IoT known as Internet of things or Internet of Everything by some and Internet of Opportunity by few has created a tide of positivity in overall IT industry. Software and hardware companies are looking into these from the angle of leveraging the evolution of IoT enabled application. As mentioned in many of the article of management research IoT would generate in Trillions of USD and this would mean world surrounded by sensor probably 10 times of human population in the world by year 2020. The data generated by these would by far surpass the data generated in decade (as a guess between year 2000 to 2010) and these devices would bring new dimension of living by humans with devices making decision for us to automatically be efficient and effective in our life style. For example smart home devices making good choices of operating themselves to save electricity such as smart fridge, fan, washing machine, air condition, heaters, bulbs & many more. Famous example of street light which lights itself when a citizen is moving around the place and dims itself when no motion is detected around it. Similarly bulbs in home would automatically turn on and off itself depending upon the movement of family inside house future would not be of switches but such smart integrated processing in home which decides by itself to turn on or turn off or reduce etc..
SAP is already venturing out in many industries and one can know more about in by visiting SAP IoT Solution and technology at the following link http://www.sap.com/pc/tech/internet-of-things.html. SAP already has solution such as following:
- Route maintenance and Services: http://www.youtube.com/watch?v=a3gyWMqRdEo
- Connected Logistics: http://www.youtube.com/watch?v=Zv46j2WZ3jU&feature=youtu.be
- Connected Retail: http://www.youtube.com/watch?v=Pddz4eHuGAo
SAP also has all required technology to support IoT further details can be read in the published white paper.
HANA being the In-Memory column based database is one of the strong competitors of analytics on IoT space where the complexity of handling huge data is required and query to process in seconds is needed for real-time scenario. HANA is known for its performance for searching, sum, average, count, etc.. of data and has integrated predictive analytical libraries in it. Single stack data and analytics is going to play an vital role in future where the need of the hour would be to have one stack which does the store and analysis on it directly instead of traditional approaches where one used Databases for only store and all other machine learning or other algorithm were built as other servers in which data were passed to analyze. Gone are the days with IoT and sensor data entering the industry and it would become near to infeasible to have real time performance of prediction to the precision of including raw data for analysis.
HANA provides end to end database to development platform on which developer can create a whole application on top of it as cloud application. With SAP having the HANA Cloud Platform (HCP) in place at the best possible time this would add to the benefit for the IoT based development of the future developers. Developing an UI5 based application in HCP is simple and easy and I do have doubt in future for SAP developers and Non SAP developers to leverage the HCP environment and create simple yet effective scenarios for future IoT applications.
One of the challenges which IoT bring with it is the huge volume of data, how would this be stored and how would one do the analytics with data running into billions in few days for an enterprise and real time scenarios or event need to be processed in second’s response. HANA which already has established itself to handle the huge amount of data is known for the same.
Let me dive deeper into HANA for its compression capability for storage of data, as you might know or read that HANA does a great job of compressing data with its dictionary encoding implementations (If not known, I recommend taking up the In-Memory Data Management course from Open Sap Courses or HPI Course). The dictionary encoding helps in compressing the string based transaction data and replaces it with the integer value or position of unique string in the dictionary; this saves database a huge amount of storage space. Going further you will also find HANA database utilizing the compression algorithms for transactional data such as Prefix Length Encoding, Run Length Encoding, Cluster Encoding, Indirect encoding etc… These algorithm compresses the integer data further depending upon the distribution of dataset, HANA selects the algorithm for best compression. The best part about all these approaches of compression are that queries are searched or sum or count or any other operation are executed on the compressed dataset and the whole scanning happens on reduced memory footprint which results in great performance. Few of the compression algorithms would be more effective in cases of IoT dataset. To further substantiate effectiveness of compression algorithm for sensor based transaction data let us consider a following example.
In this example we are considering an Enterprise Warehouse which has cold storage and each of the cold storage has the temperature sensor installed (one, two or few depending upon the size of cold storage). Each warehouse might have few or many cold storage locations. In our example the temperature within cold storage should remain at 10 Degree Celsius (not a cold storage expert so bear with me in case this number is too high or too low) and each sensor within the warehouse sends the data back to HANA database at the per minute rate. The use case here is the monitor the cold storage spaces with optimal temperature and in case any cold storage unit has any problem the alert gets generated by the system to the supervisor of the warehouse (off-course the solution can also be proposed to shift the good to nearby storage with optimal cost or any other scenario, possibilities are limitless). Let us consider we have 10 cold storage in each warehouse and each cold storage has 2 temperature sensor and overall of 50 warehouses. This would mean we have 10X2X50=1000 temperature sensors pushing data to HANA database at the rate of per minute. In a day we have 24X60 = 1440 minutes and total of 1000X1440=1,44 0,000 temperature readings which are sent to the database. This would mean in 100 days we would have 144,000,000 temperature reading. Let us assume we require a byte to store the temperature reaching of each record, this would mean about ~137.33 MB of storage space, which in case of HANA is RAM (In this the timestamp is not considered, since that would be stored in another column for columnar store database system). This could grow beyond a billion records in no time if we increase the warehouse or start capturing the temperature measurement every 15 sec, which is four time increase in dataset. Anyways point I wanted bring into attention is not how data grows but more on the data distribution for sensor dataset. As we all know that the cold storage temperature reading isn’t going to change in few minute, it remain same or one or two degree depending upon the usage. Only in cases of issues in sensor or problem with compressor or something which can go wrong and temperature can increase to beyond 20 Degree Celsius and alert can get triggered.
Let us take an example to capture the data from one of the sensor for 30 minute which means 30 records of data set. The data would be similar to the one stated below and also represented in the Figure 1.
Example dataset for 30 minutes:
10, 10, 10, 10, 11, 11, 10, 10, 10, 10, 10, 11, 11, 12, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10
Figure 1, plot showing temperature data for 30 minute
Figure 1, clearly shows that the line is pretty much flat at 10 and we see few variations due to usage of cold storage for drop or pickup. The temperature capture data makes it very suitable for two compression algorithms which HANA might utilize:
- RLE – Run Length Encoding
- Cluster Encoding.
RLE (Run Length Encoding):
In this algorithm HANA compresses the continuous occurring numbers and creates another array / vector to store the position of it.
So for the above example the data would be compressed as following:
Compressed Dataset: 10, 11, 10, 11,12,11,10
Start Position: 1, 5, 7, 12, 14, 15, 17
Start position represent the start of the value change and as the value changes in the 5th position (temperature changes to 11 Degree Celsius) it is represented in start position vector as 5(second number) and once again number 7 in third position of start position vector represent the change of temperature to 10 Degree Celsius.
Other variation of the RLE is with storing the number of occurrence and the choice of variation depends upon the compression as well as access of data.
Compressed Dataset: 10, 11, 10, 11,12,11,10
Occurrences: 4, 2, 5, 2, 1, 2, 14
In this variation of RLE the occurrences are stored instead of start position in actual array.
RLE is lossless compression algorithm, meaning the whole data can be recovered without any loss of any original dataset.
The RLE does a good job of compression and reduces the dataset to 50% in our example, instead of having 30 records we now have only 7 (compressed dataset) and 7 start position or occurrences vector (assuming the byte storage for address offset and temperature).
Cluster Encoding Algorithm:
In cluster encoding algorithm the data is virtually partitioned into equal blocks. Each block is checked if they have the same number / integer value or not. In case of same value, it is compressed and a second bit vector is created to indicate the same by initialize the bit to 1 and 0 represent that the value was not compressed.
In our case we take a block of 4 records to virtually partition the data and after applying cluster encoding the data would be as follows:
Compressed Data set: 10, 11, 11, 10, 10, 10, 10, 10, 11, 11, 12, 11, 11, 10, 10, 10, 10, 10
Cluster Bit Vector: 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0
18 data point and 18 bits required after cluster encoding, this would again means 30% compression from original dataset.
Once again cluster encoding too is lossless compression algorithm and we don’t have any loss of dataset and each value can be easily recovered.
As mentioned before looking at the distribution of data for few sensor to consistently provide the same values would lead to great advantage for HANA transactional compression algorithm and the IoT based application would benefit on the same with less memory footprint on cloud based solution and high performance is guaranteed with columnar stored compressed data.
Off-course huge amount of research had been happening in compression algorithms specifically in sensor based data but in this article I was not looking into recent work of that but rather understanding the strength of existing compression algorithms within HANA. Also I did not want to claim that this would be the best case but certainly in certain data distribution and more likely in many sensor dataset such as temperature case it definitely seems like. Last thing the intension was also not to cover the lossy compression algorithm overall so deliberately skipped the algorithm which learns stores the most relevant points of data set (lossy algorithms, in which case the raw data cannot be recovered).
Overall optimistic with HANA as technology platform supporting the next generation evolution of IoT based applications with real-time scenarios possibility with SAP.