In an organization using SAP HANA, the data resides in-memory for achieving massive performance. However, as the data grows, the amount of memory required to store the data also increases which in turn increases the cost as additional memory is required to cater for increasing data growth. Enterprises implementing SAP HANA should follow a data persistent strategy for building a smart storage infrastructure based on the business value of data thereby addressing the data storage requirements efficiently and at lower cost.
Making Storage Strategy Smarter
In SAP HANA, not all data is accessed frequently but it has to reside in-memory which increases the amount of main memory used. The historic or ‘cold data’ can be stored in separate data storage based on less expensive storage option. This data can still be accessed anytime providing necessary performance at lower cost. The end result will be a storage infrastructure that addresses the storage requirements of the business in a most efficient and cost effective solution.
Data can be classified into
- Hot or Active data – data that is used or updated frequently
- Cold or Historic or Passive data – data that is not updated frequently or used purely for analytical or statistical purposes
When the historic or cold data is stored in separate data storage, the main memory storage is reduced and frees up the hardware resource and also makes the static data available. Access to this data requires faster reads but at less expensive cost. Maintaining all data including the infrequently accessed static data in a high-performance online environment can be very expensive or just impractical due to the limitations of the databases used in the data warehouse.
What Data needs to be persisted?
This is an important exercise that needs to be undertaken before we embark on any data warehouse project. With all the in memory solutions costing quite high, it is better to do an exercise to understand organization’s data requirements. Some of the pointer what data needs/needn’t to be persisted is given below.
- How frequently the data is required?
- How many years of data business currently report on daily basis?
- What is the regulatory reporting data which is required to be made available?
- What kind of transaction data required by the business – Must have for the reporting needs?
- What data should be consolidated in Data Warehouse?
- What data can be consolidated in source systems?
- Whether some transformations can be pushed on to external ETL tool so that impact on Data Warehouse is reduced.
- Based on the reporting requirements, above options to be considered that suits the organization’s data needs.
- Redundant data to be explicitly identified and should not be brought into the Data warehouse.
- Processing logic that requires intermediate data to be stored should be analyzed so that it can be pushed to runtime at the query level or at the data base layer.
- Rolling data needs to be identified and moved to the NLS or using other backup mechanism.
Strong Information lifecycle Management covering above points is required to arrive at effective data persistence strategy for an organization. Some of the key benefits of successful data persistent strategy are –
1) Better resource usage – in terms of disk, CPU and memory
2) System availability
3) System performance
4) Analysis with right set of data
Option – 1
Implementing a near-line component makes it possible to keep less frequently accessed data, such as aged information or detailed transactions more cost-effectively. In addition, if the relatively static data can be removed from the data warehouse, it facilitates to perform regular maintenance activities more quickly and provide business users with higher data availability.
Option – 2
Apache Hadoop and Data warehouse
As the enterprises start analyzing larger amounts of data, migrating it over the network for analysis becomes unrealistic. Analyzing terabytes of data daily in-memory can bring down the processing capacity of the system and also occupies more main memory space. With Hadoop, data is loaded directly to low cost commodity servers just once, and only transferred to other systems when required.
Hadoop a true “active archive” since it not only stores and protects the data, but also enables users to quickly, easily and perpetually derive value from it.
Hadoop and the data warehouse can work together in a single information supply chain. The cold or the archived data can be stored in Hadoop and can act as online archives alternate to tapes. Used not only as storage mechanism, Hadoop also helps in real time data loading, parallel processing of complex data and discovering unknown relationships in the data.
What is Hadoop good at?
- 1. Hadoop is cost wise cheaper
- 2. Hadoop is fast
- 3. Hadoop scales to large amounts of big data storage
- 4. Hadoop scales to large amounts of big data computation
- 5. Hadoop is flexible with types of big data
- 6. Hadoop is flexible with programming languages
Since NLS has been discussed extensively in various forums and blogs , we shall discuss how Hadoop can be integrated with SAP HANA for effective data persistent strategy in subsequent discussion