Loading Data into SAP HANA Cloud, Data Lake
The SAP HANA Cloud data lake is designed to support large volumes of data. One of the first issues you are likely to encounter is how to load large volumes of data into the data lake efficiently, where efficiency is measured by a combination of total cost and performance. While the data lake does support the standard SQL DML functions, by far the fastest way to load data is via the LOAD TABLE statement. This blog examines how the provisioning of your SAP HANA Cloud, data lake impacts the performance and cost for data loading. To keep things simple, I am only looking at individual table loads. Concurrent loading of data may be covered in a future post.
If you haven’t had a chance to try out SAP HANA Cloud and its built-in data lake, you can do that now by signing up for the free trial.
The Impact of Compute on Data Loading
In the SAP HANA Cloud data lake, compute is measured and allocated elastically in vCPUs. This means that you can change the amount of compute allocated to the data lake on-demand by simply updating the provisioning. You can have a look at the SAP HANA Cloud Sizing Calculator to see how the number of vCPUs you allocate impacts your SAP HANA Cloud usage.
While data loading is typically dominated by the data transfer (read/write) activities, there is compute involved as part of the ingestion process. For example, processing and compressing the data into the proper format before it is written to the data lake in order to minimize storage costs and improve future query performance.
If we start with a minimally sized data lake on AWS (4 vCPUs, 1TB of storage), and then scale our compute up, we can see the impact of compute on the load of about 1.5GB of raw data into a single table. It’s not a lot from the perspective of the data lake, but it is enough to illustrate the impact of compute provisioning on performance.
We can see an immediate almost 2x performance improvement by increasing the amount of compute, but that improvement tails off rather quickly as we add more. Beyond 16vCPU, the difference in performance is negligible. This is because the data lake is built on the SAP IQ multiplex technology for its core data processing capabilities.
The SAP IQ multiplex is a cluster-based solution, which means the data lake processing compute scales both up (using a larger node) and out (using multiple nodes). Currently, scale-up is limited to 16vCPUs on a node. If you allocate more than this amount of compute in your data lake, you will get additional nodes added to your system. In our test, when we move beyond 16vCPUs, we are adding another node to the system.
This is relevant because currently the data lake can only use a single node to process a single load table statement, so once you go beyond 16vCPUs, your compute will not have much of an impact on load performance. Scaling out to multiple nodes can help in a multi-user system, by allowing other operations (eg. queries) to run on other nodes, thus providing more compute capacity to the node running the load statement.
The Impact of Storage on Data Loading
The SAP HANA cloud data lake currently provides only a single storage metric. It is not immediately obvious how adjusting this metric can impact performance, but it actually has a potentially very large impact, especially on smaller data lakes (less than 16TB).
This is because storage performance in the data lake is a function of the amount of storage you provision. For each TB of provisioned storage, you are allocated a certain amount of disk throughput. This effectively controls how fast you can read/write to your data lake. As your data lake grows, the allocated throughput increases, and will become large enough that it doesn’t play a significant factor in performance for single user access, but for smaller data lakes, it can have a huge impact.
Here we can see the impact of storage allocated on load performance. Note that in each test we are loading the same amount of data, using the same amount of compute. The only thing that is changing is the amount of storage.
We can see that increasing the provisioned amount of storage has a very large impact on load performance, up until about 16TB of provisioning. Beyond this performance improves, but at a lesser rate.
Controlling TCO for Data Lake Loads
We can see that performance can be impacted significantly by the amount of compute and storage you provision for the HANA data lake, but how does this translate to cost?
If we look at the capacity unit cost for the load operation under the various configurations we tested (see chart below), we can see that the minimally provisioned instance, while cheapest if you are running 24×7, is not the cheapest if you want to leverage the elasticity of the cloud to better control your overall TCO. The cheapest option for loading in this scenario is to provision 16TB and 18vCPUs of compute.
In conclusion, we can see that compute and storage have a significant impact on the performance of your HANA data lake, but to get the best TCO from your HANA data lake, you need to take advantage of the cloud qualities of the service, like elasticity.
Thanks for reading!