When listening to some of our customers talk about the size of their Hadoop data, it can be quite daunting when they casually describe their tables at the Terabyte, Petabyte, and even Exabyte scale.
In SAP Lumira, we try to overcome this data wrangling challenge in a number of ways:
- When designing the Lumira document, we work on a smaller sample of the full data set. So even though the original data set can be quite large, we can try to acquire a smaller subset that will fit into the local desktop machine.
- When scheduling the data wrangling operations on the full data set, we do this on solely on the Hadoop platform. We are just executing Hive statements on the existing Hadoop cluster. If the cluster is set up to process data at that scale, it should just work. It just may take a while.
- We try to limit the number of times we temporarily move data around during the data wrangling scheduling process. Specifically we only move data when we are actually creating the new Hive Table.
Some best practices I would like to share when dealing with big data sets.
- You are able to increase the default heap memory and perm gen memory in the SAPLumira.ini file. I installed SAP Lumira to the default location and my SAPLumira.ini file resides in C:\Program Files\SAP Lumira\Desktop\. For these tests, I set my heap memory to 8g and my perm gen memory to 256M.
- Sampling a large data set is unfortunately slow because we use Hive by default. If possible, try using an accelerator like Impala (if you are on CDH cluster) to speed up the data acquisition process.
- If possible, try to remove as many columns as possible during the acquisition phase if they are not needed. Removing data at the earliest stage possible is the most efficient use of resources within the system. Also experiment with the % sample size to create a subset of the data that will actually fit on your desktop machine yet still provide enough variety of values for the data wrangling operations to execute on.
- Joining 2 data sets are expensive. Please be mindful that when increasing the number of columns it will increase the data foot print upon processing. Again, feel free to prune down the number of columns and filter down any unnecessary records that will not contribute to your analysis.
Some sample stats to set expectations of what kinds of initial testing was done.
- 3 Table Join Test with 19M records. The resulting join table has 27 columns. This produced a Lumira document that is around 1 GB. The schedule took around 15 minutes.
- Single Table with 100M records and 20 columns. The lumira document file size was around 1.18 GB. The schedule took around 40 minutes.
It does take a while to open the scheduled lums document when double clicking on it from the Hadoop pane. This is because we have to download the large lums document locally and open the file in the Lumira desktop application. However, once it is loaded the user experience should be fairly responsive.
If you are dealing with data sets that are over 100M records, you can always schedule to a Hive table instead of trying to create Lumira document that will always have resource constraints of a single desktop machine.
Looking for tips on how to schedule your Oozie data wrangling job in Lumira, check out this post