Hadoop started in 2004, and since then the tech world is slowly adjusting to the promises and opportunities this tech is bringing. Some companies are way down the road and have their entire businesses running on Big Data. Those are the FB, LinkedIn, Netflix of the world, who were pushing the existing boundaries and innovated to make sure they could run their businesses adequately. The past 10 years have been mostly focused on making sure the Data Scientists and IT had the proper tooling to store, find, analyze and extract information from Big Data platforms.
In many cases, coding was still necessary to analyze the data in Hadoop which prevented many business users to even look into the datalakes that are starting to be put in place. The emergence of SQL access to the various analytical engines that we see sprouting around the place is bridging the gap between the traditional BI tools that businesses are accustomed of working with and the world of Big Data. There are still some challenges in how much SQL is being covered in Hive for instance, but the first few steps have been made: tools like SAP Web Intelligence and SAP Lumira can connect to Big Data sources.
Now, we have also to acknowledge the new challenges that Big Data is proposing when your data analyst wants to start analyzing information that is on millions or billions of rows in multiple dataset. Each query sent to this kind of datasets has the potential of taking 10s of seconds if not minutes, so will your Data Analyst have to call IT / his favorite developer to create a dataset of a manageable size and push it in MySQL or any RDBMS to get the slice & dice experience he is looking for?
We think the answer should be no and we looked at ways to resolve the problem.
The approach we have taken in SAP Lumira 1.27 is that we will let the Data Analyst extract a sample of the various datasets he is interested in Hadoop, do the data transformations, visual analysis he needs and replay automatically all the actions that were done on the sample down in Hadoop on the full datasets. Working on sampled data has various advantages:
- the user gets all the interactivity he is accustomed to, without having to worry about the real size of the datasets
- the user already gets an idea of what to expect from the full dataset. Of course, the results won’t be entirely accurate, but they should give an idea of the range in which the numbers would be.
- Analysis can be prepared upfront so that once the results will come back from Hadoop, the results will immediately appear
Once the visualizations, stories are done on the sample data, SAP Lumira 1.27 proposes now a way to generate that same document the user is working on with the full dataset by pushing down to Hadoop via the Oozie scheduler all of the data transformations that were done in the desktop. This operation can take minutes or hours depending on the complexity and size of the data, but at this point the user is aware of it and can focus his attention elsewhere. The scheduling can produce a Lumira file or a Hive table, depending on what the intent is to do afterwards with it.
In either case, the data analyst won’t have to call up IT to write some code or set up a smaller lens / cube of the data to run his analysis. So far, every time we presented this approach to customers who had Hadoop, it was greeted by a lot of nodding heads. The advantage I see with it is that it removes the worry of the size of the underlying data to the end-user. We as an industry are making great progress in how fast we can retrieve information on ever greater amounts of data, but unfortunately the data sizes are not waiting and are also growing at a frenetic pace. Working on samples is the way to go to remove the worry of size and maintain the interactivity analysts are expecting from their Analytical tools, and this is the new experience we are bringing in SAP Lumira 1.27.
Come and try out!