The SAP HANA Effect – Basket Analysis 60x Faster
The scenario that I have been working on is detailed in Part1 and Part2 in the Predictive Analysis area. In this I am performing Market Basket Analysis (MBA) on just over 80 million records, and this runs in SAP HANA in less than 3 minutes. A traditional 3 tier predictive tool would take around 3 hours to process the same 80 million records. 3 hours down to 3 minutes, that is 60x faster. Now that is a significant difference, but what does that enable the analyst to do?
Well 60x faster actually provides a significantly difference, with only a 3 minute wait, the analyst can easily run multiple scenarios, multiple times a day. It now becomes possible to look into cause and effect. They can ask 2nd, 3rd, 4th, 10th questions and run multiple scenarios without being subjected to a 3+ hour delay. With the previous setup a parameter could be incorrect and then you could easily lose a day or more. With SAP HANA in-memory, in-database Predictive Algorithms the gain is huge and I’m still looking for the drawbacks. The productivity gains that SAP HANA provides for this type of process are massive.
There are some reasons why SAP HANA provides significant benefit
1. Market Basket Analysis is not suited to sampling – People typically want to look across the entire range to identify those product that do and don’t sell together. Therefore data volumes and compute power needed can be higher than for some other predictive scenarios.
2. Market Basket Analysis does not follow the traditional partition, selection, sample, train, validate and productionise type predictive model cycle. Therefore the productivity benefit is huge, as it becomes an interactive analysis processes in which you receive results and can apply the learnings to your scenarios immediately. MBA becomes true Business Intelligence.
3. You cannot summarise the data – transaction level data is required. Yes, HANA has an OLAP (Aggregation) engine but this is does not mean we need to use it for everything. Yes, you can perform basket analysis at different levels, but you still need to feed in transactional level data to support this.
4. It is difficult to pre-process or schedule as the results are only applicable at the same level of the input data-set. This means if want to look at Christmas baskets, and then compare that to last Christmas we may need to run 2 different models. If we then want to look at a particular channel, store, store type, product category, day of week, time of day, etc each of these could become its own standalone output.
I have found that the process of Market Basket Analysis to be fairly compute intensive – even for SAP HANA. I am used to receiving results in sub-second or maybe 10 seconds when doing some heavy duty processing. As mentioned the data-set that I have been using is just over 80 million records. It is my understanding that during the association analysis it performs a cross join, thereby joining 80 million to the 80 million and this would result in 6,400,000,000,000,000 possible permutations. This could explain why it does take around 180 seconds to run the Basket Analysis.
If we look at the way some other tools work vs how Predictive Analysis with the SAP HANA PAL (Predictive Algorithm Library) works it becomes clear the advantage we can exploit.
“Traditional” Predictive Analytics
With the traditional approach we have a 3 tier landscape, database, predictive server and predictive client. The scoring, modeling and in this case market basket analysis is usually run on the Predictive Server. This is fine when working with a subset of data but becomes problematic with large data sets. The predictive server first queries the database and then processes that data-set.
SAP HANA Predictive Analytics
With the SAP HANA Platform we have moved the “Predictive Server” inside the database, it is one of many engines that are available. The data is all stored in-memory in an optimised compressed way. The data does not need to leave HANA, we can process it all in-memory and we just pass the results to the Predictive Client. It’s not surprising it is 60x faster. This is currently without any tuning or optimisation which I plan to do later.