Market Basket Analysis
Using SAP PA – Automated Analytics and R
Sudeepti Bandi
Kranthi Kumar Thirumalagiri
Author1: Sudeepti Bandi
Company: NTT DATA Global Delivery Services Limited
Author Bio
Sudeepti is a Principal Consultant at NTT DATA from the SAP Analytics Practice
Author2: Kranthi Kumar Thirumalagiri
Company: NTT DATA Global Delivery Services Limited
Author Bio
Kranthi Kumar is a Senior Principal Consultant at NTT DATA from the SAP Analytics Practice
Introduction
Advanced analytics and data science are fast evolving techniques that play a significant role in the presentday strategy and decision making for Businesses. Many organizations are starting to adopt practices that would help them in understanding their business data better and in building powerful insights into overallfunctioning.Analytics has a vast scope in terms of usage. There are different tools available in the market to build data models that support business analysis and prediction/forecasting.
Market Basket Analysis is a popular methodology to find the associations between the products/items based on the transactions/shopping carts/market baskets of different customers. This type of analysis helps in identifying associations between items that can be sold together and can also help in crossselling and upselling.For example the conclusion of this analysis looks like – Customers who buy product A are more likely to buy product B.
In this paper we would analyze a Retail dataset by leveraging the algorithms within SAP PA (Automated Analytics) and R to perform Market basket Analysis. We also explore the integration of R language with SAP Predictive Analytics.
Objective
Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. There are several tools and technologies in the market that can be adopted to predict patterns in the given data set.
SAP Predictive Analytics is a statistical analysis and data mining solution that enables you to build predictive models in order to discover hidden insights and relationships in your data. This will enable to make predictions about future events. The tool has two products Automated Analytics and Expert Analytics. Automated Analytics helps in automating data analysis and addresses business problems without any manual intervention in data modeling or algorithm improvisation. This is used for less complicated use cases by data analysts. Expert Analytics is used to analyze data using inbuilt algorithms and also those from R (opensource programming language for statistical analytics). This is can be used for complicated use cases where manual intervention/control is necessary at different steps of the data modeling.
In this paper we would explore how we can perform MBA through Association rules in Automated Analytics in SAP’s Predictive Analytics using a Retail dataset. We would also perform the same using R (an Open Source programming language that is used for statistical analytics). We then compare the different options, features, output and effort in performing this analysis using SAP’s Predictive Analytics versus R. We would also explore the integration of R with SAP PA by calling the R algorithms through the Expert Analytics.
This kind of analytics is used in Retail industry, Recommender engines in Ecommerce, Restaurants and tools to identify Plagiarism. There are several algorithms that could be used for Market Basket Analysis like the popular Apriori algorithm.
How does an Algorithm work
The primary input for the algorithm would be a data set of transactions from past within a business. Then the algorithm will automatically identify the patterns with in the data set. When we say patterns, it is such as the most frequently bought items together with in a given volume of transactions.
Algorithm then stores the pattern recognition logic and applies it to the new data set. This is called Training the Algorithm.
Algorithm for MBA – Apriori
Apriori is a popular algorithm used for Market Basket Analysis. The significant parameters of this algorithm are listed below
 Association Rule
 Support
 Confidence
 Lift
Association Rule
A rule is denoted as below –
A > B
A –Antecedent/LHS
B – Consequent/RHS
ü Where A and B are items from the data set of transactions.
ü They should occur together in different transactions to qualify for a strong association rule.
Support
This is calculated as a ratio of “Number of transactions where items occur together” to that of “Total Number of Transactions”. If A and B are bought together in 5 transactions out of a total of 20 transactions, then the support is calculated as 5/20 which is 0.25.
Confidence
Confidence is calculated as the ratio of “Number of transactions in which A and B occur together” to the number of transactions in which “A occurs”. If A occurs in a total of 10 transactions, out of which, B also occurs in 5 transactions along with A,
Then Confidence of (A > B) is 5/10.
Which means, the chances of purchasing item B when item A is bought is 50%. The intensity of predicting the sale of item B is termed as confidence.
It is apparent that
ü For strong association, Support and Confidence have to be high.
ü Rules with low Support and Confidence could be eliminated
Limitation of Support and Confidence
ü Many important findings of associations will be eliminated if they have low Support, however a low Support could be because the item is expensive and is not occurring frequently. Low support not necessarily mean can be ignored. It depends on the discretion of the analyst.
ü High confidence is misleading at times, The Confidence of A and B might be 5/10 but overall occurrence of B could also only in these 5 transactions and hence not dependent on A.
Lift
It is the measure where we assume that the Antecedent and Consequent are independent. Calculated as, ratio of Confidence of (A > B) to the Support of B.
Apriori Principle
If an item set is frequent then all of its subsets must also be frequent.
If the item set: {A, B, C, D} is frequently occurring, then the subsets listed below are also frequent.
3 items sets – {A,B,C} {A,B,D} {B,C,D} {C,D,A}
2 items sets {A, B}, {A, C}, {A, D}, {B, C}, {B, D}, {C, D}
1 item sets – {A} {B} {C} {D}
Let us first consider a very simple example –
 A dataset listing purchases from a Stationery Store.
 There are 11 transactions
 We aim to manually identify the associations between the items in this dataset in a way the actual Apriori works.
 We use similar method as Apriori – associations calculated through different iterations based on the support value we choose.
The table below has the 11 transactions –
1 
Pencils 
Eraser 
Sharpener 

2 
Covers 
Labels 
Notebook 
Stapler 
3 
Colour pencils 

4 
Pencils 
Eraser 

5 
Notebook 

6 
Notebook 
Pencils 
Eraser 

7 
Covers 
Labels 

8 
Crayons 

9 
Glue 
Covers 
Labels 
Notebook 
10 
Pencils 
Eraser 

11 
Notebook 
Now this is how the algorithm works:
Step 1 We can consider a minimum Support of 2.
The first iterations will be for all single items. All items that appear for 2 times or more are considered. We count the number of times each item appears in the transactions.
Pencils 
Eraser 
Sharpener 

Covers 
Labels 
Notebook 
Stapler 
Colour pencils 

Pencils 
Eraser 

Notebook 

Notebook 
Pencils 
Eraser 

Covers 
Labels 

Crayons 

Glue 
Covers 
Labels 
Notebook 
Pencils 
Eraser 

Notebook 
Single occurrences are as below –
Notebook 5
Pencils – 4
Eraser 4
Labels – 3
Covers – 3
Step 2
We now take occurrence of the items above in couples i.e. in combinations (only the singles with support greater than 2 that are listed above will be considered for this iteration). We also eliminate single item transactions with Support lesser than 2.
Pencils 
Eraser 
Sharpener 

Covers 
Labels 
Notebook 
Stapler 
Pencils 
Eraser 

Notebook 
Pencils 
Eraser 

Covers 
Labels 

Glue 
Covers 
Labels 
Notebook 
Pencils 
Eraser 
Count the number of times each of the combinations below appear in the table above,
{Notebooks, Labels} 2
{Notebooks, Covers} 2
{Pencils, Eraser} 4
{Labels, Covers} 3
Step 3
We can consider item sets of size 3 with support greater than or equal to 2. There is only one item set that is occurring twice i.e. the one mentioned below.
{Notebooks, Covers, Labels} 2
Discovery/Inference
 Customers buying Pencils also buy Erasers
 Customers buying Notebooks and Covers also buy Labels
 Customers buying Covers also buy Labels
Based on the outcome, we can take decision on availability of stock, positioning of items in the store, promotions etc.
We can start with minimum values for Support and Confidence and depending on the Use case, set them at an optimum level after several iterations.
Working in SAP Predictive Analytics
Automated Analytics
Dataset:
There are two datasets. One dataset has all the unique transaction IDs – reference data source. The other dataset has different transactions and the corresponding items – events data source; has two columns. The datasets have 200 transactions.
Reference data source is as below, it has one column:
Events data source is as below, it has two columns.
Step 1: Load and describe/analyze Reference data source
We can either load a file that has descriptions of the reference data source or ask the tool to analyze the column and then change column types if needed. Here the Column C1 contains the transaction IDs and to be defined as ‘nominal’ or ‘ordinal’.
Step 2: Load Reference data source
Repeat the same steps for Events data source.
Step 3: Set parameters for the algorithm
Step 4: generate the model
Now we see that the tool has identified the number of transactions. This report can also be saved and distributed for future reference (PDF format available).
We can also view, save and distribute the statistical information about the rules, item sets. Example frequency distribution for the items in the transactions. For the dataset we have taken, we can see that ‘whole milk’ is the most frequent item.
For this example, with minimum support of 2% and minimum confidence of 50% the tool generated 24 rules.
We can focus on a particular item as antecedent or consequent based on the business requirement/question.
We can also filter on the range of numeric parameters of the model, for example we can search for rules with support >= 0.03 or rules with confidence >= 60%.
If we give a rule size to be greater than 2 all the rules generated are covered. The graphical representation of the filtered rules, is as below
Let us focus on whole milk as consequent as this is the most frequent item. We have an option to fix the consequent. We get 7 rules for whole milk as consequent.
We can see a strong association between ‘yogurt’ and ‘whole milk’.
Discovery/Inference:
 Customers buying yogurt are more likely to buy whole milk.
 Customers buying sausage are more likely to buy rolls/buns
 Customers buying root vegetables are more likely to buy other vegetables.
This way we can derive many rules and take business decisions for cross selling, upselling, promotional offers and store layout using Market Basket Analysis.
Working in R
In R, any functionality beyond the basic version is available in the form of packages. The algorithm we are going to use in R for Market Basket Analysis is Apriori.
Step1: Download and install arules and arulesViz packages. These are relevant to Apriori and the corresponding graphs.
Step2: Call the Groceries dataset from arules package/ load any retail data set into R.
Code:
Step 3: Plot the frequency distribution of the Items.
Frequency distribution for the graph is as below.
Step 4: Apply Apriori algorithm.
For similar parameters, R’s Apriori has generated 32 rules, however the significant rules and their parameters are similar but not exactly the same.
Step 5: Plot the rules/associations identified.
There are several plots in R to represent the rules and the parameters of Apriori. Few of them are grouped, matrix and graph.
The “graph” looks as below for the 32 rules –
If we confine the consequent (LHS) to “whole milk” then the grouped plot and the graph look like below:
Working in Expert Analytics
We will now explore how this can be done using the Expert Analytics. The Apriori algorithm is called from R. Aim of this section is to demonstrate how well SAP has brought in the integration with R and its functionality; and the visuals and ease of use are the advantages of this tool.
Step 1: Load the dataset
Step 2: Apply RApriori, in Expert Analytics.
The settings for the algorithm can be given as below –
We then run the model. The rules generated are displayed.
The Association chart in Expert Analytics graphically represents the associations as below.
The output is same as in R. A display of summary gives the output as in R.
Note that the data format for datasets used in all the three tools is similar. Only for Automated Analytics we have an additional dataset to load the Transaction IDs.
Conclusion
 Visualization is better in SAP Predictive Analytics.
 SAP PA has a better UI.
 R is an open source however the R studio server installation has to be purchased.
 There are more options in R to perform Market Basket Analysis. However, SAP Predictive Analytics has an option to integrate the functionality of R through Expert Analytics.
 SAP Predictive analytics enables 2 different types of models for a use case, through Automated and Expert Analytics. Hence as per business requirement and suitability we can use either of them.
 SAP PA Automated Analytics The output and statistical reports during the preparation of the model can be downloaded and distributed easily in PDF format
References:
Image courtesy: https://pixabay.com/en/shoppingcartchartstoreshopper650046/
Thank you , interesting post. The explanatio of the algorithm basics is really well done.
Thank you!
Nice (begining) of analysis of the different options ๐
I say beginning because of two things:
1/ Our preferred option in automated is not to use the association rules section but the link analysis or social section (where there is a dedicated workflow for product recommendations).
2/ This is a functional analysis, but the war is in scalability. Automated has been used to generate millions of rules with several hundreds of throusands of possible items… That is where things get interesting ๐
Cheers
Thank you Eric, yes we did try recommendations, however interpreting association rules was easier. We are yet to explore with larger datasets as there were performance constraints with the desktop version and also the data format for transactions/events dataset.
Sudeepti, Very good post. Well laid down content.
Thank you Hari
nicely explained!!
Thank you Saritha
Hiย Sudeepti,
I know this thread is about 2 years old now, but do you still have the data sets used for this comparison? I would love to run these tests again using the same data set that you have. Any chance you could share them in .csv?
Thanks!