Skip to Content

MBA.jpg

Market Basket Analysis

          Using SAP PA – Automated Analytics and R

                                                                                                                              Sudeepti Bandi
                                                                                                                              Kranthi Kumar Thirumalagiri

/wp-content/uploads/2015/09/shopping_650046_1280_789613.png

Author1:          Sudeepti Bandi

Company:       NTT DATA Global Delivery Services Limited

        

Author Bio  

Sudeepti is a Principal Consultant at NTT DATA from the SAP Analytics Practice

Author2:          Kranthi Kumar Thirumalagiri

Company:       NTT DATA Global Delivery Services Limited

         

Author Bio  

Kranthi Kumar is a Senior Principal Consultant at NTT DATA from the SAP Analytics Practice

Introduction

Advanced analytics and data science are fast evolving techniques that play a significant role in the present-day strategy and decision making for Businesses. Many organizations are starting to adopt practices that would help them in understanding their business data better and in building powerful insights into overall-functioning.Analytics has a vast scope in terms of usage. There are different tools available in the market to build data models that support business analysis and prediction/forecasting.

Market Basket Analysis is a popular methodology to find the associations between the products/items based on the transactions/shopping carts/market baskets of different customers. This type of analysis helps in identifying associations between items that can be sold together and can also help in cross-selling and up-selling.For example- the conclusion of this analysis looks like – Customers who buy product A are more likely to buy product B.

        

In this paper we would analyze a Retail dataset by leveraging the algorithms within SAP PA (Automated Analytics) and R to perform Market basket Analysis. We also explore the integration of R language with SAP Predictive Analytics.

Objective

Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. There are several tools and technologies in the market that can be adopted to predict patterns in the given data set.

SAP Predictive Analytics is a statistical analysis and data mining solution that enables you to build predictive models in order to discover hidden insights and relationships in your data. This will enable to make predictions about future events. The tool has two products Automated Analytics and Expert Analytics. Automated Analytics helps in automating data analysis and addresses business problems without any manual intervention in data modeling or algorithm improvisation. This is used for less complicated use cases by data analysts. Expert Analytics is used to analyze data using in-built algorithms and also those from R (open-source programming language for statistical analytics). This is can be used for complicated use cases where manual intervention/control is necessary at different steps of the data modeling.

In this paper we would explore how we can perform MBA through Association rules in Automated Analytics in SAP’s Predictive Analytics using a Retail dataset. We would also perform the same using R (an Open Source programming language that is used for statistical analytics). We then compare the different options, features, output and effort in performing this analysis using SAP’s Predictive Analytics versus R. We would also explore the integration of R with SAP PA by calling the R algorithms through the Expert Analytics.

This kind of analytics is used in Retail industry, Recommender engines in E-commerce, Restaurants and tools to identify Plagiarism. There are several algorithms that could be used for Market Basket Analysis like the popular Apriori algorithm.

                                   How does an Algorithm work

The primary input for the algorithm would be a data set of transactions from past within a business. Then the algorithm will automatically identify the patterns with in the data set. When we say patterns, it is such as the most frequently bought items together with in a given volume of transactions.

Algorithm then stores the pattern recognition logic and applies it to the new data set. This is called Training the Algorithm.

                                             Algorithm for MBA – Apriori

Apriori is a popular algorithm used for Market Basket Analysis. The significant parameters of this algorithm are listed below-

  • Association Rule
  • Support
  • Confidence
  • Lift

Association Rule

        

A rule is denoted as below –

                                A -> B

A –Antecedent/LHS

B – Consequent/RHS

ü  Where A and B are items from the data set of transactions.

ü  They should occur together in different transactions to qualify for a strong association rule.

Support

This is calculated as a ratio of “Number of transactions where items occur together” to that of “Total Number of Transactions”. If A and B are bought together in 5 transactions out of a total of 20 transactions, then the support is calculated as 5/20 which is 0.25.

Confidence

Confidence is calculated as the ratio of “Number of transactions in which A and B occur together” to the number of transactions in which “A occurs”. If A occurs in a total of 10 transactions, out of which, B also occurs in 5 transactions along with A,

Then Confidence of (A -> B) is 5/10.

Which means, the chances of purchasing item B when item A is bought is  50%. The intensity of predicting the sale of item B is termed as confidence.

It is apparent that

ü  For strong association, Support and Confidence have to be high.

ü  Rules with low Support and Confidence  could be eliminated

Limitation of Support and Confidence

                                                                                                             

ü  Many important findings of associations will be eliminated if they have low Support, however a low Support could be because the item is expensive and is not occurring frequently. Low support not necessarily mean can be ignored. It depends on the discretion of the analyst.

ü  High confidence is misleading at times, The Confidence of A and B might be 5/10 but overall occurrence of B could also only in these 5 transactions and hence not dependent on A.

          Lift

It is the measure where we assume that the Antecedent and Consequent are independent. Calculated as, ratio of Confidence of (A -> B) to the Support of B.

Apriori Principle-

If an item set is frequent then all of its subsets must also be frequent.

If the item set: {A, B, C, D} is frequently occurring, then the subsets listed below are also frequent.

            3 items sets – {A,B,C}  {A,B,D}  {B,C,D}  {C,D,A}

2 items sets- {A, B}, {A, C}, {A, D}, {B, C}, {B, D}, {C, D}

1 item sets – {A} {B} {C} {D}

Let us first consider a very simple example –

  • A dataset listing purchases from a Stationery Store.
  • There are 11 transactions
  • We aim to manually identify the associations between the items in this dataset in a way the actual Apriori works.
  • We use similar method as Apriori – associations calculated through different iterations based on the support value we choose.

The table below has the 11 transactions –

1

Pencils

Eraser

Sharpener

2

Covers

Labels

Notebook

Stapler

3

Colour pencils

4

Pencils

Eraser

5

Notebook

6

Notebook

Pencils

Eraser

7

Covers

Labels

8

Crayons

9

Glue

Covers

Labels

Notebook

10

Pencils

Eraser

11

Notebook

Now this is how the algorithm works:

Step 1- We can consider a minimum Support of 2.

The first iterations will be for all single items. All items that appear for 2 times or more are considered. We count the number of times each item appears in the transactions.

Pencils

Eraser

Sharpener

Covers

Labels

Notebook

Stapler

Colour pencils

Pencils

Eraser

Notebook

Notebook

Pencils

Eraser

Covers

Labels

Crayons

Glue

Covers

Labels

Notebook

Pencils

Eraser

Notebook

Single occurrences are as below –

Notebook-                    5

Pencils –                      4

Eraser-                        4

Labels –                       3

Covers –                       3

          

Step 2-

We now take occurrence of the items above in couples i.e. in combinations (only the singles with support greater than 2 that are listed above will be considered for this iteration). We also eliminate single item transactions with Support lesser than 2.

Pencils

Eraser

Sharpener

Covers

Labels

Notebook

Stapler

Pencils

Eraser

Notebook

Pencils

Eraser

Covers

Labels

Glue

Covers

Labels

Notebook

Pencils

Eraser

Count the number of times each of the combinations below appear in the table above,

{Notebooks, Labels}-                    2                       

{Notebooks, Covers}-                    2

{Pencils, Eraser}-                         4

{Labels, Covers}-                          3

Step 3-

We can consider item sets of size 3 with support greater than or equal to 2. There is only one item set that is occurring twice i.e. the one mentioned below.

{Notebooks, Covers, Labels}-     2

Discovery/Inference-

  • Customers buying Pencils also buy Erasers
  • Customers buying Notebooks and Covers also buy Labels
  • Customers buying Covers also buy Labels

Based on the outcome, we can take decision on availability of stock, positioning of items in the store, promotions etc.

We can start with minimum values for Support and Confidence and depending on the Use case, set them at an optimum level after several iterations.

                                   Working in SAP Predictive Analytics

                                   Automated Analytics

Dataset:

There are two datasets. One dataset has all the unique transaction IDs – reference data source. The other dataset has different transactions and the corresponding items – events data source; has two columns. The datasets have 200 transactions.

Reference data source is as below, it has one column:

/wp-content/uploads/2015/09/pic1_789632.jpg

Events data source is as below, it has two columns.

/wp-content/uploads/2015/09/pic2_789633.png

Step 1: Load and describe/analyze Reference data source

/wp-content/uploads/2015/09/pic3_789634.png

/wp-content/uploads/2015/09/pic4_789644.png

We can either load a file that has descriptions of the reference data source or ask the tool to analyze the column and then change column types if needed. Here the Column C1 contains the transaction IDs and to be defined as ‘nominal’ or ‘ordinal’.

Step 2: Load Reference data source

Repeat the same steps for Events data source.

/wp-content/uploads/2015/09/pic5_789645.png

Step 3: Set parameters for the algorithm

/wp-content/uploads/2015/09/pic6_789646.png

Step 4: generate the model

/wp-content/uploads/2015/09/pic7_789656.png

Now we see that the tool has identified the number of transactions. This report can also be saved and distributed for future reference (PDF format available).

We can also view, save and distribute the statistical information about the rules, item sets. Example -frequency distribution for the items in the transactions. For the dataset we have taken, we can see that ‘whole milk’ is the most frequent item.

/wp-content/uploads/2015/09/pic8_789658.png

For this example, with minimum support of 2% and minimum confidence of 50% the tool generated 24 rules.

/wp-content/uploads/2015/09/pic9_789697.png

/wp-content/uploads/2015/09/pic10_789728.png

We can focus on a particular item as antecedent or consequent based on the business requirement/question.

We can also filter on the range of numeric parameters of the model, for example we can search for rules with support >= 0.03 or rules with confidence >= 60%.

/wp-content/uploads/2015/09/pic11_789729.png

If we give a rule size to be greater than 2 all the rules generated are covered. The graphical representation of the filtered rules, is as below-

/wp-content/uploads/2015/09/pic12_789730.png

Let us focus on whole milk as consequent as this is the most frequent item. We have an option to fix the consequent.  We get 7 rules for whole milk as consequent.

/wp-content/uploads/2015/09/pic13_789734.png

/wp-content/uploads/2015/09/pic14_789735.png

/wp-content/uploads/2015/09/pic15_789736.png

We can see a strong association between ‘yogurt’ and ‘whole milk’.

Discovery/Inference:

  • Customers buying yogurt are more likely to buy whole milk.
  • Customers buying sausage are more likely to buy rolls/buns
  • Customers buying root vegetables are more likely to buy other vegetables.

This way we can derive many rules and take business decisions for cross selling, up-selling, promotional offers and store layout using Market Basket Analysis.

                                  Working in R

In R, any functionality beyond the basic version is available in the form of packages. The algorithm we are going to use in R for Market Basket Analysis is Apriori.

Step1: Download and install arules and arulesViz packages. These are relevant to Apriori and the corresponding graphs.

Step2: Call the Groceries dataset from arules package/ load any retail data set into R.

Code:

/wp-content/uploads/2015/09/pic16_789743.png

Step 3: Plot the frequency distribution of the Items.

Frequency distribution for the graph is as below.

/wp-content/uploads/2015/09/pic17_789744.png

Step 4: Apply Apriori algorithm.

/wp-content/uploads/2015/09/pic18_789745.png

For similar parameters, R’s Apriori has generated 32 rules, however the significant rules and their parameters are similar but not exactly the same.

/wp-content/uploads/2015/09/pic19_789752.png

Step 5: Plot the rules/associations identified.

There are several plots in R to represent the rules and the parameters of Apriori. Few of them are grouped, matrix and graph.

The “graph” looks as below for the 32 rules –

/wp-content/uploads/2015/09/pic20_789753.png

If we confine the consequent (LHS) to “whole milk” then the grouped plot and the graph look like below:

/wp-content/uploads/2015/09/pic21_789754.png

/wp-content/uploads/2015/09/pic22_789761.png

                                   Working in Expert Analytics

We will now explore how this can be done using the Expert Analytics. The Apriori algorithm is called from R. Aim of this section is to demonstrate how well SAP has brought in the integration with R and its functionality; and the visuals and ease of use are the advantages of this tool.

Step 1: Load the dataset

/wp-content/uploads/2015/09/pic23_789762.png

Step 2: Apply R-Apriori, in Expert Analytics.

/wp-content/uploads/2015/09/pic24_789764.png

The settings for the algorithm can be given as below –

/wp-content/uploads/2015/09/pic25_789763.png

We then run the model. The rules generated are displayed.

/wp-content/uploads/2015/09/pic26_789765.png

The Association chart in Expert Analytics graphically represents the associations as below.

/wp-content/uploads/2015/09/pic27_789766.png

The output is same as in R. A display of summary gives the output as in R.

/wp-content/uploads/2015/09/pic333_789783.png

Note that the data format for datasets used in all the three tools is similar. Only for Automated Analytics we have an additional dataset to load the Transaction IDs.

                                         Conclusion

  • Visualization is better in SAP Predictive Analytics.
  • SAP PA has a better UI.
  • R is an open source however the R studio server installation has to be purchased.
  • There are more options in R to perform Market Basket Analysis. However, SAP Predictive Analytics has an option to integrate the functionality of R through Expert Analytics.
  • SAP Predictive analytics enables 2 different types of models for a use case, through Automated and Expert Analytics. Hence as per business requirement and suitability we can use either of them.
  • SAP PA Automated Analytics -The output and statistical reports during the preparation of the model can be downloaded and distributed easily in PDF format

                                        References:

Image courtesy: https://pixabay.com/en/shopping-cart-chart-store-shopper-650046/

http://scn.sap.com/community/predictive-analytics/blog/2015/06/28/predictive-smackdown-automated-algorithms-vs-the-data-scientist

http://scn.sap.com/docs/DOC-62238#comment-606881

To report this post you need to login first.

8 Comments

You must be Logged on to comment or reply to a post.

  1. Erik MARCADE

    Nice (begining) of analysis of the different options 🙂

    I say beginning because of two things:

    1/ Our preferred option in automated is not to use the association rules section but the link analysis -or social- section (where there is a dedicated workflow for product recommendations).

    2/ This is a functional analysis, but the war is in scalability. Automated has been used to generate millions of rules with several hundreds of throusands of possible items… That is where things get interesting 😉

    Cheers

    (0) 
    1. Sudeepti Bandi Post author

      Thank you Eric, yes we did try recommendations, however interpreting association rules was easier. We are yet to explore with larger datasets as there were performance constraints with the desktop version and also the data format for transactions/events dataset.

      (0) 

Leave a Reply