Market Basket Analysis

Former Member · ‎09-14-2015

Market Basket Analysis

Using SAP PA – Automated Analytics and R

Sudeepti Bandi

Kranthi Kumar Thirumalagiri

Author1: Sudeepti Bandi

Company: NTT DATA Global Delivery Services Limited

Author Bio

Sudeepti is a Principal Consultant at NTT DATA from the SAP Analytics Practice

Author2: Kranthi Kumar Thirumalagiri

Company: NTT DATA Global Delivery Services Limited

Author Bio

Kranthi Kumar is a Senior Principal Consultant at NTT DATA from the SAP Analytics Practice

Introduction

Advanced analytics and data science are fast evolving techniques that play a significant role in the present-day strategy and decision making for Businesses. Many organizations are starting to adopt practices that would help them in understanding their business data better and in building powerful insights into overall-functioning.Analytics has a vast scope in terms of usage. There are different tools available in the market to build data models that support business analysis and prediction/forecasting.

Market Basket Analysis is a popular methodology to find the associations between the products/items based on the transactions/shopping carts/market baskets of different customers. This type of analysis helps in identifying associations between items that can be sold together and can also help in cross-selling and up-selling.For example- the conclusion of this analysis looks like - Customers who buy product A are more likely to buy product B.

In this paper we would analyze a Retail dataset by leveraging the algorithms within SAP PA (Automated Analytics) and R to perform Market basket Analysis. We also explore the integration of R language with SAP Predictive Analytics.

Objective

Predictive analytics encompasses a variety of statistical techniques from modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events. There are several tools and technologies in the market that can be adopted to predict patterns in the given data set.

SAP Predictive Analytics is a statistical analysis and data mining solution that enables you to build predictive models in order to discover hidden insights and relationships in your data. This will enable to make predictions about future events. The tool has two products Automated Analytics and Expert Analytics. Automated Analytics helps in automating data analysis and addresses business problems without any manual intervention in data modeling or algorithm improvisation. This is used for less complicated use cases by data analysts. Expert Analytics is used to analyze data using in-built algorithms and also those from R (open-source programming language for statistical analytics). This is can be used for complicated use cases where manual intervention/control is necessary at different steps of the data modeling.

In this paper we would explore how we can perform MBA through Association rules in Automated Analytics in SAP’s Predictive Analytics using a Retail dataset. We would also perform the same using R (an Open Source programming language that is used for statistical analytics). We then compare the different options, features, output and effort in performing this analysis using SAP’s Predictive Analytics versus R. We would also explore the integration of R with SAP PA by calling the R algorithms through the Expert Analytics.

This kind of analytics is used in Retail industry, Recommender engines in E-commerce, Restaurants and tools to identify Plagiarism. There are several algorithms that could be used for Market Basket Analysis like the popular Apriori algorithm.

How does an Algorithm work

The primary input for the algorithm would be a data set of transactions from past within a business. Then the algorithm will automatically identify the patterns with in the data set. When we say patterns, it is such as the most frequently bought items together with in a given volume of transactions.

Algorithm then stores the pattern recognition logic and applies it to the new data set. This is called Training the Algorithm.

Algorithm for MBA - Apriori

Apriori is a popular algorithm used for Market Basket Analysis. The significant parameters of this algorithm are listed below-

Association Rule
Support
Confidence
Lift

Association Rule

A rule is denoted as below -

A -> B

A –Antecedent/LHS

B – Consequent/RHS

ü Where A and B are items from the data set of transactions.

ü They should occur together in different transactions to qualify for a strong association rule.

Support

This is calculated as a ratio of “Number of transactions where items occur together” to that of “Total Number of Transactions”. If A and B are bought together in 5 transactions out of a total of 20 transactions, then the support is calculated as 5/20 which is 0.25.

Confidence

Confidence is calculated as the ratio of “Number of transactions in which A and B occur together” to the number of transactions in which “A occurs”. If A occurs in a total of 10 transactions, out of which, B also occurs in 5 transactions along with A,

Then Confidence of (A -> B) is 5/10.

Which means, the chances of purchasing item B when item A is bought is 50%. The intensity of predicting the sale of item B is termed as confidence.

It is apparent that

ü For strong association, Support and Confidence have to be high.

ü Rules with low Support and Confidence could be eliminated

Limitation of Support and Confidence

ü Many important findings of associations will be eliminated if they have low Support, however a low Support could be because the item is expensive and is not occurring frequently. Low support not necessarily mean can be ignored. It depends on the discretion of the analyst.

ü High confidence is misleading at times, The Confidence of A and B might be 5/10 but overall occurrence of B could also only in these 5 transactions and hence not dependent on A.

Lift

It is the measure where we assume that the Antecedent and Consequent are independent. Calculated as, ratio of Confidence of (A -> B) to the Support of B.

Apriori Principle-

If an item set is frequent then all of its subsets must also be frequent.

If the item set: {A, B, C, D} is frequently occurring, then the subsets listed below are also frequent.

3 items sets - {A,B,C} {A,B,D} {B,C,D} {C,D,A}

2 items sets- {A, B}, {A, C}, {A, D}, {B, C}, {B, D}, {C, D}

1 item sets - {A} {B} {C} {D}

Let us first consider a very simple example –

A dataset listing purchases from a Stationery Store.
There are 11 transactions
We aim to manually identify the associations between the items in this dataset in a way the actual Apriori works.
We use similar method as Apriori – associations calculated through different iterations based on the support value we choose.

The table below has the 11 transactions -

1	Pencils	Eraser	Sharpener
2	Covers	Labels	Notebook	Stapler
3	Colour pencils
4	Pencils	Eraser
5	Notebook
6	Notebook	Pencils	Eraser
7	Covers	Labels
8	Crayons
9	Glue	Covers	Labels	Notebook
10	Pencils	Eraser
11	Notebook

Now this is how the algorithm works:

Step 1- We can consider a minimum Support of 2.

The first iterations will be for all single items. All items that appear for 2 times or more are considered. We count the number of times each item appears in the transactions.

Pencils	Eraser	Sharpener
Covers	Labels	Notebook	Stapler
Colour pencils
Pencils	Eraser
Notebook
Notebook	Pencils	Eraser
Covers	Labels
Crayons
Glue	Covers	Labels	Notebook
Pencils	Eraser
Notebook

Single occurrences are as below -

Notebook- 5

Pencils - 4

Eraser- 4

Labels - 3

Covers - 3

Step 2-

We now take occurrence of the items above in couples i.e. in combinations (only the singles with support greater than 2 that are listed above will be considered for this iteration). We also eliminate single item transactions with Support lesser than 2.

Pencils	Eraser	Sharpener
Covers	Labels	Notebook	Stapler
Pencils	Eraser
Notebook	Pencils	Eraser
Covers	Labels
Glue	Covers	Labels	Notebook
Pencils	Eraser

Count the number of times each of the combinations below appear in the table above,

{Notebooks, Labels}- 2

{Notebooks, Covers}- 2

{Pencils, Eraser}- 4

{Labels, Covers}- 3

Step 3-

We can consider item sets of size 3 with support greater than or equal to 2. There is only one item set that is occurring twice i.e. the one mentioned below.

{Notebooks, Covers, Labels}- 2

Discovery/Inference-

Customers buying Pencils also buy Erasers
Customers buying Notebooks and Covers also buy Labels
Customers buying Covers also buy Labels

Based on the outcome, we can take decision on availability of stock, positioning of items in the store, promotions etc.

We can start with minimum values for Support and Confidence and depending on the Use case, set them at an optimum level after several iterations.

Working in SAP Predictive Analytics

Automated Analytics

Dataset:

There are two datasets. One dataset has all the unique transaction IDs – reference data source. The other dataset has different transactions and the corresponding items – events data source; has two columns. The datasets have 200 transactions.

Reference data source is as below, it has one column:

Events data source is as below, it has two columns.

Step 1: Load and describe/analyze Reference data source

We can either load a file that has descriptions of the reference data source or ask the tool to analyze the column and then change column types if needed. Here the Column C1 contains the transaction IDs and to be defined as ‘nominal’ or ‘ordinal’.

Step 2: Load Reference data source

Repeat the same steps for Events data source.

Step 3: Set parameters for the algorithm

Step 4: generate the model

Now we see that the tool has identified the number of transactions. This report can also be saved and distributed for future reference (PDF format available).

We can also view, save and distribute the statistical information about the rules, item sets. Example -frequency distribution for the items in the transactions. For the dataset we have taken, we can see that ‘whole milk’ is the most frequent item.

For this example, with minimum support of 2% and minimum confidence of 50% the tool generated 24 rules.

We can focus on a particular item as antecedent or consequent based on the business requirement/question.

We can also filter on the range of numeric parameters of the model, for example we can search for rules with support >= 0.03 or rules with confidence >= 60%.

If we give a rule size to be greater than 2 all the rules generated are covered. The graphical representation of the filtered rules, is as below-

Let us focus on whole milk as consequent as this is the most frequent item. We have an option to fix the consequent. We get 7 rules for whole milk as consequent.

We can see a strong association between ‘yogurt’ and ‘whole milk’.

Discovery/Inference:

Customers buying yogurt are more likely to buy whole milk.
Customers buying sausage are more likely to buy rolls/buns
Customers buying root vegetables are more likely to buy other vegetables.

This way we can derive many rules and take business decisions for cross selling, up-selling, promotional offers and store layout using Market Basket Analysis.

Working in R

In R, any functionality beyond the basic version is available in the form of packages. The algorithm we are going to use in R for Market Basket Analysis is Apriori.

Step1: Download and install arules and arulesViz packages. These are relevant to Apriori and the corresponding graphs.

Step2: Call the Groceries dataset from arules package/ load any retail data set into R.

Code:

Step 3: Plot the frequency distribution of the Items.

Frequency distribution for the graph is as below.

Step 4: Apply Apriori algorithm.

For similar parameters, R’s Apriori has generated 32 rules, however the significant rules and their parameters are similar but not exactly the same.

Step 5: Plot the rules/associations identified.

There are several plots in R to represent the rules and the parameters of Apriori. Few of them are grouped, matrix and graph.

The “graph” looks as below for the 32 rules –

If we confine the consequent (LHS) to “whole milk” then the grouped plot and the graph look like below:

Working in Expert Analytics

We will now explore how this can be done using the Expert Analytics. The Apriori algorithm is called from R. Aim of this section is to demonstrate how well SAP has brought in the integration with R and its functionality; and the visuals and ease of use are the advantages of this tool.

Step 1: Load the dataset

Step 2: Apply R-Apriori, in Expert Analytics.

The settings for the algorithm can be given as below -

We then run the model. The rules generated are displayed.

The Association chart in Expert Analytics graphically represents the associations as below.

The output is same as in R. A display of summary gives the output as in R.

Note that the data format for datasets used in all the three tools is similar. Only for Automated Analytics we have an additional dataset to load the Transaction IDs.

Conclusion

Visualization is better in SAP Predictive Analytics.
SAP PA has a better UI.
R is an open source however the R studio server installation has to be purchased.
There are more options in R to perform Market Basket Analysis. However, SAP Predictive Analytics has an option to integrate the functionality of R through Expert Analytics.
SAP Predictive analytics enables 2 different types of models for a use case, through Automated and Expert Analytics. Hence as per business requirement and suitability we can use either of them.
SAP PA Automated Analytics -The output and statistical reports during the preparation of the model can be downloaded and distributed easily in PDF format

References:

Image courtesy: https://pixabay.com/en/shopping-cart-chart-store-shopper-650046/

http://scn.sap.com/community/predictive-analytics/blog/2015/06/28/predictive-smackdown-automated-alg...

http://scn.sap.com/docs/DOC-62238#comment-606881

SAP Predictive Analytics - MBA (Automated, Expert and R - Comparision)

Market Basket Analysis

Sudeepti Bandi

Kranthi Kumar Thirumalagiri

Introduction

Objective

How does an Algorithm work

Algorithm for MBA - Apriori

Association Rule

Support

Confidence

Limitation of Support and Confidence

Lift

Working in SAP Predictive Analytics

Automated Analytics

Working in R

Working in Expert Analytics

Conclusion

References:

SAP PI for Beginners

ABAP 7.40 Quick Reference

Fiori: technical installation and configuration of one app from A - Z