Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
cancel
Showing results for 
Search instead for 
Did you mean: 
former_member522236
Discoverer
Summary

Association analysis is process of finding interested relationship in large datasets. This is  been used in grocery stores like coupons we found , packaged deals, the way items are displayed on shelfs or together. Some common examples of Data Associations are:-

 

People who buy bread tend to buy butter or jam as well. Because normally breakfast goes with bread and butter.”

People who buys diapers tends to buy beer as well. Because raising kids is a stressful job”

 

There is lot grocery stores are doing and can do by this data Association Analysis. There are number of algorithms available and there some very good explanations available on Git and SCN blogs(link is provided in reference for very good explanation of mostly all Association Analysis algorithm and codes) but I like FP (frequent pattern) Growth algorithm and in this Article I’ll try to put some light on this using Powerful HANA PAL (Predictive Analytical Libraries).And some details of Python codes and steps.

FP-Growth is an algorithm to find frequent patterns from transactions without generating a candidate itemset.


In PAL, the FP-Growth algorithm is extended to find association rules in three steps:




  1. Converts the transactions into a compressed frequent pattern tree (FP-Tree);

  2. Recursively finds frequent patterns from the FP-Tree;

  3. Generates association rules based on the frequent patterns found in Step 2.


Here intention is to keep complexity low so that it’s easily explainable. There are other methods in Association Analysis Apriori etc but I used one only just for more focus and understanding better.

 

Indicators

  1. Support

  2. Confidence

  3. Lift


Consider if we need to find Support, Confidence and Lift for two products (A and B)

Support:-

Support of Product A to B = Transactions Involving Product A and B/ Total Transactions.

Decrease the support count tells that the frequency of item in total transaction is very low.

Confidence:

Confidence of Product A to B = Transactions Involving Product A and B /Total Transactions Involving Product A.

Lift

Lift is the increase in the ratio of the sale of Product B when you sell Product A.

Lift = (Confidence of A to B) / (Transactions fractions containing Product B)

Value of lift greater than 1 symbolizes high association between A and B.

 

Prerequisite

For trying hands on below software/Environment needs be available. I used HANA 2.0 with XSA but without XSA this can be done as well. Like there is an option to only Host server without XSA.

 

Environments/Software

HANA 2.0 with XSA Hosted on Google Cloud.

Python 3.7, Anaconda 1.9.12

Juypter Notebooks 6.0.3

HANA ML 1.0.8

 

Details

To demonstrate this I used HANA Machine learning Libraries installed on Python 3.x with Anaconda. This can be any HANA database either on your laptop or you can host on Amazon etc.

For working on HANA ML PAL needs to be enabled on SAP HANA. Details of code snippet is available in below link.

https://github.com/saphanaacademy/PAL/blob/master/Code%20Snippets/PAL%20146%20Getting%20Started%20wi...

This code also created “devuser” under my tenant database (HXE).


 

 

Installing Hana Client on Python

Since I already have Hana client installed so I didn’t install again. But this can be install easily by below command.

Pip install hdbcli

In my case I just used Pip show

 


 

Following Machine learning Libraries are also installed on my machine


 

 

Creating HANA Table to store data

 

Created table under Devuser as “FP_GROWTH_ASSOCIATION”, with only two fields Transaction and Items.

 


 

Table Structure


 

Data for this exercise

Kaggel is the opensource for various datasets. I used “Random shopping cart” data which can be found below. Used only two columns (Transaction and Items) though.

https://www.kaggle.com/acostasg/random-shopping-cart

 

Loaded this data into HANA Table with 16753 recs manually with flat file approach.

If need details of how to load onto HANA table below link can be considered.

SAP HANA – Uploading data into Table from flat files

 


Since prerequisite of algorithm to not have null values and duplicates. I removed duplicates, null value will be removed in subsequent part shortly.

 

Data Glimpse

 

Data is such that it has transactions of carts with different grocery items. Glimpse of data:-


 

 

 

Coming to Python again

Imported all libraries which will help to support this algorithm and connected to HANA Database.


 

Hxehost is the hostname of Hana server

39015 is the port

Devuser is the username

These details can be found on your SAP HANA Database.

 

Connection command syntax

Connection_context = dataframe.ConnectionContext (URL, PORT, UN, PWD)

 

Checked if  connected with HANA.

 


 

 

Assigning Dataframe

Consider Dataframe is 2d table like spreadsheet or simple table in python with columns of different types.


 

Dropping nulls

 

Algorithm prerequisite is to have no null values so using below function to remove nulls

 


Describe command will demonstrate if we’ve null values

 


 

Importing FP growth Algorithm using import

 


 

Assigning parameters values for Algorithm

Details of each parameter values will be available on help.sap.com link provided in reference section


 

 

Result

 


 

 

How to read:-Consider first line ,it shows if someone buys Poultry then there is 78% chances he/she will buy vegetables too. Support of the confidence is good and lift is above 1 which is indicating that there is high association between these items.

 

Math

Just want to highlight below numbers how these values are getting calculated with algorithm to better understand how it finds associations

There are 378 transactions involving Poultry and Vegetables both.

Total transaction are 1140 for this dataset.

Total transactions involving Poultry are 480.

Total transactions involving vegetables are 842.

Fractions of Vegetables on overall transactions is 0.7385.

 

Support:-  378/1140 = 0.33

Confidence:-378/480=0.78

Lift:-0.78/.7385=1.066

 

Note: This is just for demonstration purpose to show how HANA ML with Python can be leveraged for Machine learning using PAL. This data is open dataset and I’ve note verified each transactions myself.

 

I just shown one of the example of FP growth one can keep filtering data as much as to extract useful information and even use relational options keeping the value as “True”.

 

Please share feedback /suggestions.

Please excuse for any spelling/grammatical/typo mistakes.

Thank you

 

 

References:

https://help.sap.com/viewer/2cfbc5cf2bc14f028cfbe2a2bba60a50/1.0.12/en-US/9495128435164c2680f064b65f...

https://www.softwaretestinghelp.com/fp-growth-algorithm-data-mining/

https://www.kaggle.com/acostasg/random-shopping-cart

https://blogs.sap.com/2013/12/11/sap-hana-uploading-data-into-table-from-flat-files/

https://github.com/saphanaacademy/PAL/blob/master/Code%20Snippets/PAL%20146%20Getting%20Started%20wi...

https://github.com/SAP-samples/hana-ml-samples/blob/master/Python-API/pal/notebooks/Association_Anal...

 
Labels in this area