HANA Machine Learning (ML) -Analysis Association Frequent Pattern(FP) Growth Algorithm using Python
Association analysis is process of finding interested relationship in large datasets. This is been used in grocery stores like coupons we found , packaged deals, the way items are displayed on shelfs or together. Some common examples of Data Associations are:-
“People who buy bread tend to buy butter or jam as well. Because normally breakfast goes with bread and butter.”
“People who buys diapers tends to buy beer as well. Because raising kids is a stressful job”
There is lot grocery stores are doing and can do by this data Association Analysis. There are number of algorithms available and there some very good explanations available on Git and SCN blogs(link is provided in reference for very good explanation of mostly all Association Analysis algorithm and codes) but I like FP (frequent pattern) Growth algorithm and in this Article I’ll try to put some light on this using Powerful HANA PAL (Predictive Analytical Libraries).And some details of Python codes and steps.
FP-Growth is an algorithm to find frequent patterns from transactions without generating a candidate itemset.
In PAL, the FP-Growth algorithm is extended to find association rules in three steps:
- Converts the transactions into a compressed frequent pattern tree (FP-Tree);
- Recursively finds frequent patterns from the FP-Tree;
- Generates association rules based on the frequent patterns found in Step 2.
Here intention is to keep complexity low so that it’s easily explainable. There are other methods in Association Analysis Apriori etc but I used one only just for more focus and understanding better.
Consider if we need to find Support, Confidence and Lift for two products (A and B)
Support of Product A to B = Transactions Involving Product A and B/ Total Transactions.
Decrease the support count tells that the frequency of item in total transaction is very low.
Confidence of Product A to B = Transactions Involving Product A and B /Total Transactions Involving Product A.
Lift is the increase in the ratio of the sale of Product B when you sell Product A.
Lift = (Confidence of A to B) / (Transactions fractions containing Product B)
Value of lift greater than 1 symbolizes high association between A and B.
For trying hands on below software/Environment needs be available. I used HANA 2.0 with XSA but without XSA this can be done as well. Like there is an option to only Host server without XSA.
HANA 2.0 with XSA Hosted on Google Cloud.
Python 3.7, Anaconda 1.9.12
Juypter Notebooks 6.0.3
HANA ML 1.0.8
To demonstrate this I used HANA Machine learning Libraries installed on Python 3.x with Anaconda. This can be any HANA database either on your laptop or you can host on Amazon etc.
For working on HANA ML PAL needs to be enabled on SAP HANA. Details of code snippet is available in below link.
This code also created “devuser” under my tenant database (HXE).
Installing Hana Client on Python
Since I already have Hana client installed so I didn’t install again. But this can be install easily by below command.
Pip install hdbcli
In my case I just used Pip show
Following Machine learning Libraries are also installed on my machine
Creating HANA Table to store data
Created table under Devuser as “FP_GROWTH_ASSOCIATION”, with only two fields Transaction and Items.
Data for this exercise
Kaggel is the opensource for various datasets. I used “Random shopping cart” data which can be found below. Used only two columns (Transaction and Items) though.
Loaded this data into HANA Table with 16753 recs manually with flat file approach.
If need details of how to load onto HANA table below link can be considered.
Since prerequisite of algorithm to not have null values and duplicates. I removed duplicates, null value will be removed in subsequent part shortly.
Data is such that it has transactions of carts with different grocery items. Glimpse of data:-
Coming to Python again
Imported all libraries which will help to support this algorithm and connected to HANA Database.
Hxehost is the hostname of Hana server
39015 is the port
Devuser is the username
These details can be found on your SAP HANA Database.
Connection command syntax
Connection_context = dataframe.ConnectionContext (URL, PORT, UN, PWD)
Checked if connected with HANA.
Consider Dataframe is 2d table like spreadsheet or simple table in python with columns of different types.
Algorithm prerequisite is to have no null values so using below function to remove nulls
Describe command will demonstrate if we’ve null values
Importing FP growth Algorithm using import
Assigning parameters values for Algorithm
Details of each parameter values will be available on help.sap.com link provided in reference section
How to read:-Consider first line ,it shows if someone buys Poultry then there is 78% chances he/she will buy vegetables too. Support of the confidence is good and lift is above 1 which is indicating that there is high association between these items.
Just want to highlight below numbers how these values are getting calculated with algorithm to better understand how it finds associations
There are 378 transactions involving Poultry and Vegetables both.
Total transaction are 1140 for this dataset.
Total transactions involving Poultry are 480.
Total transactions involving vegetables are 842.
Fractions of Vegetables on overall transactions is 0.7385.
Support:- 378/1140 = 0.33
Note: This is just for demonstration purpose to show how HANA ML with Python can be leveraged for Machine learning using PAL. This data is open dataset and I’ve note verified each transactions myself.
I just shown one of the example of FP growth one can keep filtering data as much as to extract useful information and even use relational options keeping the value as “True”.
Please share feedback /suggestions.
Please excuse for any spelling/grammatical/typo mistakes.