Custom R Components – Classification with the Naive Bayes Algorithm
The Naive Bayes algorithm is one (of many) methods of Classification. For instance you may want to derive from a past Marketing campaign what prospects you should focus on in your next Marketing activity. The algorithm can identify patterns of what type of contacts have already purchased a certain product (ie what was their age, gender, income, etc.). Now you can use this information for your next campaign and focus on the people that are most likely to be interested. So you spend your Marketing budget where it is most effective.
SAP Predictive Analysis can use the Naive Bayes algorithm thanks to the ability to create Custom R Components. Within such a component an expert user can encapsulate R-Script in an end-user-friendly format. With thousands of different methods available in R, that concept is extremely powerful. This article explains how to implement and use Naive Bayes.
Usage
Let’s try the Naive Bayes algorithm on some data from the real world. The UC Irvine Machine Learning Repository kindly hosts a dataset with information taken from the 1994 US Census. The file called Adult contains anonymous information from over 32.000 people listing their age, education, martical status and much more, including the information whether the person was earning over 50.000 US Dollar in the year 1994. We will use this information to create a model that we can apply on future data to determine if the person is likely to earn more or less than these 50.000 USD.
You can follow the steps below if you download the above dataset. Before getting started, you may just have to add a first row with column names.
Just load your data into SAP Predictive Analysis. You see some of the available columns. The ‘Income’ field on the right-hand side tells us whether the person was in that year over or below the 50k threshold. This colum is called ‘TargetVariable’ in the screenshots below.
Now add the Naive Bayes Classifier component to my model. Further below you find the details to add this logic to your own SAP Predictive Analysis installation.
Configure the component. You need to tell the component
– the Classifier Column: Income
– and the Predictor Column: Here you can pick Age, Occupation and HoursPerWeek to start.
Run the model. Then go to the charts area. The table shows how many records were correctly and incorrectly classified. 24.263 people were correctly classified as earning less than 50.000 USD. 556 people were correctly classified as high-earners.
You can also save the trained model to further test it on data that is already classified. Or you can apply the model on new data for which the classification is actually unknown.
R Libraries
Please make sure you have the R-libraries e1071 and gplots installed. The following document explains how to make new libraries available in SAP Predictive Analysis:
http://scn.sap.com/docs/DOC-28396
You many want to read the documentation of the Naive Bayes algorithm on:
http://ugrad.stat.ubc.ca/R/library/e1071/html/predict.naiveBayes.html
How to Implement
The component can be downloaded as .spar file from GitHub. Then deploy it as described here. You just need to import it through the option “Import/Model Component”, which you will find by clicking on the plus-sign at the bottom of the list of the available algorithms.
Disclaimer
Please note that this component is not an official release by SAP and that it is provided as-is without any guarantee or support. Please test the component to ensure it works for your purposes.
Hi Andreas,
Thanks for sharing!
I always enjoy reading your cases.
In you description you take 70% to train you model and 30% to test it.
Have you seen any rule of thumb around for this specific number?
Furthermore in order to reduce bias in data how would you ensure that your data is picked random in both samples and not reused? If this was salestransactions they could be presorted and with a "timestamp"-bias.
Looking at PA the options are: First N, Last N, Every N, Simple Random or Systematic Random? However how do we make sure that data are not reused in training, testing or validation and making sure that any presorting in the data ?
By the way it would be nice with a function to automatic control process-flow for training, testing and validation - right?
Just a heads-up: the link to The UC Irvine Machine Learning Repository is a bit off.
It currently points to "http://http//archive.ics.uci.edu/ml/datasets/Adult" instead of http://archive.ics.uci.edu/ml/datasets/Adult
Thanks again.
Best regards,
Kurt Holst
Hi Kurt,
Nice to hear you like my articles!
Often two thirds of the data are used for training a model and the remaining third is used for testing. To keep it simple I did a straight 70% / 30% split and found that worked quite nicely on this data.
Just as you say, you want to avoid having the same record in both the training and testing dataset. Here I used "First 70%" and "Last 30%" to achieve that. However, this requires the data to be randomly sorted. If this is not the case, like in your example, then a little custom R Script could do the trick to randomly separate/flag the records.
Greetings
Andreas
Oh and thanks for the heads up on the broken link! It's fixed now.
Hi Andreas,
thanks for this nice R implementation. One of the best features of PA which is still not fully appreciated.
Hi Andreas,
One small question, does this algorithm works only when the target is numerical? Because I have one column named Priority having 3 things- LOW, MEDIUM, HIGH. So when I applied this algorithm on them, some error appeared.
Hello Ranajay, I have just tried it out on iris and it predicts the 3 nominal target values without error. Can you please test with iris as well. If this works please post your error message.
Hey Andreas
I tested with Iris it worked fine. But iris dataset has measures on which algorithm runs smoothly. I have data with dimensions only, I mean I have data with PRIORITY along with CREATION DATE over a year and some additional field like who raised the incident etc.
If I apply HANA based Naive Bayes algorithm it works on this data, but if I apply this extension error comes.
The error message screenshot is below:
Custom R extensions curently have a limitation with dates. If you remove the date, it might work.
Here is the comment from the release notes
"You cannot use date columns as strings in the data that is passed to the custom R component. Therefore, we recommended to filter the date column from the dataset or use the as.date function in R script."
http://service.sap.com/sap/support/notes/2165858
Unfortunately Date is the column which is mandatory in this analysis.
I dont know your use case but a date as input for a classification seems unusual. Often dates are used to describe activity in relation to a time stamp, ie "number of days since last contact to customer". Is that something that would make sense to your case?
Maybe you are aware of the Data Manager in Automated Mode that helps create such variables based on dates, amongst other things?
No actually, in my case based on past year date wise some priorities of some incident for different departments got generated, depending on which I need to predict priority pattern for future months.
Hi Andreas,
Thanks for sharing .
I have issue with the Model and It is throwing error at the eval function.
Please find the below screenshot.
Could you please what is the issue.
Thanks & Regards,
Ramana.
Hello Ramana,
Thank you for bringing this up. I haven't seen that message before, but I was able to reproduce it on my system. The component itself is still working nicely for me though.
I will try to find out more and will let you know.
Greetings
Andreas
Hi again Ramana,
I have checked with the product group and the warning you see should not appear. The custom R code is just fine. The incorrect warning will be removed in a future release.
If the primary function and scoring function use the same name for the dataset, the (incorrect) warning is not shown. Currently the names are different (mydata and mynewdata).
Many Greetings
Andreas
Hi Andreas,
Just a small suggestion, caret package supports cross-validation method which increases used size of both train and test data.
library(caret)
NaiveBayes<-train(x~y,'nb',trControl=trainControl(method='cv',number=10))
Regards,
Eser
Hello Eser, Thank you for the comment.
I will try to find some time to add this optionally with a parameter in the configuration.
Many Greetings
Andreas
Hi,
Just for my own curiosity, what are the differences between this algorithm and the one that is built-in inside SAP HANA in the PAL library?
Thanks & regards
Antoine
Hello Antoine,
This custom r component was released before the HANA PAL included the naive bayes algorithm. Now that it is also available in HANA, this custom extension is probably most relevant for customers using data sources other than HANA.
Many Greetings
Andreas
Hi Andreas,
This makes sense - thanks for clarifying!
Do you know if the implementations are exactly the same?
Thanks & regards
Antoine
Hello Antoine,
I assume the core functionality is very similar, if not identical, but I havent tested it. The differences might be rather in the parameterisation. Out of the box the algorithm from the HANA PAL offers more options than the component here. But there are additional parameters from R that could be exposed as well.
If possible I'd clearly recommend to use the PAL. If that is not possible or sufficient, this component might be an option.
Here is more info on the parameters
R
http://ugrad.stat.ubc.ca/R/library/e1071/html/predict.naiveBayes.html
PAL
http://help.sap.com/saphelp_hanaplatform/helpdata/en/4f/dc40e5c8e14cbd966f5254977ec0e9/content.htm
Greetings
Andreas