- Data Preparation and Data Modeling
- Application of Cluster Algorithm using R in SAP Analytics Cloud
I am Harshavardhan Kadam, Data visualization analyst, I am writing this article on CLUSTERING in SAP Analytics Cloud , when I was build a Viz on SAP Analytics Cloud I wanted to create a cluster for customer based on RFM model, Unlike other Viz tool SAP Analytics Cloud gives us an upper hand in building cluster using R. This article I have written, How to create a cluster using K-Means(Details on KMeans explained in Cluster Section)
To Implement the complete Customer Clustering, We have used data from E-Commerce portal, The data consist of Customer ID, Invoice No, Country, Item No, Item Description, Invoice Date, Quantity ordered, Unit price and order status.
We will be pre-processing the data, then feed the data into the model and then build a cluster and RFM model on preprocessed data. To build a cluster model we must have a basic R programming knowledge.
Data Preparation and Data Modeling
The data need to be analyzed before feeding into the model. Below are few of the steps to be considered
- Check for Null values/Missing values.
- Check for Date format.
- Make sure our data do not have ZERO or negative value for ordered quantity and Unit price (Note: For any other domain we need to be sure that measure related to price must not be Zero or Negative).
- Check for any uncommon name for format in our data
Example: For few Item in our data had a description as Post,Bank Charges,Manual etc.which is not considered.
Note: These cleaning and preprocessing activity need to be performed at backend before feeding data to SAP Analytics Cloud model .Although SAP Analytics Cloud can perform these cleaning very efficiently but this might slow your data load in model as unnecessary data will be loaded that we might need to be delete at later stage.
Note: Data can be preprocessed using any ETL technique or using R
When we have performed the preprocessing activity the data can be loaded to the Analytical model in SAP Analytics Cloud
- Login to SAP Analytics Cloud and create a model.
- Select the Option to import the data. SAP Analytics Cloud provided two option
- Import using file.
- Connect to Data Source.
- Once Data is loaded perform data modeling
- Check if Dimension and measure are projected Correctly.
Example: Customer ID sometimes shown as measure, convert to Dimension.
- Check date format Convert all the date in one single format.
- Extract a day of the Week, Month, and Year from date.
Example to Extract Year ,like wise we can create Month and Day
- Created a required calculated column using Calculated Funcation (fx)
Example: Line Total= Quantity*Unit price
- Create a Geo location column on Country column.
- Once all the modeling is done, validate the data and build model.
Now when model is ready, we must start building our Story.
Clustering in SAP Analytics Cloud using R
Clustering Definition: The technique grouping of homogeneous objects which are similarity between them and dissimilarity to others group is called clustering. This is the Unsupervised model.
Types of Clustering
- K-Means Hierarchical
Clustering is performed using Euclidian Distance.
In this blog I have created a cluster using K-Means
- Less impacted by Outliers
- Faster Computation
- Reading output visual is better when we have a huge data.
Building a Cluster in SAP Analytics Cloud
- To build a Cluster in SAP Analytics Cloud lets us create a story and select a R visualization widget.
- Select the required column in building the Click on input date and select Dimension and the measure on which we need to build cluster.
For my model I used Customer, Country dimension, Date, Invoice No, New LineTotal,Description
- Now Click on Edit Script and start build the code in Editor
- Create a Recency, Frequency and Monetary columns
CustDataFrame<- CustDateFrame %>%
The above code groups customers based on the Recency of visit, Frequency of Purchase and Amount of Spend. This is my base table on which I will do clustering. I created these new columns as K-Means clustering use distance , the records with close distance are grouped together.
- Now we will start our clustering process, before we start we must load library .
- Scale the date
- Calculate the count of required cluster ( you must use subject knowledge to in deciding the cluster count)
- Build the cluster
- Visualize the cluster
- Profile the cluster
ScaledCluster <-scale(clusterData[,3:5]) #scale data
Nc<-NbClust(clusterData[,-c(1)],min.nc=2,max.nc=4,method=”kmeans”) # cluster count
plt<-ggplot(data=clusterData, aes(FrequencyofVisit,Monetory,color=factor(Cluster),legend =”Cluster”,label =CustomerID))+
geom_text() # plot a Graph
tt2 <- ttheme_default(
panel.border = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
core=list(bg_params = list(fill = c(“#fafafa”, “white”), lwd=2, lty=1, col=”#cecece”),
fg_params=list(hjust=1, x=0.95, cex=1.25)),
colhead=list(bg_params = list(fill = “#4682b4”, col=NA, lwd=5),
fg_params=list(col = “white”, fontface=”bold”, cex=1.25)),
rowhead=list(fg_params=list(hjust=0, x=0, cex=1.25, fontface=”plain”)))
tg<-tableGrob(agg, theme = tt2) # Plot a Aggerate table
Note: Above code is written as per the data structure. Sample data screen below
- Below is the Visualization, where I have divided a data in 5 cluster based on their Frequency, Recency and Monetary.
Note: We might see lot of overlap in few cluster, That is a complete other discussion which will be covered separately.
The below table shows the cluster profile based on average Spending, Frequency and Recency.
Conclusion:- Based on above table and Chart we can profile our Cluster and segment our customers
Cluster Profiling is done based on Avg Recency. Frequency and Monetory.
Cluster 1 (Monthly Customer) :- Groups Customers who shops ,24 days on Avg, with Normal spending .
Cluster 2 ( Window Shopper ) :- Groups Customers who Shops on an Avg of 50 days ,but not frequently and not a good Spender
Cluster 3 (Regular Customer):- Groups Customers who shops very frequently.
Cluster 4 (Lost Customer):- Groups Customer who has not visited for long
Cluster 5 (Loyal Customer):- Groups Customer who are High Spender, frequent visitor and Recent buyer.
Using Clustering we can segment our customer into various profiles and help us in identifying the customer who need attention and where our new plans and campaign must be rolled out
Similarly we can cluster Product too ,Below is the Visualization example.
Data Visualization and Analytics