Recently I got a chance to do a small demo in SAP PA for telecom. The initial part of getting correct data was the biggest challenge. However I found a University site that had some data sets. I requested them for one data set and they obliged.
The dataset that I used was from Duke/NCR Teradata 2003 Tournament (I know quite old but served the purpose for demo).
The data was solicited from a major wireless telecom to provide customer level data for an international modeling competition.Data was suitable for churn modeling and prediction.Data were provided for 100,000 customers with at least 6 months of service history. However for demo I used a sample of 3400 records only.
Historical information was provided in the form of
- Type and price of current handset
- Total revenue (expenditure)
- Call behavior statistics (type, number, duration, totals, etc.)
- Demographic and geographical information
Below is the snapshot of the model that I built in SAP PA. I will not talk about the process of building the models but will focus on the results. For any information on building the model you can put your query in comments section.
Some of the key outputs and their interpretations are shown below.
Below are the results of different analysis that I ran on the data set:
Customers were clustered into 5 different clusters based on Total Monthly Calls, Night Billing, Day Billing, Total Revenue and Number of cars with the customer
It was observed that 4 clusters were high density clusters. Cluster 5 was the most dense with 1629 of total 3400 customers falling in that cluster. On analysis it was found that this cluster of customers had high income and were generating higher revenue. Cluster 5 was consisting of high income customers that were generating average revenue (possible campaign customers) while cluster 4 had customer with low income and very high call usage (possible fraudsters). The tree map gives us the state-wise break up for the same.
2. Decision Tree
A decision tree using the R-CNR tree algorithm was created to study the existing churn in the telecom dataset.
The chart represents the chances of churn based on several factors like Day charge, Evening charge, Net usage, Handset price etc. This type of chart is called a decision tree. The decision tree is a special tool for classification in DM systems. R CNR Tree Method was used for analysis in the titanic scenario. The Generation was based on the top-down principle. The Starting point (the root) contains all records of the training set which is divided with the aid of the rules defined by the variables in two or more sub-nodes (sons / daughters).
By analyzing the solution of the diagram we get various profiles of customer who have a high probability of churn :
Eg : Customers with Day Usage charges (for a fortnight) of > $44 are more likely to churn out. Similarly it was also found that people living in small and medium
households had a high churn probability than those living in luxury apartments. Customers having handsets with high price churned 60% more than those with
low price handsets indicating that they were only interested in handsets.
3. Churn Prediction using Neural Network
The neural network algorithm is a machine learning algorithm that can predict a dependent variable based on several independent variables and the historical association between the two. In our case we try to predict whether a future customer will churn or not based on the historical analysis of customers who have churned. The number of Hidden Layer Neurons was 5. The iterations were set to 1000. The results obtained were pretty good with the variance being <15% when compared to original values. The graph below shows predicted churn for each of the states in USA.
A quick glance shows that New Mexico has the highest % of predicted churn followed by Wyoming and California. The lowest churn is predicted in Alabama at
around 14%. Certainly the life style of customer and work location is an important factor as we see.
The outlier algorithm can be mainly used to identify an error in dataset or fraud detection or may be an out performer. The idea here was to find out the outliers in each region based the various parameters of the customer in the dataset.
229 customers were found to be outliers. The region wise break-up of outliers is shown below against the property type of the customer fixed to City (Customers residing in Cities only).
The mean monthly usage (total calls including net) of 3333 customers is 21428 minutes. Based on that the upper fence was decided at 62967 minutes. 229 customers were using the service above that. Hence we know that these 229 customers can be either fraud or valuable. Further detailed analysis of these 229 customers can help us arrive at a solution to tag them.