How to Predict Customer Churn using Machine Learning? A Telecom Case Study
Table of Contents
1 Executive Summary
2 Introduction
3 Methodology
4 Results
5 Discussion
6 Conclusion
Appendixes
1 Executive Summary
We will build a predictive model for a telecom customer churn using variables from the customer dataset. After trailing several classical classification algorithms, we chose Random Forest as the best performing algorithm with an AUC of 87% on the unseen test.
2 Introduction
As consumers, we know how easy it is to switch from one phone contract to another. Furthermore, we see alternative mobile phone data providers and companies from other consumer-facing industries trying to entice competitors’ customers by offering better bundles or subsidised hardware. This threat of losing customers to competitors makes churn prediction a fundamental problem in the telecom industry. In this case study, our goal is to predict whether a customer will leave using machine learning techniques on a historical dataset.
This case study is a graded project as part of the second course in The University of Edinburgh Predictive Analytics using Python MicroMasters Program PA4.2x_MM Successfully Evaluating Predictive Modelling.
3 Methodology
3.1 Analytic Approach and Data Requirements
The goal is to predict whether a customer will churn or not, which means we will take a predictive analytics approach. We will be using a Logistic Regression algorithm, among others, which accepts numerical data. We would need first to normalise the data to conduct PCA analysis and compare different variables’ relative importance.
3.2 Data
We start with a dataset of 5000 observations in a CSV format with 19 independent variables and the dependent as shown in Figure 3.1 Initial Dataset. The course authors have already done the hard part of collecting the raw data and transforming it into a single CSV file.
We can broadly categorise the variables into features describing the account characteristics, plan type, phone number, and customer behavioural or consumption patterns.[1]
For the subsequent Logistic Regression algorithm, we need to convert categorical variables such as Area_Code, International_Plan, and Voice_mail_Plan into a numerical format by creating dummy variables. Area_Code was in a “numeric” format but did not represent a “continuous” number, so we treat it as a categorical variable. We assume that there is no auto-correlation between the observations, so Phone_Number will not be particularly useful for predictions. This assumption is to simplify the case study: auto-correlation analysis and sequence modelling are beyond this project’s scope.
Likewise, we convert the target variable using a label encoder to bring it into a boolean format.
3.3 Feature Selection
Having brought the features to the appropriate format, we can focus on feature selection techniques.
Looking at the correlation matrix Figure 3.3 Correlation, we can see the following groups of variables being highly correlated and potentially redundant
- total day calls, minutes, charges
- evening calls, minutes, charges
- international calls, minutes, charges
- voicemail plan and voicemail messages
Some of these may not even be linearly independent. The other variables are not correlated, as can be seen from the plot above.
Mutual information and Chi2
One way to shortlist is to use mutual information and chi-squared to remove the features below the mean of both metrics, which implies they are less informative.
This combined approach would leave us with the following nine features:
- Num_of_Voice_mail_Messages
- Total_Day_Minutes
- Total_day_Charge
- Total_Eve_Calls
- Total_Night_Calls_
- Total_Intl_Calls
- Number_Customer_Service_calls_
- Area_Code_415
- International_Plan_ yes
Refer to Figure 3.4 Mutual Information by Feature and Figure 3.5 Chi-2 by Feature illustrating the relative scoring for each of the variables.
PCA
The other alternative was to look at variables that weigh above ten per cent for the top five Principal Components (refer to Figure 3.6 Scree plot). We can see that the first PCA can explain 31.6% of the variance, followed by 28.7% and 18,9% for the second and third components, respectively.
This would leave us with 8 variables listed below:
'Account_Length',
'Total_Day_Calls',
'Total_Day_Minutes',
'Total_Eve_Calls',
'Total_Eve_Minutes',
'Total_Night_Calls_',
'Total_Night_Minutes',
'Total_day_Charge'
From Figure 3.7 Principal components, we can already see some themes emerging in the data associated with individual components: total call volume, evening and night calls, account length, daily calls, night calls.
Appendix 1 Component details show the details of relative weighting for the first five PC component elements and the weighting.
Preliminary Data Exploration
Next, we can visually explore the data using principal component analysis to seek patterns in the data.
Figure 3.8 PC1 vs PC2 visualises the positive and negative examples against the top two principal components to observe patterns in the data. It appears that customers with low overall call volume usage and low night call volume would churn.
Perhaps, this is a customer segment representing those who do not use telecom services as heavily and could switch to a different provider. Could the company direct its marketing efforts to develop bundles and promotional campaigns to retain this low usage segment?
Customers with around average overall volume also tend to churn. Could this be the segment that is the most targeted by standard offers by the competitors?
Customers with a high overall call volume but low night call volumes tend to churn less frequently. Outlier customers with consumption outside the typical churn values tend to be more loyal as well. Could it be a result of incentives or the competitors not being so aggressive with these customers that do not fall under the standard telecom bundles?
From the above figure, we can see that callers with average call volumes and contract length tend to churn. There also seems to be a parabolic relationship between the length of the contract and total volumes for churn class. Customers with lower call volumes, high contract lengths, and higher volume and shorter subscription duration churn. Could it be that people that have been long on a contract prefer to switch? Or do we observe a high usage and frequent change segment?
Appendix 2 PCA contains the table displaying each of the principal components along with the original features list.
Plotting churn outcomes against PC2 and PC2, we see that people with exceptional levels of evening calls and contract lengths are less likely to churn. Who are they? Is this a loyal segment worth exploring?
Instead of focussing on contribution to the top 3 principal components, we could have looked at the loadings, which would result in the following variables being most important:
- Total_Eve_Charge
- Total_Day_Calls
- Account_Length
- Total_Eve_Minutes
- Total_Eve_Calls
More generally, we can explore these relationships further through unsupervised clustering algorithms to identify different customer segments and understand their preferences. It would be worthwhile to look deeper into the underlying reasons for higher and lower churn connected to different usage patterns and contract types to understand these customer segments and reflect any findings in the product and marketing mix accordingly.
4 Results
Finally, we build a cross-validation pipeline that:
- normalises the data
- get outcomes using each of the models (Logistic Regression, Decision Tree Classifier, and Random Forest Classifier) and Stratified Shuffle Split and uses five folds for cross-validation
For each of the models, we measure the validation set accuracy, ROC AUC, and precision.
Surprisingly, Logistic Regression performed relatively poorly at accurately predicting positive examples out of all examples labelled as positive. The dataset is “skewed” with only 16.4% of records corresponding to churn, so overly emphasising negative examples, the model can still get high scores on the other two metrics. In other words, a model that would label every customer as “no churn” would right 83.6% of the time. Hence, the ROC AUC for the Logistic Regression is the lowest among the three models because it measures effectiveness at both True Positive and False Negative rates.
Random Forest Classifier had the highest results across the three metrics on the validation set. We selected the model for reporting its effectiveness on the unseen test set with 30% of the data set aside.
The result is encouraging. We have achieved the original objective with minimal hyperparameter tuning, i.e., building a model with 96% accuracy on an unseen test set.
Test set metrics:
Accuracy: 0.9586666666666667
Precison 0.9341317365269461
AUC 0.872557920575675
5 Discussion
We have seen that the critical features explaining the variance in the dataset based on loading are evening activity, daily calls, and account length based on PCA Loading. Using a threshold for PCA components weight again daily activity, night activity and account length are essential. If we use a combination of the chi-squared and mutual information scores, the central themes are voicemail use, daily activity volumes, evening activity, international calls. These explain the variance in the independent dataset, which is not the same as strong predictors.
Being able to make a prediction is part of the story. Through PCA visualisations discussed in Section 3.3 Feature Selection, we saw a range of different behaviours patterns associated with churn that are associated with the themes or maybe even customer segments:
- customers with low overall call volume usage and low night call volume
- customers with average call volumes
- high overall volume and low night call volumes
- average call volume and contract length
- low call volumes long contract length
- high volume and short contract duration
- evening calling for unusual contract lengths
- unusual evening call patterns
Here, we only visually examined these possible clusters, but the business can initiate a separate unsupervised learning analysis to segment the customers further. Using statistical methods to quantify the likelihood of churn between these groups can further augment this analysis. The telecom provider can also interview the domain experts to understand the underlying factors behind the behavioural differences.
Having built several models, Random Forest turned out to be the most effective of the three algorithms we tried on the dataset. The combination of metrics implies that the model is effective at predicting both classes in this imbalanced dataset. Effectively predicting both classes means the organisation would not waste time retaining customers where discretionary effort is unlikely to result in different churn outcomes. Likewise, it allows to pre-emptively influence the customers likely to churn through offers, tailored communications, and product mix that will increase the likelihood of customers staying with the company.
6 Conclusion
We set out to build a model to predict churn for a telecom service customer. After completing the pre-processing steps, we have built a model representative of the dataset through reshuffling data and using multiple folds at the cross-validations stage. Along the way, we also looked at feature importance in explaining the variance in the dataset using chi-squared, mutual information score, and PCA. We have seen that the themes of overall phone usage, evening behaviour and contract length as particularly important.
Further clustering analysis to dig deeper, split the customers into segments, and understand statistical significance between the groups can help us grasp the real-world story behind the numbers. Prediction is robust, but the business may also want to come up with the solution to change the odds in its favour by understanding the factors that drive the behaviour and changing them to reduce the churn. Think of an analogy of only using a thermometer versus having a thermostat to save home energy consumption. Accurate churn predictions can help the telecom industry in business decisions, including planning, new service development, marketing, and competitive tactics.
To conclude, we have seen an example of how machine learning can predict human behaviour. Even knowing how these algorithms work under the hood, there is still something uncanny about how precisely machine learning can predict complex human behaviour. If we ask an average person whether they would switch their mobile contract, apart from those who already made up their minds to switch, few people would even have the answer. The answer would change based on their mood, perhaps the ads they have seen and other factors. Nevertheless, the ultimate decision resulting in the action to switch is easy to predict given enough data. Arguably, in some areas, machine learning can give better insights into one’s behaviour and decisions than what the consumer can explicitly state in a focus group questionnaire.
Future directions
Hyperparameter tuning for the algorithms was out of scope. Tuning the model or exploring alternative, potentially more powerful algorithms like neural networks or XGBoost could be explored.
The other direction could be to improve the quality of the data we used to learn the parameters. We could try other resampling and oversampling techniques to deal with the positive and negative dataset imbalance.
We also do not directly know from the dataset whether the customer owns the phone hardware, whether the telecom provider subsidises as part of the plan with monthly instalment, whether the provider offers it as a service, and what kind of device is. Many competitive deals lure customers with a perspective to upgrade their mobile device or tablet at a lower cost in exchange for a longer-term contract. Furthermore, entrants from related industries like financial services, supermarket chains, and broadband providers offer bundle deals including TV, broadband internet, landline, and mobile as part of the same contract. Customer stickiness depends on the other services in the bundle. As part of the dataset, we do not directly know about the hardware deals or what other services the customer uses or needs, making the forecast even more accurate. Data usage is not present in the dataset, and even though we still refer to smartphones as phones, these are portable computers. Phone usage as a phone that we have in our dataset may only represent a small percentage of the overall phone daily use, and data may be a strong predictor of the usage.
I want to describe in a separate post a simplified pipeline approach to solving this case study which also addresses the class imbalance through Random Oversampling.
APPENDIXES
Appendix 1 Component details
Appendix 2 PCA
References
[1] EdinburghX PA4.2x_MM Successfully Evaluating Predictive Modelling course video. Predictive Analytics using Python MicroMasters Program