Comparing Boosting Techniques for Customer Churn Prediction: AdaBoost Vs. XGBoost in Python

Luke Theivagt
Jan 10, 2025
11 min read

Customer churn, the rate at which customers stop using a product or service, is a critical metric for businesses to monitor and minimize. After all, marketing efforts are far more impactful when they turn new customers into loyal supporters who generate recurring revenue. But what factors are most likely to influence a customer’s decision to leave? And what strategies can businesses use to prevent churn and foster long-term loyalty?

“It cost 5 times more to acquire a new customer than it does to keep an existing one” -Forrester

Each business is as unique as its customers, and identifying the specific variables associated with customer churn requires precise handling.

This is where machine learning comes in.

Machine learning models can be trained to identify which customers are most likely to leave, enabling businesses to focus their efforts on retaining these individuals through targeted interventions like coupons, discounts, personalized outreach, or follow-up calls. While investing resources to retain a customer can be costly, it often proves invaluable when directed at high-risk customers who might otherwise churn.

This article will explore how boosting classification models can effectively predict customer churn and guide you in choosing the best model for your specific needs.

Understanding the Problem and EDA
An Introduction to Decision Trees
Boosting Techniques Overview
Programing the Model
1. Step 1: Build Function to Measure Model Performance
2. Step 2: Split the Data
3. Step 3: Balance the Data
4. Step 4: Build the Model
5. Step 5: Tuning the Model
Conclusion
Libraries Used Building the Model

Understanding The Problem

Customer Churn Example

Customer churn, or attrition, is a critical challenge for businesses across industries, especially in sectors like banking, telecommunications, and subscription services. In this case study, we'll focus on a bank that is dealing with an increasing number of credit card cancellations. Understanding why customers leave can help the bank proactively address their needs, improve retention, and ultimately boost long-term profitability.

The target variable for this model is the "attrition flag," which indicates whether a customer has cancelled their credit card (attrited) or is still an active customer (existing). This binary classification problem will allow us to use machine learning models to predict which customers are most likely to cancel their cards, enabling the bank to focus retention efforts on those individuals.

Understanding the other data columns and key factors contributing to customer churn is essential for this process, as it provides insight into the behavior of both existing and attrited customers. The goal is to identify trends and patterns that could be used to develop targeted retention strategies.

Exploratory Data Analysis (EDA)

Before building any machine learning models, it's important to conduct exploratory data analysis (EDA) to understand the underlying patterns in the data. EDA helps us uncover important relationships, detect outliers, and spot potential issues that could affect the model’s performance. For this case study, we’ll visualize key variables to gain insights into customer behavior and churn.

Above, we see that there is a normal bell curve distribution of age, with the average around 45 years old. Customers have various levels of education and the gender of customers is split with slightly more women. It is important to note that there are far more existing customers than attrited customers, which may cause some issues later on.

Next, lets look at specific differences between existing and attrited customers.

The histograms (left) reveal that existing customers tend to have higher total transaction amounts compared to attrited customers, most of whom spend less than $10,000. However despite this large difference, the ‘Average Transaction Amount’ (bottom right) shows that the average cost per transaction is nearly identical between customer types.

Analysis also reveals that customers with fewer products and more frequent business contacts are more likely to churn, highlighting key factors for targeted retention strategies.

An Introduction To Decision Trees

How does a machine learning model classify a customer as existing or attrited?

Basic classification decision trees

Identifies groupings of variables based on their category.
Create ‘rules’ or mathematical equations to determine which combinations of variables are most likely to result in a specific classification.
These rules can be visualized as a tree, with branches representing decision paths and the final classification (e.g., existing or attrited customer) shown at the bottom of each branch.

Example: Classifying spam emails with a decision tree. Imagine a decision tree designed to classify whether an email is spam. The first rule might check if the email's title is written in all capital letters. If the title isn’t in all caps, the algorithm moves on to evaluate other factors. However, if it is in all caps, the tree follows a separate branch to assess additional criteria that help determine the email’s authenticity.

Balancing rules and complexity in decision trees. How do you decide how many rules to include and how specific they should be? If the tree has too many rules, it can become overly complex and overfit to the “noise” or randomness in the training dataset, making it ineffective at predicting future data. On the other hand, too few rules will limit the tree's ability to accurately classify or predict outcomes.

Enter boosting: A smarter way to build stronger models. What if there were a way to combine multiple weak models to fine-tune a single, powerful model? That’s where boosting comes into play.

Boosting Techniques Overview

Boosting: Turning weak models into a strong predictor. Boosting is a process where multiple weak models are combined consecutively to create a strong, accurate model. It begins with the creation of a weak model that performs slightly better than random guessing (e.g., predicting with 60% accuracy instead of a 50% split). The dataset is then adjusted based on the predictions of this first model, typically giving more weight to incorrectly predicted data points. A new weak model is trained on this altered dataset, and the cycle continues, with each model building on the shortcomings of the previous one. This iterative process continues until an optimal set of rules is established.

By continuously fine-tuning the weight, or "importance," of each variable, boosting creates a much stronger predictive model. However, the specifics of how boosting adjusts weights and builds models can vary depending on the approach used.

What are some of the common types of boosting?

AdaBoost

How it works:
- Sequentially adjusts model weights, focusing on misclassified instances, making it suitable for low-noise datasets
Pros:
- Ideal for datasets with low noise.
- Only a few hyperparameters need adjustment, making it beginner-friendly and well-suited for smaller datasets.
Cons:
- Prone to overfitting when dealing with noisy datasets, as the algorithm can focus too much on outliers or extreme cases.
- Slower than other boosting methods, making it less efficient for larger datasets.

XGBoost

How it works:
- Optimizes performance with advanced parameters, making it ideal for larger datasets
- A faster version of gradient boosting, designed for efficiency and performance.
Pros:
- Significantly faster than AdaBoost, capable of processing larger datasets quickly.
- Additional parameters help reduce overfitting, making the model more robust.
Cons:
- The increased number of parameters makes fine-tuning more complex.
- The process is more difficult to understand, especially for beginners, due to its complexity.

Let’s Program and Choose the Best Boosting Technique for You!

To determine the right boosting technique, you’ll need to evaluate and fine-tune your model’s performance. But first, let’s build a function to measure its effectiveness.

Step 1: Build Function to Measure Model Performance

Fine-tuning a model is impossible without a way to test its performance. Choosing the right metric is critical for interpreting how well the model meets your specific needs. Fortunately, there are several metrics to choose from, including accuracy, precision, recall, and the F1 score. Each metric serves a unique purpose and is suited to different scenarios.

Accuracy

Definition: The proportion of correctly predicted samples out of the total samples.
When to Use: Best for balanced datasets where both classes (e.g., churned and existing customers) are equally represented. For instance, if 50% of customers churn and 50% stay, accuracy offers a clear measure of performance.
Limitations: Accuracy can be misleading in imbalanced datasets. For example, in a dataset with 90% existing customers, a model that predicts all customers will stay would achieve 90% accuracy but fail to identify churners effectively.

Precision

Definition: The proportion of true positives (correctly predicted churners) out of all predicted positives.
When to Use: Precision is crucial when the cost of false positives is high. For example, predicting a customer will churn when they won’t could lead to unnecessary discounts or outreach efforts, wasting resources.

Recall

Definition: The proportion of true positives out of all actual positives (e.g., correctly predicting churners out of all churners in the dataset).
When to Use: Recall is essential when the cost of false negatives is high. Missing a true churner could result in a lost customer, so recall is key if your goal is to retain as many customers as possible.

F1 Score

Definition: The harmonic mean of precision and recall, balancing the two.
When to Use: The F1 score is ideal when there’s a trade-off between precision and recall. For example, if both reducing false positives and false negatives are important, the F1 score provides a single, balanced metric to optimize.

Choosing the Right Metric

Balanced Datasets: Use accuracy.
Costly False Positives: Prioritize precision.
Costly False Negatives: Focus on recall.
Balancing Both: Optimize for the F1 score.

By selecting the right metric, you’ll be equipped to interpret your model’s performance and determine which boosting technique best suits your needs!

For this example, we will focus on building a model that performs best on recall, as our goal is to minimize false negatives—situations where the model predicts a customer will stay, but they ultimately leave. However, we will also evaluate the model using all relevant metrics to gain a more comprehensive understanding of its performance.

We will use this function definition to produce a data frame that provides the performance of the model based on each metric.

Step 2: Split the Data

To build a model that can accurately predict future customer behavior, we must ensure it generalizes well to new, unseen data—not just the data it was trained on. If we fine-tune the model using the entire dataset, it risks overfitting to the "noise" or random fluctuations in the data. To avoid this, we split the data into training, validation, and testing sets to ensure the model learns from one subset, tunes its hyperparameters on another, and evaluates on unseen data last for generalizability.

The data will be divided into three groups:

Training Set (50% of data): This subset is used to initially train the model, allowing it to learn from the patterns in the data.
Validation Set (30% of data): This set helps to adjust the model, guiding decisions on hyperparameters based on its performance.
Testing Set (20% of data): This subset is kept aside for the final evaluation, testing how well the model performs on completely unseen data.

Clarification on Code Explanation:

The code first splits the data into independent (X) and dependent (Y) variables. The independent data (X) includes the features used to predict the dependent data (Y), which represents the target variable. The second and third sections of the code handle the splitting of the data into training, validation, and testing sets, ensuring the model has distinct data to learn from, tune on, and evaluate.

Step 3: Balance the Data

Effective data balancing is critical for building a robust learning model. Often, datasets are imbalanced, meaning one category has significantly more instances than the other. For example, in this case, around 84% of the data represents existing customers, while only 16% represents customers who have churned. This imbalance can skew model predictions and increase errors, as the model may be biased toward the majority class.

To address this, we can balance the data in two ways:

Undersampling: This method reduces the number of instances in the majority class, making the classes more evenly distributed.
Oversampling: This approach increases the number of instances in the minority class by duplicating some of the data points.

In this example, undersampling was used. However, you can experiment with both techniques and evaluate which works best for your specific dataset using the testing outlined later.

Notice how we only alter the training data because the goal is to keep the testing and validation data as close to the original data as possible. This ensures we get an honest assessment of the model's performance when it encounters real, unseen data.

Step 4: Build the Model

Now, we get to the exciting part—building the model! It's surprisingly easier than you might think.

First, we append AdaBoost and XGBoost to our models list, but feel free to add other models to test their performance. Next, we use a for-loop to fit each model to the undersampled training data. We evaluate the model performance using recall_score(), as recall was identified as the most important metric for this example, and test it against the separated validation data.

AdaBoost was able to predict 93% of the churned customers, while XGBoost performed slightly better, predicting 95%. At this point, the model performs well, and we could stop here with a solid solution. However, let's see if we can further fine-tune the model for even better performance.

Step 5: Tuning the Model

Parameters and hyperparameters play a crucial role in adjusting the model’s performance to fit the specific data requirements. However, with so many possible settings, it can be difficult to determine the best combination. Fortunately, we can use a randomized search to test multiple combinations and identify the optimal set of hyperparameters based on our chosen performance metric.

For AdaBoost, key hyperparameters include:

n_estimators: The number of boosting stages (trees).
learning_rate: Controls how much each new model corrects the previous one.
base_estimator: Determines the max depth of the decision trees used in the model.

For more information on tuning these hyperparameters, visit the XGBoost parameter documentation.

After running the search on the model, we determine the best settings for each parameter. Then all we have to do is create a new model with the settings provided.

Lastly, we evaluate the performance of the newly tuned model. While we can look at all the metrics, the key metric we’re focused on for this project is recall.

The first row shows the model’s performance on the training data, while the second row reflects how it performed on the validation data. Comparing the recall score of 93% from the original model, we see that tuning has increased recall by 1.4%. Additionally, the model’s performance on other metrics is as follows:

Accuracy: 94%
Precision: 76%
F1 Score: 84%

Now that we've fine-tuned the model, let's move on to training XGBoost.

Tuning XGBoost follows a similar process but can be a bit more complex due to the larger number of parameters involved. To keep things manageable, we’ll focus on tuning a few key parameters:

n_estimators: The number of boosting rounds (trees).
scale_pos_weight: Adjusts for class imbalance.
learning_rate: Controls the contribution of each new model.
gamma: Regularization term that prevents overfitting.
subsample: The proportion of training data used for each boosting round.

For more detailed information on XGBoost parameters, refer to the XGBoost parameter documentation.

Now that we know the best parameters for XGBoost, let's build and test the model.

Looking at the performance on the validation data in the second row, the recall score has improved after tuning, rising from 95% to 97%. However, accuracy is at 81%, precision at 46%, and F1 score at 63%. While XGBoost outperforms AdaBoost in recall, all other metrics (accuracy, precision, and F1 score) are better with AdaBoost.

Conclusion

By leveraging AdaBoost and XGBoost, we identified high-risk customers with over 95% recall, equipping the bank to target these individuals with retention strategies. While AdaBoost is simpler and suitable for smaller datasets, XGBoost’s speed and and robustness make it ideal for larger, more complex data.

AdaBoost demonstrated strong, balanced performance, excelling in accuracy, precision, and F1 score. On the other hand, XGBoost outperformed in recall, making it particularly effective for minimizing false negatives and identifying customers who are at risk of leaving.

Each model has its strengths and limitations. Your choice will depend on your specific goals. In this case study, the tuned XGBoost performed best at identifying customers who are likely to leave, even if it also targeted some customers who weren’t planning to cancel. AdaBoost’s more balanced metrics, however, provide a broader view of customer behavior, offering insights across multiple dimensions.

Next steps for real-world implementation include testing these models on live data to validate their effectiveness and adapt to changing customer behavior. Future enhancements could involve integrating additional features such as customer sentiment analysis from reviews or social media interactions. Additionally, exploring other advanced techniques like CatBoost or LightGBM could further optimize model performance. By continuously improving these models, businesses can not only reduce churn but also build loyalty and strengthen customer relationships.

Python Libraries

Below are the libraries used throughout this project. Some were utilized in the preprocessing and exploratory data analysis, even though they may not have been referenced in the provided code.