DEV Community

Shanii
Shanii

Posted on

Predicting Customer Lifetime Value (CLV) with Machine Learning

Customer Lifetime Value (CLV) is a crucial metric for businesses, representing the total revenue a business can reasonably expect from a single customer account throughout their relationship. By understanding and predicting CLV, companies can make informed decisions about marketing, sales, and customer service strategies. In this blog post, we'll walk through a practical example of how to predict CLV using machine learning, leveraging historical transactional data.

We'll cover:

Data Preprocessing and Cleaning
Feature Engineering using RFM (Recency, Frequency, Monetary) and other metrics
Splitting data into past and future windows
Training various regression models (Linear, Random Forest, XGBoost)
Evaluating model performance
Segmenting customers based on predicted CLV
Enter fullscreen mode Exit fullscreen mode

Step 1: Data Preprocessing and Cleaning

Our dataset(Online Retail Dataset(UCI Repository)) consists of transactional data across two years (2009-2010 and 2010-2011). The first step is to load this data, combine it, and perform essential cleaning to ensure data quality for our analysis. This includes handling missing values, removing invalid entries, and creating a TotalAmount column for each transaction.

After combining the datasets, we inspect for missing values, especially in critical columns like Customer ID. We'll also clean the data by:

Converting InvoiceDate to datetime objects.
Removing rows with missing Customer ID.
Filtering out cancellation invoices (identified by 'C' in Invoice).
Excluding transactions with zero or negative Quantity.
Calculating TotalAmount as Price * Quantity.
Enter fullscreen mode Exit fullscreen mode

Step 2: Feature Engineering - The Past Window

To predict future CLV, we need to extract meaningful features from historical data. This involves splitting our dataset into 'Past' and 'Future' windows based on a cutoff_day. The 'Past' window will be used to generate features, and the 'Future' window will provide the actual CLV for training. We'll focus on traditional RFM (Recency, Frequency, Monetary) metrics, along with Average Gap, Tenure, and Average Order Value.

First, we define a cutoff_day which is 3 months before the latest transaction date in our dataset.


Step 3: Calculating Customer Lifetime Value (CLV) - The Future Window

The target variable for our models, CLV, is derived from the 'Future' window. For each customer present in the 'Future' window, their CLV is the sum of their TotalAmount during that period. Customers present in the 'Past' window but not in the 'Future' window will have a CLV of 0, indicating no future purchases within the defined period.


Step 4: Train-Test Split

Before training our models, we split the data into training and testing sets. This allows us to train the models on one portion of the data and evaluate their performance on unseen data, ensuring generalization.


Step 5: Model Training

We will train three different regression models to predict CLV:

Linear Regression: A simple, interpretable model.
Random Forest Regressor: An ensemble method known for its robustness.
XGBoost Regressor: A powerful gradient boosting algorithm often used for high performance.
Enter fullscreen mode Exit fullscreen mode


Step 6: Model Evaluation

To evaluate the performance of our models, we will use three common regression metrics:

Mean Squared Error (MSE): Measures the average squared difference between estimated values and actual value.
Mean Absolute Error (MAE): Measures the average absolute errors between estimated values and actual value.
R-squared (R²): Represents the proportion of variance in the dependent variable that can be predicted from the independent variables.
Enter fullscreen mode Exit fullscreen mode

Since we applied a log transformation to our target variable Y, we need to apply np.expm1 to the predictions and actual Y_test before calculating the metrics to convert them back to their original scale for a meaningful interpretation.


From the evaluation metrics, we can observe that both Random Forest and XGBoost models performed significantly better than the simple Linear Regression, which had a very poor R-squared score. XGBoost and Random Forest have comparable performance in terms of MSE and MAE on this dataset.
Step 7: Customer Segmentation Based on Predicted CLV

Predicting CLV allows us to segment customers into different value tiers. We'll use the XGBoost model's predictions to create four segments: 'Low', 'Mid', 'High', and 'VIP'. This segmentation can guide targeted marketing and retention efforts.


Conclusion

Predicting Customer Lifetime Value is a powerful analytical technique that provides actionable insights into customer behavior and future revenue. By engineering features from historical data and employing robust machine learning models like Random Forest or XGBoost, businesses can accurately forecast CLV. This allows for effective customer segmentation, enabling targeted strategies for retention, marketing, and resource allocation.

The significant difference in Total_Predicted_Value across segments, particularly the 'VIP' segment, highlights the importance of identifying and nurturing high-value customers. This approach empowers businesses to optimize their customer relationships and drive sustainable growth.

Feel free to adapt this notebook and methodology to your own datasets to unlock the hidden value within your customer base

Top comments (2)

Collapse
 
brighto7700 profile image
Bright Emmanuel

Great breakdown, Shanii! Explaining the RFM metrics before diving into the model training really helps bridge the gap between the business logic and the machine learning implementation. Thanks for sharing the code snippets, too!

Collapse
 
shanii profile image
Shanii

Thanks for your kind words.