Introduction
Customer acquisition is expensive. But do you know which customers will actually generate long term revenue? That’s where Customer Lifetime Value (CLV) comes in.
Instead of focusing on one-off transactions, CLV estimates the total revenue a business expects from a customer over their entire relationship.
In this project, I built an end-to-end CLV prediction model and then deployed it as a production ready API.
In this article, we’ll cover:
- Business problem
- Data preprocessing
- Model development
- Model evaluation
- Model deployment with FastAPI
- Production-ready setup
The Business Problem
Businesses want to answer:
- Which customers are most valuable?
- Who should receive retention incentives?
- Where should marketing budgets be allocated?
Predicting CLV helps with:
- Customer segmentation
- Revenue forecasting
- Budget optimization
- Retention strategies
This is a regression problem since CLV is a continuous value.
Step 1: Data Preprocessing
The dataset includes:
- Purchase frequency
- Recency
- Average transaction value
- Tenure
- Demographic features
Data Preparation
Before training any model, we need to separate our features from the target variable. In this case, CLV is what we're trying to predict, and everything else in the dataset serves as input:
x = df.drop('CLV', axis=1)
y = df['CLV']
We also check for missing values:
x.isnull().sum()
Clean data is non-negotiable. Missing values can silently corrupt a model's performance if left unaddressed.
Splitting the Dataset
We divide the data into training and testing sets 80% for training and 20% for evaluating performance on unseen data:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
Setting random_state=42 ensures reproducibility, so results remain consistent across runs.
Step 2: Model development
Linear Regression
We start with linear regression, a simple but interpretable baseline. It assumes a linear relationship between the features and the target, making it fast to train and easy to explain to stakeholders.
from sklearn.linear_model import LinearRegression
Linear = LinearRegression()
Linear.fit(x_train, y_train)
Predictions = Linear.predict(x_test)
Random Forest Regressor
Next, we train a Random Forest an ensemble method that builds 200 decision trees and averages their predictions. This approach is more robust to non-linear patterns in the data and tends to outperform linear models on complex real world datasets.
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=200, random_state=42)
rf.fit(x_train, y_train)
random_prediction = rf.predict(x_test)
Step 3: Model evaluation
We evaluate both models using Root Mean Squared Error (RMSE) and R² Score. RMSE tells us the average prediction error in the same units as CLV, while R² tells us how much of the variance in CLV our model explains (1.0 = perfect, 0 = no better than guessing the mean).
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
RMSE_linear = sqrt(mean_squared_error(y_test, Predictions))
r2_linear = r2_score(y_test, Predictions)
RMSE_tree = sqrt(mean_squared_error(y_test, random_prediction))
r2_tree = r2_score(y_test, random_prediction)
print(f'RMSE_linear: {RMSE_linear}')
print(f'r2_linear: {r2_linear}')
print(f'RMSE_tree: {RMSE_tree}')
print(f'r2_tree: {r2_tree}')
In most real-world CLV scenarios, the Random Forest will outperform Linear Regression due to its ability to capture complex, non-linear relationships between customer features and lifetime value.
Saving the Model
Once we're satisfied with model performance, we persist the trained model and feature schema using joblib. This makes reloading the model later without retraining straightforward:
import joblib
model = joblib.load('CLV_model.joblib')
feature_name = joblib.load('modelfeatures.joblib')
Saving the feature set alongside the model is a great practice. It documents exactly what columns and structure the model expects at inference time, which prevents subtle bugs when deploying.
Step 4: Model deployment with FastAPI
Training a model is only half the work. To put it into production, you need an API that other systems can call. Here's how to build a simple REST endpoint using FastAPI:
1. Install Dependencies
pip install fastapi uvicorn joblib scikit-learn pandas
2. Create the API
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np
app = FastAPI(title='Customer Lifetime Value Prediction API')
# Load the saved model and feature schema
model = joblib.load('CLV_model.joblib')
feature_name = joblib.load('modelfeatures.joblib')
# Define the input schema (adjust fields to match your actual dataset columns)
class CLVinput(BaseModel):
Customer_Age: int
Annual_Income: float
Tenure_Months: int
Monthly_Spend: float
Visits_Per_Month: int
Avg_Basket_Value: float
Support_Tickets: int
@app.get("/")
def health_check():
return {"status": "API is running"}
@app.post('/predict-CLV')
def predict_CLV(data:CLVinput):
x = np.array([[getattr(data,f) for f in feature_name]])
prediction = model.predict(x)[0]
return{'predicted_CLV': prediction}
3. Run the Server Locally
uvicorn app:app --reload
Your API will be live at http://localhost:8000. You can test it at http://localhost:8000/docs. FastAPI generates interactive API documentation automatically.
4. Deploy to the Cloud
For production, deploy the API to a cloud provider. Here's a quick overview:
Railway or Render (simplest): Push your code to GitHub and connect the repo. Both platforms auto-detect Python apps and handle deployment with minimal configuration. Add a requirements.txt file:
fastapi
uvicorn
joblib
scikit-learn
pandas
Summary
Here's the end-to-end workflow we covered:
- Load and explore the customer dataset
- Prepare features by separating inputs from the CLV target
- Train two models, Linear Regression and Random Forest and compare them using RMSE and R²
- Save the best model using joblib
- Deploy via FastAPI with a /predict endpoint that accepts customer data and returns a CLV estimate
Predicting Customer Lifetime Value turns raw customer data into a strategic business asset. With a deployed model, your sales and marketing teams can make real-time decisions based on predicted value, not just historical behaviour.
Top comments (0)