Data Analyst Guide: Mastering Feature Engineering > Fancy ML Models (Always)
Business Problem Statement
In the e-commerce industry, predicting customer churn is crucial for businesses to retain their customer base and maintain revenue. A company like Amazon can lose millions of dollars in revenue if it fails to identify and retain its high-value customers. In this tutorial, we will work on a real-world scenario where we will predict customer churn for an e-commerce company using feature engineering and machine learning.
The business problem statement is as follows:
- Predict customer churn for an e-commerce company based on their historical transaction data.
- Identify the key factors that contribute to customer churn.
- Develop a predictive model that can identify high-risk customers and provide recommendations to retain them.
The ROI impact of this project can be significant. For example, if we can reduce customer churn by 10%, the company can save millions of dollars in revenue.
Step-by-Step Technical Solution
Step 1: Data Preparation (pandas/SQL)
First, we need to prepare our data for analysis. We will use a sample dataset that contains customer transaction data.
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load the dataset
data = pd.read_csv('customer_churn_data.csv')
# Print the first few rows of the dataset
print(data.head())
The dataset contains the following columns:
-
customer_id: Unique customer ID -
transaction_date: Date of the transaction -
transaction_amount: Amount of the transaction -
product_category: Category of the product purchased -
churn: Whether the customer has churned (1) or not (0)
We will use SQL to extract the data from a database and load it into a pandas DataFrame.
-- Create a table to store customer transaction data
CREATE TABLE customer_transactions (
customer_id INT,
transaction_date DATE,
transaction_amount DECIMAL(10, 2),
product_category VARCHAR(255),
churn INT
);
-- Insert sample data into the table
INSERT INTO customer_transactions (customer_id, transaction_date, transaction_amount, product_category, churn)
VALUES
(1, '2022-01-01', 100.00, 'Electronics', 0),
(2, '2022-01-15', 200.00, 'Fashion', 1),
(3, '2022-02-01', 50.00, 'Home Goods', 0),
(4, '2022-03-01', 150.00, 'Electronics', 1),
(5, '2022-04-01', 250.00, 'Fashion', 0);
-- Extract the data from the table and load it into a pandas DataFrame
import pandas as pd
import pyodbc
# Connect to the database
conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=localhost;DATABASE=customer_churn;UID=username;PWD=password')
# Extract the data from the table
query = "SELECT * FROM customer_transactions"
data = pd.read_sql(query, conn)
# Print the first few rows of the dataset
print(data.head())
Step 2: Analysis Pipeline
Next, we will create an analysis pipeline to extract insights from the data.
# Define a function to calculate the average transaction amount per customer
def calculate_average_transaction_amount(data):
average_transaction_amount = data.groupby('customer_id')['transaction_amount'].mean()
return average_transaction_amount
# Define a function to calculate the total transaction amount per customer
def calculate_total_transaction_amount(data):
total_transaction_amount = data.groupby('customer_id')['transaction_amount'].sum()
return total_transaction_amount
# Define a function to calculate the frequency of transactions per customer
def calculate_transaction_frequency(data):
transaction_frequency = data.groupby('customer_id')['transaction_date'].count()
return transaction_frequency
# Calculate the average transaction amount per customer
average_transaction_amount = calculate_average_transaction_amount(data)
# Calculate the total transaction amount per customer
total_transaction_amount = calculate_total_transaction_amount(data)
# Calculate the frequency of transactions per customer
transaction_frequency = calculate_transaction_frequency(data)
# Print the results
print("Average Transaction Amount per Customer:")
print(average_transaction_amount)
print("\nTotal Transaction Amount per Customer:")
print(total_transaction_amount)
print("\nTransaction Frequency per Customer:")
print(transaction_frequency)
Step 3: Model/Visualization Code
Now, we will create a predictive model to identify high-risk customers.
# Define a function to train a random forest classifier
def train_random_forest_classifier(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
return model, y_pred
# Define the features and target variable
X = data[['average_transaction_amount', 'total_transaction_amount', 'transaction_frequency']]
y = data['churn']
# Train a random forest classifier
model, y_pred = train_random_forest_classifier(X, y)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)
# Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Print the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
We will use visualization to understand the results.
# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns
# Create a heatmap to visualize the correlation between features
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation between Features')
plt.show()
# Create a bar chart to visualize the distribution of the target variable
plt.figure(figsize=(8, 6))
sns.countplot(data['churn'])
plt.title('Distribution of Target Variable')
plt.show()
Step 4: Performance Evaluation
We will evaluate the performance of the model using metrics such as accuracy, precision, recall, and F1 score.
# Define a function to calculate the performance metrics
def calculate_performance_metrics(y_test, y_pred):
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1_score = f1_score(y_test, y_pred)
return accuracy, precision, recall, f1_score
# Calculate the performance metrics
accuracy, precision, recall, f1_score = calculate_performance_metrics(y_test, y_pred)
# Print the performance metrics
print("Model Performance Metrics:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1_score)
Step 5: Production Deployment
Finally, we will deploy the model in production.
# Define a function to deploy the model in production
def deploy_model_in_production(model, data):
# Use the model to make predictions on new data
predictions = model.predict(data)
return predictions
# Deploy the model in production
predictions = deploy_model_in_production(model, data)
# Print the predictions
print("Predictions:")
print(predictions)
Metrics/ROI Calculations
We will calculate the ROI of the project by comparing the cost of implementing the model with the benefits it provides.
# Define a function to calculate the ROI
def calculate_roi(cost, benefit):
roi = (benefit - cost) / cost
return roi
# Calculate the ROI
cost = 100000 # Cost of implementing the model
benefit = 500000 # Benefit of implementing the model
roi = calculate_roi(cost, benefit)
# Print the ROI
print("ROI:", roi)
Edge Cases
We will handle edge cases such as missing values, outliers, and imbalanced data.
# Define a function to handle missing values
def handle_missing_values(data):
# Replace missing values with the mean or median
data.fillna(data.mean(), inplace=True)
return data
# Define a function to handle outliers
def handle_outliers(data):
# Use the IQR method to detect outliers
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR)))]
return data
# Define a function to handle imbalanced data
def handle_imbalanced_data(data):
# Use SMOTE to oversample the minority class
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X, y = smote.fit_resample(data.drop('churn', axis=1), data['churn'])
return X, y
# Handle missing values
data = handle_missing_values(data)
# Handle outliers
data = handle_outliers(data)
# Handle imbalanced data
X, y = handle_imbalanced_data(data)
Scaling Tips
We will provide tips for scaling the model to handle large datasets.
# Define a function to scale the model
def scale_model(data):
# Use parallel processing to speed up the computation
from joblib import Parallel, delayed
parallel = Parallel(n_jobs=-1)
delayed_functions = [delayed(calculate_average_transaction_amount)(data) for _ in range(10)]
results = parallel(delayed_functions)
return results
# Scale the model
results = scale_model(data)
# Print the results
print("Scaled Model Results:")
print(results)
By following these steps and tips, we can master feature engineering and build a robust predictive model that provides valuable insights for business decision-making.
Top comments (0)