Data Analyst Guide: Mastering Feature Engineering > Fancy ML Models (Always)

Business Problem Statement

In the e-commerce industry, predicting customer churn is crucial for businesses to retain their customer base and maintain revenue. A company like Amazon can lose millions of dollars in revenue if it fails to identify and retain its high-value customers. In this tutorial, we will work on a real-world scenario where we will predict customer churn for an e-commerce company using feature engineering and machine learning.

The business problem statement is as follows:

Predict customer churn for an e-commerce company based on their historical transaction data.
Identify the key factors that contribute to customer churn.
Develop a predictive model that can identify high-risk customers and provide recommendations to retain them.

The ROI impact of this project can be significant. For example, if we can reduce customer churn by 10%, the company can save millions of dollars in revenue.

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

First, we need to prepare our data for analysis. We will use a sample dataset that contains customer transaction data.

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
data = pd.read_csv('customer_churn_data.csv')

# Print the first few rows of the dataset
print(data.head())

The dataset contains the following columns:

customer_id: Unique customer ID
transaction_date: Date of the transaction
transaction_amount: Amount of the transaction
product_category: Category of the product purchased
churn: Whether the customer has churned (1) or not (0)

We will use SQL to extract the data from a database and load it into a pandas DataFrame.

-- Create a table to store customer transaction data
CREATE TABLE customer_transactions (
    customer_id INT,
    transaction_date DATE,
    transaction_amount DECIMAL(10, 2),
    product_category VARCHAR(255),
    churn INT
);

-- Insert sample data into the table
INSERT INTO customer_transactions (customer_id, transaction_date, transaction_amount, product_category, churn)
VALUES
(1, '2022-01-01', 100.00, 'Electronics', 0),
(2, '2022-01-15', 200.00, 'Fashion', 1),
(3, '2022-02-01', 50.00, 'Home Goods', 0),
(4, '2022-03-01', 150.00, 'Electronics', 1),
(5, '2022-04-01', 250.00, 'Fashion', 0);

-- Extract the data from the table and load it into a pandas DataFrame
import pandas as pd
import pyodbc

# Connect to the database
conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=localhost;DATABASE=customer_churn;UID=username;PWD=password')

# Extract the data from the table
query = "SELECT * FROM customer_transactions"
data = pd.read_sql(query, conn)

# Print the first few rows of the dataset
print(data.head())

Step 2: Analysis Pipeline

Next, we will create an analysis pipeline to extract insights from the data.

# Define a function to calculate the average transaction amount per customer
def calculate_average_transaction_amount(data):
    average_transaction_amount = data.groupby('customer_id')['transaction_amount'].mean()
    return average_transaction_amount

# Define a function to calculate the total transaction amount per customer
def calculate_total_transaction_amount(data):
    total_transaction_amount = data.groupby('customer_id')['transaction_amount'].sum()
    return total_transaction_amount

# Define a function to calculate the frequency of transactions per customer
def calculate_transaction_frequency(data):
    transaction_frequency = data.groupby('customer_id')['transaction_date'].count()
    return transaction_frequency

# Calculate the average transaction amount per customer
average_transaction_amount = calculate_average_transaction_amount(data)

# Calculate the total transaction amount per customer
total_transaction_amount = calculate_total_transaction_amount(data)

# Calculate the frequency of transactions per customer
transaction_frequency = calculate_transaction_frequency(data)

# Print the results
print("Average Transaction Amount per Customer:")
print(average_transaction_amount)
print("\nTotal Transaction Amount per Customer:")
print(total_transaction_amount)
print("\nTransaction Frequency per Customer:")
print(transaction_frequency)

Step 3: Model/Visualization Code

Now, we will create a predictive model to identify high-risk customers.

# Define a function to train a random forest classifier
def train_random_forest_classifier(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    return model, y_pred

# Define the features and target variable
X = data[['average_transaction_amount', 'total_transaction_amount', 'transaction_frequency']]
y = data['churn']

# Train a random forest classifier
model, y_pred = train_random_forest_classifier(X, y)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Print the confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

We will use visualization to understand the results.

# Import necessary libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Create a heatmap to visualize the correlation between features
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation between Features')
plt.show()

# Create a bar chart to visualize the distribution of the target variable
plt.figure(figsize=(8, 6))
sns.countplot(data['churn'])
plt.title('Distribution of Target Variable')
plt.show()

Step 4: Performance Evaluation

We will evaluate the performance of the model using metrics such as accuracy, precision, recall, and F1 score.

# Define a function to calculate the performance metrics
def calculate_performance_metrics(y_test, y_pred):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1_score = f1_score(y_test, y_pred)
    return accuracy, precision, recall, f1_score

# Calculate the performance metrics
accuracy, precision, recall, f1_score = calculate_performance_metrics(y_test, y_pred)

# Print the performance metrics
print("Model Performance Metrics:")
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1_score)

Step 5: Production Deployment

Finally, we will deploy the model in production.

# Define a function to deploy the model in production
def deploy_model_in_production(model, data):
    # Use the model to make predictions on new data
    predictions = model.predict(data)
    return predictions

# Deploy the model in production
predictions = deploy_model_in_production(model, data)

# Print the predictions
print("Predictions:")
print(predictions)

Metrics/ROI Calculations

We will calculate the ROI of the project by comparing the cost of implementing the model with the benefits it provides.

# Define a function to calculate the ROI
def calculate_roi(cost, benefit):
    roi = (benefit - cost) / cost
    return roi

# Calculate the ROI
cost = 100000  # Cost of implementing the model
benefit = 500000  # Benefit of implementing the model
roi = calculate_roi(cost, benefit)

# Print the ROI
print("ROI:", roi)

Edge Cases

We will handle edge cases such as missing values, outliers, and imbalanced data.

# Define a function to handle missing values
def handle_missing_values(data):
    # Replace missing values with the mean or median
    data.fillna(data.mean(), inplace=True)
    return data

# Define a function to handle outliers
def handle_outliers(data):
    # Use the IQR method to detect outliers
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR)))]
    return data

# Define a function to handle imbalanced data
def handle_imbalanced_data(data):
    # Use SMOTE to oversample the minority class
    from imblearn.over_sampling import SMOTE
    smote = SMOTE(random_state=42)
    X, y = smote.fit_resample(data.drop('churn', axis=1), data['churn'])
    return X, y

# Handle missing values
data = handle_missing_values(data)

# Handle outliers
data = handle_outliers(data)

# Handle imbalanced data
X, y = handle_imbalanced_data(data)

Scaling Tips

We will provide tips for scaling the model to handle large datasets.

# Define a function to scale the model
def scale_model(data):
    # Use parallel processing to speed up the computation
    from joblib import Parallel, delayed
    parallel = Parallel(n_jobs=-1)
    delayed_functions = [delayed(calculate_average_transaction_amount)(data) for _ in range(10)]
    results = parallel(delayed_functions)
    return results

# Scale the model
results = scale_model(data)

# Print the results
print("Scaled Model Results:")
print(results)

By following these steps and tips, we can master feature engineering and build a robust predictive model that provides valuable insights for business decision-making.