Data Analyst Guide: Mastering Random Forest vs XGBoost: Which Wins for Analytics?

Business Problem Statement

In the retail industry, predicting customer churn is crucial for maintaining a loyal customer base and minimizing revenue loss. A leading e-commerce company wants to develop a predictive model to identify customers who are likely to churn, allowing them to proactively offer personalized promotions and improve customer retention. The company estimates that a 10% reduction in customer churn can result in a $1 million increase in annual revenue.

Step-by-Step Technical Solution

Step 1: Data Preparation

We will use a sample dataset containing customer information, purchase history, and demographic data. We will prepare the data using pandas and SQL.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
data = pd.read_csv('customer_data.csv')

# Handle missing values
data.fillna(data.mean(), inplace=True)

# Convert categorical variables to numerical variables
data = pd.get_dummies(data, columns=['gender', 'location'])

# Define the target variable
target = data['churn']

# Define the feature variables
features = data.drop(['churn'], axis=1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)

SQL Query to Create the Dataset

CREATE TABLE customer_data (
    id INT PRIMARY KEY,
    age INT,
    gender VARCHAR(10),
    location VARCHAR(20),
    purchase_history DECIMAL(10, 2),
    churn BOOLEAN
);

INSERT INTO customer_data (id, age, gender, location, purchase_history, churn)
VALUES
(1, 25, 'Male', 'New York', 100.00, FALSE),
(2, 30, 'Female', 'Los Angeles', 200.00, TRUE),
(3, 35, 'Male', 'Chicago', 50.00, FALSE),
...

Step 2: Analysis Pipeline

We will use a pipeline to train and evaluate the Random Forest and XGBoost models.

# Define the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Define the XGBoost model
xgb_model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the models
rf_model.fit(X_train, y_train)
xgb_model.fit(X_train, y_train)

# Make predictions
rf_pred = rf_model.predict(X_test)
xgb_pred = xgb_model.predict(X_test)

Step 3: Model Evaluation and Visualization

We will evaluate the performance of the models using accuracy score, classification report, and confusion matrix.

# Evaluate the models
rf_accuracy = accuracy_score(y_test, rf_pred)
xgb_accuracy = accuracy_score(y_test, xgb_pred)

print("Random Forest Accuracy:", rf_accuracy)
print("XGBoost Accuracy:", xgb_accuracy)

# Print the classification report
print("Random Forest Classification Report:")
print(classification_report(y_test, rf_pred))
print("XGBoost Classification Report:")
print(classification_report(y_test, xgb_pred))

# Print the confusion matrix
print("Random Forest Confusion Matrix:")
print(confusion_matrix(y_test, rf_pred))
print("XGBoost Confusion Matrix:")
print(confusion_matrix(y_test, xgb_pred))

Step 4: Performance Evaluation and ROI Calculation

We will calculate the ROI of the models based on the predicted churn probability.

# Define the cost of churn
cost_of_churn = 100.00

# Define the revenue gain from retaining a customer
revenue_gain = 50.00

# Calculate the ROI for the Random Forest model
rf_roi = (rf_accuracy * revenue_gain) - (cost_of_churn * (1 - rf_accuracy))
print("Random Forest ROI:", rf_roi)

# Calculate the ROI for the XGBoost model
xgb_roi = (xgb_accuracy * revenue_gain) - (cost_of_churn * (1 - xgb_accuracy))
print("XGBoost ROI:", xgb_roi)

Step 5: Production Deployment

We will deploy the best-performing model to a production environment using a RESTful API.

from flask import Flask, request, jsonify
from sklearn.externals import joblib

app = Flask(__name__)

# Load the trained model
model = joblib.load('best_model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    prediction = model.predict(data)
    return jsonify({'prediction': prediction})

if __name__ == '__main__':
    app.run(debug=True)

Edge Cases

Handling missing values: We will use mean imputation to handle missing values in the dataset.
Handling outliers: We will use IQR method to detect and remove outliers from the dataset.
Handling class imbalance: We will use SMOTE technique to oversample the minority class and balance the dataset.

Scaling Tips

Use distributed computing frameworks like Apache Spark or Dask to scale the model training process.
Use cloud-based services like AWS SageMaker or Google Cloud AI Platform to deploy the model in a production environment.
Use model pruning techniques to reduce the model size and improve inference speed.

By following these steps and tips, data analysts can develop a predictive model to identify customers who are likely to churn and deploy it in a production environment to improve customer retention and revenue.