Data Analyst Guide: Mastering Random Forest vs XGBoost: Which Wins for Analytics?

Business Problem Statement

In the retail industry, predicting customer churn is crucial for maintaining a loyal customer base and minimizing revenue loss. A leading e-commerce company wants to identify the most effective machine learning model to predict customer churn, comparing Random Forest and XGBoost. The goal is to reduce churn by 15% and increase revenue by 10% within the next quarter.

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

First, we'll prepare the data using pandas and SQL. We'll use a sample dataset containing customer information, purchase history, and churn status.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Sample dataset
data = {
    'customer_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'age': [25, 32, 43, 28, 35, 40, 45, 38, 48, 50],
    'purchase_history': [100, 200, 300, 150, 250, 350, 400, 300, 450, 500],
    'churn': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Split data into training and testing sets
X = df.drop(['churn', 'customer_id'], axis=1)
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale data using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

SQL query to create the dataset:

CREATE TABLE customers (
    customer_id INT PRIMARY KEY,
    age INT,
    purchase_history DECIMAL(10, 2),
    churn BOOLEAN
);

INSERT INTO customers (customer_id, age, purchase_history, churn)
VALUES
(1, 25, 100.00, FALSE),
(2, 32, 200.00, TRUE),
(3, 43, 300.00, FALSE),
(4, 28, 150.00, TRUE),
(5, 35, 250.00, FALSE),
(6, 40, 350.00, TRUE),
(7, 45, 400.00, FALSE),
(8, 38, 300.00, TRUE),
(9, 48, 450.00, FALSE),
(10, 50, 500.00, TRUE);

Step 2: Analysis Pipeline

Next, we'll create an analysis pipeline using Random Forest and XGBoost.

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# XGBoost model
xgb_model = XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
xgb_model.fit(X_train_scaled, y_train)

Step 3: Model/Visualization Code

Now, we'll evaluate the performance of both models and visualize the results.

# Predictions
rf_pred = rf_model.predict(X_test_scaled)
xgb_pred = xgb_model.predict(X_test_scaled)

# Evaluation metrics
rf_accuracy = accuracy_score(y_test, rf_pred)
xgb_accuracy = accuracy_score(y_test, xgb_pred)

print("Random Forest Accuracy:", rf_accuracy)
print("XGBoost Accuracy:", xgb_accuracy)

# Classification report
print("Random Forest Classification Report:")
print(classification_report(y_test, rf_pred))
print("XGBoost Classification Report:")
print(classification_report(y_test, xgb_pred))

# Confusion matrix
print("Random Forest Confusion Matrix:")
print(confusion_matrix(y_test, rf_pred))
print("XGBoost Confusion Matrix:")
print(confusion_matrix(y_test, xgb_pred))

Step 4: Performance Evaluation

To evaluate the performance of both models, we'll calculate the ROI impact.

# ROI calculation
def calculate_roi(churn_rate, revenue):
    return (1 - churn_rate) * revenue

# Assume an average revenue of $1000 per customer
average_revenue = 1000

# Calculate ROI for Random Forest model
rf_churn_rate = 1 - rf_accuracy
rf_roi = calculate_roi(rf_churn_rate, average_revenue)

# Calculate ROI for XGBoost model
xgb_churn_rate = 1 - xgb_accuracy
xgb_roi = calculate_roi(xgb_churn_rate, average_revenue)

print("Random Forest ROI:", rf_roi)
print("XGBoost ROI:", xgb_roi)

Step 5: Production Deployment

Finally, we'll deploy the best-performing model to production.

# Deploy the best-performing model
if rf_accuracy > xgb_accuracy:
    best_model = rf_model
else:
    best_model = xgb_model

# Save the best-performing model
import pickle
with open('best_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

Edge Cases

Handling imbalanced datasets: Use techniques like oversampling the minority class, undersampling the majority class, or using class weights.
Handling missing values: Use techniques like mean/median imputation, interpolation, or using a machine learning model to predict missing values.

Scaling Tips

Use distributed computing frameworks like Apache Spark or Dask to scale up the analysis pipeline.
Use cloud-based services like AWS SageMaker or Google Cloud AI Platform to deploy the model to production.
Use model interpretability techniques like feature importance or partial dependence plots to understand the model's behavior.

By following this guide, data analysts can master Random Forest and XGBoost models, evaluate their performance, and deploy the best-performing model to production, ultimately driving business growth and revenue increase.