Data Analyst Guide: Mastering Random Forest vs XGBoost: Which Wins for Analytics?

Business Problem Statement

In the retail industry, predicting customer churn is crucial to maintaining a loyal customer base and maximizing revenue. A leading e-commerce company wants to identify the most effective machine learning model to predict customer churn, with the goal of reducing churn by 15% and increasing revenue by 10%. The company has a large dataset of customer information, including demographic data, purchase history, and browsing behavior.

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

First, we need to prepare the data for analysis. We will use a sample dataset of 10,000 customers, with 20 features, including demographic data, purchase history, and browsing behavior.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('customer_data.csv')

# Drop any missing values
df.dropna(inplace=True)

# Convert categorical variables to numerical variables
df['gender'] = df['gender'].map({'male': 0, 'female': 1})
df['churn'] = df['churn'].map({'yes': 1, 'no': 0})

# Split the data into training and testing sets
X = df.drop('churn', axis=1)
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Alternatively, we can use SQL to prepare the data:

CREATE TABLE customer_data (
    id INT PRIMARY KEY,
    gender VARCHAR(10),
    age INT,
    purchase_history VARCHAR(100),
    browsing_behavior VARCHAR(100),
    churn VARCHAR(10)
);

INSERT INTO customer_data (id, gender, age, purchase_history, browsing_behavior, churn)
VALUES
(1, 'male', 25, 'high', 'frequent', 'yes'),
(2, 'female', 30, 'medium', 'occasional', 'no'),
(3, 'male', 35, 'low', 'rare', 'yes'),
...;

SELECT * FROM customer_data WHERE churn = 'yes';

Step 2: Analysis Pipeline

Next, we will create an analysis pipeline using scikit-learn to train and evaluate the Random Forest and XGBoost models.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define the hyperparameter tuning space for Random Forest
param_grid_rf = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10]
}

# Define the hyperparameter tuning space for XGBoost
param_grid_xgb = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15],
    'learning_rate': [0.1, 0.5, 1]
}

# Perform hyperparameter tuning for Random Forest
grid_search_rf = GridSearchCV(RandomForestClassifier(), param_grid_rf, cv=5, scoring='accuracy')
grid_search_rf.fit(X_train, y_train)

# Perform hyperparameter tuning for XGBoost
grid_search_xgb = GridSearchCV(XGBClassifier(), param_grid_xgb, cv=5, scoring='accuracy')
grid_search_xgb.fit(X_train, y_train)

# Train the best-performing models
best_rf = grid_search_rf.best_estimator_
best_xgb = grid_search_xgb.best_estimator_

# Make predictions on the test set
y_pred_rf = best_rf.predict(X_test)
y_pred_xgb = best_xgb.predict(X_test)

# Evaluate the models
print("Random Forest:")
print("Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))

print("XGBoost:")
print("Accuracy:", accuracy_score(y_test, y_pred_xgb))
print("Classification Report:")
print(classification_report(y_test, y_pred_xgb))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_xgb))

Step 3: Model/Visualization Code

We can use matplotlib and seaborn to visualize the results:

import matplotlib.pyplot as plt
import seaborn as sns

# Plot the confusion matrix for Random Forest
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_pred_rf), annot=True, cmap='Blues')
plt.xlabel("Predicted labels")
plt.ylabel("True labels")
plt.title("Random Forest Confusion Matrix")
plt.show()

# Plot the confusion matrix for XGBoost
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix(y_test, y_pred_xgb), annot=True, cmap='Blues')
plt.xlabel("Predicted labels")
plt.ylabel("True labels")
plt.title("XGBoost Confusion Matrix")
plt.show()

Step 4: Performance Evaluation

We can calculate the ROI impact of each model:

# Define the revenue and cost of each customer
revenue_per_customer = 100
cost_per_customer = 50

# Calculate the ROI impact of Random Forest
roi_rf = (accuracy_score(y_test, y_pred_rf) * revenue_per_customer - (1 - accuracy_score(y_test, y_pred_rf)) * cost_per_customer) / cost_per_customer
print("Random Forest ROI:", roi_rf)

# Calculate the ROI impact of XGBoost
roi_xgb = (accuracy_score(y_test, y_pred_xgb) * revenue_per_customer - (1 - accuracy_score(y_test, y_pred_xgb)) * cost_per_customer) / cost_per_customer
print("XGBoost ROI:", roi_xgb)

Step 5: Production Deployment

We can deploy the best-performing model to production using a cloud-based platform such as AWS or Google Cloud:

from sklearn.externals import joblib

# Save the best-performing model to a file
joblib.dump(best_xgb, 'best_model.pkl')

# Load the model from the file
loaded_model = joblib.load('best_model.pkl')

# Make predictions on new data
new_data = pd.DataFrame({'gender': [0], 'age': [25], 'purchase_history': ['high'], 'browsing_behavior': ['frequent']})
new_prediction = loaded_model.predict(new_data)
print("New prediction:", new_prediction)

Edge Cases:

Handling missing values: We can use imputation techniques such as mean, median, or mode to handle missing values.
Handling outliers: We can use techniques such as winsorization or trimming to handle outliers.
Handling class imbalance: We can use techniques such as oversampling the minority class, undersampling the majority class, or using class weights to handle class imbalance.

Scaling Tips: