DEV Community

amal org
amal org

Posted on

Data Analyst Guide: Mastering Random Forest vs XGBoost: Which Wins for Analytics?

Data Analyst Guide: Mastering Random Forest vs XGBoost: Which Wins for Analytics?

Business Problem Statement

In the retail industry, predicting customer churn is crucial for maintaining a loyal customer base and maximizing revenue. A leading e-commerce company wants to develop a predictive model to identify customers who are likely to churn. The goal is to proactively offer personalized promotions and improve customer retention, resulting in a significant ROI impact.

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

First, we'll prepare the data using pandas and SQL. We'll assume we have a PostgreSQL database with a customers table containing customer information and a transactions table containing transaction history.

import pandas as pd
import psycopg2
from sklearn.model_selection import train_test_split

# Connect to PostgreSQL database
conn = psycopg2.connect(
    host="localhost",
    database="retail_db",
    user="username",
    password="password"
)

# SQL query to retrieve customer data
sql_query = """
    SELECT c.customer_id, c.age, c.gender, c.location, 
           SUM(t.transaction_amount) AS total_spend, 
           COUNT(t.transaction_id) AS transaction_count
    FROM customers c
    JOIN transactions t ON c.customer_id = t.customer_id
    GROUP BY c.customer_id, c.age, c.gender, c.location
"""

# Execute SQL query and store results in a pandas DataFrame
df = pd.read_sql_query(sql_query, conn)

# Close database connection
conn.close()

# Define features (X) and target variable (y)
X = df.drop(["customer_id"], axis=1)
y = df["customer_id"].apply(lambda x: 1 if x in df["customer_id"].value_counts().head(10).index else 0)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

Step 2: Analysis Pipeline

Next, we'll create an analysis pipeline using scikit-learn to train and evaluate our models.

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Train Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Train XGBoost model
xgb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
xgb_model.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Step 3: Model/Visualization Code

Now, we'll evaluate the performance of our models using various metrics and visualize the results.

# Make predictions on test data
rf_pred = rf_model.predict(X_test)
xgb_pred = xgb_model.predict(X_test)

# Evaluate model performance
rf_accuracy = accuracy_score(y_test, rf_pred)
xgb_accuracy = accuracy_score(y_test, xgb_pred)

print("Random Forest Accuracy:", rf_accuracy)
print("XGBoost Accuracy:", xgb_accuracy)

print("Random Forest Classification Report:")
print(classification_report(y_test, rf_pred))

print("XGBoost Classification Report:")
print(classification_report(y_test, xgb_pred))

# Visualize confusion matrices
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
sns.heatmap(confusion_matrix(y_test, rf_pred), annot=True, cmap="Blues")
plt.title("Random Forest Confusion Matrix")

plt.subplot(1, 2, 2)
sns.heatmap(confusion_matrix(y_test, xgb_pred), annot=True, cmap="Blues")
plt.title("XGBoost Confusion Matrix")

plt.show()
Enter fullscreen mode Exit fullscreen mode

Step 4: Performance Evaluation

To evaluate the performance of our models, we'll calculate the ROI impact of each model.

# Calculate ROI impact
def calculate_roi(model_name, accuracy):
    # Assume average order value is $100
    avg_order_value = 100

    # Assume customer retention rate is 20%
    customer_retention_rate = 0.2

    # Calculate ROI impact
    roi_impact = (accuracy * avg_order_value * customer_retention_rate) / (1 - accuracy)

    return roi_impact

rf_roi_impact = calculate_roi("Random Forest", rf_accuracy)
xgb_roi_impact = calculate_roi("XGBoost", xgb_accuracy)

print("Random Forest ROI Impact: $", rf_roi_impact)
print("XGBoost ROI Impact: $", xgb_roi_impact)
Enter fullscreen mode Exit fullscreen mode

Step 5: Production Deployment

Finally, we'll deploy our models to a production environment using a RESTful API.

from flask import Flask, request, jsonify
from sklearn.externals import joblib

app = Flask(__name__)

# Load trained models
rf_model = joblib.load("random_forest_model.pkl")
xgb_model = joblib.load("xgboost_model.pkl")

# Define API endpoint
@app.route("/predict", methods=["POST"])
def predict():
    data = request.get_json()
    customer_id = data["customer_id"]
    age = data["age"]
    gender = data["gender"]
    location = data["location"]
    total_spend = data["total_spend"]
    transaction_count = data["transaction_count"]

    # Make predictions
    rf_pred = rf_model.predict([[age, gender, location, total_spend, transaction_count]])
    xgb_pred = xgb_model.predict([[age, gender, location, total_spend, transaction_count]])

    # Return predictions
    return jsonify({"random_forest": rf_pred, "xgboost": xgb_pred})

if __name__ == "__main__":
    app.run(debug=True)
Enter fullscreen mode Exit fullscreen mode

Edge Cases

  • Handling missing values: We can use imputation techniques such as mean, median, or mode to handle missing values.
  • Handling outliers: We can use techniques such as winsorization or trimming to handle outliers.
  • Handling class imbalance: We can use techniques such as oversampling the minority class, undersampling the majority class, or using class weights to handle class imbalance.

Scaling Tips

  • Use distributed computing frameworks such as Apache Spark or Dask to scale our models.
  • Use cloud-based services such as AWS SageMaker or Google Cloud AI Platform to deploy our models.
  • Use model pruning or quantization to reduce the size of our models and improve inference speed.

By following these steps, we can develop a predictive model that accurately identifies customers who are likely to churn and provides a significant ROI impact. We can also deploy our model to a production environment using a RESTful API and scale our model using distributed computing frameworks or cloud-based services.

Top comments (0)