DEV Community

amal org
amal org

Posted on

Data Analyst Guide: Mastering Overfitting: The Silent Killer of ML Projects

Data Analyst Guide: Mastering Overfitting: The Silent Killer of ML Projects

===========================================================

Business Problem Statement


In the real-world scenario of a retail company, overfitting can lead to inaccurate sales forecasts, resulting in inventory mismanagement and significant financial losses. For instance, a company like Walmart, with thousands of products and stores, can lose millions of dollars due to overfitting. The ROI impact of overfitting can be substantial, with potential losses ranging from 10% to 30% of total sales.

Let's consider a retail company with the following sales data:

Product Store Sales
A 1 100
A 2 120
B 1 80
B 2 100
... ... ...

The goal is to develop a machine learning model that can accurately predict sales for each product in each store.

Step-by-Step Technical Solution


Step 1: Data Preparation (pandas/SQL)

First, we need to prepare our data for analysis. We'll use pandas to load and manipulate the data, and SQL to query the data from a database.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data from CSV file
data = pd.read_csv('sales_data.csv')

# Convert categorical variables to numerical variables
data['Product'] = pd.Categorical(data['Product']).codes
data['Store'] = pd.Categorical(data['Store']).codes

# Define features (X) and target variable (y)
X = data[['Product', 'Store']]
y = data['Sales']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features using StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Enter fullscreen mode Exit fullscreen mode

SQL query to retrieve data from a database:

SELECT Product, Store, Sales
FROM sales_data
WHERE Product IN ('A', 'B', 'C')
AND Store IN (1, 2, 3);
Enter fullscreen mode Exit fullscreen mode

Step 2: Analysis Pipeline

Next, we'll create an analysis pipeline to evaluate the performance of different machine learning models.

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regressor': DecisionTreeRegressor(),
    'Random Forest Regressor': RandomForestRegressor()
}

# Train and evaluate models
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f'{name}: MSE = {mse:.2f}')
Enter fullscreen mode Exit fullscreen mode

Step 3: Model/Visualization Code

Now, let's visualize the performance of the best-performing model using a scatter plot.

import matplotlib.pyplot as plt

# Train and evaluate the best-performing model
best_model = RandomForestRegressor()
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)

# Create a scatter plot
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales')
plt.title('Sales Prediction')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Step 4: Performance Evaluation

To evaluate the performance of the model, we'll calculate the mean squared error (MSE) and the coefficient of determination (R-squared).

from sklearn.metrics import r2_score

# Calculate MSE and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'MSE: {mse:.2f}, R-squared: {r2:.2f}')
Enter fullscreen mode Exit fullscreen mode

Step 5: Production Deployment

Finally, we'll deploy the model to a production environment using a RESTful API.

from flask import Flask, request, jsonify
from sklearn.externals import joblib

app = Flask(__name__)

# Load the trained model
model = joblib.load('best_model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    X = pd.DataFrame(data)
    y_pred = model.predict(X)
    return jsonify({'prediction': y_pred.tolist()})

if __name__ == '__main__':
    app.run(debug=True)
Enter fullscreen mode Exit fullscreen mode

Metrics/ROI Calculations


To calculate the ROI of the model, we'll use the following metrics:

  • Mean squared error (MSE)
  • Coefficient of determination (R-squared)
  • Return on investment (ROI)
# Calculate ROI
roi = (1 - (mse / np.mean(y_test**2))) * 100
print(f'ROI: {roi:.2f}%')
Enter fullscreen mode Exit fullscreen mode

Edge Cases


To handle edge cases, we'll implement the following:

  • Data preprocessing: handle missing values and outliers
  • Model selection: select the best-performing model based on cross-validation
  • Hyperparameter tuning: tune hyperparameters using grid search or random search
# Handle missing values
data.fillna(data.mean(), inplace=True)

# Handle outliers
data = data[(np.abs(data - data.mean()) <= (3 * data.std()))]

# Select the best-performing model
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5)
print(f'Cross-validation score: {scores.mean():.2f}')
Enter fullscreen mode Exit fullscreen mode

Scaling Tips


To scale the model, we'll use the following techniques:

  • Data parallelism: split the data into smaller chunks and process them in parallel
  • Model parallelism: split the model into smaller components and train them in parallel
  • Distributed computing: use distributed computing frameworks like Apache Spark or Hadoop
# Use data parallelism
from joblib import Parallel, delayed
def train_model(X, y):
    model.fit(X, y)
    return model
models = Parallel(n_jobs=-1)(delayed(train_model)(X_train, y_train) for _ in range(5))

# Use model parallelism
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)

# Use distributed computing
from pyspark.ml.regression import LinearRegression
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Sales Prediction').getOrCreate()
data = spark.createDataFrame(data)
model = LinearRegression(featuresCol='features', labelCol='label')
model.fit(data)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)