DEV Community

amal org
amal org

Posted on

Data Analyst Guide: Mastering ML Ops: Why 87% of Models Never Reach Production

Data Analyst Guide: Mastering ML Ops: Why 87% of Models Never Reach Production

Business Problem Statement

The majority of machine learning models never reach production, resulting in significant losses for businesses. According to a study, 87% of models fail to make it to production, with the main reasons being poor data quality, inadequate testing, and lack of scalability. In this tutorial, we will explore a real-world scenario and provide a step-by-step technical solution to overcome these challenges.

Let's consider a retail company that wants to predict customer churn using machine learning. The company has a large customer database and wants to identify the factors that contribute to churn. The goal is to develop a model that can predict churn with high accuracy and deploy it to production.

The ROI impact of deploying a successful churn prediction model can be significant. For example, if the company can reduce churn by 10%, it can result in an additional $1 million in revenue per year.

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

The first step is to prepare the data for analysis. We will use a combination of pandas and SQL to load and preprocess the data.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data from SQL database
import sqlite3
conn = sqlite3.connect('customer_database.db')
cursor = conn.cursor()

# SQL query to load data
query = """
    SELECT 
        customer_id,
        age,
        gender,
        average_order_value,
        total_orders,
        churn
    FROM 
        customer_data
"""

# Execute query and load data into pandas dataframe
df = pd.read_sql_query(query, conn)

# Close database connection
conn.close()

# Print first few rows of dataframe
print(df.head())
Enter fullscreen mode Exit fullscreen mode

Step 2: Analysis Pipeline

Next, we will develop an analysis pipeline to explore the data and identify the factors that contribute to churn.

# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Split data into training and testing sets
X = df.drop(['churn'], axis=1)
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale data using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train random forest classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train_scaled, y_train)

# Make predictions on testing set
y_pred = rfc.predict(X_test_scaled)

# Evaluate model performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Enter fullscreen mode Exit fullscreen mode

Step 3: Model/Visualization Code

We will use the trained model to make predictions and visualize the results.

# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Make predictions on testing set
y_pred_proba = rfc.predict_proba(X_test_scaled)[:, 1]

# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_value = auc(fpr, tpr)
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % auc_value)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Step 4: Performance Evaluation

We will evaluate the model's performance using various metrics.

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1_score = f1_score(y_test, y_pred)

# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1_score)
Enter fullscreen mode Exit fullscreen mode

Step 5: Production Deployment

We will deploy the model to production using a cloud-based platform.

# Import necessary libraries
from sklearn.externals import joblib
from flask import Flask, request, jsonify

# Save model to file
joblib.dump(rfc, 'churn_prediction_model.pkl')

# Load model from file
loaded_rfc = joblib.load('churn_prediction_model.pkl')

# Create Flask app
app = Flask(__name__)

# Define API endpoint
@app.route('/predict', methods=['POST'])
def predict():
    # Get input data from request
    input_data = request.get_json()

    # Make predictions using loaded model
    predictions = loaded_rfc.predict(input_data)

    # Return predictions as JSON response
    return jsonify({'predictions': predictions.tolist()})

# Run Flask app
if __name__ == '__main__':
    app.run(debug=True)
Enter fullscreen mode Exit fullscreen mode

Metrics/ROI Calculations

We will calculate the ROI of deploying the churn prediction model.

# Calculate ROI
revenue_increase = 0.1  # 10% increase in revenue
current_revenue = 1000000  # $1 million
additional_revenue = revenue_increase * current_revenue
roi = additional_revenue / (1000000 - additional_revenue)  # $100,000 investment

# Print ROI
print("ROI:", roi)
Enter fullscreen mode Exit fullscreen mode

Edge Cases

We will handle edge cases such as missing values and outliers.

# Handle missing values
from sklearn.impute import SimpleImputer

# Create imputer object
imputer = SimpleImputer(strategy='mean')

# Fit imputer to data
imputer.fit(X_train)

# Transform data
X_train_imputed = imputer.transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Handle outliers
from sklearn.robust import HuberRegressor

# Create Huber regressor object
huber = HuberRegressor()

# Fit Huber regressor to data
huber.fit(X_train_imputed, y_train)
Enter fullscreen mode Exit fullscreen mode

Scaling Tips

We will provide scaling tips for large datasets.

# Use parallel processing
from joblib import Parallel, delayed

# Define function to parallelize
def parallelize_function(data):
    # Perform computation on data
    result = data ** 2
    return result

# Parallelize function
results = Parallel(n_jobs=-1)(delayed(parallelize_function)(data) for data in X_train)

# Use distributed computing
from dask.distributed import Client

# Create client object
client = Client()

# Define function to distribute
def distribute_function(data):
    # Perform computation on data
    result = data ** 2
    return result

# Distribute function
results = client.map(distribute_function, X_train)
Enter fullscreen mode Exit fullscreen mode

By following these steps and tips, data analysts can master ML Ops and deploy successful machine learning models to production, resulting in significant ROI increases for businesses.

Top comments (0)