Data Analyst Guide: Mastering ML Ops: Why 87% of Models Never Reach Production
Business Problem Statement
The majority of machine learning models never reach production, resulting in significant losses for businesses. According to a study, 87% of models fail to make it to production, with the main reasons being poor data quality, inadequate testing, and lack of scalability. In this tutorial, we will explore a real-world scenario and provide a step-by-step technical solution to overcome these challenges.
Let's consider a retail company that wants to predict customer churn using machine learning. The company has a large customer database and wants to identify the factors that contribute to churn. The goal is to develop a model that can predict churn with high accuracy and deploy it to production.
The ROI impact of deploying a successful churn prediction model can be significant. For example, if the company can reduce churn by 10%, it can result in an additional $1 million in revenue per year.
Step-by-Step Technical Solution
Step 1: Data Preparation (pandas/SQL)
The first step is to prepare the data for analysis. We will use a combination of pandas and SQL to load and preprocess the data.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load data from SQL database
import sqlite3
conn = sqlite3.connect('customer_database.db')
cursor = conn.cursor()
# SQL query to load data
query = """
SELECT
customer_id,
age,
gender,
average_order_value,
total_orders,
churn
FROM
customer_data
"""
# Execute query and load data into pandas dataframe
df = pd.read_sql_query(query, conn)
# Close database connection
conn.close()
# Print first few rows of dataframe
print(df.head())
Step 2: Analysis Pipeline
Next, we will develop an analysis pipeline to explore the data and identify the factors that contribute to churn.
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Split data into training and testing sets
X = df.drop(['churn'], axis=1)
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Scale data using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train random forest classifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train_scaled, y_train)
# Make predictions on testing set
y_pred = rfc.predict(X_test_scaled)
# Evaluate model performance
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Step 3: Model/Visualization Code
We will use the trained model to make predictions and visualize the results.
# Import necessary libraries
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc
# Make predictions on testing set
y_pred_proba = rfc.predict_proba(X_test_scaled)[:, 1]
# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc_value = auc(fpr, tpr)
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % auc_value)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
Step 4: Performance Evaluation
We will evaluate the model's performance using various metrics.
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1_score = f1_score(y_test, y_pred)
# Print metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1_score)
Step 5: Production Deployment
We will deploy the model to production using a cloud-based platform.
# Import necessary libraries
from sklearn.externals import joblib
from flask import Flask, request, jsonify
# Save model to file
joblib.dump(rfc, 'churn_prediction_model.pkl')
# Load model from file
loaded_rfc = joblib.load('churn_prediction_model.pkl')
# Create Flask app
app = Flask(__name__)
# Define API endpoint
@app.route('/predict', methods=['POST'])
def predict():
# Get input data from request
input_data = request.get_json()
# Make predictions using loaded model
predictions = loaded_rfc.predict(input_data)
# Return predictions as JSON response
return jsonify({'predictions': predictions.tolist()})
# Run Flask app
if __name__ == '__main__':
app.run(debug=True)
Metrics/ROI Calculations
We will calculate the ROI of deploying the churn prediction model.
# Calculate ROI
revenue_increase = 0.1 # 10% increase in revenue
current_revenue = 1000000 # $1 million
additional_revenue = revenue_increase * current_revenue
roi = additional_revenue / (1000000 - additional_revenue) # $100,000 investment
# Print ROI
print("ROI:", roi)
Edge Cases
We will handle edge cases such as missing values and outliers.
# Handle missing values
from sklearn.impute import SimpleImputer
# Create imputer object
imputer = SimpleImputer(strategy='mean')
# Fit imputer to data
imputer.fit(X_train)
# Transform data
X_train_imputed = imputer.transform(X_train)
X_test_imputed = imputer.transform(X_test)
# Handle outliers
from sklearn.robust import HuberRegressor
# Create Huber regressor object
huber = HuberRegressor()
# Fit Huber regressor to data
huber.fit(X_train_imputed, y_train)
Scaling Tips
We will provide scaling tips for large datasets.
# Use parallel processing
from joblib import Parallel, delayed
# Define function to parallelize
def parallelize_function(data):
# Perform computation on data
result = data ** 2
return result
# Parallelize function
results = Parallel(n_jobs=-1)(delayed(parallelize_function)(data) for data in X_train)
# Use distributed computing
from dask.distributed import Client
# Create client object
client = Client()
# Define function to distribute
def distribute_function(data):
# Perform computation on data
result = data ** 2
return result
# Distribute function
results = client.map(distribute_function, X_train)
By following these steps and tips, data analysts can master ML Ops and deploy successful machine learning models to production, resulting in significant ROI increases for businesses.
Top comments (0)