DEV Community

Luke
Luke

Posted on

How I Stopped Losing Track of My ML Experiments (and You Should Too)

The backstory — The spreadsheet from hell

Let me tell you about the worst week of my data science career.

I was working as a freelance data scientist for a mid-size telecom company. They had a churn problem — customers were leaving, and they wanted a model to predict who was about to leave so they could intervene with targeted offers. Classic use case.

I dove in. Tested a Logistic Regression. Then a Random Forest. Then XGBoost. Then LightGBM. Tweaked the hyperparameters. Changed the feature engineering. Tried SMOTE for class imbalance. Tried class weights instead. Changed the threshold from 0.5 to 0.3. Went back to 0.4.

After three days, I had 47 notebook cells with different model configurations, a spreadsheet where I was manually logging results, and absolutely no idea which combination of preprocessing + model + hyperparameters + threshold had given me the best score.

I had lost track of my own experiments. I was doing data science like a chef who throws ingredients into a pot without writing down the recipe — sure, I got something that tasted OK, but I couldn't reproduce it.

That's when my manager Slack'd me a link to MLflow with just one line:

"Maybe stop using Excel for experiment tracking?"

That weekend, I rebuilt the entire pipeline with MLflow. And it changed the way I work permanently. This article is what I wish someone had shown me before that hellish week.


What you'll learn (and build)

By the end of this article, you'll have a complete ML pipeline with:

  • Experiment tracking — every model run logged automatically (parameters, metrics, artifacts)
  • A visual UI to compare all your experiments side by side
  • A Model Registry to version and manage your best models
  • Hyperparameter optimization with Optuna, fully tracked
  • A custom business metric that goes beyond accuracy
  • Model interpretability with SHAP (global and local explanations)
  • Model serving — turning your best model into an API endpoint with one command

Here's the big picture:

┌─────────────────────────────────────────────────────────────┐
│                    YOUR ML PIPELINE                          │
│                                                             │
│  ┌───────────┐   ┌──────────────┐   ┌────────────────────┐  │
│  │  Data      │──▶│  Training    │──▶│  Evaluation        │  │
│  │  Prep      │   │  + Tracking  │   │  + Comparison      │  │
│  └───────────┘   └──────┬───────┘   └────────┬───────────┘  │
│                         │                     │              │
│                    MLflow logs            MLflow UI           │
│                    (params,metrics)       (visualization)     │
│                         │                     │              │
│                         ▼                     ▼              │
│               ┌──────────────────────────────────┐           │
│               │     MLflow Model Registry        │           │
│               │  (versioning, staging, prod)      │           │
│               └──────────────┬───────────────────┘           │
│                              │                               │
│                              ▼                               │
│               ┌──────────────────────────────────┐           │
│               │     MLflow Model Serving          │           │
│               │  (REST API endpoint)              │           │
│               └──────────────────────────────────┘           │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

The use case: predicting customer churn for a telecom company. A binary classification problem with class imbalance, a business cost function, and real-world messiness.


Part 1 — What is MLOps, and why should I care?

The problem MLOps solves

When you learn ML in tutorials, the workflow is linear: load data → train model → evaluate → done. But in real life, ML is iterative. You try dozens of combinations, and you need to keep track of all of them.

Without MLOps:

  • "Which model gave 0.79 AUC? Was it the one with 200 trees or 500? Did I use SMOTE or class weights? What was the learning rate?"
  • You can't reproduce your best result
  • You can't compare experiments systematically
  • You deploy a model and have no idea which version is running

With MLOps:

  • Every experiment is automatically logged with all its parameters and metrics
  • You can compare any two runs in a visual dashboard
  • Your best models are versioned and tagged (staging, production)
  • Deployment is reproducible and auditable

MLOps is to ML what DevOps is to software engineering: the practices and tools that make the ML lifecycle manageable, reproducible, and collaborative.

MLOps lifecycle — from experimentation to production and monitoring

Going further on MLOps:

Why MLflow?

There are several MLOps tools out there: Weights & Biases, Neptune, ClearML, DVC... I chose MLflow for this project because:

  1. Open source and free — no vendor lock-in, no paid tier needed
  2. Lightweight — you can start with pip install mlflow and a single line of code
  3. Complete — tracking, registry, serving all in one tool
  4. Industry standard — used at thousands of companies, huge community
  5. Framework agnostic — works with scikit-learn, XGBoost, PyTorch, anything

Going further:


Part 2 — Setting up the project

Install and structure

# Create project
mkdir churn-prediction-mlops
cd churn-prediction-mlops

# Virtual environment (always isolate your projects!)
python -m venv venv
source venv/bin/activate

# Create structure
mkdir -p data notebooks models
touch requirements.txt .gitignore
Enter fullscreen mode Exit fullscreen mode

Dependencies

# requirements.txt
mlflow==2.16.0
scikit-learn==1.5.2
xgboost==2.1.1
lightgbm==4.5.0
optuna==4.0.0
shap==0.46.0
pandas==2.2.3
numpy==1.26.4
matplotlib==3.9.2
seaborn==0.13.2
imbalanced-learn==0.12.4
missingno==0.5.2
Enter fullscreen mode Exit fullscreen mode
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

The dataset: Telco Customer Churn

We'll use the Telco Customer Churn dataset from Kaggle. It contains 7,043 customers with features like tenure, monthly charges, contract type, internet service, etc. The target is Churn (Yes/No) — did the customer leave?

It's a great dataset for learning because it has:

  • Class imbalance (~26% churn vs ~74% no churn)
  • A mix of numerical and categorical features
  • Enough complexity to require real feature engineering
  • A clear business context where false negatives (missing a churner) cost more than false positives

Part 3 — Data preparation (the foundation)

Why this step matters more than model selection

Here's a truth that took me a while to accept: the quality of your data preparation determines 80% of your model's performance. You can swap a Random Forest for XGBoost and gain 1-2% AUC. But proper feature engineering, handling of missing values, and encoding can gain you 10-15%.

Don't rush this step.

# ============================================================
# DATA PREPARATION
# Complete pipeline: loading, cleaning, encoding, splitting
# ============================================================

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
import warnings
warnings.filterwarnings("ignore")

# ------------------------------------------------------------
# 1. LOAD THE DATA
# ------------------------------------------------------------
# Download from Kaggle: https://www.kaggle.com/datasets/blastchar/telco-customer-churn
df = pd.read_csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nTarget distribution:")
print(df["Churn"].value_counts(normalize=True))
# Expected: ~73% No, ~27% Yes — clear imbalance

# ------------------------------------------------------------
# 2. DATA QUALITY AUDIT
# Before doing anything, we need to understand what we're working with.
# Skipping this step is the #1 cause of silent bugs in ML pipelines.
# ------------------------------------------------------------

# --- Check for duplicates ---
# Duplicates can inflate model performance artificially
# (same sample in train AND test = data leakage)
n_duplicates = df.duplicated().sum()
print(f"Duplicate rows: {n_duplicates}")
if n_duplicates > 0:
    print("  → Dropping duplicates to avoid data leakage")
    df = df.drop_duplicates().reset_index(drop=True)

# --- Check data types and missing values ---
# A quick summary of every column: type, non-null count, sample values
print(f"\nData types overview:")
print(df.dtypes.value_counts())
print(f"\nMissing values per column:")
missing = df.isnull().sum()
print(missing[missing > 0] if missing.any() else "  No NaN detected (but watch out for hidden ones!)")

# --- Visualize missing values ---
# missingno gives a quick visual of the missing data pattern
# Patterns matter: random missingness ≠ systematic missingness
import missingno as msno
import matplotlib.pyplot as plt

fig = msno.matrix(df, figsize=(12, 5), sparkline=False)
plt.title("Missing Values Matrix — white lines = missing data")
plt.tight_layout()
plt.show()

# --- Visualize target distribution ---
# This is the FIRST thing to check in any classification task.
# Severe imbalance requires specific handling (which we'll do later)
fig, ax = plt.subplots(figsize=(6, 4))
df["Churn"].value_counts().plot(kind="bar", ax=ax, color=["steelblue", "salmon"])
ax.set_title("Target Distribution — Churn")
ax.set_xlabel("Churn (No=0, Yes=1)")
ax.set_ylabel("Count")
ax.set_xticklabels(["No (stayed)", "Yes (churned)"], rotation=0)
for i, v in enumerate(df["Churn"].value_counts().values):
    ax.text(i, v + 50, f"{v} ({v/len(df)*100:.1f}%)", ha="center", fontweight="bold")
plt.tight_layout()
plt.show()

# --- Visualize distributions of numerical features ---
# Helps spot outliers, skewed distributions, and unexpected patterns
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
for i, col in enumerate(["tenure", "MonthlyCharges", "TotalCharges"]):
    # Histogram
    axes[0, i].hist(df[col], bins=30, color="steelblue", edgecolor="white", alpha=0.8)
    axes[0, i].set_title(f"Distribution of {col}")
    axes[0, i].set_xlabel(col)
    # Boxplot to spot outliers
    axes[1, i].boxplot(df[col].dropna(), vert=False)
    axes[1, i].set_title(f"Boxplot of {col}")
plt.suptitle("Numerical Feature Distributions", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.show()

# ------------------------------------------------------------
# 3. CLEAN THE DATA
# ------------------------------------------------------------

# TotalCharges has some blank strings that should be NaN
# This is a common real-world issue: numeric columns imported as strings
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Check how many NaN we created
print(f"\nMissing TotalCharges: {df['TotalCharges'].isna().sum()}")

# These are new customers (tenure=0) — fill with 0
df["TotalCharges"] = df["TotalCharges"].fillna(0)

# Drop customerID — it's an identifier, not a feature
# Including it would be data leakage (each ID is unique to one customer)
df = df.drop("customerID", axis=1)

# ------------------------------------------------------------
# 3. ENCODE CATEGORICAL VARIABLES
# ------------------------------------------------------------

# Identify column types
categorical_cols = df.select_dtypes(include=["object"]).columns.tolist()
categorical_cols.remove("Churn")  # We'll handle the target separately

print(f"\nCategorical columns to encode: {categorical_cols}")

# Binary columns (Yes/No, Male/Female) → Label Encoding (0/1)
# Multi-class columns → One-Hot Encoding
# This distinction matters! Label encoding imposes an artificial order
# that makes no sense for nominal categories like "InternetService"

binary_cols = [
    col for col in categorical_cols
    if df[col].nunique() == 2
]
multiclass_cols = [
    col for col in categorical_cols
    if df[col].nunique() > 2
]

print(f"Binary columns (LabelEncode): {binary_cols}")
print(f"Multi-class columns (OneHot): {multiclass_cols}")

# Label encode binary columns
le = LabelEncoder()
for col in binary_cols:
    df[col] = le.fit_transform(df[col])

# One-hot encode multi-class columns
df = pd.get_dummies(df, columns=multiclass_cols, drop_first=True)
# drop_first=True avoids the "dummy variable trap" (perfect multicollinearity)

# Encode the target
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})

# ------------------------------------------------------------
# 4. FEATURE ENGINEERING
# ------------------------------------------------------------

# Create derived features that capture business logic.
# These are the kind of features that separate a good model from a great one.

# Average monthly charge over the customer's lifetime
# A high ratio might indicate a recent price increase (churn risk)
df["AvgChargePerMonth"] = np.where(
    df["tenure"] > 0,
    df["TotalCharges"] / df["tenure"],
    df["MonthlyCharges"]  # For new customers, use current monthly charge
)

# Charge variation: how much does the current monthly charge
# deviate from the historical average? High deviation = warning sign
df["ChargeVariation"] = df["MonthlyCharges"] - df["AvgChargePerMonth"]

# Tenure buckets: new customers (<12 months) churn much more
df["IsNewCustomer"] = (df["tenure"] < 12).astype(int)

print(f"\nFinal dataset shape: {df.shape}")

# ------------------------------------------------------------
# 5. SPLIT THE DATA
# ------------------------------------------------------------

# Separate features and target
X = df.drop("Churn", axis=1)
y = df["Churn"]

# Stratified split to preserve class balance in both sets
# This is critical when classes are imbalanced!
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # Ensures both sets have ~27% churn
)

print(f"Train: {X_train.shape[0]} samples")
print(f"Test:  {X_test.shape[0]} samples")
print(f"Train churn rate: {y_train.mean():.3f}")
print(f"Test churn rate:  {y_test.mean():.3f}")

# ------------------------------------------------------------
# 6. SCALE NUMERICAL FEATURES
# ------------------------------------------------------------
# Tree-based models don't need scaling, but it helps for
# Logistic Regression and neural networks. We scale anyway
# so all models get the same input.

numerical_cols = X_train.select_dtypes(include=[np.number]).columns.tolist()
scaler = StandardScaler()

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test_scaled[numerical_cols] = scaler.transform(X_test[numerical_cols])
# IMPORTANT: fit on train only, transform both. Never fit on test data!
Enter fullscreen mode Exit fullscreen mode

Going further on data preparation:


Part 4 — MLflow Experiment Tracking (the game-changer)

What is experiment tracking?

Every time you train a model, you make choices: which algorithm, which hyperparameters, which preprocessing, which features. Experiment tracking means automatically recording all these choices AND the results, so you can compare, reproduce, and audit any experiment later.

Think of it as a lab notebook for data science. A biologist wouldn't run 50 experiments without writing down the protocol and results. Neither should you.

How MLflow tracking works

MLflow organizes things in a hierarchy:

Experiment (e.g., "Churn Prediction")
  └── Run 1: Logistic Regression, C=1.0, AUC=0.72
  └── Run 2: Random Forest, n_estimators=100, AUC=0.78
  └── Run 3: XGBoost, lr=0.1, depth=5, AUC=0.81
  └── Run 4: XGBoost, lr=0.05, depth=7, AUC=0.83
  └── ...
Enter fullscreen mode Exit fullscreen mode

Each run automatically logs:

  • Parameters: hyperparameters, settings, choices you made
  • Metrics: AUC, accuracy, F1-score, your custom business metric
  • Artifacts: saved model files, plots, confusion matrices
  • Tags: anything else you want to note (dataset version, author, etc.)

MLflow UI showing experiment runs with parameters and metrics side by side

Going further:

Your first tracked experiment

# ============================================================
# MLFLOW EXPERIMENT TRACKING
# Training multiple models with automatic logging
# ============================================================

import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import (
    roc_auc_score, f1_score, accuracy_score,
    classification_report, confusion_matrix
)
import matplotlib.pyplot as plt
import seaborn as sns

# ------------------------------------------------------------
# 1. CONFIGURE MLFLOW
# ------------------------------------------------------------

# Set the experiment name — all runs will be grouped under this
mlflow.set_experiment("Churn_Prediction")

# Optional: set the tracking URI (default is ./mlruns in current dir)
# For a remote server: mlflow.set_tracking_uri("http://your-server:5000")
mlflow.set_tracking_uri("mlruns")

# ------------------------------------------------------------
# 2. DEFINE THE BUSINESS COST FUNCTION
# ------------------------------------------------------------
# This is THE key concept of this project.
#
# In churn prediction, not all errors are equal:
# - False Negative (FN): we MISS a churner → they leave → we lose
#   their lifetime value (say, $500)
# - False Positive (FP): we flag a loyal customer as churner → we
#   give them a retention offer they didn't need (say, $50)
#
# FN is 10x more expensive than FP.
# A model that minimizes accuracy might NOT minimize business cost.
# That's why we need a custom metric.

def business_cost(y_true, y_pred, cost_fn=10, cost_fp=1):
    """
    Calculate the total business cost of prediction errors.

    Parameters:
    -----------
    y_true : array-like — actual labels (0 = stayed, 1 = churned)
    y_pred : array-like — predicted labels (0 or 1)
    cost_fn : int — cost multiplier for False Negatives
                     (missed churners — lost revenue)
    cost_fp : int — cost multiplier for False Positives
                     (unnecessary retention offers)

    Returns:
    --------
    float — total normalized cost (lower is better)
    """
    # Confusion matrix: [[TN, FP], [FN, TP]]
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()

    # Total cost = weighted sum of errors
    total_cost = (fn * cost_fn) + (fp * cost_fp)

    # Normalize by number of samples for comparability
    return total_cost / len(y_true)


def find_optimal_threshold(y_true, y_proba, cost_fn=10, cost_fp=1):
    """
    Find the classification threshold that minimizes business cost.

    Instead of using the default 0.5 threshold (which assumes equal
    error costs), we sweep from 0.1 to 0.9 and pick the one that
    minimizes our business cost function.

    Returns:
    --------
    tuple — (best_threshold, best_cost, all_thresholds, all_costs)
    """
    thresholds = np.arange(0.1, 0.91, 0.01)
    costs = []

    for t in thresholds:
        # Convert probabilities to binary predictions at threshold t
        y_pred = (y_proba >= t).astype(int)
        cost = business_cost(y_true, y_pred, cost_fn, cost_fp)
        costs.append(cost)

    best_idx = np.argmin(costs)
    return thresholds[best_idx], costs[best_idx], thresholds, costs


# ------------------------------------------------------------
# 3. TRAIN AND TRACK MULTIPLE MODELS
# ------------------------------------------------------------
# We'll train 4 different models, each in its own MLflow run.
# Cross-validation ensures robustness.
# Everything is logged automatically.

# Define models to test
models = {
    "LogisticRegression": LogisticRegression(
        max_iter=1000,
        class_weight="balanced",  # Handle class imbalance
        random_state=42
    ),
    "RandomForest": RandomForestClassifier(
        n_estimators=200,
        max_depth=10,
        class_weight="balanced",  # Handle class imbalance
        random_state=42,
        n_jobs=-1  # Use all CPU cores
    ),
    "XGBoost": XGBClassifier(
        n_estimators=200,
        max_depth=5,
        learning_rate=0.1,
        scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]),
        # scale_pos_weight compensates for class imbalance in XGBoost
        random_state=42,
        eval_metric="logloss",
        use_label_encoder=False
    ),
    "LightGBM": LGBMClassifier(
        n_estimators=200,
        max_depth=5,
        learning_rate=0.1,
        is_unbalance=True,  # LightGBM's way of handling imbalance
        random_state=42,
        verbose=-1  # Suppress training output
    ),
}

# Cross-validation setup
# StratifiedKFold preserves the class ratio in each fold
# This is CRITICAL with imbalanced data
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Store results for comparison
results = {}

for model_name, model in models.items():
    # Each model gets its own MLflow run
    with mlflow.start_run(run_name=model_name):

        # --- Log model parameters ---
        # mlflow.log_params logs a dictionary of parameters
        params = model.get_params()
        # Only log the most relevant params (MLflow has a limit)
        key_params = {
            k: v for k, v in params.items()
            if v is not None and k not in ["random_state", "verbose", "n_jobs"]
        }
        mlflow.log_params(key_params)

        # --- Cross-validation ---
        # cv_scores gives us 5 AUC scores (one per fold)
        cv_scores = cross_val_score(
            model, X_train_scaled, y_train,
            cv=cv,
            scoring="roc_auc",
            n_jobs=-1
        )

        # --- Train on full training set ---
        model.fit(X_train_scaled, y_train)

        # --- Predict on test set ---
        y_pred = model.predict(X_test_scaled)
        y_proba = model.predict_proba(X_test_scaled)[:, 1]
        # predict_proba gives probability of class 1 (churn)

        # --- Calculate metrics ---
        auc = roc_auc_score(y_test, y_proba)
        f1 = f1_score(y_test, y_pred)
        accuracy = accuracy_score(y_test, y_pred)

        # Business cost at default threshold (0.5)
        cost_default = business_cost(y_test, y_pred)

        # Find optimal threshold
        best_threshold, best_cost, _, _ = find_optimal_threshold(
            y_test, y_proba
        )
        # Predictions at optimal threshold
        y_pred_optimal = (y_proba >= best_threshold).astype(int)
        f1_optimal = f1_score(y_test, y_pred_optimal)

        # --- Log metrics to MLflow ---
        mlflow.log_metric("cv_auc_mean", cv_scores.mean())
        mlflow.log_metric("cv_auc_std", cv_scores.std())
        mlflow.log_metric("test_auc", auc)
        mlflow.log_metric("test_f1", f1)
        mlflow.log_metric("test_accuracy", accuracy)
        mlflow.log_metric("business_cost_default_threshold", cost_default)
        mlflow.log_metric("business_cost_optimal", best_cost)
        mlflow.log_metric("optimal_threshold", best_threshold)
        mlflow.log_metric("f1_at_optimal_threshold", f1_optimal)

        # --- Log the trained model as an artifact ---
        mlflow.sklearn.log_model(model, "model")

        # --- Log confusion matrix as an image artifact ---
        fig, ax = plt.subplots(figsize=(6, 5))
        cm = confusion_matrix(y_test, y_pred_optimal)
        sns.heatmap(
            cm, annot=True, fmt="d", cmap="Blues", ax=ax,
            xticklabels=["Stayed", "Churned"],
            yticklabels=["Stayed", "Churned"]
        )
        ax.set_xlabel("Predicted")
        ax.set_ylabel("Actual")
        ax.set_title(f"{model_name} — Confusion Matrix (threshold={best_threshold:.2f})")
        plt.tight_layout()
        mlflow.log_figure(fig, f"confusion_matrix_{model_name}.png")
        plt.close()

        # --- Tag the run with useful metadata ---
        mlflow.set_tag("model_type", model_name)
        mlflow.set_tag("dataset", "telco_churn")
        mlflow.set_tag("imbalance_handling", "class_weight/scale_pos_weight")

        # Store for local comparison
        results[model_name] = {
            "cv_auc": f"{cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})",
            "test_auc": f"{auc:.4f}",
            "test_f1": f"{f1:.4f}",
            "business_cost": f"{best_cost:.4f}",
            "optimal_threshold": f"{best_threshold:.2f}",
        }

        print(f"{model_name:25s} | AUC: {auc:.4f} | "
              f"Cost: {best_cost:.4f} | Threshold: {best_threshold:.2f}")

# --- Summary table ---
print("\n" + "=" * 70)
results_df = pd.DataFrame(results).T
print(results_df)
Enter fullscreen mode Exit fullscreen mode

Exploring a Neural Network: MLP with different activation functions

The project specification asks us to test multiple model families, including neural networks (MLP). More importantly, we need to explore the impact of activation functions — these are the non-linear transformations applied at each neuron that allow the network to learn complex patterns.

Without activation functions, a multi-layer network would collapse into a simple linear regression, no matter how many layers you add. The choice of activation function directly affects convergence speed, gradient flow, and the types of patterns the network can capture.

# ============================================================
# MLP CLASSIFIER — ACTIVATION FUNCTION EXPLORATION
# Testing different activation functions and documenting their impact
# ============================================================

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

# We'll test the 3 activation functions available in scikit-learn's MLP:
#
# 1. 'relu' (Rectified Linear Unit): f(x) = max(0, x)
#    - Most popular in modern deep learning
#    - Fast convergence, avoids vanishing gradient (for positive values)
#    - Can cause "dead neurons" if too many negative inputs → output 0
#
# 2. 'tanh' (Hyperbolic Tangent): f(x) = tanh(x), output in [-1, 1]
#    - Zero-centered (unlike relu), which can help optimization
#    - Suffers from vanishing gradient for extreme values
#    - Often works well for small/medium networks
#
# 3. 'logistic' (Sigmoid): f(x) = 1/(1+exp(-x)), output in [0, 1]
#    - Classic activation, outputs interpretable as probabilities
#    - Severe vanishing gradient problem → slow training on deep networks
#    - Generally the weakest choice for hidden layers

activation_functions = {
    "relu": "ReLU — max(0, x). Fast, avoids vanishing gradient.",
    "tanh": "Tanh — output [-1,1]. Zero-centered, good for small nets.",
    "logistic": "Sigmoid — output [0,1]. Classic but prone to vanishing gradient.",
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
mlp_results = {}

for activation, description in activation_functions.items():
    with mlflow.start_run(run_name=f"MLP_{activation}"):
        # MLP architecture: 2 hidden layers of 64 and 32 neurons
        # early_stopping=True prevents overfitting by monitoring validation loss
        mlp = MLPClassifier(
            hidden_layer_sizes=(64, 32),     # 2 layers: 64 then 32 neurons
            activation=activation,            # The activation function we're testing
            solver="adam",                    # Adam optimizer (adaptive learning rate)
            max_iter=500,                     # Max training epochs
            early_stopping=True,              # Stop if validation loss stops improving
            validation_fraction=0.15,         # 15% of training data for validation
            n_iter_no_change=15,              # Patience: stop after 15 epochs without improvement
            random_state=42,
            learning_rate_init=0.001,         # Starting learning rate
        )

        # Cross-validation
        cv_scores = cross_val_score(
            mlp, X_train_scaled, y_train,
            cv=cv, scoring="roc_auc", n_jobs=-1
        )

        # Train on full training set
        mlp.fit(X_train_scaled, y_train)

        # Evaluate
        y_pred_mlp = mlp.predict(X_test_scaled)
        y_proba_mlp = mlp.predict_proba(X_test_scaled)[:, 1]
        auc_mlp = roc_auc_score(y_test, y_proba_mlp)
        f1_mlp = f1_score(y_test, y_pred_mlp)

        # Business cost at optimal threshold
        best_t_mlp, best_c_mlp, _, _ = find_optimal_threshold(y_test, y_proba_mlp)

        # Log to MLflow
        mlp_params = {
            "activation": activation,
            "hidden_layer_sizes": "(64, 32)",
            "solver": "adam",
            "max_iter": 500,
            "early_stopping": True,
            "learning_rate_init": 0.001,
        }
        mlflow.log_params(mlp_params)
        mlflow.log_metric("cv_auc_mean", cv_scores.mean())
        mlflow.log_metric("cv_auc_std", cv_scores.std())
        mlflow.log_metric("test_auc", auc_mlp)
        mlflow.log_metric("test_f1", f1_mlp)
        mlflow.log_metric("business_cost", best_c_mlp)
        mlflow.log_metric("optimal_threshold", best_t_mlp)
        mlflow.log_metric("n_epochs_actual", mlp.n_iter_)  # How many epochs before stopping
        mlflow.sklearn.log_model(mlp, "model")
        mlflow.set_tag("model_type", "MLP")
        mlflow.set_tag("activation_function", activation)

        mlp_results[activation] = {
            "cv_auc_mean": cv_scores.mean(),
            "cv_auc_std": cv_scores.std(),
            "test_auc": auc_mlp,
            "test_f1": f1_mlp,
            "business_cost": best_c_mlp,
            "epochs_trained": mlp.n_iter_,
        }

        print(f"MLP ({activation:10s}) | AUC: {auc_mlp:.4f} | "
              f"Cost: {best_c_mlp:.4f} | Epochs: {mlp.n_iter_:3d} | "
              f"CV AUC: {cv_scores.mean():.4f}")

# --- Compare activation functions ---
print("\n" + "=" * 70)
print("ACTIVATION FUNCTION COMPARISON")
print("=" * 70)
mlp_comparison = pd.DataFrame(mlp_results).T
print(mlp_comparison.round(4))

# --- Interpretation ---
# Typically you'll see:
# - relu: fastest convergence (fewer epochs), competitive AUC
# - tanh: similar or slightly lower AUC, more epochs needed
# - logistic: slowest convergence, often lowest AUC (vanishing gradient)
#
# For this tabular classification task, relu usually wins or ties with tanh.
# The logistic activation is generally outperformed on hidden layers.
#
# CONCLUSION: We justify choosing relu (or tanh if it wins) based on
# these experimental results, not just convention.

best_activation = min(mlp_results.keys(), key=lambda k: mlp_results[k]["business_cost"])
print(f"\nBest activation function: {best_activation}")
print(f"  → Reason: lowest business cost ({mlp_results[best_activation]['business_cost']:.4f})")
print(f"{activation_functions[best_activation]}")
Enter fullscreen mode Exit fullscreen mode

The key takeaway here is that the choice of activation function should be empirical, not dogmatic. Yes, relu is the default in deep learning, but on a small tabular dataset with 2 hidden layers, tanh can sometimes compete. The important thing is to test, measure, and justify your choice with data.

Launching the MLflow UI

This is the moment where everything clicks. Run this in your terminal:

mlflow ui --port 5000
Enter fullscreen mode Exit fullscreen mode

Then open http://127.0.0.1:5000 in your browser.

You'll see all your runs listed with every parameter and metric. You can sort by AUC, filter by model type, compare runs side by side, and inspect artifacts (confusion matrices, saved models).

The power of this becomes obvious when you have 20+ runs. Instead of scrolling through a spreadsheet, you have a proper dashboard.

MLflow UI showing experiment comparison with metrics and parameters

Going further:


Part 5 — Hyperparameter Optimization with Optuna

Why Optuna instead of GridSearchCV?

GridSearchCV tries every possible combination of hyperparameters. If you have 3 values for max_depth and 4 for learning_rate, it trains 3 x 4 = 12 models. That's fine for small grids.

But in practice, you want to search broader. 10 hyperparameters with 5 values each = 5^10 = ~10 million combinations. GridSearch can't handle that.

Optuna uses Bayesian optimization: it learns from previous trials which regions of the hyperparameter space are promising, and focuses its search there. It's smarter, faster, and scales to complex search spaces.

Going further on Optuna:

# ============================================================
# HYPERPARAMETER OPTIMIZATION WITH OPTUNA + MLFLOW TRACKING
# Every trial is automatically logged as an MLflow run
# ============================================================

import optuna
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Silence Optuna's verbose output
optuna.logging.set_verbosity(optuna.logging.WARNING)


def objective(trial):
    """
    Optuna objective function.

    This function is called once per trial. Optuna suggests
    hyperparameter values, we train the model, evaluate it,
    and return the score to minimize (business cost).

    Each trial is also logged as an MLflow run for full traceability.
    """

    # --- Optuna suggests hyperparameter values ---
    # suggest_int: integer in range
    # suggest_float: float in range (log=True for log-uniform sampling)
    # suggest_categorical: pick from a list
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 500, step=50),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-4, 10.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-4, 10.0, log=True),
    }

    # Create XGBoost with suggested params
    model = XGBClassifier(
        **params,
        scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]),
        random_state=42,
        eval_metric="logloss",
        use_label_encoder=False,
    )

    # Cross-validation for robust evaluation
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    cv_scores = cross_val_score(
        model, X_train_scaled, y_train,
        cv=cv, scoring="roc_auc", n_jobs=-1
    )

    # Train on full training set for test evaluation
    model.fit(X_train_scaled, y_train)
    y_proba = model.predict_proba(X_test_scaled)[:, 1]

    # Find optimal threshold for business cost
    best_threshold, best_cost, _, _ = find_optimal_threshold(y_test, y_proba)
    auc = roc_auc_score(y_test, y_proba)

    # --- Log everything to MLflow ---
    with mlflow.start_run(
        run_name=f"optuna_trial_{trial.number}",
        nested=True  # Nest under the parent optimization run
    ):
        mlflow.log_params(params)
        mlflow.log_metric("cv_auc_mean", cv_scores.mean())
        mlflow.log_metric("cv_auc_std", cv_scores.std())
        mlflow.log_metric("test_auc", auc)
        mlflow.log_metric("business_cost", best_cost)
        mlflow.log_metric("optimal_threshold", best_threshold)
        mlflow.set_tag("optimization", "optuna")
        mlflow.set_tag("trial_number", trial.number)

    # Return the metric to MINIMIZE (Optuna minimizes by default)
    return best_cost


# --- Run the optimization ---
# Parent MLflow run wraps all trials
with mlflow.start_run(run_name="XGBoost_Optuna_Optimization"):
    study = optuna.create_study(
        direction="minimize",     # We want to minimize business cost
        study_name="xgboost_churn",
    )

    study.optimize(
        objective,
        n_trials=50,              # Number of trials to run
        show_progress_bar=True,
    )

    # --- Log the best result ---
    mlflow.log_params(study.best_params)
    mlflow.log_metric("best_business_cost", study.best_value)
    mlflow.set_tag("best_trial", study.best_trial.number)

    print(f"\nBest business cost: {study.best_value:.4f}")
    print(f"Best parameters: {study.best_params}")
Enter fullscreen mode Exit fullscreen mode

After running this, check the MLflow UI again. You'll see 50+ nested runs under the optimization parent, each with its own parameters and metrics. You can sort by business_cost to find the best trial instantly.

Diagnosing overfitting and underfitting with learning curves

A model can have great training accuracy but fail on unseen data (overfitting), or perform poorly on both (underfitting). Learning curves are the diagnostic tool: they show how train and validation scores evolve as you increase training data size.

If train score is high and validation score is low → overfitting (the model memorized the training data).
If both scores are low → underfitting (the model is too simple or needs more features).
If both converge to a high value → sweet spot.

# ============================================================
# OVERFITTING/UNDERFITTING DIAGNOSIS
# Learning curves + regularization techniques
# ============================================================

from sklearn.model_selection import learning_curve

# --- Learning Curves ---
# We'll compare the baseline XGBoost (default params) vs the
# Optuna-optimized XGBoost to see if optimization improved generalization.

# 1. Baseline model (default hyperparameters, from our initial comparison)
baseline_xgb = XGBClassifier(
    n_estimators=200,
    max_depth=5,
    learning_rate=0.1,
    scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]),
    random_state=42,
    eval_metric="logloss",
    use_label_encoder=False,
)

# 2. Optimized model (from Optuna)
optimized_xgb = XGBClassifier(
    **study.best_params,
    scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]),
    random_state=42,
    eval_metric="logloss",
    use_label_encoder=False,
)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

for idx, (model, name) in enumerate([
    (baseline_xgb, "Baseline XGBoost (default)"),
    (optimized_xgb, "Optimized XGBoost (Optuna)")
]):
    # learning_curve returns train sizes, train scores, and validation scores
    # for increasing training set sizes (10%, 30%, 50%, 70%, 90%, 100%)
    train_sizes, train_scores, val_scores = learning_curve(
        model, X_train_scaled, y_train,
        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
        scoring="roc_auc",
        train_sizes=np.linspace(0.1, 1.0, 10),  # 10 points from 10% to 100%
        n_jobs=-1,
    )

    # Mean and standard deviation across folds
    train_mean = train_scores.mean(axis=1)
    train_std = train_scores.std(axis=1)
    val_mean = val_scores.mean(axis=1)
    val_std = val_scores.std(axis=1)

    # Plot
    ax = axes[idx]
    ax.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color="blue")
    ax.fill_between(train_sizes, val_mean - val_std, val_mean + val_std, alpha=0.1, color="orange")
    ax.plot(train_sizes, train_mean, "o-", color="blue", label="Training AUC")
    ax.plot(train_sizes, val_mean, "o-", color="orange", label="Validation AUC")
    ax.set_xlabel("Training Set Size")
    ax.set_ylabel("AUC-ROC")
    ax.set_title(name)
    ax.legend(loc="lower right")
    ax.grid(True, alpha=0.3)

    # Calculate and annotate the train-validation gap
    # A large gap = overfitting; a small gap = good generalization
    gap = train_mean[-1] - val_mean[-1]
    ax.annotate(
        f"Train-Val gap: {gap:.4f}",
        xy=(0.5, 0.02), xycoords="axes fraction",
        fontsize=10, ha="center",
        bbox=dict(boxstyle="round,pad=0.3", facecolor="yellow", alpha=0.7)
    )

plt.suptitle(
    "Learning Curves — Overfitting Diagnosis\n"
    "Large gap between curves = overfitting | Both curves low = underfitting",
    fontsize=13, fontweight="bold"
)
plt.tight_layout()
plt.show()

# -------------------------------------------------------
# EXPLICIT COMPARISON: BASELINE vs OPTIMIZED
# This is critical — we need to show that optimization
# actually improved performance, not just changed params.
# -------------------------------------------------------

# Train both models on full training set
baseline_xgb.fit(X_train_scaled, y_train)
optimized_xgb.fit(X_train_scaled, y_train)

comparison_data = {}
for model, name in [(baseline_xgb, "Baseline"), (optimized_xgb, "Optimized")]:
    y_proba = model.predict_proba(X_test_scaled)[:, 1]
    auc = roc_auc_score(y_test, y_proba)

    # Train AUC (to measure overfitting)
    y_proba_train = model.predict_proba(X_train_scaled)[:, 1]
    auc_train = roc_auc_score(y_train, y_proba_train)

    # Business cost
    best_t, best_c, _, _ = find_optimal_threshold(y_test, y_proba)

    comparison_data[name] = {
        "train_auc": auc_train,
        "test_auc": auc,
        "train_test_gap": auc_train - auc,  # Overfitting indicator
        "business_cost": best_c,
        "optimal_threshold": best_t,
    }

comp_df = pd.DataFrame(comparison_data).T
print("\n" + "=" * 70)
print("BASELINE vs OPTIMIZED — SIDE BY SIDE COMPARISON")
print("=" * 70)
print(comp_df.round(4))
print(f"\nAUC improvement: {comparison_data['Optimized']['test_auc'] - comparison_data['Baseline']['test_auc']:.4f}")
print(f"Cost reduction:  {comparison_data['Baseline']['business_cost'] - comparison_data['Optimized']['business_cost']:.4f}")
print(f"Overfitting gap change: "
      f"{comparison_data['Baseline']['train_test_gap']:.4f}"
      f"{comparison_data['Optimized']['train_test_gap']:.4f}")

# Key interpretation:
# - If the optimized model has a LOWER train-test gap → regularization worked
# - If AUC improved AND cost decreased → optimization was successful
# - If train AUC is very close to 1.0 → potential overfitting, investigate further
#
# The regularization params (reg_alpha, reg_lambda) from Optuna act as
# L1 and L2 regularization respectively, penalizing overly complex models.
# subsample and colsample_bytree add randomness (like dropout for trees),
# further reducing overfitting.
Enter fullscreen mode Exit fullscreen mode

Early stopping in practice

Early stopping is one of the most effective anti-overfitting techniques. Instead of training for a fixed number of iterations, you monitor validation performance and stop when it stops improving. We already used it in the MLP (via early_stopping=True), and XGBoost supports it natively:

# ============================================================
# EARLY STOPPING WITH XGBOOST
# Monitor validation loss and stop when it plateaus
# ============================================================

# Early stopping needs an evaluation set (not used for training)
eval_set = [(X_train_scaled, y_train), (X_test_scaled, y_test)]

xgb_early = XGBClassifier(
    **study.best_params,
    n_estimators=1000,  # Set high — early stopping will cut it short
    scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]),
    random_state=42,
    eval_metric="logloss",
    use_label_encoder=False,
    early_stopping_rounds=30,  # Stop if no improvement for 30 rounds
)

# fit() with eval_set tracks performance at each boosting round
xgb_early.fit(
    X_train_scaled, y_train,
    eval_set=eval_set,
    verbose=False,
)

print(f"Training stopped at round {xgb_early.best_iteration} out of 1000")
print(f"Best validation logloss: {xgb_early.best_score:.6f}")

# Plot the training history: train vs validation loss per round
# This is THE visual proof of overfitting (or lack thereof)
results_history = xgb_early.evals_result()

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(results_history["validation_0"]["logloss"], label="Train", color="blue")
ax.plot(results_history["validation_1"]["logloss"], label="Validation", color="orange")
ax.axvline(x=xgb_early.best_iteration, color="red", linestyle="--",
           label=f"Early stop (round {xgb_early.best_iteration})")
ax.set_xlabel("Boosting Round")
ax.set_ylabel("Log Loss")
ax.set_title("XGBoost Training History — Early Stopping")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# If the validation curve diverges from the training curve after
# a certain round, that's overfitting. Early stopping catches this
# and stops training at the optimal point.
Enter fullscreen mode Exit fullscreen mode

Part 6 — Model Interpretability with SHAP

Why interpretability matters

Your model predicts that customer #4521 is going to churn. Great. But the business team asks: why? Is it because they've been a customer for only 2 months? Because their monthly charge is too high? Because they're on a month-to-month contract?

Without interpretability, your model is a black box. And black boxes don't build trust.

SHAP (SHapley Additive exPlanations) is the gold standard for model interpretability. It explains each prediction by assigning a contribution score to each feature:

  • Global importance: which features matter the most across all predictions?
  • Local importance: for one specific customer, which features pushed the prediction toward churn?

Going further:

# ============================================================
# MODEL INTERPRETABILITY WITH SHAP
# Global and local explanations for the best model
# ============================================================

import shap

# --- Retrain the best model with optimal hyperparameters ---
best_model = XGBClassifier(
    **study.best_params,
    scale_pos_weight=len(y_train[y_train == 0]) / len(y_train[y_train == 1]),
    random_state=42,
    eval_metric="logloss",
    use_label_encoder=False,
)
best_model.fit(X_train_scaled, y_train)

# -------------------------------------------------------
# METHOD 1: BUILT-IN FEATURE IMPORTANCES (feature_importances_)
# Fast and simple. Based on how much each feature is used in
# splits across all trees. Good starting point, but can be biased
# toward high-cardinality features.
# -------------------------------------------------------

importances = best_model.feature_importances_
feat_imp_df = pd.DataFrame({
    "feature": X_train_scaled.columns,
    "importance": importances
}).sort_values("importance", ascending=False)

fig, ax = plt.subplots(figsize=(10, 8))
top_n = 15  # Show top 15 features
ax.barh(
    feat_imp_df["feature"].head(top_n)[::-1],
    feat_imp_df["importance"].head(top_n)[::-1],
    color="steelblue"
)
ax.set_xlabel("Feature Importance (gain)")
ax.set_title("Built-in Feature Importance (feature_importances_) — Top 15")
plt.tight_layout()
plt.show()

# -------------------------------------------------------
# METHOD 2: PERMUTATION IMPORTANCE
# More reliable than built-in importance. Works by shuffling
# each feature and measuring how much the model's score drops.
# If shuffling a feature destroys performance → it's important.
# If shuffling has no effect → the feature is useless.
# -------------------------------------------------------

from sklearn.inspection import permutation_importance

# Compute permutation importance on the TEST set (not train!)
# Using test set avoids overfitting-related importance inflation
perm_result = permutation_importance(
    best_model, X_test_scaled, y_test,
    n_repeats=15,          # Repeat shuffling 15 times for stable estimates
    random_state=42,
    scoring="roc_auc",     # Use the same metric we care about
    n_jobs=-1,
)

perm_imp_df = pd.DataFrame({
    "feature": X_test_scaled.columns,
    "importance_mean": perm_result.importances_mean,
    "importance_std": perm_result.importances_std,
}).sort_values("importance_mean", ascending=False)

fig, ax = plt.subplots(figsize=(10, 8))
top_features = perm_imp_df.head(top_n)[::-1]
ax.barh(
    top_features["feature"],
    top_features["importance_mean"],
    xerr=top_features["importance_std"],  # Error bars show stability
    color="darkorange"
)
ax.set_xlabel("AUC decrease when feature is shuffled")
ax.set_title("Permutation Importance — Top 15 (with std deviation)")
plt.tight_layout()
plt.show()

# NOTE: Permutation importance can differ from built-in importance.
# Built-in importance tells you "how much the model uses this feature"
# Permutation importance tells you "how much performance drops without it"
# A feature could be used a lot but be redundant (correlated with others)
# → high built-in importance, low permutation importance

# -------------------------------------------------------
# METHOD 3: SHAP (the gold standard)
# SHAP unifies several interpretability methods and provides
# both global AND local explanations. It's based on game theory
# (Shapley values) and is mathematically rigorous.
# -------------------------------------------------------

# --- Create SHAP explainer ---
# TreeExplainer is optimized for tree-based models (XGBoost, LightGBM, RF)
explainer = shap.TreeExplainer(best_model)

# Calculate SHAP values for the test set
# This tells us how much each feature contributed to each prediction
shap_values = explainer.shap_values(X_test_scaled)

# -------------------------------------------------------
# GLOBAL IMPORTANCE
# Which features matter the most across ALL customers?
# -------------------------------------------------------

# Summary plot: each dot is a customer, position = SHAP value,
# color = feature value (red=high, blue=low)
fig, ax = plt.subplots(figsize=(10, 8))
shap.summary_plot(
    shap_values, X_test_scaled,
    feature_names=X_test_scaled.columns,
    show=False
)
plt.title("SHAP Global Feature Importance")
plt.tight_layout()

# Log the plot to MLflow
with mlflow.start_run(run_name="SHAP_Analysis"):
    mlflow.log_figure(fig, "shap_global_importance.png")
plt.show()

# -------------------------------------------------------
# LOCAL IMPORTANCE (single customer explanation)
# WHY did the model predict churn for customer #0?
# -------------------------------------------------------

# Pick one customer from the test set
customer_idx = 0

# Waterfall plot: shows each feature's push toward or away from churn
fig, ax = plt.subplots(figsize=(10, 6))
shap.waterfall_plot(
    shap.Explanation(
        values=shap_values[customer_idx],
        base_values=explainer.expected_value,
        data=X_test_scaled.iloc[customer_idx],
        feature_names=X_test_scaled.columns.tolist()
    ),
    show=False
)
plt.title(f"SHAP Local Explanation — Customer #{customer_idx}")
plt.tight_layout()
plt.show()

# This plot tells a story:
# "This customer has a high churn risk because:
#  - They're on a month-to-month contract (+0.15)
#  - They have fiber optic internet with no tech support (+0.08)
#  - They've only been a customer for 3 months (+0.07)
#  But their low monthly charges help reduce the risk (-0.04)"
Enter fullscreen mode Exit fullscreen mode

Part 7 — Model Registry and Serving

The Model Registry: versioning your models

Once you've found the best model, you need to register it. The MLflow Model Registry is like Git for models — it stores versions, lets you tag them (staging, production), and provides a central place to manage the model lifecycle.

# ============================================================
# MODEL REGISTRY
# Register the best model, tag it, transition to production
# ============================================================

# --- Register the best model ---
with mlflow.start_run(run_name="Best_XGBoost_Final"):
    # Log the model with a registered name
    mlflow.sklearn.log_model(
        best_model,
        "model",
        registered_model_name="churn_prediction_xgboost",
        # This creates the model in the registry if it doesn't exist
        # Or adds a new version if it does
    )

    # Log final metrics for the record
    y_proba_final = best_model.predict_proba(X_test_scaled)[:, 1]
    best_threshold, best_cost, _, _ = find_optimal_threshold(y_test, y_proba_final)
    auc_final = roc_auc_score(y_test, y_proba_final)

    mlflow.log_metric("final_auc", auc_final)
    mlflow.log_metric("final_business_cost", best_cost)
    mlflow.log_metric("final_threshold", best_threshold)
    mlflow.log_params(study.best_params)
    mlflow.set_tag("stage", "production_candidate")

    print(f"Model registered: churn_prediction_xgboost")
    print(f"AUC: {auc_final:.4f} | Cost: {best_cost:.4f} | Threshold: {best_threshold:.2f}")
Enter fullscreen mode Exit fullscreen mode

You can view registered models in the MLflow UI under the "Models" tab. Each model can have multiple versions, and each version can be tagged as "Staging" or "Production".

Model Serving — From model to API in one command

MLflow can serve any registered model as a REST API endpoint. This is incredibly useful for quick testing and proof-of-concept demos.

# Serve the latest version of the registered model
mlflow models serve \
    --model-uri "models:/churn_prediction_xgboost/latest" \
    --port 5001 \
    --no-conda

# Test it with curl
curl -X POST http://127.0.0.1:5001/invocations \
    -H "Content-Type: application/json" \
    -d '{"dataframe_split": {"columns": ["tenure", "MonthlyCharges", ...], "data": [[24, 65.5, ...]]}}'
Enter fullscreen mode Exit fullscreen mode

This is a quick way to validate that your model works as an API. For a production-grade deployment, you'd use FastAPI or a similar framework (as we built in the previous article), but MLflow serving is perfect for rapid prototyping and internal demos.

Going further:


Part 8 — The threshold plot (the one chart that wins over business stakeholders)

This is the chart I always show to non-technical stakeholders. It makes the abstract concept of "threshold optimization" concrete and visual.

# ============================================================
# THRESHOLD OPTIMIZATION VISUALIZATION
# Show why the default 0.5 threshold is NOT optimal
# ============================================================

# Get probabilities from the best model
y_proba_best = best_model.predict_proba(X_test_scaled)[:, 1]

# Sweep thresholds and compute business cost at each
best_t, best_c, thresholds, costs = find_optimal_threshold(
    y_test, y_proba_best
)

# Also compute F1 and recall at each threshold for comparison
f1_scores = []
recall_scores = []
for t in thresholds:
    y_pred_t = (y_proba_best >= t).astype(int)
    f1_scores.append(f1_score(y_test, y_pred_t))
    # Recall on the positive class (churners caught)
    cm = confusion_matrix(y_test, y_pred_t)
    _, _, fn, tp = cm.ravel()
    recall_scores.append(tp / (tp + fn) if (tp + fn) > 0 else 0)

# --- Plot ---
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left panel: Business cost vs threshold
axes[0].plot(thresholds, costs, "b-", linewidth=2, label="Business cost")
axes[0].axvline(x=0.5, color="gray", linestyle="--", alpha=0.7, label="Default (0.5)")
axes[0].axvline(x=best_t, color="red", linestyle="--", linewidth=2, label=f"Optimal ({best_t:.2f})")
axes[0].set_xlabel("Classification Threshold", fontsize=12)
axes[0].set_ylabel("Business Cost (normalized)", fontsize=12)
axes[0].set_title("Business Cost vs. Threshold", fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Right panel: F1 and Recall vs threshold
axes[1].plot(thresholds, f1_scores, "g-", linewidth=2, label="F1-Score")
axes[1].plot(thresholds, recall_scores, "orange", linewidth=2, label="Recall (churners caught)")
axes[1].axvline(x=best_t, color="red", linestyle="--", linewidth=2, label=f"Optimal ({best_t:.2f})")
axes[1].set_xlabel("Classification Threshold", fontsize=12)
axes[1].set_ylabel("Score", fontsize=12)
axes[1].set_title("F1-Score & Recall vs. Threshold", fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle(
    f"Threshold Optimization — Optimal: {best_t:.2f} "
    f"(cost: {best_c:.4f} vs {business_cost(y_test, (y_proba_best >= 0.5).astype(int)):.4f} at 0.5)",
    fontsize=14, fontweight="bold"
)
plt.tight_layout()
plt.show()

# Key insight to communicate to stakeholders:
print(f"\nDefault threshold (0.5) → business cost: "
      f"{business_cost(y_test, (y_proba_best >= 0.5).astype(int)):.4f}")
print(f"Optimal threshold ({best_t:.2f}) → business cost: {best_c:.4f}")
print(f"Cost reduction: "
      f"{(1 - best_c/business_cost(y_test, (y_proba_best >= 0.5).astype(int)))*100:.1f}%")
Enter fullscreen mode Exit fullscreen mode

This chart demonstrates that the technical optimum (best AUC) is NOT the same as the business optimum (lowest cost). A lower threshold catches more churners (higher recall) at the expense of more false alarms — but since false alarms are cheap and missed churners are expensive, the total cost goes down. This is the kind of insight that makes stakeholders trust your work.


Summary — What we built

Here's everything we covered:

Component Tool Purpose
Data quality audit Pandas, missingno Duplicates, missing values, distributions
Data prep Pandas, scikit-learn Clean, encode, engineer features
Experiment tracking MLflow Tracking Log every run automatically
Visual comparison MLflow UI Compare models side by side
MLP + activation functions scikit-learn MLPClassifier Test relu, tanh, logistic — justify choice
Hyperparameter tuning Optuna + MLflow Smart search, fully tracked
Overfitting diagnosis Learning curves Compare train vs validation across data sizes
Early stopping XGBoost eval_set Stop training when validation plateaus
Base vs optimized Side-by-side comparison Prove optimization improved generalization
Business metric Custom cost function Optimize for real-world impact
Threshold optimization Custom sweep Find the decision point that minimizes cost
Feature importance (built-in) feature_importances_ Quick importance ranking
Feature importance (permutation) sklearn permutation_importance Robust, model-agnostic importance
Feature importance (SHAP) SHAP Global summary + local waterfall explanations
Model versioning MLflow Registry Version, stage, manage models
Model serving MLflow Serve Turn model into API endpoint

Technologies to remember

Tool What it does Link
MLflow ML experiment lifecycle mlflow.org
Optuna Hyperparameter optimization optuna.org
SHAP Model interpretability shap.readthedocs.io
XGBoost Gradient boosting classifier xgboost.readthedocs.io
LightGBM Fast gradient boosting lightgbm.readthedocs.io
imbalanced-learn Handle class imbalance imbalanced-learn.org
scikit-learn ML foundation scikit-learn.org

Going further

  • Data drift monitoring — detect when production data diverges from training data (Evidently AI)
  • Feature stores — centralize and reuse feature engineering across projects (Feast)
  • Automated retraining — trigger retraining when performance drops below a threshold
  • A/B testing — compare model versions in production with real users
  • Full CI/CD for ML — automate the entire pipeline from data to deployment (GitHub Actions + MLflow)

Conclusion

The shift from "notebook data science" to MLOps is not about adding complexity for the sake of it. It's about not losing your mind when you have 50 experiments, 3 datasets, and a manager asking "which model should we deploy?"

MLflow gives you that sanity. Every experiment is logged. Every model is versioned. Every metric is trackable. And when someone asks "why did the model predict churn for this customer?", you can show them a SHAP plot instead of shrugging.

The tools are free. The concepts are straightforward. The only thing standing between you and reproducible ML is the decision to start.

Next time you open a notebook, add import mlflow at the top. Future you will be grateful.


If this article helped you, a clap on Medium goes a long way. Questions? The comments are open.

Top comments (0)