ML Monitoring Suite

#machinelearning #datascience #python #ai

ML Monitoring Suite

Production model monitoring with Prometheus metrics, Grafana dashboards, and automated alerting. Detect data drift, performance degradation, and service health issues before they impact users.

Key Features

Pre-built Grafana dashboards — model performance, prediction distributions, latency, and error rates
Prometheus metric exporters — custom Python exporters for sklearn, PyTorch, and TensorFlow models
Data drift detection — statistical tests (KS, PSI, chi-squared) running on a configurable schedule
Alerting rules — Prometheus alerting configs for accuracy drops, latency spikes, and error rate thresholds
SLA monitoring — track p50/p95/p99 latency against defined service level objectives
Incident response runbooks — step-by-step guides for common ML production incidents
Health check endpoints — readiness and liveness probes for model serving containers

Quick Start

# 1. Copy the config
cp config.example.yaml config.yaml

# 2. Start the monitoring stack
docker-compose -f templates/docker-compose.yaml up -d

# 3. Import Grafana dashboards
python templates/import_dashboards.py --grafana-url http://localhost:3000

# 4. Start the model metric exporter
python templates/exporter.py --config config.yaml

"""Expose model metrics to Prometheus."""
from prometheus_client import start_http_server, Histogram, Counter, Gauge
import time

# Define metrics
PREDICTION_LATENCY = Histogram("model_prediction_seconds", "Prediction latency",
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0])
PREDICTION_COUNT = Counter("model_predictions_total", "Total predictions",
    ["model_version", "status"])
MODEL_ACCURACY = Gauge("model_accuracy_score", "Rolling accuracy", ["model_name"])

def predict_with_monitoring(model, features: dict) -> dict:
    """Run prediction and record metrics."""
    start = time.perf_counter()
    try:
        result = model.predict(features)
        PREDICTION_COUNT.labels(model_version="v2.1", status="success").inc()
        return {"prediction": result}
    except Exception as exc:
        PREDICTION_COUNT.labels(model_version="v2.1", status="error").inc()
        raise
    finally:
        PREDICTION_LATENCY.observe(time.perf_counter() - start)

if __name__ == "__main__":
    start_http_server(8001)  # Prometheus scrape endpoint
    print("Metrics server running on :8001/metrics")

Architecture

ml-monitoring-suite/
├── config.example.yaml              # Monitoring configuration
├── templates/
│   ├── docker-compose.yaml          # Prometheus + Grafana stack
│   ├── exporter.py                  # Python metric exporter
│   ├── dashboards/
│   │   ├── model_performance.json   # Accuracy, F1, precision, recall
│   │   ├── serving_latency.json     # p50/p95/p99 latency panels
│   │   ├── data_drift.json          # Feature distribution shifts
│   │   └── system_health.json       # CPU, memory, GPU utilization
│   ├── alerts/
│   │   ├── accuracy_drop.yaml       # Alert when accuracy < threshold
│   │   ├── latency_spike.yaml       # Alert when p99 > SLA
│   │   └── error_rate.yaml          # Alert when error rate > 1%
│   └── runbooks/
│       ├── accuracy_degradation.md  # Step-by-step diagnosis
│       └── data_drift_detected.md   # Drift response procedure
├── docs/
│   └── overview.md
└── examples/
    ├── sklearn_monitoring.py        # Monitor sklearn model
    └── drift_detection.py           # Run drift checks manually

Usage Examples

Data Drift Detection

"""Detect feature distribution drift using Population Stability Index."""
import numpy as np
from scipy import stats

def calculate_psi(reference: np.ndarray, current: np.ndarray, bins: int = 10) -> float:
    """Population Stability Index. < 0.1 OK, 0.1-0.25 investigate, > 0.25 retrain."""
    breakpoints = np.linspace(
        min(reference.min(), current.min()),
        max(reference.max(), current.max()),
        bins + 1,
    )
    ref_pct = np.clip(np.histogram(reference, breakpoints)[0] / len(reference), 1e-6, None)
    cur_pct = np.clip(np.histogram(current, breakpoints)[0] / len(current), 1e-6, None)
    return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))

# Check drift for each feature
for feature_name in feature_columns:
    psi = calculate_psi(reference_df[feature_name].values,
                        production_df[feature_name].values)
    print(f"{feature_name}: PSI={psi:.4f} {'DRIFT' if psi > 0.25 else 'OK'}")

Prometheus Alert Rule

# alerts/accuracy_drop.yaml
groups:
  - name: model_quality
    rules:
      - alert: ModelAccuracyDrop
        expr: model_accuracy_score < 0.85
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Model accuracy below threshold"
          description: "{{ $labels.model_name }} accuracy is {{ $value }} (threshold: 0.85)"

Configuration

# config.example.yaml
monitoring:
  prometheus_port: 8001
  scrape_interval: "15s"

drift_detection:
  schedule: "0 */6 * * *"           # Every 6 hours
  psi_threshold: 0.25
  reference_window_days: 30

alerts:
  accuracy_threshold: 0.85
  latency_p99_ms: 200
  error_rate_threshold: 0.01
  notification_channel: "slack"      # slack | email | pagerduty

Best Practices

Monitor inputs, not just outputs — data drift in features often precedes accuracy drops by days or weeks
Set up a reference dataset — freeze your training data distribution as the baseline for all drift comparisons
Use rolling windows for metrics — a 1-hour rolling accuracy is more actionable than a per-request metric
Alert on trends, not single points — require the condition to persist (for: 15m) before firing alerts
Automate runbook links in alerts — every alert annotation should include a link to the relevant runbook

Troubleshooting

Problem	Cause	Fix
Grafana dashboards show "No data"	Prometheus not scraping the exporter	Check `http://localhost:9090/targets` for scrape errors; verify exporter port
PSI always near zero	Reference and current data from same source	Ensure reference data is from training time, not recent production
Alert firing too frequently	Threshold too tight or window too short	Increase `for` duration in alert rules or relax thresholds
Exporter OOM on high traffic	Unbounded histogram buckets	Set explicit `buckets` on Histogram metrics; limit cardinality of labels

This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [ML Monitoring Suite] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.

Get the Complete Bundle →

DEV Community

ML Monitoring Suite

ML Monitoring Suite

Key Features

Quick Start

Architecture

Usage Examples

Data Drift Detection

Prometheus Alert Rule

Configuration

Best Practices

Troubleshooting

Related Articles

Top comments (0)