DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

ML Monitoring Suite

ML Monitoring Suite

Production model monitoring with Prometheus metrics, Grafana dashboards, and automated alerting. Detect data drift, performance degradation, and service health issues before they impact users.

Key Features

  • Pre-built Grafana dashboards — model performance, prediction distributions, latency, and error rates
  • Prometheus metric exporters — custom Python exporters for sklearn, PyTorch, and TensorFlow models
  • Data drift detection — statistical tests (KS, PSI, chi-squared) running on a configurable schedule
  • Alerting rules — Prometheus alerting configs for accuracy drops, latency spikes, and error rate thresholds
  • SLA monitoring — track p50/p95/p99 latency against defined service level objectives
  • Incident response runbooks — step-by-step guides for common ML production incidents
  • Health check endpoints — readiness and liveness probes for model serving containers

Quick Start

# 1. Copy the config
cp config.example.yaml config.yaml

# 2. Start the monitoring stack
docker-compose -f templates/docker-compose.yaml up -d

# 3. Import Grafana dashboards
python templates/import_dashboards.py --grafana-url http://localhost:3000

# 4. Start the model metric exporter
python templates/exporter.py --config config.yaml
Enter fullscreen mode Exit fullscreen mode
"""Expose model metrics to Prometheus."""
from prometheus_client import start_http_server, Histogram, Counter, Gauge
import time

# Define metrics
PREDICTION_LATENCY = Histogram("model_prediction_seconds", "Prediction latency",
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0])
PREDICTION_COUNT = Counter("model_predictions_total", "Total predictions",
    ["model_version", "status"])
MODEL_ACCURACY = Gauge("model_accuracy_score", "Rolling accuracy", ["model_name"])

def predict_with_monitoring(model, features: dict) -> dict:
    """Run prediction and record metrics."""
    start = time.perf_counter()
    try:
        result = model.predict(features)
        PREDICTION_COUNT.labels(model_version="v2.1", status="success").inc()
        return {"prediction": result}
    except Exception as exc:
        PREDICTION_COUNT.labels(model_version="v2.1", status="error").inc()
        raise
    finally:
        PREDICTION_LATENCY.observe(time.perf_counter() - start)

if __name__ == "__main__":
    start_http_server(8001)  # Prometheus scrape endpoint
    print("Metrics server running on :8001/metrics")
Enter fullscreen mode Exit fullscreen mode

Architecture

ml-monitoring-suite/
├── config.example.yaml              # Monitoring configuration
├── templates/
│   ├── docker-compose.yaml          # Prometheus + Grafana stack
│   ├── exporter.py                  # Python metric exporter
│   ├── dashboards/
│   │   ├── model_performance.json   # Accuracy, F1, precision, recall
│   │   ├── serving_latency.json     # p50/p95/p99 latency panels
│   │   ├── data_drift.json          # Feature distribution shifts
│   │   └── system_health.json       # CPU, memory, GPU utilization
│   ├── alerts/
│   │   ├── accuracy_drop.yaml       # Alert when accuracy < threshold
│   │   ├── latency_spike.yaml       # Alert when p99 > SLA
│   │   └── error_rate.yaml          # Alert when error rate > 1%
│   └── runbooks/
│       ├── accuracy_degradation.md  # Step-by-step diagnosis
│       └── data_drift_detected.md   # Drift response procedure
├── docs/
│   └── overview.md
└── examples/
    ├── sklearn_monitoring.py        # Monitor sklearn model
    └── drift_detection.py           # Run drift checks manually
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Data Drift Detection

"""Detect feature distribution drift using Population Stability Index."""
import numpy as np
from scipy import stats

def calculate_psi(reference: np.ndarray, current: np.ndarray, bins: int = 10) -> float:
    """Population Stability Index. < 0.1 OK, 0.1-0.25 investigate, > 0.25 retrain."""
    breakpoints = np.linspace(
        min(reference.min(), current.min()),
        max(reference.max(), current.max()),
        bins + 1,
    )
    ref_pct = np.clip(np.histogram(reference, breakpoints)[0] / len(reference), 1e-6, None)
    cur_pct = np.clip(np.histogram(current, breakpoints)[0] / len(current), 1e-6, None)
    return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))

# Check drift for each feature
for feature_name in feature_columns:
    psi = calculate_psi(reference_df[feature_name].values,
                        production_df[feature_name].values)
    print(f"{feature_name}: PSI={psi:.4f} {'DRIFT' if psi > 0.25 else 'OK'}")
Enter fullscreen mode Exit fullscreen mode

Prometheus Alert Rule

# alerts/accuracy_drop.yaml
groups:
  - name: model_quality
    rules:
      - alert: ModelAccuracyDrop
        expr: model_accuracy_score < 0.85
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "Model accuracy below threshold"
          description: "{{ $labels.model_name }} accuracy is {{ $value }} (threshold: 0.85)"
Enter fullscreen mode Exit fullscreen mode

Configuration

# config.example.yaml
monitoring:
  prometheus_port: 8001
  scrape_interval: "15s"

drift_detection:
  schedule: "0 */6 * * *"           # Every 6 hours
  psi_threshold: 0.25
  reference_window_days: 30

alerts:
  accuracy_threshold: 0.85
  latency_p99_ms: 200
  error_rate_threshold: 0.01
  notification_channel: "slack"      # slack | email | pagerduty
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Monitor inputs, not just outputs — data drift in features often precedes accuracy drops by days or weeks
  2. Set up a reference dataset — freeze your training data distribution as the baseline for all drift comparisons
  3. Use rolling windows for metrics — a 1-hour rolling accuracy is more actionable than a per-request metric
  4. Alert on trends, not single points — require the condition to persist (for: 15m) before firing alerts
  5. Automate runbook links in alerts — every alert annotation should include a link to the relevant runbook

Troubleshooting

Problem Cause Fix
Grafana dashboards show "No data" Prometheus not scraping the exporter Check http://localhost:9090/targets for scrape errors; verify exporter port
PSI always near zero Reference and current data from same source Ensure reference data is from training time, not recent production
Alert firing too frequently Threshold too tight or window too short Increase for duration in alert rules or relax thresholds
Exporter OOM on high traffic Unbounded histogram buckets Set explicit buckets on Histogram metrics; limit cardinality of labels

This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [ML Monitoring Suite] with all files, templates, and documentation for $39.

Get the Full Kit →

Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)