ML Monitoring Suite
Production model monitoring with Prometheus metrics, Grafana dashboards, and automated alerting. Detect data drift, performance degradation, and service health issues before they impact users.
Key Features
- Pre-built Grafana dashboards — model performance, prediction distributions, latency, and error rates
- Prometheus metric exporters — custom Python exporters for sklearn, PyTorch, and TensorFlow models
- Data drift detection — statistical tests (KS, PSI, chi-squared) running on a configurable schedule
- Alerting rules — Prometheus alerting configs for accuracy drops, latency spikes, and error rate thresholds
- SLA monitoring — track p50/p95/p99 latency against defined service level objectives
- Incident response runbooks — step-by-step guides for common ML production incidents
- Health check endpoints — readiness and liveness probes for model serving containers
Quick Start
# 1. Copy the config
cp config.example.yaml config.yaml
# 2. Start the monitoring stack
docker-compose -f templates/docker-compose.yaml up -d
# 3. Import Grafana dashboards
python templates/import_dashboards.py --grafana-url http://localhost:3000
# 4. Start the model metric exporter
python templates/exporter.py --config config.yaml
"""Expose model metrics to Prometheus."""
from prometheus_client import start_http_server, Histogram, Counter, Gauge
import time
# Define metrics
PREDICTION_LATENCY = Histogram("model_prediction_seconds", "Prediction latency",
buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0])
PREDICTION_COUNT = Counter("model_predictions_total", "Total predictions",
["model_version", "status"])
MODEL_ACCURACY = Gauge("model_accuracy_score", "Rolling accuracy", ["model_name"])
def predict_with_monitoring(model, features: dict) -> dict:
"""Run prediction and record metrics."""
start = time.perf_counter()
try:
result = model.predict(features)
PREDICTION_COUNT.labels(model_version="v2.1", status="success").inc()
return {"prediction": result}
except Exception as exc:
PREDICTION_COUNT.labels(model_version="v2.1", status="error").inc()
raise
finally:
PREDICTION_LATENCY.observe(time.perf_counter() - start)
if __name__ == "__main__":
start_http_server(8001) # Prometheus scrape endpoint
print("Metrics server running on :8001/metrics")
Architecture
ml-monitoring-suite/
├── config.example.yaml # Monitoring configuration
├── templates/
│ ├── docker-compose.yaml # Prometheus + Grafana stack
│ ├── exporter.py # Python metric exporter
│ ├── dashboards/
│ │ ├── model_performance.json # Accuracy, F1, precision, recall
│ │ ├── serving_latency.json # p50/p95/p99 latency panels
│ │ ├── data_drift.json # Feature distribution shifts
│ │ └── system_health.json # CPU, memory, GPU utilization
│ ├── alerts/
│ │ ├── accuracy_drop.yaml # Alert when accuracy < threshold
│ │ ├── latency_spike.yaml # Alert when p99 > SLA
│ │ └── error_rate.yaml # Alert when error rate > 1%
│ └── runbooks/
│ ├── accuracy_degradation.md # Step-by-step diagnosis
│ └── data_drift_detected.md # Drift response procedure
├── docs/
│ └── overview.md
└── examples/
├── sklearn_monitoring.py # Monitor sklearn model
└── drift_detection.py # Run drift checks manually
Usage Examples
Data Drift Detection
"""Detect feature distribution drift using Population Stability Index."""
import numpy as np
from scipy import stats
def calculate_psi(reference: np.ndarray, current: np.ndarray, bins: int = 10) -> float:
"""Population Stability Index. < 0.1 OK, 0.1-0.25 investigate, > 0.25 retrain."""
breakpoints = np.linspace(
min(reference.min(), current.min()),
max(reference.max(), current.max()),
bins + 1,
)
ref_pct = np.clip(np.histogram(reference, breakpoints)[0] / len(reference), 1e-6, None)
cur_pct = np.clip(np.histogram(current, breakpoints)[0] / len(current), 1e-6, None)
return float(np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct)))
# Check drift for each feature
for feature_name in feature_columns:
psi = calculate_psi(reference_df[feature_name].values,
production_df[feature_name].values)
print(f"{feature_name}: PSI={psi:.4f} {'DRIFT' if psi > 0.25 else 'OK'}")
Prometheus Alert Rule
# alerts/accuracy_drop.yaml
groups:
- name: model_quality
rules:
- alert: ModelAccuracyDrop
expr: model_accuracy_score < 0.85
for: 15m
labels:
severity: critical
annotations:
summary: "Model accuracy below threshold"
description: "{{ $labels.model_name }} accuracy is {{ $value }} (threshold: 0.85)"
Configuration
# config.example.yaml
monitoring:
prometheus_port: 8001
scrape_interval: "15s"
drift_detection:
schedule: "0 */6 * * *" # Every 6 hours
psi_threshold: 0.25
reference_window_days: 30
alerts:
accuracy_threshold: 0.85
latency_p99_ms: 200
error_rate_threshold: 0.01
notification_channel: "slack" # slack | email | pagerduty
Best Practices
- Monitor inputs, not just outputs — data drift in features often precedes accuracy drops by days or weeks
- Set up a reference dataset — freeze your training data distribution as the baseline for all drift comparisons
- Use rolling windows for metrics — a 1-hour rolling accuracy is more actionable than a per-request metric
-
Alert on trends, not single points — require the condition to persist (
for: 15m) before firing alerts - Automate runbook links in alerts — every alert annotation should include a link to the relevant runbook
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| Grafana dashboards show "No data" | Prometheus not scraping the exporter | Check http://localhost:9090/targets for scrape errors; verify exporter port |
| PSI always near zero | Reference and current data from same source | Ensure reference data is from training time, not recent production |
| Alert firing too frequently | Threshold too tight or window too short | Increase for duration in alert rules or relax thresholds |
| Exporter OOM on high traffic | Unbounded histogram buckets | Set explicit buckets on Histogram metrics; limit cardinality of labels |
This is 1 of 10 resources in the ML Starter Kit toolkit. Get the complete [ML Monitoring Suite] with all files, templates, and documentation for $39.
Or grab the entire ML Starter Kit bundle (10 products) for $149 — save 30%.
Top comments (0)