Model Monitoring Dashboard
Models don't fail with exceptions — they fail silently. Accuracy degrades gradually as data distributions shift, features go stale, or upstream systems change schemas. By the time someone notices, you've been serving bad predictions for weeks. This toolkit gives you drift detection algorithms, real-time performance monitoring, and pre-built Grafana dashboards with alerting rules that catch degradation early. You get the complete observability stack for deployed ML models: data drift, prediction drift, feature importance shifts, and business metric correlation.
Key Features
- Data Drift Detection — Population Stability Index (PSI), Kolmogorov-Smirnov tests, and Jensen-Shannon divergence computed per-feature on configurable schedules.
- Prediction Drift Monitoring — Track prediction distribution changes even when ground truth labels aren't available yet.
- Grafana Dashboards — Pre-built dashboard JSON with panels for model performance, feature drift, latency percentiles, and error rates.
- Alerting Rules — Prometheus alerting rules for drift thresholds, latency spikes, error rate increases, and data pipeline staleness.
- Performance Tracking — Accuracy, precision, recall, and custom metrics computed on incoming labeled data with sliding window aggregations.
- Feature Importance Monitoring — SHAP-based feature importance tracking that alerts when the model starts relying on different features than during training.
- Outlier Detection — Flag individual predictions made on out-of-distribution inputs so downstream systems can handle them appropriately.
- Report Generator — Weekly and monthly model health reports in HTML and PDF with trend analysis and recommendations.
Quick Start
unzip model-monitoring-dashboard.zip && cd model-monitoring-dashboard
pip install -r requirements.txt
# Start the monitoring stack
docker compose up -d # Starts Prometheus + Grafana + monitoring service
# Import Grafana dashboards
python src/model_monitoring/setup.py import-dashboards \
--grafana-url http://localhost:3000 \
--api-key YOUR_GRAFANA_API_KEY_HERE
# config.example.yaml
monitoring:
model_name: churn_predictor_v3
check_interval_minutes: 60
reference_dataset: ./data/training_reference.parquet
drift_detection:
features:
numerical: { method: ks_test, threshold: 0.05 }
categorical: { method: chi_squared, threshold: 0.05 }
prediction: { method: psi, threshold: 0.1 }
schedule: "*/60 * * * *"
performance:
metrics: [accuracy, precision, recall, f1, auc_roc]
window_size: 1000 # predictions
sliding_step: 100
ground_truth_delay_hours: 24 # how long until labels arrive
alerting:
channels: [slack, pagerduty]
rules:
- { name: feature_drift_alert, condition: "psi > 0.2 for any feature", severity: warning }
- { name: performance_drop, condition: "accuracy < 0.80", severity: critical }
- { name: latency_spike, condition: "p99_latency_ms > 200", severity: warning }
- { name: data_staleness, condition: "last_prediction_age > 30m", severity: critical }
grafana:
url: http://localhost:3000
dashboards_dir: ./dashboards/
datasource: prometheus
Architecture
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ Prediction │────>│ Metrics │────>│ Prometheus │
│ Service │ │ Collector │ │ │
└────────────────┘ └────────────────┘ └───────┬────────┘
│
┌────────────────┐ ┌────────────────┐ ┌───────▼────────┐
│ Reference │────>│ Drift │────>│ Grafana │
│ Dataset │ │ Detector │ │ Dashboards │
└────────────────┘ └────────────────┘ └───────┬────────┘
│
┌────────────────┐ ┌───────▼────────┐
│ Report │<────│ Alert │
│ Generator │ │ Manager │
└────────────────┘ └────────────────┘
Usage Examples
Set Up Drift Detection
from model_monitoring.core import DriftDetector
import pandas as pd
# Load reference data (your training distribution)
reference = pd.read_parquet("./data/training_reference.parquet")
detector = DriftDetector(
reference_data=reference,
numerical_method="ks_test",
categorical_method="chi_squared",
)
# Check drift on new production data
production_batch = pd.read_parquet("./data/production_batch_20260323.parquet")
report = detector.check_drift(production_batch)
for feature, result in report.items():
status = "DRIFT" if result["drifted"] else "OK"
print(f"{feature}: {status} (p={result['p_value']:.4f}, PSI={result['psi']:.4f})")
Configure Prometheus Metrics
from model_monitoring.core import MetricsCollector
from prometheus_client import start_http_server
start_http_server(port=8000)
collector = MetricsCollector(model_name="churn_predictor_v3")
# Log predictions from your serving code
collector.log_prediction(features={"age": 35, "tenure_months": 24}, prediction=0.73, latency_ms=12.5)
# Log ground truth when labels arrive (async, hours/days later)
collector.log_ground_truth(prediction_id="pred_abc123", actual_label=1)
Generate a Model Health Report
from model_monitoring.core import ReportGenerator
generator = ReportGenerator.from_config("config.example.yaml")
report = generator.generate(period="weekly", start_date="2026-03-16", end_date="2026-03-23")
report.save_html("./reports/model_health_week_12.html")
print(f"Health: {report.health_score}/100 | Drifted: {report.drifted_features} | Trend: {report.performance_trend}")
Configuration Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
drift_detection.features.numerical.method |
str | ks_test |
Statistical test for numeric drift |
drift_detection.features.numerical.threshold |
float | 0.05 |
p-value threshold for drift detection |
drift_detection.prediction.threshold |
float | 0.1 |
PSI threshold for prediction drift |
performance.window_size |
int | 1000 |
Sliding window size for metrics |
performance.ground_truth_delay_hours |
int | 24 |
Expected delay for label arrival |
alerting.rules.*.severity |
str | warning |
Alert severity: info, warning, critical |
Best Practices
- Monitor prediction distributions, not just performance — Ground truth labels can take days or weeks to arrive. Prediction drift is your early warning system.
- Set per-feature drift thresholds — Not all features matter equally. Set tighter thresholds on high-importance features and looser ones on low-importance features.
- Use PSI for business stakeholder communication — PSI is more intuitive than KS test p-values. PSI < 0.1 = stable, 0.1-0.2 = moderate shift, > 0.2 = significant.
- Track feature importance over time — If the model starts relying heavily on a feature that was unimportant during training, something has changed fundamentally.
- Establish a retraining trigger — Define explicit criteria: "retrain when 3+ features show PSI > 0.2 or accuracy drops below 0.80 for 3 consecutive windows."
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| All features show drift | Reference dataset is too old | Refresh reference dataset with recent training data |
| Grafana dashboard shows no data | Prometheus scrape target not configured | Verify Prometheus config has the metrics endpoint URL and port |
| Drift alerts firing too frequently | Threshold too sensitive for noisy features | Increase threshold for noisy features, or use longer aggregation windows |
| Performance metrics show NaN | Ground truth labels not arriving | Check ground_truth_delay_hours, verify label pipeline is running |
This is 1 of 11 resources in the ML Engineer Toolkit toolkit. Get the complete [Model Monitoring Dashboard] with all files, templates, and documentation for $39.
Or grab the entire ML Engineer Toolkit bundle (11 products) for $149 — save 30%.
Top comments (0)