DEV Community

Mona Hamid
Mona Hamid

Posted on

# Building an MLOps Monitoring Architecture That Actually Works

The Problem ๐Ÿ˜…

You've probably been here:

  • Deploy ML model โœ…
  • Model works great initially โœ…
  • Stakeholders are happy โœ…
  • Then... ๐Ÿ“‰ silent degradation
  • Business metrics drop ๐Ÿ“Š
  • "Why didn't we know sooner?" ๐Ÿค”

Traditional monitoring doesn't work for ML models.

The Architecture ๐Ÿ—๏ธ

Built a 3-layer monitoring system:

Layer 1: Models & Data ๐Ÿค–

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ML Model โ”‚ โ”‚ Data Storage โ”‚
โ”‚ (FastAPI) โ”‚โ—„โ”€โ”€โ”€โ”ค (PostgreSQL/S3) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Layer 2: Processing โš™๏ธ

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Drift Detection โ”‚ โ”‚ Orchestration โ”‚
โ”‚ (Evidently AI) โ”‚โ—„โ”€โ”€โ”€โ”ค (Prefect) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Layer 3: Alerts & Viz ๐Ÿ“Š

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Dashboards โ”‚ โ”‚ Alerts โ”‚
โ”‚ (Grafana) โ”‚โ—„โ”€โ”€โ”€โ”ค (Slack/PagerDuty)โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Monitoring Metrics ๐Ÿ“ˆ

๐ŸŽฏ Prediction Drift

Detect when model outputs change distribution:


python
from evidently.metrics import DatasetDriftMetric

def check_prediction_drift(reference, current):
    metric = DatasetDriftMetric()
    result = metric.calculate(reference, current)
    return result.drift_detected
๐Ÿ“Š Feature Drift
Monitor input feature distributions:

Mean/median shifts
Standard deviation changes
Quantile-based detection

โŒ Data Quality
Real-time validation:

Missing value %
Outlier detection
Schema changes

๐Ÿ“‰ Performance Metrics
When ground truth available:

Accuracy trends
F1-score evolution
Business KPI correlation

Implementation Example ๐Ÿ’ป
pythonclass MLMonitor:
    def __init__(self, reference_data):
        self.reference_data = reference_data
        self.slack_webhook = os.getenv('SLACK_WEBHOOK')

    def monitor_predictions(self, current_data):
        """Main monitoring function"""

        # 1. Check for drift
        drift_result = self.check_drift(current_data)

        # 2. Validate data quality  
        quality_result = self.check_quality(current_data)

        # 3. Send alerts if needed
        if drift_result['drift_detected']:
            self.send_alert(f"๐Ÿšจ Drift detected: {drift_result['drift_score']:.3f}")

        # 4. Update dashboards
        self.update_metrics(drift_result, quality_result)

    def check_drift(self, current_data):
        """Drift detection with Evidently"""
        from evidently.report import Report
        from evidently.metric_preset import DataDriftPreset

        report = Report(metrics=[DataDriftPreset()])
        report.run(self.reference_data, current_data)

        return report.as_dict()

    def send_alert(self, message):
        """Send Slack notification"""
        import requests

        payload = {
            "text": message,
            "channel": "#ml-alerts",
            "username": "ML Monitor Bot"
        }

        requests.post(self.slack_webhook, json=payload)
Results ๐Ÿ“Š
After implementing this system:
MetricBeforeAfterDetection Time2-3 days2-3 hoursMonthly Incidents83False Positive Rate40%5%Stakeholder Confidence๐Ÿ˜๐Ÿ˜
Tech Stack Choices ๐Ÿ› ๏ธ
Why Evidently AI?

Open source & flexible
Excellent drift algorithms
Great documentation
Active community

Why Grafana?

Beautiful dashboards
Real-time capabilities
PostgreSQL integration
Industry standard

Why Prefect over Airflow?

Modern Python-first approach
Better error handling
Easier Kubernetes deployment
Superior observability

Lessons Learned ๐Ÿ’ก
โœ… What Worked

Start simple - Basic drift detection first
Tune thresholds - Avoid alert fatigue
Pretty dashboards - Stakeholders love visuals
Automation - Let system handle simple fixes

โŒ What Failed

Too many alerts initially - Alert fatigue is real
Complex metrics upfront - Confused the team
Manual processes - Doesn't scale


What's Next? ๐Ÿ”ฎ
Planning to add:

Automated retraining triggers
A/B testing integration
Cost monitoring per prediction
Explainability tracking with SHAP

Conclusion ๐ŸŽ‰
ML monitoring isn't optional anymore. This architecture has:

Caught issues 10x faster
Reduced incidents by 60%
Improved stakeholder trust
Made our ML systems actually reliable

Key takeaway: Treat monitoring as a first-class citizen in your ML pipeline.

What monitoring challenges are you facing? Share in the comments! 
Enter fullscreen mode Exit fullscreen mode

Top comments (0)