Jayita Gulati

Posted on Sep 20

Setting Up Model Performance Monitoring with Python and Docker

#programming #ai #python #machinelearning

Deploying a machine learning model is a big milestone, but it’s not the finish line. In fact, most of the real challenges in machine learning start after deployment. Once your model is live, its performance can degrade for reasons like data drift, concept drift, or infrastructure issues.

That’s where model performance monitoring comes in. Monitoring is about continuously tracking your model’s predictions, evaluating performance, and alerting you when something goes wrong.

In this article, we’ll go through setting up a basic model monitoring pipeline using Python and Docker.

Why Monitor Models in Production?

Machine learning models are not static—they’re products of the data they were trained on. Once deployed, they’re exposed to new, unseen data, which may not look like the training data. If input data changes, or if the relationship between features and the target shifts, performance will drop.

Here are some common reasons models fail in production:

Data Drift: The statistical distribution of input data changes over time.
Concept Drift: The underlying relationship between features and the target changes.
Infrastructure Failures: Even if the model logic is fine, latency spikes, memory leaks, or service crashes can still degrade user experience.

Monitoring helps you catch these problems early. Instead of waiting for business KPIs to drop, you’ll have real-time visibility into your model’s accuracy, latency, and stability.

Architecture Overview

To implement monitoring effectively, a common stack combines several components working together:

Inference API (FastAPI): The model is deployed behind an API that exposes two key endpoints—/predict (for serving predictions) and /metrics (for exposing performance and system metrics in a Prometheus-compatible format).
Prometheus: A time-series database designed for monitoring. Prometheus periodically scrapes the /metrics endpoint, storing metrics over time so you can analyze trends and set up alert rules.
Grafana: A visualization layer on top of Prometheus. It allows you to build dashboards to monitor accuracy, latency, drift, and business KPIs in real time. Grafana also supports alerting and integration with Slack, PagerDuty, and other tools.
Docker Compose: To tie everything together, Docker Compose orchestrates the services (FastAPI, Prometheus, Grafana) in one environment. This makes it easy to spin up the full monitoring stack locally or in staging before moving to production.

This monitoring stack pairs well with other production practices such as CI/CD pipelines and model Compression for efficiency.

Create a Python Monitoring Script

Let’s start with a simple monitoring script that logs model accuracy. For demonstration, we’ll simulate predictions and true labels instead of using a real dataset.

from prometheus_client import start_http_server, Gauge
from sklearn.metrics import accuracy_score
import time
import random

# Define Prometheus metrics
accuracy_gauge = Gauge('model_accuracy', 'Accuracy of model predictions')

def get_mock_predictions():
    """Simulate predictions and labels for demo purposes."""
    y_true = [random.randint(0, 1) for _ in range(100)]
    y_pred = [random.randint(0, 1) for _ in range(100)]
    return y_true, y_pred

def monitor_model():
    while True:
        y_true, y_pred = get_mock_predictions()
        acc = accuracy_score(y_true, y_pred)
        accuracy_gauge.set(acc)
        print(f"Logged accuracy: {acc:.2f}")
        time.sleep(10)

if __name__ == "__main__":
    start_http_server(8000)  # Expose metrics at http://localhost:8000/metrics
    monitor_model()

How it works:

We use the Prometheus client library to define a metric (model_accuracy).
The script simulates predictions and calculates accuracy with scikit-learn.
Every 10 seconds, it updates the Prometheus metric.
The HTTP server on port 8000 exposes metrics in a format Prometheus can scrape.

If you run this script, you’ll see logs like:

Logged accuracy: 0.52
Logged accuracy: 0.48

And if you visit http://localhost:8000/metrics, you’ll see:

# HELP model_accuracy Accuracy of model predictions
# TYPE model_accuracy gauge
model_accuracy 0.48

This endpoint is exactly what Prometheus expects.

Dockerize the Monitoring Service

In production, you don’t want to run raw Python scripts. You want containers—portable, reproducible environments that can run anywhere.

Here’s how to Dockerize the monitoring script.

requirements.txt

scikit-learn
prometheus-client

Dockerfile

# Use Python base image
FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Copy dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy monitoring script
COPY monitor.py .

# Run monitoring service
CMD ["python", "monitor.py"]

Build and run the container:

docker build -t model-monitor .
docker run -p 8000:8000 model-monitor

Now your monitoring service is running inside Docker, accessible on http://localhost:8000/metrics.

Setting Up Prometheus

Prometheus is an open-source monitoring system and time-series database designed to collect, store, and query metrics. Setting it up involves running the Prometheus server, defining what targets to scrape, and configuring how data is stored.

prometheus.yml

scrape_configs:
  - job_name: 'model_monitor'
    static_configs:
      - targets: ['host.docker.internal:8000']

Run Prometheus in Docker:

docker run -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

Navigate to http://localhost:9090 to explore the metrics.

Visualizing with Grafana

Grafana is a visualization platform that connects to Prometheus to turn time-series metrics into interactive dashboards. Prometheus scrapes and stores metrics, while Grafana queries them and displays charts, gauges, and alerts.

Run it with Docker:

docker run -d -p 3000:3000 grafana/grafana

Log in at http://localhost:3000 (default user: admin / admin).
Add Prometheus as a data source (http://host.docker.internal:9090).
Create panels for accuracy over time, latency histograms, or drift metrics.

Challenges in Model Monitoring

While monitoring is essential, it comes with its own set of challenges:

Delayed Ground Truth: Many models, such as churn or fraud detection systems, rely on labels that only become available after days or weeks. This delay makes real-time accuracy tracking difficult.
Data Quality Issues: Noisy or incomplete input data can trigger false alerts. Distinguishing between real drift and bad data pipelines is often tricky.
Alert Fatigue: Setting thresholds too tightly can overwhelm teams with alerts, while loose thresholds can miss critical failures. Striking the right balance is hard.
Scalability: Monitoring thousands of models or high-traffic APIs requires careful resource management. Metrics storage and query performance can quickly become bottlenecks.
Contextual Understanding: Not all performance drops mean failure—sometimes business objectives shift, requiring updates to monitoring logic and KPIs.

Best Practices

Define Meaningful Alerts – Set smart thresholds and use tiered alerts to avoid fatigue and ensure the right team is notified.
Automate Monitoring Setup – Containerize and version-control configs so every new model automatically gets monitoring.
Plan for Scale – Use Prometheus federation or long-term storage solutions to handle large numbers of models and metrics.
Close the Feedback Loop – Let monitoring insights trigger retraining, feature updates, or infrastructure fixes automatically.

Final Thoughts

Monitoring machine learning models is not optional—it’s essential for maintaining trust in production systems. With Python, Prometheus, Grafana, and Docker, you can build a monitoring stack that not only tracks accuracy but also surfaces drift, latency, and business KPIs.

Start small: expose a metric, containerize it, and watch it in Grafana. From there, evolve toward batch evaluation, drift detection, and alerting.

DEV Community