DEV Community

Erythix
Erythix

Posted on

Monitoring an ML Pipeline in Production: Anatomy of an Open-Source Stack

This isn't a theoretical guide. It's a field report on the observability stack I've built and iterated across engagements and demos on the AI Observability Hub - a demonstration platform I use to validate AI monitoring architectures before deploying them at client sites.

The goal is straightforward: give an SRE, data engineer, or CTO the building blocks to monitor an ML pipeline in production with VictoriaMetrics, OpenTelemetry, and Grafana. No vendor lock-in. No proprietary platform. Open-source components, assembled with intention.


What we actually monitor (and what we forget)

Most organizations deploying ML in production settle for monitoring infrastructure: CPU, RAM, disk space. That's necessary, but it's the equivalent of watching a factory's temperature without looking at the quality of parts coming off the line.

A production ML pipeline has four observability layers:

Infrastructure : the foundation. GPU utilization (compute, VRAM, memory bandwidth), CPU, network, disk I/O. Without it, you don't even know if the machine is running. But with it alone, you don't know if the model is working.

Data pipeline : the invisible layer. Training data freshness, ingestion latency, feature completeness, statistical drift in input distributions. A model receiving degraded data produces degraded results and nothing in the infra metrics flags it.

Model : this is what data scientists care about, but what often goes unmonitored in production. Inference latency, throughput (requests/second), confidence score distribution, fallback rate, prediction vs. ground truth comparison when available.

Cost : this is what leadership cares about, and what's often discovered too late. Cost per inference, GPU cost per model, cost/business-value ratio. A model that costs €3 per inference on a use case generating €0.50 in value isn't a technical problem : it's a business problem that only observability makes visible.


The reference architecture

Here's the stack I've built and deploy in my engagements. Each component was chosen for a specific reason - not by habit or popularity.

┌─────────────────────────────────────────────────────────────────┐
│                        ML APPLICATIONS                          │
│  vLLM / TGI / Triton / Custom Flask-FastAPI                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │         OpenTelemetry SDK + Auto-instrumentation        │    │
│  │    Traces (spans)  │  Metrics (counters/histograms)     │    │
│  └────────────────────┼────────────────────────────────────┘    │
└───────────────────────┼─────────────────────────────────────────┘
                        │
                        ▼
┌─────────────────────────────────────────────────────────────────┐
│                  OPENTELEMETRY COLLECTOR                         │
│  Receivers: OTLP (gRPC/HTTP), Prometheus scrape                 │
│  Processors: batch, filter, attributes, tail_sampling           │
│  Exporters: prometheusremotewrite, otlp                         │
└──────────┬──────────────────────────────────────────────────────┘
           │                                  │
           ▼                                  ▼
┌──────────────────────┐         ┌────────────────────────────────┐
│   VICTORIAMETRICS    │         │       OPENOBSERVE / LOKI       │
│   (metrics TSDB)     │         │       (logs + traces)          │
│   ┌──────────────┐   │         │                                │
│   │  vmselect    │   │         │  Long retention for audit      │
│   │  vminsert    │   │         │  Full-text search              │
│   │  vmstorage   │   │         │  Trace-log correlation         │
│   └──────────────┘   │         └────────────────────────────────┘
└──────────┬───────────┘                      │
           │                                  │
           ▼                                  ▼
┌─────────────────────────────────────────────────────────────────┐
│                          GRAFANA                                │
│  Infra Dashboard  │  Model Dashboard  │  Cost Dashboard         │
│  Alerting (Alertmanager) → PagerDuty / Slack / Email            │
└─────────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Component by component

OpenTelemetry: the instrumentation standard

OpenTelemetry is the instrumentation choice for a non-negotiable reason: it's the only vendor-agnostic standard covering traces, metrics, and logs in a unified framework. Instrumenting with OTel guarantees the freedom to swap backends without re-instrumenting.

For an ML pipeline, instrumentation covers three levels:

The application SDK integrates directly into the inference service code. For a Python service (FastAPI, Flask), OTel auto-instrumentation automatically captures HTTP requests, database calls, and processing spans. For model-specific metrics, custom instruments are added:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

# Meter configuration
meter = metrics.get_meter("ml.inference")

# Custom metrics for the ML pipeline
inference_duration = meter.create_histogram(
    name="ml.inference.duration",
    description="Inference duration in milliseconds",
    unit="ms"
)

inference_tokens = meter.create_counter(
    name="ml.inference.tokens.total",
    description="Total tokens generated"
)

confidence_score = meter.create_histogram(
    name="ml.inference.confidence",
    description="Confidence score distribution",
    unit="1"
)

gpu_cost_counter = meter.create_counter(
    name="ml.inference.cost.gpu",
    description="Cumulative estimated GPU cost in euros",
    unit="EUR"
)

# In the inference function
def predict(request):
    start = time.time()

    result = model.generate(request.prompt)

    duration_ms = (time.time() - start) * 1000
    tokens = result.token_count
    conf = result.confidence

    # Record metrics with labels
    labels = {
        "model_name": "llama-3-8b",
        "model_version": "v2.1",
        "environment": "production",
        "use_case": "maintenance_assistant"
    }

    inference_duration.record(duration_ms, labels)
    inference_tokens.add(tokens, labels)
    confidence_score.record(conf, labels)

    # GPU cost estimation (based on hourly rate / time consumed)
    gpu_hourly_rate = 2.50  # €/h for an A100
    gpu_cost = (duration_ms / 3_600_000) * gpu_hourly_rate
    gpu_cost_counter.add(gpu_cost, labels)

    return result
Enter fullscreen mode Exit fullscreen mode

The Collector is the consolidation point. It receives telemetry data from all services, transforms, filters, and routes it to storage backends. It's the most underestimated component in the stack — and the most critical.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  # Scrape GPU metrics from DCGM/nvidia-smi exporter
  prometheus:
    config:
      scrape_configs:
        - job_name: 'dcgm-exporter'
          scrape_interval: 15s
          static_configs:
            - targets: ['dcgm-exporter:9400']
        - job_name: 'node-exporter'
          scrape_interval: 30s
          static_configs:
            - targets: ['node-exporter:9100']

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024

  # Filter low-value metrics
  filter/drop-debug:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ".*debug.*"
          - ".*test.*"

  # Attribute enrichment
  attributes/env:
    actions:
      - key: deployment.environment
        value: production
        action: upsert
      - key: service.namespace
        value: ml-platform
        action: upsert

  # Trace sampling (keep 100% of errors, 10% of the rest)
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-always
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: sample-rest
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  prometheusremotewrite:
    endpoint: "http://victoriametrics:8428/api/v1/write"
    resource_to_telemetry_conversion:
      enabled: true

  otlp/traces:
    endpoint: "openobserve:5081"
    tls:
      insecure: true

  otlp/logs:
    endpoint: "openobserve:5081"
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [otlp, prometheus]
      processors: [batch, filter/drop-debug, attributes/env]
      exporters: [prometheusremotewrite]
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling, attributes/env]
      exporters: [otlp/traces]
    logs:
      receivers: [otlp]
      processors: [batch, attributes/env]
      exporters: [otlp/logs]
Enter fullscreen mode Exit fullscreen mode

The key point here is tail sampling. A production ML pipeline can generate thousands of traces per minute. Storing everything is costly and unnecessary. Tail sampling keeps 100% of error traces (the ones that matter for debugging) and samples the rest - reducing storage volume without losing signal.

VictoriaMetrics: storage that handles the load

I chose VictoriaMetrics over Prometheus for a simple reason: cardinality.

A production ML pipeline generates metrics with a high number of label combinations: model × version × use case × environment × request type × user. Prometheus starts struggling beyond a few million active time series. VictoriaMetrics is designed to handle this scale with significantly lower memory and disk footprint.

In practice, in my deployments:

Single-node mode for mid-market companies with moderate volume (< 500k active series). A single binary, minimal configuration, excellent performance. This is the mode I recommend to start with.

# Launch VictoriaMetrics single-node
docker run -d \
  --name victoriametrics \
  -v /data/vm:/victoria-metrics-data \
  -p 8428:8428 \
  victoriametrics/victoria-metrics:stable \
  -retentionPeriod=6m \
  -search.maxUniqueTimeseries=5000000 \
  -dedup.minScrapeInterval=15s
Enter fullscreen mode Exit fullscreen mode

Cluster mode (vmselect + vminsert + vmstorage) when volume exceeds one million series or when high availability is required. This is the mode I use on the AI Observability Hub to simulate realistic loads.

Retention parameters are an architecture decision, not a configuration detail. For operational observability (SRE), 30 to 90 days suffice. For governance and audit (EU AI Act), plan for 12 to 36 months — and that's where VictoriaMetrics' compression makes a real difference in storage cost compared to alternatives.

Grafana: three dashboards, three audiences

Grafana isn't just a visualization tool. It's the translation layer between technical data and human decisions. A dashboard that just shows curves without guiding action is a useless dashboard.

I systematically structure ML observability into three dashboards:


Dashboard 1 : Infra & GPU (audience: SRE/DevOps)

This dashboard answers: "Is the platform holding up?"

Key metrics:

# GPU compute utilization (via DCGM exporter)
DCGM_FI_DEV_GPU_UTIL{instance=~"$gpu_node"}

# GPU memory used vs. total
DCGM_FI_DEV_FB_USED{instance=~"$gpu_node"}
  / DCGM_FI_DEV_FB_TOTAL{instance=~"$gpu_node"} * 100

# GPU temperature (alert if > 85°C)
DCGM_FI_DEV_GPU_TEMP{instance=~"$gpu_node"}

# Inference throughput (requests/second)
rate(ml_inference_duration_count[5m])

# Inference p95 latency
histogram_quantile(0.95,
  rate(ml_inference_duration_bucket[5m])
)

# Queue saturation (if applicable)
ml_inference_queue_depth
Enter fullscreen mode Exit fullscreen mode

Configured alerts:

  • GPU utilization > 95% for 10 minutes → capacity alert
  • GPU temperature > 85°C → thermal alert
  • p95 latency > SLO threshold (e.g., 2s for a conversational assistant) → performance alert
  • Queue depth > 100 requests for 5 minutes → saturation alert

Dashboard 2 : Model & Quality (audience: data engineers, ML engineers)

This dashboard answers: "Is the model doing its job?"

This is the dashboard missing from 90% of ML deployments I audit. The infra is running, the service responds, but nobody knows if the responses are good.

Key metrics:

# Confidence score distribution (heatmap)
ml_inference_confidence_bucket

# Rolling 24h average confidence score
avg_over_time(
  ml_inference_confidence_sum[24h]
) / avg_over_time(
  ml_inference_confidence_count[24h]
)

# Low-confidence response rate (< 0.6)
sum(rate(ml_inference_confidence_bucket{le="0.6"}[1h]))
  / sum(rate(ml_inference_confidence_count[1h])) * 100

# Model error rate (timeouts, exceptions, fallbacks)
sum(rate(ml_inference_duration_count{status="error"}[5m]))
  / sum(rate(ml_inference_duration_count[5m])) * 100

# Tokens generated per request (distribution)
histogram_quantile(0.5, rate(ml_inference_tokens_bucket[1h]))

# Drift detector: current vs. baseline distribution comparison
# (requires a periodic compute job publishing the metric)
ml_feature_drift_score{feature="input_length"}
Enter fullscreen mode Exit fullscreen mode

Configured alerts:

  • Average confidence score < adaptive threshold for 48h → drift alert
  • Low-confidence response rate > 20% → quality alert
  • Model error rate > 5% for 15 minutes → critical alert
  • Drift score > 0.3 on a key feature → data shift alert

The drift alert is the most important and hardest to calibrate. The threshold isn't static — it must be calculated against a baseline established over a reference period (the first 30 days in production, for example). This is a use case where VictoriaMetrics recording rules come into their own:

# Recording rules for baseline calculation
groups:
  - name: ml_baseline
    interval: 1h
    rules:
      - record: ml:confidence:baseline_avg
        expr: >
          avg_over_time(
            ml_inference_confidence_sum[30d]
          ) / avg_over_time(
            ml_inference_confidence_count[30d]
          )

      - record: ml:confidence:current_avg
        expr: >
          avg_over_time(
            ml_inference_confidence_sum[24h]
          ) / avg_over_time(
            ml_inference_confidence_count[24h]
          )

      - record: ml:confidence:drift_ratio
        expr: >
          abs(ml:confidence:current_avg - ml:confidence:baseline_avg)
            / ml:confidence:baseline_avg
Enter fullscreen mode Exit fullscreen mode

Dashboard 3 : Cost & Business (audience: leadership, finance, product owners)

This dashboard answers: "How much does it cost and how much value does it deliver?"

This is the dashboard that turns a cost center into a value center — and the one that'll keep your budget next year.

Key metrics:

# Cumulative GPU cost by model (current day)
sum by (model_name)(
  increase(ml_inference_cost_gpu_total[24h])
)

# Average cost per inference
sum(rate(ml_inference_cost_gpu_total[1h]))
  / sum(rate(ml_inference_duration_count[1h]))

# Cost by use case
sum by (use_case)(
  increase(ml_inference_cost_gpu_total[30d])
)

# Daily inference volume
sum(increase(ml_inference_duration_count[24h]))

# Projected end-of-month cost (linear extrapolation)
sum(increase(ml_inference_cost_gpu_total[24h])) * 30

# Tokens/cost ratio (efficiency)
sum(rate(ml_inference_tokens_total[1h]))
  / sum(rate(ml_inference_cost_gpu_total[1h]))
Enter fullscreen mode Exit fullscreen mode

This dashboard must be readable by someone who doesn't know what a quantile is. Big numbers at the top (today's cost, projected monthly cost, inference count), trends below, details by model and use case at the bottom.


Deployment: from docker-compose to cluster

Phase 1 : Prototype (docker-compose)

To validate the architecture, a docker-compose.yml is enough. This is what I use on the AI Observability Hub for quick demos:

version: '3.8'

services:
  victoriametrics:
    image: victoriametrics/victoria-metrics:stable
    ports:
      - "8428:8428"
    volumes:
      - vm-data:/victoria-metrics-data
    command:
      - "-retentionPeriod=90d"
      - "-search.maxUniqueTimeseries=3000000"

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
      - "4317:4317"   # gRPC
      - "4318:4318"   # HTTP
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol/config.yaml
    depends_on:
      - victoriametrics
      - openobserve

  openobserve:
    image: public.ecr.aws/zinclabs/openobserve:latest
    ports:
      - "5080:5080"   # UI
      - "5081:5081"   # Ingestion
    environment:
      - ZO_ROOT_USER_EMAIL=admin@erythix.com
      - ZO_ROOT_USER_PASSWORD=changeme
    volumes:
      - oo-data:/data

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=changeme
    depends_on:
      - victoriametrics

  # ML load simulator (for demos)
  ml-simulator:
    build: ./ml-simulator
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
      - MODEL_NAME=llama-3-8b
    depends_on:
      - otel-collector

volumes:
  vm-data:
  oo-data:
  grafana-data:
Enter fullscreen mode Exit fullscreen mode

Phase 2 : Production (Kubernetes)

For production, each component is deployed via Helm charts or Kubernetes manifests with the following considerations:

VictoriaMetrics: official Helm chart (victoria-metrics-k8s-stack) including vmoperator, recording rules, and Grafana integration. Cluster mode for HA, with PVCs on performant storage (local SSD or AWS EBS gp3).

OTel Collector: deployed as a DaemonSet (one per node, for system and GPU metric collection) + a centralized Deployment (for aggregation, tail sampling, and routing). The DaemonSet collects DCGM metrics and local logs. The central Deployment handles processing and export.

Grafana: deployed with automatic datasource and dashboard provisioning via ConfigMaps. Dashboards are versioned in Git and deployed via CI/CD — no manual configuration that drifts over time.


Pitfalls I learned to avoid

After multiple iterations on the AI Observability Hub and real-world deployments, here are the most expensive mistakes.

Cardinality explosion

Trap number one. A user_id label on inference metrics seems useful - until 10,000 users generate 10,000 time series per metric. Multiply by 20 metrics and 3 model versions, and you hit 600,000 series for a single service.

The rule: high-cardinality labels (user ID, request ID, session ID) belong in traces and logs, not in metrics. Metrics use bounded-cardinality labels: model_name, model_version, environment, use_case, status.

Cargo cult monitoring

Copying someone else's dashboards without understanding what they measure. I've seen teams with 47 panels on a dashboard, 43 of which nobody looked at. A useful dashboard has between 6 and 12 panels, organized by business question, not by metric type.

No baseline

Monitoring without a baseline is like having a thermometer without knowing what temperature is normal. The first 30 days of a model's production run should establish baselines for every key metric. Drift alerts are calculated against these baselines, not against arbitrary thresholds.

Only monitoring the happy path

Instrumenting only the nominal path and discovering during an incident that the error path isn't traced. Every fallback, every timeout, every exception should produce metrics and spans with an explicit error status. Errors are where observability creates the most value.

The cost of monitoring itself

I've seen observability stacks that cost more than the infrastructure they monitored. VictoriaMetrics helps significantly here (aggressive compression, low memory footprint), but sizing must be planned from the start. Rule of thumb: monitoring cost shouldn't exceed 5-10% of the monitored infrastructure cost.


The result: what the stack makes possible

When all four observability layers are in place (infra, pipeline, model, cost), three things become possible that weren't before:

Drift detection in days, not months. On a recent engagement, the stack detected a confidence score degradation in a predictive maintenance model 72 hours after a sensor change on the factory floor. Without model monitoring, the team would have continued following degraded recommendations for weeks.

Data-driven cost/performance optimization. The cost dashboard helped a client discover that a marginal use case (5% of volume) consumed 35% of the GPU budget due to an oversized model. Replaced with a lighter model, same perceived quality, 30% reduction in the overall bill.

Governance as a byproduct of observability. The audit trails required for the EU AI Act aren't an additional effort they're the traces and logs the stack already collects. It's a matter of structuring them for audit, not creating them from scratch.


Where to start

If you have nothing today, here's the sequence I recommend:

Week 1 : Instrument. Add the OpenTelemetry SDK to your inference service. Five metrics are enough to start: inference duration, request count, error rate, tokens generated, confidence score. Deploy the Collector in minimal mode.

Week 2 : Store and visualize. Deploy VictoriaMetrics in single-node mode and Grafana. Create the infra dashboard first (it's the fastest), then the model dashboard. It doesn't need to look pretty — it needs to be functional.

Week 3 : Alert. Configure three alerts only: p95 latency above SLO, error rate above 5%, confidence score dropping. Three well-calibrated alerts are worth more than twenty that generate fatigue.

Month 2 : Refine. Add cost metrics. Establish your baselines. Configure recording rules for drift calculation. Create the cost/business dashboard. At this point, you have production-grade ML observability.

Month 3+ : Extend. Add traces for advanced debugging. Integrate structured logs. Connect AI alerts to your SIEM. Explore input feature monitoring for data drift detection.

Each step delivers immediate value. No need to wait for everything to be in place to benefit from observability.


Samuel Desseaux is the founder of Erythix and Aureonis, a fractional CTO and trainer specializing in observability IT/AI, AI security, and IT/OT convergence. Official VictoriaMetricsPartner for France,Benelux and Arize partner.

The AI Observability Hub is Erythix's demonstration platform for AI workload observability. Contact https://www.linkedin.com/in/sdesseaux/ for a demo or a stack diagnostic.

Top comments (0)