Postmortem: How a Misconfigured OpenTelemetry 1.20 Pipeline Lost All Our AI Workload Metrics

#postmortem #misconfigured #opentelemetry #pipeline

Postmortem: How a Misconfigured OpenTelemetry 1.20 Pipeline Lost All Our AI Workload Metrics

Executive Summary

On October 12, 2024, our observability team discovered that all AI workload metrics from our production ML inference cluster had been missing for 72 hours. The root cause was a misconfigured filter processor in our OpenTelemetry Collector 1.20.0 pipeline, which silently dropped 100% of metrics matching our ai.* namespace. No alerts fired during the outage due to gaps in Collector self-monitoring, delaying detection and remediation.

Incident Timeline (UTC)

2024-10-09 14:22: Deploy OpenTelemetry Collector 1.20.0 to AI inference cluster via Helm, updating pipeline config to add a filter for "noisy" non-AI metrics.
2024-10-09 14:25: All AI workload metrics stop flowing to Prometheus; Collector logs show no errors, only debug-level filter exclusion messages.
2024-10-12 08:15: Data science team notices missing metrics when troubleshooting elevated inference latency, files a ticket with observability.
2024-10-12 08:42: Observability team confirms Collector config error, rolls back to previous pipeline version.
2024-10-12 09:10: All AI metrics restored; no data loss recovery possible for the 72-hour gap.

Impact

Total loss of 72 hours of AI workload metrics, including:

GPU utilization, memory usage, and temperature per inference node
Per-model inference latency (p50, p95, p99), throughput, and error rates
Model version rollout success/failure rates
Custom business metrics (e.g., inference cost per request, SLA compliance)

Teams were unable to troubleshoot performance regressions, validate model rollouts, or bill customers for inference usage during the outage. No customer-facing outages occurred, but internal operations were severely disrupted.

Root Cause Analysis

Our OpenTelemetry Collector 1.20.0 pipeline was configured to collect OTLP metrics from inference services, process them, and export to Prometheus. The pipeline definition in our Helm values was:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
processors:
  batch:
    send_batch_max_size: 1000
    timeout: 10s
  filter:
    metrics:
      exclude:
        match_type: regexp
        regexp: ^ai\..*
exporters:
  prometheus:
    endpoint: 0.0.0.0:9090
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch, filter]
      exporters: [prometheus]

The intended configuration was to include only ai.* metrics to reduce Prometheus storage costs. However, the engineer updating the config mistakenly used the exclude action instead of include for the filter processor. OpenTelemetry Collector 1.20.0 does not validate filter processor logic against intended outcomes, and the Collector only logs filter exclusions at debug level, which was disabled in our production config. This caused 100% of AI metrics to be dropped silently after processing.

Compounding the issue: our Collector self-monitoring only tracked trace and log pipeline health, not metrics pipeline drop rates. The otelcol_processor_filter_excluded_metrics metric was available but not scraped by Prometheus, so no alert fired when exclusions spiked.

Remediation Steps

Rolled back Collector config to pre-change version at 08:42 UTC on 2024-10-12, restoring metrics flow within 28 minutes.
Updated filter processor config to use include action with the correct ^ai\..* regex, validated in staging for 24 hours before production deploy.
Enabled debug-level logging temporarily during config changes to catch filter mismatches.
Added scraping for all Collector self-monitoring metrics, including otelcol_processor_filter_excluded_metrics and otelcol_exporter_sent_metric_points, with alerts for unexpected drops.

Lessons Learned

Always validate pipeline config changes in staging with production-like metric volumes, and diff configs against previous versions before deploy.
OpenTelemetry filter processor config is error-prone: use explicit include lists instead of exclude where possible, and add config validation steps to CI/CD pipelines.
Self-monitoring for all Collector pipelines (metrics, traces, logs) is mandatory: scrape all default OTel Collector metrics and alert on drop rates, export errors, and queue saturation.
Disable debug logging in production, but enable it temporarily during config changes to catch silent drops.
AI workload metrics are business-critical: add synthetic metric checks (e.g., a canary inference service emitting a known metric) to detect gaps faster.

Conclusion

This incident highlighted gaps in our OpenTelemetry pipeline change management and self-monitoring. While no customer impact occurred, the 72-hour metric gap delayed critical model rollout decisions and caused billing discrepancies. We have since implemented all remediation steps and added automated config validation to prevent similar misconfigurations in the future.