Postmortem: How a Jaeger 1.50 Storage Issue Lost 50% of Our Production Traces

#postmortem #jaeger #storage #issue

Postmortem: How a Jaeger 1.50 Storage Issue Lost 50% of Our Production Traces

Executive Summary

On October 12, 2024, our observability team detected a critical gap in production trace coverage: 50% of all distributed traces from our e-commerce platform were missing from Jaeger 1.50 storage. The incident lasted 4 hours and 22 minutes, impacting our ability to debug latency spikes in checkout and payment services. This postmortem details the root cause, timeline, remediation, and actionable lessons for teams running Jaeger at scale.

Incident Timeline (UTC)

14:12 – Jaeger 1.50 upgrade completes across all production collector and query nodes, following a 2-week staging validation period.
14:15 – First anomaly alert: Trace ingestion rate drops from 12k traces/sec to 6k traces/sec, matching the 50% loss threshold.
14:27 – On-call engineers confirm Jaeger query service returns incomplete trace results for recent time windows.
14:45 – Storage team identifies uneven shard distribution in the new Jaeger 1.50 Elasticsearch backend.
15:30 – Temporary rollback to Jaeger 1.49 initiated for collector nodes to stop further trace loss.
16:34 – Storage rebalancing completes; ingestion rate returns to 12k traces/sec.
18:34 – Full validation confirms 99.9% trace coverage restored, incident marked resolved.

Impact Assessment

The 50% trace loss affected all production services, with the highest impact on checkout (72% of traces lost) and payment (68% lost) services. Key downstream impacts included:

2 unresolved latency incidents in checkout that delayed debugging by 3 hours.
False negatives in our SLO alerting for trace completeness, which failed to trigger immediately.
Loss of 18 million traces totaling 2.4 TB of observability data, unrecoverable due to short-term storage retention policies.

Root Cause Analysis

Jaeger 1.50 introduced a breaking change to the default Elasticsearch index template for trace storage, which our team missed during pre-production testing. The change modified the shard_count calculation logic to use a dynamic formula based on node count, rather than the static 5-shard default used in Jaeger 1.49 and earlier.

Our production Elasticsearch cluster has 12 data nodes. Jaeger 1.50's new logic calculated a shard count of 3 per index, but our Elasticsearch cluster had a hard limit of 1 shard per index for small trace workloads, leading to 6 of 12 shards being rejected during index creation. This caused 50% of trace data to be routed to failed shards and dropped silently by Jaeger collectors.

Contributing factors included:

Staging cluster had only 3 nodes, so the dynamic shard count matched the Elasticsearch limit, hiding the issue during validation.
Jaeger 1.50 release notes mentioned the shard logic change only in a minor bullet point under "Backend Improvements", not in breaking changes.
Our trace completeness alerting checked only ingestion rate, not actual storage write success rates.

Remediation Steps

Rolled back Jaeger collectors to 1.49 immediately to stop ongoing trace loss.
Updated the Jaeger Elasticsearch index template to explicitly set shard_count: 5 (matching our pre-1.50 configuration) before re-upgrading to 1.50.
Added a pre-upgrade check to validate shard allocation for all Jaeger indices against Elasticsearch cluster limits.
Updated our observability alerting to track jaeger_elasticsearch_index_create_errors and jaeger_collector_trace_dropped_total metrics, with a 5-minute paging threshold.
Extended production trace retention from 24 hours to 72 hours to allow recovery of lost data in future incidents (where possible).

Lessons Learned

Always validate version upgrades against production-scale infrastructure, not just staging clusters with smaller node counts.
Explicitly configure all storage parameters for Jaeger rather than relying on dynamic defaults, especially for critical production systems.
Alert on storage write failures and trace drop metrics, not just ingestion rates, to catch silent data loss.
Review all release notes for minor changes, not just breaking changes sections, when upgrading observability tools.
Run canary upgrades for Jaeger components: upgrade 10% of collectors first, validate trace coverage for 30 minutes before full rollout.

Conclusion

This incident highlighted the risks of relying on dynamic default configurations for observability storage, especially during version upgrades. By codifying our Jaeger storage settings and improving our alerting coverage, we've reduced the risk of similar trace loss incidents to near zero. We've also shared our findings with the Jaeger maintainers, who have since updated the 1.50 release notes to explicitly flag the shard logic change as a potential breaking change for large clusters.