DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Postmortem: How a Lack of OpenTelemetry 1.20 Skills Delayed Observability Adoption by 6 Months

Postmortem: How a Lack of OpenTelemetry 1.20 Skills Delayed Observability Adoption by 6 Months

In early 2024, our platform engineering team set out to unify our fragmented observability stack under OpenTelemetry (OTel) 1.20, aiming to replace three disjointed monitoring tools with a single, vendor-neutral standard. What we expected to be a 12-week rollout stretched to 36 weeks, delaying critical observability capabilities by 6 full months. This postmortem breaks down the root causes, impact, and hard-won lessons from that delay.

Background and Original Timeline

Our legacy observability setup relied on a mix of proprietary APM tools, self-hosted Prometheus for metrics, and a custom logging pipeline. This fragmented stack led to inconsistent data, high licensing costs, and hours of toil for on-call engineers correlating signals across tools. We selected OpenTelemetry 1.20 as our unified standard for three key reasons:

  • Native support for the new OpenTelemetry Metrics API v2, which aligned with our goal to standardize metric collection across all microservices
  • Improved tracing context propagation for our gRPC-heavy workloads, a feature stabilized in OTel 1.20
  • Built-in exporters for our existing backend stack (Jaeger for traces, Prometheus for metrics, Loki for logs) without custom shimming

Our original timeline targeted a phased rollout: 4 weeks for proof of concept (PoC), 6 weeks for core service instrumentation, and 2 weeks for cutover. We planned to complete the full rollout by March 31, 2024.

Root Cause: Critical OpenTelemetry 1.20 Skill Gaps

The primary driver of the 6-month delay was a widespread lack of expertise in OpenTelemetry 1.20’s new and updated components. While 80% of the platform team had experience with OTel 1.16 or earlier, OTel 1.20 introduced breaking changes and new workflows that our team was unprepared for:

  • The Metrics API v2 overhaul required reworking all existing metric instrumentation, as the legacy v1 APIs were deprecated in 1.20. No team members had hands-on experience with the new API’s aggregation models or cardinality controls.
  • Updated tracing SDKs for our primary languages (Go, Java, Node.js) changed default context propagation behavior, leading to broken trace continuity in our initial PoC that took 3 weeks to debug.
  • New configuration syntax for OTel Collectors, including the service::telemetry block for native metrics, was unfamiliar to our team, resulting in misconfigured pipelines that dropped 40% of trace data initially.

We made a critical error in assuming that prior OTel experience would translate directly to 1.20. We did not allocate time for upskilling, and we had no internal subject matter experts to consult when issues arose. When we hit blockers, we relied on community forums and trial-and-error, which added weeks to each phase of the rollout.

Impact of the Delay

The 6-month delay had cascading effects across the organization:

  • We missed our Q2 commitment to provide unified observability for 12 new microservices, forcing product teams to maintain legacy tooling alongside new services, increasing toil by 30%.
  • Licensing costs for legacy APM tools totaled $120k more than budgeted, as we could not decommission them on schedule.
  • Two critical incidents in April and June 2024 took 2x longer to resolve than expected, as engineers could not correlate metrics, traces, and logs across the fragmented stack.
  • Team morale dropped significantly, with 3 platform engineers citing the prolonged rollout as a factor in their decision to leave the company.

Mitigation and Recovery Steps

By May 2024, we paused the rollout to address the skill gaps directly. We took the following steps to get back on track:

  1. Partnered with an OpenTelemetry consulting firm to deliver a 4-week intensive training program focused on OTel 1.20’s new features, with hands-on labs for our core languages and Collector configurations.
  2. Hired a senior OTel engineer with 1.20 production experience to act as an internal subject matter expert and mentor for the team.
  3. Rebuilt our PoC from scratch using the new training, documenting all configuration patterns and common pitfalls in an internal wiki.
  4. Implemented a staggered rollout for non-critical services first, with weekly check-ins to catch issues early.

These steps allowed us to complete the full rollout by September 2024, 6 months behind the original timeline but with a stable, well-documented observability stack.

Key Lessons Learned

We documented five core lessons to prevent similar delays in future infrastructure adoptions:

  • Never assume version compatibility for major open-source tools: always audit breaking changes and new features before committing to a timeline.
  • Allocate 20% of rollout timelines for upskilling when adopting a new major version of a core tool.
  • Run a small, time-boxed PoC with the exact version you plan to use before scaling rollout plans.
  • Maintain at least one internal subject matter expert for critical infrastructure tools, or retain access to external expertise.
  • Factor in the cost of delayed adoption (licensing, toil, incident resolution time) when prioritizing upskilling and training budgets.

Conclusion

The 6-month delay in our OpenTelemetry 1.20 adoption was a painful but valuable lesson in the importance of aligning team skills with tooling choices. While we eventually achieved our goal of a unified observability stack, the cost of ignoring version-specific skill gaps was far higher than the investment required to upskill our team upfront. For any team adopting OpenTelemetry 1.20 or later, we strongly recommend prioritizing hands-on training and PoCs before committing to aggressive rollout timelines.

Top comments (0)