DEV Community

PracticeOverflow
PracticeOverflow

Posted on

OpenTelemetry's Stability Sprint: The Week Nobody Noticed

Wednesday morning at KubeCon EU in Amsterdam. Hall 7. The OpenTelemetry maintainers' meeting had maybe 200 people in a room built for 600. Three halls over, every AI agent demo was standing room only.

In that half-empty room, the OTel project announced more stability milestones in a single week than in the previous two years combined.

Declarative Configuration: stable. Profiles: alpha. eBPF Instrumentation: headed to RC. Go Metrics SDK: 30x faster. Baggage propagation validated at 60 million requests per minute.

And the hallway track? All anyone wanted to talk about was whether Claude could auto-instrument their microservices.

Here's the thing. OpenTelemetry has been "almost ready" for production for years. Teams adopt it, hit rough edges in configuration drift and SDK inconsistencies, fall back to Datadog or Dynatrace vendor SDKs, and file a mental note to try again in six months. This week might be the tipping point. But only if you know which parts actually crossed the line.


5-Minute Skim

If you're skimming between sessions:

  • Declarative Configuration hit stable. One YAML schema configures SDK + instrumentation across C++, Go, Java, JavaScript, and PHP. .NET and Python are weeks away. This kills the "every language configures differently" problem that plagued adoption.
  • Profiles entered alpha as the 4th observability pillar. Continuous profiling with 40% smaller wire format than pprof, cross-signal correlation via trace_id/span_id, and an eBPF agent that runs as a Collector receiver.
  • eBPF Instrumentation (OBI) is heading to RC. Zero-code, kernel-level tracing for Go, Rust, and C++ -- languages that never had auto-instrumentation before. No sidecars. No code changes. No runtime overhead from bytecode manipulation.
  • Go Metrics SDK got 30x faster. The synchronous instrument path was the bottleneck everyone complained about. Fixed.
  • 65% of organizations now invest in both Prometheus and OTel. Not either/or. Both. 47% increased OTel usage year-over-year. 84% report time or cost savings from open standards adoption.
  • 92% find AI valuable for anomaly detection on telemetry data. The observability-meets-AI convergence is real, and OTel's structured, vendor-neutral data is what makes it possible.

Now let me walk through what actually changed and why it matters.


Why Has OTel Been "Almost Ready" for Five Years?

Because a spec isn't a product.

OpenTelemetry reached traces GA in 2021. Metrics in 2023. Logs in 2024. Each time, the announcement said "production ready." Each time, platform teams discovered the gap between a stable signal spec and a deployable system.

The spec says traces are stable. Great. But how do you configure the SDK? Environment variables? Code? YAML? It depends on the language. The Go SDK configures differently from Java which configures differently from Python. You need different expertise for each runtime in your fleet. That's not production-ready. That's a research project.

The spec says metrics are stable. But the Go SDK's synchronous instruments had performance characteristics that made high-throughput services drop samples or add latency. Teams benchmarked, saw the numbers, and switched back to Prometheus client libraries.

The spec says logs are stable. But without profiling data, you still can't answer "this endpoint is slow -- is it the code, the GC, or the downstream dependency?" You had three pillars holding up a roof that needed four.

This week fixed all three problems simultaneously. That's what makes it different.


What Does the 4-Signal Architecture Actually Look Like?

Before this week, OTel had three stable signals. Now it has three stable signals and a fourth entering alpha. The architecture looks like this:

The critical change isn't the fourth signal. It's the box in the middle.

Declarative Configuration means that single YAML file controls everything: which signals are enabled, which exporters they route to, which sampling rules apply, which resources are attached. Across five languages today, seven soon. One schema. One file. One truth.

Before this, every language SDK had its own configuration story. Java used system properties and environment variables. Go used functional options in code. JavaScript used a mix of environment variables and programmatic setup. Python had its own thing entirely. If you ran a polyglot microservices fleet -- and who doesn't -- you needed language-specific expertise for every runtime.

That's over.

That file replaces dozens of environment variables, language-specific initialization code, and vendor-specific configuration blocks. Deploy it via ConfigMap in Kubernetes, mount it into every pod, and every SDK reads the same truth.

The stability guarantee means the schema won't break between minor versions. You can upgrade the SDK without rewriting your configuration. For platform teams managing hundreds of services, that's the difference between "we can standardize on OTel" and "we'll revisit next quarter."

.NET and Python support is underway and expected within weeks. When those land, Declarative Configuration covers every major backend language in production use.


Why Do Profiles Change Everything?

Traces tell you which service is slow. Metrics tell you how slow. Logs tell you what happened. None of them tell you why.

Why is checkout-service P99 at 800ms? Is it a hot code path? GC pressure? Lock contention? A downstream timeout? With three signals, you're guessing. You jump to a profiler, set up a separate agent, try to correlate timestamps manually, lose the thread, give up, add more logging, deploy, wait for the next incident.

Profiles fix this. They're continuous profiling -- CPU, memory allocation, wall-clock, lock contention -- baked into the same pipeline as your traces, metrics, and logs.

The key design decision: profiles carry trace_id and span_id. That means you can go from a slow trace span directly to the flame graph showing exactly which function burned 600ms. No timestamp correlation. No separate tooling. One click.

The wire format is 40% smaller than pprof, which matters when you're shipping continuous profiling data from every pod in your fleet. And the eBPF-based profiling agent runs as a Collector receiver -- not a separate daemon, not a sidecar, but a component inside the Collector you already run.

Alpha means the spec will change. APIs are not frozen. But the signal definition, wire format, and Collector integration path are real enough to evaluate today.


How Does eBPF Instrumentation Work Without Code Changes?

This one matters most for the languages OTel has historically ignored.

Java has auto-instrumentation via bytecode manipulation. Python has monkey-patching. JavaScript has require hooks. But Go? Rust? C++? These compile to native binaries. There's no bytecode to manipulate. No interpreter to hook. You either instrument the code manually or you don't instrument it at all.

eBPF Instrumentation -- OBI -- solves this at the kernel level.

eBPF programs attach to function entry and exit points (uprobes) in the compiled binary. They capture timing, arguments, and return values without modifying the binary, without injecting a sidecar, and without adding runtime overhead from bytecode manipulation. The traces flow into the OTel Collector through a dedicated receiver.

This is beta today, heading to RC. Splunk showed it running in production at KubeCon with their GA Kubernetes Operator managing the lifecycle.

The trade-off is real: eBPF requires Linux kernel 5.8+ and appropriate capabilities (CAP_BPF). It can't instrument inlined functions. And the span detail is coarser than manual instrumentation -- you get function-level granularity, not arbitrary code block spans. For most observability use cases, that's more than enough. For custom business logic spans, you'll still need manual instrumentation at key points.


Vendor SDK vs. OTel: Where's the Trade-off Now?

This is the question I hear most from platform engineering leads. "Should we migrate off Datadog/Dynatrace/New Relic SDKs onto OTel?"

A year ago, the honest answer was "probably not yet." The configuration story was fragmented. Performance had gaps. Profiling didn't exist. Vendor SDKs gave you a coherent, well-tested, fully-supported package. OTel gave you portability at the cost of paper cuts.

After this week, the calculus shifts.

What OTel gives you now:

  • Single configuration schema across all languages (Declarative Config, stable)
  • Four signals in one pipeline (traces, metrics, logs, profiles)
  • Zero-code instrumentation for compiled languages (eBPF)
  • Vendor portability: switch backends without re-instrumenting
  • 30x faster Go metrics (the worst performance gap is closed)

What vendor SDKs still give you:

  • Tighter integration with vendor-specific features (AI-powered root cause, custom dashboards, proprietary correlation)
  • One vendor to call when something breaks
  • Battle-tested at extreme scale with years of production hardening
  • Faster time-to-value for small teams without platform engineering capacity

The hybrid pattern that's emerging: Instrument with OTel SDKs and Declarative Config. Export to your vendor of choice via OTLP. Use vendor-specific features on the backend. This gives you portability at the instrumentation layer and vendor power at the analysis layer.

65% of organizations are already doing exactly this -- investing in both open standards and commercial platforms simultaneously. That number is from Grafana's 2026 Open Standards survey, and it matches every conversation I've had this quarter.
The Collector as a routing layer is the unlock. Instrument once. Route anywhere. Change vendors without touching application code. That's the promise OTel has been making for five years. This week, the last major blockers to delivering on it fell away.


What Does the Adoption Data Actually Say?

Grafana surveyed thousands of practitioners in early 2026. The numbers tell a clear story:

  • 57% use OTel for metrics. This was the lagging signal. Prometheus had an iron grip. OTel metrics crossing the majority threshold means the "just use Prometheus" default is eroding.
  • 50% use OTel for traces. Traces were the first stable signal, and half the industry is on board. The other half is split between vendor SDKs and "we don't do distributed tracing yet."
  • 48% use OTel for logs. Surprisingly close to traces, given that OTel logs only went stable in 2024. The structured logging push is working.
  • 47% increased OTel usage year-over-year. Not just adoption, but deepening adoption. Teams that started with traces are adding metrics and logs.
  • 84% report time or cost savings. This is the number that gets budget. Not "it's the right thing to do" but "it saves money."

The Baggage signal at 60 million requests per minute is less about the feature and more about the proof point. OTel's core propagation infrastructure handles hyperscale traffic. The "will it perform?" question has an answer now.


Mono-Signal vs. Multi-Signal: Which Migration Path?

If you're planning an OTel migration, you have two strategies. Both work. They have different risk profiles.

Mono-signal migration: Pick one signal -- usually traces -- and migrate it fully across your fleet. Get the Collector running, the exporters configured, the dashboards rebuilt. Stabilize. Then add metrics. Then logs. Then profiles.

This is lower risk. You learn the operational model on one signal before adding complexity. The downside: you're running two parallel telemetry pipelines for months. Vendor SDK for the signals you haven't migrated. OTel for the one you have. That's more infrastructure, more cost, more cognitive load.

Multi-signal migration: Use Declarative Configuration to deploy all signals at once. One YAML, one Collector, one rollout.

This is higher risk but dramatically faster. Declarative Config makes it feasible because you're not writing language-specific initialization code for each signal in each language. You write the YAML once. The downside: if something breaks, everything breaks. Your blast radius is your entire observability pipeline.

My recommendation for most teams: start with traces (the most mature signal), add metrics within the same quarter, add logs in the next quarter, and evaluate profiles once they hit beta. Use Declarative Config from day one even if you're only enabling one signal -- the migration cost of adding signals later drops to near zero.


What Should You Do This Quarter?

Adopt Declarative Configuration immediately. Even if you're already running OTel, switch to the stable YAML schema. It eliminates environment variable sprawl, makes configuration auditable and version-controlled, and prepares you for adding signals with zero SDK code changes. If you're on C++, Go, Java, JavaScript, or PHP, it's available today.

Evaluate Profiles on a single high-value service. Pick the service that generates the most on-call pages. Deploy the eBPF profiling agent as a Collector receiver. Correlate profile data with existing traces. You'll find root causes you've been chasing for months. Alpha means "the API may change," not "it doesn't work."

Benchmark eBPF Instrumentation against your manual instrumentation. If you have Go, Rust, or C++ services with no observability or hand-rolled tracing, OBI in beta is ready for staging environments. Compare the span coverage against what you'd get from manual instrumentation. For most services, the 80/20 is heavily in eBPF's favor.

Stop waiting for OTel to be "ready." Traces have been stable for five years. Metrics for three. Logs for two. Configuration is now stable. The Go performance gap is closed. The "we'll adopt OTel when it's mature" position was defensible in 2024. In 2026, it's just inertia.

Budget for the Collector as infrastructure. The Collector isn't a nice-to-have sidecar. It's a critical routing layer between your applications and your observability backends. Run it as a DaemonSet. Give it resource limits. Monitor it with... itself. Treat it like you treat your service mesh control plane.


Deep Dive Resources


Sources

  1. Bindplane, "KubeCon EU 2026 OpenTelemetry Recap," April 2, 2026
  2. OpenTelemetry Blog, "Profiles Signal Enters Alpha," April 2026
  3. OpenTelemetry Blog, "Declarative Configuration Reaches Stable," April 2026
  4. Splunk, "KubeCon EU 2026: OTel eBPF Instrumentation and Kubernetes Operator GA," April 2026
  5. Grafana Labs, "2026 Open Standards in Observability Survey," March 2026
  6. Grafana Labs, "2026 State of AI in Observability," March 2026

Top comments (0)