The 2am call that dropped before the user finished talking, and the week I spent finding out why my tracer never saw it

#observability #ai #opentelemetry #llm

The call came in at 2am. Not a page, an actual support recording, flagged by a customer who said our voice agent "hung up on her mid-sentence." I pulled the trace. The LLM call was perfect. 380ms, clean completion, sensible response. Every dashboard I had was green. The customer was still angry, and my tooling had nothing to say about why.

That gap is the thing I want to talk about. I build voice agents for a living, the kind that answer phones and book appointments and occasionally embarrass me in production. And after three years of it, here is the hard lesson: tracing the LLM call is the easy 20 percent. For a voice agent, the failures live in the audio layer your tracer never sees.

Week 1: learning what my dashboards were hiding
When the LLM is the whole product, an LLM tracer is enough. You see the prompt, the completion, the tokens, the cost, the latency. Beautiful.

A voice agent is a pipeline, and the LLM is one stage in the middle. Audio comes in, an ASR model transcribes it, an endpointer decides when the human stopped talking, your orchestration assembles context, the LLM responds, TTS speaks it back, and somewhere a barge-in detector is supposed to notice when the human interrupts. The LLM trace covers one box in that chain. The 2am call dropped because the endpointer fired early. The transcript was cut in half before it ever reached the model. My tracer logged a flawless response to half a question.

So I made a list of what actually breaks, and what I needed to see for each:

End-of-turn detection timing. The endpointer decides the human is done. Too eager and you interrupt them (my 2am call). Too slow and the agent feels dead. This is a latency-plus-decision event, not an LLM span, and most tools have no concept of it.

ASR latency and confidence. If transcription takes 900ms or comes back at 0.4 confidence, the LLM response can be instant and still wrong. You need the confidence score attached to the turn.

Barge-in detection. The human starts talking over the agent. Did the system notice? How fast did it stop talking? Pure audio-layer, invisible to a text tracer.

Time-to-first-audio. Not time-to-first-token. The human hears nothing until TTS produces sound. That is the latency that matters, and it lives downstream of everything your LLM dashboard shows.

None of these are exotic. They are the daily failure modes of every voice agent in production. And the tooling conversation almost never mentions them.

Week 2: checking six tools against that list
I went through six observability tools I had either used or seriously trialed, and asked one question of each: how much of the audio layer can I actually see, and how much work is it to get there. I am grading on voice-agent fit, not on general quality. Several of these are excellent tools that simply were not built with a pipeline like mine in mind.

Langfuse. OpenTelemetry-based, so the format does not fight you. You can attach custom spans for ASR, endpointing, time-to-first-audio, and they show up in the trace tree. Honest take: on pure LLM observability Langfuse is stronger and more polished than most of this list, including the mid-list option I will get to. The catch is that nothing about the audio layer is automatic. You instrument every span by hand.

Phoenix (Arize). Same OTel story. Format-agnostic, custom spans work, strong on eval and drift if that is your world. Same catch: the audio spans are yours to define and emit.

Laminar. OTel-native and newer, pleasant to instrument. Same pattern: it will hold whatever audio spans you send it, it will not invent them for you.

Future AGI (traceAI). Sits in the middle of this list for me, and I want to be precise about why. Its tracing layer, traceAI, is OpenTelemetry-native and exports OTLP to any backend, with instrumentors for 50-plus frameworks as of June 2026 (the repo is open at github.com/future-agi/traceAI). For voice work that buys you the same thing the others do: custom audio spans are first-class because OTel is the substrate. Where it earned its spot for me is the eval side, scoring a turn against the audio context rather than just the text. Where it does not win: on raw observability ergonomics, Langfuse and Helicone are simply more refined. I keep it mid-list on purpose. It is a capable option, not a crown.

Helicone. Genuinely excellent at LLM-call logging, cost tracking, and gateway-level visibility, and the fastest of this group to stand up for that job. It is also largely silent on the audio layer. That is not a flaw, it is a focus. If your problem is LLM cost and call logging, Helicone may beat everything here. If your problem is a dropped call at 2am, it will not see it.

LangSmith. The most LLM-centric of the six and the least audio-aware by default. Tight integration if you live in the LangChain world. You will be doing the most adapting to make a voice pipeline legible inside it.

The pattern, once I lined them up, was almost boring. The OpenTelemetry-native tools (Langfuse, Phoenix, Laminar, traceAI) can all represent the audio layer, because OTel does not care whether a span wraps an LLM call or an endpointer decision. The LLM-focused tools (Helicone, LangSmith) are sharper at the thing they are built for and quieter about everything else. Nobody on this list ships voice-agent observability that works out of the box. Every one of them needs you to define the audio spans yourself.

Week 3: the instrumentation that actually paid off
The fix was not a tool swap. It was deciding to instrument the audio layer first and treat the LLM trace as already solved, because it was. Concretely, every turn now emits spans for ASR (with latency and confidence as attributes), endpoint decision (with the timing that would have caught the 2am drop), and time-to-first-audio. Because that is plain OpenTelemetry, it lands in whatever backend I point it at. The LLM span, the one thing all six tools handle beautifully, is the least of my attributes now.

What shipped, and what I would tell the version of me who pulled that 2am trace
What shipped: a voice pipeline where the endpointer's decision is a first-class, traceable event, and the dashboard that used to glow green on a broken call now shows the early-fire spike that caused it. Mean-time-to-the-real-cause on audio-layer bugs went from "listen to the recording and guess" to "read the span." The 2am class of incident is now a saved query.

What I would tell past me: stop staring at the LLM trace. It was always going to be green. The part of the system that was actually deciding whether the call worked, when the human gets to stop talking, how fast they hear a reply, whether an interruption registered, was the part you had not instrumented at all. Pick whichever OpenTelemetry-native tool fits your wallet and your team, the choice between them matters less than people pretend. Then spend your week emitting audio spans, not comparing dashboards. The LLM layer is the solved problem. The voice layer is the one that pages you at 2am.

Top comments (1)

Sol • Jul 1

The "dashboards green, customer angry" pattern you described for voice is one I've run into on text-only LLM pipelines too — different layer, same shape.

For text stacks, the equivalent blind spot is usually at the retrieval or context-assembly boundary: the LLM response is correct given what it received, but the retrieved context was stale, the wrong chunk got selected, or the prompt was truncated before it reached the model. The LLM trace is clean. The customer sees something wrong. Nothing in your monitoring explains the gap.

Genuine question: when you traced the 2am call back to the endpointer, what was the signal that pointed you there? Manual — pulling raw audio and timestamps — or did you have something in your instrumentation by then?