Voice pipelines have 4 stages that need separate latency stories: ASR (speech to text), LLM (the response prompt), TTS (text to speech), and client (jitter on the receiving end). When we wired OTel across all 4, the spans without consistent attributes were useless for queries. 3 attributes ended up on every span and earn their keep.
audio.stage.Enum: asr, llm, tts, client. The single most-queried attribute. The Grafana query for p95 latency by stage is one filter. Without this, you are scrolling raw traces.
audio.session_id: The full conversation. Lets you query "what did the user actually experience" end-to-end. We use a uuid generated at session start, propagated to every downstream call. Tempo's traces by tag lookup is fast on this.
audio.first_byte_ms: The time from request start to first audio byte returned. For ASR and TTS streaming stages. This is what catches barge-in latency regressions before the dashboard's aggregate alert does. We page when p95 first_byte_ms goes above 350ms on TTS.
Honorable mention attributes that didn't survive the first cleanup: audio.codec (covered by the service info), audio.session_turn_index (covered by parent-span linkage), audio.user_id (privacy concerns at scale; left out).
If you are starting voice-pipeline observability: tag stage + session_id on every span from day one. The first_byte_ms is the one you will add after the first production incident; you might as well add it on day two.
Top comments (0)