TTFT vs Total Latency: Instrumenting What Users Actually Feel

#llm #observability #devops #tutorial

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ship a chat feature. The dashboard says p95 latency is 6 seconds. You panic. Six seconds is a disaster for a web interaction; nobody waits six seconds.

Then you open the product and try it yourself. Words start appearing in under a second. The full answer takes six seconds to finish streaming, but it never feels slow, because you are reading along while the model writes. The number on the dashboard and the feeling in the UI disagree.

Both are right. They are measuring different things. Your span recorded the wall-clock time of the whole generation. The user felt the wait until the first word showed up. For a streaming chat UI, those two numbers can be off by an order of magnitude, and you are almost certainly alerting on the wrong one.

The two numbers that matter

A streaming LLM response has a shape, not a duration. Three numbers describe it:

TTFT — time to first token. From when the user hits send to when the first chunk lands. This is the perceived responsiveness. It is the spinner-to-text gap.
Total latency — from send to the last token. This is what a naive time.time() around the call records.
TPOT — time per output token, sometimes called inter-token latency. How fast the text scrolls once it starts. Total latency is roughly TTFT + (output_tokens × TPOT).

For a chat UX, TTFT is the number a user judges you on. A 400ms TTFT with a 6-second total feels responsive. A 4-second TTFT with a 4.5-second total feels broken, even though the second one finishes sooner. The instrument most teams ship records only total latency, so it cannot tell those two apart.

Why one duration is not enough

When you wrap a streaming call in a single span and read its duration, you collapse the whole response shape into one scalar. That scalar moves for reasons that have nothing to do with how the UI feels:

A longer answer raises total latency without touching TTFT. The user is happy; the dashboard is redder.
A model that buffers before emitting raises TTFT without changing total. The user stares at a spinner; the dashboard looks the same.
A slow network on the last few tokens drags total latency up. The user has already read the answer and moved on.

You need the timestamps inside the stream, not just the bookends. That means recording an event when the first chunk arrives, and computing TTFT as the gap between span start and that event.

Instrument the stream, not the call

Here is a minimal OpenTelemetry emitter for a streaming chat call. It records TTFT as a span attribute, keeps total latency as the span duration, and derives TPOT. The example uses the OpenAI Python client, but the shape is the same for any streaming SDK.

import time
from openai import OpenAI
from opentelemetry import trace

client = OpenAI()
tracer = trace.get_tracer("app.llm")


def stream_chat(model, messages, conv_id):
    span = tracer.start_span("gen_ai.chat")
    span.set_attribute("gen_ai.request.model", model)
    span.set_attribute(
        "gen_ai.conversation.id", conv_id
    )

    start = time.perf_counter()
    ttft = None
    out_tokens = 0
    chunks = []

    try:
        stream = client.chat.completions.create(
            model=model,
            messages=messages,
            stream=True,
        )
        for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta is None:
                continue
            if ttft is None:
                ttft = time.perf_counter() - start
                span.set_attribute(
                    "app.llm.ttft_ms", round(ttft * 1000)
                )
            out_tokens += 1
            chunks.append(delta)
            yield delta
    finally:
        total = time.perf_counter() - start
        span.set_attribute(
            "app.llm.total_ms", round(total * 1000)
        )
        span.set_attribute(
            "gen_ai.usage.output_tokens", out_tokens
        )
        if ttft is not None and out_tokens > 1:
            gen = total - ttft
            tpot = gen / (out_tokens - 1)
            span.set_attribute(
                "app.llm.tpot_ms", round(tpot * 1000)
            )
        span.end()

Three details carry the whole thing. The app.llm.ttft_ms attribute is set the moment the first content delta arrives, not before. The span only ends in the finally block, so total latency is correct even when the client disconnects mid-stream. And TPOT divides by out_tokens - 1, because the first token's time is already accounted for in TTFT.

A note on the token count: counting deltas is an approximation. A delta is often one token but not always, and providers chunk differently. If you need exact output-token counts for cost, read them from the final usage payload the provider sends; use the delta count for TPOT only.

What to record on the span

Keep the standard GenAI attributes and add three custom ones in a project-prefixed namespace. The custom names are not in the OpenTelemetry GenAI spec yet, so app.* keeps them out of the way of future standard attributes:

gen_ai.request.model        e.g. "gpt-4o-2024-11-20"
gen_ai.conversation.id      stable id across turns
gen_ai.usage.output_tokens  int
app.llm.ttft_ms             time to first token, ms
app.llm.total_ms            send to last token, ms
app.llm.tpot_ms             time per output token, ms

Recording all three lets you answer the question your single-duration span could not: was a slow response slow because the model took a long time to start, or because the answer was long? Those are different bugs with different fixes. A high TTFT points at queueing, cold starts, a long prompt, or provider load. A high TPOT with a fine TTFT points at decode speed, which usually means model choice or a throttled tier.

What to alert on for a chat UX

Alert on TTFT, not total latency. This is the part most teams get backwards.

Total latency is partly under the user's control, because it scales with how long the answer is. A user who asks for a 2,000-word summary signed up for a long total. Paging your on-call because someone requested a long answer is noise. TTFT does not scale with answer length; it should be roughly constant regardless of how much the model ends up writing. That makes it a clean signal.

PromQL — page when the p95 time-to-first-token over five minutes crosses your threshold:

histogram_quantile(
  0.95,
  rate(app_llm_ttft_ms_bucket[5m])
) > 1500

Pick the threshold from how the UI feels, not from a generic SLO. For a chat box where the user is staring at a spinner, somewhere around 1 to 1.5 seconds of TTFT is the edge of "feels responsive." Past that, the interaction starts to feel like a page load.

Datadog DDQL — same idea, on the p95 rollup:

p95:app.llm.ttft_ms{*} > 1500

Add a second, looser alert on TPOT for the cases where the stream starts fast but crawls afterward:

histogram_quantile(
  0.95,
  rate(app_llm_tpot_ms_bucket[5m])
) > 80

A TPOT past roughly 50 to 80ms per token means the text scrolls slower than a person reads, and the responsiveness you bought with a fast TTFT bleeds away over a long answer. That is a different fix from a TTFT regression, which is why it gets its own alert.

Leave a total-latency alert as a backstop only, set very loose. It catches the pathological case where a stream opens fast and then hangs forever without closing, which neither the TTFT nor the TPOT alert sees on its own.

The slice that hides the regression

A global p95 TTFT smooths over the thing that actually breaks. Slice it by the dimensions that drift independently:

Per model. A cheaper model on a busier tier often has a worse TTFT under load even when its decode speed is fine.
Per region. TTFT includes the network round trip to the provider. A user far from the provider's region eats that on every turn.
First turn vs. follow-up. The first message in a conversation can carry a cold start or a longer system prompt. Follow-ups reuse warm context. Averaging them together hides both.

A flat 900ms p95 can be a healthy 600ms for most traffic and a miserable 2.5 seconds for one region or one model. Your users in that slice feel every millisecond. Your aggregate dashboard does not.

What this changes

The shift is small in code and large in what you can see. You stop treating an LLM response as a single duration and start treating it as a stream with a start, a rate, and an end. You record the first-token moment as an event, keep total as the span duration, and alert on the number the user actually feels.

Once the instrument tells TTFT and total apart, most "the chat feels slow" tickets resolve to one of two clear causes instead of a shrug at a green-but-high latency graph.

If this was useful

If your chat latency dashboard has been arguing with what the product feels like, the disagreement is usually this one: the span measures total, the user feels first-token. The LLM Observability Pocket Guide covers the streaming-span pattern, the GenAI attribute set worth recording, and how to keep these alerts from going stale as models and tiers rotate.