- Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
The dashboard says green. The user says slow.
You ship a chat feature. P50 sits at 1.4s. P95 at 4.8s. P99 at 9s. Cost graph is steady, error rate is zero. Then the support inbox lights up: "the AI feels broken," "it locks up for like three seconds," "is it even working?"
You stare at the dashboard. The dashboard stares back. Green.
The dashboard isn't wrong about what it measures. It's wrong about what matters. For any streaming LLM endpoint, total wall-time is the wrong SLO. Users don't wait for the response. They wait for the first word, and then they read at the speed it appears.
Shipping against P99-total is how you end up with a "fast" app that feels broken.
What users actually wait for
Two numbers describe streaming latency honestly, and both have to live on your dashboard:
Time to first token (TTFT). Wall-clock from request leaving the client to the first piece of model output rendered. This is the silence the user stares at. If TTFT is 2.5s, users start blaming you no matter how fast the rest streams.
Inter-token gap (ITG). The wall-clock between successive tokens (or chunks) once streaming has started. ITG-P50 of 25ms feels fluid. ITG-P95 of 400ms is the "is it frozen?" moment people complain about even when total time was fine.
OpenAI's streaming protocol uses Server-Sent Events with data: lines containing JSON deltas (see the streaming chat docs). Anthropic's protocol uses SSE with typed events (message_start, content_block_delta, message_stop). Both end with a terminal marker. Both make it surprisingly easy to measure the wrong thing.
Where the timer goes wrong
The single most common bug: starting the TTFT timer when the SDK call returns its iterator, and stopping it when you receive the first chunk, regardless of whether that chunk has any user-visible content.
Both sides of that sentence are subtly wrong. Start the timer too late, you under-report. Stop it on the wrong chunk, you under-report by another few hundred ms. Here's what honest TTFT looks like in three SDKs.
OpenAI Python SDK
import time
from openai import OpenAI
client = OpenAI()
def stream_with_ttft(prompt: str) -> dict:
t_start = time.perf_counter() # start BEFORE the SDK call
first_token_at: float | None = None
last_chunk_at = t_start
itgs: list[float] = []
output = []
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
stream=True,
stream_options={"include_usage": True},
)
for chunk in stream:
if not chunk.choices:
continue # usage-only chunk at the end, skip
delta = chunk.choices[0].delta.content
if delta is None or delta == "":
continue # role-only opener, doesn't count as a token
now = time.perf_counter()
if first_token_at is None:
first_token_at = now # this is the real TTFT mark
else:
itgs.append(now - last_chunk_at)
last_chunk_at = now
output.append(delta)
return {
"ttft_ms": (first_token_at - t_start) * 1000,
"itg_ms": itgs,
"total_ms": (last_chunk_at - t_start) * 1000,
"text": "".join(output),
}
Two things matter here. t_start goes before the create call, not after. The SDK does the TLS handshake and the initial POST on that line, and that round-trip is what your user paid for. The first chunk OpenAI sends is usually {"role": "assistant"} with no content; counting it as the first token under-reports TTFT by 50-200ms.
Anthropic Python SDK
import time
import anthropic
client = anthropic.Anthropic()
def stream_with_ttft(prompt: str) -> dict:
t_start = time.perf_counter()
first_token_at: float | None = None
last_text_at = t_start
itgs: list[float] = []
output = []
with client.messages.stream(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
) as stream:
for event in stream:
# Anthropic sends message_start, content_block_start,
# content_block_delta, content_block_stop, message_delta,
# message_stop. Only deltas with text count as tokens.
if event.type != "content_block_delta":
continue
if event.delta.type != "text_delta":
continue # tool_use deltas also flow here
now = time.perf_counter()
if first_token_at is None:
first_token_at = now
else:
itgs.append(now - last_text_at)
last_text_at = now
output.append(event.delta.text)
return {
"ttft_ms": (first_token_at - t_start) * 1000,
"itg_ms": itgs,
"total_ms": (last_text_at - t_start) * 1000,
"text": "".join(output),
}
Anthropic's stream is event-typed, which is a gift. You can't accidentally count a message_start as the first token. But you can accidentally count a tool_use delta as a text token. If your agent loop calls tools, treat tool_use deltas as a separate signal. Tool-call TTFT is its own metric.
Generic SSE (your own gateway, vLLM, Together, anything raw)
import time
import json
import httpx
def stream_with_ttft(url: str, payload: dict) -> dict:
t_start = time.perf_counter()
first_token_at: float | None = None
last_chunk_at = t_start
itgs: list[float] = []
output = []
with httpx.stream("POST", url, json=payload, timeout=60) as resp:
resp.raise_for_status()
for line in resp.iter_lines():
if not line or not line.startswith("data: "):
continue
data = line[6:]
if data == "[DONE]":
break
try:
chunk = json.loads(data)
except json.JSONDecodeError:
continue # half-frame, comes through occasionally
text = (
chunk.get("choices", [{}])[0]
.get("delta", {})
.get("content")
)
if not text:
continue
now = time.perf_counter()
if first_token_at is None:
first_token_at = now
else:
itgs.append(now - last_chunk_at)
last_chunk_at = now
output.append(text)
return {
"ttft_ms": (first_token_at - t_start) * 1000,
"itg_ms": itgs,
"total_ms": (last_chunk_at - t_start) * 1000,
"text": "".join(output),
}
Raw SSE is where the boring bugs live. Half-frames from a slow proxy. Heartbeat comments (: keep-alive) that aren't data: lines. The [DONE] sentinel that some providers send and some don't. Wrap json.loads in a try, skip what you can't parse, and never count a non-content chunk.
TTFT lies again when there's a proxy in the middle
Here's the bit that surprises people. Your TTFT looks fine in dev. You roll out behind your API gateway, your CDN, or your own LLM router. TTFT-P95 doubles. Sometimes triples.
The culprit is almost always response buffering. Nginx with default settings buffers responses until it has a meaningful chunk. AWS API Gateway with LAMBDA_PROXY integration buffers the entire response. Cloudflare Workers stream, but a Cloudflare proxy in front of an origin that doesn't send Transfer-Encoding: chunked correctly will buffer too.
What this looks like in numbers: model emits first token at 380ms, proxy releases it at 750ms. Your client-side TTFT records 750ms. Your provider-side dashboard records 380ms. Both are correct. Neither is what the user feels.
Capture two TTFT timers. One at the edge (the value you SLO on, because that's the user's experience), one at the model adapter (the value you debug with, because that's where you can act). When the gap between them widens, you know it's not the provider. It's your own infrastructure swallowing the first token.
Concretely: set proxy_buffering off; on the SSE location in nginx, set X-Accel-Buffering: no as a response header, and disable response compression for the streaming route. Gzip on a 30-byte SSE frame is a tax with no upside.
The inter-token gap nobody alerts on
ITG catches a failure mode TTFT misses entirely: the provider serves the first token fast, then stalls.
You've seen this. The response starts, you read the first sentence, and then it just... pauses. For two seconds. For four. Total wall-time still looks fine because the rest of the stream catches up. P99-total stays green. ITG-P95 explodes.
Capture ITG as a histogram, not an average. The interesting signal is the long tail. Prometheus histogram works well here because the buckets you want are 10ms / 25ms / 50ms / 100ms / 250ms / 500ms / 1s / 2s: log-spaced, biased toward the small end.
from prometheus_client import Histogram
LLM_TTFT = Histogram(
"llm_ttft_seconds",
"Wall-clock from request start to first content token",
labelnames=["model", "route"],
buckets=(0.1, 0.25, 0.5, 0.8, 1.0, 1.5, 2.0, 3.0, 5.0, 10.0),
)
LLM_ITG = Histogram(
"llm_inter_token_gap_seconds",
"Wall-clock between successive content tokens",
labelnames=["model", "route"],
buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.0),
)
After the stream finishes, observe the TTFT once and observe every ITG. The histogram percentiles will tell you what users feel.
The SLO that replaces P99-total
Stop SLO'ing on total wall-time for streaming. Use a composite of three signals. Here's a template, edit the numbers for your product:
slo:
service: chat-stream
objectives:
- name: ttft_p95
query: histogram_quantile(0.95, sum(rate(
llm_ttft_seconds_bucket{route="chat"}[5m]
)) by (le))
threshold_s: 0.8 # users tolerate 800ms of silence
window: 28d
target: 0.99 # 99% of 5-min windows under threshold
- name: itg_p95
query: histogram_quantile(0.95, sum(rate(
llm_inter_token_gap_seconds_bucket{route="chat"}[5m]
)) by (le))
threshold_s: 0.05 # tokens land within 50ms of each other
window: 28d
target: 0.99
- name: perceived_latency_p95
# TTFT + (expected output tokens × ITG-P50)
# for a 200-token answer at 30ms ITG that's TTFT + 6s
expression: ttft_p95 + (200 * itg_p50)
threshold_s: 11.0
window: 28d
target: 0.99
The third one, perceived latency, is the SLO that aligns with the user's mental model. It's the answer to "how long does this feel?". For a 200-token typical answer, 800ms of silence followed by 30ms per token is 6.8s of streaming. That feels fast. Move TTFT to 2s and the same 200 tokens at 30ms feels slow despite the total being 8s.
You can run all three on the same Prometheus scrape, and the three together replace your old P99-total alert with something that actually catches what users complain about.
One alert, three signals
Don't page on each signal in isolation. You'll wake people up for nothing. Page on a composite, the same way you'd build a multi-window burn-rate alert for any SLO.
Datadog monitor that wires all three:
name: "Chat stream — perceived latency burn"
type: query alert
query: |
(
avg(last_5m):p95:llm.ttft.seconds{route:chat} > 0.8
) and (
avg(last_5m):p95:llm.itg.seconds{route:chat} > 0.05
) and (
avg(last_5m):p95:llm.perceived.seconds{route:chat} > 11
)
message: |
Streaming chat has both slow first token AND choppy mid-stream.
Check: provider status, gateway buffering, model concurrency cap.
Runbook: https://wiki/runbooks/chat-stream-slow
tags:
- service:chat-stream
- team:platform-llm
Prometheus + Alertmanager equivalent:
- alert: ChatStreamPerceivedLatencyBurn
expr: |
(
histogram_quantile(0.95,
sum(rate(llm_ttft_seconds_bucket{route="chat"}[5m]))
by (le)
) > 0.8
)
and
(
histogram_quantile(0.95,
sum(rate(llm_inter_token_gap_seconds_bucket{route="chat"}[5m]))
by (le)
) > 0.05
)
for: 10m
labels:
severity: page
team: platform-llm
annotations:
summary: "Chat stream feels slow to users (TTFT + ITG)"
description: |
Both TTFT-P95 > 800ms and ITG-P95 > 50ms for 10 minutes.
The composite alert means the user is genuinely waiting.
The and is what saves you. A spike in TTFT alone with normal ITG is usually a cold-start or a model warmup. Annoying, not a page. A spike in ITG alone is usually a noisy provider on one chunk. Both together, sustained for 10 minutes, is real and worth waking someone up for.
Gateway-side TTFT and model-side TTFT are not the same number. The first is what you SLO on. The second is what you debug with. Keep both on the dashboard, side by side. The day they diverge by 300ms is the day you fix nginx, not the model.
If this was useful
This is one chapter of a wider problem: picking metrics that match what your LLM users actually feel, then wiring them to the right backend without ending up with twelve dashboards nobody reads. The LLM Observability Pocket Guide walks through composite SLOs, OTel attribute design for token-aware tracing, and the trade-offs between Datadog/Grafana/Langfuse/Arize for streaming workloads. If you're moving past "we have a P99 chart" into "our on-call rotation trusts the alerts," it'll save you some Sundays.

Top comments (0)