The Latency Lie: Why Your Agent Is Slower Than You Think

#ai #agents #performance #engineering

Everyone measures agent latency. Almost nobody measures it correctly.

The problem is that most latency metrics capture model response time, not user experience time. And those are very different things.

What most teams measure:

Time from request to first token
Time from first token to completion
Total generation time

These are useful metrics. They tell you how fast the model responds. But they do not tell you how long the user waits.

What actually affects user experience:

Preprocessing time. Before the model sees the prompt, you may be doing retrieval, context building, prompt assembly. This can add seconds that do not show up in model metrics.
Tool execution time. When the agent calls a tool, the model stops timing. But the user is still waiting. Tool calls can take anywhere from milliseconds to minutes.
Retry loops. If the agent fails and retries, you are adding another full cycle. Each retry doubles the latency.
Context accumulation. Longer contexts mean slower inference. A 50k token prompt takes longer than a 5k prompt, even if generation time is identical.
UI rendering. Streaming tokens to a UI is not instant. Formatting, markdown parsing, code highlighting all add latency.

The hidden latency budget:

A typical agent interaction might look like:

Preprocessing: 500ms
Model inference: 2s
Tool calls: 3s (parallelized)
Postprocessing: 200ms
UI rendering: 100ms

Total: 5.8 seconds. But your metrics probably show 2 seconds.

Why this matters:

When you optimize only model latency, you are optimizing 30-40 percent of the actual wait time. The rest is invisible to your metrics but very visible to your users.

Better latency measurement:

Measure end-to-end time from user action to visible result. Include everything. Then break it down: