Kunal

Posted on Jul 3 • Originally published at kunalganglani.com

LLM Latency Benchmarks 2026: 6 Levers to Hit Sub-500ms TTFT

#llmlatency #aiperformance #productionai #timetofirsttoken

Originally published at kunalganglani.com — read it there for inline code, hero image, and live links.

LLM latency benchmark optimization in production is the difference between a product users love and one they abandon. In 2026, the fastest models hit 0.35-second Time to First Token (TTFT) while some popular budget options crawl at 53 tokens per second. Most optimization guides are either vendor-locked, outdated, or disconnected from actual user experience science. This one isn't.

Key takeaways:

Gemini 2.5 Flash-Lite leads TTFT at 0.35 seconds and 213.5 tokens/second for $0.10/1M input tokens — the best latency-to-price ratio in 2026.
Mercury 2 hits 841 tokens/second output speed, nearly 16x faster than GPT-4o mini's 53.6 t/s, proving that "budget model" doesn't mean "fast model."
Jakob Nielsen's three UX thresholds (0.1s, 1.0s, 10s) map directly to LLM latency budgets: TTFT above 1 second breaks conversational flow.
A 5-step agentic AI pipeline with 800ms TTFT per step burns 4 seconds before the user sees anything useful.
Six architectural levers — streaming, model routing, prompt caching, speculative decoding, KV cache optimization, and deployment geography — can halve perceived latency without a model swap.

What Is Time to First Token (TTFT) — and Why It's Not the Whole Story

Time to First Token (TTFT) measures the delay between sending an API request and receiving the first token of the response. It's the number users feel most acutely — that dead silence before the chatbot starts typing.

But TTFT alone doesn't tell you whether your application feels fast. There are actually three metrics that matter for production AI latency:

TTFT (Time to First Token): How long until the response starts. Dominated by network round-trip, queue wait time, and prompt prefill computation.
Output throughput (tokens/second): How fast tokens arrive after the first one. This determines how quickly a streaming response completes.
End-to-end response time: Total wall-clock time from request to final token. For non-streaming use cases, this is all that matters.

The model that starts fastest isn't always the model that finishes fastest.

Here's why that distinction matters: a model with 0.4s TTFT but 50 tokens/second throughput will feel slower on a 500-token response than a model with 0.8s TTFT and 200 tokens/second throughput. The first model takes 10.4 seconds total. The second takes 3.3 seconds. For streaming chatbots, TTFT dominates perceived speed. For RAG pipelines that need complete answers before the next step, throughput is king.

According to Artificial Analysis, which tracks 500+ model endpoints with measurements taken 8 times per day using a rolling 72-hour window, TTFT and throughput are often inversely correlated across providers. The fastest TTFT doesn't guarantee the fastest total response.

LLM Latency Benchmarks 2026: TTFT and Throughput Across Major Providers

The 2026 model generation has completely reshuffled the latency leaderboard. Mercury 2 at 841 tokens/second and Gemini 2.5 Flash-Lite at 0.35s TTFT represent efficiency frontiers that invalidate most optimization advice written before this year. Mixture-of-experts (MoE) models like DeepSeek V3 and Llama 4 Scout have broken the assumption that model size equals latency.

Here's the provider-agnostic benchmark table using live Artificial Analysis data:

Model	TTFT (seconds)	Throughput (tokens/s)	Price ($/1M output tokens)	Architecture	Context Window
Mercury 2	—	841	—	Proprietary	—
LFM2.5-VL-1.6B	—	456	—	Dense	—
Gemini 2.5 Flash-Lite	0.35	213.5	$0.40	Proprietary	1M
Gemini 2.5 Flash	—	204.4	$2.50	Proprietary	1M
GPT-4o (Nov 2024)	—	171.8	$10.00	Proprietary	128K
Llama 4 Scout	—	107.3	$0.66	MoE (109B/17B active)	10M
DeepSeek V3	—	—	$0.89	MoE (671B/37B active)	—
GPT-4o mini	—	53.6	$0.60	Proprietary	128K
North Mini Code	0.39	—	—	—	—

Three things jump out from this data.

First, GPT-4o mini is shockingly slow. At 53.6 tokens/second, it ranks #50 out of 84 non-reasoning models despite being positioned as OpenAI's budget speed option. Gemini 2.5 Flash-Lite runs 4x faster at a comparable price point. If you chose GPT-4o mini for speed, you made the wrong call.

Second, MoE architecture is the real latency lever nobody talks about. Llama 4 Scout has 109 billion total parameters but only activates 17 billion per token, delivering 107.3 t/s throughput while supporting a 10 million token context window. DeepSeek V3 pushes this further: 671B total parameters, 37B active. You get large-model quality at small-model inference cost — no optimization tricks required, just architecture.

Third, based on the benchmark data I maintain at kunalganglani.com/llm-benchmarks, quantization quality cliffs are model-family-specific, which means a blanket recommendation to "just quantize to Q4" for latency gains is wrong. You need to test per-model-family to know where quality drops off.

Where Latency Actually Breaks User Experience (The 0.1s / 1s / 10s Framework)

Jakob Nielsen, co-founder of Nielsen Norman Group, established three perceptual thresholds in 1993 that have held up for over 30 years — validated by Miller (1968) and Card et al. (1991):

0.1 seconds: The system feels instantaneous. The user perceives no delay whatsoever.
1.0 second: Flow of thought stays uninterrupted, but the user notices the delay. This is the critical threshold for conversational interfaces.
10 seconds: The absolute limit for keeping user attention on the dialogue. Beyond this, users abandon the task.

Here's how these thresholds map to LLM use cases:

Autocomplete / inline suggestions: You need sub-100ms TTFT. This is why local models on Apple Silicon or edge inference matter — network round-trips alone can blow this budget. See how local LLM setups compare in my hardware guides.

Chatbots and copilots: TTFT under 1 second with streaming. Users tolerate the response building token-by-token as long as it starts quickly. Gemini 2.5 Flash-Lite's 0.35s TTFT fits comfortably here. GPT-4o's latency depends on the provider endpoint.

RAG pipelines and search: Total response time under 3 seconds. Users coming from Google expect fast answers. Throughput matters more than TTFT here because you're typically waiting for the full response before displaying it.

Batch processing and background jobs: Latency barely matters. Optimize for cost per token and throughput. GPT-4o mini's 53.6 t/s is fine if you're processing overnight.

When I built the AI chatbot on Walmart product pages at Firework, handling millions of queries daily at sub-second response times, retrieval quality dominated answer quality far more than model choice. But the latency budget was non-negotiable: users on product pages bounce in seconds. We learned that event-streaming the context pipeline through Kafka mattered more for hitting latency targets than any model-side optimization.

The TTFT vs. Throughput Tradeoff: Which Matters More for Your Use Case

This is one of those things where the boring answer is actually the right one: it depends on whether you're streaming.

For streaming responses (chatbots, writing assistants, copilots), TTFT dominates. A 0.35s TTFT with 100 t/s throughput feels snappier than a 1.2s TTFT with 300 t/s throughput, even though the second model finishes a 500-token response 1.6 seconds faster. Users perceive speed from when the first token appears, not when the last one lands.

For non-streaming responses (AI agents making tool calls, RAG pipelines assembling answers, function calling chains), throughput wins. The downstream consumer doesn't see partial results — it waits for the complete output. Here, Mercury 2's 841 t/s throughput is 15.7x more valuable than GPT-4o mini's 53.6 t/s.

For agentic pipelines with multiple sequential LLM calls, both matter — and they compound. More on that below.

Here's a framework:

If the user is watching the output stream in: optimize TTFT
If the system is waiting for a complete response: optimize throughput
If you're chaining 3+ LLM calls sequentially: optimize both, but TTFT first

6 Architectural Levers to Hit Sub-500ms TTFT in Production

These are the levers that actually move the needle for LLM latency benchmark optimization in production (2026). I'm ordering them by implementation effort, lowest first.

Lever 1: Streaming — Reducing Perceived Latency Without Changing Model Speed

Streaming is the single highest-ROI latency optimization because it changes perceived latency without changing actual generation speed. As OpenAI's latency optimization guide recommends: use streaming for perceived latency improvement even when total generation time is unchanged.

With streaming enabled, a model with 0.8s TTFT starts showing output at 0.8s. Without streaming, the user waits for the entire response — potentially 3-5 seconds for a 500-token answer. Same model, same speed, radically different user experience.

Every major provider supports server-sent events (SSE) streaming. If you're building any user-facing LLM feature and you're not streaming, stop and fix that before reading the rest of this article. Seriously.

Lever 2: Model Routing — Smaller Models for Low-Complexity Tasks

Not every query needs your most capable model. A smart router that sends simple questions to a fast, cheap model and complex queries to a frontier model can cut average latency by 40-60% while barely impacting quality.

The key insight from Joao Gante, Machine Learning Engineer at HuggingFace, applies here: the bottleneck in text generation is memory bandwidth, not compute FLOPs. Smaller models aren't just cheaper — they're fundamentally faster because they're less memory-bound.

Consider: Gemini 2.5 Flash-Lite at $0.10/1M input tokens and 213.5 t/s handles 80% of conversational queries as well as models costing 25x more. Route the remaining 20% — complex reasoning, multi-step analysis — to your frontier model.

This is where MoE models shine as a middle ground. Llama 4 Scout delivers quality from its 109B total parameters while only activating 17B per forward pass. It's model routing baked into the architecture itself. If you're comparing LangChain vs LlamaIndex for your orchestration layer, both support routing patterns.

Lever 3: Prompt Caching — Eliminating Prefill Latency for Repeated Context

Prompt caching stores the computed key-value representations of repeated prompt prefixes — system prompts, document context, few-shot examples — so the model skips the prefill computation on subsequent requests.

All three major providers now offer this: OpenAI gives a 50% input price discount on cache hits, Anthropic offers up to 90% discount, and Google supports it across Gemini models. But the price discount is secondary. The latency reduction is the real win: cached prefixes skip prefill entirely, which for long system prompts (4K+ tokens) can cut TTFT by 50-80%.

If your application uses a consistent system prompt or includes the same document context across multiple queries — which describes most production AI chatbots and RAG systems — prompt caching is free latency. You're leaving performance on the table if you haven't enabled it.

This lever didn't exist when most competitor optimization guides were written, which is why it's under-discussed relative to its impact.

Lever 4: Speculative Decoding — Parallel Token Generation

Speculative decoding is the most elegant latency trick in the LLM inference stack. Yaniv Leviathan, Research Scientist at Google, showed in the ICML 2023 Oral paper that a small draft model can predict several tokens ahead, and the large target model verifies them in a single parallel forward pass — achieving 2x-3x acceleration on T5-XXL with identical output distribution.

The key insight: hard language tasks contain easier subtasks. Most tokens in a response are predictable. The draft model handles the easy parts; the big model only corrects mistakes. No retraining needed, no architecture changes, no quality loss.

HuggingFace's implementation of this, called assisted generation, reduces latency up to 10x on commodity hardware according to Joao Gante. Yichao Fu at UC Berkeley / LMSYS took this further with lookahead decoding, which breaks the autoregressive dependency without needing a draft model at all — using Jacobi iteration to generate multiple n-grams in parallel.

Speculative decoding matters most for self-hosted deployments where you control the inference stack. If you're running models via vLLM or Ollama, this is accessible today.

Lever 5: KV Cache Optimization and PagedAttention

The KV cache stores the key-value pairs from attention computation for previously processed tokens. For long sequences, this cache becomes enormous — Woosuk Kwon, PhD Researcher at UC Berkeley and co-creator of vLLM, showed that a single LLaMA-13B sequence can consume up to 1.7GB of GPU memory just for KV cache.

vLLM's PagedAttention algorithm manages KV cache like virtual memory pages, eliminating fragmentation. The result: up to 24x higher throughput than HuggingFace Transformers and 3.5x higher than Text Generation Inference (TGI). If you're self-hosting models and serving more than a handful of concurrent users, vLLM or a similar PagedAttention-based server isn't optional.

Complement this with FlashAttention by Tri Dao at Stanford (now Princeton). FlashAttention reduces GPU HBM reads/writes via tiling, achieving 3x speedup on GPT-2 and 2.4x on long-range tasks. Since attention complexity is quadratic in sequence length, FlashAttention directly reduces prefill latency for long-context prompts — critical if you're pushing large documents through a retrieval-augmented generation pipeline.

When you're evaluating LLM quantization formats like GGUF, GPTQ, or EXL2, remember that quantization also reduces KV cache memory, indirectly improving throughput by allowing more concurrent sequences.

Lever 6: Deployment Geography — Matching API Region to User Location

This is the optimization that nobody writes about because it's boring. But network round-trip time adds 50-200ms per API call depending on geography. If your users are in Tokyo and your API endpoint is in Virginia, you're burning 150ms of latency before the model even starts processing.

For managed API providers, check which regions they serve from. For self-hosted deployments, deploy your inference server in the same region as your application server. For global products, consider multi-region deployments or edge routing.

This matters especially for agentic AI pipelines where you're making 3-5 sequential API calls. A 150ms geographic penalty compounds to 450-750ms of pure network waste across the chain.

Latency Budgeting for Agentic Pipelines: When Each Step Compounds

Here's the thing nobody's saying about agent orchestration: single-call benchmarks hide the real latency problem.

A 5-step agentic pipeline where each step makes one LLM call with 800ms TTFT burns 4 seconds minimum before the user sees any useful output. That's before adding network overhead, tool execution time, or retrieval latency. With a vector database lookup at each step, you're easily at 6-8 seconds total.

Here's how to build your latency budget for a multi-step agent:

Map each step's LLM call — identify which steps are sequential vs. parallelizable
Assign TTFT budget per step — for a 5-step agent with a 3-second total target, that's 600ms per sequential step maximum
Account for non-LLM overhead — tool calls, database queries, API calls to external services typically add 100-300ms each
Identify parallelizable branches — if steps 2 and 3 are independent, run them concurrently
Set a circuit breaker — if any single step exceeds 2x its budget, fail fast and return a degraded response rather than making the user wait 15 seconds

When I worked on the Walmart conversational commerce chatbot, we discovered that throughput problems were queue-shape problems, not compute problems. The same applies to agentic pipelines: your bottleneck is usually one slow step in the chain, not the aggregate compute.

If you're building AI agents with Python, bake latency observability in from day one. You need per-step timing, not just end-to-end.

Self-Hosted vs. Managed API: Latency Trade-offs at Scale

This decision breaks down along three axes: control, consistency, and cost.

Managed APIs (OpenAI, Anthropic, Google) give you zero control over inference infrastructure but minimal operational burden. Latency varies by time of day, load, and provider capacity. You're subject to queue times during peak hours. The upside: no GPU procurement, no model serving headaches, and prompt caching is handled for you.

Self-hosted inference (vLLM, TGI, Ollama) gives you full control over latency characteristics. You choose the hardware, the batch size, the KV cache policy. With vLLM's PagedAttention, you can serve many concurrent users efficiently. The downside: you're now in the infrastructure business, managing GPU hardware, monitoring utilization, handling failover.

The decision framework:

< 100 requests/minute with variable load: Use managed APIs. The operational cost of self-hosting doesn't justify the latency control.
100-10,000 requests/minute with predictable load: Evaluate both. Self-hosting on NVIDIA GPUs or Apple Silicon can deliver better P99 latency if you have the ops capacity.
> 10,000 requests/minute: Self-host your hot path models, use managed APIs as fallback. At this scale, the cost savings and latency consistency of self-hosting pay for the operational overhead many times over.

For local LLM development and testing, tools like Ollama vs LM Studio give you fast iteration without API costs. But don't confuse development convenience with production readiness.

Building Your Latency Budget: A Framework by Use Case

Stop optimizing latency in the abstract. Start with your use case, work backward to a budget, then pick the lever that closes the gap.

Conversational chatbot (customer-facing):

TTFT target: < 500ms
Throughput target: > 100 t/s
Recommended model tier: Gemini 2.5 Flash-Lite (0.35s TTFT, 213.5 t/s) or equivalent
Primary levers: Streaming + prompt caching + geographic routing
Monitoring: P50 and P95 TTFT, user abandonment rate correlated with latency percentile

RAG pipeline (internal tool):

TTFT target: < 1.5s (users tolerate more from internal tools)
Throughput target: > 150 t/s (you need complete answers fast)
Recommended model tier: GPT-4o or Claude Sonnet for quality, with model routing for simple queries
Primary levers: Prompt caching (system prompt + document context) + model routing
Monitoring: End-to-end response time, retrieval latency as separate metric

Agentic pipeline (multi-step, user-initiated):

Total budget: < 5s for the full chain
Per-step TTFT target: < 600ms (assuming 5 sequential steps)
Recommended approach: Fast models for routing/planning steps, frontier models only for reasoning steps
Primary levers: Parallelization + model routing + circuit breakers
Monitoring: Per-step timing breakdown, step failure rates, timeout frequency

Batch processing (background):

Latency target: None meaningful
Optimize for: Cost per million tokens and throughput
Recommended: GPT-4o mini ($0.60/1M output) or Gemma 3n E4B ($0.02/1M) depending on quality needs
Primary levers: Batch APIs (OpenAI offers 50% discount), off-peak scheduling

For latency monitoring in production, track these metrics at minimum: P50/P95/P99 TTFT, output throughput (tokens/second), end-to-end response time, and error rate by provider. If you're running AI in production without per-request latency telemetry, you're flying blind. Correlate latency spikes with user engagement metrics — you'll almost certainly find that sessions with P95+ latency have measurably higher abandonment.

How Does Context Window Length Affect TTFT and Prefill Time?

Longer prompts mean longer prefill times. This is physics, not a provider limitation.

Prefill computation scales with input sequence length — the model needs to process every input token before generating the first output token. Tri Dao's FlashAttention work showed that standard attention has quadratic complexity in sequence length, meaning that doubling your prompt length more than doubles your TTFT.

Practically, this means:

A 500-token prompt prefills in the noise (< 100ms on most providers)
A 5,000-token prompt (typical RAG context) adds 200-500ms to TTFT
A 50,000-token prompt (long document analysis) can add 2-5 seconds to TTFT
A 1M-token prompt (Gemini's full context) can add 10+ seconds to TTFT

This is exactly why prompt caching matters so much. If 4,000 tokens of your 5,000-token prompt are the same system prompt and document context on every request, caching those 4,000 tokens eliminates 80% of your prefill computation.

It's also why the choice of vector database and retrieval strategy matters for latency. Retrieving 20 relevant chunks instead of 5 doesn't just cost more tokens — it directly increases TTFT through longer prefill. When working on the Walmart chatbot RAG pipeline, I found that retrieval quality — returning fewer, better chunks — reduced both LLM cost and latency simultaneously. GraphRAG paid off specifically for relationship queries like product compatibility, where fewer but more targeted chunks outperformed stuffing the context window.

What the 2026 Latency Landscape Means for What You Build Next

The latency floor has dropped dramatically. Gemini 2.5 Flash-Lite at 0.35s TTFT and $0.10/1M input tokens makes sub-500ms chatbots trivially achievable without self-hosting. Mercury 2 at 841 t/s makes real-time agentic pipelines feasible where they weren't 12 months ago.

But the bigger shift is architectural. MoE models have permanently broken the "bigger model = slower inference" assumption. Prompt caching has made repeated context nearly free. Speculative decoding has moved from research paper to production tooling.

The teams that win in 2026 won't be the ones using the fastest model. They'll be the ones who understand their latency budget, measure per-step timing in their agent frameworks, and apply the right lever at the right layer of the stack.

If you take one thing from this post: measure your TTFT in production today. Not the number from the provider's marketing page — your actual P95 TTFT, from your users' geography, with your prompt lengths, at your peak traffic hours. That number is the starting line for every optimization decision you'll make this year.

Originally published on kunalganglani.com

DEV Community