Yash Kumar Panjwani

Posted on May 28

The Reason Your AI Chatbot Feels Fast Has Nothing to Do With a Better Model

#ai #llm #optimization #genai

You have probably noticed that ChatGPT or Claude streams words to your screen almost instantly. But behind the scenes, generating each word requires a massive model to perform billions of computations. So how do these systems feel so fast?

One of the key answers is a technique called speculative decoding — an inference optimization that makes large language models generate text significantly faster without changing a single word of their output.

First — Why is Text generation slow?

To understand speculative decoding, you need to understand one fundamental constraint of large language models.

They generate text one token at a time.

Not one word, one sentence, or one paragraph — one token. A token is roughly a word or a piece of a word. And for every single token, the model performs a complete forward pass through billions of parameters — embeddings, self-attention layers, and feed-forward networks.

Consider this sentence:

“The cat sat on the ______”

To predict that final word, the model processes every previous token, computes attention scores across the entire sequence, and runs the result through every layer of the network. Then it produces the next token. Then it does the entire thing again for the token after that.

For a 70 billion parameter model, a single forward pass takes around 50 milliseconds. Generating 200 tokens sequentially therefore takes around 8 to 10 seconds. That is a long time to wait for a response.

This is the problem speculative decoding was designed to solve.

The Core Insight

Here is the observation that makes speculative decoding possible.

A large model spends the same amount of time generating a boring, predictable token as it does generating a surprising, complex one.

Writing “the” after “I went to” takes exactly as long as generating a rare technical term in a complex sentence. That feels wasteful. Common, predictable tokens — which make up the majority of natural language — should somehow be cheaper.

Speculative decoding exploits this by splitting the generation process between two models.

Two Models, Two Roles

The draft model is small, cheap, and fast — perhaps 7 billion parameters. It can generate tokens roughly 10 times faster than the large model.

The target model is large, powerful, and slow — perhaps 70 billion parameters. This is the model whose quality you care about. In speculative decoding, it is no longer responsible for generating tokens one by one. It is only used for verification.

Phase 1 — Draft model

Given the sentence:

“The capital of France is _____ _____ _____ _____ _____”

Instead of asking the large model to generate one token at a time, the small draft model quickly predicts several tokens ahead:

Token 1 → “Paris”
Token 2 → “,”
Token 3 → “which”
Token 4 → “is”
Token 5 → “located”

This took a fraction of the time. But the small model is less capable — it may have made mistakes. So the large model now steps in, but not to generate. Only to verify.

Phase 2 — Target model verifies

This is where the magic happens.

The large model receives the entire sequence — the original context plus all 5 draft tokens — and processes them simultaneously in one forward pass.

Because all 5 tokens already exist as input, the model does not need to generate them one by one. It simply checks whether it agrees with each one by computing what it would have predicted at each position.

“Paris” → target model would say “,” → draft said “,” → Accept
“,” → target model would say “which” → draft said “which” → Accept
“which” → target model would say “is” → draft said “is” → Accept
“is” → target model would say “a” → draft said “located” → Reject

The large model accepts the first 4 tokens and rejects the 5th. It then samples its own correct token — “a” — to replace “located”.

Net result: 4 tokens from a single large model forward pass instead of 1. That is a 4x speedup on those tokens.

What Happens at Rejection

When the large model rejects a draft token, everything after that point is thrown away immediately. The tokens that came after “located” are discarded because they were built on a wrong foundation — answers to a question the large model never agreed with.

The large model uses its own probability distribution at the rejection point to sample the correct token. This probability distribution was already computed during the verification pass — no extra work is needed.

The sequence becomes:

“The capital of France is Paris, which is a”

And the next round of speculative decoding begins from here. The small model drafts 5 more tokens. The large model verifies them all in one pass. The cycle continues until the full response is generated.

The Guarantee — Zero Quality Loss

Here is the part that sounds too good to be true but is mathematically proven.

The accepted tokens — “Paris”, “,”, “which”, “is” — are identical to what the large model would have generated on its own, token by token. The draft model had zero influence on them. The large model happened to agree.

When a token is rejected, the large model substitutes its own choice — exactly what it would have done without the small model involved.

The final output distribution is provably identical to running the large model alone. This is not an approximation. There is no quality tradeoff.

This distinguishes speculative decoding from techniques like quantization or distillation, which genuinely change the model’s outputs. Speculative decoding is a pure inference optimization — the model gets faster, the output stays exactly the same.

Real World Speedups

The original Google Research paper showed 2x–3x acceleration on T5-XXL (11B parameters) compared to standard implementation, with identical outputs.

DeepMind’s concurrent paper reported 2–2.5x speedup on the Chinchilla model at 70 billion parameters.

EAGLE (a later method) achieved 2.7–3.5x on LLaMA 2 Chat 70B, with later versions reaching 3–6.5x. Medusa, another variant, measured 2.2–3.6x speedup.

When Does It Not Help?

Speculative decoding is not universally beneficial.

If the acceptance rate collapses — the small model keeps getting rejected — you end up paying for two model forward passes per token instead of one, making things slower than no speculation at all.

It also optimizes latency — how fast one user gets their response — more than throughput — how many users can be served simultaneously. Running two models requires more GPU memory, which can reduce the total number of requests a system can handle at once.

Why This Matters

Before speculative decoding, generating a 200-token response from a 70B model on reasonable hardware took 8 to 10 seconds. Production systems with speculative decoding bring that to 2 to 4 seconds — with identical output quality.

That is the difference between a user experience that feels sluggish and one that feels instant. And it was achieved not by building a better model, but by being smarter about how the existing model is used.

How It Fits the Bigger Picture

Speculative decoding is one layer of a broader inference optimization stack. Production systems combine it with KV caching — which avoids recomputing attention over past tokens — and prompt caching, which reuses computation across API calls. Together these techniques take a system that would feel unbearably slow and make it feel instant. Speculative decoding specifically attacks generation latency, which is why it has been adopted by Google, Anthropic, and most major LLM providers.

Conclusion

Speculative decoding is one of the most elegant inference optimizations introduced for large language models. Instead of forcing a massive model to generate every token sequentially, the technique allows a smaller and faster draft model to predict multiple future tokens in advance while the larger model verifies them efficiently in parallel.

The key insight is that most tokens in natural language are highly predictable. By allowing a lightweight model to handle these easy predictions, speculative decoding reduces latency dramatically without changing the final output distribution of the larger model.

This makes modern AI systems feel significantly faster while preserving the same response quality. Techniques like speculative decoding show that improving AI systems is not only about building larger models, but also about designing smarter inference algorithms that use existing models more efficiently.

As large language models continue to scale, inference optimization techniques such as speculative decoding will become increasingly important for making AI systems faster, cheaper, and more practical in real-world applications.

Top comments (1)

Harjot Singh • May 31

TTFT-over-tokens-per-second is the right call and it's the most common perf misconception in chat UX. The reason is psychological: a user can't perceive 80 vs 120 tokens/sec once text is flowing, but they absolutely feel the 2-second silence before the first token, that gap is where it feels broken. So the metric that matters is the one tied to perceived responsiveness, not raw throughput, and optimizing the wrong one (chasing a faster model for higher tok/s) can leave the experience feeling slower if TTFT got worse. Streaming is the other half because it changes the perception of the same total latency, the answer takes just as long but feels instant because something happened immediately. It's the same lesson as voice agents measuring barge-in instead of end-to-end latency: measure what the human actually experiences, not what's easy to put on a dashboard. A model that's slightly dumber but starts streaming in 200ms will feel faster and often test better than a smarter one that stalls. That optimize-the-felt-metric instinct is core to how I think about UX in Moonshift. Beyond streaming and TTFT, did anything else move perceived speed for you, like optimistic UI or chunking the first sentence?