Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.
Large Language Models keep getting smarter.
But there's a problem: users don't experience intelligence directly. They experience latency.
If a model takes 30 seconds to write an answer instead of 3 seconds, most users won't care that it scored higher on some benchmark.
This creates an interesting engineering challenge:
How do we make LLMs generate text faster without making them worse?
One of the most important techniques to emerge in recent years is speculative decoding.
The idea sounds almost absurd at first:
What if a small model could guess what a large model is about to say, and the large model simply verifies those guesses?
Surprisingly, that's exactly what happens.
Let's see how it works.
The Fundamental Bottleneck of LLM Inference
To understand speculative decoding, we first need to understand why LLMs are slow.
Imagine a model is generating:
The capital of France is Paris.
The model doesn't generate the entire sentence at once.
Instead it generates one token at a time:
The
The capital
The capital of
The capital of France
...
Each new token requires another forward pass through the model.
For a large model with hundreds of billions of parameters, every token is expensive.
This means generation is inherently sequential:
Token 1 → Token 2 → Token 3 → Token 4
You can't generate token 4 until you know token 3.
This sequential nature becomes one of the biggest sources of inference latency.
The Intuition: Let a Smaller Model Predict Ahead
Suppose you have:
- A large model (expensive)
- A small model (cheap)
The small model is usually less accurate.
But it's often correct about obvious next tokens.
For example:
Prompt:
The capital of France is
Small model prediction:
Paris
Large model prediction:
Paris
Both agree.
Now imagine the small model predicts several tokens:
Paris, which is
Instead of asking the large model for each token individually, we ask it to verify the entire sequence in one pass.
If the predictions are correct, we've effectively skipped multiple expensive decoding steps.
This is the core idea behind speculative decoding.
A Simple Example
Let's say our draft model predicts:
The weather today is sunny and warm.
Tokenized:
sunny
and
warm
.
The large model then evaluates these proposed tokens.
Possible outcome:
| Token | Draft Model | Large Model |
|---|---|---|
| sunny | ✓ | ✓ |
| and | ✓ | ✓ |
| warm | ✓ | ✓ |
| . | ✓ | ✓ |
Everything matches.
The large model accepts all four tokens.
Instead of generating four separate tokens sequentially, we've effectively generated four tokens in a single verification step.
That's a major latency reduction.
What Happens When They Disagree?
This is where the algorithm becomes interesting.
Suppose the draft model predicts:
The weather today is rainy and cold.
The large model evaluates the proposal.
rainy ✓
and ✓
cold ✗
The large model agrees until "cold".
At that point:
- Accepted tokens are kept.
- Incorrect tokens are discarded.
- Generation resumes from the first disagreement.
Result:
The weather today is rainy and pleasant.
Only part of the speculation was useful.
But even partial acceptance can significantly improve throughput.
Why This Doesn't Change Model Quality
A common misconception is:
"Aren't we replacing the big model with a smaller model?"
No.
The large model remains the source of truth.
The draft model merely proposes candidates.
The verification process guarantees that the final output follows the same probability distribution as standard decoding.
Conceptually:
Normal Decoding
---------------
Large Model → Token
Speculative Decoding
--------------------
Small Model → Proposed Tokens
Large Model → Verify Tokens
Accepted Tokens → Output
The final answer is still determined by the large model.
The user gets the same quality, but faster.
A Simplified Algorithm
At a high level:
while not finished:
proposed = small_model.generate(k_tokens)
verification = large_model.evaluate(proposed)
accepted = longest_matching_prefix(
proposed,
verification
)
output.extend(accepted)
if mismatch:
output.append(
large_model.next_token()
)
In practice, the real algorithm is more sophisticated because it must preserve exact sampling behavior.
But this captures the overall workflow.
Why It Works So Well
Speculative decoding exploits an observation about language:
Most tokens are predictable.
Consider:
Once upon a
Most models will predict:
time
Similarly:
Thank you for your
Likely:
help
Large models spend a surprising amount of compute confirming obvious continuations.
A smaller model can often predict these easy regions accurately.
The large model only needs to intervene when things become ambiguous.
This creates a useful division of labor:
| Component | Job |
|---|---|
| Small Model | Predict likely tokens |
| Large Model | Verify and correct |
| User | Receives faster output |
Modern Variants
Research and production systems have extended the original idea in several directions.
Self-Speculative Decoding
Instead of using two separate models:
- Early layers generate drafts
- Full model verifies
This avoids maintaining a second model entirely.
Multi-Token Prediction
Some architectures are trained to predict multiple future tokens directly.
Instead of:
Predict token N+1
They predict:
N+1
N+2
N+3
...
This increases opportunities for speculative execution.
Tree-Based Speculation
Rather than proposing a single sequence:
A → B → C
The draft model proposes multiple branches:
B1
/
A --
\
B2
The verifier can then select among several possible continuations.
These approaches push throughput even further.
Where You'll Encounter It
Many developers use speculative decoding without realizing it.
Modern inference systems frequently employ variants of:
- Server-side LLM inference platforms
- High-throughput API providers
- Optimized open-source inference engines
- Enterprise deployment stacks
Whenever you see a large model streaming unusually quickly, there's a decent chance some form of speculative execution is happening behind the scenes.
It's becoming one of the standard techniques for making frontier models economically viable at scale.
Final Thoughts
Speculative decoding is a beautiful example of an engineering idea that sounds counterintuitive but turns out to be remarkably effective.
Instead of trying to make large models inherently faster, it asks a different question:
What if most of the work they're doing is already predictable?
By letting a smaller model make educated guesses and allowing the larger model to verify them, we can reduce latency dramatically while preserving output quality.
As LLM deployment scales to millions of users and billions of generated tokens, techniques like speculative decoding are likely to matter just as much as advances in model architecture itself.
Question: If you were deploying a large LLM in production, would you prefer investing in a better model, a faster model, or inference optimizations like speculative decoding? Why?
*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*
Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.
HexmosTech
/
git-lrc
Free, Micro AI Code Reviews That Run on Commit
| 🇩🇰 Dansk | 🇪🇸 Español | 🇮🇷 Farsi | 🇫🇮 Suomi | 🇯🇵 日本語 | 🇳🇴 Norsk | 🇵🇹 Português | 🇷🇺 Русский | 🇦🇱 Shqip | 🇨🇳 中文 | 🇮🇳 हिन्दी |
git-lrc
Free, Micro AI Code Reviews That Run on Commit
AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.
git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.
See It In Action
See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements
git-lrc-intro-60s.mp4
Why
- 🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
- 🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.
- …
Top comments (1)
Worth flagging the production gotcha that doesn't show up in most speculative decoding explainers: the latency gain is highly workload-dependent. For text with low entropy (boilerplate, common phrasing, structured output) the draft model's hit rate is high and you can see 2-3x speedups. For code generation in unusual contexts, the draft model misses constantly and you net very little.
The other under-discussed point is that the draft and target model have to share enough of a tokenizer/embedding space for verification to be cheap. Cross-vendor speculative decoding is hard for exactly this reason — it's not a universal acceleration, it's a tuned-couple acceleration. Useful framing for anyone evaluating it for their own serving stack.