DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

Speculative Decoding: How LLMs Generate Tokens Faster Without Changing the Answer

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.


Large Language Models keep getting smarter.

But there's a problem: users don't experience intelligence directly. They experience latency.

If a model takes 30 seconds to write an answer instead of 3 seconds, most users won't care that it scored higher on some benchmark.

This creates an interesting engineering challenge:

How do we make LLMs generate text faster without making them worse?

One of the most important techniques to emerge in recent years is speculative decoding.

The idea sounds almost absurd at first:

What if a small model could guess what a large model is about to say, and the large model simply verifies those guesses?

Surprisingly, that's exactly what happens.

Let's see how it works.

The Fundamental Bottleneck of LLM Inference

To understand speculative decoding, we first need to understand why LLMs are slow.

Imagine a model is generating:

The capital of France is Paris.

The model doesn't generate the entire sentence at once.

Instead it generates one token at a time:

The
The capital
The capital of
The capital of France
...
Enter fullscreen mode Exit fullscreen mode

Each new token requires another forward pass through the model.

For a large model with hundreds of billions of parameters, every token is expensive.

This means generation is inherently sequential:

Token 1 → Token 2 → Token 3 → Token 4
Enter fullscreen mode Exit fullscreen mode

You can't generate token 4 until you know token 3.

This sequential nature becomes one of the biggest sources of inference latency.

The Intuition: Let a Smaller Model Predict Ahead

Suppose you have:

  • A large model (expensive)
  • A small model (cheap)

The small model is usually less accurate.

But it's often correct about obvious next tokens.

For example:

Prompt:

The capital of France is
Enter fullscreen mode Exit fullscreen mode

Small model prediction:

Paris
Enter fullscreen mode Exit fullscreen mode

Large model prediction:

Paris
Enter fullscreen mode Exit fullscreen mode

Both agree.

Now imagine the small model predicts several tokens:

Paris, which is
Enter fullscreen mode Exit fullscreen mode

Instead of asking the large model for each token individually, we ask it to verify the entire sequence in one pass.

If the predictions are correct, we've effectively skipped multiple expensive decoding steps.

This is the core idea behind speculative decoding.

A Simple Example

Let's say our draft model predicts:

The weather today is sunny and warm.
Enter fullscreen mode Exit fullscreen mode

Tokenized:

sunny
and
warm
.
Enter fullscreen mode Exit fullscreen mode

The large model then evaluates these proposed tokens.

Possible outcome:

Token Draft Model Large Model
sunny
and
warm
.

Everything matches.

The large model accepts all four tokens.

Instead of generating four separate tokens sequentially, we've effectively generated four tokens in a single verification step.

That's a major latency reduction.

What Happens When They Disagree?

This is where the algorithm becomes interesting.

Suppose the draft model predicts:

The weather today is rainy and cold.
Enter fullscreen mode Exit fullscreen mode

The large model evaluates the proposal.

rainy   ✓
and     ✓
cold    ✗
Enter fullscreen mode Exit fullscreen mode

The large model agrees until "cold".

At that point:

  1. Accepted tokens are kept.
  2. Incorrect tokens are discarded.
  3. Generation resumes from the first disagreement.

Result:

The weather today is rainy and pleasant.
Enter fullscreen mode Exit fullscreen mode

Only part of the speculation was useful.

But even partial acceptance can significantly improve throughput.

Why This Doesn't Change Model Quality

A common misconception is:

"Aren't we replacing the big model with a smaller model?"

No.

The large model remains the source of truth.

The draft model merely proposes candidates.

The verification process guarantees that the final output follows the same probability distribution as standard decoding.

Conceptually:

Normal Decoding
---------------
Large Model → Token

Speculative Decoding
--------------------
Small Model → Proposed Tokens
Large Model → Verify Tokens
Accepted Tokens → Output
Enter fullscreen mode Exit fullscreen mode

The final answer is still determined by the large model.

The user gets the same quality, but faster.

A Simplified Algorithm

At a high level:

while not finished:

    proposed = small_model.generate(k_tokens)

    verification = large_model.evaluate(proposed)

    accepted = longest_matching_prefix(
        proposed,
        verification
    )

    output.extend(accepted)

    if mismatch:
        output.append(
            large_model.next_token()
        )
Enter fullscreen mode Exit fullscreen mode

In practice, the real algorithm is more sophisticated because it must preserve exact sampling behavior.

But this captures the overall workflow.

Why It Works So Well

Speculative decoding exploits an observation about language:

Most tokens are predictable.

Consider:

Once upon a
Enter fullscreen mode Exit fullscreen mode

Most models will predict:

time
Enter fullscreen mode Exit fullscreen mode

Similarly:

Thank you for your
Enter fullscreen mode Exit fullscreen mode

Likely:

help
Enter fullscreen mode Exit fullscreen mode

Large models spend a surprising amount of compute confirming obvious continuations.

A smaller model can often predict these easy regions accurately.

The large model only needs to intervene when things become ambiguous.

This creates a useful division of labor:

Component Job
Small Model Predict likely tokens
Large Model Verify and correct
User Receives faster output

Modern Variants

Research and production systems have extended the original idea in several directions.

Self-Speculative Decoding

Instead of using two separate models:

  • Early layers generate drafts
  • Full model verifies

This avoids maintaining a second model entirely.

Multi-Token Prediction

Some architectures are trained to predict multiple future tokens directly.

Instead of:

Predict token N+1
Enter fullscreen mode Exit fullscreen mode

They predict:

N+1
N+2
N+3
...
Enter fullscreen mode Exit fullscreen mode

This increases opportunities for speculative execution.

Tree-Based Speculation

Rather than proposing a single sequence:

A → B → C
Enter fullscreen mode Exit fullscreen mode

The draft model proposes multiple branches:

      B1
    /
A --
    \
      B2
Enter fullscreen mode Exit fullscreen mode

The verifier can then select among several possible continuations.

These approaches push throughput even further.

Where You'll Encounter It

Many developers use speculative decoding without realizing it.

Modern inference systems frequently employ variants of:

  • Server-side LLM inference platforms
  • High-throughput API providers
  • Optimized open-source inference engines
  • Enterprise deployment stacks

Whenever you see a large model streaming unusually quickly, there's a decent chance some form of speculative execution is happening behind the scenes.

It's becoming one of the standard techniques for making frontier models economically viable at scale.

Final Thoughts

Speculative decoding is a beautiful example of an engineering idea that sounds counterintuitive but turns out to be remarkably effective.

Instead of trying to make large models inherently faster, it asks a different question:

What if most of the work they're doing is already predictable?

By letting a smaller model make educated guesses and allowing the larger model to verify them, we can reduce latency dramatically while preserving output quality.

As LLM deployment scales to millions of users and billions of generated tokens, techniques like speculative decoding are likely to matter just as much as advances in model architecture itself.

Question: If you were deploying a large LLM in production, would you prefer investing in a better model, a faster model, or inference optimizations like speculative decoding? Why?


*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

GitHub logo HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Commit




AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.

See It In Action

See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

git-lrc-intro-60s.mp4

Why

  • 🤖 AI agents silently break things. Code removed. Logic changed. Edge cases gone. You won't notice until production.
  • 🔍 Catch it before it ships. AI-powered inline comments show you exactly what changed and what looks wrong.

Top comments (1)

Collapse
 
mnemehq profile image
Theo Valmis

Worth flagging the production gotcha that doesn't show up in most speculative decoding explainers: the latency gain is highly workload-dependent. For text with low entropy (boilerplate, common phrasing, structured output) the draft model's hit rate is high and you can see 2-3x speedups. For code generation in unusual contexts, the draft model misses constantly and you net very little.

The other under-discussed point is that the draft and target model have to share enough of a tokenizer/embedding space for verification to be cheap. Cross-vendor speculative decoding is hard for exactly this reason — it's not a universal acceleration, it's a tuned-couple acceleration. Useful framing for anyone evaluating it for their own serving stack.