Today, we're diving into a fascinating concept known as Speculative Decoding. The name might sound a bit complex—Speculative means guessing or making a conjecture, and Decoding is how an LLM predicts the next token. Put them together, and it means the LLM isn't just generating the next token accurately; it's speculating on the next token's prediction.
That quick definition probably doesn't fully capture the essence of it.
So, let's explore why Speculative Decoding is necessary, how it works, and what its pros and cons are, piece by piece.
LLMs Are Slow
Large Language Models (LLMs) are notoriously slow. There are a few reasons for this, but the most critical are the sheer size of the model itself and the Autoregressive nature of generating one token after the next. These two factors combine to make LLM inference incredibly slow. The model is so large that generating a single token takes a long time, and this lengthy process has to be repeated for every single token. It's no wonder the process is slow.
If we dig deeper, there are more technical reasons involving the Multi-Head Self Attention architecture, modern GPU memory hierarchy, and the Memory Bandwidth Bound characteristic, but those two fundamental issues are the root cause.
So, if we want to speed up LLM inference, how do we do it? We need to tackle those two problems: the model is too large and the Autoregressive token-by-token generation method.
Speculative Decoding aims directly at solving these two issues.
The Core Idea Behind Speculative Decoding
The Problem of the Oversized Model
The solution Speculative Decoding offers to the problem of slow, oversized models is surprisingly simple: "Let's use a small model." Imagine the original model you intended to use is 100 billion parameters (100B). If you use a 7B parameter model instead, it will obviously be much faster. A simple calculation suggests it would be nearly 14 times faster.
However, this idea seems absurd on the surface. We use a large model precisely because, while small models are fast, their output quality is poor. Reverting to a small model would be fast, but the quality would drop—that's certainly not a good solution.
But, the story changes if the result generated by the small model is reviewed by someone else. Since the small model is 14 times faster than the 100B model, even if the review process adds some time, the total time spent will still be faster than using the 100B model alone.
So, how does this review work?
The Problem of Autoregressive Token-by-Token Generation
Here is where Speculative Decoding's second core idea comes into play. Instead of using the result generated by the small 7B model directly, the output is reviewed using the large 100B model that we originally intended to use.
Again, this seems counter-intuitive. To speed things up, we generated a result quickly with a small model, only to check it with the big model? Doesn't that just add the time the small model took to generate the result?
No, it doesn't. The secret lies in how the large model reviews the small model's output.
The large model does not review the result by generating tokens one by one in an Autoregressive fashion. Instead, it takes the results generated by the small model all at once and determines where the sequence is correct and where it goes wrong. It then accepts the correct portion and sends the rest back to the small model to regenerate from the point of error.
By addressing both the issue of the model being too large (by using a small Draft model first) and the issue of token-by-token prediction (by having the large model review a sequence of tokens in a single forward pass), Speculative Decoding solves the root causes of LLM sluggishness. However, to truly grasp how this is possible, we need a concrete example.
Let's look at one now.
A Practical Example
Step 1: Generation by the Small Model
Imagine a user provides the prompt: "The weather is lovely today."
Instead of the 100B model, we use the 7B model to predict the continuation. We also specify how many words (or tokens—we'll use "word" for simplicity) we want to generate. Let's assume we aim to generate seven words.
The 7B model very quickly generates: "Perhaps we should go for a dance." (Note: 'dance' is not a typo; the explanation follows.)
Step 2: Review by the Large Model
Now, the 100B model reviews this result. The original prompt and the small model's generation are combined and fed into the 100B model. The input becomes: "The weather is lovely today. Perhaps we should go for a dance."
The 100B model then uses its normal LLM inference process to predict the next word following this entire input. Let's say it generates "and." The generated sentence so far is now: "The weather is lovely today. Perhaps we should go for a dance and."
Step 3: Preparing for Review
(This is where it gets a bit complex. Feel free to take a deep breath before continuing.)
The main goal of running the 100B model to generate this single new word ("and") is not the word itself, but reviewing the result generated by the 7B model, as explained earlier. Let's see how the review works.
To generate the new word "and," the 100B model first needs to understand the input: "The weather is lovely today. Perhaps we should go for a dance." Generating "and" means selecting the most probable word to follow that sequence.
How does it select the most probable word?
In the Transformer architecture, each word (token) in the input sentence (the Prompt) is passed through a complex and deep artificial neural network (including the famous self-attention mechanism) and is transformed into a vector, which is a sequence of numbers.
In our example, "The weather is lovely today. Perhaps we should go for a dance" is converted as follows. The specific numbers are arbitrary and not important; just grasp the concept that each word is converted into a vector:
The: [23, 53, 29, 134, …]
weather: [221, 23, 99, 111, …]
is: [34, 13, 252, 153, …]
lovely: [25, 22, 89, 75, …]
today: [162, 1, 3, 66, …]
Perhaps: [14, 43, 2, 46, …]
we: [42, 52, 3, 72, …]
should: [14, 43, 2, 46, …]
go: [25, 22, 89, 75, …]
for: [34, 13, 252, 153, …]
a: [221, 23, 99, 111, …]
dance: [42, 52, 3, 72, …]
The Transformer then takes the vector corresponding to "dance"—[42, 52, 3, 72, ...]—and selects the most plausible next word.
Crucially, it only considers "dance" to pick the next word, not all the preceding words. This is because the calculation of "dance's" value, [42, 52, 3, 72, ...], already incorporates the values for "The, weather, is, lovely, today, Perhaps, we, should, go, for, a," due to the Transformer's self-attention mechanism.
For this discussion, you only need to know that the vector representing any given word already contains information about all the words that came before it.
By running the 100B model once on "The weather is lovely today. Perhaps we should go for a dance," it generated the next word "and." More importantly, it also simultaneously obtained the vector values for all preceding words.
Step 4: The Actual Review
The: [23, 53, 29, 134, …]
weather: [221, 23, 99, 111, …]
is: [34, 13, 252, 153, …]
lovely: [25, 22, 89, 75, …]
today: [162, 1, 3, 66, …]
Perhaps: [14, 43, 2, 46, …]
we: [42, 52, 3, 72, …]
should: [14, 43, 2, 46, …]
go: [25, 22, 89, 75, …]
for: [34, 13, 252, 153, …]
a: [221, 23, 99, 111, …]
dance: [42, 52, 3, 72, …]
We established that only the vector for "dance" is needed to generate the word following "dance."
What about generating the word following the fifth word, "today"? We need the vector value for "today." Since we already have the vector for "today," we can predict the word that should follow it. The same logic applies to predicting the word following "go."
This is where a slight modification to the 100B model comes in. Since we already know the vector for each word, we can generate the word that follows each of them.
| # | Word Generated by 7B | Word's Vector Value | Word Generated by 100B based on its Vector |
|---|---|---|---|
| 1 | The | [23, 53, 29, 134, …] | N/A / User Input |
| 2 | weather | [221, 23, 99, 111, …] | N/A / User Input |
| 3 | is | [34, 13, 252, 153, …] | N/A / User Input |
| 4 | lovely | [25, 22, 89, 75, …] | N/A / User Input |
| 5 | today | [162, 1, 3, 66, …] | Perhaps |
| 6 | Perhaps | [14, 43, 2, 46, …] | we |
| 7 | we | [42, 52, 3, 72, …] | should |
| 8 | should | [14, 43, 2, 46, …] | go |
| 9 | go | [25, 22, 89, 75, …] | for |
| 10 | for | [34, 13, 252, 153, …] | a |
| 11 | a | [221, 23, 99, 111, …] | walk |
| 12 | dance | [42, 52, 3, 72, …] | and |
Let's look at row 11 in this table. This is equivalent to asking the 100B model, "Here's the vector for 'a'. What would you generate next?" The 100B model responds, "I'd generate 'walk'." Now we compare this to the word generated by the 7B model in the next row, which is "dance." They are different. This means the 7B model failed the 100B model's review at this step.
We don't need to look beyond step 11. Any words generated after an error are based on a wrong prediction and must be discarded.
Strictly speaking, Speculative Decoding doesn't just do a simple comparison of predicted tokens. It statistically checks how similar the predictions of the small and large models are. However, for a concept-level understanding without complex math, we'll omit that detailed complexity.
We now accept only the portion of the 7B's generated sequence that passed the 100B's review.
Therefore, the first six words generated by the 7B model are accepted: "Perhaps we should go for a". We now know the word that should follow "a" is not "dance", but "walk". We can now feed the accepted sequence, "The weather is lovely today. Perhaps we should go for a walk", back to the 7B model to continue the generation process.
The Meaning of "One-Shot" Review
I have repeatedly emphasized that the review stage in Speculative Decoding happens "all at once." However, the example above shows that, in practice, the next word is predicted for each preceding word, seemingly in an Autoregressive manner.
The "all at once" refers not to the token prediction step, but to the vector calculation step for each word. LLM inference involves two main steps: calculating the vector for each word and then using that vector to predict the next word. The step of calculating the vector for each word is far more complex, lengthy, and time-consuming. One of the main reasons models are getting larger is to extend this vector calculation process.
In other words, the most time-consuming part of the Autoregressive, word-by-word prediction process is calculating the vector for each word. Compared to this, predicting the next word based on an already calculated vector happens in an instant.
And critically, the vector calculation for all words in the input sentence happens simultaneously.
If the input is "The weather is lovely today. Perhaps we should go for a dance," the vectors for "The" and "weather" are not calculated sequentially. Instead, the vectors for the entire sentence are calculated in parallel, all at once.
To summarize, the review stage involves the following:
The vectors for each word in "The weather is lovely today. Perhaps we should go for a dance" are calculated in parallel, all at once.
Based on each word's vector, the word that should follow it is predicted.
The final result accepts only the portion where the 100B model's prediction matched the 7B model's prediction.
How Much Faster Is It, Really?
Let's assume the 100B model takes 5 seconds to predict one word, and the 7B model takes 0.5 seconds to predict one word.
To generate "Perhaps we should go for a dance" (7 words) following "The weather is lovely today," the 7B model takes 0.5 X 7 = 3.5 seconds.
The 100B model predicted one word ("walk" or "and") during the review stage. This review took 5 seconds. (Comparing the 7B results word-by-word is practically instantaneous and doesn't significantly affect the time.)
The total time spent to generate the corrected sequence, "The weather is lovely today. Perhaps we should go for a walk," is the 3.5 seconds the 7B model spent generating, plus the 5 seconds the 100B model spent reviewing, totaling 8.5 seconds.
If we had used only the 100B model without Speculative Decoding, it would have taken 5 X 7 = 35 seconds. Cutting a 35-second task down to 8.5 seconds represents massive efficiency.
Revisiting the Name: Speculative Decoding
Stepping back from the technical details, let's look at the name again.
Speculative, as mentioned earlier, means guessing or conjecture. In Speculative Decoding, a smaller model (called the Draft model) is used to create a speculative answer first, which is not guaranteed to be correct. This speculative answer is then reviewed by the larger model. The name Speculative Decoding now makes a lot more sense.
Key Considerations for Choosing the Small Model
Speed of the Small Model
It goes without saying that the small model must be significantly faster than the large one. Otherwise, there's no point in adopting Speculative Decoding. Typically, models 10 to 100 times smaller than the main model are used.
Acceptance Rate
The performance of the small model is critical in Speculative Decoding. While it can't be as accurate as the large model, it must be reasonably accurate. In the review stage, the large model only accepts the parts it agrees with, and the rest is discarded. If the small model constantly makes poor predictions, the process becomes overly reliant on the slow large model, defeating the purpose of quickly generating a speculative result with the small model.
The proportion of the small model's generated sequence that passes the large model's review is called the Acceptance Rate. In our example, the small model predicted 7 words ("Perhaps we should go for a dance"), but only 6 were accepted ("Perhaps we should go for a"), so the Acceptance Rate is 6/7.
Therefore, the small model must be small enough to be fast, but also smart enough to emulate the large model's results to some extent.
A common approach is to use a smaller version of the large model rather than a completely different architecture. For example, if the large model is Llama 3 70B, the small model might be Llama 3 8B. Since they share the same base, the small model is more likely to successfully emulate the large model.
Another method involves training the Draft Model alongside the large model during fine-tuning, such as with LoRA. Since LoRA fine-tuning alters the output characteristics of the base model, it can be highly effective to concurrently train a small Draft Model that matches the characteristics of the fine-tuned LoRA model.
How Much Should the Small Model Predict?
The number of next words (or tokens) the small model generates, often called the k value, is a crucial factor.
In our example following "The weather is lovely today," we generated k=7 words, resulting in "Perhaps we should go for a dance."
Imagine we generated k=12 words: "The weather is lovely today. Perhaps we should go for a dance or sing a cheerful song." After "dance," the prediction was wrong. We wasted time generating all 12 words, only to discard most of them.
If k is too small (e.g., k=2): The large model has to review the small model's results too frequently. The overhead can negate the speed gains, making the tail wag the dog (minimal speed gain).
If k is too large (e.g., k=20): There's a high probability that the small model's predictions will be wrong early on. If only one prediction is wrong, all subsequent 19 must be discarded, resulting in a significant waste of computation.
Finding the appropriate k value is a matter of tuning that depends on the service's specific characteristics and the Draft model's Acceptance Rate.
Self Speculative Decoding: Being Both Small and Large
Speculative Decoding is a fantastic idea, but it has the drawback of requiring maintenance of a separate small model. Self Speculative Decoding emerged to solve this issue.
The core idea is to use only a subset of the large model to act as the small model. The entire large model is then used for the review stage.
An LLM is typically composed of multiple layers, or Decoder Blocks. Let's say the main large model has 20 layers.
In the initial generation stage, we only run the first 5 layers to create the speculative result. This part serves the role of the small model in regular Speculative Decoding.
Once the speculative result is generated, all 20 layers are used for the review. This part serves as the large model.
This approach eliminates the need to maintain a separate small model. Since the 5-layer small model is part of the 20-layer large model, their "thinking styles" are similar, and the Acceptance Rate is likely to be higher.
Of course, there are downsides. It's more complex to implement than regular Speculative Decoding and requires deep modification of the model's internal logic.
However, if implemented well, it can solve both the problems of the separate Draft Model and the Acceptance Rate that plague traditional Speculative Decoding.
Pros and Cons of Speculative Decoding
Finally, let's summarize the advantages and disadvantages of Speculative Decoding.
Advantages
Massive Speed Improvement: When well-tuned, it can increase inference throughput by 2x to 3x compared to the traditional method.
No Quality Degradation: This is the most crucial point. Other speed-up techniques (like Quantization or Distillation) often sacrifice a bit of performance. However, because Speculative Decoding always has the large model review the small model's results and only accepts those that align with its own expectation, the output is 100% identical to what would have been generated slowly by the large model alone.
Disadvantages
Implementation Complexity: It's much more complex than just running a single model. Synchronizing two models (or two stages of one model), modifying internal model logic, comparing tokens, and implementing the reject/accept logic adds significant development overhead.
Increased Memory Usage: If you use a separate small Draft Model, you must load both the small and large models into memory. This can be mitigated by using Self Speculative Decoding.
Not Always Faster: As mentioned, for tasks with a low Acceptance Rate, the overhead may not be worth it, or it could even result in a slower performance.
Conclusion
The sluggishness of LLMs isn't a problem that can be solved simply by upgrading to better hardware. It requires overcoming the intrinsic limitations of their massive size and Autoregressive nature.
Speculative Decoding is an approach that solves this problem not through brute-force computation, but through a shift in thinking. By moving away from the fixed idea of making the large model faster, it introduces an efficient collaboration model where a small model quickly generates a draft, and the large model reviews it.
From now on, when you see an LLM's response "blinking... blinking..." while it generates, you might instinctively picture the frantic collaboration between the small and large models happening behind the scenes.
Appendix: Running Two Models When GPU Is Precious?
In Speculative Decoding, you need to run both a small and a large model. One might think this would increase GPU utilization and actually slow things down.
However, a closer look reveals this isn't the case.
When you examine how LLMs operate, you find that LLM computation is not Compute Bound; it's Memory-Bandwidth Bound. Let me explain.
An LLM is essentially a massive neural network. Nearly all its operations involve multiplying and adding the user input with the colossal matrices called weights. This process is a constant repetition of: 1) Moving the weights from slower main memory to extremely fast memory, 2) The GPU cores calculating values in the fast memory, and 3) Moving the result back to the slower memory.
Step 2 (the actual value calculation) is very fast, but Steps 1 and 3 (moving data between the slow and fast memory) are relatively very slow. Time is consumed moving data between two different types of semiconductor memory.
This means that the GPU cores spend most of their time waiting for data to be filled into their fast-access workspace.
So, why not just use fast memory instead of dividing it into slow and fast memory? The problem is the LLM is so big. Fast memory is incredibly expensive, making it difficult to fit a large capacity. Hence, the cheaper, slower memory is used as an intermediate storage. It's still exponentially faster than reading directly from a disk.
After reading all this, you realize that those precious GPU cores actually spend a lot of time idle during LLM inference. Why let them be idle? They should be working! The idle time of the GPU cores can be used to run the small model for Speculative Decoding. Of course, we don't need to manually track when the GPU is idle and when it's working; the GPU handles that automatically.
The key is that, due to the LLM's Memory Bound characteristic, running both a small and a large model doesn't strain the GPU core utilization. Instead, it's a method for using the GPU cores more efficiently.
Top comments (0)