Large language models generate text one token at a time, running a full computation for each word sequentially. This week, two projects applying an acceleration technique called speculative decoding — DeepSeek's DSPark and a related effort called JetSpec — surged to the top of Hacker News, with DSpark pulling in more than seven hundred points. Both aim to speed up LLM inference without changing a single word the model produces.
Key facts
- What: A small model guesses ahead and a big model checks the work in parallel - and this week two efforts pushing that idea, DeepSeek's DSpark and JetSpec, lit up the front page while the community argued over whether it's truly 'lossless.'
- When: 2026-06-27
- Primary source: read the source (arXiv 2606.18394)
Speculative decoding pairs the big, slow model with a small, fast draft model. The draft model races ahead and guesses the next several words; the big model then verifies all those guesses in a single pass instead of generating them one by one. When the guesses are correct — which they usually are on easy, predictable stretches of text — the big model confirms a whole chunk at once and skips the slow step-by-step process. When a guess is wrong, the big model catches it and corrects course. The output is exactly what the big model would have written alone, just produced in fewer slow rounds. A full plain-language explainer is at how speculative decoding works.
Think of a careful senior editor who must approve every sentence. Working alone, they write and approve one sentence at a time — thorough but slow. Now add a fast junior writer who drafts the next few sentences on a guess. The editor reads all of them in one glance: the ones that match what they would have written get a checkmark instantly, and the moment one is wrong, the editor stops, fixes it, and the junior starts fresh from there. On routine passages the junior nails it and the pair flies; on tricky passages the editor takes back over. Same final document, far less waiting.
What DSpark and JetSpec add is breadth. Classic speculative decoding has the draft model guess a single line of words. The newer approach drafts a whole tree of possible continuations — several plausible next paths at once — so when the big model verifies, it's more likely to find a branch it agrees with and can accept even more words per pass. JetSpec's project page and the Hao AI Lab writeup cite speedups of up to eight times, with the underlying paper reporting a range from roughly two to eight times depending on the model and the task. The corresponding research is posted as a paper on parallel tree drafting.
The practical impact is on money and latency — the two constraints every company running AI feels directly. A two-to-four-times speedup with no retraining and no quality loss means a smaller cloud bill and a snappier product, which is precisely why systems-level inference work, usually invisible to the public, briefly outranked flashy model launches on a hacker forum.
The honest caveat is the part the community is busy stress-testing. The headline word is "lossless" — the promise that output is identical to running the big model alone. In theory that holds: the big model only ever emits a word it would have chosen anyway, so the guessing cannot change the answer, only the speed. In practice, on the popular forum for running models locally, people report a more nuanced picture. The speedups are real, but the size of the speedup swings hard with the workload, and some users see minor quality wobble on complex tasks when the little draft model is too weak to guess well. The resolution is subtle: "lossless" describes the decoding rule, and it holds as long as the draft actually contains the word the big model wants; the magnitude of the gain is never a fixed eight times — it depends entirely on the model pair, the batch size, and how predictable the text is. Treat "up to 8x" the way you'd treat any "up to" number: real, achievable in the best case, and not a promise for your case. For the bigger picture on why inference cost dominates AI economics, see training versus inference.
Originally published on Ground Truth, where every claim is checked against the primary source.
Top comments (0)