Google Just Showed Us What Happens When You Throw Out the Token-by-Token Playbook

#google #ai #machinelearning #programming

Every LLM you have used works the same way. It predicts the next token, then the next, then the next. One at a time. Autoregressive generation. That is how GPT-4o works, how Claude works, how Gemini 2.5 Pro works.

Google just said: what if we stop doing that?

Gemini Diffusion generates text the way image models generate images. Instead of building a sentence left to right, it starts with noise and refines the entire output simultaneously. The claimed speedup is 5x over comparable autoregressive models.

I have been thinking about what this actually means for the way I build things.

The speed problem nobody talks about

When you are making a single API call, the difference between 2 seconds and 0.4 seconds does not matter much. But when you are running batch jobs — processing 500 documents, generating test cases for an entire codebase, summarizing a week of customer support tickets — that 5x adds up fast.

I run a lot of batch processing through Claude and GPT-4o. A typical overnight job hits the API maybe 2,000 times. At current speeds that takes roughly 3 hours. If diffusion models actually deliver on the 5x claim, that same job finishes in 36 minutes. That changes whether I can run it during lunch instead of overnight.

Why I am skeptical

The 5x number comes from controlled benchmarks. Real-world performance depends on context length, output complexity, and how the model handles edge cases. I have seen plenty of impressive benchmark numbers that fall apart when you throw messy real data at them.

Also, diffusion models for text are genuinely new territory. Image diffusion had years of iteration before it got reliable. Text diffusion is maybe six months into serious research. The failure modes are different — you can get away with a slightly wrong pixel in an image, but a slightly wrong word in code breaks everything.

What I am actually going to do

Nothing yet. I am not rewriting any pipelines around a model that is still in research preview. But I am watching two things mainly: whether the speed holds on long outputs (2000+ tokens), and whether Google actually ships a usable API at a reasonable price. The code quality question is secondary if the pricing is wrong.

If all three check out, I will probably move my batch processing over first and keep interactive coding on Claude. Speed matters less when you are pair programming. It matters a lot when you are processing data at scale.

What got my attention is not this specific model. Someone proved the approach works at all. If Google can do it, Anthropic and OpenAI are probably working on their own versions. In a year we might look back at autoregressive-only models the way we look back at RNNs — technically functional but obviously not the final answer.

Or the whole thing might hit a wall at 1000 tokens and we are back to business as usual. Could go either way.