Sumi is a 7-billion-parameter diffusion language model, released fully open-weight by Tohoku NLP, that generates text by refining an entire passage over multiple passes rather than predicting one word at a time. It is the first openly available diffusion language model at a scale large enough for researchers to seriously test whether the approach can match conventional autoregressive models.
Key facts
- What: Most language models write one word after another, left to right. A new openly-released model of real size generates text the way image AIs make pictures — refining a whole draft at once.
- When: 2026-06-20
- Primary source: read the source (arXiv 2606.19005)
Nearly every language model in wide use today generates text autoregressively: one word at a time, left to right, each word chosen based on everything before it. Once a word is produced, it's committed — there's no going back to revise. This approach has powered the entire chatbot era. Diffusion language models offer a different path. Borrowing from image generation — where AI art tools don't paint stroke by stroke but instead start with random noise and gradually refine the whole image at once over many passes — diffusion language models do the same with text. They begin with a rough, garbled draft of the entire passage and repeatedly clean it up, all positions at once, until fluent text emerges.
The core appeal is revision. Because a diffusion model works on the whole passage simultaneously and refines it over multiple passes, it can go back and fix earlier words in light of later ones — something a strict left-to-right model can never do. That enables a kind of self-correction that is awkward for conventional models, and it also allows generating many parts of the text in parallel rather than strictly in sequence, which could be faster in some setups. For years this remained mostly a research curiosity, demonstrated at small scale and rarely with openly available weights.
What makes Sumi notable is the combination of scale and openness. It is a genuinely mid-sized model — in the range of capable open models people actually run — trained from scratch on an enormous amount of text, and its creators at Tohoku NLP released it fully openly: the weights, not just a paper. The model weights are on Hugging Face and the code is on GitHub. That is the part that moves the field. Researchers and tinkerers can now download a real, non-trivial diffusion language model and study how it behaves, where it shines, and where it breaks — rather than taking a lab's word for it. Open releases like this are how a niche idea gets a fair, broad test.
The two styles differ in a fundamental way: an autoregressive model is a speaker giving a live, unscripted talk — fluent, but unable to un-say anything. A diffusion model is a writer with a full draft and an eraser, sweeping over the whole page again and again, tightening a phrase here, fixing an earlier word there, until the whole thing reads well. Both can produce excellent results; they just get there by very different routes, and the writer's ability to revise is the thing researchers are most curious about.
The dominance of left-to-right generation is so total that it's easy to forget it's a choice, not a law of nature. Every serious, openly-released alternative is a chance to learn whether the mainstream approach is truly best or merely entrenched. If diffusion language models can match conventional ones while adding genuine self-correction and parallel generation, that reshapes assumptions about how text AI should be built. Even if they can't quite match them yet, knowing where and why they fall short is valuable knowledge that only open models make possible.
The genuine caveat is that the headline promise — real, useful self-correction — still has to prove itself at this scale. It's one thing for the math to allow revision; it's another for a model this size to actually revise in ways that improve its answers rather than just churn. The hard, open question Sumi lets the community finally probe is whether diffusion's theoretical advantages show up in practice when the model is big enough to matter. That we can now ask the question with a real model in hand, openly, is the achievement.
Originally published on Ground Truth, where every claim is checked against the primary source.
Top comments (0)