A language model that writes by erasing, and now keeps up with the classics

#diffusionlanguagemodels #opensource #architecture #research

iLLaDA, an eight-billion-parameter diffusion language model, demonstrates that generating text by refining a whole passage at once can match conventional left-to-right models at the same scale. The model, described in a paper on arXiv with weights and code released, improves broadly over its predecessor across general knowledge, math, and coding tasks and stays competitive with a strong, similarly sized conventional model — marking the first time the diffusion approach has held its ground at this scale.

Key facts

What: Almost every chatbot writes one word at a time, left to right. A newly released model of real size writes the way image AIs paint, refining a whole passage at once, and finally holds its own.
When: 2026-06-25
Primary source: read the source (arXiv 2606.25331)

Almost every AI chatbot in use today writes one word after another, strictly left to right, each new word chosen based on everything written so far. Once a word is out, it is committed. This approach, called autoregression, has powered the entire chatbot era. The alternative borrows its trick from image-generation AI. Picture-generating models start with a field of pure noise and refine it step by step into a coherent image, sharpening the whole canvas at once rather than painting one pixel at a time. iLLaDA does the language version of this. Instead of writing left to right, it starts with a passage where many words are blank, hidden behind a kind of mask, and then fills them in over several passes, refining the whole passage together. This family of models is called diffusion language models, and the appeal is straightforward: a writer who can see the whole draft at once and revise any part of it should, in principle, be better at planning ahead and at fixing the middle of a sentence after seeing the end.

For years the catch was that diffusion language models did not scale. They were a charming research curiosity that fell behind left-to-right models as soon as the stakes got serious. iLLaDA is the improved successor to an earlier model called LLaDA. It was trained from scratch on an enormous amount of text using the diffusion recipe all the way through, never falling back on the usual left-to-right method. Writing by refinement is no longer obviously the weaker choice at this scale.

For the whole modern era of AI, the field has placed one giant bet: that left-to-right prediction is the road to capable language models. iLLaDA is evidence that there is a second viable road, and viable roads are valuable even when the first one is working, because they tend to be good at different things. The researchers argue their approach has natural advantages for reasoning that runs both forward and backward, for planning over long stretches, and for squeezing more out of limited data, since it can revisit the same material from many angles rather than reading it once front to back. A field with two healthy architectures instead of one is a field with more room to improve. It is the same spirit as earlier diffusion results, like the open model that writes by refining a whole draft at once and the demonstration of text that arrives all at once.

The claim that iLLaDA is "competitive with a strong conventional model" needs careful reading. The comparison only means something if both models were trained with similar amounts of computing power and data — an apples-to-apples match rather than a flattering pairing, and that is exactly the detail to scrutinize before declaring the gap closed. Independent groups reproducing the result is what would turn this from a promising paper into a settled fact. It is also worth being clear about what "competitive" is and is not. It is not "better than the best models in the world." It is "this overlooked approach can hang with a serious peer at the same weight class," which after years of the diffusion idea trailing badly is a genuinely meaningful turn, and worth watching to see whether the road keeps climbing.

Originally published on Ground Truth, where every claim is checked against the primary source.

Top comments (1)

Luis Cruz • Jul 1

This is a really interesting direction because it challenges the usual assumption that language generation is purely additive (predicting tokens forward).

The idea of a model that “writes by erasing” suggests a more iterative refinement process, where generation is treated as repeated correction rather than single-pass completion. That aligns closely with how humans actually write—draft → delete → restructure → refine—rather than linear token prediction.

What’s especially compelling is how this connects to quality scaling without simply increasing parameter count. If a model can repeatedly revise its own output, it effectively gains a form of internal search / self-editing loop, which can significantly improve coherence and style alignment.

The claim about “keeping up with the classics” is less about memorization and more about emergent refinement behavior. If validated, this approach could sit somewhere between autoregressive generation and planning-based systems.

Curious how it handles stability—iterative deletion can easily lead to oscillation or loss of structure if not carefully constrained.