66 Tokens Make a Diffusion Language Model Look Easy

#openreview #macbookair #shakespeare #github

A diffusion language model generates text by starting from masked or otherwise corrupted tokens and iteratively restoring them. In this MacBook Air M2 demo, that idea shows up in its smallest, most hackable form: a toy character-level model that learns to recover missing characters from Karpathy's tiny Shakespeare dataset.

The GitHub project Encrux/simple_dlm really is small and direct. The author reports a roughly 7.5 million parameter model, a 66-token vocabulary made of 65 characters plus [MASK], training on Karpathy's tiny Shakespeare dataset, and sample outputs after a few hours on an M2 Air 16GB like: “To be, fo hend! ... be horse.” Not exactly publishable literature, but enough to show the machinery works.

A language diffusion model generates text by starting from a heavily corrupted sequence and repeatedly denoising it. In the discrete diffusion version used for text, corruption usually means replacing tokens with a mask token, then training the model to recover the missing pieces. If that sounds a bit like masked language modeling, that is the key reason these projects now feel much less forbidding.

Why a diffusion language model is suddenly a weekend project

The conceptual change came from recent masked diffusion language model work. In Simple and Effective Masked Diffusion Language Models, published on OpenReview, the authors argue that simple masked discrete diffusion is stronger than earlier results suggested, and describe the training objective as a mixture of masked language modeling losses. That is a much more approachable starting point than the older, scarier framing around diffusion for text.

That paper-to-repo bridge is pretty direct. Instead of treating text diffusion as an exotic continuous process, the repo implements a masked character recovery loop: mask tokens, predict the missing characters, repeat during sampling. Once you see diffusion as repeated masked-token reconstruction, a small hobby project stops looking mysterious and starts looking like a compact variation on familiar language-model training.

So the “weekend project” feeling is real, but only because this build removes almost every hard part at once:

Character-level tokens instead of a full tokenizer
Vocabulary of 66 instead of tens of thousands
One text file as the corpus
Tiny Shakespeare instead of a web-scale dataset
Small model instead of a frontier-scale stack

That changes the implementation burden dramatically. No byte-pair encoding weirdness. No huge embedding tables. No distributed training. No data pipeline that needs a team to debug.

The denoising loop here is also simple enough to describe in four steps:

Step	What happens
1	Start with text where many positions are replaced by `[MASK]`
2	Run the model to predict likely characters for masked positions
3	Fill in some of those positions
4	Repeat until the sequence is fully restored

In the repo, training is basically: put a single input.txt file in /data, run training, sample from checkpoints. The project also includes sampling and ONNX export. For learning purposes, that is a great setup because you can see the whole system at once.

The same compression is why toy demos can be misleading. When people say “I built a diffusion language model on a laptop,” the important follow-up is: what kind of language model? Here, the answer is a tiny character generator with a mask-based objective and a miniature corpus.

What the MacBook build actually proves — and what it doesn't

It proves that the core mechanics of a diffusion language model are now reproducible on consumer hardware.

That matters in a narrow, practical way. You can train a working toy system, inspect the noise process, change the mask schedule, watch sampling behavior, and build intuition for failure modes.

It does not prove that laptop diffusion models are suddenly competitive with mainstream autoregressive models.

The repo does not present benchmark comparisons, and some of the most interesting runtime details are self-reported rather than independently verified. The claimed “few hours” training run on an M2 Air 16GB comes from the project description. By contrast, the existence of training code, sampling scripts, checkpoints workflow, and ONNX export support is visible directly in the repository.

The sample output is also telling: it has local Shakespeare-like texture, line breaks, and plausible fragments, but not sustained semantic coherence. Character-level models can look charmingly competent while still failing at the thing people usually mean by “language modeling.”

A quick way to think about it:

What the demo shows	What it does not show
Masked discrete diffusion can be implemented compactly	That diffusion beats autoregressive LMs in general use
Consumer hardware can train a toy text generator	That consumer hardware is enough for relevant benchmarks
Character-level denoising produces recognizable style fragments	That the model tracks long-range meaning well
Recent research ideas are easier to reproduce than before	That scaling the approach is easy

That last row is the trap. “Easy to reproduce” and “easy to scale” are very different claims.

The trade-offs hiding behind the easy demo

A toy language model gets easier as you make language less like real language.

Shrinking the vocabulary to characters is the biggest example. A 66-token vocabulary means the output space for each position is tiny. That makes training and debugging much simpler. It also removes the very problem that makes language modeling hard in practice: choosing among huge vocabularies while preserving meaning over long spans.

In plain language, a character model only has to pick letters and punctuation one step at a time. That is enough to learn that Shakespeare-like text often contains capital letters, commas, short names, and line breaks. So it can imitate style surprisingly early, before it has learned anything like robust sentence meaning or long-range structure.

A small comparison makes the trade-off clearer:

Setup choice	What gets easier	What it hides
66 character tokens	Tiny output space, simple embeddings, easy debugging	Real vocabulary selection and semantic precision
tiny Shakespeare dataset	Fast training on a clean corpus	Broad-domain generalization, factual recall, instruction following
MacBook Air M2	Consumer reproducibility	The compute needed for serious benchmarks and scaling experiments

The tiny Shakespeare dataset hides another hard part: data diversity. Tiny Shakespeare is useful because it is small, public, and has obvious stylistic structure. It is not useful for telling you how a model handles broad-domain language, factual recall, instruction following, or code. A model that learns “dramatic-looking character patterns” can appear more coherent than it really is.

Then there is sampling. Diffusion models generate by repeated denoising steps rather than the one-token-at-a-time loop used by autoregressive models. That can bring advantages, but it also introduces schedule choices and step-count trade-offs that are easy to gloss over in toy settings.

The OpenReview paper reports strong results for masked diffusion on research benchmarks, including claims that perplexity comes close to autoregressive baselines in some settings. Those results belong to research-scale models, datasets, and evaluations. They are useful context for why this area feels more practical now, but they do not transfer directly to the MacBook build.

This is a good place to be wary of the same instinct people bring to leaderboard snapshots such as code arena rankings: a single visible result can compress a lot of hidden setup.

What readers can steal from this setup

The useful part is the pattern, not the sample prose.

When evaluating a diffusion language model, start with the choices that make the demo look easy:

Easy demo choice	What it hides
Character-level tokenization	Real tokenization complexity and larger vocabularies
Single small corpus	Whether the model generalizes beyond one style
Runs on a MacBook Air M2	Whether the method still works at benchmark-relevant scale

Then check the implementation details that usually decide how meaningful the result is:

Corruption objective
- Ask whether it is plain masking, a weighted masking schedule, or something more complex.
- “Diffusion” in text often turns out to mean an iterative masked-token recovery process, not something mystical.
Output quality
- Look for semantic consistency over paragraphs, not just stylish fragments.
- Toy outputs are good for intuition, weak for performance claims.

That checklist generalizes well beyond diffusion. Small vocabularies, toy corpora, and handpicked examples are often the difference between a system that feels hackable and one that survives contact with real workloads.

Key Takeaways

The Encrux repo shows that a diffusion language model can now be built as a compact, understandable toy project on a MacBook Air M2.
The setup is tractable because it uses character-level tokens, a 66-token vocabulary, and the tiny Shakespeare dataset.
Recent masked diffusion language model research, especially MDLM, helps explain why text diffusion now looks closer to masked language modeling than to an exotic new paradigm.
The demo is useful for learning mechanics and failure modes, but it does not establish benchmark competitiveness or easy scaling to realistic language tasks.
When evaluating similar projects, check tokenization, masking objective, corpus size, hardware scope, and whether outputs stay coherent beyond short stylistic fragments.