On May 7, 2026, ByteDance Seed released a 2B-parameter language model that does not generate text one token at a time. Cola DLM — short for Continuous Latent Diffusion Language Model — plans the whole passage in a continuous latent space, then decodes those latents back to words in a single pass. If you have only ever met autoregressive LLMs, this is the first non-autoregressive recipe that is open, scaled, and benchmarked against models you have heard of.
TL;DR
- Cola DLM separates what to say (a diffusion model over continuous latents) from how to say it (a Text VAE decoder).
- At matched ~2B parameters and ~2000 EFLOP training budgets, it is competitive with autoregressive baselines across 8 benchmarks and pulls ahead of LLaDA on late-stage scaling.
- It is not faster than autoregressive in serving yet. The interesting part is the architecture, not the latency.
Background
An autoregressive language model — every GPT, Claude, Llama, and Gemini variant you have used — produces text by sampling token N conditioned on tokens 1 through N-1. The architecture has no place to hold a "plan" for the whole passage. Coherence emerges, when it does, from patterns baked into the weights. Diffusion language models break this assumption by producing all positions in parallel and refining them across diffusion steps. The first wave (LLaDA, SEDD) worked at the token level and struggled to match autoregressive quality. Cola DLM takes the next step: do the diffusion in a continuous semantic latent space instead of over discrete tokens.
A Text VAE first, then diffusion on its latents
Cola DLM has three pieces, applied in this order:
- A Text VAE learns to compress a passage into a sequence of continuous latent vectors and reconstruct it back.
- A block-causal DiT (diffusion transformer) trains on those latents — it learns the prior over the latent sequences the VAE produces.
- At inference, the DiT runs a standard diffusion process to produce a latent sequence, and the VAE decoder turns that sequence into tokens.
The diffusion is over what the passage means, not over individual tokens. The decoder handles the surface form.
# Conceptual sketch of Cola DLM inference
z_T = torch.randn(B, L_latent, D_latent) # noise prior
for t in reversed(range(T)):
z_T = dit.denoise_step(z_T, t, conditioning) # diffusion in latent space
tokens = text_vae.decode(z_T) # one decoding pass to text
Compare this to autoregressive inference, where the loop runs L_token times — one forward pass per output token — and each step depends on the previous one. Cola DLM's outer loop runs T diffusion steps over the full latent sequence in parallel.
Why "continuous" and "hierarchical" both matter
Earlier diffusion LMs operated at the discrete token level and had to handle the categorical noise model explicitly, which is what LLaDA does. Cola DLM pushes the noise into continuous Gaussian space, where the standard tools of image diffusion (DDPM schedulers, classifier-free guidance, EDM samplers) port over cleanly. The hierarchical part is that the Text VAE compresses sequence length: one latent token covers several text tokens. So the diffusion model does global planning at one resolution, and the decoder does local realization at another. In the paper's framing, this separates global semantic organization from local textual realization.
What this buys you, empirically: across 8 benchmarks at matched ~2B parameter budgets and ~2000 EFLOP training compute, Cola DLM is competitive everywhere and shows the strongest late-stage scaling gains versus both autoregressive and LLaDA baselines.
Running it
Weights, configs, and training/eval code are at github.com/ByteDance-Seed/Cola-DLM, and the model card is at huggingface.co/ByteDance-Seed/Cola-DLM. The release is HuggingFace-Transformers-compatible, so you can load it the same way you would load any model.
from transformers import AutoTokenizer, AutoModel
tok = AutoTokenizer.from_pretrained("ByteDance-Seed/Cola-DLM")
model = AutoModel.from_pretrained("ByteDance-Seed/Cola-DLM").eval().cuda()
prompt = "Explain diffusion language models to a junior engineer:"
input_ids = tok(prompt, return_tensors="pt").input_ids.cuda()
out = model.generate(input_ids, num_diffusion_steps=50)
print(tok.decode(out[0]))
Read the repo README for the actual sampler arguments — the API exposes diffusion-specific knobs (steps, guidance scale) that the autoregressive Transformers API does not have.
What it changes for builders
For production work today, almost nothing. Autoregressive models are still faster per token on the hardware most teams have, and the open Cola DLM checkpoint is a 2B research artifact, not a deployed system. What it does change is the mental model you should carry into the next two years. If you have been quietly assuming "language model" and "next-token predictor" are synonyms, that assumption is now testable against an open competitor. The interesting consequence will probably not be "diffusion replaces autoregressive". It will be hybrid architectures where a planning model runs at one resolution and a decoding model runs at another. Cola DLM is a clean example of that pattern.
Caveats and open questions
- Inference speed. Diffusion LMs theoretically allow parallel decoding, but in current implementations they trail well-optimized autoregressive serving. Together AI's consistency-diffusion work claims ~14x speedups, from a slow baseline.
- Scaling beyond 2B. The paper's strongest claim is the late-stage scaling slope. Whether that slope holds past ~2000 EFLOPs of training is open.
- Long context. The VAE's compression ratio interacts with effective context length in ways the paper does not fully characterize.
If you want one weekend project: clone the repo, run generate.py on a few of your usual prompts, and compare to a same-size autoregressive baseline (Qwen2.5-1.5B is a fair reference). The samples will not blow you away. The architecture will.
Paper: arxiv.org/abs/2605.06548. Code: github.com/ByteDance-Seed/Cola-DLM.
Top comments (0)