Deep Dives 8 min read advanced

Diffusion LLMs: How Text Diffusion Is Challenging Autoregression

Diffusion language models (dLLMs) abandon left-to-right autoregressive generation, instead refining masked noise into text over a few parallel denoising steps. Inception Labs' Mercury Coder runs at 1,100+ tokens per second on H100s versus 50-200 for autoregressive models, and LLaDA 8B's bidirectional design breaks the reversal curse. They still trail the best models on hard reasoning benchmarks, but the one-token-at-a-time assumption is no longer a law of nature.

Aisha Patel

Jun 12, 2026

For three years, every frontier large language model you have used has worked the same way: left to right, one token at a time. Diffusion LLMs propose to throw that constraint out entirely — and in 2026 they finally have the benchmarks to make the rest of the field nervous.

The pitch is simple to state and radical in its consequences. Instead of writing a sentence word by word, a diffusion language model starts with a blob of masked, garbled noise and refines the whole thing at once, over a handful of passes, until coherent text emerges. It is the same coarse-to-fine process that powers image generators like Midjourney and video models like Sora — now pointed at text and code, where it was long assumed not to work.

This piece is a technical look at how diffusion LLMs actually work, why they are suddenly fast and good enough to matter, and where the architecture still falls short of the autoregressive incumbents.

The autoregressive bottleneck

To understand why anyone would abandon the dominant paradigm, you have to understand its central limitation.

An autoregressive model — GPT, Claude, Llama, Gemini, nearly everything in production — factorizes the probability of a sequence into a chain of conditional predictions. Token n depends on tokens 1 through n-1. This is elegant and trains beautifully, but it imposes a hard rule at inference time: you cannot generate token n until every token before it exists.

Generation is inherently sequential. A token cannot be generated until all the text that comes before it has been generated, and producing each one requires a full forward pass through billions of parameters.

That sequential dependency is the reason a frontier model can crawl along at 50 tokens per second on a long reasoning trace. The industry's current answer — test-time compute, where models "think" by generating ever-longer chains of intermediate tokens — makes the problem worse, not better. More reasoning means more sequential tokens, which means ballooning latency and inference cost. You are paying, linearly, for every word the model thinks before it answers.

How diffusion language models flip the script

Diffusion models attack the bottleneck at its root by removing the left-to-right constraint.

The clearest reference implementation is LLaDA (Large Language Diffusion with mAsking), the first discrete diffusion model trained from scratch at 8B scale to rival LLaMA3 8B. Its mechanism is worth understanding in detail, because it is representative of the class.

During training, a forward process randomly masks a fraction of the tokens in a sequence. The model — a Transformer mask predictor — is then trained to recover the originals. Crucially, this Transformer omits causal masking. Where an autoregressive model is structurally forbidden from looking at future tokens, LLaDA's predictor sees the entire sequence bidirectionally and predicts all masked positions simultaneously, trained with a plain cross-entropy loss.

At inference, the process runs in reverse. The model starts from a fully masked sequence (think: pure noise) and, over a series of denoising steps, progressively unmasks tokens — predicting many positions in parallel at each step and using flexible remasking to revise low-confidence guesses. Rather than optimizing exact log-likelihood, LLaDA maximizes a variational bound (the ELBO), the same principled objective family that underpins image diffusion.

The practical upshot: a diffusion LLM is a drop-in replacement for an autoregressive one. It supports the same use cases — RAG, tool use, agentic workflows — but arrives at the answer by refining the whole response at once instead of dictating it word by word.

Speed: the headline number

The reason diffusion LLMs broke into the mainstream conversation is throughput, and the clearest commercial proof point is Mercury from Inception Labs — billed as the world's first commercial-scale diffusion LLM.

Inception's founders are not lightweights: they co-invented techniques as load-bearing as Direct Preference Optimization and Flash Attention. Their coding model, Mercury Coder, posts throughput numbers that simply do not exist in the autoregressive world:

Model	Throughput (tokens/sec)	HumanEval
Mercury Coder Mini	1109	88.0
Mercury Coder Small	737	90.0
Gemini 2.0 Flash-Lite	201	90.0
Claude 3.5 Haiku	61	86.0
GPT-4o Mini	59	88.0

Read the throughput column twice. Speed-optimized autoregressive models top out around 200 tokens per second; frontier models can run below 50. Mercury serves over 1000 tokens per second on commodity NVIDIA H100s — a regime that previously required specialized inference silicon from Groq, Cerebras, or SambaNova. And because diffusion's algorithmic speedup is orthogonal to hardware acceleration, the gains would compound on faster chips, not compete with them.

The momentum has continued: Inception launched Mercury 2 in February 2026, marketed as roughly 5x faster than leading speed-optimized models, and Google DeepMind shipped Gemini Diffusion as an experimental text-diffusion model. This is no longer a single research curiosity.

The subtler win: bidirectionality

Speed gets the headlines, but the architectural consequence that should interest researchers most is what bidirectional generation makes possible.

Because a diffusion model considers the whole sequence at once, it is not married to its earlier tokens. It can revise. Inception argues this lets dLLMs correct their own mistakes and hallucinations mid-generation — editing the draft rather than committing irreversibly to each word. That same property enables genuinely controllable generation: infilling text, generating tokens in arbitrary order, and conforming reliably to a user-specified format or schema.

The most striking demonstration is the reversal curse. Autoregressive models famously struggle to answer "B is A" when they only ever saw "A is B" during training — a direct artifact of left-to-right factorization. LLaDA, generating bidirectionally, breaks this curse: on a reversed-poem completion task it reportedly surpasses even GPT-4. That is not a speed win. That is the architecture solving a problem the dominant paradigm is structurally bad at.

Where diffusion LLMs still lose

It would be irresponsible to present this as a settled victory. It is not.

The benchmark tables tell an honest story: on LiveCodeBench, Mercury Coder Small scores 25.0 against Claude 3.5 Haiku's 31.0 and DeepSeek Coder V2 Lite's 37.8. On harder reasoning-heavy code tasks, the speed-optimized autoregressive models still edge ahead on raw quality. Diffusion buys you a 10–20x speedup; it does not yet buy you a clean lead at the frontier of capability.

There are deeper open questions, too. Diffusion training optimizes a bound on likelihood rather than the exact objective, and the number of denoising steps trades quality against speed in ways that are still being characterized. A January 2026 arXiv paper pointedly titled "The Bitter Lesson of Diffusion Language Models for Agentic Workflows" exists precisely because the agentic results are more mixed than the throughput charts suggest. The ecosystem — tooling, fine-tuning recipes, serving infrastructure — is also a decade less mature than the autoregressive stack it hopes to displace.

The knob that doesn't exist in autoregression

There is one more property worth dwelling on, because it has no clean analogue in the autoregressive world: the quality-versus-speed dial.

An autoregressive model's generation cost is essentially fixed by the length of the output — one forward pass per token, full stop. A diffusion model, by contrast, lets you choose how many denoising steps to run. Fewer steps means faster, rougher output; more steps means slower, more polished output, refining the same draft further toward coherence. The model is not committing irrevocably to each word, so you can spend more compute improving a draft you already have rather than re-deriving it from scratch.

That reframes inference economics. Instead of paying linearly for every token of a long reasoning trace — the autoregressive tax — you can allocate a budget of refinement passes and stop when the answer is good enough. Inception explicitly pitches this as the route to "advanced reasoning" that still completes in seconds, sidestepping the minutes-long latency that current autoregressive reasoning models incur when they think out loud. Whether that promise holds across hard tasks is exactly what the field is now stress-testing.

Why this matters now

The strategic logic is what makes diffusion LLMs more than an academic footnote.

The entire industry has bet on test-time compute as the path to better reasoning, and that bet has a tax: latency and cost that scale with how much the model thinks. Diffusion offers a different trade — think in parallel, refine globally, finish in seconds. For latency-sensitive products that today are forced to ship a smaller, dumber model just to hit a response-time budget, a dLLM that delivers a larger model's quality at a fraction of the latency is not a marginal improvement. It changes what is buildable.

That is why the relevant question for 2026 is no longer "does diffusion work for text?" — Mercury and LLaDA settled that. The question is whether the quality gap at the frontier closes before the incumbents borrow diffusion's best ideas for themselves.

The Bottom Line

Diffusion LLMs are the first credible challenge to autoregression in years. They are genuinely, measurably faster — 1000+ tokens per second versus a frontier crawl — and their bidirectional design solves problems, like the reversal curse, that left-to-right models are built to fail. They are not yet beating the best models on the hardest reasoning benchmarks, and the surrounding ecosystem is young. But the direction of travel is clear: the assumption that language must be generated one token at a time, left to right, is no longer a law of nature. It is now just one option among at least two.

ai-models reasoning-models benchmarks llm

More in Deep Dives

Deep Dives

Mamba: The State Space Models Challenging the Transformer

Mamba's selective state space models scale linearly and rival Transformers, and 2026's frontier models increasingly blend the two into hybrids.

By Aisha Patel · 7 min · Jul 23, 2026

Deep Dives

GRPO: The Critic-Free RL Algorithm Behind DeepSeek-R1

GRPO (Group Relative Policy Optimization) is a critic-free reinforcement learning algorithm introduced in the DeepSeekMath paper (arXiv 2402.03300). Instead of training a separate value model like PPO, it samples a group of responses per prompt and computes each response's advantage relative to the group's mean and standard deviation. It powered DeepSeek-R1's emergent reasoning and is the central baseline for reinforcement learning with verifiable rewards in 2026, spawning variants like Dr. GRPO, DAPO, and GSPO.

By Aisha Patel · 6 min · Jul 22, 2026

Deep Dives

DPO: How Direct Preference Optimization Replaced RLHF

Direct Preference Optimization (DPO), introduced in a 2023 NeurIPS paper by Rafailov et al., aligns language models directly on preference pairs without training a separate reward model or running reinforcement learning. It replaces RLHF's fragile four-model PPO pipeline with a single supervised loss governed mainly by one parameter, beta, and works best stacked after SFT on subjective tasks — not on problems with a single correct answer.

By Aisha Patel · 9 min · Jul 13, 2026