Deep Dives 7 min read advanced

Speculative Decoding: How a Tiny Draft Model Doubles LLM Speed

Speculative decoding speeds up LLM inference 2-6x by having a small draft model propose tokens that the target model verifies in parallel via rejection sampling, guaranteeing lossless output. EAGLE-3 and Medusa reduce or remove the separate draft model. Gains are largest at low batch sizes.

Aisha Patel

Jun 15, 2026

Every token your favorite chatbot writes is the product of a small miracle of waste. To produce a single word, a large language model loads all of its billions of weights from GPU memory, runs one full forward pass, and emits exactly one token. Then it does the whole thing again. For a 70-billion-parameter model generating a 500-token answer, that is 500 sequential trips through the entire network.

Speculative decoding is the trick that breaks this one-token-at-a-time tyranny — and it does so without changing a single word of the model's output. It is now standard in production inference engines, and understanding it explains why the same model can feel two to three times faster from one provider to the next.

The real bottleneck: memory bandwidth, not math

The instinct is to assume LLM inference is slow because the math is hard. It isn't. During generation, a modern GPU's arithmetic units sit largely idle, waiting for model weights to stream in from high-bandwidth memory. Inference is memory-bandwidth-bound, not compute-bound.

This is the crucial insight that makes speculative decoding possible. If you process one token, you pay the full cost of loading every weight. But if you process several candidate tokens in a single forward pass, you load those same weights once and amortize the cost across all of them. The GPU was going to wait on memory anyway; you might as well give it more arithmetic to do while it waits.

The problem is that autoregressive generation is inherently sequential — token n+1 depends on token n. You cannot verify a token you haven't generated yet. Unless, that is, you guess.

Draft and verify: the core algorithm

Speculative decoding splits the work between two models:

A small, fast draft model proposes a chunk of K candidate tokens, generating them autoregressively but cheaply.
The large target model — the one whose quality you actually want — verifies all K candidates in a single forward pass.

Because the target model can score all K draft tokens in parallel (it is checking tokens that already exist, not generating new ones), it pays the memory cost once instead of K times. If the draft model guessed well, you just produced several tokens for the price of one full target pass.

The elegant part is the verification step. The target model doesn't blindly trust the draft. It walks through the candidates and, using a modified rejection sampling scheme, accepts the longest prefix that is consistent with its own probability distribution, then corrects the first token where the draft and target disagree.

The result is provably lossless: the output is mathematically identical to sampling from the target model alone. You are not trading quality for speed — you are getting the exact same tokens, faster.

This guarantee comes from the foundational paper, "Fast Inference from Transformers via Speculative Decoding" by Leviathan, Kalman, and Matias, presented at ICML 2023. Their rejection-sampling formulation is what makes the technique safe to deploy: a skeptical engineer can turn it on knowing the model's behavior is unchanged.

Acceptance rate is everything

The entire speedup hinges on one number: the acceptance rate — how often the draft model's guesses survive the target model's verification.

If the draft model is a good mimic of the target, most of its tokens get accepted, and you generate many tokens per target pass. If it guesses poorly, the target rejects most candidates, you have burned compute on draft tokens that got thrown away, and the speedup evaporates.

This creates a fundamental tension:

A larger draft model guesses more accurately (higher acceptance) but is slower to run.
A smaller draft model is fast but less accurate (lower acceptance).

The art of speculative decoding is finding a draft model that is both fast and well-aligned with the target. On general queries with off-the-shelf draft models, realistic acceptance rates land around 0.6 to 0.8, which translates to roughly a 2–3x end-to-end speedup. Predictable, repetitive text (boilerplate code, structured output) accepts at much higher rates; creative, high-entropy text accepts at lower rates.

Beyond a separate draft model: Medusa and EAGLE

The classic approach needs a second, separate model — extra weights to host, align, and maintain. The most important recent advances eliminate or shrink that overhead.

Medusa takes a different tack: instead of a standalone draft model, it bolts multiple extra decoding heads directly onto the target model. Each head predicts a token several positions into the future, and a tree-based attention mechanism verifies many candidate continuations at once. There is no separate model to keep in sync — the heads ride on top of the model you already have.

EAGLE and its successors push further by drafting at the feature level rather than the token level. Instead of predicting raw tokens, EAGLE predicts the target model's internal hidden-state representations, which are far more informative, and feeds them back to produce better candidates.

EAGLE-3, introduced in 2025, is the current standard-bearer. Its two key ideas:

Training-time test. The draft head is trained under conditions that simulate real inference — predicting tokens on top of its own previously generated features, not just teacher-forced ground truth. This closes the gap between how the drafter is trained and how it actually runs.
Multi-layer fusion. Rather than reading only the target model's top layer, EAGLE-3 fuses information from low, mid, and high layers, giving the drafter a richer signal.

The payoff is substantial. The EAGLE-3 paper reports speedups around 4.0–4.8x for LLaMA-3.3-70B across tasks, and 70B-class models commonly land in the 4–6x range at low batch sizes. Those numbers are why EAGLE-style drafting now ships in major inference stacks.

The batching catch

Here is the caveat that trips up teams deploying this in production: speculative decoding shines at low batch sizes and fades at high ones.

The whole premise is that the GPU has spare compute while waiting on memory. At small batch sizes — a single user, an interactive coding agent, a latency-sensitive endpoint — that spare capacity is real, and speculative decoding fills it beautifully.

But as you pack more concurrent requests into a batch, the GPU's arithmetic units get busy serving all of them. The free compute disappears, and the extra work of generating and verifying draft tokens starts competing with genuine throughput. At very high batch sizes, speculative decoding can even hurt aggregate tokens-per-second.

The practical rule of thumb:

Latency-bound, low-concurrency workloads (chat, coding assistants, agents): speculative decoding is close to free real-time speed.
Throughput-bound, high-concurrency serving: measure carefully; the gains shrink and may invert.

Where it lives in your stack

You rarely implement speculative decoding yourself. Modern inference engines such as vLLM and TensorRT-LLM support it natively, and many hosted APIs apply EAGLE-style drafting transparently. That is precisely why two providers serving the same open-weight model can post very different latencies — one may be running an aggressively tuned drafter, the other plain autoregressive decoding.

For self-hosters, the decision tree is roughly:

Want zero extra models and easy setup? Reach for a Medusa-style head-based approach.
Want maximum speedup and can afford to train or download a matched drafter? Use an EAGLE-3 draft model.
Serving huge concurrent batches? Benchmark with and without — don't assume it helps.

The Bottom Line

Speculative decoding is one of the rare optimizations that feels like cheating but isn't: it makes large models dramatically faster while producing bit-for-bit identical output, thanks to a rejection-sampling guarantee that dates back to Leviathan et al. in 2023. The mechanism is simple to state — let a small model guess, let the big model verify in parallel — but the performance depends entirely on acceptance rate, which is why the field has moved from separate draft models toward integrated approaches like Medusa and feature-level drafters like EAGLE-3, the latter delivering 4–6x speedups on 70B-class models. The one thing to internalize before you flip the switch: it is a low-batch, latency-killing technique. Match it to interactive workloads and it is close to magic; throw it at a saturated high-throughput server and the magic quietly disappears.

llm ai-models developer-tools benchmarks

More in Deep Dives

Deep Dives

FlashAttention: The IO-Aware Trick That Made Long Context Cheap

FlashAttention is an IO-aware, exact attention algorithm from 2022 that avoids writing the full N-by-N attention matrix to slow GPU HBM. Using tiling, an online-softmax running-statistics trick, kernel fusion, and recomputation, it cuts memory from O(N^2) to O(N) and delivered up to 7.6x speedups. FlashAttention-2 reached ~70% of A100 peak FLOPs; FlashAttention-3 (2024) exploits Hopper asynchrony and FP8 to hit ~840 TFLOPs BF16 (~75% H100 utilization). It now powers PyTorch, vLLM, and long-context serving.

By Aisha Patel · 9 min · Aug 1, 2026

Deep Dives

RoPE: The Rotary Embeddings Behind Every Modern LLM

RoPE (Rotary Position Embeddings), introduced in the 2021 RoFormer paper, injects position into transformers by rotating query and key vectors so attention scores depend only on relative distance. It became the default across LLaMA, Mistral, Qwen and more. Because RoPE fails to extrapolate past its training length, methods like Position Interpolation, NTK-Aware scaling, and YaRN extend it to 128K-token context windows.

By Aisha Patel · 8 min · Jul 30, 2026

Deep Dives

Mamba: The State Space Models Challenging the Transformer

Mamba's selective state space models scale linearly and rival Transformers, and 2026's frontier models increasingly blend the two into hybrids.

By Aisha Patel · 7 min · Jul 23, 2026