Deep Dives 9 min read advanced

KV Cache: The Memory Trick Behind Fast LLM Inference

A deep dive into the KV cache in LLM inference: why autoregressive decoding needs it, how it dominates GPU memory, the 60-80% waste of contiguous allocation, and how vLLM's PagedAttention fixed it.

Aisha Patel

Jun 22, 2026

Ask most people what limits how many users an LLM server can handle, and they'll say "the model is too big." They're looking at the wrong number. On a modern inference server, the model's weights are a fixed, one-time cost. The thing that actually fills your GPU — that decides whether you serve five concurrent users or fifty — is a quietly growing data structure called the KV cache. Understanding it is the difference between an inference bill that scales and one that doesn't.

This is a deep dive into what the KV cache is, why it became the dominant consumer of GPU memory, and how a 2023 idea borrowed from operating systems — PagedAttention — reorganized the whole field of LLM serving around it.

Why a cache exists at all

To see the problem, you have to look at how a transformer generates text. Generation is autoregressive: the model produces one token, appends it to the sequence, and runs the whole thing forward again to produce the next token. One token per forward pass.

Inside each attention layer, every token is projected into three vectors: a query, a key, and a value. To compute attention for the newest token, the model needs the keys and values of every previous token in the sequence. That's the mechanism of attention — the new token "looks back" at everything that came before.

Here's the inefficiency. Without a cache, generating token number 1,000 would mean recomputing the keys and values for tokens 1 through 999 from scratch — every single step. The keys and values for those earlier tokens never change once computed. Recomputing them is pure waste.

The KV cache is the fix: compute each token's key and value once, store them, and reuse them on every subsequent step. It turns an O(n²) recomputation nightmare into something tractable. Modern LLM inference is impossible without it.

The cache that eats the GPU

Caching solves the compute problem and creates a memory problem. The KV cache is not small, and crucially, it grows linearly with sequence length and batch size. Every new token in every active request adds another slice of keys and values that must live in GPU memory for the rest of that request's life.

Think about what that means in production. You're not serving one request — you're batching dozens. Each one carries its own growing KV cache. As context windows have ballooned to tens or hundreds of thousands of tokens, the KV cache has become the single largest dynamic consumer of GPU memory, and at long context lengths and large batch sizes it can rival or exceed the memory taken by the model weights themselves.

So the practical limit on throughput isn't compute. It's: how many KV caches can you fit in memory at once? Every byte you waste on KV cache is a request you can't serve.

The waste nobody saw

And here's the punchline that motivated the breakthrough: pre-2023 serving systems wasted most of that precious memory.

The reason was naive allocation. Older systems stored each sequence's KV cache in a single contiguous block of memory. But you don't know in advance how long a response will be, so systems reserved a chunk sized to the maximum possible sequence length for every request. A request that ended after 20 tokens still held a reservation sized for 2,000.

This produced two classic forms of waste, both names borrowed straight from operating-systems theory:

Internal fragmentation — the gap between what you reserved and what you actually used. Reserve 2,048 slots, generate 100 tokens, and the other 1,948 slots sit empty but locked.
External fragmentation — the unusable gaps left between contiguously-allocated blocks of different sizes, like the holes in a parking lot full of mismatched cars.

The measured result was brutal. Studies of these systems found they wasted the majority of allocated KV cache memory — on the order of 60–80% lost to fragmentation and over-allocation. Put plainly: most of the most-contested resource on the GPU was sitting empty.

PagedAttention: the OS analogy that fixed it

In 2023, a team at UC Berkeley led by Woosuk Kwon published "Efficient Memory Management for Large Language Model Serving with PagedAttention" (SOSP 2023, arXiv:2309.06180), the work that became the engine vLLM. Their insight was that the LLM-serving community had been solving a problem operating systems solved fifty years ago: how to manage memory that's requested in unpredictable, growing amounts.

The OS answer is virtual memory with paging. Instead of giving each process one big contiguous block, the OS divides memory into fixed-size pages that can sit anywhere physically, and uses a page table to map a process's logical addresses to those scattered physical pages.

PagedAttention applies this directly to the KV cache. Rather than one contiguous allocation per sequence, it partitions each sequence's KV cache into fixed-size blocks, each holding the keys and values for a fixed number of tokens. The blocks need not be contiguous in physical GPU memory. A per-sequence block table maps the logical token positions to wherever the physical blocks actually live.

The attention computation is rewritten to fetch keys and values block by block through that table — hence PagedAttention. The model code doesn't care that the blocks are scattered; the indirection layer hides it, exactly as a CPU doesn't care that a process's pages are scattered across RAM.

The crucial move: blocks are allocated on demand, as a sequence actually grows, instead of reserved up front for a hypothetical maximum length.

Why blocks of sixteen

The block size is a real engineering tradeoff, not an arbitrary constant. Too large, and you reintroduce internal fragmentation — the last partially-filled block of every sequence wastes space, and a big block wastes more. Too small, and you pay overhead: more block-table entries to track and less efficient GPU memory access patterns.

The vLLM authors tested a range and found that around 16 tokens per block strikes the balance — small enough to keep waste low, large enough for efficient GPU operations. That figure became the field's default.

The payoff is dramatic. Because blocks are allocated only as needed and packed without contiguity requirements, PagedAttention reduces KV cache waste to under 4% — down from the 60–80% of the contiguous era. Nearly all the memory you paid for is now memory you can use.

What "no waste" buys you

Reclaiming that memory isn't an academic win; it converts almost directly into throughput. With waste near zero, far more sequences' KV caches fit on the same GPU at the same time. Bigger effective batches mean more requests served in parallel, which is the entire game in inference economics.

The original paper measured 2–4× higher throughput for popular LLMs at the same latency compared to the prior state of the art, including systems like FasterTransformer and Orca. Same hardware, same response times, two to four times the work — purely from managing memory more like an operating system does.

Paging unlocked a second capability that contiguous allocation made nearly impossible: sharing KV cache blocks between sequences.

Consider parallel sampling, where you generate several candidate completions from one prompt, or beam search, which maintains multiple branching hypotheses. All of those sequences share an identical prefix — the prompt. With paged blocks, that shared prefix can be stored once and pointed to by multiple sequences' block tables. When one sequence needs to diverge and write into a shared block, the system does copy-on-write: copy the block at the moment of the first write, exactly as an OS forks a process's memory.

The same mechanism powers prefix caching in production systems: if many requests start with the same long system prompt, that prompt's KV blocks are computed once and reused across all of them, saving both memory and the compute to rebuild them. A single good abstraction paid off in several directions at once.

Why this shapes the whole stack

PagedAttention didn't stay a vLLM detail; it reset expectations for the entire serving layer. Once memory was no longer the bottleneck it had been, the field's attention moved to the next constraints — continuous batching to keep the GPU from idling between requests, and a wave of research into shrinking the KV cache itself.

That research is now a field of its own. Techniques like grouped-query attention reduce how many distinct key/value heads each token even produces, cutting cache size at the source. KV cache quantization stores the cached values in lower precision. Eviction and compression schemes drop or summarize the least-useful cached tokens for very long contexts. Every one of these is, at heart, another answer to the same question PagedAttention posed: the KV cache is the scarce resource, so how do we make it smaller or use it better?

The Bottom Line

The KV cache is the hidden protagonist of LLM inference. It exists because recomputing attention from scratch every step would be insane, and it dominates GPU memory because it grows with every token of every concurrent request. The pre-2023 approach of contiguous allocation threw away 60–80% of that memory to fragmentation. PagedAttention fixed it by stealing the oldest trick in operating systems — paging — partitioning the cache into ~16-token blocks mapped through a block table, cutting waste below 4% and delivering 2–4× more throughput on the same hardware, with prefix sharing thrown in for free. If you're choosing or tuning an inference stack today, the question that matters most is no longer "how big is the model." It's "how well does this thing manage the KV cache." That's where your throughput, and your bill, are actually decided.

llm ai-models open-source developer-tools

More in Deep Dives

Deep Dives

Model Collapse: Why AI Trained on AI Slowly Falls Apart

Model collapse is the progressive degradation of generative models trained recursively on synthetic data, documented in Nature (Shumailov et al., 2024). Errors compound and rare data vanishes, but research (Gerstgrasser et al., 2024) shows accumulating real data alongside synthetic data, tracking ratios, and verifying generations prevents it.

By Aisha Patel · 8 min · Jun 19, 2026

Deep Dives

Test-Time Compute: Why Reasoning Models Think Before Answering

Test-time compute spends extra computation during inference, not training, to improve answers. It powers reasoning models like OpenAI o1 and DeepSeek-R1. Two strategies exist: sequential scaling (longer chains of thought, e.g. the s1 paper's budget forcing) and parallel scaling (Best-of-N, majority voting). More thinking is not always better, overthinking degrades accuracy, and hidden reasoning tokens are billable. Match compute to task difficulty.

By Aisha Patel · 8 min · Jun 17, 2026

Deep Dives

Speculative Decoding: How a Tiny Draft Model Doubles LLM Speed

Speculative decoding speeds up LLM inference 2-6x by having a small draft model propose tokens that the target model verifies in parallel via rejection sampling, guaranteeing lossless output. EAGLE-3 and Medusa reduce or remove the separate draft model. Gains are largest at low batch sizes.

By Aisha Patel · 7 min · Jun 15, 2026

KV Cache: The Memory Trick Behind Fast LLM Inference

Why a cache exists at all

The cache that eats the GPU

The waste nobody saw

PagedAttention: the OS analogy that fixed it

Why blocks of sixteen

What "no waste" buys you

The bonus: memory you can share

Why this shapes the whole stack

The Bottom Line

More in Deep Dives

Model Collapse: Why AI Trained on AI Slowly Falls Apart

Test-Time Compute: Why Reasoning Models Think Before Answering

Speculative Decoding: How a Tiny Draft Model Doubles LLM Speed