TurboQuant: Google's 6x KV Cache Compression Hits 3-Bit With Zero Loss
Deep Dives 5 min read

TurboQuant: Google's 6x KV Cache Compression Hits 3-Bit With Zero Loss

Aisha Patel
Aisha Patel
May 11, 2026

TurboQuant is the kind of paper that quietly rewrites the economics of running large language models. Google Research and NYU's joint algorithm, accepted at ICLR 2026, takes a problem every inference engineer has stared at — the bloated key-value cache that swallows GPU memory at long context lengths — and reduces it by roughly 6x with what the authors call zero accuracy loss. No training. No calibration data. No model-specific tuning. It just rotates your vectors and quantizes them to 3 bits.

That is a much bigger deal than it sounds.

Why The KV Cache Is The Real Bottleneck

When a transformer processes a long prompt, every token's attention keys and values get stashed in a high-speed lookup table — the key-value cache. The cache lets the model avoid recomputing prior tokens on every step, but it grows linearly with context length and model width. By the time you are pushing a 2-million-token window, the KV cache, not the model weights, is what is hogging your H100s.

The standard fix is quantization: replace 16- or 32-bit floats with fewer bits per number. The problem is that classical quantizers need quantization constants — tiny per-block scaling values — stored in full precision. Those constants add 1 to 2 extra bits per number, partially defeating the compression you came for. TurboQuant's whole insight is that you can eliminate that overhead entirely.

The Two-Stage Trick

TurboQuant is a pipeline of two algorithms that the same team published as standalone papers, then stacked.

"TurboQuant is a compression method that achieves a high reduction in model size with zero accuracy loss." — Amir Zandieh and Vahab Mirrokni, Google Research

Stage 1: PolarQuant (accepted at AISTATS 2026). The vector gets a random orthogonal rotation, which spreads its energy evenly across coordinates. Then, instead of using Cartesian coordinates, PolarQuant converts the rotated vector into polar coordinates — a radius (how strong the signal is) and a set of angles (where it points). Because the post-rotation distribution is known in advance, you can pre-compute mathematically optimal quantization buckets once and use them for every vector, killing the per-block overhead.

Stage 2: QJL — Quantized Johnson-Lindenstrauss. This stage spends one bit on the residual error left over from Stage 1. QJL projects the error into a lower dimension and reduces each coordinate to a single sign bit. The result is a debiased estimator that recovers attention scores with near-original fidelity.

Together, the two stages give you a 3-bit-per-element KV cache with no calibration step and no fine-tuning.

What The Numbers Actually Look Like

Google tested TurboQuant against KIVI and standard product quantization baselines on Llama-3.1-8B-Instruct across five long-context benchmarks: LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. Results held up on Gemma and Mistral too.

Metric TurboQuant Result
KV cache compression ~6x at 3-bit
Accuracy degradation Effectively zero across LongBench
Attention-logit speedup (4-bit vs 32-bit) Up to 8x on H100
Training data required None
Calibration step None

The 8x attention speedup is worth pausing on. KV cache reads are memory-bound; cutting the bytes you have to pull from HBM by a factor of eight is close to a straight-line throughput win at long contexts. On vector-search workloads — the other domain the paper targets — TurboQuant matched or beat the top-1@k recall of PQ and RabbiQ baselines, despite those methods using dataset-specific tuning that TurboQuant does not need.

Why Engineers Are Already Shipping It

Within weeks of the paper landing, the open-source community produced multiple independent ports. There are PyTorch implementations, Triton kernels with vLLM integration, and a llama.cpp port reporting roughly 5.2x memory reduction with near-lossless quality. The reason these implementations exist already is that the algorithm has two unusual properties:

  • No training data. You can apply it to a model you do not own and have never seen.
  • Online operation. Vectors are quantized as they enter the cache, not pre-processed offline.

For practitioners running Gemma, Mistral, or Llama-class models on a single GPU, this is a near-free upgrade. For frontier labs hosting million-token contexts, it is the difference between affordable and uneconomical.

KV cache compression is the headline, but the same math compresses vector embeddings. Modern semantic search at any real scale — billions of vectors, low-latency lookups — runs into the same memory wall. Google's paper shows TurboQuant achieving state-of-the-art 1@k recall on the GloVe d=200 benchmark while operating at 3-bit width. If you build retrieval systems, that translates into either dramatically smaller indices or dramatically more documents at the same hardware budget.

The Bottom Line

TurboQuant is rare in machine learning: a result that is both provably optimal (the paper proves it operates near the theoretical lower bound for distortion-rate trade-offs) and trivially deployable (no training, no calibration, runs online). If you are deploying long-context inference today, ignore it at your peril. If you are running vector search, the index-size implications are larger than they look. And if you are betting on million-token contexts becoming the default, TurboQuant is one of the algorithms that makes the economics work.

The frontier of AI efficiency is no longer just about larger models or better hardware. Sometimes it is about a clever rotation, a polar coordinate, and a single bit of error correction.