Deep Dives 7 min read advanced

LLM Quantization: GGUF vs AWQ vs GPTQ in 2026

A practical breakdown of the three dominant LLM quantization formats in 2026. GGUF is the portable, CPU-friendly default (use Q4_K_M); AWQ wins on 4-bit quality for GPU serving via activation-aware precision; GPTQ remains a solid NVIDIA-focused option. Quantization is lossy, so test on your real workload.

Aisha Patel

Jun 25, 2026

If you have ever tried to run a large language model on your own hardware, you have hit the same wall everyone hits: the model is too big for your GPU. A 70-billion-parameter model in full 16-bit precision wants roughly 140 GB just for its weights. No consumer card comes close. Quantization is the trick that makes the impossible routine — and in 2026 it is the single most important technique standing between you and a frontier-class model running on a machine you already own.

But "quantization" is not one thing. It is a family of methods with sharp trade-offs, and the three names you will see everywhere — GGUF, GPTQ, and AWQ — are not interchangeable. Pick the wrong one and you either leave quality on the table or you can't run the model at all. Here is what each actually does, and how to choose.

What quantization really is

A model's weights are just numbers. By default they are stored as 16-bit floating-point values (FP16 or BF16), giving each weight a wide, precise range. Quantization reduces the number of bits used to store each weight — from 16 down to 8, 4, or even fewer — by mapping the original range onto a much smaller set of discrete values.

The payoff is brutal arithmetic. At a rough 0.5 bytes per parameter for 4-bit storage, a 70B model's weights drop from ~140 GB to about 35 GB. The cost is precision: you are throwing away information, and if you do it carelessly the model gets dumber. Every method below is, at its core, a different strategy for which information to throw away.

The whole game of modern quantization is selective sacrifice: spend your precious bits on the weights that matter, and starve the ones that don't.

GGUF: the format that runs everywhere

GGUF is the quantization format created by the llama.cpp project, and it is the one most people actually use. Its defining feature is reach: GGUF runs on CPU, on consumer GPUs, on Apple Silicon, and on mixed CPU+GPU setups where some layers sit in VRAM and the rest spill to system RAM. If you have used Ollama, LM Studio, or llama.cpp directly, you have used GGUF.

GGUF is a post-training method — it quantizes an already-trained model without needing the original training pipeline. It also offers a ladder of quality levels, from Q2_K (tiny, lossy) up to Q8_0 (near-lossless). The ones worth knowing are the k-quant variants like Q4_K_M and Q5_K_M.

The "K" matters. K-quants use mixed precision within a single model: more sensitive tensors (such as attention weights) are stored at higher bit-depth while less sensitive feedforward weights are quantized harder. The result is that a k-quant 4-bit model averages closer to 4.5 bits per weight in practice — which is why a 70B model in Q4_K_M lands around 40–45 GB rather than the naive 35 GB, and why it punches well above its bit-budget on quality.

GGUF level	Rough bits/weight	Use case
`Q2_K`	~2.6	Last resort on tiny hardware
`Q4_K_M`	~4.5	The default for most people
`Q5_K_M`	~5.5	A bit more quality, a bit more RAM
`Q8_0`	~8.5	Near-original, when you have the memory

If you only remember one thing: Q4_K_M is the sensible default. It keeps the large majority of full-precision quality and runs on hardware as humble as a laptop.

GPTQ: GPU-first, layer-by-layer

GPTQ takes a more surgical approach. Instead of mapping weights with a fixed scheme, it uses a small set of calibration data and quantizes the model one layer at a time, adjusting the remaining weights to compensate for the error introduced so far. The goal is to minimize the difference between the quantized layer's output and the original's.

GPTQ is built for NVIDIA GPUs. It leans on CUDA and tensor cores for fast group-wise 4-bit inference, and historically it was the go-to for serving quantized models on a single GPU. The trade-off: it is GPU-bound by design, so it does not give you the CPU and Apple Silicon flexibility that GGUF does, and quality at 4-bit tends to trail the activation-aware methods on instruction-tuned models.

AWQ: protect the weights that matter

AWQ (Activation-aware Weight Quantization) is the method that has quietly become the default for production GPU serving. Its insight is elegant: not all weights are equally important, and you can tell which ones matter by watching the model's activations, not just the weights themselves.

During a calibration pass, AWQ identifies the small fraction of salient weights that have an outsized effect on the output. It protects those — keeping them at higher effective precision — while quantizing the rest aggressively. Because it optimizes for what actually drives the model's predictions, AWQ generally delivers the best quality at 4-bit, especially for chat and instruction-tuned models, and it does so with fast GPU inference. That combination is why so much 2026 production serving runs on AWQ.

How to choose

The decision is less about which method is "best" in the abstract and more about your hardware and your goal.

Run it locally on mixed or modest hardware — laptop, Mac, single consumer GPU, or CPU? Use GGUF, level Q4_K_M. It is the most forgiving and the most portable, and tools like Ollama hide the complexity entirely.

Serving a model to users from NVIDIA GPUs and want the best 4-bit quality? Reach for AWQ. It is the modern production default for a reason.

Already standardized on a GPTQ pipeline, or running on a stack where GPTQ has the best kernel support? It remains a perfectly solid GPU-serving choice.

A quick reality check on what 4-bit buys you: a 70B model that needs ~140 GB at FP16 fits in roughly 40–45 GB at 4-bit — still too much for a single 24 GB consumer card, but comfortable on a 48 GB workstation GPU like an RTX A6000, or split across two 24 GB cards. Smaller 7B–14B models, by contrast, slip onto ordinary gaming GPUs with room to spare. That is the real democratizing effect: quantization is what moved "run a serious model at home" from fantasy to a Saturday-afternoon project.

The catch nobody mentions

Quantization is lossy, full stop. The marketing-friendly framing is "almost no quality loss," and for Q4_K_M and AWQ on most tasks that is close to true. But the degradation is not uniform. It tends to show up first in the hardest things a model does: long-chain reasoning, precise code, math, and instruction-following at the edges. A model that benchmarks fine on casual chat at 4-bit can quietly fumble a complex multi-step task it would have nailed at full precision.

The practical rule: test the quantized model on your actual workload, not someone else's benchmark. If you are doing lightweight chat or drafting, 4-bit is almost certainly fine. If you are pushing a model to its reasoning limits, step up to Q5_K_M, Q8_0, or full precision and measure the difference yourself.

The Bottom Line

Quantization is the reason 2026 is the year local AI got real. GGUF is the universal runner — Q4_K_M should be your default unless you have a reason to deviate. AWQ is the production GPU-serving champion, winning on 4-bit quality through activation-aware precision. GPTQ remains a capable NVIDIA-focused option. None of them are free — you are trading bits for capability — but the trade is so favorable that running a frontier-class model on your own hardware has gone from a flex to a default. Just remember to test on your own work before you trust the savings.

llm quantization ai-models open-weights developer-tools

More in Deep Dives

Deep Dives

KV Cache: The Memory Trick Behind Fast LLM Inference

A deep dive into the KV cache in LLM inference: why autoregressive decoding needs it, how it dominates GPU memory, the 60-80% waste of contiguous allocation, and how vLLM's PagedAttention fixed it.

By Aisha Patel · 9 min · Jun 22, 2026

Deep Dives

Model Collapse: Why AI Trained on AI Slowly Falls Apart

Model collapse is the progressive degradation of generative models trained recursively on synthetic data, documented in Nature (Shumailov et al., 2024). Errors compound and rare data vanishes, but research (Gerstgrasser et al., 2024) shows accumulating real data alongside synthetic data, tracking ratios, and verifying generations prevents it.

By Aisha Patel · 8 min · Jun 19, 2026

Deep Dives

Test-Time Compute: Why Reasoning Models Think Before Answering

Test-time compute spends extra computation during inference, not training, to improve answers. It powers reasoning models like OpenAI o1 and DeepSeek-R1. Two strategies exist: sequential scaling (longer chains of thought, e.g. the s1 paper's budget forcing) and parallel scaling (Best-of-N, majority voting). More thinking is not always better, overthinking degrades accuracy, and hidden reasoning tokens are billable. Match compute to task difficulty.

By Aisha Patel · 8 min · Jun 17, 2026