If you have ever tried to run a large language model on your own hardware, you have hit the same wall everyone hits: the model is too big for your GPU. A 70-billion-parameter model in full 16-bit precision wants roughly 140 GB just for its weights. No consumer card comes close. Quantization is the trick that makes the impossible routine — and in 2026 it is the single most important technique standing between you and a frontier-class model running on a machine you already own.
But "quantization" is not one thing. It is a family of methods with sharp trade-offs, and the three names you will see everywhere — GGUF, GPTQ, and AWQ — are not interchangeable. Pick the wrong one and you either leave quality on the table or you can't run the model at all. Here is what each actually does, and how to choose.
What quantization really is
A model's weights are just numbers. By default they are stored as 16-bit floating-point values (FP16 or BF16), giving each weight a wide, precise range. Quantization reduces the number of bits used to store each weight — from 16 down to 8, 4, or even fewer — by mapping the original range onto a much smaller set of discrete values.
The payoff is brutal arithmetic. At a rough 0.5 bytes per parameter for 4-bit storage, a 70B model's weights drop from ~140 GB to about 35 GB. The cost is precision: you are throwing away information, and if you do it carelessly the model gets dumber. Every method below is, at its core, a different strategy for which information to throw away.
The whole game of modern quantization is selective sacrifice: spend your precious bits on the weights that matter, and starve the ones that don't.
GGUF: the format that runs everywhere
GGUF is the quantization format created by the llama.cpp project, and it is the one most people actually use. Its defining feature is reach: GGUF runs on CPU, on consumer GPUs, on Apple Silicon, and on mixed CPU+GPU setups where some layers sit in VRAM and the rest spill to system RAM. If you have used Ollama, LM Studio, or llama.cpp directly, you have used GGUF.
GGUF is a post-training method — it quantizes an already-trained model without needing the original training pipeline. It also offers a ladder of quality levels, from Q2_K (tiny, lossy) up to Q8_0 (near-lossless). The ones worth knowing are the k-quant variants like Q4_K_M and Q5_K_M.
The "K" matters. K-quants use mixed precision within a single model: more sensitive tensors (such as attention weights) are stored at higher bit-depth while less sensitive feedforward weights are quantized harder. The result is that a k-quant 4-bit model averages closer to 4.5 bits per weight in practice — which is why a 70B model in Q4_K_M lands around 40–45 GB rather than the naive 35 GB, and why it punches well above its bit-budget on quality.
| GGUF level | Rough bits/weight | Use case |
|---|---|---|
Q2_K |
~2.6 | Last resort on tiny hardware |
Q4_K_M |
~4.5 | The default for most people |
Q5_K_M |
~5.5 | A bit more quality, a bit more RAM |
Q8_0 |
~8.5 | Near-original, when you have the memory |
If you only remember one thing: Q4_K_M is the sensible default. It keeps the large majority of full-precision quality and runs on hardware as humble as a laptop.
GPTQ: GPU-first, layer-by-layer
GPTQ takes a more surgical approach. Instead of mapping weights with a fixed scheme, it uses a small set of calibration data and quantizes the model one layer at a time, adjusting the remaining weights to compensate for the error introduced so far. The goal is to minimize the difference between the quantized layer's output and the original's.
GPTQ is built for NVIDIA GPUs. It leans on CUDA and tensor cores for fast group-wise 4-bit inference, and historically it was the go-to for serving quantized models on a single GPU. The trade-off: it is GPU-bound by design, so it does not give you the CPU and Apple Silicon flexibility that GGUF does, and quality at 4-bit tends to trail the activation-aware methods on instruction-tuned models.
AWQ: protect the weights that matter
AWQ (Activation-aware Weight Quantization) is the method that has quietly become the default for production GPU serving. Its insight is elegant: not all weights are equally important, and you can tell which ones matter by watching the model's activations, not just the weights themselves.
During a calibration pass, AWQ identifies the small fraction of salient weights that have an outsized effect on the output. It protects those — keeping them at higher effective precision — while quantizing the rest aggressively. Because it optimizes for what actually drives the model's predictions, AWQ generally delivers the best quality at 4-bit, especially for chat and instruction-tuned models, and it does so with fast GPU inference. That combination is why so much 2026 production serving runs on AWQ.
How to choose
The decision is less about which method is "best" in the abstract and more about your hardware and your goal.
Run it locally on mixed or modest hardware — laptop, Mac, single consumer GPU, or CPU? Use GGUF, level Q4_K_M. It is the most forgiving and the most portable, and tools like Ollama hide the complexity entirely.
Serving a model to users from NVIDIA GPUs and want the best 4-bit quality? Reach for AWQ. It is the modern production default for a reason.
Already standardized on a GPTQ pipeline, or running on a stack where GPTQ has the best kernel support? It remains a perfectly solid GPU-serving choice.
A quick reality check on what 4-bit buys you: a 70B model that needs ~140 GB at FP16 fits in roughly 40–45 GB at 4-bit — still too much for a single 24 GB consumer card, but comfortable on a 48 GB workstation GPU like an RTX A6000, or split across two 24 GB cards. Smaller 7B–14B models, by contrast, slip onto ordinary gaming GPUs with room to spare. That is the real democratizing effect: quantization is what moved "run a serious model at home" from fantasy to a Saturday-afternoon project.
The catch nobody mentions
Quantization is lossy, full stop. The marketing-friendly framing is "almost no quality loss," and for Q4_K_M and AWQ on most tasks that is close to true. But the degradation is not uniform. It tends to show up first in the hardest things a model does: long-chain reasoning, precise code, math, and instruction-following at the edges. A model that benchmarks fine on casual chat at 4-bit can quietly fumble a complex multi-step task it would have nailed at full precision.
The practical rule: test the quantized model on your actual workload, not someone else's benchmark. If you are doing lightweight chat or drafting, 4-bit is almost certainly fine. If you are pushing a model to its reasoning limits, step up to Q5_K_M, Q8_0, or full precision and measure the difference yourself.
The Bottom Line
Quantization is the reason 2026 is the year local AI got real. GGUF is the universal runner — Q4_K_M should be your default unless you have a reason to deviate. AWQ is the production GPU-serving champion, winning on 4-bit quality through activation-aware precision. GPTQ remains a capable NVIDIA-focused option. None of them are free — you are trading bits for capability — but the trade is so favorable that running a frontier-class model on your own hardware has gone from a flex to a default. Just remember to test on your own work before you trust the savings.


