You wrote a prototype with Hugging Face Transformers, model.generate() worked, and everything felt fine — until you put it behind an API and ten users showed up at once. Latency spiked, your GPU sat half-idle, and throughput cratered. That's not your code. It's the serving loop. vLLM exists to fix exactly this, and it's why by 2026 it's the default engine under most open-source LLM deployments.
This is a working playbook: what vLLM does differently, how to stand up a production server, and the four knobs that decide whether your GPU earns its keep.
Why the naive loop is slow
Two problems kill a hand-rolled inference server.
Problem one: memory waste. The KV cache — the model's memory of the tokens it has already processed — grows as text is generated. Traditional serving pre-allocates a contiguous block sized for the maximum possible sequence length, for every request. Most requests never use that much, so 60–80% of that reserved memory sits empty. Less usable memory means smaller batches, and smaller batches mean a starved GPU.
Problem two: static batching. Batch eight requests together and the whole batch waits for the slowest one to finish. A request that needed 20 tokens sits idle while its neighbor grinds out 800. The GPU processes dead slots.
vLLM attacks both.
PagedAttention and continuous batching
PagedAttention borrows the oldest trick in operating systems: virtual memory paging. Instead of one giant contiguous reservation, the KV cache is split into fixed-size blocks allocated on demand, exactly like OS memory pages. Waste drops from that 60–80% down to under 4%. That reclaimed memory goes straight into bigger batches.
Continuous batching (sometimes called rolling batching) fixes the idle-slot problem. The moment one sequence in the batch finishes, vLLM slots a waiting request into its place instead of stalling the whole group. The batch is refilled token-by-token, so the GPU almost never runs a dead slot.
Put together, the payoff is large. vLLM's own benchmarks — LLaMA-7B on an A10G and LLaMA-13B on an A100-40GB, with request lengths sampled from the ShareGPT dataset — show 14x to 24x higher throughput than Hugging Face Transformers, and 2.2x to 2.5x higher than Hugging Face's Text Generation Inference (TGI). Both continuous batching and PagedAttention are on by default. You get most of this for free the moment you switch engines.
Getting a server running
Install is one line. vLLM 0.24.0 is current as of mid-2026:
pip install vllm --torch-backend=auto
The --torch-backend=auto flag lets vLLM pick the PyTorch build that matches your CUDA setup instead of you guessing.
Now serve a model. vLLM ships an OpenAI-compatible server, so it's a drop-in replacement for any code already pointed at the OpenAI API:
vllm serve Qwen/Qwen2.5-1.5B-Instruct
That's the whole command. It downloads the weights, loads them, and starts serving on http://localhost:8000. Change the bind address with --host and --port. Note that the server hosts one model at a time — one process, one model.
Because it speaks the OpenAI protocol, your existing client just needs its base URL repointed:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="Qwen/Qwen2.5-1.5B-Instruct",
messages=[{"role": "user", "content": "Explain PagedAttention in one sentence."}],
)
print(resp.choices[0].message.content)
No SDK swap, no bespoke request format. The /v1/models, /v1/completions, and /v1/chat/completions endpoints all work as expected.
The four knobs that matter
Defaults are good, but production throughput lives in a handful of flags. Tune these in order — each builds on the last.
1. --gpu-memory-utilization (default 0.9). This is the fraction of VRAM vLLM claims for weights and the KV cache pool. A bigger pool means more concurrent sequences. If the GPU is dedicated to vLLM, push it up:
vllm serve <model> --gpu-memory-utilization 0.95
Leave headroom below 1.0 for CUDA context and fragmentation, or you'll hit out-of-memory errors under load. If you share the GPU with anything else, keep it lower.
2. --max-num-batched-tokens. The ceiling on tokens processed per scheduler step. Raising it lets vLLM pack more work into each iteration, which lifts throughput at the cost of a little per-request latency. For throughput-oriented serving, 32768 or higher is a common starting point:
vllm serve <model> --max-num-batched-tokens 32768
3. --tensor-parallel-size. Split one model across multiple GPUs when the weights don't fit on a single card, or to add headroom for a bigger KV cache. Set it to your GPU count:
vllm serve <big-model> --tensor-parallel-size 4
4. Chunked prefill. The prefill phase (processing the prompt) competes with the decode phase (generating tokens). Chunked prefill breaks long prompts into pieces so decode requests aren't blocked behind a giant prompt, smoothing tail latency under mixed traffic. It's enabled by default in recent versions — the thing to know is that it exists and why your latency stays flat when a 100K-token prompt arrives.
A reasonable throughput-tuned starting config on a dedicated H100 looks like this:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--gpu-memory-utilization 0.95 \
--max-num-batched-tokens 32768
With continuous batching, a larger PagedAttention pool, and chunked prefill working together, that setup serves several times the traffic of a naive PyTorch loop on the same hardware.
A tuning method, not magic numbers
Don't copy someone else's flags and call it done — the right values depend on your model size, prompt-to-output ratio, and concurrency. Work in this order:
Start with defaults and measure. Continuous batching is already on, so establish a baseline with real request shapes. Then raise --gpu-memory-utilization toward 0.95 to enlarge the KV cache pool and watch for OOMs. Next, lift --max-num-batched-tokens and watch throughput climb while you keep an eye on p99 latency. Only reach for --tensor-parallel-size when a model won't fit or you need still more cache headroom. Change one variable at a time and re-measure, because these knobs interact.
Two failure modes to watch. Push --gpu-memory-utilization too high and you trade a rare OOM crash for a slightly bigger batch — rarely worth it. And chasing maximum --max-num-batched-tokens for a latency-sensitive chat app is the wrong trade; that setting favors throughput, and interactive users feel the tail.
Not just a server: offline batch jobs
The vllm serve command is the right tool for live traffic, but plenty of work is offline — scoring a dataset, generating synthetic data, running an eval suite. For those, skip the HTTP layer entirely and use the Python LLM class, which applies the same PagedAttention and batching internally:
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=256)
prompts = ["Summarize: ...", "Classify sentiment: ...", "Extract entities: ..."]
outputs = llm.generate(prompts, params) # whole list batched at once
for o in outputs:
print(o.outputs[0].text)
Hand generate() the entire list of prompts in one call rather than looping request-by-request. vLLM batches the whole set internally, and that's where the throughput advantage shows up — a per-prompt Python loop throws most of it away.
A note on quantization
If a model won't fit or you want a bigger KV-cache pool, quantized weights help. vLLM loads common formats — AWQ, GPTQ, and others — and you point it at the format with a single flag:
vllm serve <awq-model> --quantization awq
Quantization shrinks the weights so more VRAM is freed for the PagedAttention pool, which means larger batches and higher throughput — at some accuracy cost. Benchmark the quantized model on your own task before trusting it; the trade-off is real and model-dependent.
The Bottom Line
vLLM turns LLM serving from a bespoke engineering project into a one-line command, and the defaults alone — PagedAttention plus continuous batching — buy you 14–24x the throughput of a raw Transformers loop. The OpenAI-compatible server means your existing client code barely changes. Get the process running first, confirm it works, then tune --gpu-memory-utilization, --max-num-batched-tokens, tensor parallelism, and chunked prefill against your traffic, one knob at a time. Measure at every step. The engine does the hard part; your job is to stop starving the GPU.


