For a decade, the recipe for a smarter language model was simple: make it bigger and feed it more data. Then, in late 2024, the industry discovered a second lever — and it doesn't touch training at all. Test-time compute is the practice of spending more computation while the model answers instead of only while it learns. It is the single idea behind every "reasoning model" you've used since OpenAI's o1, and understanding it is the difference between knowing that these models are slower and knowing why they're sometimes dramatically better — and sometimes dramatically worse.
This is a technical piece for people who deploy these models, pay their bills, or just want to know what's happening inside the spinner that says "Thinking."
The shift from train-time to test-time
Classic scaling laws describe a relationship between a model's loss and three things fixed before deployment: parameters, training data, and training compute. Once trained, a standard model answers in essentially constant time — it does one forward pass per token and stops when it's done. The quality of that answer is baked in.
Test-time compute breaks that assumption. The core insight, demonstrated most visibly by OpenAI's o1 in September 2024, is that a model trained with large-scale reinforcement learning can be allowed to generate a long internal chain of thought before committing to a final answer. Those intermediate "thinking" tokens act as a scratch pad: the model explores approaches, checks its own work, backtracks, and only then writes the response you see.
The economic framing is what makes this a genuine paradigm and not a trick. You are no longer limited by what the model knows at the instant the prompt arrives. You can buy additional accuracy at inference time by letting the model compute more — trading latency and dollars for correctness. For a hard math proof or a multi-step agentic task, that trade is often worth it. For "what's the capital of France," it is pure waste.
How the model learned to think: o1 and R1
The reasoning behavior isn't prompted into existence — it's trained. DeepSeek-R1, released January 20, 2025 under an MIT license, was the first major open-weight model to match o1's reasoning across math, coding, and graduate-level science, and because its method was public, it revealed how this class of model is built.
DeepSeek shipped two variants. DeepSeek-R1-Zero was trained with pure reinforcement learning — no supervised fine-tuning warm-up at all. Remarkably, behaviors like self-verification, reflection, and generating long chains of thought emerged on their own from the RL objective, simply because they helped the model get answers right. The production DeepSeek-R1 added a supervised fine-tuning stage on top to make those chains more readable and stable.
The takeaway: reasoning models aren't running a different algorithm at inference. They are ordinary autoregressive transformers that were rewarded, during training, for producing long deliberate token sequences before answering. Test-time compute is the lever that controls how long those sequences get.
The two ways to spend more compute
There are two fundamental strategies for allocating extra inference budget, and they behave very differently.
Sequential scaling: think longer
Sequential scaling extends a single reasoning trace — you make the model deliberate for more tokens before it answers. The cleanest public demonstration is the s1 paper (Muennighoff et al., 2025), which introduced a technique called budget forcing.
The mechanism is almost comically simple. When the model tries to stop thinking, you suppress its end-of-thinking token and append the word "Wait" to its output. This nudges it to second-guess and continue, often catching its own errors. To cap compute, you do the reverse — force the thinking to terminate.
The results were striking. The authors fine-tuned Qwen2.5-32B-Instruct on just 1,000 curated reasoning examples (a dataset they called s1K) and equipped it with budget forcing. The resulting s1-32B exceeded o1-preview by up to 27% on competition math (MATH and AIME24). More importantly, budget forcing let them extrapolate performance: pushing the model to think longer lifted its AIME24 score from 50% to 57% without any change to weights. Same model, more thinking, better answer.
Parallel scaling: think wider
Parallel scaling spends the budget differently: instead of one long trace, you generate many independent answers and pick the best. The classic methods are Best-of-N sampling (generate N candidates, select one with a reward model or verifier) and majority voting / self-consistency (generate N chains, take the most common final answer).
The advantage is robustness. Because each sample is independent, a single derailed chain of thought doesn't poison the result — it gets outvoted. Research across math, vision, and language has consistently shown parallel scaling maintains or improves accuracy as the budget grows.
| Sequential scaling | Parallel scaling | |
|---|---|---|
| Mechanism | One longer chain of thought | Many independent chains |
| Examples | Budget forcing, "Wait" tokens | Best-of-N, majority voting |
| Strength | Deep, self-correcting reasoning | Robust; bad samples get outvoted |
| Failure mode | Overthinking degrades answers | Cost scales linearly with N |
| Needs | A model trained to reason long | A verifier or voting rule |
The catch nobody warns you about: overthinking
Here is the finding that should reshape how you configure these models. More thinking is not monotonically better. A growing body of 2026 research — including work pointedly titled Mirage of Test-Time Scaling — shows that sequential scaling can degrade accuracy past a certain point. The model talks itself out of correct answers, compounds an early mistake across hundreds of tokens, or spirals into irrelevant tangents.
Parallel scaling is more forgiving here, because independent samples don't share a single failure path. But it isn't free either: cost grows linearly with the number of samples, and you still need a reliable way to select the winner. A weak verifier can confidently pick the wrong answer out of ten good ones.
The practical principle: test-time compute is a tool with a sweet spot, not a slider you crank to maximum. The optimal budget depends on task difficulty. Easy questions need almost none — and spending it anyway risks overthinking a problem the model already had right.
There's a deeper lesson here about what reasoning models actually are. Because the deliberation is learned behavior rather than a guaranteed search procedure, a longer chain is not a proof of more rigor — it's just more sampled tokens, each of which can be right or wrong. A model that has "talked itself into" a bad assumption early will often spend its extra budget elaborating that mistake with growing confidence. This is why a verifier or a vote — something outside the chain that can check the result — tends to be more reliable than simply trusting a model that thought for longer.
The bill: where your tokens really go
If you operate one of these models, the most important sentence in this article is this one: the reasoning tokens are billable, and you usually can't see them.
When a model "pauses to think," it is generating hidden reasoning tokens that never appear in the final chat bubble but represent a real surge in compute. On a metered API, those tokens are counted as output. A reasoning model answering a hard question can emit thousands of internal tokens to produce a three-sentence reply — and you pay for all of them.
This has concrete consequences:
- Latency is variable, not fixed. Response time now depends on how hard the model decides the question is. Budget your timeouts accordingly.
- Cost is unpredictable per request. Two superficially similar prompts can differ by an order of magnitude in token spend.
- Routing matters. Sending trivial queries to a reasoning model is the most common and expensive mistake. Many production systems now classify difficulty first and route only hard problems to the expensive "thinking" path.
A practical framework for deployment
Putting it together, here's how to reason about reasoning models:
- Match compute to difficulty. Use a cheap, fast model — or a reasoning model with a low thinking budget — for routine work. Reserve heavy test-time compute for genuinely hard, high-stakes tasks where a wrong answer is costly.
- Prefer parallel scaling when you have a good verifier. If you can cheaply check answers (unit tests for code, a reward model for math), Best-of-N is robust and predictable.
- Cap sequential budgets. Don't assume longer is better. Set and test a thinking-token ceiling; watch for overthinking on your own eval set.
- Instrument hidden tokens. Log reasoning-token counts, not just visible output. That's where your costs and latency actually come from.
- Remember what it can't fix. Test-time compute is not a guaranteed accuracy button. It cannot repair gaps caused by poor training data or knowledge the model simply doesn't have — it only helps the model use what it already knows more carefully.
The Bottom Line
Test-time compute reframes inference from a fixed cost into a dial you control: spend more computation to think longer or think wider, and buy accuracy you couldn't get from the weights alone. The s1 results prove the upside is real — a 32B model extrapolating past o1-preview just by being told to "Wait." But the overthinking research is the necessary counterweight: deliberation has a sweet spot, the hidden tokens land on your invoice, and trivial questions don't deserve a reasoning budget at all. The labs gave us a new lever. The engineering discipline is knowing exactly how hard to pull it.


