Tech Tips 7 min read intermediate

Prompt Caching: How to Cut LLM API Costs by Up to 90%

Prompt caching stores the computed KV attention tensors for a repeated prompt prefix so the model skips recomputation, cutting input cost and latency. Anthropic (explicit cache_control, ~90% read discount), OpenAI (automatic, 50% off, 1,024-token minimum), and Google Gemini (implicit plus explicit cache objects, up to 90%) all support it. The one rule that determines hit rate: put all static content at the front of the prompt and all dynamic content at the back.

Marcus Rivera

Jun 12, 2026

If your LLM bill is climbing and you have not turned on prompt caching, you are almost certainly overpaying — often by a wide margin. It is the single highest-leverage cost optimization available to most teams running LLMs in production, and on at least one major provider it requires zero code changes. Yet it remains one of the most under-used features in the entire API surface.

This is a practical guide to what prompt caching is, how it works on Anthropic, OpenAI, and Google Gemini, and how to structure your prompts so the cache actually fires. It assumes you are already calling these APIs and want to cut the bill.

What prompt caching actually does

Every time a model processes your prompt, it computes a set of internal attention matrices — the key and value (KV) tensors — for each token. That computation is the expensive part, and normally it is thrown away the instant the response is returned. On your next request, the model recomputes all of it from scratch, even if 90% of your prompt is identical to last time.

Prompt caching fixes exactly this waste:

The provider stores the computed K and V matrices for a repeated portion of your prompt in their datacenter, so it does not have to reprocess those tokens from scratch on every request. Cached entries are typically held for a few minutes after each call.

The key word is prefix. Caching works on the longest matching prefix of your prompt — the run of tokens from the very start that is byte-for-byte identical to a previous request. The moment the text diverges, caching stops. This single fact dictates everything about how you should structure prompts, which we will get to.

The payoff is two-sided: lower cost (cached tokens are billed at a steep discount) and lower latency (the model skips recomputation). Reductions in input cost of 70–90% and latency improvements up to 80% are realistic for workloads with a large, stable context.

The three providers, compared

The discounts and the amount of work required differ meaningfully. Here is the current landscape.

Provider	How you enable it	Cache read discount	Minimum prompt
Anthropic (Claude)	Explicit — add `cache_control` markers	~90% (reads billed at 0.10× input)	1,024 tokens
OpenAI	Automatic — no code changes	50% off cached input	1,024 tokens
Google Gemini	Implicit (auto) or explicit cache objects	Up to 90% on cached tokens	varies by model

Anthropic: the most explicit, the most generous

Anthropic gives you the deepest discount but asks you to opt in. You mark the stable parts of your prompt with a cache_control block of type: "ephemeral":

response = client.messages.create(
    model="claude-sonnet-4-6",
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT_AND_DOCS,   # the big static blob
            "cache_control": {"type": "ephemeral"}  # cache everything up to here
        }
    ],
    messages=[{"role": "user", "content": user_question}]
)

The economics, on Claude Sonnet 4.6 / Opus 4.8 and the current 4-series models:

Cache write: 1.25× the base input price for a 5-minute TTL, or 2.0× for a 1-hour TTL. You pay a small premium the first time.
Cache read: 0.10× base input — a 90% discount on every subsequent request within the TTL window.
Minimum: the cacheable section must be at least 1,024 tokens; shorter blocks are silently ignored.

You can mix TTLs in one request, but there is an ordering rule: 1-hour cache entries must appear before 5-minute ones. Put your most stable content (system prompt, tool definitions) in the longer-lived cache, earlier in the prompt.

OpenAI: free, automatic, and easy to miss

OpenAI's implementation is the opposite philosophy: it is on by default with no code changes and no extra fees. Any request on a supported model over 1,024 tokens automatically reuses the longest previously-computed prefix, growing in 128-token increments.

# No cache_control, no flags. Just keep the prefix stable.
response = client.responses.create(
    model="gpt-5.5",
    instructions=STABLE_SYSTEM_PROMPT,   # identical across calls -> cached
    input=user_question
)

The trade-off is a smaller discount: cached input tokens cost 50% less than standard input, versus Anthropic's 90%. But since it costs you nothing to turn on, there is no reason not to benefit — your only job is to keep the prompt prefix stable so the cache can find it.

Google Gemini: two flavors

Gemini offers both modes. Implicit caching has been enabled by default since May 2025: when the infrastructure detects a reused prefix, it applies the discount automatically, much like OpenAI. Explicit context caching lets you create a named cache object from a large document or system prompt and reference it across requests, with a configurable TTL that defaults to 60 minutes. On the Gemini 2.5 models, cached tokens are discounted up to 90%, putting it in Anthropic's territory for cost savings when you manage the cache explicitly.

The one rule that makes or breaks your hit rate

Because caching keys on the prompt prefix, prompt structure is the whole game. The rule:

Put everything static at the front. Put everything dynamic at the back.

A cache-hostile prompt interleaves the two — a timestamp in the system message, the user's name spliced into the instructions, a request ID near the top. Every one of those tiny variations changes the prefix and invalidates the entire cache from that point on. You pay full price.

A cache-friendly prompt is ordered deliberately:

System prompt — identical on every call.
Tool / function definitions — stable.
Few-shot examples — stable.
Large reference documents or RAG context — stable for the session.
The user's actual, variable input — last.

[ system prompt        ]  <- cached
[ tool definitions     ]  <- cached     ~95% of tokens,
[ few-shot examples    ]  <- cached     billed at 10-50%
[ retrieved documents  ]  <- cached
------------- cache boundary -------------
[ this turn's user msg ]  <- full price, but tiny

Get this right and the expensive 95% of your prompt is billed at the cached rate while only the small dynamic tail pays full freight.

A worked example

Suppose you run a support assistant with a 10,000-token system prompt and knowledge base, answering 50,000 queries a day, each with a 200-token user question. Without caching, you pay full input price on ~10,200 tokens per call.

With caching and a stable prefix, that 10,000-token block is written once per TTL window and then read at a discount on every subsequent call. On Anthropic's 90% read discount, those 10,000 cached tokens effectively cost a tenth of their list price across the day's traffic — while the 200-token tail is the only thing billed in full. For high-volume, large-context workloads, teams routinely report 60–85% reductions in total input cost. The savings scale directly with how much of your prompt is stable and how often you reuse it.

Common mistakes

A few traps that quietly kill cache hit rates:

Putting a timestamp or random ID at the top of the prompt. It changes the prefix every call. Move volatile data to the end, or remove it.
Reordering tool definitions or examples between calls. Even a stable set of tokens in a different order is a different prefix. Freeze the order.
Letting the TTL expire under low traffic. Caches live minutes, not hours by default. If requests arrive sparsely, you may pay the write premium repeatedly without ever banking a read. For Anthropic, consider the 1-hour TTL for bursty traffic.
Caching below the minimum. Under ~1,024 tokens, there is nothing to cache. Caching shines on large stable contexts, not short prompts.

The Bottom Line

Prompt caching is close to free money for any team with a large, stable prompt prefix and repeat traffic. On OpenAI it is automatic — verify your prompts clear 1,024 tokens and keep the prefix stable, and you bank a 50% discount for nothing. On Anthropic and Gemini, a little explicit work with cache_control markers or named cache objects unlocks discounts up to 90%. The architecture rewards one discipline above all: static content at the front, dynamic content at the back. Restructure your prompts around that single rule and watch the bill fall.

developer-tools llm openai productivity

More in Tech Tips

Tech Tips

LangGraph: Build Durable, Stateful AI Agents in Python

A practical guide to LangGraph, the durable agent framework that hit 1.0 in October 2025 and the 1.2 line by mid-2026. It covers the three core primitives (state, nodes, edges), conditional edges for branching and loops, and checkpointer-based persistence for crash recovery and memory. It also explains the deprecation of create_react_agent in favor of LangChain's create_agent, and the rule of thumb for choosing between the high-level agent builder and low-level StateGraph.

By Marcus Rivera · 6 min · Jul 26, 2026

Tech Tips

LiteLLM: One Unified API for Every LLM Provider in 2026

LiteLLM is an open-source gateway that gives developers a single OpenAI-format interface to call 100+ LLM providers. This tutorial covers installing the SDK and Proxy Server, switching providers by changing a model string, unified exception handling, streaming, and adding cost tracking, observability, virtual keys, and budgets.

By Marcus Rivera · 7 min · Jul 17, 2026

Tech Tips

Langfuse: LLM Observability That Debugs Your AI Agents

Langfuse is an open-source, MIT-licensed LLM observability platform acquired by ClickHouse in January 2026. It provides hierarchical tracing, prompt management, evaluations, and datasets. Its OpenTelemetry-based Python SDK v3 uses the @observe decorator and integrates with LangChain, the OpenAI SDK, Anthropic, and LiteLLM.

By Marcus Rivera · 6 min · Jul 16, 2026