Context Engineering: A Practical Playbook for Reliable AI Agents
If your AI agent works beautifully in a demo and falls apart in production, the problem usually isn't your prompt wording. It's everything around the prompt — the tools, the message history, the retrieved files, the accumulated tool outputs. That discipline now has a name: context engineering, and it's the single highest-leverage skill for anyone shipping agents in 2026.
Prompt Engineering Got Promoted
Shopify CEO Tobi Lütke coined the term in mid-2025, and Anthropic frames it cleanly: context engineering is "the set of strategies for curating and maintaining the optimal set of tokens during LLM inference."
The distinction from prompt engineering is real, not semantic:
- Prompt engineering is a discrete task — you write a good system prompt once.
- Context engineering is iterative. The curation happens on every single inference call, deciding what to pass the model from a constantly growing pile of possible information.
An agent running in a loop generates more data every turn — tool results, intermediate reasoning, file contents. Most of it is noise by the next step. Your job is to decide what survives into the next context window.
Why More Context Makes Agents Worse
The counterintuitive core of this discipline: bigger context is not better context.
Research on long-context recall has surfaced a phenomenon called context rot — as the number of tokens in the window grows, the model's ability to accurately recall any specific fact from that window decreases. This shows up across every model, just with different decay curves.
The cause is architectural. Transformers create n² pairwise relationships for n tokens. As context grows, the model's attention gets stretched thin across all those relationships. Models are also trained mostly on shorter sequences, so they have fewer specialized parameters for long-range dependencies.
Treat context as a finite attention budget, not free real estate. Every token you add depletes that budget. The guiding principle for everything below:
Find the smallest possible set of high-signal tokens that maximize the likelihood of your desired outcome.
This isn't theory. In one striking benchmark, Llama 3.1 8B fails when handed 46 tools but succeeds when given only 19 — the same task, the same model. Dynamically selecting only relevant tools (a technique called tool loadout) improved its function-calling score on the Berkeley Function Calling Leaderboard by 44%. The model didn't get smarter; its context got cleaner.
Tactic 1: Right-Altitude System Prompts
The most common system-prompt failures sit at two extremes:
- Too brittle — engineers hardcode complex if-else logic trying to script exact behavior. It's fragile and rots fast.
- Too vague — high-level hand-waving that assumes shared context the model doesn't have.
Aim for the Goldilocks altitude: specific enough to guide behavior, flexible enough to let the model use its own heuristics. Organize prompts into clear sections and let structure do the work:
<background_information>
You are a support triage agent for a SaaS billing system.
</background_information>
## Instructions
- Classify each ticket into exactly one category.
- If confidence is below 0.7, escalate to a human.
## Output format
Return JSON: {"category": str, "confidence": float, "escalate": bool}
"Minimal" doesn't mean "short." It means no token that isn't earning its place. Start with the leanest prompt that could work on the best available model, then add instructions only to fix failure modes you actually observe.
Tactic 2: Curate Tools Like a Codebase
Tools define the contract between your agent and the world, so bloated tool sets are a top failure mode. The test is simple:
If a human engineer can't say definitively which tool to use in a situation, your agent can't either.
Keep tools self-contained, non-overlapping, and unambiguous. And don't dump your entire tool catalog into every call — load tools dynamically based on the task. Here's the pattern that drove that 44% gain:
def select_tools(query: str, all_tools: list[Tool]) -> list[Tool]:
# 1. Let the model reason about what it needs
needed = llm.complete(
f"What capabilities does this query need? Query: {query}"
)
# 2. Semantically match against the full catalog
return semantic_search(needed, all_tools, top_k=10)
Ten well-chosen tools beat forty exhaustive ones almost every time.
Tactic 3: Just-in-Time Retrieval Over Pre-Loading
The older pattern was to embed everything up front and stuff the top matches into context before inference. The newer, more reliable pattern is just-in-time retrieval: the agent holds lightweight references — file paths, queries, links — and loads the actual data only when it needs it.
This mirrors human cognition. You don't memorize an entire codebase; you keep an index in your head and open files on demand. Claude Code works this way: it drops a CLAUDE.md into context up front, then uses primitives like glob and grep to pull files just-in-time, sidestepping stale indexes entirely.
# Anti-pattern: load everything, hope the right thing is in there
context = load_all_docs() # 200K tokens of mostly-irrelevant text
# Better: keep references, fetch on demand
refs = list_doc_paths() # cheap: just filenames
relevant = agent.decide_and_read(refs, task) # reads 2-3 files
The metadata itself is signal. A file named test_utils.py in a tests/ folder tells the agent something different than the same filename in src/core_logic/. Folder hierarchy, naming conventions, and timestamps all guide retrieval for free.
The trade-off: runtime exploration is slower than pre-computed retrieval, and a poorly guided agent can waste its budget chasing dead ends. For mostly-static domains like legal or finance, a hybrid approach — some data up front, exploration at the agent's discretion — often wins.
Tactic 4: Surviving Long-Horizon Tasks
When a task runs for hours and blows past the context window entirely, you need explicit strategies. There are three proven ones, and they suit different shapes of work.
Compaction — when the conversation nears the window limit, summarize it and reinitialize a fresh window with that summary. The art is in what you keep: preserve architectural decisions, unresolved bugs, and implementation details; discard redundant tool outputs. The safest, lightest form is simply clearing old tool results — once a tool has run deep in the history, the agent rarely needs the raw output again.
Structured note-taking — have the agent write notes to persistent memory outside the context window (a NOTES.md, a to-do list) and pull them back later. This is how an agent maintains coherence across thousands of steps. The canonical demo is Claude playing Pokémon: it tracks objectives like "Pikachu has gained 8 levels toward the target of 10" across resets by reading its own notes.
Sub-agent architectures — instead of one agent holding state for an entire project, a lead agent coordinates a high-level plan while specialized sub-agents do deep work in clean context windows. Each sub-agent might burn tens of thousands of tokens exploring, but returns only a distilled 1,000–2,000 token summary. The detail stays isolated; the lead agent stays focused.
Match the technique to the task:
- Compaction for tasks needing extensive back-and-forth conversational flow.
- Note-taking for iterative development with clear milestones.
- Sub-agents for research and analysis where parallel exploration pays off.
A Pre-Flight Checklist
Before you ship an agent, audit its context like a code review:
- Is every tool definition earning its place, or can you prune the set?
- Does the system prompt sit at the right altitude — neither brittle nor vague?
- Are you loading data just-in-time instead of pre-stuffing the window?
- Do you have a compaction or note-taking strategy for runs that exceed the window?
- Are stale tool results being cleared from history?
The Bottom Line
Context engineering is the shift from "write the perfect prompt" to "curate the perfect state on every turn." The models will keep getting better at tolerating messy context — but a finite attention budget is an architectural fact, not a temporary limitation. The teams shipping reliable agents in 2026 aren't the ones with the cleverest prompts. They're the ones treating every token as a cost.


