Context Engineering: More Than "Writing a Good Prompt"
Anthropic's Agent best practices contain this line:
"The most important skill in building agents is context engineering — the art of getting the right information into the model's context window at the right time."
Context engineering sits below prompt engineering. It governs the entire context window: what goes in, how much of it, in what order, and what to drop when the budget runs out.
This article covers three dimensions:
- Five context sources + token cost analysis — what actually lives inside a context window
- Dynamic assembly under a budget constraint — how to prioritize when tokens are scarce
- Three overflow strategies compared — truncation, summarization, retrieval, on the same question
The Five Sources of Context
A complete Agent context is assembled from five types of content, each with different lifetime and cost characteristics:
┌─────────────────────────────────────────────────────────┐
│ Agent Context Composition │
├──────────────────┬──────────────────────────────────────┤
│ ① System Prompt │ Always loaded. Defines Agent role. │
│ │ Stable — ideal for Prompt Caching. │
├──────────────────┼──────────────────────────────────────┤
│ ② Tool Defs │ Load on demand. Only current-task │
│ │ tools. Grows fast (50–200t per tool). │
├──────────────────┼──────────────────────────────────────┤
│ ③ Conversation │ Last K turns. Grows linearly. │
│ History │ Needs truncation or summarization. │
├──────────────────┼──────────────────────────────────────┤
│ ④ Retrieved │ Dynamically injected KB snippets. │
│ Content │ Quality-gated by relevance score. │
├──────────────────┼──────────────────────────────────────┤
│ ⑤ Current Input │ The user's current turn question. │
│ │ Always last in, never skippable. │
└──────────────────┴──────────────────────────────────────┘
Real Token Costs
Measured against a real customer service Agent (128K window, 4K output reserve, 124K available):
Source Tokens % of Budget Notes
──────────────────────────────────────────────────────────────
① System Prompt 155 0.1% Always loaded
② Tool Definitions 174 0.1% 4 tools on demand
③ Conv. History (8 turns) 263 0.2% Last N turns
④ Retrieved Content 200 0.2% Relevance-filtered
⑤ Current Input 22 0.0% Current turn
──────────────────────────────────────────────────────────────
Total 814 0.7%
Remaining Buffer 123,186
At 8 turns of history, 0.7% looks fine. Two growth factors change the picture:
Factor 1: Conversation length
- 8 turns = 263 tokens; 100 turns ≈ 3,200 tokens; 1,000 turns ≈ 32,000 tokens
- Long-running Agents (customer service, personal assistants) hit overflow after a few hundred turns without controls
Factor 2: Tool count
- 4 tools = 174 tokens; 20 tools ≈ 860 tokens; 100 tools ≈ 4,300 tokens
- MCP Agents connecting to large tool ecosystems can burn several K tokens on definitions alone
Token counting (tiktoken, cl100k_base — a close approximation for most models):
import tiktoken
_enc = tiktoken.get_encoding("cl100k_base")
def count_tokens(text: str) -> int:
return len(_enc.encode(text))
def msg_tokens(msg) -> int:
return count_tokens(msg.content) + 4 # 4 overhead tokens per message
Dynamic Context Assembly Under a Budget
The core principle: load by priority; when the budget tightens, high-priority content stays complete, low-priority content is trimmed elastically.
Priority Model
P0 System Prompt — always complete (role definition cannot be dropped)
P1 Current Input — always complete (no question = no point)
P2 Recent History — load newest-first until budget exhausted (compressible)
P3 Retrieved Docs — load by relevance score descending (trimmable)
P4 Tool Defs — load only tools relevant to this turn (prunable)
ContextBudgetManager
class ContextBudgetManager:
def __init__(self, total_budget: int = 128_000, output_reserve: int = 4_000):
self.available = total_budget - output_reserve
self.used = 0
self.items: list[dict] = []
def add(self, name: str, content: str, priority: int) -> bool:
"""Try to load completely. Returns False if over budget."""
t = count_tokens(content)
if t <= self._remaining():
self.used += t
self.items.append({"name": name, "tokens": t, "priority": priority})
return True
return False
def add_with_trim(self, name: str, content: str, priority: int) -> int:
"""Load as much as possible: trim proportionally if over budget."""
t = count_tokens(content)
if t <= self._remaining():
self.used += t
self.items.append({"name": name, "tokens": t, "priority": priority})
return t
budget_left = self._remaining()
if budget_left <= 0:
return 0
ratio = budget_left / t
trimmed = content[:int(len(content) * ratio)]
actual_t = count_tokens(trimmed)
self.used += actual_t
self.items.append({"name": name, "tokens": actual_t,
"priority": priority, "trimmed": True})
return actual_t
Measured Results: Two Scenarios
Scenario A: Normal conversation (12K budget)
Source Tokens % Status
────────────────────────────────────────────────
System Prompt 155 1.6% ✓ Complete
Current Input 22 0.2% ✓ Complete
History Turn -1 62 0.6% ✓ Complete
History Turn -2 47 0.5% ✓ Complete
History Turn -3 61 0.6% ✓ Complete
History Turn -4 60 0.6% ✓ Complete
Retrieved Docs 200 2.0% ✓ Complete
Tool Defs 174 1.7% ✓ Complete
────────────────────────────────────────────────
Used 781 7.8%
Remaining 9,219 92.2%
Scenario B: Tight budget (3K, simulating a tool-heavy Agent)
Source Tokens % Status
────────────────────────────────────────────────
System Prompt 155 7.8% ✓ Complete
Current Input 22 1.1% ✓ Complete
Recent History (1 turn) 40 2.0% ✓ Complete
Retrieved Docs 200 10.0% ✓ Complete
Tool Defs 174 8.7% ✓ Complete
────────────────────────────────────────────────
Used 591 29.5%
Remaining 1,409 70.5%
When the budget is tight: System Prompt + current input are always complete (P0/P1). History shrinks to the most recent turn. Retrieved content and tool defs are trimmed to fit remaining space. Core information preserved; supporting content scales with budget.
Three Overflow Strategies: Head-to-Head
When conversation history exceeds the token limit, three strategies exist. Testing all three on the same scenario:
Test Setup:
- 10-topic Python learning conversation: 20 messages, 510 tokens total
- History token limit: 300 (simulating a nearly-full context)
- Test question: "What was the list comprehension we discussed at the beginning? Can you give me a practical example?"
- Goal: does the answer reflect that we covered list comprehensions in Turn 1?
Strategy 1: Truncation
def strategy_truncation(history: list, max_tokens: int) -> tuple[list, int]:
"""Keep newest messages first, working backwards until budget exhausted."""
kept = []
used = 0
for msg in reversed(history):
t = msg_tokens(msg)
if used + t > max_tokens:
break
kept.insert(0, msg)
used += t
return kept, used
Result: 11/20 messages kept (294 tokens). Earliest visible message is from Turn 5 (context managers with __enter__/__exit__). Turn 1's list comprehension content is gone.
Kept: 11/20 messages | 294/300 tokens used
Earliest visible: "Implement __enter__ and __exit__, or use @contextmanager..."
Answer (from LLM's own knowledge, no history context):
Python list comprehensions are a concise way to create lists.
[expression for variable in iterable if condition]
← Technically correct, but LLM is answering from general knowledge,
not from "what we discussed earlier"
Strategy 2: Summarization
summary = llm.invoke([
SystemMessage("Compress the following Python learning conversation into a concise summary. "
"Preserve all topics discussed and key conclusions (≤150 words)."),
HumanMessage(history_text),
])
Result: 510 tokens → 99 tokens, 5.2x compression ratio. All 10 topics preserved:
510 tokens → 99 tokens (5.2x compression)
Summary content:
Python list comprehensions simplify for loops; dict comprehensions
produce dicts; generators save memory; decorators wrap functions;
context managers implement 'with'; GIL restricts threads;
IO-bound → threading; CPU-bound → multiprocessing; async/await for IO;
dataclass reduces boilerplate; Pydantic for data validation.
Answer (from summary):
Python list comprehensions are a concise syntax for creating lists.
[x**2 for x in range(5)] ← list comprehensions in summary → specific example generated
Strategy 3: Retrieval
# Convert each conversation turn into a Document, build vector index
history_docs = [
Document(page_content=f"Q: {q}\nA: {a}", metadata={"turn": i+1})
for i, (q, a) in enumerate(history_topics)
]
history_store = Chroma.from_documents(history_docs, embeddings)
history_retriever = history_store.as_retriever(search_kwargs={"k": 2})
relevant_docs = history_retriever.invoke(test_question)
# → Precisely retrieves Turn 1 (list comprehensions) and Turn 10 (Pydantic)
Result: 2 relevant turns retrieved (118 tokens), directly hitting Turn 1:
Retrieved 2 relevant turns (118 tokens):
Turn 1: "What are Python list comprehensions?" ← precise hit
Turn 10: "Pydantic vs dataclass?"
Answer (from retrieved history):
Python list comprehensions are a concise way to create lists.
squared = [x**2 for x in numbers] ← uses retrieved Turn 1 content directly
Summary Comparison
Strategy History Tokens Turn 1 Visible Implementation
─────────────────────────────────────────────────────────────────
Truncation 294 ✗ Dropped Trivial
Summarization 99 ✓ In summary Medium
Retrieval 118 ✓ Precise hit High (needs vector index)
Truncation loses Turn 1. The LLM answers from general knowledge — correct, but not from conversation history.
Summarization at 5.2x compression retains all 10 topic names. The trade-off: specifics are generalized (topic names only, not the original Q&A text).
Retrieval uses the fewest tokens (118) and lands directly on Turn 1. Highest answer quality. The cost is building and maintaining a history vector index.
Context Engineering Design Checklist
Token Budget Planning
- [ ] Know your model's context window: Claude 3.7 200K, GPT-4o 128K, GLM-4 128K
- [ ] Reserve adequate output space (4K–20K depending on expected output length)
- [ ] Set per-source ceilings: System Prompt < 2K, history budget 20K max, retrieval budget 30K max
Priority Assembly
- [ ] P0/P1 (System Prompt + current input) always complete — never participate in trimming
- [ ] P2 (history) loaded newest-first; stop when budget exhausted, don't sample randomly
- [ ] P3 (retrieved content) filtered by relevance score — low-score docs excluded
- [ ] P4 (tool defs) only load tools plausibly relevant to the current turn
Overflow Strategy Selection
- [ ] < 20 turns: truncation — simple and reliable
- [ ] 20–100 turns or need global narrative: summarization (suggest triggering every 5–10 turns)
- [ ] > 100 turns or need to surface specific early content: retrieval (vector-index the history)
Monitoring
- [ ] Log actual token consumption per request (input / output / cache hits)
- [ ] Alert when a single request exceeds 80% of the window — signals compression tuning is needed
- [ ] Track truncation/summarization trigger frequency — too frequent means thresholds are too low
Summary
Five core takeaways:
- Context engineering ≠ prompt engineering: prompt engineering tunes the System Prompt; context engineering manages the entire window's composition, budget, and priority
- 128K is not forever: conversation history is the fastest-growing variable — uncontrolled Agents hit overflow after a few hundred turns
- Priority-based assembly is the core principle: System Prompt + current question always win; history and retrieval content scale elastically
- Truncation is simplest but drops early content: both summarization and retrieval preserve Turn 1 information; retrieval is the most precise
- 5.2x compression is a real measurement: 99 tokens replacing 510, retaining all 10 topic names — summarization is genuinely effective
Up next: Multi-Agent Architecture Design Patterns — when you actually need multiple Agents, the Supervisor vs Pipeline trade-off, and LangGraph Subgraph implementation.
References
- Anthropic: Building Effective Agents
- LangGraph Memory and Context Management
- tiktoken
- Full demo code for this series: agent-07-context-engineering
Find more useful knowledge and interesting products on my Homepage
Top comments (0)