DEV Community

Cover image for Agent Series (8): Context Engineering — Making Every Token Count
WonderLab
WonderLab

Posted on

Agent Series (8): Context Engineering — Making Every Token Count

Context Engineering: More Than "Writing a Good Prompt"

Anthropic's Agent best practices contain this line:

"The most important skill in building agents is context engineering — the art of getting the right information into the model's context window at the right time."

Context engineering sits below prompt engineering. It governs the entire context window: what goes in, how much of it, in what order, and what to drop when the budget runs out.

This article covers three dimensions:

  1. Five context sources + token cost analysis — what actually lives inside a context window
  2. Dynamic assembly under a budget constraint — how to prioritize when tokens are scarce
  3. Three overflow strategies compared — truncation, summarization, retrieval, on the same question

The Five Sources of Context

A complete Agent context is assembled from five types of content, each with different lifetime and cost characteristics:

┌─────────────────────────────────────────────────────────┐
│              Agent Context Composition                    │
├──────────────────┬──────────────────────────────────────┤
│ ① System Prompt  │ Always loaded. Defines Agent role.   │
│                  │ Stable — ideal for Prompt Caching.   │
├──────────────────┼──────────────────────────────────────┤
│ ② Tool Defs      │ Load on demand. Only current-task    │
│                  │ tools. Grows fast (50–200t per tool). │
├──────────────────┼──────────────────────────────────────┤
│ ③ Conversation   │ Last K turns. Grows linearly.        │
│    History       │ Needs truncation or summarization.   │
├──────────────────┼──────────────────────────────────────┤
│ ④ Retrieved      │ Dynamically injected KB snippets.    │
│    Content       │ Quality-gated by relevance score.    │
├──────────────────┼──────────────────────────────────────┤
│ ⑤ Current Input  │ The user's current turn question.    │
│                  │ Always last in, never skippable.     │
└──────────────────┴──────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Real Token Costs

Measured against a real customer service Agent (128K window, 4K output reserve, 124K available):

Source                     Tokens    % of Budget  Notes
──────────────────────────────────────────────────────────────
① System Prompt               155        0.1%    Always loaded
② Tool Definitions            174        0.1%    4 tools on demand
③ Conv. History (8 turns)     263        0.2%    Last N turns
④ Retrieved Content           200        0.2%    Relevance-filtered
⑤ Current Input                22        0.0%    Current turn
──────────────────────────────────────────────────────────────
Total                          814        0.7%
Remaining Buffer           123,186
Enter fullscreen mode Exit fullscreen mode

At 8 turns of history, 0.7% looks fine. Two growth factors change the picture:

Factor 1: Conversation length

  • 8 turns = 263 tokens; 100 turns ≈ 3,200 tokens; 1,000 turns ≈ 32,000 tokens
  • Long-running Agents (customer service, personal assistants) hit overflow after a few hundred turns without controls

Factor 2: Tool count

  • 4 tools = 174 tokens; 20 tools ≈ 860 tokens; 100 tools ≈ 4,300 tokens
  • MCP Agents connecting to large tool ecosystems can burn several K tokens on definitions alone

Token counting (tiktoken, cl100k_base — a close approximation for most models):

import tiktoken

_enc = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    return len(_enc.encode(text))

def msg_tokens(msg) -> int:
    return count_tokens(msg.content) + 4  # 4 overhead tokens per message
Enter fullscreen mode Exit fullscreen mode

Dynamic Context Assembly Under a Budget

The core principle: load by priority; when the budget tightens, high-priority content stays complete, low-priority content is trimmed elastically.

Priority Model

P0  System Prompt  — always complete (role definition cannot be dropped)
P1  Current Input  — always complete (no question = no point)
P2  Recent History — load newest-first until budget exhausted (compressible)
P3  Retrieved Docs — load by relevance score descending (trimmable)
P4  Tool Defs      — load only tools relevant to this turn (prunable)
Enter fullscreen mode Exit fullscreen mode

ContextBudgetManager

class ContextBudgetManager:
    def __init__(self, total_budget: int = 128_000, output_reserve: int = 4_000):
        self.available = total_budget - output_reserve
        self.used = 0
        self.items: list[dict] = []

    def add(self, name: str, content: str, priority: int) -> bool:
        """Try to load completely. Returns False if over budget."""
        t = count_tokens(content)
        if t <= self._remaining():
            self.used += t
            self.items.append({"name": name, "tokens": t, "priority": priority})
            return True
        return False

    def add_with_trim(self, name: str, content: str, priority: int) -> int:
        """Load as much as possible: trim proportionally if over budget."""
        t = count_tokens(content)
        if t <= self._remaining():
            self.used += t
            self.items.append({"name": name, "tokens": t, "priority": priority})
            return t
        budget_left = self._remaining()
        if budget_left <= 0:
            return 0
        ratio = budget_left / t
        trimmed = content[:int(len(content) * ratio)]
        actual_t = count_tokens(trimmed)
        self.used += actual_t
        self.items.append({"name": name, "tokens": actual_t,
                           "priority": priority, "trimmed": True})
        return actual_t
Enter fullscreen mode Exit fullscreen mode

Measured Results: Two Scenarios

Scenario A: Normal conversation (12K budget)

Source                     Tokens    %    Status
────────────────────────────────────────────────
System Prompt               155   1.6%  ✓ Complete
Current Input                22   0.2%  ✓ Complete
History Turn -1              62   0.6%  ✓ Complete
History Turn -2              47   0.5%  ✓ Complete
History Turn -3              61   0.6%  ✓ Complete
History Turn -4              60   0.6%  ✓ Complete
Retrieved Docs              200   2.0%  ✓ Complete
Tool Defs                   174   1.7%  ✓ Complete
────────────────────────────────────────────────
Used                        781   7.8%
Remaining                 9,219  92.2%
Enter fullscreen mode Exit fullscreen mode

Scenario B: Tight budget (3K, simulating a tool-heavy Agent)

Source                     Tokens    %    Status
────────────────────────────────────────────────
System Prompt               155   7.8%  ✓ Complete
Current Input                22   1.1%  ✓ Complete
Recent History (1 turn)      40   2.0%  ✓ Complete
Retrieved Docs              200  10.0%  ✓ Complete
Tool Defs                   174   8.7%  ✓ Complete
────────────────────────────────────────────────
Used                        591  29.5%
Remaining                 1,409  70.5%
Enter fullscreen mode Exit fullscreen mode

When the budget is tight: System Prompt + current input are always complete (P0/P1). History shrinks to the most recent turn. Retrieved content and tool defs are trimmed to fit remaining space. Core information preserved; supporting content scales with budget.


Three Overflow Strategies: Head-to-Head

When conversation history exceeds the token limit, three strategies exist. Testing all three on the same scenario:

Test Setup:

  • 10-topic Python learning conversation: 20 messages, 510 tokens total
  • History token limit: 300 (simulating a nearly-full context)
  • Test question: "What was the list comprehension we discussed at the beginning? Can you give me a practical example?"
  • Goal: does the answer reflect that we covered list comprehensions in Turn 1?

Strategy 1: Truncation

def strategy_truncation(history: list, max_tokens: int) -> tuple[list, int]:
    """Keep newest messages first, working backwards until budget exhausted."""
    kept = []
    used = 0
    for msg in reversed(history):
        t = msg_tokens(msg)
        if used + t > max_tokens:
            break
        kept.insert(0, msg)
        used += t
    return kept, used
Enter fullscreen mode Exit fullscreen mode

Result: 11/20 messages kept (294 tokens). Earliest visible message is from Turn 5 (context managers with __enter__/__exit__). Turn 1's list comprehension content is gone.

Kept: 11/20 messages  |  294/300 tokens used
Earliest visible: "Implement __enter__ and __exit__, or use @contextmanager..."

Answer (from LLM's own knowledge, no history context):
Python list comprehensions are a concise way to create lists.
[expression for variable in iterable if condition]
← Technically correct, but LLM is answering from general knowledge,
  not from "what we discussed earlier"
Enter fullscreen mode Exit fullscreen mode

Strategy 2: Summarization

summary = llm.invoke([
    SystemMessage("Compress the following Python learning conversation into a concise summary. "
                  "Preserve all topics discussed and key conclusions (≤150 words)."),
    HumanMessage(history_text),
])
Enter fullscreen mode Exit fullscreen mode

Result: 510 tokens → 99 tokens, 5.2x compression ratio. All 10 topics preserved:

510 tokens → 99 tokens (5.2x compression)

Summary content:
Python list comprehensions simplify for loops; dict comprehensions
produce dicts; generators save memory; decorators wrap functions;
context managers implement 'with'; GIL restricts threads;
IO-bound → threading; CPU-bound → multiprocessing; async/await for IO;
dataclass reduces boilerplate; Pydantic for data validation.

Answer (from summary):
Python list comprehensions are a concise syntax for creating lists.
[x**2 for x in range(5)]  ← list comprehensions in summary → specific example generated
Enter fullscreen mode Exit fullscreen mode

Strategy 3: Retrieval

# Convert each conversation turn into a Document, build vector index
history_docs = [
    Document(page_content=f"Q: {q}\nA: {a}", metadata={"turn": i+1})
    for i, (q, a) in enumerate(history_topics)
]
history_store = Chroma.from_documents(history_docs, embeddings)
history_retriever = history_store.as_retriever(search_kwargs={"k": 2})

relevant_docs = history_retriever.invoke(test_question)
# → Precisely retrieves Turn 1 (list comprehensions) and Turn 10 (Pydantic)
Enter fullscreen mode Exit fullscreen mode

Result: 2 relevant turns retrieved (118 tokens), directly hitting Turn 1:

Retrieved 2 relevant turns (118 tokens):
  Turn 1: "What are Python list comprehensions?"  ← precise hit
  Turn 10: "Pydantic vs dataclass?"

Answer (from retrieved history):
Python list comprehensions are a concise way to create lists.
squared = [x**2 for x in numbers]   ← uses retrieved Turn 1 content directly
Enter fullscreen mode Exit fullscreen mode

Summary Comparison

Strategy      History Tokens    Turn 1 Visible    Implementation
─────────────────────────────────────────────────────────────────
Truncation         294          ✗ Dropped         Trivial
Summarization       99          ✓ In summary       Medium
Retrieval          118          ✓ Precise hit      High (needs vector index)
Enter fullscreen mode Exit fullscreen mode

Truncation loses Turn 1. The LLM answers from general knowledge — correct, but not from conversation history.

Summarization at 5.2x compression retains all 10 topic names. The trade-off: specifics are generalized (topic names only, not the original Q&A text).

Retrieval uses the fewest tokens (118) and lands directly on Turn 1. Highest answer quality. The cost is building and maintaining a history vector index.


Context Engineering Design Checklist

Token Budget Planning

  • [ ] Know your model's context window: Claude 3.7 200K, GPT-4o 128K, GLM-4 128K
  • [ ] Reserve adequate output space (4K–20K depending on expected output length)
  • [ ] Set per-source ceilings: System Prompt < 2K, history budget 20K max, retrieval budget 30K max

Priority Assembly

  • [ ] P0/P1 (System Prompt + current input) always complete — never participate in trimming
  • [ ] P2 (history) loaded newest-first; stop when budget exhausted, don't sample randomly
  • [ ] P3 (retrieved content) filtered by relevance score — low-score docs excluded
  • [ ] P4 (tool defs) only load tools plausibly relevant to the current turn

Overflow Strategy Selection

  • [ ] < 20 turns: truncation — simple and reliable
  • [ ] 20–100 turns or need global narrative: summarization (suggest triggering every 5–10 turns)
  • [ ] > 100 turns or need to surface specific early content: retrieval (vector-index the history)

Monitoring

  • [ ] Log actual token consumption per request (input / output / cache hits)
  • [ ] Alert when a single request exceeds 80% of the window — signals compression tuning is needed
  • [ ] Track truncation/summarization trigger frequency — too frequent means thresholds are too low

Summary

Five core takeaways:

  1. Context engineering ≠ prompt engineering: prompt engineering tunes the System Prompt; context engineering manages the entire window's composition, budget, and priority
  2. 128K is not forever: conversation history is the fastest-growing variable — uncontrolled Agents hit overflow after a few hundred turns
  3. Priority-based assembly is the core principle: System Prompt + current question always win; history and retrieval content scale elastically
  4. Truncation is simplest but drops early content: both summarization and retrieval preserve Turn 1 information; retrieval is the most precise
  5. 5.2x compression is a real measurement: 99 tokens replacing 510, retaining all 10 topic names — summarization is genuinely effective

Up next: Multi-Agent Architecture Design Patterns — when you actually need multiple Agents, the Supervisor vs Pipeline trade-off, and LangGraph Subgraph implementation.


References


Find more useful knowledge and interesting products on my Homepage

Top comments (0)