DEV Community

Cover image for Context Engineering for AI Agents: 4 Patterns That Replace Prompt Hacking
klement Gunndu
klement Gunndu

Posted on

Context Engineering for AI Agents: 4 Patterns That Replace Prompt Hacking

Your AI agent works on the first call. By turn 20, it forgets your name.

That is not a prompt engineering problem. That is a context engineering problem. Prompt engineering optimizes how you ask. Context engineering optimizes what information surrounds the ask — the schemas, memory, tool definitions, and retrieval architecture that determine whether your agent succeeds or fails at complex tasks.

Anthropic defines context engineering as "the set of strategies for curating and maintaining the optimal set of tokens during LLM inference." The shift matters because autonomous agents persist across multiple interactions, make sequential decisions, and operate with varying levels of human oversight. A well-crafted prompt means nothing if the context window is full of irrelevant conversation history.

Here are 4 patterns that move your agents from prompt hacking to systematic context management — with working Python code for each.

Pattern 1: Message Trimming With Token Budgets

The simplest context engineering failure: your agent accumulates messages until it hits the context window limit, then crashes. Most developers fix this by counting messages. That is the wrong unit. Tokens are what matter.

LangChain provides trim_messages in langchain_core for exactly this. It trims message history based on token counts, not message counts, and preserves conversation boundaries.

from langchain_core.messages.utils import (
    trim_messages,
    count_tokens_approximately,
)
from langchain.chat_models import init_chat_model

model = init_chat_model("claude-sonnet-4-5-20250929")

def prepare_context(messages: list, max_tokens: int = 4096) -> list:
    """Trim messages to fit within token budget."""
    return trim_messages(
        messages,
        strategy="last",
        token_counter=count_tokens_approximately,
        max_tokens=max_tokens,
        start_on="human",
        end_on=("human", "tool"),
    )
Enter fullscreen mode Exit fullscreen mode

The key parameters:

  • strategy="last" keeps the most recent messages. Use "first" to keep the oldest (useful for summarization agents).
  • start_on="human" ensures trimmed output starts on a human message, not a dangling assistant response.
  • end_on=("human", "tool") prevents cutting off mid-tool-call, which causes parsing errors downstream.
  • token_counter accepts either a function or an LLM instance. count_tokens_approximately is fast but imprecise. For production, pass your model directly: token_counter=model.

In a LangGraph agent, this slots into your node function:

from langgraph.graph import StateGraph, START, MessagesState
from langgraph.checkpoint.memory import InMemorySaver

def call_model(state: MessagesState):
    trimmed = trim_messages(
        state["messages"],
        strategy="last",
        token_counter=count_tokens_approximately,
        max_tokens=4096,
        start_on="human",
        end_on=("human", "tool"),
    )
    response = model.invoke(trimmed)
    return {"messages": [response]}

builder = StateGraph(MessagesState)
builder.add_node(call_model)
builder.add_edge(START, "call_model")
graph = builder.compile(checkpointer=InMemorySaver())
Enter fullscreen mode Exit fullscreen mode

Every invocation trims before calling the model. The agent stays within budget no matter how long the conversation runs.

Pattern 2: Structured System Prompts With Sections

Most system prompts are a wall of text. The model reads them linearly and loses track of which instruction applies to which situation. Context engineering treats the system prompt as a structured document with clear sections.

Anthropic's engineering team recommends using XML tags or Markdown headers to create distinct sections. The goal: help the model index instructions by category so it retrieves the right rule at the right time.

import anthropic

client = anthropic.Anthropic()

system_prompt = [
    {
        "type": "text",
        "text": """## Role
You are a code review assistant for Python projects.

## Constraints
- Never suggest changes that break existing tests
- Flag security issues as CRITICAL
- Limit suggestions to 5 per file

## Output Format
Return JSON with fields: file, line, severity, suggestion, rationale

## Tool Guidance
- Use the `read_file` tool to examine source code
- Use the `run_tests` tool to verify suggestions do not break tests
- Never use `write_file` without explicit user approval""",
    },
]

response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=2048,
    system=system_prompt,
    messages=[{"role": "user", "content": "Review this pull request."}],
)
Enter fullscreen mode Exit fullscreen mode

The structure matters more than the words. A system prompt with ## Role, ## Constraints, ## Output Format, and ## Tool Guidance sections outperforms the same instructions written as continuous prose. The model treats headers as retrieval anchors — when it needs to decide output format, it looks for the ## Output Format section instead of scanning the entire prompt.

Three rules for structuring system prompts:

  1. Find the right altitude. Too high-level ("be helpful") gives no guidance. Too prescriptive ("if the user says X, respond with Y") breaks on edge cases. Aim for constraints and principles.
  2. Keep tool guidance near tool definitions. Models perform better when the description of when to use a tool is adjacent to the tool's schema.
  3. Put the most important instructions first and last. Models recall the beginning and end of context windows better than the middle — the primacy-recency effect applies to LLMs too.

Pattern 3: Prompt Caching for Multi-Turn Agents

Every API call to an LLM re-processes the entire system prompt and conversation history. For an agent that makes 15 tool calls per task, that means the same 3,000-token system prompt gets tokenized and processed 15 times. Anthropic's prompt caching eliminates this redundancy.

import anthropic

client = anthropic.Anthropic()

# First call: system prompt gets cached
response = client.messages.create(
    model="claude-sonnet-4-5-20250929",
    max_tokens=1024,
    cache_control={"type": "ephemeral"},
    system=[
        {
            "type": "text",
            "text": "You are a data analysis assistant with expertise in pandas, SQL, and visualization.",
        },
        {
            "type": "text",
            "text": LARGE_REFERENCE_DOCUMENT,  # 10,000+ tokens of context
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[{"role": "user", "content": "Summarize the key findings."}],
)

# Check cache performance
usage = response.usage
print(f"Cache created: {usage.cache_creation_input_tokens} tokens")
print(f"Cache read:    {usage.cache_read_input_tokens} tokens")
print(f"New input:     {usage.input_tokens} tokens")
Enter fullscreen mode Exit fullscreen mode

The cache_control: {"type": "ephemeral"} parameter tells the API to cache everything up to that breakpoint. On subsequent calls with the same prefix, cached tokens cost 90% less and process with up to 85% lower latency.

Where to place cache breakpoints:

  • Tool definitions — cache these first. They rarely change between calls.
  • System instructions — cache second. Static across the entire session.
  • Reference documents — cache third. Large context that the agent queries repeatedly.
  • Conversation history — do not cache. It changes every turn.

The cache has a 5-minute TTL by default. For longer agent sessions, use {"type": "ephemeral", "ttl": "1h"} to extend it to one hour.

A 15-call agent session with a 5,000-token system prompt processes 75,000 tokens without caching. With caching, it processes 5,000 tokens once and reads from cache 14 times — cutting input costs by roughly 80%.

Pattern 4: Sub-Agent Architecture With Context Isolation

The hardest context engineering problem: your agent needs to explore 5 different approaches, but exploring all 5 in a single context window creates confusion. The model mixes up findings from approach 1 with conclusions from approach 3.

Sub-agents solve this by giving each exploration its own isolated context. The orchestrator receives condensed summaries — not the full reasoning traces.

import anthropic

client = anthropic.Anthropic()

def run_sub_agent(task: str, context: str, max_tokens: int = 1024) -> str:
    """Run a focused sub-agent with isolated context."""
    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=max_tokens,
        system="You are a focused research agent. Return only findings relevant to the task. Be concise.",
        messages=[
            {
                "role": "user",
                "content": f"Task: {task}\n\nContext:\n{context}",
            }
        ],
    )
    return response.content[0].text

def orchestrate(question: str, approaches: list[str]) -> str:
    """Explore multiple approaches with isolated sub-agents."""
    findings = {}

    for approach in approaches:
        result = run_sub_agent(
            task=f"Investigate: {approach}",
            context=question,
            max_tokens=1500,
        )
        # Each sub-agent returns 1,000-1,500 tokens max
        findings[approach] = result

    # Synthesize with clean context
    synthesis_prompt = "Synthesize these findings into a recommendation:\n\n"
    for approach, result in findings.items():
        synthesis_prompt += f"## {approach}\n{result}\n\n"

    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=2048,
        system="You are a decision-making agent. Compare findings and recommend the best approach with reasoning.",
        messages=[{"role": "user", "content": synthesis_prompt}],
    )
    return response.content[0].text
Enter fullscreen mode Exit fullscreen mode

The key constraint: each sub-agent returns 1,000-1,500 tokens, not its full reasoning chain. The orchestrator works with condensed findings, not raw transcripts. This keeps the synthesis context clean and focused.

Three rules for sub-agent context isolation:

  1. Cap sub-agent output. Set max_tokens to bound what comes back. Unbounded sub-agents flood the orchestrator.
  2. Give each sub-agent a focused system prompt. "You are a focused research agent" performs better than reusing the orchestrator's full system prompt.
  3. Synthesize, do not concatenate. The orchestrator should compare and decide — not just paste sub-agent outputs together.

When to Use Which Pattern

Not every agent needs all 4 patterns. Here is the decision framework:

Symptom Pattern
Agent forgets context after many turns Pattern 1: Message trimming
Agent ignores specific instructions Pattern 2: Structured system prompts
API costs scale linearly with turns Pattern 3: Prompt caching
Agent confuses findings from different tasks Pattern 4: Sub-agent isolation

Start with Pattern 2 — structured system prompts cost nothing to implement and improve every agent. Add Pattern 1 when conversations exceed 10 turns. Add Pattern 3 when you are making 5+ API calls per task. Add Pattern 4 when your agent needs to explore multiple approaches.

The Shift That Matters

Prompt engineering asks: "How do I phrase this so the model understands?"

Context engineering asks: "What information does the model need, in what structure, at what point in the conversation?"

The second question is harder. It requires you to think about token budgets, cache breakpoints, message boundaries, and context isolation — not just word choice. But it scales. A well-engineered context window works across models, across tasks, and across conversation lengths. A clever prompt works until the context window fills up.

The agents that work in production are not the ones with the best prompts. They are the ones with the best context.


Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (0)