Context Engineering: The Developer's Complete Guide to Building High-Performance AI Agents
Table of Contents
- The Prompt Engineering Era Is Over
- What Is Context Engineering?
- The 7 Components of a Production Harness
- Memory Architecture: Short-Term vs. Long-Term Context
- MCP: Wiring the World to Your Agent
- The Ratchet Pattern: Turning Agent Mistakes Into Permanent Rules
- Multi-Agent Patterns: Orchestrators, Sub-Agents, and Skills
- Benchmarks & Real-World Evidence
- Getting Started: Build Your First Context-Engineered Agent
- Conclusion
The Prompt Engineering Era Is Over
Here's a data point worth sitting with: on Terminal Bench 2.0, Cline's CLI running claude-opus-4.7 scored 74.2%. Anthropic's own Claude Code, running the same model, scored 69.4%. A ~5-point gap on a competitive benchmark — not from a better model, not from more parameters, not from more training data. Just from a better harness.
That gap is the entire argument for context engineering.
For the last three years, the dominant mental model in AI development has been: better prompt → better output. Tweak the system message. Add few-shot examples. Try chain-of-thought. This worked well enough when LLMs were assistants — stateless, turn-by-turn, finishing after one response. But AI has moved on. Agents run for dozens of steps. They call tools, spawn sub-agents, manage memory across sessions, and recover from failures. A static prompt tweaked in a playground simply cannot orchestrate that complexity.
The field has quietly converged on a new discipline to fill that gap: context engineering for AI agents — the practice of deliberately shaping every token that enters an LLM's context window during an agent's execution lifecycle. HuggingFace just launched a full six-unit certification course on it. O'Reilly published a deep-dive from Google Chrome engineer Addy Osmani on the patterns. LangChain's blog is flooded with harness architecture posts. And Viv Trivedy's viral equation — Agent = Model + Harness — has become the consensus definition of what an AI agent actually is.
This post is the engineer's guide to that discipline. By the end, you'll understand what this discipline is, how to decompose a production harness into its seven core components, how to wire external tools through MCP, how to manage short-term and long-term memory, and how to implement the ratchet pattern so your agent gets measurably better with every failure.
What Is Context Engineering?
Context engineering is the discipline of designing what goes into an agent's context window at every step: the system prompt, tool descriptions, conversation history, retrieved knowledge, intermediate reasoning, memory injections, and output format instructions. It's not a one-time decision made before you hit enter — it's an active engineering process that shapes every model call across an agent's entire execution lifecycle.
The HuggingFace definition is precise:
"Context engineering: structuring knowledge so an agent can efficiently find what it needs, when it needs it, to improve its generated outputs."
This sounds simple. It is not. An agent's context window is a scarce, high-stakes resource. Fill it poorly — irrelevant history, redundant tool schemas, imprecise instructions — and the model loses track of the task, hallucinates constraints that don't exist, or burns tokens on reasoning that goes nowhere. Fill it well, and the same model that struggled through a 40-step task completes it cleanly on the first try.
The Anatomy of an Agent
To master this practice, you first need to understand the three-layer architecture that every modern AI agent shares:

Figure 1: The three-layer anatomy of a production AI agent — Model at the core, Scaffolding in the middle, and the Harness as the full execution environment.
The Model is the LLM at the center: Claude, GPT, Qwen, DeepSeek. It takes text in and produces text out. On its own, it has no memory between calls, no loop, and no ability to execute anything. It produces the intent to call a tool; it cannot call the tool itself.
Scaffolding is the behaviour-defining layer wrapped around the model. It includes: the system prompt, tool descriptions, how the model's responses get parsed, what format outputs must follow, and how state is maintained across steps. Scaffolding shapes how the model perceives its world.
The Harness is the execution layer: the code that calls the model, handles its tool calls, routes outputs, decides when to stop, manages errors, and feeds results back into context. The harness is what makes the agent run.
Viv Trivedy's equation crystallises this:
Agent = Model + Harness
If you are not the model, you are the harness. And this is where the real engineering work lives — the harness is your surface area.
Why Prompt Engineering Isn't Enough Anymore
Classic prompt engineering optimises a single inference call. You write a better instruction, you get a better response. The feedback loop is immediate and the surface area is small.
Agentic systems break this model completely. Consider a coding agent tasked with "refactor this service to add proper error handling." It will:
- Read multiple files to understand the codebase
- Reason about the best error-handling strategy
- Write code changes across potentially dozens of files
- Run tests and interpret their output
- Fix failures — potentially through multiple iterations
- Validate the final result against the original spec
Each of these steps is a separate model call. Each call has a context window that needs to be carefully curated — relevant history included, irrelevant tokens pruned, tool results formatted for consumption, memory from previous sessions retrieved and injected. Prompt engineering addresses step 1 of 1. Context engineering addresses every step of every loop.
The 7 Components of a Production Harness
A production-grade agent harness is not a single file with a system prompt. It is an engineered system with seven distinct components, each responsible for a specific aspect of agent behaviour.
1. System Prompts & Agent Configuration Files
This is your AGENTS.md, CLAUDE.md, or equivalent. These files encode the agent's identity, constraints, conventions, and task-specific knowledge. They are not static documentation — they are active configuration that shapes every model call. A well-maintained AGENTS.md is the distilled product of your agent's entire failure history (more on this in the Ratchet Pattern section).
2. Tool Descriptions & Schemas
Tools are how your agent reaches outside its context window: filesystems, databases, REST APIs, code execution environments, web search. The description of each tool — the natural language explanation of what it does, when to use it, and what its parameters mean — is as important as the tool itself. A poorly described tool will be misused or ignored. Tool schema rendering (how the description becomes tokens the model sees) varies significantly across model providers and can account for substantial performance differences.
3. Context Window Policy
A context window policy is the set of rules that govern what gets included, truncated, summarised, or discarded at each step. Key decisions include: how many turns of conversation history to retain; when to summarise rather than retain verbatim; how to format tool results; and when to trigger context compaction (reducing a long context to a shorter, high-fidelity summary). This is one of the most under-engineered components in most agent implementations.
4. Memory Architecture
Short-term memory (in-context) and long-term memory (external, retrieved on demand) are separate systems with different engineering requirements. See the dedicated section below.
5. Feedback Loops & Backpressure
Agents that can't detect failure will run confidently in the wrong direction until they hit a hard limit. Feedback loops are the signals that interrupt and redirect: test failures, lint errors, type check results, assertion violations, CI/CD outputs. Backpressure (the mechanism that prevents an agent from declaring success until all quality signals are green) is what keeps agents honest. Without this, you get agents that "finish" broken code. With it, you get agents that iterate until the code actually works.
6. Hooks & Middleware
Hooks intercept the agent loop at defined points — before tool execution, after model output, on error — to enforce deterministic policies. Examples: a pre-execution hook that blocks git push --force; a post-output hook that strips personally identifiable information from responses before they're logged; a continuation hook that fires when context compaction is triggered. Hooks give you control points that are not subject to model behaviour — they execute deterministically regardless of what the model decides.
7. Observability: Traces, Costs, and Latency
An agent you can't observe is an agent you can't debug. Production harnesses instrument every model call: input tokens, output tokens, tool calls made, tool results received, latency per step, total cost per run. LangSmith, Langfuse, and Helicone are the leading platforms for this. Without traces, a failing agent is a black box. With traces, every failure becomes a legible sequence of events you can replay and fix.
Memory Architecture: Short-Term vs. Long-Term Context
Memory is the most misunderstood component of agent harness design, and getting it wrong produces some of the most frustrating agent bugs: the agent "forgets" a constraint it was told about, repeats work it already completed, or fails to apply knowledge from a previous session.
The architecture has two distinct tiers:
Short-term memory is everything that lives in the context window during a single agent run. This includes the conversation history (prior turns, tool calls, tool results), intermediate reasoning, and any documents or code the agent has loaded for the current task. Short-term memory is fast and immediately accessible — but it is finite, expensive, and ephemeral. When the run ends, it's gone.
Long-term memory persists across sessions. It is stored externally — in vector databases, key-value stores, or structured databases — and retrieved on demand, then injected into the context window when relevant. This is how an agent "remembers" that your team never uses console.log in production code, or that a particular API has a known rate-limiting bug.
The engineering challenge is the injection layer: deciding what long-term memory to retrieve, when to retrieve it, and how to format it for injection so the model treats it as high-confidence context rather than noise.
Here's a practical Python pattern for a context-aware memory manager:
import json
from typing import Any
from openai import OpenAI
from sentence_transformers import SentenceTransformer
import numpy as np
class AgentMemoryManager:
"""
Manages short-term (in-context) and long-term (external) memory
for a production AI agent harness.
"""
def __init__(self, model_client: OpenAI, max_short_term_tokens: int = 8000):
self.client = model_client
self.max_short_term_tokens = max_short_term_tokens
self.short_term: list[dict] = [] # Current conversation/tool history
self.long_term_store: list[dict] = [] # Persistent memory store
self.encoder = SentenceTransformer("all-MiniLM-L6-v2")
def add_to_short_term(self, role: str, content: str, metadata: dict = None):
"""Add a message to short-term (in-context) memory."""
self.short_term.append({
"role": role,
"content": content,
"metadata": metadata or {}
})
# Trigger compaction if we're approaching the context limit
if self._estimate_tokens(self.short_term) > self.max_short_term_tokens:
self._compact_short_term()
def store_to_long_term(self, content: str, tags: list[str] = None):
"""
Persist important information to long-term memory.
Call this for learnings, constraints, and facts that should
survive across agent sessions.
"""
embedding = self.encoder.encode(content).tolist()
self.long_term_store.append({
"content": content,
"embedding": embedding,
"tags": tags or [],
})
def retrieve_relevant_context(self, query: str, top_k: int = 3) -> list[str]:
"""
Retrieve the most relevant long-term memories for the current task.
Returns a list of context strings to inject into the system prompt.
"""
if not self.long_term_store:
return []
query_embedding = self.encoder.encode(query)
scores = []
for memory in self.long_term_store:
mem_vec = np.array(memory["embedding"])
similarity = np.dot(query_embedding, mem_vec) / (
np.linalg.norm(query_embedding) * np.linalg.norm(mem_vec) + 1e-8
)
scores.append((similarity, memory["content"]))
scores.sort(reverse=True)
return [content for _, content in scores[:top_k]]
def build_context_window(self, current_task: str) -> list[dict]:
"""
Assemble the full context window by combining:
1. Long-term memories retrieved for this task (injected as system context)
2. Current short-term conversation history
"""
relevant_memories = self.retrieve_relevant_context(current_task)
injected_context = "\n".join(
[f"[MEMORY]: {m}" for m in relevant_memories]
)
system_message = {
"role": "system",
"content": f"You are a helpful coding agent.\n\n"
f"Relevant context from previous sessions:\n{injected_context}"
}
return [system_message] + self.short_term
def _compact_short_term(self):
"""
When short-term memory approaches the token limit, summarise older
turns to preserve space for new tool calls and reasoning.
"""
if len(self.short_term) < 6:
return
# Summarise everything except the last 4 messages
to_summarise = self.short_term[:-4]
recent = self.short_term[-4:]
summary_prompt = [
{"role": "system", "content": "Summarise the following conversation turns concisely, "
"preserving all decisions made, constraints established, and tool results."},
{"role": "user", "content": json.dumps(to_summarise)}
]
response = self.client.chat.completions.create(
model="gpt-4.1-mini",
messages=summary_prompt,
max_tokens=500
)
summary = response.choices[0].message.content
self.short_term = [
{"role": "system", "content": f"[COMPACTED HISTORY]: {summary}"}
] + recent
def _estimate_tokens(self, messages: list[dict]) -> int:
"""Rough token estimate: ~4 chars per token."""
total_chars = sum(len(str(m)) for m in messages)
return total_chars // 4
The key insight: context window management is not passive. Every step of the agent loop, your harness must actively decide what to keep, what to compress, and what to retrieve.
MCP: Wiring the World to Your Agent
The Model Context Protocol (MCP) is the emerging standard for connecting external tools, APIs, and data sources to any AI agent — regardless of which model or framework you're using. Think of it as the USB-C of agent integrations: one protocol, every tool.
Before MCP, every agent framework had its own tool integration pattern. LangChain tools looked different from OpenAI function-calling schemas, which looked different from Anthropic tool use, which looked different from custom ReAct implementations. Developers wrote adapters. Tools drifted out of sync. Schema rendering — how a tool description actually becomes tokens the model sees — varied wildly across providers, producing silent performance differences.
MCP solves this by defining a client-server protocol: MCP servers expose tools, resources, and prompts through a standard interface; MCP clients (your agent harness) connect to those servers and translate their capabilities into whatever the underlying model expects.

Figure 2: MCP as the universal integration layer — one protocol connecting your agent to every tool and data source.
Here's a minimal MCP server implementation in Python exposing two tools — a file reader and a shell executor — that any MCP-compatible agent can consume:
import asyncio
import subprocess
from pathlib import Path
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp import types
# Initialise the MCP server
app = Server("dev-tools-mcp")
@app.list_tools()
async def list_tools() -> list[types.Tool]:
"""Declare the tools this MCP server exposes to any connected agent."""
return [
types.Tool(
name="read_file",
description=(
"Read the full contents of a file at the given path. "
"Use this to inspect source code, configs, or data files. "
"Prefer this over bash `cat` for files under 500 lines."
),
inputSchema={
"type": "object",
"properties": {
"path": {
"type": "string",
"description": "Absolute or relative path to the file to read."
}
},
"required": ["path"]
}
),
types.Tool(
name="run_shell",
description=(
"Execute a shell command and return stdout + stderr. "
"Use for running tests, builds, linters, or system queries. "
"NEVER use for destructive operations (rm -rf, git push --force) "
"without explicit user confirmation."
),
inputSchema={
"type": "object",
"properties": {
"command": {
"type": "string",
"description": "The shell command to execute."
},
"timeout": {
"type": "integer",
"description": "Max seconds to wait. Defaults to 30.",
"default": 30
}
},
"required": ["command"]
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
"""Handle tool calls routed from the agent harness."""
if name == "read_file":
path = Path(arguments["path"])
if not path.exists():
return [types.TextContent(type="text", text=f"Error: File not found: {path}")]
content = path.read_text(encoding="utf-8", errors="replace")
return [types.TextContent(type="text", text=content)]
elif name == "run_shell":
command = arguments["command"]
timeout = arguments.get("timeout", 30)
try:
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=timeout
)
output = f"STDOUT:\n{result.stdout}\nSTDERR:\n{result.stderr}\nEXIT CODE: {result.returncode}"
return [types.TextContent(type="text", text=output)]
except subprocess.TimeoutExpired:
return [types.TextContent(type="text", text=f"Error: Command timed out after {timeout}s")]
return [types.TextContent(type="text", text=f"Unknown tool: {name}")]
async def main():
async with stdio_server() as (read_stream, write_stream):
await app.run(read_stream, write_stream, app.create_initialization_options())
if __name__ == "__main__":
asyncio.run(main())
Notice the tool descriptions. They are not just labels — they are load-bearing instructions. "Prefer this over bash cat for files under 500 lines." "NEVER use for destructive operations without explicit user confirmation." The model reads these descriptions at every step. Every word is a nudge toward or away from correct behaviour.
Tool description quality is context engineering. A tool named exec with description "executes commands" will be used unpredictably. The same tool with a precise description constraining when and how to use it behaves measurably better — without changing a single line of tool logic.
The Ratchet Pattern: Turning Agent Mistakes Into Permanent Rules
The most powerful operational practice in harness engineering is also the most counterintuitive: you should want your agent to fail.
Not randomly, not catastrophically — but every mistake is a diagnostic. An agent that ships a PR with commented-out tests isn't a broken model; it's a misconfigured harness that lacks the constraint "never comment out tests." Add that constraint, and the agent never makes that mistake again. For every agent. On every run. Forever.
This is the ratchet pattern: the harness only ever gets more capable, never less. Every mistake tightens a ratchet tooth.
Here's how it works in practice:
Step 1: Observe the failure. The agent deleted a .env file that was gitignored. The agent added console.log statements in production code. The agent "finished" a task with 12 failing tests.
Step 2: Identify the harness component responsible. Is it a missing constraint in AGENTS.md? A missing pre-execution hook? A missing feedback signal (the agent didn't know the tests were failing because you hadn't wired test output back into context)?
Step 3: Add the fix to the harness — not to the prompt for that one run. If it fixes a class of failures, it belongs in the harness permanently.
Here's a sample AGENTS.md structure that embeds learned constraints:
# AGENTS.md — Coding Agent Configuration
# Every constraint below is traceable to a specific production failure.
## Identity
You are a senior backend engineer working on a Python/FastAPI service.
Use Python 3.11+ syntax. Prefer `httpx` over `requests` for async HTTP.
## Code Standards
- NEVER comment out tests. Delete them or fix them — never `.skip()` or `xit()`.
- NEVER include `console.log`, `print()`, or `pdb.set_trace()` in committed code.
- ALL new functions must have type annotations and a docstring.
- Run `ruff check .` and `mypy src/` before marking any task complete.
## File Operations
- NEVER delete, overwrite, or move files in `./config/` or `.env*` without
explicit user confirmation in the same session.
- ALWAYS check `git status` before staging files to avoid committing secrets.
## Task Completion Criteria
A task is NOT complete until:
1. `pytest` exits with 0 failures
2. `ruff check .` exits with 0 warnings
3. `mypy src/` exits with 0 errors
4. You have confirmed the change does what the user asked
## Escalation
If you encounter any of the following, STOP and ask the user:
- Database migration changes
- Changes to authentication/authorization logic
- Modifications to CI/CD configuration files
Pair this with a pre-commit hook that enforces the same constraints deterministically (bypassing the model entirely for checks that can be automated):
#!/bin/bash
# .git/hooks/pre-commit
# Mirror of AGENTS.md constraints as deterministic pre-commit checks
set -e
echo "🔍 Running harness pre-commit checks..."
# Check 1: No commented-out tests
if grep -rn "\.skip\(\|xit(\|# def test_\|# test_" --include="*.py" src/ tests/; then
echo "❌ Commented-out tests found. Delete or fix them."
exit 1
fi
# Check 2: No debug statements
if grep -rn "pdb\.set_trace\|breakpoint()\|print(" --include="*.py" src/; then
echo "❌ Debug statements found in src/. Remove before committing."
exit 1
fi
# Check 3: Ruff linting
ruff check . || { echo "❌ Ruff check failed."; exit 1; }
# Check 4: Type checking
mypy src/ || { echo "❌ mypy check failed."; exit 1; }
echo "✅ All harness checks passed."
The result: your AGENTS.md and pre-commit hooks become the accumulated wisdom of every failure your agent has ever made. This is institutional knowledge, versioned in git, applied at every run.
Multi-Agent Patterns: Orchestrators, Sub-Agents, and Skills
Single-agent systems hit a ceiling quickly. A 40-step coding task doesn't just stress the context window — it conflates planning, execution, and validation into a single agent that is simultaneously trying to be a strategist, a code writer, and a reviewer. These are different cognitive modes, and mixing them degrades performance on all three.
The solution is to decompose. Multi-agent systems assign these roles to specialised agents coordinated by an orchestrator.

Figure 3: Multi-agent orchestration — the Orchestrator decomposes tasks and delegates to specialised sub-agents, each with their own context-optimised harness.
The vocabulary here matters:
- Orchestrator: A higher-level controller that decomposes tasks and delegates to sub-agents. It reasons about what needs to be done, not how.
- Sub-agent: An agent called by the orchestrator to handle a specific subtask. It has its own model and scaffold, reasons independently, and returns a result. It can itself spawn further sub-agents.
- Skill: A reusable, packaged bundle of context — tool descriptions, prompts, and instructions — that enables a specific capability. Where a tool is a function call, a skill is everything the agent needs to use that capability well.
Here's a minimal LangGraph implementation of a planner-executor-reviewer multi-agent loop for a code generation task:
from typing import Annotated, TypedDict
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
# ─── Shared agent state ───────────────────────────────────────────────────────
class AgentState(TypedDict):
messages: Annotated[list, add_messages]
task: str
plan: str
code: str
review_result: str
iteration: int
# ─── Model instances (could use different models per role) ────────────────────
planner_llm = ChatOpenAI(model="gpt-4.1", temperature=0)
coder_llm = ChatOpenAI(model="gpt-4.1", temperature=0.1)
reviewer_llm = ChatOpenAI(model="gpt-4.1", temperature=0)
# ─── Planner sub-agent ────────────────────────────────────────────────────────
def planner_agent(state: AgentState) -> AgentState:
"""
Decomposes the task into a step-by-step implementation plan.
Does NOT write code — only plans.
"""
response = planner_llm.invoke([
SystemMessage(content=(
"You are a senior software architect. Your job is to create a detailed, "
"step-by-step implementation plan. Do NOT write code. Output numbered steps only."
)),
HumanMessage(content=f"Task: {state['task']}")
])
return {"plan": response.content}
# ─── Coder sub-agent ──────────────────────────────────────────────────────────
def coder_agent(state: AgentState) -> AgentState:
"""
Implements the plan produced by the planner.
Focuses purely on code quality and correctness.
"""
response = coder_llm.invoke([
SystemMessage(content=(
"You are a senior Python engineer. Implement the following plan as clean, "
"type-annotated Python code with docstrings. Include unit tests. "
"Output only the code block."
)),
HumanMessage(content=(
f"Task: {state['task']}\n\n"
f"Implementation Plan:\n{state['plan']}\n\n"
f"Previous review feedback (if any): {state.get('review_result', 'None')}"
))
])
return {"code": response.content, "iteration": state.get("iteration", 0) + 1}
# ─── Reviewer sub-agent ───────────────────────────────────────────────────────
def reviewer_agent(state: AgentState) -> AgentState:
"""
Reviews the generated code against quality criteria.
Returns APPROVED or a list of specific issues to fix.
"""
response = reviewer_llm.invoke([
SystemMessage(content=(
"You are a strict code reviewer. Check for: "
"(1) correctness, (2) type annotations, (3) error handling, "
"(4) test coverage, (5) no debug statements. "
"If all criteria pass, respond with exactly 'APPROVED'. "
"Otherwise, list specific issues to fix."
)),
HumanMessage(content=f"Code to review:\n\n{state['code']}")
])
return {"review_result": response.content}
# ─── Routing logic ────────────────────────────────────────────────────────────
def should_continue(state: AgentState) -> str:
"""Route back to coder if review failed, or end if approved or max iterations hit."""
if state["review_result"] == "APPROVED":
return "end"
if state.get("iteration", 0) >= 3:
return "end" # Safety: max 3 revision loops
return "coder"
# ─── Build the graph ──────────────────────────────────────────────────────────
workflow = StateGraph(AgentState)
workflow.add_node("planner", planner_agent)
workflow.add_node("coder", coder_agent)
workflow.add_node("reviewer", reviewer_agent)
workflow.set_entry_point("planner")
workflow.add_edge("planner", "coder")
workflow.add_edge("coder", "reviewer")
workflow.add_conditional_edges(
"reviewer",
should_continue,
{"coder": "coder", "end": END}
)
agent = workflow.compile()
# ─── Run it ───────────────────────────────────────────────────────────────────
result = agent.invoke({
"task": "Write a rate limiter class using a sliding window algorithm with Redis",
"messages": [],
"plan": "",
"code": "",
"review_result": "",
"iteration": 0
})
print("Final Code:\n", result["code"])
print("Review Status:", result["review_result"])
The key design principle here is separation of cognitive concerns: the planner never writes code, the coder never reviews, and the reviewer never plans. Each sub-agent's context window is optimised for its specific role. This produces measurably better results than a single agent trying to do all three simultaneously.
Benchmarks & Real-World Evidence
The case for harness-first engineering is not theoretical. The evidence is accumulating in production:
Terminal Bench 2.0 (June 2026): Cline CLI running claude-opus-4.7 scored 74.2% vs. Anthropic's Claude Code scoring 69.4% on the same model. The difference is entirely attributable to harness design. Cline's newly open-sourced @cline/sdk makes this harness available for inspection — it's a four-layer TypeScript stack with native sub-agent support, CRON scheduling, checkpointing, and MCP connectors.
Viv Trivedy's team (O'Reilly): Moved a custom coding agent from the Top 30 to Top 5 on the same benchmark by changing only the harness. The model was unchanged. The improvements were: tighter AGENTS.md constraints, better tool descriptions, a typecheck backpressure signal, and a planner/executor split.
Uber (June 2026): Capped AI coding tool spending at $1,500/month per tool per engineer after blowing their annual AI budget in four months. At that cap, AI tool spend per engineer approaches ~11% of median annual compensation — a signal that organisations are extracting real productivity value, and that harness quality directly affects cost efficiency.
Rippling: Rebuilt every product AI-native in 6 months using LangGraph multi-agent patterns. Every agent in their stack uses a context-engineered harness rather than raw model calls.
Lyft: Built a self-serve AI agent platform for customer support using LangGraph. The platform handles multi-turn support conversations with tool access to booking systems, payment APIs, and account management — all orchestrated through careful harness design.
The pattern across these cases is consistent: the limiting factor isn't model capability, it's harness design.
Getting Started: Build Your First Context-Engineered Agent
Here is a concise checklist and minimal harness skeleton to get you from zero to a context-engineered agent:
Prerequisites:
- Python 3.11+
pip install openai langchain-openai langgraph mcp sentence-transformers- An OpenAI API key (or equivalent)
Your starter harness skeleton:
"""
minimal_harness.py — A context-engineered agent harness from scratch.
This implements the core loop: Plan → Execute → Observe → Decide → Repeat
with proper context management and feedback backpressure.
"""
import json
from openai import OpenAI
from typing import Any
client = OpenAI() # Reads OPENAI_API_KEY from env
# ─── 1. Agent Configuration (your AGENTS.md equivalent) ──────────────────────
SYSTEM_PROMPT = """You are a Python coding assistant. You help write clean, tested, typed Python code.
CONSTRAINTS (all are hard rules — never violate them):
- All functions must have type annotations and docstrings
- Never include print() or debug statements in final code
- Always include unit tests alongside implementation code
- A task is complete only when all tests pass
TOOLS AVAILABLE:
- run_python: Execute Python code and return stdout/stderr
- read_file: Read a file by path
- write_file: Write content to a file
When you call a tool, use the function calling format. Be precise with file paths.
"""
# ─── 2. Tool Definitions (your MCP schemas, inline for simplicity) ────────────
TOOLS = [
{
"type": "function",
"function": {
"name": "run_python",
"description": (
"Execute Python code in an isolated subprocess. "
"Use this to test your implementations, run assertions, "
"or verify outputs. Returns stdout and stderr."
),
"parameters": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Python code to execute"}
},
"required": ["code"]
}
}
}
]
# ─── 3. Tool Execution Layer (your harness execution) ─────────────────────────
def execute_tool(name: str, arguments: dict) -> str:
"""Execute a tool call and return the result as a string."""
if name == "run_python":
import subprocess
result = subprocess.run(
["python3", "-c", arguments["code"]],
capture_output=True, text=True, timeout=10
)
return f"stdout: {result.stdout}\nstderr: {result.stderr}\nexit: {result.returncode}"
return f"Error: Unknown tool {name}"
# ─── 4. The Agent Loop (harness + context management) ─────────────────────────
def run_agent(task: str, max_iterations: int = 10) -> str:
"""
Core agent loop with context management.
Implements: call model → handle tool use → feed results back → repeat.
"""
# Short-term memory: starts with system prompt + user task
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": task}
]
for iteration in range(max_iterations):
# ── Model call ──────────────────────────────────────────────────────
response = client.chat.completions.create(
model="gpt-4.1",
messages=messages,
tools=TOOLS,
tool_choice="auto"
)
message = response.choices[0].message
# ── Append model response to context (short-term memory update) ─────
messages.append(message.model_dump())
# ── Check stopping condition ──────────────────────────────────────
if message.finish_reason == "stop" and not message.tool_calls:
print(f"✅ Agent completed in {iteration + 1} iterations")
return message.content
# ── Execute tool calls and feed results back into context ─────────
if message.tool_calls:
for tool_call in message.tool_calls:
tool_name = tool_call.function.name
tool_args = json.loads(tool_call.function.arguments)
print(f"🔧 Calling tool: {tool_name}")
result = execute_tool(tool_name, tool_args)
# Inject tool result back into context window
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
return "Max iterations reached. Task incomplete."
# ─── 5. Run it ────────────────────────────────────────────────────────────────
if __name__ == "__main__":
result = run_agent(
"Write a Python function that implements binary search with full type annotations, "
"a docstring, and pytest unit tests. Then run the tests to verify they pass."
)
print("\n📄 Final output:\n", result)
Your context engineering checklist:
- [ ] AGENTS.md written — at minimum: role identity, code standards, task completion criteria, escalation triggers
- [ ] Tool descriptions reviewed — each tool's description should constrain when and how to use it, not just what it does
- [ ] Context window policy defined — how many turns to retain, when to compact, what to summarise
- [ ] Long-term memory wired (for multi-session agents) — vector store or database for persistent context retrieval
- [ ] Feedback loops connected — test output, lint results, and type check results fed back into context
- [ ] Pre-commit / pre-execution hooks added — deterministic enforcement of your AGENTS.md constraints
- [ ] Observability instrumented — token counts, costs, and traces per run
Conclusion
Context engineering for AI agents is not a buzzword. It's the engineering discipline that determines whether your agent actually works in production — or confidently fails in circles, burning tokens and budget with nothing to show for it.
The core insight, worth repeating: a decent model with a great harness beats a great model with a bad harness. The benchmark data proves it. The production case studies confirm it. The gap between what today's models can do and what most deployments see them do is largely a harness gap.
The discipline is maturing rapidly. HuggingFace launched a full certification course on it in June 2026. O'Reilly published practitioner-level guides on harness engineering. LangChain, Cline, and the major agent platforms are converging on shared vocabulary — scaffolding, harness, MCP, skills, sub-agents, ratchet patterns.
The engineers who master context engineering in the next six months will build AI systems that compound in capability with every failure, rather than systems that plateau the moment the model reaches its limits.
Where to go next:
- 📚 HuggingFace Context Engineering Course — Free, 6 units, certification available
- 🔧 Addy Osmani's Agent Harness Engineering — O'Reilly deep-dive
- 🏗️ LangGraph — Production multi-agent framework
- 🔌 MCP Python SDK — Official MCP server/client library
- 📊 LangSmith — Observability and tracing for agent harnesses
The models are good. Make the harness great. Your agents will be unstoppable.
Published on June 6, 2026 | Focus keyword: context engineering for AI agents
Top comments (0)