DEV Community

marsa adam
marsa adam

Posted on

Context Engineering Is the Skill That Actually Ships Reliable AI Agents

Prompt engineering is what you learn first. Context engineering is what you need when you're actually trying to ship something.

Here's the distinction that took me too long to understand.


What Prompt Engineering Gets Right (and Where It Stops)

Prompt engineering is the craft of writing clear instructions. It matters. A well-constructed prompt reduces ambiguity, sets the right tone, and gives the model enough information to complete a task.

But prompt engineering operates on a single input. It doesn't answer:

  • What happens when the model is on turn 12 and the conversation history is 4000 tokens?
  • What happens when you retrieve 6 documents but only 3 fit in the context window?
  • What happens when the system prompt's constraints contradict something injected by the retrieval pipeline?
  • What happens when an agent calls a tool with parameters it invented?

These are not prompt problems. They're architecture problems. And they're what production AI systems actually fail on.


What Context Engineering Is

Context engineering is the practice of deliberately designing everything the model sees when it generates a response — not just the current prompt, but the entire context: system instructions, retrieved data, conversation history, tool schemas, injected state, and output format guidance.

The core insight: context is a finite, expensive resource that directly determines output quality. Managing it deliberately — rather than letting it accumulate passively — is the difference between a demo and a system that runs at scale.

The term is relatively new. Andrej Karpathy started using it in 2025 to describe what serious agent builders were already doing without a name for it. It's now the most useful framing I know for thinking about LLM system design.


The Four Layers You Have to Design

A reliable AI agent context has four layers. When any of them is designed carelessly, you get unpredictable outputs.

Layer 1: System Layer

This is your role definition, rules, and constraints. Most developers write this as a paragraph of instructions. The production version writes it as a contract:

You are a [role] operating under these constraints: [list].
When [condition A] occurs, always [behavior X].
When [condition B] occurs, always [behavior Y].
If you cannot satisfy the task within these constraints, respond with: [specific fallback].
Output format: [exact specification].
Enter fullscreen mode Exit fullscreen mode

The "if you cannot satisfy" clause is the one most people leave out. It's also the one that prevents your agent from improvising when it should be escalating.

Layer 2: Memory Layer

Memory is what persists across turns. There are four types:

Type What it stores How to implement
In-context Recent turns, working state Direct injection, managed truncation
Episodic Past sessions, events External store, retrieved on relevance
Semantic Facts, knowledge, preferences Vector store or knowledge graph
Procedural How to do tasks Prompt templates, tool definitions

Most agent frameworks handle in-context memory automatically (badly). The other three require explicit design decisions.

The most common failure: in-context memory grows unbounded until it crowds out the system prompt and RAG context. Fix: enforce a token budget and summarize aggressively.

Layer 3: Task Layer

The task layer is your current goal, scoped tightly for this turn. The mistake here is making the task too broad. "Help the user with their request" is not a task layer. "Extract all date mentions from the following document and return them as ISO-8601 strings" is.

Tighter task scoping → more consistent outputs → easier evaluation.

Layer 4: Output Layer

Specify the exact format the model should produce. Not "in JSON format" — the exact schema. Not "clearly and concisely" — the word count range, the heading structure, what to include and what to explicitly exclude.

An output layer specification also includes a quality gate: what makes a valid output? What should the model say if it can't produce a valid one?


The Five Most Common Production Failures (and Their Context Engineering Fixes)

1. Context Bloat

Symptom: Agent works reliably for 5 turns, degrades after 10.
Root cause: Conversation history growing without a budget.
Fix: Set a token budget in code. When history approaches the limit, summarize the oldest turns into a compressed episodic record. Inject the summary; drop the raw turns.

2. Tool Hallucination

Symptom: Agent calls a tool with parameters it invented, or calls the wrong tool for a task.
Root cause: Vague tool descriptions. The model fills gaps with plausible-sounding values.
Fix: Write tool descriptions with explicit anti-conditions. "Do NOT call this tool when [condition]" is as important as "Call this tool when [condition]." Specify the exact input schema, not just the field names.

3. Retrieval Miss (RAG)

Symptom: You retrieved the right document. The model still gave the wrong answer.
Root cause: Not a retrieval problem — an injection problem. Chunk format, chunk size, position in context, and source metadata all affect how well the model uses retrieved content.
Fix: Use a consistent chunk injection format with source metadata before the content. "SOURCE: [id] [relevance score] | [content]" consistently outperforms raw content injection. Position RAG context immediately before the task instruction, not after.

4. Instruction Drift

Symptom: The system prompt's constraints are followed at the start of a session, ignored by turn 8.
Root cause: Attention dilution. As context length grows, the model's effective attention to early tokens decreases.
Fix: Re-inject critical constraints into the task layer, not just the system layer. For long-running agents, include a "constraint re-injection block" every N turns.

5. Silent Failure

Symptom: Agent produces output. Output looks plausible. Output is wrong. No error was signaled.
Root cause: No post-generation evaluation step.
Fix: For high-stakes tasks, add a second LLM call that evaluates the first response for groundedness, format compliance, and stated confidence. This is not expensive — it's a targeted evaluator, not a general review. The cost is worth it.


The Attention Budget You're Not Managing

Every context window has a finite attention budget. Attention is not uniformly distributed — models attend more strongly to the beginning and end of a context, and to tokens that are structurally prominent (headers, code blocks, explicit formatting).

This has architectural implications:

  • Put your most critical constraints at the beginning of the system prompt, not buried in paragraph 4
  • Put your task specification immediately before the expected output in the prompt, not pages before
  • Use explicit structure (numbered lists, labeled sections) for anything that must be reliably attended to
  • Budget token counts across your layers explicitly: 20% reserved for output, 30% system+task, 50% retrieval+history is a reasonable starting point

A Minimal Context Engineering Template (Copy and Adapt)

Here's the system prompt scaffold I use as a starting point for most agent architectures:

## Role
You are a [role]. You [primary capability]. You do NOT [explicit exclusion].

## Operating Constraints
- [Constraint 1]
- [Constraint 2]
- [Constraint 3]

## Behavior Rules
- When [condition A]: [behavior X]
- When [condition B]: [behavior Y]
- If you cannot satisfy the task within these constraints: [specific fallback — do not improvise]

## Output Format
[Exact specification: structure, length, fields, schema]

## Quality Gate
Your response is valid only if: [explicit criteria]
If your response does not meet these criteria, output: "QUALITY_GATE_FAIL: [reason]"

## Memory Injection
[Injected episodic summary if applicable]
[Injected user preferences if applicable]

## Current Task
[Injected at runtime — scoped, specific, bounded]

## Retrieved Context
[RAG chunks injected here, formatted as: SOURCE: [id] [score] | [content]]
Enter fullscreen mode Exit fullscreen mode

This is a scaffold, not a prescription. Adapt section names and content to your agent type. The structural discipline — explicit roles, explicit constraints, explicit fallbacks, explicit quality gates — is what matters.


What to Read Next

If you want to go deeper on any specific layer:

  • Memory architecture: the episodic + semantic combination is the least-covered topic in the public literature. The production pattern is: vector store for retrieval + external episodic log for long-session continuity.
  • RAG evaluation: most teams measure retrieval accuracy but not injection quality. Chunk format, chunk size, and context position all affect outcome and are independently testable.
  • Multi-agent context sharing: the blackboard architecture (agents read/write from a shared state file rather than passing messages) is underused and solves a lot of coordination problems.

I documented the full framework — all four layers, 13 copy-paste templates, 10 failure modes with specific fixes — in a 35-page practitioner's guide.

Context Engineering for AI Agents — Practitioner's Guide

Framework-agnostic. Works with GPT-4o, Claude, and Gemini. $39.


Discussion

What production context failure have you hit that I didn't cover here?

Specifically: the failure mode where everything looks right on the surface but the system is silently degrading. Those are the interesting ones.


This article documents production patterns, not benchmarks. No performance numbers are claimed. All templates are starting points — adapt them to your specific agent architecture and evaluate with your own data.



Top comments (0)