Shopify CEO Tobi Lütke posted on X:
"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."
Hours later, Andrej Karpathy — former OpenAI researcher — replied:
"+1. In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."
Two of the most credible names in AI. Same day. Same conclusion.
This wasn't terminology drama.
It was a postmortem.
The Pipeline Nobody Drew
Everyone drew this:
User → Prompt → LLM → Response
Production systems actually look like this:
User Query
↓
Memory Retrieval ← what did this user/session do before?
↓
Semantic Search ← which documents are actually relevant?
↓
Re-ranking ← of those, which ones matter most?
↓
State Injection ← what's the current task/workflow state?
↓
Tool Schema Loading ← which tools should even be available here?
↓
History Compression ← how do we fit 40k tokens of history into 2k?
↓
Context Assembly ← put it all together, within token budget
↓
LLM
↓
Action / Response
The prompt is one box in this pipeline.
Yet for two years, the entire industry obsessed over that one box.
Why This Mattered Less Than We Thought
Prompt engineering was genuinely useful when:
- Tasks were single-turn
- Context was short
- Systems were stateless
- One human + one model
But modern agents are:
- Multi-turn by default
- Stateful across sessions
- Calling 20+ tools
- Running as subagents inside larger orchestrations
At that point, the words in your prompt are not your bottleneck.
Your context pipeline is.
The Four Failure Modes Nobody Talks About
Drew Breunig's 2025 essay "Why context engineering matters" named the four ways context kills production systems:
1. Context Poisoning
Wrong information enters the window.
RAG retrieves:
- doc from 2 product versions ago ✗
- outdated pricing table ✗
- correct current policy ✓
Model confidently answers based on stale data.
Model isn't broken. Context is poisoned.
How context engineering fixes it:
Add metadata to every document at index time:
{
"content": "...",
"version": "v4.2",
"last_updated": "2025-11-01",
"deprecated": false
}
At retrieval time, filter before ranking:
→ only fetch docs where deprecated = false
→ prefer docs with last_updated within 90 days
→ score recency as a ranking signal alongside semantic similarity
Result: stale docs never enter the window.
2. Context Distraction
Too much noise drowns the signal.
You needed: 3 relevant chunks
You retrieved: 15 chunks
Model "has" the right answer somewhere.
But reasoning quality degrades with irrelevant tokens.
More context ≠ better context.
This is the biggest misconception in AI engineering right now.
How context engineering fixes it:
Don't just retrieve — re-rank.
Step 1: Semantic search → top 15 candidates
Step 2: Cross-encoder re-ranker → score each against the exact query
Step 3: Token budget check → keep top N that fit within budget
Step 4: Relevance threshold → drop anything below score 0.6
Before: 15 chunks, 6,000 tokens, noisy
After: 3 chunks, 1,200 tokens, precise
Less context. Better answers. Lower cost.
3. Context Confusion
Contradictory information, same window.
Document A (written 2023): "Refund window: 30 days"
Document B (updated last week): "Refund window: 14 days"
Model hedges.
Or worse — picks the wrong one with full confidence.
How context engineering fixes it:
Treat conflicts as a pipeline responsibility, not model responsibility.
Option 1 — Deduplication layer:
→ Before assembly, detect overlapping topics across chunks
→ Keep only the most recent version
→ Discard Document A entirely
Option 2 — Explicit conflict signal:
→ If conflict detected, inject a note into context:
"NOTE: Policy updated 2025-10-28. Use Document B only."
Option 3 — Single source of truth:
→ For structured facts (prices, policies, dates)
→ Don't RAG at all — query a live DB directly
→ Inject the result as a verified fact block
Result: Model never sees contradictions.
4. Context Clash
Conflicting instructions from different layers.
System prompt: "Be concise."
User preference: "Give detailed explanations."
RAG injection: 8,000 tokens of documentation
Tool results: 2,000 tokens of API output
The agent is now fighting itself.
How context engineering fixes it:
Define a clear priority hierarchy at architecture level:
Priority 1 → System prompt (non-negotiable behavior rules)
Priority 2 → Task-specific instructions for this request
Priority 3 → User preferences (within allowed bounds)
Priority 4 → Retrieved context (supporting data only)
Priority 5 → Tool outputs (structured, labelled clearly)
And enforce a token budget per layer:
system_prompt → 800 tokens (hard cap)
task_instructions → 400 tokens
user_preferences → 200 tokens
retrieved_docs → 2,000 tokens
tool_outputs → 1,000 tokens
─────────────────────────────
total → 4,400 tokens ← predictable, no surprises
Result: Every layer knows its role. No layer overrides another.
None of these are prompt problems.
All of them are architecture problems.
The Numbers That Should Scare You
LangChain State of Agent Engineering, 2025:
57% of organizations have AI agents in production.
32% cite quality as their #1 barrier.
Not capability. Not cost. Not latency.
Quality.
When they traced those failures back — the model was rarely the cause. The cause was context: wrong information, too much information, or stale information at decision time.
Microsoft + Salesforce joint study, 2025:
LLM accuracy drops by ~40% after just one back-and-forth in multi-turn conversation.
Not because the model degraded.
Because the context got messy.
Zylos Research, 2025–2026:
65% of enterprise AI failures were attributed to context drift — not context window exhaustion.
Context drift: the gradual corruption of an agent's working memory over long sessions.
The window was big enough. The contents became unreliable.
The Mental Model That Changes Everything
Karpathy put it best:
LLM = CPU
Context Window = RAM
You wouldn't blame a CPU for running poorly if you filled its RAM with garbage.
The CPU is powerful but blind — it executes on whatever's loaded.
Your job as an AI engineer isn't to talk to the CPU better.
Your job is to be the operating system.
Load the right data. Evict the irrelevant. Compress what's too big. Route what belongs to subprocesses.
That's context engineering.
A Real Production Diff
Same model. Same task. Two different approaches.
Old: Prompt Engineering
system_prompt = """
You are a helpful customer support agent.
Be professional and empathetic.
Resolve the customer's issue.
"""
Result: Generic response. AI knows nothing about this customer.
New: Context Engineering
Before the model runs, a pipeline fires:
context = {
# From CRM
"customer_tier": "Premium",
"lifetime_value": "$4,200",
"prior_escalations_90d": 2,
"crm_sentiment": "frustrated",
# From Order DB
"current_shipment_status": "delayed",
"delay_days": 5,
"delay_reason": "carrier_exception",
# From Policy Engine
"refund_eligible": True,
"auto_approve_threshold": "$500",
# From Memory
"last_agent_promise": "expedited shipping on next order",
# Compressed history (not raw logs)
"session_summary": "2 prior contacts, unresolved shipping issue"
}
Now the model produces:
"I can see this is your second shipping issue in three months, and the last agent promised expedited delivery — which didn't happen. I've already triggered a full refund and flagged your account for priority handling on all future orders."
The difference isn't the prompt.
The difference is the pipeline that ran before the prompt.
Context Has Four Operations
Anthropic's own multi-agent research system (documented in their engineering blog) is built around four primitives:
WRITE → Persist state to external memory explicitly
SELECT → Choose which retrieved chunks actually belong here
COMPRESS → Summarize history to fit within token budget
ISOLATE → Run subagents with minimal context to avoid interference
Not one of these is about wording.
All of them are data engineering problems wearing an AI hat.
The Job Market Already Noticed
In 2023:
Everyone wanted prompt engineers.
Six-figure salaries. Viral LinkedIn posts. "No coding required."
By 2025:
Prompt engineer job postings collapsed.
Context/agent infrastructure roles exploded.
What production teams actually hire for now:
- Retrieval systems (vector DBs, hybrid search, re-ranking)
- Memory architecture (episodic vs semantic, when to retrieve vs regenerate)
- Agent orchestration (LangGraph, CrewAI, state machines)
- Context compression (summarization strategies, failure-driven optimization)
- Observability (tracing context inputs → correlating with output quality)
These skills didn't exist in the prompt engineering discourse.
They're what separates working agents from expensive demos.
The Uncomfortable Truth for Model Buyers
Datadog's 2026 State of AI Engineering report:
"Organizations that invest in context engineering — retrieval quality, summarization, deduplication, and clear information hierarchy — will close the gap between what long-context models allow and what production agents can reliably work with."
Translation:
Small model + clean context
vs
Large model + noisy context
Winner: Small model. (Often.)
The bottleneck isn't model intelligence anymore.
It's context quality.
Teams buying bigger models to fix reliability problems are frequently solving the wrong problem.
The New Stack
This is what serious AI teams are actually building:
┌─────────────────────────────────┐
│ Context Layer │
│ ┌──────────┐ ┌─────────────┐ │
│ │ Memory │ │ Retrieval │ │
│ │ Pipeline │ │ Pipeline │ │
│ └──────────┘ └─────────────┘ │
│ ┌──────────┐ ┌─────────────┐ │
│ │ State │ │ Compression │ │
│ │ Manager │ │ Engine │ │
│ └──────────┘ └─────────────┘ │
└────────────────┬────────────────┘
↓
┌──────────────┐
│ LLM + Prompt │ ← prompt lives here
└──────────────┘
↓
┌──────────────┐
│ Action │
└──────────────┘
Prompt engineering works at one layer.
Context engineering is the system around that layer.
The One Mental Model Shift
Prompt Engineering:
How should I talk to the AI?
Context Engineering:
What does the AI need to know — and how do I build
a system that delivers exactly that, at exactly
the right time, within the right token budget?
One is a communication problem.
The other is a software architecture problem.
Final Thought
Prompt engineering isn't dead.
But it's been demoted.
It's now one layer inside a larger discipline that requires retrieval systems, memory design, compression strategies, observability tooling, and proper state management.
The teams winning with AI in 2026 aren't the ones with the cleverest prompts.
They're the ones who treated the context window like what it actually is:
A resource to be engineered. Not a textarea to be filled.
References
Andrej Karpathy on X — context engineering quote (June 25, 2025)
https://x.com/karpathy/status/1937902205765607626Drew Breunig — "Why the term context engineering matters" (2025)
https://www.dbreunig.com/2025/07/24/why-the-term-context-engineering-matters.htmlLangChain — State of Agent Engineering Report (2025)
https://langchain-ai.github.io/langgraph/Zylos Research — AI Agent Context Compression (2026)
https://zylos.ai/research/2026-02-28-ai-agent-context-compression-strategiesDatadog — State of AI Engineering (2026)
https://www.datadoghq.com/state-of-ai-engineering/Anthropic Engineering — How we built our multi-agent research system
https://www.anthropic.com/engineeringMicrosoft + Salesforce — LLMs Get Lost in Multi-Turn Conversation (2025)
Survey of Context Engineering for LLMs — arXiv:2507.13334 (2025)
Drop a comment: what's the worst context failure you've hit in production?
Top comments (0)