Shopify CEO Tobi Lütke posted on X:
"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."
Hours later, Andrej Karpathy — former OpenAI researcher — replied:
"+1. In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."
Two of the most credible names in AI. Same day. Same conclusion.
This wasn't terminology drama.
It was a postmortem.
The Pipeline Nobody Drew
Everyone drew this:
User → Prompt → LLM → Response
Production systems actually look like this:
User Query
↓
Memory Retrieval ← what did this user/session do before?
↓
Semantic Search ← which documents are actually relevant?
↓
Re-ranking ← of those, which ones matter most?
↓
State Injection ← what's the current task/workflow state?
↓
Tool Schema Loading ← which tools should even be available here?
↓
History Compression ← how do we fit 40k tokens of history into 2k?
↓
Context Assembly ← put it all together, within token budget
↓
LLM
↓
Action / Response
The prompt is one box in this pipeline.
Yet for two years, the entire industry obsessed over that one box.
Why This Mattered Less Than We Thought
Prompt engineering was genuinely useful when:
- tasks were single-turn
- context was short
- systems were stateless
- one human + one model
But modern agents are:
- multi-turn by default
- stateful across sessions
- calling 20+ tools
- running as subagents inside larger orchestrations
At that point, the words in your prompt are not your bottleneck.
Your context pipeline is.
The Four Failure Modes Nobody Talks About
Drew Breunig's 2025 essay "Why context engineering matters" named the four ways context kills production systems:
1. Context Poisoning
Wrong information enters the window.
RAG retrieves:
- doc from 2 product versions ago ✗
- outdated pricing table ✗
- correct current policy ✓
Model confidently answers based on stale data.
Model isn't broken. Context is poisoned.
How context engineering fixes it:
Add metadata to every document at index time:
{
"content": "...",
"version": "v4.2",
"last_updated": "2025-11-01",
"deprecated": false
}
At retrieval time, filter before ranking:
→ only fetch docs where deprecated = false
→ prefer docs with last_updated within 90 days
→ score recency as a ranking signal alongside semantic similarity
Result: stale docs never enter the window.
2. Context Distraction
Too much noise drowns the signal.
You needed: 3 relevant chunks
You retrieved: 15 chunks
Model "has" the right answer somewhere.
But reasoning quality degrades with irrelevant tokens.
More context ≠ better context.
This is the biggest misconception in AI engineering right now.
How context engineering fixes it:
Don't just retrieve — re-rank.
Step 1: Semantic search → top 15 candidates
Step 2: Cross-encoder re-ranker → score each against the exact query
Step 3: Token budget check → keep top N that fit within budget
Step 4: Relevance threshold → drop anything below score 0.6
Before: 15 chunks, 6,000 tokens, noisy
After: 3 chunks, 1,200 tokens, precise
Less context. Better answers. Lower cost.
3. Context Confusion
Contradictory information, same window.
Document A (written 2023): "Refund window: 30 days"
Document B (updated last week): "Refund window: 14 days"
Model hedges.
Or worse — picks the wrong one with full confidence.
How context engineering fixes it:
Treat conflicts as a pipeline responsibility, not model responsibility.
Option 1 — Deduplication layer:
→ Before assembly, detect overlapping topics across chunks
→ Keep only the most recent version
→ Discard Document A entirely
Option 2 — Explicit conflict signal:
→ If conflict detected, inject a note into context:
"NOTE: Policy updated 2025-10-28. Use Document B only."
Option 3 — Single source of truth:
→ For structured facts (prices, policies, dates)
→ Don't RAG at all — query a live DB directly
→ Inject the result as a verified fact block
Result: Model never sees contradictions.
4. Context Clash
Conflicting instructions from different layers.
System prompt: "Be concise."
User preference: "Give detailed explanations."
RAG injection: 8,000 tokens of documentation
Tool results: 2,000 tokens of API output
The agent is now fighting itself.
How context engineering fixes it:
Define a clear priority hierarchy at architecture level:
Priority 1 → System prompt (non-negotiable behavior rules)
Priority 2 → Task-specific instructions for this request
Priority 3 → User preferences (within allowed bounds)
Priority 4 → Retrieved context (supporting data only)
Priority 5 → Tool outputs (structured, labelled clearly)
And enforce a token budget per layer:
system_prompt → 800 tokens (hard cap)
task_instructions → 400 tokens
user_preferences → 200 tokens
retrieved_docs → 2,000 tokens
tool_outputs → 1,000 tokens
─────────────────────────────
total → 4,400 tokens ← predictable, no surprises
Result: Every layer knows its role. No layer overrides another.
None of these are prompt problems.
All of them are architecture problems.
The Root Cause Under All Four Failures
There's a pattern connecting all four.
Teams debug model behavior.
The real bug is almost always one of three broken contracts:
1. Retrieval Contract → what was selected, why, and from which version?
2. State Contract → what does the agent believe is current workflow state?
3. Tool-Result Contract → what did the external system return, and is it fresh?
Without these contracts, you're flying blind.
The model has no way to know if its context is stale, incomplete, or conflicting.
So it guesses.
And you blame the model.
Retrieval Contract
Most RAG pipelines return chunks.
They don't return why those chunks were selected.
❌ Without contract:
[chunk 1 text]
[chunk 2 text]
[chunk 3 text]
✓ With contract:
{
"chunk": "...",
"source": "refund-policy-v4.md",
"version": "v4",
"score": 0.91,
"retrieved_at": "2025-11-01T10:32Z",
"deprecated": false
}
Now the model knows:
- where this came from
- how confident the retrieval was
- whether it's current
Provenance is not optional metadata.
It's the difference between a fact and a guess.
State Contract
Agents fail at long tasks because they lose track of what's already happened.
❌ Without contract:
Agent retries a step it already completed.
Agent contradicts a decision it made 10 steps ago.
Agent asks for info it already retrieved.
✓ With contract:
{
"current_step": "awaiting_payment_confirmation",
"completed_steps": ["validate_order", "check_inventory", "reserve_stock"],
"failed_attempts": [{"step": "charge_card", "error": "timeout", "at": "10:31Z"}],
"pending_decisions": ["apply_discount?"]
}
The agent knows exactly where it is.
Not from memory. From an explicit state object that the pipeline maintains.
Tool-Result Contract
This is the one most teams skip entirely.
Tool outputs get injected raw into context.
❌ Without contract:
Tool returned: {"price": 4200, "currency": "USD"}
No timestamp. No scope. No filters. No freshness signal.
Model doesn't know if this is live data or a cached response from 3 hours ago.
✓ With contract:
{
"tool": "pricing_api",
"result": {"price": 4200, "currency": "USD"},
"fetched_at": "2025-11-01T10:33Z",
"cache_ttl_seconds": 300,
"freshness": "live",
"scope": "customer_id:8821, product_id:PRD-44",
"filters_applied": ["region:IN", "tier:premium"]
}
Now the model can reason about the data, not just use it.
It knows the price is live, scoped to this customer, and will be stale in 5 minutes.
And here's the important part most teams get wrong:
❌ Wrong approach:
Ask the model to summarize where its answer came from.
The model will invent a plausible-sounding source trail.
That's not provenance. That's the model's interpretation of execution.
✓ Right approach:
Generate provenance at the database/MCP layer — before the model sees it.
Attach it to the tool result as structured fields.
A complete provenance envelope looks like this:
{
"tool": "revenue_api",
"tool_version": "v2.1",
"source_system": "postgres-prod-replica",
"metric_definition": "mrr_v3", ← which formula was used
"tenant_scope": "org_id:441", ← whose data
"user_scope": "role:finance_read", ← what access level
"query_time": "2025-11-01T10:33Z",
"data_freshness": "replica_lag_~90s", ← not live, replica
"result_limit": 1000,
"redactions": ["pii_fields"],
"partial_result": false,
"audit_id": "qry_8821xz" ← traceable for review
}
This matters because wrong database answers are not always hallucinations.
"MRR is up 8%" ← confident, grounded, wrong
Why wrong?
- queried a replica with 90s lag ✗
- used metric_definition v2, not v3 ✗
- scoped to wrong tenant ✗
Without provenance: you debug the model.
With provenance: you see the actual failure in 30 seconds.
The model didn't hallucinate. The pipeline gave it the wrong source.
Provenance makes wrong answers debuggable. Without it, debugging becomes archaeology.
The principle behind all three:
The model should not have to guess whether context is fresh.
The pipeline should tell it.
This is the shift.
Prompt engineering asked the model to be smart enough to figure things out.
Context engineering gives the model what it needs to not have to figure things out.
The Numbers That Should Scare You
LangChain State of Agent Engineering, 2025:
57% of organizations have AI agents in production.
32% cite quality as their #1 barrier.
Not capability. Not cost. Not latency.
Quality.
When they traced those failures back — the model was rarely the cause. The cause was context: wrong information, too much information, or stale information at decision time.
Microsoft + Salesforce joint study, 2025:
LLM accuracy drops by ~40% after just one back-and-forth in multi-turn conversation.
Not because the model degraded.
Because the context got messy.
Zylos Research, 2025–2026:
65% of enterprise AI failures were attributed to context drift — not context window exhaustion.
Context drift: the gradual corruption of an agent's working memory over long sessions.
The window was big enough. The contents became unreliable.
The Mental Model That Changes Everything
Karpathy put it best:
LLM = CPU
Context Window = RAM
You wouldn't blame a CPU for running poorly if you filled its RAM with garbage.
The CPU is powerful but blind — it executes on whatever's loaded.
Your job as an AI engineer isn't to talk to the CPU better.
Your job is to be the operating system.
Load the right data. Evict the irrelevant. Compress what's too big. Route what belongs to subprocesses.
That's context engineering.
A Real Production Diff
Same model. Same task. Two different approaches.
Old: Prompt Engineering
system_prompt = """
You are a helpful customer support agent.
Be professional and empathetic.
Resolve the customer's issue.
"""
Result: Generic response. AI knows nothing about this customer.
New: Context Engineering
Before the model runs, a pipeline fires:
context = {
# From CRM
"customer_tier": "Premium",
"lifetime_value": "$4,200",
"prior_escalations_90d": 2,
"crm_sentiment": "frustrated",
# From Order DB
"current_shipment_status": "delayed",
"delay_days": 5,
"delay_reason": "carrier_exception",
# From Policy Engine
"refund_eligible": True,
"auto_approve_threshold": "$500",
# From Memory
"last_agent_promise": "expedited shipping on next order",
# Compressed history (not raw logs)
"session_summary": "2 prior contacts, unresolved shipping issue"
}
Now the model produces:
"I can see this is your second shipping issue in three months, and the last agent promised expedited delivery — which didn't happen. I've already triggered a full refund and flagged your account for priority handling on all future orders."
The difference isn't the prompt.
The difference is the pipeline that ran before the prompt.
Context Has Four Operations
Anthropic's own multi-agent research system (documented in their engineering blog) is built around four primitives:
WRITE → Persist state to external memory explicitly
SELECT → Choose which retrieved chunks actually belong here
COMPRESS → Summarize history to fit within token budget
ISOLATE → Run subagents with minimal context to avoid interference
Not one of these is about wording.
All of them are data engineering problems wearing an AI hat.
The Job Market Already Noticed
In 2023:
Everyone wanted prompt engineers.
Six-figure salaries. Viral LinkedIn posts. "No coding required."
By 2025:
Prompt engineer job postings collapsed.
Context/agent infrastructure roles exploded.
What production teams actually hire for now:
- Retrieval systems (vector DBs, hybrid search, re-ranking)
- Memory architecture (episodic vs semantic, when to retrieve vs regenerate)
- Agent orchestration (LangGraph, CrewAI, state machines)
- Context compression (summarization strategies, failure-driven optimization)
- Observability (tracing context inputs → correlating with output quality)
These skills didn't exist in the prompt engineering discourse.
They're what separates working agents from expensive demos.
The Uncomfortable Truth for Model Buyers
Datadog's 2026 State of AI Engineering report:
"Organizations that invest in context engineering — retrieval quality, summarization, deduplication, and clear information hierarchy — will close the gap between what long-context models allow and what production agents can reliably work with."
Translation:
Small model + clean context
vs
Large model + noisy context
Winner: Small model. (Often.)
The bottleneck isn't model intelligence anymore.
It's context quality.
Teams buying bigger models to fix reliability problems are frequently solving the wrong problem.
The New Stack
This is what serious AI teams are actually building:
┌─────────────────────────────────┐
│ Context Layer │
│ ┌──────────┐ ┌─────────────┐ │
│ │ Memory │ │ Retrieval │ │
│ │ Pipeline │ │ Pipeline │ │
│ └──────────┘ └─────────────┘ │
│ ┌──────────┐ ┌─────────────┐ │
│ │ State │ │ Compression │ │
│ │ Manager │ │ Engine │ │
│ └──────────┘ └─────────────┘ │
└────────────────┬────────────────┘
↓
┌──────────────┐
│ LLM + Prompt │ ← prompt lives here
└──────────────┘
↓
┌──────────────┐
│ Action │
└──────────────┘
Prompt engineering works at one layer.
Context engineering is the system around that layer.
The One Mental Model Shift
Prompt Engineering:
How should I talk to the AI?
Context Engineering:
What does the AI need to know — and how do I build
a system that delivers exactly that, at exactly
the right time, within the right token budget?
One is a communication problem.
The other is a software architecture problem.
Final Thought
Prompt engineering isn't dead dead.
But it's been demoted.
It's now one layer inside a larger discipline that requires retrieval systems, memory design, compression strategies, observability tooling, and proper state management.
The teams winning with AI in 2026 aren't the ones with the cleverest prompts.
They're the ones who treated the context window like what it actually is:
A resource to be engineered. Not a textarea to be filled.
References
Andrej Karpathy on X — context engineering quote (June 25, 2025)
https://x.com/karpathy/status/1937902205765607626Drew Breunig — "Why the term context engineering matters" (2025)
https://www.dbreunig.com/2025/07/24/why-the-term-context-engineering-matters.htmlLangChain — State of Agent Engineering Report (2025)
https://langchain-ai.github.io/langgraph/Zylos Research — AI Agent Context Compression (2026)
https://zylos.ai/research/2026-02-28-ai-agent-context-compression-strategiesDatadog — State of AI Engineering (2026)
https://www.datadoghq.com/state-of-ai-engineering/Anthropic Engineering — How we built our multi-agent research system
https://www.anthropic.com/engineeringMicrosoft + Salesforce — LLMs Get Lost in Multi-Turn Conversation (2025)
Survey of Context Engineering for LLMs — arXiv:2507.13334 (2025)
Top comments (1)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.