DEV Community

Cover image for Prompt Engineering Is Slowly Becoming Context Engineering
Sudarshan Gouda
Sudarshan Gouda

Posted on

Prompt Engineering Is Slowly Becoming Context Engineering

Shopify CEO Tobi Lütke posted on X:

"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."

Hours later, Andrej Karpathy — former OpenAI researcher — replied:

"+1. In every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step."

Two of the most credible names in AI. Same day. Same conclusion.

This wasn't terminology drama.

It was a postmortem.


The Pipeline Nobody Drew

Everyone drew this:

User → Prompt → LLM → Response
Enter fullscreen mode Exit fullscreen mode

Production systems actually look like this:

User Query
  ↓
Memory Retrieval        ← what did this user/session do before?
  ↓
Semantic Search         ← which documents are actually relevant?
  ↓
Re-ranking              ← of those, which ones matter most?
  ↓
State Injection         ← what's the current task/workflow state?
  ↓
Tool Schema Loading     ← which tools should even be available here?
  ↓
History Compression     ← how do we fit 40k tokens of history into 2k?
  ↓
Context Assembly        ← put it all together, within token budget
  ↓
LLM
  ↓
Action / Response
Enter fullscreen mode Exit fullscreen mode

The prompt is one box in this pipeline.

Yet for two years, the entire industry obsessed over that one box.


Why This Mattered Less Than We Thought

Prompt engineering was genuinely useful when:

  • Tasks were single-turn
  • Context was short
  • Systems were stateless
  • One human + one model

But modern agents are:

  • Multi-turn by default
  • Stateful across sessions
  • Calling 20+ tools
  • Running as subagents inside larger orchestrations

At that point, the words in your prompt are not your bottleneck.

Your context pipeline is.


The Four Failure Modes Nobody Talks About

Drew Breunig's 2025 essay "Why context engineering matters" named the four ways context kills production systems:

1. Context Poisoning

Wrong information enters the window.

RAG retrieves:
  - doc from 2 product versions ago ✗
  - outdated pricing table ✗  
  - correct current policy ✓

Model confidently answers based on stale data.
Model isn't broken. Context is poisoned.
Enter fullscreen mode Exit fullscreen mode

How context engineering fixes it:

Add metadata to every document at index time:

{
  "content": "...",
  "version": "v4.2",
  "last_updated": "2025-11-01",
  "deprecated": false
}

At retrieval time, filter before ranking:
  → only fetch docs where deprecated = false
  → prefer docs with last_updated within 90 days
  → score recency as a ranking signal alongside semantic similarity

Result: stale docs never enter the window.
Enter fullscreen mode Exit fullscreen mode

2. Context Distraction

Too much noise drowns the signal.

You needed: 3 relevant chunks
You retrieved: 15 chunks

Model "has" the right answer somewhere.
But reasoning quality degrades with irrelevant tokens.

More context ≠ better context.
This is the biggest misconception in AI engineering right now.
Enter fullscreen mode Exit fullscreen mode

How context engineering fixes it:

Don't just retrieve — re-rank.

Step 1: Semantic search  → top 15 candidates
Step 2: Cross-encoder re-ranker → score each against the exact query
Step 3: Token budget check → keep top N that fit within budget
Step 4: Relevance threshold → drop anything below score 0.6

Before:  15 chunks, 6,000 tokens, noisy
After:   3 chunks, 1,200 tokens, precise

Less context. Better answers. Lower cost.
Enter fullscreen mode Exit fullscreen mode

3. Context Confusion

Contradictory information, same window.

Document A (written 2023): "Refund window: 30 days"
Document B (updated last week): "Refund window: 14 days"

Model hedges.
Or worse — picks the wrong one with full confidence.
Enter fullscreen mode Exit fullscreen mode

How context engineering fixes it:

Treat conflicts as a pipeline responsibility, not model responsibility.

Option 1 — Deduplication layer:
  → Before assembly, detect overlapping topics across chunks
  → Keep only the most recent version
  → Discard Document A entirely

Option 2 — Explicit conflict signal:
  → If conflict detected, inject a note into context:
     "NOTE: Policy updated 2025-10-28. Use Document B only."

Option 3 — Single source of truth:
  → For structured facts (prices, policies, dates)
  → Don't RAG at all — query a live DB directly
  → Inject the result as a verified fact block

Result: Model never sees contradictions.
Enter fullscreen mode Exit fullscreen mode

4. Context Clash

Conflicting instructions from different layers.

System prompt:     "Be concise."
User preference:   "Give detailed explanations."
RAG injection:     8,000 tokens of documentation
Tool results:      2,000 tokens of API output

The agent is now fighting itself.
Enter fullscreen mode Exit fullscreen mode

How context engineering fixes it:

Define a clear priority hierarchy at architecture level:

Priority 1 → System prompt (non-negotiable behavior rules)
Priority 2 → Task-specific instructions for this request
Priority 3 → User preferences (within allowed bounds)
Priority 4 → Retrieved context (supporting data only)
Priority 5 → Tool outputs (structured, labelled clearly)

And enforce a token budget per layer:

system_prompt    →  800 tokens  (hard cap)
task_instructions → 400 tokens
user_preferences →  200 tokens
retrieved_docs   → 2,000 tokens
tool_outputs     → 1,000 tokens
─────────────────────────────
total            → 4,400 tokens  ← predictable, no surprises

Result: Every layer knows its role. No layer overrides another.
Enter fullscreen mode Exit fullscreen mode

None of these are prompt problems.

All of them are architecture problems.


The Numbers That Should Scare You

LangChain State of Agent Engineering, 2025:

57% of organizations have AI agents in production.
32% cite quality as their #1 barrier.

Not capability. Not cost. Not latency.

Quality.

When they traced those failures back — the model was rarely the cause. The cause was context: wrong information, too much information, or stale information at decision time.


Microsoft + Salesforce joint study, 2025:

LLM accuracy drops by ~40% after just one back-and-forth in multi-turn conversation.

Not because the model degraded.

Because the context got messy.


Zylos Research, 2025–2026:

65% of enterprise AI failures were attributed to context drift — not context window exhaustion.

Context drift: the gradual corruption of an agent's working memory over long sessions.

The window was big enough. The contents became unreliable.


The Mental Model That Changes Everything

Karpathy put it best:

LLM          = CPU
Context Window = RAM
Enter fullscreen mode Exit fullscreen mode

You wouldn't blame a CPU for running poorly if you filled its RAM with garbage.

The CPU is powerful but blind — it executes on whatever's loaded.

Your job as an AI engineer isn't to talk to the CPU better.

Your job is to be the operating system.

Load the right data. Evict the irrelevant. Compress what's too big. Route what belongs to subprocesses.

That's context engineering.


A Real Production Diff

Same model. Same task. Two different approaches.

Old: Prompt Engineering

system_prompt = """
You are a helpful customer support agent.
Be professional and empathetic.
Resolve the customer's issue.
"""
Enter fullscreen mode Exit fullscreen mode

Result: Generic response. AI knows nothing about this customer.


New: Context Engineering

Before the model runs, a pipeline fires:

context = {
    # From CRM
    "customer_tier": "Premium",
    "lifetime_value": "$4,200",
    "prior_escalations_90d": 2,
    "crm_sentiment": "frustrated",

    # From Order DB
    "current_shipment_status": "delayed",
    "delay_days": 5,
    "delay_reason": "carrier_exception",

    # From Policy Engine
    "refund_eligible": True,
    "auto_approve_threshold": "$500",

    # From Memory
    "last_agent_promise": "expedited shipping on next order",

    # Compressed history (not raw logs)
    "session_summary": "2 prior contacts, unresolved shipping issue"
}
Enter fullscreen mode Exit fullscreen mode

Now the model produces:

"I can see this is your second shipping issue in three months, and the last agent promised expedited delivery — which didn't happen. I've already triggered a full refund and flagged your account for priority handling on all future orders."

The difference isn't the prompt.

The difference is the pipeline that ran before the prompt.


Context Has Four Operations

Anthropic's own multi-agent research system (documented in their engineering blog) is built around four primitives:

WRITE    → Persist state to external memory explicitly
SELECT   → Choose which retrieved chunks actually belong here
COMPRESS → Summarize history to fit within token budget  
ISOLATE  → Run subagents with minimal context to avoid interference
Enter fullscreen mode Exit fullscreen mode

Not one of these is about wording.

All of them are data engineering problems wearing an AI hat.


The Job Market Already Noticed

In 2023:

Everyone wanted prompt engineers.
Six-figure salaries. Viral LinkedIn posts. "No coding required."

By 2025:

Prompt engineer job postings collapsed.
Context/agent infrastructure roles exploded.

What production teams actually hire for now:

- Retrieval systems     (vector DBs, hybrid search, re-ranking)
- Memory architecture   (episodic vs semantic, when to retrieve vs regenerate)
- Agent orchestration   (LangGraph, CrewAI, state machines)
- Context compression   (summarization strategies, failure-driven optimization)
- Observability         (tracing context inputs → correlating with output quality)
Enter fullscreen mode Exit fullscreen mode

These skills didn't exist in the prompt engineering discourse.

They're what separates working agents from expensive demos.


The Uncomfortable Truth for Model Buyers

Datadog's 2026 State of AI Engineering report:

"Organizations that invest in context engineering — retrieval quality, summarization, deduplication, and clear information hierarchy — will close the gap between what long-context models allow and what production agents can reliably work with."

Translation:

Small model + clean context
  vs
Large model + noisy context

Winner: Small model.  (Often.)
Enter fullscreen mode Exit fullscreen mode

The bottleneck isn't model intelligence anymore.

It's context quality.

Teams buying bigger models to fix reliability problems are frequently solving the wrong problem.


The New Stack

This is what serious AI teams are actually building:

┌─────────────────────────────────┐
│         Context Layer           │
│  ┌──────────┐  ┌─────────────┐  │
│  │  Memory  │  │  Retrieval  │  │
│  │ Pipeline │  │  Pipeline   │  │
│  └──────────┘  └─────────────┘  │
│  ┌──────────┐  ┌─────────────┐  │
│  │   State  │  │ Compression │  │
│  │ Manager  │  │   Engine    │  │
│  └──────────┘  └─────────────┘  │
└────────────────┬────────────────┘
                 ↓
         ┌──────────────┐
         │  LLM + Prompt │  ← prompt lives here
         └──────────────┘
                 ↓
         ┌──────────────┐
         │    Action    │
         └──────────────┘
Enter fullscreen mode Exit fullscreen mode

Prompt engineering works at one layer.

Context engineering is the system around that layer.


The One Mental Model Shift

Prompt Engineering:

How should I talk to the AI?
Enter fullscreen mode Exit fullscreen mode

Context Engineering:

What does the AI need to know — and how do I build
a system that delivers exactly that, at exactly
the right time, within the right token budget?
Enter fullscreen mode Exit fullscreen mode

One is a communication problem.

The other is a software architecture problem.


Final Thought

Prompt engineering isn't dead.

But it's been demoted.

It's now one layer inside a larger discipline that requires retrieval systems, memory design, compression strategies, observability tooling, and proper state management.

The teams winning with AI in 2026 aren't the ones with the cleverest prompts.

They're the ones who treated the context window like what it actually is:

A resource to be engineered. Not a textarea to be filled.


References


Drop a comment: what's the worst context failure you've hit in production?

Top comments (0)