Andrej Karpathy called it "the new prompt engineering." Everyone wrote a definition article. Nobody wrote about the actual infrastructure.
When I built the ESG Analytics Chatbot at Planet Sustech — a RAG system serving 10+ organizations with 95%+ query accuracy — the hardest engineering challenge wasn't the model. It wasn't the retrieval. It was figuring out what to put in the context window before every single LLM call, and why getting that wrong destroyed accuracy in ways that were nearly impossible to debug.
Context engineering is about one thing: building the right input to the model, every call, for every state of your application. It's not about writing better prompts. It's about the pipeline that assembles the prompt at runtime.
This is the post I wish existed when I was building that system.
Why This Is Harder Than It Sounds
Here's a simplified view of what goes into a single LLM call in a production RAG system:
[System Prompt] + [Retrieved Documents] + [Conversation History] + [User Query] + [Tool Schemas]
= Total Context
Each component competes for the same finite token budget. And each one has a different freshness, relevance, and cost profile.
The naive approach: include everything. Stuff the context window as full as you can.
The result: accuracy drops, costs spike, and you hit a phenomenon called "lost in the middle" — where the model pays disproportionately less attention to content in the middle of a long context. Research shows up to 24.2% accuracy degradation when relevant information is buried in the middle of a long context rather than placed at the beginning or end.
We saw this in the ESG chatbot. When an analyst asked about a specific company's water usage data, our retrieval was returning 8 relevant documents. But we were appending them in retrieval order, so the most relevant chunk was often sitting in positions 3–6. Accuracy on those queries was 15–20 points lower than queries where the best chunk landed first or last.
The Four Layers of Agent Context
Before optimizing anything, it helps to think about context in layers:
Layer 1: Working Memory
The current task. User's query, the agent's current goal, any in-progress state. This is always present and always goes at the beginning of the context (primacy effect — models weight early content heavily).
Layer 2: Episodic Memory
Recent conversation history. Not all of it — a compressed or windowed version. For a chatbot, this is the last N turns. For a long-running agent, this is a summary of recent steps taken.
Layer 3: Semantic Memory
Retrieved knowledge — documents, database results, external data. This is what RAG injects. It's the most expensive layer to get right because it's query-dependent and changes every call.
Layer 4: Tool State
The results of previous tool calls within the current session. If your agent called search_jobs two steps ago, that result might still be relevant now.
Most systems treat these as a flat list appended in whatever order they're collected. Production systems need to treat them as a priority stack with a budget.
Building a Context Budget
In our ESG chatbot, we set an explicit token budget per call:
CONTEXT_BUDGET = {
"system_prompt": 800, # Fixed — organizational context, instructions
"working_memory": 400, # Current query + task state
"episodic_memory": 600, # Recent conversation (compressed)
"semantic_memory": 1800, # Retrieved documents
"tool_results": 400, # Previous tool call outputs
"response_reserve": 1000, # Leave room for model's response
# Total: ~5000 tokens — fits in most model context windows with headroom
}
This forces you to make explicit tradeoffs instead of letting the context grow unbounded until you hit an error.
When retrieved documents exceed the semantic memory budget, you don't just truncate — you re-rank and select:
def build_semantic_context(query: str, retrieved_docs: list, budget: int) -> str:
"""
Select and rank documents to fit within token budget.
Prioritize by relevance score, place highest-scoring chunk first.
"""
# Sort by relevance score descending
ranked = sorted(retrieved_docs, key=lambda d: d['score'], reverse=True)
selected = []
used_tokens = 0
for doc in ranked:
doc_tokens = count_tokens(doc['content'])
if used_tokens + doc_tokens > budget:
break
selected.append(doc)
used_tokens += doc_tokens
# Critical: place highest-relevance content first (primacy effect)
return "\n\n".join([f"[Source: {d['source']}]\n{d['content']}" for d in selected])
This simple re-ranking step improved our query accuracy from the mid-70s to 95%+ on complex multi-document queries.
The KV-Cache Problem (and Why It's a Cost Issue, Not Just a Performance Issue)
If you're running in production at scale, KV-cache hit rate is one of the most important metrics you're probably not tracking.
When you make an LLM API call, the model can cache the key-value pairs computed for a prefix of your context. If your next call starts with the same prefix, it reuses the cache instead of recomputing — dramatically reducing latency and cost (cached tokens are typically 10x cheaper than prompt tokens on Anthropic's API).
The problem: if your system prompt or retrieved documents change slightly on every call, you get 0% cache hit rate. You pay full price every time.
Our ESG chatbot was rebuilding the organization's schema context from scratch on every call — even though that schema barely changed. Cache hit rate was near zero. When we moved the static organizational context to a fixed prefix that never changed, and only appended dynamic content after it, our cache hit rate jumped to ~65%.
def build_context(org_schema: str, query: str, retrieved_docs: list) -> list[dict]:
"""
Structure context so static content comes first (cacheable prefix).
Dynamic content comes after (changes per request).
"""
return [
# STATIC — same for all queries in this org. Gets cached.
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "system", "content": f"Organization schema:\n{org_schema}"},
# DYNAMIC — changes per query. Not cached.
{"role": "user", "content": f"Retrieved context:\n{build_semantic_context(query, retrieved_docs, 1800)}"},
{"role": "user", "content": query},
]
The rule: put what doesn't change first. Put what changes last.
Episodic Memory: When You Can't Afford Full History
For multi-turn conversations, you can't keep appending the full history. At turn 20, your conversation history alone might be 5,000 tokens.
Two patterns that work in production:
Pattern 1: Rolling Window
Keep only the last N turns verbatim. Simple, predictable, no information loss on recent turns.
def get_episodic_context(conversation_history: list, max_turns: int = 6) -> list:
"""Keep the most recent N turns."""
return conversation_history[-max_turns * 2:] # *2 for user+assistant pairs
Pattern 2: Hierarchical Summarization
For longer sessions, summarize older turns and keep recent ones verbatim.
async def build_episodic_context(history: list, budget_tokens: int) -> str:
recent_turns = history[-6:] # Last 3 exchanges verbatim
older_turns = history[:-6]
if not older_turns:
return format_turns(recent_turns)
# Summarize everything older
summary = await llm.summarize(
f"Summarize this conversation context concisely:\n{format_turns(older_turns)}",
max_tokens=200
)
return f"Earlier in this conversation: {summary}\n\nRecent exchanges:\n{format_turns(recent_turns)}"
This kept our ESG chatbot coherent across 30+ turn sessions while staying well under our token budget.
What to Measure
If you're not measuring these, you're flying blind:
# Log this for every LLM call
context_metrics = {
"total_tokens": count_tokens(full_context),
"system_tokens": count_tokens(system_prompt),
"retrieved_tokens": count_tokens(semantic_context),
"history_tokens": count_tokens(episodic_context),
"cache_hit": response.usage.cache_read_input_tokens > 0, # Anthropic API
"retrieval_count": len(retrieved_docs),
"retrieval_top_score": retrieved_docs[0]['score'] if retrieved_docs else 0,
"query_id": query_id,
"org_id": org_id,
}
The metrics that predicted accuracy problems in our system:
-
retrieval_top_score < 0.7→ likely hallucination risk -
retrieved_tokens > 2000→ "lost in the middle" risk on complex queries -
cache_hit = Falseon high-traffic routes → unnecessary cost
The Mental Model Shift
Prompt engineering asks: "What do I write in the prompt?"
Context engineering asks: "What information does the model need, in what order, with what budget, at this specific point in this specific conversation, for this specific user?"
The second question has an engineering answer. It involves data structures, caching strategies, budget allocation, and retrieval pipelines. It's not about being clever with words — it's about building infrastructure.
When we treated our context pipeline as an engineering artifact — with explicit budgets, priority layers, and metrics — our ESG chatbot went from inconsistent to consistent. 95%+ accuracy wasn't the result of a better model or better prompts. It was the result of better context architecture.
Conclusion
Context engineering is not a new concept, but the production implementation is genuinely underexplored. Most of what's written is definitional. The infrastructure details — budget allocation, cache optimization, memory layers, retrieval re-ranking — are things teams figure out on their own after expensive mistakes.
If you're building a production RAG system or a long-running agent and you're hitting accuracy walls or cost spikes, audit your context pipeline first. The answer is probably there.
What patterns have you found that work? Drop them in the comments — I'm genuinely curious what others are doing here.
I built the ESG Analytics Chatbot at Planet Sustech — a multi-tenant RAG system serving 10+ organizations — and these patterns came directly from debugging production failures in that system.
Tags: #ai #llm #rag #machinelearning #python
Top comments (0)