Context Engineering: The Discipline That Determines What Your LLM Actually Sees

#ai #architecture #machinelearning #python

Prompt engineering asks: how do I phrase this instruction? Context engineering asks: what information does the model need, in what form, in what order, and how much of it — to produce a correct answer?

For a long time, the implicit mental model was: give the LLM more context and it performs better. This is wrong. A 20,000-token window stuffed with weakly relevant content produces worse answers than a 4,000-token window with precisely curated information. Larger windows do not eliminate context quality problems — they amplify them.

The Context Window Is a Budget

Treat it as a budget with competing line items, not a container you fill. Start with the total window, subtract fixed allocations (system prompt, output reserve, safety margin), and what remains is your dynamic budget split across retrieved chunks, conversation history, and memory.

The first question should always be: "can we get better at selecting less, rather than including more?"

Four Memory Types, Four Purposes

Episodic — conversation history. Highest priority for continuity. Grows unbounded — needs compression.
Semantic — durable facts about the user (role, team, preferences). Compact, injected in system prompt before retrieved content.
Procedural — reusable workflows and SOPs. Retrieved selectively when the query type matches.
Working — intermediate results within a single request (agentic loop output). Ephemeral, request-scoped.

Each type has different durability, update frequency, and token cost. Conflating them into a single undifferentiated store is the most common memory architecture mistake.

Structured Injection Patterns

XML tags for section boundaries (<documents>, <user_context>, <instructions>) — gives the model clear anchors for where information types begin and end
Indexed documents — label chunks with indices so citations can be traced
Ordering matters — most relevant content first (primacy effect), user query last (recency effect)
Grounding instruction is not optional — explicit instruction to use only provided context and signal when insufficient

Lost-in-the-Middle

Models attend more strongly to content near the beginning and end of the context window. Information buried in the middle receives less attention. Mitigations:

Relevance-ordered injection (highest score first)
Sandwich pattern (critical content at both start and end)
Active relevance filtering (exclude low-scoring chunks even if they fit)
Smaller, tighter windows (fewer high-quality chunks > more mediocre chunks)

Conversation Compression

A 100-turn conversation consumes your entire retrieved context budget. Naive truncation loses critical early constraints. Solutions:

Sliding window with pinned turns — critical turns (user constraints, decisions) never truncated
Progressive summarization — compress old segments into 3-5 sentence summaries using Haiku (cheap, mechanical task)

Context Assembly Is Testable

Unit test your assembly layer: budget compliance, ordering preserved, critical turns survive truncation, no mid-chunk truncation. Every assembly failure produces a predictable RAGAS metric signature — context precision drops point to noisy inclusion, faithfulness drops point to contradictions.

Read the Full Article

This is a summary of my deep dive into context engineering. The full article covers the complete discipline with production implementations:

👉 Context Engineering: The Discipline That Determines What Your LLM Actually Sees — Full Article

The full article includes:

Context window budget accounting with Python dataclasses
Four memory types with implementation patterns (episodic, semantic, procedural, working)
Working memory bridge from agentic retrieval loops
XML-structured injection with document indexing
Primacy/recency ordering strategy
Progressive summarization with critical turn pinning
Lost-in-the-middle mitigation (4 strategies with code)
Contradiction detection and resolution
Noise taxonomy (stale, tangential, redundant, over-retrieved)
Unit testing context assembly
AssemblyMetadata integration with RAGAS eval pipeline
RAGAS metric → assembly failure mapping table
Production checklist (19 items)