If you've been calling yourself a "prompt engineer" for the past two years, it's time to update your vocabulary — and your mental model.
In 2026, the real leverage when building LLM-powered systems isn't in crafting the perfect sentence. It's in context engineering: designing everything an LLM sees before it ever generates a response. Andrej Karpathy coined the term in mid-2025, and it's since taken over serious AI engineering discussions.
This article breaks down what context engineering actually is, why it matters more than prompt writing, and gives you concrete techniques you can apply today.
What Is Context Engineering?
Context engineering is the discipline of systematically designing the information environment that surrounds a prompt. Where prompt engineering asks "what should I tell the model to do?", context engineering asks "what does the model need to know to do it well?"
Think of it this way: a doctor doesn't just answer the question you ask on the spot. They look at your chart, your history, your vitals, and then respond. Context engineering is building that chart for your LLM.
The context window is the LLM's working memory — everything it can "see" at once. In 2026, these windows are massive:
- Claude Opus 4.x: 200K tokens
- GPT-4o: 128K tokens
- Gemini 2.5 Flash: Up to 1M tokens
But bigger isn't automatically better. More tokens = more cost, more latency, and a real risk of what researchers call the "lost-in-the-middle" problem — where models process information at the beginning and end of the context more reliably than content buried in the middle.
Why This Matters for Data Engineers
Data engineers are increasingly building pipelines that feed LLMs: RAG systems, AI copilots for data quality, agents that write and review SQL, tools that summarize data lineage. In every one of these systems, the quality of what lands in the context window directly determines output quality.
A poorly designed context is like feeding a senior analyst a jumbled mess of raw logs and asking for an executive summary. Technically possible — but you'll get garbage.
Core Techniques
1. Strategic Positioning
LLMs don't read context uniformly. Research consistently shows they pay more attention to the beginning and end of the context window. So:
- Put critical instructions and persona definitions at the start
- Put the most relevant retrieved data near the end, close to the user query
- Move supporting or low-priority content to the middle
# BAD: query buried in the middle
context = system_instructions + docs_and_examples + user_query + more_examples
# GOOD: query at the end, most relevant data just before it
context = system_instructions + background_context + retrieved_chunks + user_query
2. Selective Retrieval Over Full Documents
Don't dump entire documents into the context. Use semantic chunking + vector search to retrieve only relevant paragraphs.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
def retrieve_relevant_chunks(query, chunks, top_k=5):
query_emb = model.encode([query])
chunk_embs = model.encode(chunks)
scores = np.dot(chunk_embs, query_emb.T).squeeze()
top_indices = scores.argsort()[-top_k:][::-1]
return [chunks[i] for i in top_indices]
3. Context Caching (Huge Cost Savings)
Both Claude and Gemini support prompt caching — storing repeated context server-side so you only pay full price once.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=[
{"type": "text", "text": "You are a senior data engineer..."},
{
"type": "text",
"text": open("schema_definitions.txt").read(),
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": user_query}]
)
Prompt caching reduces cost by 75–90% on cached tokens. At scale, this is the difference between a viable product and a budget disaster.
4. Structured Context Formats
Use XML tags or clear delimiters to separate context sections — LLMs respond better to structured input:
def build_structured_context(schema, recent_errors, user_query):
errors_str = "\n".join(recent_errors[-10:])
return f"<schema>\n{schema}\n</schema>\n\n<recent_errors>\n{errors_str}\n</recent_errors>\n\n<question>\n{user_query}\n</question>"
5. Dynamic Context Compression
As conversations grow, implement rolling summarization instead of truncating from the start:
def compress_history(messages, max_tokens=4000):
if estimate_tokens(messages) <= max_tokens:
return messages
recent = messages[-10:]
summary = summarize_with_llm(messages[:-10])
return [{"role": "system", "content": f"Prior summary: {summary}"}, *recent]
Context Engineering Checklist
- [ ] System prompt at the very beginning of context?
- [ ] User query at or near the end?
- [ ] Retrieving relevant chunks instead of full documents?
- [ ] Repeated blocks cached (system prompts, schemas, docs)?
- [ ] Context sections clearly delimited?
- [ ] Compression strategy for long conversations?
- [ ] Measured token usage and cost per request?
The Shift in Mindset
Prompt engineering is about what you say. Context engineering is about what you provide.
The best LLM outputs in production systems today come from engineers who think carefully about information architecture — what goes in the context window, in what order, how much of it, and how it's structured. That's an engineering discipline, not a writing exercise.
If you're building data pipelines that feed AI systems, this is now part of your stack. Treat context design with the same rigor you'd apply to schema design or query optimization.
Cheers,
Gabriel Henrique — Data Engineer | ETL/ELT | Databricks | Azure
🔗 gabrielh.dev
Top comments (0)