By Roman Rylko, CTO at Pynest
Prompt engineering helped us get early wins. But once you put agents into production—touching tickets, dashboards, ERPs—you quickly learn the real game is what goes into the model’s head each step, not just how you word the instruction. Anthropic calls this context engineering: curating and governing the exact tokens (facts, rules, history, tool outputs) the model sees at inference time. Their guidance frames context as a finite, valuable resource that must be budgeted and maintained, not a bottomless dump of text.
Below I’ll answer the journalist’s questions directly, then share patterns we use at Pynest, plus expert perspectives worth hearing.
Do we see a trend toward context engineering?
Yes—fast. Three signals:
- Vendors are productizing it. Anthropic’s engineering playbooks explicitly shift focus from clever prompts to repeatable context pipelines (retrieval, summarization, tool traces, policies). They also released guidance on writing tools and on Model Context Protocol (MCP) to wire agents to live data—both assume you’ll curate context, not improvise it.
- Practitioners are renaming the job. Andrej Karpathy puts it bluntly: “I really like the term ‘context engineering’ over prompt engineering… the art of providing all the context for the task.”
- Security folks are insisting on context control. Prompt-injection research (and design patterns to mitigate it) boil down to: be radically intentional about what you load, from where, and when. That’s context engineering by another name.
At Pynest, the inflection point came when we moved agents from “smart autocomplete” to doing things—opening incidents, filing expense anomalies, producing BI briefs. We had to decide: what gets into the window, who vouches for it, and how we prove provenance later.
Advantages vs. prompt engineering; do we still need both?
Think of prompt engineering as interface design (clear instructions, roles, refusal policies). It still matters. But context engineering is data and control plane design:
- Accuracy and explainability. Right-sized, verified context raises answer quality and makes “why” auditable. OpenAI’s own accuracy docs point to retrieval and context structuring as the main levers; Anthropic says the context is “critical but finite,” so you budget it like memory on an embedded device.
- Lower attack surface. Context minimization and typed sources blunt prompt-injection/confused-deputy risks. Simon Willison’s work is clear: models trust whatever tokens you feed; controlling those tokens is the defense.
- Operational stability. When context is a pipeline (not an ad-hoc paste), you can test it, cache it, diff it, set SLAs for retrieval, and ship rollbacks.
So yes, you still need crisp, testable prompts—but prompts without disciplined context are lipstick on a pager.
What CIOs and IT leaders should know
- Context is a budget. Treat the window as scarce. Decide up-front how many tokens you allocate to: policies, task frame, fresh facts (retrieved), prior turns, and tool traces. Then enforce it automatically (truncation rules, summarization tiers). Anthropic’s post is explicit about managing a finite resource.
- Sources must be typed and signed. Don’t mix anonymous web clippings with authoritative ledgers in the same breath. Use adapters that label origin (e.g., “Jira issue #123,” “Snowflake query X”), include a timestamp, and attach a lightweight signature or hash so agents can’t be tricked by look-alikes. (MCP heads this direction.)
- Minimize and stage. Only load what matters for this step. Plan-then-execute and context-minimization patterns are emerging best practice; they also happen to reduce token cost.
- Design for “trust fallbacks.” Karpathy urges keeping AI “on the leash”—we translate that into confidence thresholds, human-in-the-loop checkpoints, and a one-click read-only switch for agents when signals look off.
- Measure context quality, not just output quality. Track retrieval hit rate, duplication, staleness (age of facts), source coverage, and how often humans click through citations.
How we do it at Pynest (field notes)
- Context contracts. Each agent has a YAML contract describing slots in its window: policy, task, facts, history, tools. A small allocator assigns token budgets; if we run out, history collapses first into summaries, then drops. This alone cut token spend ~18% and reduced regressions tied to “lost instructions.”
- Typed retrieval with freshness guards. Facts come from connectors that return (content, source_id, ts, checksum). If ts is stale or the checksum mismatches the index, the slot stays empty and the agent gets a “fact missing” state. We borrowed the spirit from Anthropic’s emphasis on curated inputs vs. indiscriminate stuffing.
- Tool traces as context. After a tool call (e.g., a BI query), we add a short, structured receipt—inputs, rows, aggregates, limits—rather than dumping raw CSVs. Anthropic’s guidance on writing effective tools steered us here.
- Defense-in-depth for prompt injection. We whitelist sources for “actionable” slots; anything from untrusted surfaces is sandboxed to analysis-only and run through a red-team pattern scanner. Willison’s taxonomy and patterns helped us move from vibes to guardrails.
- Context QA dashboards. We score conversations for evidence density (citations per 1k tokens), staleness, and novelty reuse (how often the same chunk shows up). When evidence density dips below a floor, the agent cannot execute—only draft.
Result: fewer “confident wrongs,” faster human reviews (people click sources, not ask “where did this come from?”), and easier incident forensics because we can replay exactly what the model saw.
A few expert voices worth hearing
Anthropic engineering team: “Context is a critical but finite resource… curate and manage it deliberately.” Their patterns—chunking, summarization layers, and tool receipts—are a strong baseline for enterprise agents.
Andrej Karpathy: Context engineering better names the real skill—providing the right information so the task is “plausibly solvable.” It’s a cultural nudge: stop worshiping clever prompts; start shipping reliable inputs.
Simon Willison: Prompt injection remains unsolved. The safest bet is strict control of what gets into the window and when agents can act—exactly the remit of context engineering.
Quick answers for busy CIOs
Is there a trend toward context engineering?
Yes. Vendors, researchers, and practitioners are converging on it as the scalable way to run agents—treating context like a governed data plane, not a clipboard.
Advantages vs. prompt engineering; need both?
Prompt craft is table stakes. Context engineering delivers accuracy, safety, and operability—because you control inputs and provenance, not just phrasing. Keep both, prioritize context.
What should leaders know/do?
Create context budgets; type and sign sources; minimize and stage; add trust fallbacks; and measure context quality. If you can’t explain what the model saw, you can’t govern what it did.
Handy starter checklist (what we’d implement first)
Define slots and budgets for agent context (policy/task/facts/history/tools). Make overflows predictable.
Wire retrieval with provenance—every fact has an origin and timestamp; block actions without proof.
Adopt two or three secure patterns (e.g., plan-then-execute, context-minimization) and stick to them.
Instrument context, not just outputs—track evidence density, staleness, duplication.
Install a leash—confidence floors, action scopes, and a one-click read-only mode when signals drift.
If you’re still spending most of your energy word-smithing prompts, you’re optimizing the front door while leaving the back door wide open. Close the loop. Engineer the context.
Top comments (0)