A coding agent does not usually fail because it cannot write code. It fails because it writes too soon.
It opens a few files, guesses the architecture, edits the wrong seam, runs a narrow test, and returns a confident summary. The pull request may even look clean. Then you find the real damage later: a broken tenant boundary, a missed migration, a hidden side effect, or a test that passed because it never touched the risky path.
The fix is not a longer prompt. It is a context engineering workflow that forces the agent to collect evidence before it edits.
For AI app builders, solo developers, and small product teams, this matters more than it sounds. AI coding tools are getting faster, agent frameworks are improving, and repo-scale assistants are moving from demos into daily work. Speed is no longer the scarce resource. Trust is.
This guide shows how to design a practical pre-edit context layer for coding agents: repo maps, local indexes, retrieved decisions, impact analysis, test discovery, and verification receipts.
The goal is simple: make the agent prove it understands the codebase before it changes the codebase.
Why coding agents need context engineering
Most teams treat context as a chat problem:
- Add a better system prompt.
- Paste a longer issue description.
- Point the agent at
README.md. - Ask it to “inspect the code first.”
That helps, but it is not enough for production work.
A coding agent needs a repeatable way to answer these questions before editing:
- What files are probably involved?
- What symbols, routes, schemas, jobs, and tests connect to this change?
- What previous decisions or gotchas matter?
- What evidence would prove the change worked?
- What risks should slow or block the edit?
Without this, agents burn tokens rediscovering the same repo shape over and over. Worse, they rely on partial evidence. A few text matches become an architecture model. A passing unit test becomes a release signal. A prompt instruction becomes a substitute for real code inspection.
Context engineering turns that loose behavior into a workflow.
The recent signal: agents are moving toward evidence layers
Current AI developer tooling is pointing in the same direction: agents need structured evidence, not just larger windows.
Recent signals include local code intelligence tools that expose symbols and references, memory tools that reduce repeated exploration, monitors that track context windows and cost, and review agents that require exact file-line evidence. The pattern is clear: teams are no longer satisfied with “the agent seemed smart.” They want evidence before edits, proof after edits, and readable receipts during review.
What context engineering means for coding agents
Context engineering is the design of what an AI system sees, when it sees it, and how it proves that the context is relevant.
For coding agents, it has five layers.
| Layer | Purpose | Example evidence |
|---|---|---|
| Task context | Defines the work | issue, user story, acceptance criteria, non-goals |
| Repo context | Shows code structure | files, symbols, routes, schemas, dependencies |
| Memory context | Recalls prior decisions | ADRs, past fixes, migration notes, gotchas |
| Risk context | Highlights danger zones | auth, billing, tenant isolation, deletion, PII |
| Verification context | Proves the outcome | tests, lint, typecheck, traces, logs |
A good agent workflow does not dump all of this into the prompt. That creates noise. Instead, it retrieves the smallest useful slice at each stage.
Think of it as a pipeline:
task brief
-> repo search
-> symbol/reference lookup
-> impact analysis
-> memory retrieval
-> plan
-> edit
-> verification
-> review receipt
The key is order. Evidence comes before the plan. The plan comes before the edit. Verification comes before the summary.
The hidden failure mode: confident partial context
The most dangerous coding-agent failure is not an obvious crash. It is confident partial context.
You see it when the agent says:
- “I found the relevant file” after reading one route handler.
- “No tests need changes” after searching only one folder.
- “This is safe” without checking downstream callers.
- “The bug is fixed” after testing the happy path.
The output looks professional. The summary is crisp. But the agent never built a complete enough map of the change.
This is especially risky in AI app codebases because small edits often cross boundaries:
- A prompt template change affects evaluation results.
- A tool schema change breaks an agent workflow.
- A retrieval filter change leaks tenant data.
- A model fallback change breaks structured output validation.
- A cache key change creates stale or cross-user answers.
- A background job change doubles token spend.
The agent needs to see these connections before it starts typing.
A practical pre-edit routine
Use a pre-edit routine for any agent task that touches production code, data, auth, billing, integrations, or AI behavior.
1. Restate the task and non-goals.
2. Identify likely files and symbols.
3. Find references and callers.
4. Identify tests and missing tests.
5. Retrieve relevant memory or decisions.
6. Name risks and assumptions.
7. Propose an edit plan with validation commands.
8. Wait for approval or continue only if risk is low.
You can give this routine to an agent as a policy, but it works better when backed by tools.
For example, a repo-aware agent can run:
repo_status
search_code("usage metering webhook")
get_definition("recordUsage")
get_references("recordUsage")
impact_analysis("recordUsage")
find_tests_for_change("usage metering webhook")
plan_change("add idempotency to usage webhook")
The exact tool names do not matter. The behavior does.
The agent should not move from “search” to “edit” until it can explain primary files, related files, expected side effects, validation commands, confidence level, and known gaps.
Example: a context packet for an AI feature change
Imagine you are changing an AI support agent so it can escalate billing questions to a human.
A weak prompt says:
Add human escalation for billing questions in the support agent.
A better context packet says:
task: Add human escalation for billing questions in the support agent.
intent: Billing conversations should create an escalation ticket instead of giving account-specific billing advice.
non_goals:
- Do not change pricing logic.
- Do not expose invoice details in model prompts.
- Do not auto-refund or modify subscriptions.
risk_zones:
- billing data
- tenant isolation
- tool permissions
- PII in logs
required_evidence:
- support agent route or workflow entrypoint
- billing intent classifier or prompt
- escalation tool schema
- existing ticket creation tests
validation:
- unit tests for billing intent classification
- integration test for escalation ticket creation
- log redaction check
This is still short, but it gives the agent a map. It also defines what “done” means.
Build a repo map before you need it
Agents waste time when every task starts with blind exploration. A repo map reduces that cost.
A useful repo map can start as one markdown file:
# Repo Map
## Product areas
- `apps/web`: user-facing dashboard
- `apps/api`: API routes and background jobs
- `packages/ai`: prompts, model routing, tool schemas
- `packages/db`: schema, migrations, query helpers
- `packages/evals`: golden tasks and regression evals
## Risk zones
- Auth: `apps/api/src/auth`, `packages/db/src/tenant.ts`
- Billing: `apps/api/src/billing`, `packages/stripe`
- AI tools: `packages/ai/src/tools`
- Retrieval filters: `packages/ai/src/retrieval`
## Validation commands
- `pnpm test`
- `pnpm typecheck`
- `pnpm lint`
- `pnpm evals:agent`
This map gives agents a starting point. It also helps human reviewers see whether the agent touched the right surface area.
Add memory, but keep code evidence first
Agent memory is useful, but it can become dangerous if it outranks the current code.
Good memory items look like this:
{
"scope": "billing-webhooks",
"fact": "Webhook handlers must use idempotency keys from Stripe event IDs before writing usage records.",
"source": "incident-usage-duplicates.md",
"last_verified": "2026-07-04",
"confidence": "high"
}
Bad memory items look like this:
{
"fact": "Billing is handled in the old webhook file."
}
The first memory has scope, source, and a verification date. The second may be stale and misleading.
Use memory for architectural decisions, prior incidents, gotchas, migration warnings, evaluation failures, and “do not repeat this” notes.
Do not use memory as a replacement for code search. The agent should retrieve memory, then verify it against the current repo.
A safe instruction is:
Use memory to guide exploration, not to conclude. If memory conflicts with code, trust current code and report the conflict.
Teach the agent to find tests before editing
Many agents edit first and look for tests later. Reverse that.
Before editing, the agent should answer which tests cover current behavior, which test should fail before the fix, which test proves the new behavior, and which validation is too expensive to run locally. A small test discovery note can prevent a lot of review pain:
## Test discovery
Likely existing tests:
- `packages/ai/src/tools/__tests__/ticket-tool.test.ts`
- `apps/api/src/support/__tests__/support-agent-route.test.ts`
Missing test:
- No regression test confirms billing questions create escalation tickets without exposing invoice data.
Plan:
- Add a failing test for billing intent -> escalation.
- Add a redaction assertion for logs.
- Run support-agent route tests and agent tool tests.
This stops the agent from treating tests as cleanup and starts treating them as navigation.
Use risk tiers for agent edits
Not every change needs the same ceremony. A typo fix should not require a full architecture review. A billing-agent tool change should.
| Tier | Example | Agent behavior |
|---|---|---|
| Low | docs, comments, isolated UI copy | inspect, edit, run narrow check |
| Medium | UI logic, internal API, non-critical job | pre-edit plan, tests, summary receipt |
| High | auth, billing, tenant data, AI tools, deletion | approval gate, impact analysis, rollback note |
| Critical | production data migration, permission model, external writes | human review before execution |
For AI systems, mark these as high risk by default: prompt changes that affect customer-visible answers, tool permission changes, retrieval filter changes, memory writes, model routing changes, fallback logic, usage metering, PII handling, and tenant isolation.
Require a verification receipt
A final agent message should not be “done.” It should be a receipt.
## Change summary
- Added billing escalation path for support agent.
## Evidence used
- Read support route, intent classifier, escalation tool, and audit log code.
- Checked references for `createEscalationTicket`.
## Validation run
- `pnpm test support-agent-route` ✅
- `pnpm test agent-tools` ✅
## Risks remaining
- Did not run full eval suite because it takes 40 minutes.
This format separates claims from evidence and tells the reviewer where to look.
Implementation pattern: a lightweight context gate
You can implement a context gate without building a full platform.
Create .agent/context-gate.md:
# Context Gate
Before editing production code, complete this checklist:
- [ ] Restate task and non-goals.
- [ ] List primary files with reason.
- [ ] List references/callers checked.
- [ ] List tests found before editing.
- [ ] List risk tier.
- [ ] List validation commands.
- [ ] List unknowns.
Do not edit high-risk files until the plan includes risk, rollback, and validation.
Then add a short agent instruction:
For code tasks, read `.agent/context-gate.md` first. Complete the checklist before editing. If the change is high risk, pause after the plan.
Common mistakes
Mistake 1: dumping the whole repo into context
More context is not always better. Large irrelevant context can make the agent slower and less accurate. Use retrieval and handles instead.
Mistake 2: trusting memory without freshness
Memory should have source, scope, and verification. Stale memory is just a confident rumor.
Mistake 3: running tests only after the edit
Tests guide the plan. Find them before editing.
Mistake 4: treating all files as equal risk
A CSS tweak and a tenant-filter change should not have the same workflow.
Mistake 5: accepting summaries without receipts
A summary tells you what the agent claims. A receipt tells you what the agent checked.
A starter workflow for small teams
If you are a solo builder or small AI product team, start here:
- Create a repo map.
- Add a context gate checklist.
- Add a PR receipt template.
- Define high-risk file patterns.
- Ask agents to find tests before editing.
- Keep a small memory file for decisions and incidents.
- Review the receipt, not just the diff.
High-risk file patterns can be simple:
high_risk:
- "**/auth/**"
- "**/billing/**"
- "**/migrations/**"
- "**/tools/**"
- "**/retrieval/**"
- "**/tenant*.ts"
- "**/prompts/**"
Then tell the agent:
If a touched file matches a high-risk pattern, stop after the plan and explain risk, rollback, and validation.
That one rule can prevent a lot of expensive agent confidence.
FAQ
What is coding agent context engineering?
Coding agent context engineering is the practice of designing what evidence an AI coding agent receives before, during, and after a code change. It includes task briefs, repo maps, code indexes, memory, risk rules, tests, and verification receipts.
Is a larger context window enough for coding agents?
No. A larger context window can help, but it does not guarantee relevance. Agents still need retrieval, symbol lookup, reference checks, test discovery, and risk rules so they use the right context instead of more context.
Should coding agents use memory?
Yes, but memory should guide exploration rather than replace code evidence. Good memory includes source, scope, freshness, and confidence. The agent should verify memory against the current repo before relying on it.
What should an agent check before editing code?
Before editing, an agent should restate the task, list non-goals, identify primary files, check references, find tests, retrieve relevant decisions, assign risk, and propose validation commands.
How do I make agent-written code easier to review?
Require a verification receipt. The receipt should list evidence used, files touched, tests run, risks remaining, and reviewer focus areas. This gives human reviewers a trail instead of only a diff.
Which code changes should require human approval?
Require approval for high-risk changes such as auth, billing, tenant isolation, data deletion, migrations, AI tool permissions, retrieval filters, memory writes, prompt changes that affect users, and external actions.
Top comments (0)