We stopped trying to find a better model.
We built a better context surface. Different problem. Different fix.
Here's the story of how we got there, and why I think most teams in 2026 are optimising the wrong side of the equation.
The 1,200-line PR
A few months ago, one of our engineers asked an AI agent to help add a new refund flow to our merchant service. The agent returned a PR. 1,200 lines. It compiled. The tests passed.
It also did three things we'd explicitly decided, months earlier, to never do in this codebase:
- It created a new service-to-service HTTP client instead of using our internal
ServiceBusabstraction. - It persisted refund state in the merchant service's own database instead of emitting a domain event for the ledger service to consume.
- It wrote a retry loop with
setTimeoutinstead of using our@Retryabledecorator, which has backoff policies tied to our SLOs.
None of this is in the agent's training data. Nothing in the README told it either. And the reviewer — doing the review at 6pm on a Friday — skimmed the diff, saw green CI, and approved.
Two weeks later we had a duplicate-refund incident. One hour of debugging to find the cause. Not a bug in the agent's code. A design-pattern violation the agent had no way to know existed.
The realisation
Here's the uncomfortable part.
The agent didn't do anything wrong. It did exactly what a capable junior engineer would have done if dropped into the repo for the first time with no context. Which is: it solved the immediate problem with reasonable-looking code, using the patterns it had seen in its training data.
Our new hires did the same thing. I went back and checked. In the six months before that incident, we'd had three separate PRs from three different people — two human, one AI — all creating bespoke HTTP clients instead of using ServiceBus. All of them reviewed by people who knew better but missed it under time pressure.
The bug wasn't the model. The bug was that the knowledge of which patterns we'd consciously chosen to standardise lived nowhere an agent could read it, and only half-lived in the heads of senior engineers who weren't always in the review.
So we stopped chasing model quality and started building the thing that was actually missing: a context layer.
What "context layer" actually means
The phrase gets thrown around loosely since MCP took off, so let me be concrete.
In our stack, a context layer is:
- A single, versioned source of truth for architectural decisions, design patterns, and merchant-domain invariants.
- Structured as machine-readable documents (MDX with frontmatter, not free-form Confluence pages).
- Served over MCP so the same corpus is queryable by every AI tool on the team — Claude, Cursor, Copilot, our internal agents.
- Enforced by CI through design-pattern lints that fail the build when any PR — human-authored or AI-authored — violates a recorded pattern.
The enforcement layer is what most teams skip. The context on its own is a wiki nobody reads. The lints on their own are arbitrary rules nobody remembers the reason for. Pairing them is where the leverage lives.
The three files that made it work
Here's the minimum structure we settled on, with real examples from our monorepo.
1. adr/*.mdx — architectural decisions, machine-readable
---
id: ADR-0047
title: "Service-to-service communication goes through ServiceBus"
status: accepted
date: 2025-11-12
tags: [microservices, inter-service, nestjs]
supersedes: null
lint_rule: no-direct-http-client
---
## Context
15 NestJS microservices. Two years ago, every service had its own
Axios instance. Retry semantics drifted. Timeouts drifted. Tracing
headers got dropped. Incidents had no consistent trail.
## Decision
All service-to-service calls go through @atoa/service-bus, which
wraps Axios with retries, circuit breaking, OpenTelemetry tracing,
and our standard auth header injection.
## Rationale
- Retry policies live in one place, tied to SLOs.
- Every call is traced by default.
- Failures surface consistently in Grafana.
## Enforcement
eslint rule: no-direct-http-client (see lint-rules/)
CI gate: fail on import of 'axios' or 'node:http' in service code.
Every ADR has a lint_rule pointer. No ADR ships without one, unless explicitly marked advisory.
2. lint-rules/no-direct-http-client.ts — the actual enforcement
import { TSESTree, TSESLint } from '@typescript-eslint/utils';
const BANNED = ['axios', 'node:http', 'node:https', 'undici'];
const ALLOWED_PATHS = [
'libs/service-bus/',
'libs/http-primitives/',
];
export const rule: TSESLint.RuleModule<'useServiceBus', []> = {
meta: {
type: 'problem',
messages: {
useServiceBus:
'Direct HTTP clients are banned. Use @atoa/service-bus. See ADR-0047.',
},
schema: [],
},
defaultOptions: [],
create(ctx) {
const filename = ctx.getFilename();
if (ALLOWED_PATHS.some((p) => filename.includes(p))) return {};
return {
ImportDeclaration(node: TSESTree.ImportDeclaration) {
if (BANNED.includes(node.source.value)) {
ctx.report({ node, messageId: 'useServiceBus' });
}
},
};
},
};
Nothing clever. The point is: when an agent (or a human) ships the banned pattern, the PR cannot land. Not "a reviewer will notice." The build fails. Every time.
3. context.mcp.json — what we expose to every tool
{
"name": "atoa-engineering-context",
"version": "1.4.0",
"resources": [
{ "uri": "adr://*", "description": "Architectural decisions with enforcement status" },
{ "uri": "pattern://*", "description": "Approved design patterns with code examples" },
{ "uri": "domain://merchant", "description": "Merchant domain invariants and flows" },
{ "uri": "domain://payments", "description": "Payment flow state machines" }
],
"tools": [
{
"name": "check_pattern",
"description": "Given a code snippet, return any ADR violations it would trigger"
},
{
"name": "find_precedent",
"description": "Search for prior implementations of a similar pattern in our codebase"
}
]
}
Every AI tool our team uses mounts this MCP server. When an engineer asks Claude to "add a refund flow," the model has the ADRs in retrieval before it starts writing code. When it asks "how have we handled async retries in the past," find_precedent returns the real decorator, not something that looks plausible.
The agent stopped hallucinating patterns not because the model got smarter. Because we gave it somewhere to look.
What happened in the last 30 days
We've been running this layer across the full engineering team — 18 people, mix of AI-heavy and AI-light workflows — for just under a quarter now. Last month's numbers:
- 23 pattern violations caught by design-pattern lints before merge. 14 from human-authored PRs. 9 from AI-authored PRs. The ratio surprised me. I'd expected AI to dominate the violation list. It did not.
- 2 architectural regressions avoided that would previously have shipped. One was a would-be duplicate-refund bug in the same area as the Friday-night incident. The lint caught what the reviewer under time pressure would have missed.
- Onboarding time for a new engineer down from 2 weeks to 4 days on the local-dev side, which is a separate story, but the context layer helped here too. New hires read the ADR corpus once, then let the MCP server answer their day-to-day "does this already exist?" questions.
- Zero arguments in code review about "is this the right pattern." When a disagreement happens, the question becomes "is there an ADR for this?" If yes, the lint decides. If no, we write the ADR.
That last one is the quiet win. Code review time on architectural questions dropped by roughly a third, because we stopped relitigating decisions we'd already made.
The part most teams get wrong
Two patterns I see repeatedly on teams that try to build this and don't get the leverage:
1. Context without enforcement. A beautiful ADR wiki nobody reads. Every violation still ships because there's no gate. This is where most teams stop because the wiki felt like the real deliverable. It is not. The lint is the real deliverable.
2. Enforcement without context. A forest of lint rules nobody understands. The first time someone hits a red CI gate with a rule they've never seen, they open a Slack channel and ask why. If the lint points to an ADR with a clear rationale, the question answers itself. If it points to a rule that just says "forbidden," you've built a political problem disguised as infrastructure.
Pairing them is not optional. Either one alone is worse than nothing.
What this means for "model quality" debates in 2026
Every week there's a new "is Claude 4.6 better than Opus 4.5 at code" thread. I read them. I have opinions. But in terms of what actually moved the needle on our shipping velocity this quarter — it wasn't the model.
It was the retrieval surface.
The model doesn't need to be smarter. It needs to read the right thing before it answers. And once the context layer is good enough, the difference between "good model" and "great model" collapses, because both are now looking at the same authoritative source.
For 2026, if I had to pick one place to invest a quarter of engineering time to improve AI-native development, it wouldn't be better prompts. It wouldn't be a new IDE extension. It would be this:
Write down the patterns you've actually chosen. Make them machine-readable. Serve them over MCP. Enforce them in CI. Stop relying on tribal knowledge to survive code review.
The agent isn't the bottleneck. The knowledge surface is.
I'm Arun, CTO and co-founder at Atoa — we build open banking payments for the UK. We run 15 NestJS microservices in production and I write about the things we've learned the hard way. Find me on X @mickyarun if you want to argue about any of this.
Top comments (0)