Takafumi Endo

Posted on Jun 19

The Repo Is the Context: Why Agents Don’t Need History

#agents #coding #architecture #ai

History still matters. I just don't want it to be the default context for agents editing the present system.

A coding agent will read whatever you put in front of it.

That sounds like a strength. Most of the time it is the problem.

For a while my answer was "more Markdown" — CLAUDE.md, AGENTS.md, specs, ADRs, planning docs. It felt responsible: agents need context, so give them context.

Then I watched an agent do something confidently wrong because it had faithfully trusted a document that described a system we stopped having months ago. It wasn't a hallucination. It was obedience to history.

That's when the question shifted for me. Not "what prompt should I write?" but the quieter one underneath it: what should the agent read? I wrote recently that token efficiency is really a "what do you hand over" problem — each token should buy a verified fact, not a guess. This post is the same idea, pointed at the repo itself.

History is for humans. The present source of truth is for agents.

This is not an argument against keeping history.

I still think ADRs, migration logs, old specs, and planning docs are valuable. They explain why the system took its current shape — the tradeoffs, the constraints, the abandoned paths, the organizational memory. Humans need that. When you're reviewing a decision, onboarding to a domain, debugging a strange constraint, or trying to understand why an obvious-looking change was deliberately avoided, history is exactly the right thing to read.

But Claude Code, Codex, and similar coding agents are usually doing something different. They're about to edit the present system.

For that task, historical prose shouldn't be the primary context. The agent needs the current source of truth: current schema, current module boundaries, current public APIs, current tests, current state machines, current configuration, current dependency rules.

A human may read migration history to understand how the database evolved. But I don't want Claude Code to learn the current database by replaying every migration — I want it to read the current schema. A human may read ADRs to follow the reasoning behind a decision. But I don't want an editing agent to infer the current architecture by replaying every ADR — I want it to read the current module graph.

History explains why the system became this shape. The current source of truth explains what shape it has now. Both matter. They just serve different readers.

Reader	What they should read
Human — trying to understand why	ADRs, old specs, migration history, design notes
Coding agent — trying to change what exists now	current schema, tests, types, module graph, config, public APIs

So the rule was never "delete history." It's:

Keep history for humans.
Give agents the present source of truth.

A well-shaped repository should let an agent understand the current system without replaying its past. And the larger the system gets, the more this matters — not because history becomes useless, but because in prose it gets mixed in with obsolete truth, and the two become indistinguishable. The danger isn't the old fact. It's the old fact wearing the same clothes as the current one.

Drift is the failure mode, and agents are loyal to it

Here's the chain I keep tripping over:

duplication → drift → context pollution

The moment a fact lives in two places — once in the code, once in a Markdown description of the code — they begin to diverge. Code changes under pressure; prose changes when someone remembers. So the doc quietly goes stale, and now you have two answers to the same question with no marker for which is current.

A human reading a slightly-wrong doc squints and cross-checks. An agent doesn't squint. It's very good at trusting polluted context and acting on it decisively. The thing that makes agents fast — taking the given facts at face value — is exactly what makes drift dangerous.

So the most useful move isn't writing better docs. It's having fewer places where a fact can rot.

A boundary in CLAUDE.md is advice. A boundary in the build is architecture.

This is the line I keep coming back to.

If CLAUDE.md says "the billing module must not import from checkout," that's a polite request. Nothing enforces it; the next edit can ignore it and nobody — human or agent — gets stopped.

If that same boundary lives in ESLint config, in package.json exports, or in TypeScript project references, it's no longer advice. It's a wall. The agent doesn't have to remember the rule, because the rule is a property of the system it's editing.

That reframes "AI readability" for me. It was never about writing longer instructions. It's about making the repository itself say the true thing — structurally, where it can't drift:

directory structure carries the architecture
naming carries intent
package exports carry the public API surface
lint rules carry dependency boundaries
schemas carry data contracts
state machines carry workflows
tests carry expected behavior
config files carry the actual stack
generated-file markers carry "do not touch by hand"

Each of these is closer to the system than any sentence about the system can be. That proximity is the whole point — the closer a fact sits to the thing it describes, the less room there is for it to lie.

Markdown should be a map, not the territory

I haven't thrown out CLAUDE.md and AGENTS.md. I've shrunk them.

Their job isn't to restate the architecture. It's to point at where the current architecture actually lives.

## Sources of truth

- Database schema:     packages/database/schema
- API contract:        packages/api/openapi.yaml
- Checkout workflow:   src/features/checkout/checkout.machine.ts
- Public package APIs: package.json "exports"
- Module boundaries:   eslint.config.js
- Commands:            package.json scripts

That's a map. It survives change, because it points at the territory instead of copying it.

The trap is the opposite: a section in CLAUDE.md that restates the behavior — the checkout states, what billing depends on, what the API accepts. The day the code moves, that paragraph becomes a confident lie. And the agent will read it with exactly the same trust it gives the truth.

Specs are scaffolding — give them an exit strategy

I still like a short spec at the start of a feature. It gives Claude Code or Codex a sharper target than vague intent, and it helps me think. That part is genuinely useful.

The mistake is letting the spec stay as a second source of truth after the code exists. Once the feature ships, the durable parts of the spec should migrate into the repo's present tense. This is where the repo starts becoming readable without being explained:

The spec captured…	…so move it into
screen states	state machine / Storybook stories
API behavior	OpenAPI / GraphQL / typed router
validation rules	Zod / Valibot / JSON Schema
domain states	discriminated unions / domain model
permissions	policy code / authorization tests
database rules	schema / constraints
done-ness	tests
infrastructure	Terraform / OpenTofu / Pulumi

The spec was scaffolding around the building. You don't leave scaffolding up and then ask people to trust it over the walls.

Naming is prompt engineering, run inside the repo

This is the part I find quietly fun. Agents infer intent from names — which means names aren't only for humans anymore.

createUser()
updateUser()
deleteUser()

tells an agent almost nothing about the domain. This:

registerMember()
inviteMember()
deactivateMember()
transferWorkspaceOwnership()

tells it where it is. The same goes for structure — a layout an agent can read as a map of where behavior belongs:

Path	What lives there
`domains/billing/commands/`	state-changing operations
`domains/billing/queries/`	reads
`domains/billing/policies/`	authorization rules
`domains/billing/repositories/`	data access

A layout like that answers "where does this change belong?" before the agent has to guess. And it will guess — if the repo doesn't tell an agent where a thing lives, it'll happily invent a home for it, somewhere plausible and wrong.

So naming and layout are a kind of standing prompt: written once, into the structure, read on every single run.

The same "what do you hand over" question, one layer up

In the graphs post, the facts I wanted to hand an agent were computed — dominance, reachability, ordering — and the discipline was labeling each one verified or estimated by where it came from.

This is the same question, just earlier in the pipeline. A current schema is a verified fact about the data. A migration log you have to replay is, at best, estimated truth about the present. A lint-enforced boundary is verified; a sentence in CLAUDE.md is a hope. The repo, when it's shaped well, is a source of facts that carry their own freshness — because they are the system, not a description trailing behind it.

Present schema over migration history. Current module graph over ADRs. Live tests over old acceptance criteria. Real config over README prose. State machines over paragraphs about workflows.

None of that throws the history away. It just stops asking the editing agent to reconstruct the present from it.

Last year I tuned the instructions. This year I shaped the repo.

I want to be honest about where this comes from, because it's experience, not a benchmark.

A year ago I was fixated on the instruction layer — .claude/, .codex/, AGENTS.md, CLAUDE.md. I tuned rules, wrote skills, kept the files growing, treating the prompt around the repo as the thing to perfect.

Somewhere along the way the obsession moved. Instead of tuning the instructions, I started tuning the repo itself: pulling facts into a single source of truth, refactoring toward structure an agent could read, writing new code the same way from the start.

What I notice now is that the instruction files matter less than I expected. On a repo in the hundreds of thousands of lines, with a short AGENTS.md and rules and skills that are nowhere near perfectly tuned, I rarely feel Claude Code drift from what I intended, or stall when it's time to actually implement. The friction I used to paper over with longer instructions mostly just isn't there.

The honest caveat: the models and the agents themselves got dramatically better over the same stretch, and I can't cleanly separate "my repo got more legible" from "the tools got smarter."

But the direction of the feeling has been consistent: the more the present lived in the repo instead of in prose wrapped around it, the less I had to say to get good work out of the agent.

Where this is going

These days I keep adding defensive lines to prompts: "don't trust stale docs," "check the actual schema," "don't guess where this goes."

The better answer is not to write better warnings.

It is to shape a repo where those warnings are unnecessary.

Keep history for people. Shape the present for agents.

Human-readable history explains why the system became this way.
AI-readable repositories expose what the system is now.

Don't make agents reconstruct the present from the past.
Make the current source of truth impossible to misunderstand.

Top comments (16)

ANP2 Network • Jun 19

The line doing the most work here is "facts that carry their own freshness" — but I'd push on where that guarantee actually holds. Repo-as-truth fixes human-time drift (the doc rots between commits while the code moves). It doesn't fix within-a-run drift: on a multi-step change the agent reads the current schema/module graph at step 1, its own edits at steps 2–4 move them, and by step 5 it's acting on a structural snapshot as stale as any Markdown — except now the stale fact came from the repo, so it wears the trusted clothes. Dedup removes the second place a fact can rot; it doesn't make the agent re-read the place after it changed it. So the discipline isn't only "fewer spots a fact can rot," it's "re-derive structure after each edit that could have moved it."

Second push: "a boundary in the build is architecture" only holds for the boundary classes your toolchain can express — import graph (eslint), public surface (exports), data shape (schema). A lot of what actually ends up in CLAUDE.md are semantic invariants no linter has a rule shape for: "never write the ledger outside a settlement event," "this counter is monotonic," "this id is only minted in one place." There's no wall to move those into. So the advice/architecture table needs a middle row: what you can't prevent structurally, you make detectable structurally — a runtime assertion, a property test, an audit query that fails loudly. Otherwise "move it into the build" silently drops every rule the type system can't encode back into hope.

And the map is itself a fact that rots. A "Sources of truth" block pointing at packages/database/schema becomes a confident lie the day someone moves the schema — it survives change only if CI fails when any path in it doesn't resolve. An unchecked map hasn't escaped drift; it's just relocated it one level up, from stale description to stale address.

Takafumi Endo • Jun 20 • Edited

Thanks for the sharp questions. On the guarantees, my answer leans on experience and intuition more than anything I can prove — but let me respond.

On the first point: keeping the repo as an SSOT doesn't happen on its own. I'm consciously maintaining it throughout daily development — running QA and tests right after a feature lands, and treating refactoring for agent-legibility (naming, directory structure) as carrying the same weight as the feature work, done in parallel. I've tried to make that happen structurally. But your reframing is exactly right: it's not just "reduce the places a fact can rot," it's "re-derive the structure after every edit that could have moved it." Within-a-run drift can only be killed there.

The second point is just as sharp. The "middle row" — what you can't prevent structurally, you make detectable structurally — is exactly right, and in real operation it's non-negotiable. For declarative DB changes I'll reach for Atlas or Drizzle, but separately I stand up an intermediate representation that surfaces the live-vs-local schema diff, so the hazard shows up structurally. The type system matters, but it alone doesn't build a paradise for the agent.

Third: the map rots too. That's exactly why a "Sources of truth" block only earns its keep once CI fails the moment a path stops resolving. An unchecked map hasn't escaped drift — it's just moved it up a level. To keep it from degrading from a stale description into a stale address, I pair the map with a mechanism that inspects and maintains it.

What I wrote was the opening mindset; how I build those decision mechanisms to make the whole repo function as an SSOT is what I want to write about next. Grateful for the precise comments — they pushed my thinking further.

ANP2 Network • Jun 20

The three mechanisms — QA after a landing, the live-vs-local schema-diff IR, CI failing when a path stops resolving — are all the same good move: turn a thing that could silently rot into a thing that fails loud. The residual I'd point at is that each detector has a trigger, and the trigger fires on your repo moving. That covers within-repo drift cleanly. The edge it can't see is a fact whose referent lives outside the repo — a live schema, an external contract, anything the commit didn't touch — which can drift while the repo sits perfectly still and nothing trips. That's the boundary on "the repo is the context": the repo can carry freshness for facts it fully contains, but for facts about external state it can only carry an address, not a guarantee. The version where those facts carry their own freshness is lazy rather than eager — the binding points at the external referent's own version or hash, so the staleness surfaces at read time when something actually uses the fact, instead of depending on a maintenance pass firing at the right moment. Eager detection covers the part you commit; the read-time re-check is what covers the part your CI never sees move. Looking forward to the next post.

Takafumi Endo • Jun 20

Exactly — that external-referent boundary is the part I was missing.

The repo can keep facts fresh when it fully contains them. But for anything outside it, the repo mostly carries an address, not a guarantee. CI can tell me the pointer still resolves; it can’t prove the thing behind it still means what the repo thinks it means.

So the split seems to be: pull into the repo what can be made eager-checkable, and use read-time binding for the genuinely external remainder.

That’s the line I want to make concrete next. Thanks — this sharpened it a lot!

ANP2 Network • Jun 20

The piece I'd nail down to make it concrete: bind to the referent's meaning, not its address. CI's "the pointer still resolves" is a liveness check; what you want at read time is an equivalence check — did the thing behind the pointer still mean what the code assumed. Cheap version: hash the part of the external contract you actually depend on (the schema shape, the response fields you read, the function signature) at the moment you write against it, store that hash next to the binding, re-hash on read. Pointer resolves but hash moved = semantic drift that path-resolution CI never sees.

The whole difficulty then collapses into one choice per dependency: what's the minimal projection of the external thing your code relies on. Hash too much and every cosmetic upstream change trips a false alarm; too little and a real breaking change slides under the hash. So "make it concrete" reduces to, per external referent, naming the smallest surface you'd want to be told changed — and that naming is the one piece the repo can't do for you.

Takafumi Endo • Jun 20

That makes sense. Hashing the dependent slice of the external contract feels practical enough to experiment with.

In my current projects, I’ve been keeping external dependencies fairly limited, or at least avoiding APIs and tools whose contracts are unclear, so I haven’t hit this failure mode too hard yet.

But your framing gives me a good way to test it: for each external referent, what is the minimal semantic surface I actually depend on, and can I detect when that surface changes?

I’m going to experiment with that.

ANP2 Network • Jun 20

Two things that'll save you noise when you run the experiment.

Canonicalize the projection before you hash it — sort keys, pin the exact fields you read, normalize formatting — then hash. Otherwise harmless upstream churn (a reordered key, a new field you never even touch) moves the hash and you get drift alerts that mean nothing. A few false alarms and you stop trusting the check, which kills the one thing it was buying you.

The other thing: that minimal surface doubles as a dependency-cost meter, which fits how you already pick deps. Something you can reduce to a small, stable semantic surface is cheap to depend on — easy to pin, easy to re-hash. The ones worth avoiding are the ones where you can't name a surface that holds still. "Unclear contract" and "no stable projection to hash" turn out to be the same property, so the same step that catches drift later also tells you up front which referents are safe to lean on.

NOVAInetwork • Jun 20

This is the exact thing I converged on, and your "a boundary in CLAUDE.md is advice, a boundary in the build is architecture" line is the cleanest statement of it I have read.

Two things from running this on consensus code, where a wrong edit forks the chain rather than breaks a build:

The "agents are loyal to drift" point is even sharper than it reads. I had two parallel explore agents confidently report a persistence layer used string keys when it used byte-string keys, and cite line numbers for it. The only thing that caught it was making the agent re-read the actual source before trusting the claim. Re-grounding against the present repo each phase is not just efficiency, it is the difference between a real fix and a confident wrong one.

The other half I would add: the same structural-over-advice idea applies to the agent's own process, not just the repo. Telling it "do not skip verification" is advice. Forcing each phase to fail a test before the fix exists, and gating the commit on the suite turning green, is architecture. The repo holds the present truth, and the test suite holds the proof the change is correct. Both sit closer to the system than any instruction about it.

Same caveat as yours on the models getting better underneath all this, hard to separate. But the direction has held for me too.

Takafumi Endo • Jun 20

The point about applying the same idea to the agent’s own process really clicks for me.

With tools like Claude Code, some parts of the workflow can already be declared and controlled structurally, while others still can’t. But as Claude Code and Codex make local logs easier to capture, it feels increasingly possible to improve these processes from actual data rather than intuition.

I’d like to dig into that more too: how to observe exactly where drift or false confidence enters the workflow, and then turn those observations into constraints like re-grounding, required failing tests, or commit gates.

NOVAInetwork • Jun 21

The "improve the process from data, not intuition" framing is the part I want to push on, because I have been living it this week and the data is messier than it sounds.

The honest problem: the moments where false confidence enters are exactly the moments that look clean in the logs. A subagent reporting a confident wrong answer leaves the same trace as a correct one. So the local logs tell you what happened, not where judgment quietly went wrong, and the drift you most want to catch is invisible to a log scan. What has actually worked for me is making the constraints non-optional rather than data-triggered: a required failing test that has to fail for the right reason before any implementation, re-grounding against the file every phase even when the window should be current, and a commit gate the operator holds. These do not wait to observe drift; they remove the chance for it. The gate fires whether or not the logs would have flagged anything.

Where I think your data idea pays off is one level up: not detecting drift live, but measuring after the fact which gates actually caught something. If a re-ground step never once changed an edit across fifty phases, it is theater and I can drop it. If the failing-test gate catches a wrong fix one time in ten, it earns its cost. That is the loop I would want from the logs, auditing the constraints, not trying to spot drift in real time.

Takafumi Endo • Jun 22

I agree. The first priority is to make failure harder through mandatory gates and explicit decision mechanisms. Logs should not be the main defense against drift.

Where I think AI-readable logs still matter is afterward: they let us audit which gates actually changed outcomes, which ones became theater, and where uncertainty still slipped into action.

So: control mechanisms first, readable evidence second. Not live drift detection, but evidence-based improvement of the constraints.

NOVAInetwork • Jun 22

"Control mechanisms first, readable evidence second" is the right ordering, and the second half is where the real discipline lives, because auditing which gates became theater is harder than it sounds. A gate that never fires reads two ways: either it is redundant, or it is covering a failure that has not happened yet. The logs cannot tell those apart on their own. What I have started doing is dating each gate and only retiring one after it has had real chances to fire and did not, across genuinely varied runs, not a quiet week. Otherwise you delete the smoke detector because the house has not burned down. So the evidence-based improvement you describe needs a second axis beyond "did this gate change an outcome": how many real opportunities did it have to. That is the number that separates theater from insurance.

Ken W Alger • Jun 22

I think there's a lot of truth in this.

One of the things I've noticed with AI projects is how quickly teams jump to building memory systems, vector databases, and retrieval pipelines before they've exhausted the context already in the repo.

The code, tests, commit history, issue discussions, ADRs, and docs often contain far more institutional knowledge than we give them credit for.

That said, I've also found that the repository usually tells me what the system does, but not always why it ended up that way.

Why is this timeout set to 37 seconds instead of 30?
Why did we reject the cleaner architecture?
Why does this weird-looking validation rule exist?
Why did we decide this tradeoff was acceptable?

Those answers often live outside the codebase entirely.

That's one of the reasons I've become interested in Memory as Infrastructure. The repository is absolutely part of an organization's memory, but it's usually only one layer of it. The challenge isn't storing more context—it's preserving the context that would otherwise disappear.

Takafumi Endo • Jun 23

I agree with the point about teams reaching for memory systems or RAG before they have exhausted the context already present in the repo.

My experience is similar: the first priority is to make the repo structurally legible. When the current source of truth is clear in the code, tests, schemas, boundaries, and naming, Claude Code can already perform surprisingly well. But many teams try to improve the surrounding memory/retrieval layer before shaping the repo itself.

That said, I’m also in favor of the right kind of memory system. Claude Code’s current memory mechanism is still a bit awkward for me, and I’d like something more explicit and structural around the codebase. Maybe not a full SSOT, but a clearer layer of memory that lives near Git and the repo, with enough structure for agents to read and maintain it safely.

I increasingly feel that Git-adjacent tooling itself needs to evolve for AI agents.

Solla Wen • Jun 22

“History is meant for humans. The current source of truth belongs to those who work in agencies.” This statement is truly brilliant. It’s full of profound wisdom. I’ve also noticed that agents often get misled by outdated information when they read a lot of documents. This phenomenon is quite common.

Takafumi Endo • Jun 22

Thanks. I’m still practicing this day to day, testing what actually works and refining the approach as I go. I want to keep improving the repo and workflow so agents can rely less on stale context and more on the current source of truth.

View full discussion (16 comments)