Your agent relearns your codebase every session. That's the cold start.

Akash Goenka — Mon, 27 Apr 2026 04:33:40 +0000

Agents don't take advice. They take evidence. Here's what I deleted before I understood that.

I work with coding agents every day. And every day, in every new session, the agent starts over.

It greps around my repo. It opens files it doesn't need. It works out how the pieces connect — the same pieces it worked out yesterday, and the day before. Somewhere in there it rediscovers the one gotcha in the config loader that it had already traced, carefully, last week.

The agent isn't the memory. I am. Every session, I pour the context back in from my own head before we can do any real work.

The obvious fix doesn't hold

Write it down. Put it in CLAUDE.md, or AGENTS.md, or whatever your tool reads at startup.

I went further than that once. A whole docs/ folder — the important things about the codebase written out properly, so that any agent, or any human, could get oriented without me. It was good for about a week.

The problem was never writing it. The problem is that keeping it true is a job, and it's a job with no deadline, no ticket, and nobody chasing you for it. So it doesn't get done. The code moves, the docs don't, and now you have something strictly worse than no documentation: a confident, out-of-date map that an agent will follow straight into a wall.

I didn't want to write more documentation. I wanted context that maintains itself.

That's what I've been building. It's called coldstart, and this post is mostly about the things I built and then deleted along the way, because that's the part I'd actually want to read.

What version one taught me

The first version was an MCP server with four tools. It worked, sort of, and using it taught me three things I didn't expect.

Give an agent a free hand with a tool and it will abuse it. I assumed a capable model handed a good tool would use it well. It doesn't. Search has no natural structure, so the agent fires off grep after grep, each one an independent guess, trying to triangulate onto an answer from a scatter of weak signals. It's not stupid — it's doing exactly what you'd do with no map. But it's expensive, and it's slow, and it happens every single session.

So four tools became two. Fewer doors, less improvisation.

Agents don't believe bare tool output. This one changed the design. Hand an agent a ranked list of files and it will not act on it. It goes and checks. What makes it trust the result is seeing actual code alongside the ranking — the lines that matched, the symbol that's defined there, the receipts. Once the evidence is in front of it, it moves. Without it, the tool call was just a suggestion, and the agent quietly did the work itself anyway.

So find doesn't return a list. It returns a list with the evidence attached.

And agents guess line numbers. Watching sessions, I kept seeing the model reach for a file, decide roughly where the thing it wanted lived, and read the wrong range. Then read again. So the second tool, gs, exists to give the shape of a file cheaply — the symbols, their line ranges, who calls them — so that when the agent finally does a real read, it reads the right twenty lines instead of guessing at two hundred.

Both of those are the same lesson wearing different clothes, and it's the subtitle of this post. The agent doesn't want to be told. It wants to be shown.

The graveyard

Coldstart got simpler as it got better. Almost every clever idea I had died, and the interesting thing is that most of them died the same way.

Making the ranking clever. The first ranker had more signals in it than the one that shipped. The one I was most attached to was graph centrality — the intuition being that a file lots of other files import must be an important file, so it should rank higher. It sounds obviously right. It's wrong. The most-imported files in any repo are the utils, the constants, the type definitions. Centrality quietly promotes the least useful results in the codebase. I started calling it the hub trap, and then I deleted the signal. What's left is deliberately blunt: how many of your query terms does this file actually cover, and does it define them or merely mention them.

Nudging the agent's decisions. If it won't read my descriptions, maybe it'll respond to a nudge at the right moment. It won't. Advisory rules get ignored, or pattern-matched around, or they steer behaviour somewhere I didn't intend. Every attempt to persuade the agent failed, and the ones that "worked" only worked because they'd become hard constraints rather than advice. That's when surface, don't steer stopped being a philosophy I liked and became a finding I had to live with.

Freshness by timer. Both the index and the notes originally aged out on a TTL, the way a cache does. This is wrong, and it took me longer than it should have to see why: time doesn't make knowledge stale. Code changes do. A note written six months ago about a file nobody has touched since is perfectly good. A note written this morning about a file you refactored at lunch is garbage. TTLs went in the bin and got replaced with content hashing.

A "map my repo" command. The tempting onboarding feature — one command, point an LLM at the whole codebase, generate notes for everything. I rejected this one before building it, and I'm glad. Batch synthesis produces exactly the thing the notebook exists to prevent: confident prose that nobody verified, generated by a model that was never actually solving a problem in that file. It would have mass-produced the failure mode, with a bill attached.

The pattern in all four graves is the same. Every time I tried to make coldstart smarter, the answer was to make it more honest instead.

The detour I didn't take

There was a point where I nearly went in a completely different direction.

I'd noticed agents re-reading the same files across sessions, and I thought: the transcripts are sitting right there on disk. Scrape them. Work out which files turned out to matter for which kinds of question, and build an index of that. It's local, it's deterministic, it fits everything else I believe about this problem.

I didn't build it, and the reason I didn't is the reason the rest of this project exists.

Capturing what an agent learned is the easy half. Keeping it true is the hard half. A big pile of harvested knowledge with no mechanism to invalidate it isn't memory — it's a faster way to be confidently wrong. And the code underneath it moves every day.

So I went at the hard half first.

What survived

The notebook. It's the part I care about, and it rests on one distinction:

It doesn't store a journal. It stores what an agent learned about the codebase, anchored to the codebase.

That's not a small difference. Most "give your AI memory" tools are storing a record of your conversations — what was said, what was done, in what order. That decays into archaeology. What I want isn't a diary of past sessions; it's the durable facts about this repo that a session happened to uncover. How a flow moves across files. Why a piece of code is the way it is. The trap that isn't visible from reading the code, only from having been burned by it.

Here's how it holds together.

An agent writes notes at the end of a task — while it still has the full context, having actually read the code and fixed the thing. Not later, not in a batch, not by a model that was never there.

Every note carries anchors: the real files and symbols its claims rest on, stamped with a content hash at the moment it was written. The index re-checks those stamps as the code changes. When the evidence moves, the note says so — it renders as [evidence changed], drops down the search results, and degrades into a labelled hypothesis instead of continuing to assert a fact it can no longer support.

That's the whole trick, and it's why the notes can be trusted at all. The codebase is the source of truth, and a note that no longer matches it admits it.

Writes go through a gate, so a new note is checked against what's already there and either merges into an existing one or is explicitly declared new. Notes don't pile up into a swamp.

You can just look at it

None of this is a black box you have to take on faith. coldstart kb view opens the notebook in your browser — every note the agents have written, what each one is anchored to, and whether it still holds.

Here it is on Arches, an open-source (AGPL-3.0) heritage data platform I pointed coldstart at as a test. A large Django codebase I have never worked on.

115 notes and 22 flows, written by agents doing real tasks in a repo they'd never seen before. The note open here is a hub note on models.py — a file too big for one summary, so it's split into facets, one per symbol.

Look at what a facet actually says. Not "this is the ResourceInstance model" — I can get that from the class name. It says that get_instance_creator() quietly falls back to a settings default when no EditLog record exists. That's the kind of thing you only learn by having been in there, and exactly the kind of thing you've forgotten by next Tuesday.

Above it sits the anchor block: the verified code addresses those claims rest on. Every note in this notebook is fresh, because nothing has moved under them since they were written. Change one of those files and the note flips to [evidence changed] — here, in search results, everywhere it surfaces. The green dot is load-bearing, not decoration.

And the framing I keep coming back to, because it's the honest one:

Written by agents. Kept honest by the codebase.

Not "the agents maintain it, so you don't have to." That would be a lie, and everything in the section above about version one tells you why — agents given a free hand produce junk. The notes are trustworthy because there's a mechanism adjudicating them. Take the mechanism away and you've just automated the production of rotting markdown.

And it accrues. This is the part that only shows up over time. One note is a convenience. A few hundred, written across months of real tasks and continuously re-checked against the code, is something else — a picture of the codebase that no single session could have produced, built by the agents that were actually in there doing the work. The second time a question comes up, the answer is already sitting there.

And it can be shared. The notebook is private to you by default — nothing leaves your machine unless you say so. But if you want it, one command commits it to the repo as an append-only log that merges cleanly across branches, so parallel work doesn't collide. Then it isn't your agent's memory any more, it's the team's. Your colleague's agent starts from what yours already worked out, and the notes keep getting checked against the same code you're all changing.

The boring layer underneath

Under the notebook sits the navigation, and it's deliberately dumb.

find answers "which files are about this?" and gs answers "what is this file, and who uses it?" Both run against a deterministic index built by static analysis — file paths, symbol names, exports, the import graph. Ranking is by how many of your query terms a file actually covers, weighted toward terms that name something declared in it. No embeddings. No model in the query path. No service to run. You can always see why a file came back.

"Static" here means static analysis, not stale: a background process watches the repo and updates the index within seconds of an edit, uncommitted ones included.

There is no API key anywhere in coldstart, and no bill. Nothing is sent off your machine — not to me, not to anyone. The index is parsing, not inference. And the notes are free too, in the sense that matters: the agent writes them as part of the task it was already doing, out of the context it already had in front of it. There is no second model reading your codebase to summarise it. That was the whole point of refusing batch synthesis — the moment you need a separate pass over the repo to generate knowledge, you have signed up for a cost, a delay, and a model forming opinions about code it was never asked to fix.

It's in this post for one reason only. It's the thing that makes the notes verifiable. The notebook's freshness stamps come from the index, and without them the notes would just be a wiki with better branding.

Where this leaves me

I wrote a lot to get here, and most of it wasn't worth keeping. Four tools became two. Descriptions, nudges, TTLs, batch synthesis — all built or seriously considered, all gone.

Which is, I've come to think, the same argument the notebook is making. What's worth keeping was never the transcript of how you got somewhere. It's the residue that survived contact with the code.

coldstart is open source, MIT, pure Node.

npm install -g @cstart/coldstart
cd your-project
coldstart init

Works with Claude Code, Cursor, and Codex.

Site: https://akashgoenka.github.io/coldstart/
GitHub: https://github.com/AkashGoenka/coldstart
npm: https://www.npmjs.com/package/@cstart/coldstart

If you try it, I'd genuinely like to know where it falls over. Most of what's in this post came from me and a couple of friends using it and something breaking.

DEV Community: Akash Goenka