I kept watching agents do the same thing. Given a real task in a real codebase, they'd spend the first half of the session navigating — grepping, opening files, reading imports, going back, grepping again — and only after that get to the part they're actually good at: reasoning, planning, writing code.
The context window isn't infinite. Every token spent locating a file is a token not spent thinking about it.
So I built coldstart. It's an MCP server that gives agents a static index over a codebase — file paths, exported symbol names, path segments — built once with Tree-sitter, queried instantly. Four tools, no embeddings, no vector database, no model. The whole point is to find the right file fast and get out of the agent's way.
This is a post about why it ended up that small, what I deliberately didn't build, and what I learned from agents using it on real codebases.
The problem isn't intelligence, it's navigation
Agents are most valuable when they're reasoning about code, not searching for it. But navigation is where they quietly burn context. A typical session in a large codebase looks something like:
- Grep for a likely keyword. Get 200 lines of matches.
- Open three or four files to figure out which one is actually relevant.
- Realize the right file is named something different and grep again.
- Trace imports manually.
- Finally start the actual task — with significantly less context window left. Larger context windows don't fix this; they just delay it. An agent that wastes tokens navigating has less left for the work that matters. The cost isn't theoretical — it shows up as worse answers near the end of long sessions, more re-reading of files the agent already saw, and tasks that get abandoned mid-flight because there's no room left to finish.
The framing I landed on: context window is finite regardless of price. An agent that navigates efficiently stays in its best-value zone longer.
What coldstart actually is
Straight from the README:
coldstart is a lightweight navigation layer for AI agents. It answers one question: which files are relevant to this task? No embeddings, no graph, no model to run or maintain. Just a fast, static index over your codebase — file paths, symbol names, exports — built once, queried instantly. Agents are already good at reading code, tracing logic, and reasoning about structure. What they don't need is another system trying to do that for them. coldstart stays out of the way: find the file, hand it off, done. 4 tools. Minimal context overhead. No infrastructure to babysit.
The mental model is "smarter grep." Grep matches strings; coldstart matches strings and knows which ones are exported symbols, which are path segments, and which are rare enough to be meaningful signals. Once it points the agent at the right file, the agent does the rest. That's the whole tool.
What I deliberately didn't build
Most of the design effort went into things I chose not to add. Each one was tempting and each one would have made the tool worse for what agents actually need.
No recommendations or next-step suggestions. That's the agent's job. A navigation tool that tries to also be a reasoning tool ends up doing both poorly, and worse, anchors the agent on whatever heuristic the tool used. I'd rather hand back a clean list of files and let the agent decide.
No semantic search or embeddings. Embeddings add an entire failure mode — model versioning, index rebuilds when the embedding model changes, cost, latency, dependency on a hosted service or a shipped model file — without proportional gain for navigation. For finding files by symbol name, lexical retrieval is faster, more predictable, and easier to debug. (For conceptual queries — "where's the retry logic" — embeddings genuinely help. coldstart isn't trying to be that tool.)
No path prefix filters or file-type filters. That's grep's job. coldstart's value is what grep can't do: finding files by exported symbol names and structural signals. If you need a glob, you already have one.
Ranking is fixed. I went through a lot of iterations on ranking — penalties, exact-match filters, length thresholds, IDF weighting — and what shipped is what held up across real queries on real codebases. I'm done tweaking it. If I keep tuning, I'm just overfitting to whatever I tested last week.
The pattern across all four: every feature I didn't add was one that would have made coldstart bigger without making it more useful for the specific job of "find the file fast and disappear."
What I learned from real agent runs
Two things stood out once I watched agents actually use it.
Soft failures are recoverable; hard failures aren't. When coldstart returns too many results, the agent narrows the query and tries again — that's a soft failure, and agents handle it fine. What kills an agent is a tool that returns zero results with no signal, or a confident wrong answer. coldstart is designed so its failure mode is "too much" rather than "nothing" — agents can always work with too much.
Confidence scores are decoration unless they mean something. I tried adding confidence scores early on. They were meaningless — every result came back at roughly the same number — and the agent would over-anchor on them. I removed them. If a score doesn't differentiate results, it's just noise the agent has to interpret.
There's a more detailed benchmark coming, comparing coldstart to a graph-based codebase analysis approach on real queries. I want to do that properly with numbers rather than vibes, so it's getting its own post.
The honest limitations
A few things coldstart doesn't do well, in case you're evaluating it:
-
Cross-file call resolution is partial. Named function calls across files are resolved. Member-expression calls (
this.foo(),obj.foo()) are not. This is a Tree-sitter limitation I haven't fully solved. - It's lexical, not semantic. If you ask "where do we handle authentication" and no file has the word "auth" in its path or exports, coldstart won't find it. Use it for symbol-shaped queries, not concept-shaped ones.
-
Ten languages parsed for symbols. TypeScript, JavaScript, Java, Ruby, Python, Go, Rust, C#, PHP, Kotlin. Swift, Dart, and C++ files are walked and indexed by path — so agents can still find them — but symbols, imports, and call edges aren't extracted yet, so
trace-depsandtrace-impactstop at those boundaries. Adding full parsing is a Tree-sitter grammar wire-up per language. ## Try it
npm install -g coldstart-mcp
GitHub: github.com/AkashGoenka/coldstart
npm: npmjs.com/package/coldstart-mcp
If you want the long version of how the design got here — the failed ranking iterations, the AST decisions, what broke and what held up — there's a deep-dive post covering that.
Top comments (0)