Akash Goenka

Posted on Apr 27

Building coldstart: what broke, what held up

#ai #webdev #programming #mcp

This is the long version. If you want the short pitch for coldstart — what it is and why it exists — read the main post first. This one is for people who want to see the iteration story: the design decisions that didn't work, the ones that did, and why.

The arc is roughly: I started with a simple idea, hit real codebases, and watched it break in interesting ways. Each section below is a thing that broke and what I did about it.

The starting point: one folder-path domain per file

The first version of the index assigned each file a single "domain" based on its folder path. The idea was straightforward — files in src/auth/ belong to the auth domain, files in src/notifications/ belong to notifications, and so on.

This held up for about ten minutes of real-world testing.

The problem: deeply nested files lost all specificity. A file at src/features/profile/components/settings/sections/UserSettingsRow.tsx would get tagged with sections or settings or profile depending on where you cut, and none of those is uniquely identifying. Two completely unrelated Row.tsx files in different feature trees would collide on the same domain. Worse, the agent had no way to query for "the user settings row" because the domain was just one slice of the path.

The fix was to stop thinking about a single domain and start thinking about all the meaningful tokens.

Domains as an array of tokens

I moved to a domains[] array — every meaningful token from the path segments plus every exported symbol from the file. So UserSettingsRow.tsx at src/features/profile/components/settings/sections/UserSettingsRow.tsx would index as something like:

{
  path: "src/features/profile/components/settings/sections/UserSettingsRow.tsx",
  domains: [
    "features", "profile", "components", "settings", "sections",
    "user", "settings", "row",                    // from filename, split on case
    "UserSettingsRow", "UserSettingsRowProps"     // exported symbols
  ],
  exports: ["UserSettingsRow", "UserSettingsRowProps"]
}

This worked much better. A query for "user settings" would hit both the path segments and the symbol name. A query for "UserSettingsRow" would hit the export directly.

Then I tried adding import paths as a token source — the reasoning being "files that import from auth/ are probably auth-related." This was a mistake.

A middleware file that imports from many feature-specific files (a common pattern — global router config, a top-level layout component, an API client setup file) would start matching every query for any of those features. Pure noise. The middleware file was structurally important but not about any one feature, and indexing its imports made it look like it was about all of them. I pulled it back out.

The lesson: what a file imports tells you about its dependencies, not its identity. Identity comes from the file's own path and exports. That's the boundary I drew.

The substring-matching disaster

Early on, matching was substring-based. A query token would match an index token if it appeared anywhere inside it. This seemed reasonable — "user" should match "UserProfile", after all.

It caused cascade failures.

The token "in" is a substring of "login", "signin", "settings", "admin", "binding", "PluginConfig", and roughly a thousand other tokens. A query like "sign in form" would tokenize to ["sign", "in", "form"], with "in" matching as a substring across hundreds of unrelated files, and the result list would balloon with files that had nothing to do with sign-in flows.

I tried fixes in this order:

Length-based penalties — penalize matches where the query token is much shorter than the index token. Helped a little; broke for legitimate short tokens like id, db, api.
Minimum length thresholds for substring matching — only allow substring matches if the query token is at least N characters. Cut some noise; introduced new false negatives where a 3-character token was actually meaningful.
Exact-match-only with a fallback — match exactly first, fall back to substring only if no exact matches exist. Better, but the fallback still triggered noise on rare-but-real exact-zero queries. None of these felt principled. They were all heuristics layered on a fundamentally noisy signal.

IDF-based rarity scoring

What finally worked was scoring tokens by inverse document frequency at index-build time. Common tokens — index, utils, helper, component, types — get low weight. Rare tokens — your specific feature names, your specific symbol names — get high weight.

// At index build time, compute IDF for every token
function computeIDF(tokenCounts: Map<string, number>, totalFiles: number) {
  const idf = new Map<string, number>();
  for (const [token, count] of tokenCounts) {
    idf.set(token, Math.log(totalFiles / (1 + count)));
  }
  return idf;
}

The match score for a file became the sum of IDF weights of matched tokens, scaled by the fraction of query tokens matched:

function scoreFile(queryTokens: string[], fileTokens: Set<string>, idf: Map<string, number>) {
  const matched = queryTokens.filter(t => fileTokens.has(t));
  if (matched.length === 0) return 0;
  const idfSum = matched.reduce((s, t) => s + (idf.get(t) ?? 0), 0);
  const coverage = matched.length / queryTokens.length;
  return idfSum * coverage * coverage;  // squared to favor higher coverage
}

This alone wasn't enough — even with IDF, common-token files were still slipping through if they happened to match one rare token incidentally. So I added a two-predicate filter on top:

A file qualifies as a result if it matches a rare token (IDF above threshold) OR satisfies multiple distinct concept groups in the query.

The "concept groups" thing matters. A query like "user settings row" is conceptually one thing — users, settings views, row components — and a file that hits all three is structurally relevant even if no individual token is super rare. Either rare-token-match or multi-group-match gets you in. Neither, you're out.

This was the version that held up across a wide range of real queries. I stopped tweaking it.

Tree-sitter and the nested-function problem

Symbol extraction is done with Tree-sitter. The first pass walked top-level declarations only — function foo(), const bar = ..., class Baz, export default .... This works for most languages.

It does not work for React components.

In React, handlers are typically defined inside the component body:

export function UserProfile({ userId }: Props) {
  const handleSubmit = async () => { /* ... */ };
  const handleDelete = async () => { /* ... */ };

  return <form onSubmit={handleSubmit}>...</form>;
}

handleSubmit and handleDelete are real symbols. They're referenced in tests, they show up in stack traces, they're things an agent might reasonably search for. But a top-level walk misses them entirely — Tree-sitter sees UserProfile as the only declaration in the file.

The fix is to walk one level deeper into function bodies when the parent is a component-shaped function (PascalCase name, returns JSX). I don't go arbitrarily deep — that opens the door to indexing every closure and helper in every callback chain — just one level into the immediate component body.

function extractSymbols(tree: Parser.Tree, source: string): string[] {
  const symbols: string[] = [];
  const root = tree.rootNode;

  for (const child of root.namedChildren) {
    const topLevel = extractTopLevelSymbol(child);
    if (topLevel) {
      symbols.push(topLevel.name);
      // If it's a component-shaped function, walk one level into its body
      if (topLevel.kind === "function" && isPascalCase(topLevel.name)) {
        symbols.push(...extractNestedHandlers(topLevel.body));
      }
    }
  }
  return symbols;
}

This isn't perfect — it'll over-index a function that happens to be PascalCase but isn't actually a component, and under-index components defined as arrow functions assigned to lowercase variables — but it covered the cases that mattered in practice.

What's still unsolved: cross-file call resolution

I want to be honest about this one. Tracing impact (what depends on this function?) requires resolving function calls across files. Named calls work — if foo is defined in A.ts and B.ts has foo() somewhere, I can resolve that. Member expression calls do not work:

// A.ts
import { service } from './services';
service.handleUpdate();        // unresolved — I can't tell what handleUpdate is

// B.ts
class Service {
  handleUpdate() { /* ... */ }
}

Resolving member expressions requires either (a) full type inference, which is most of the way to a language server and dramatically more work, or (b) a heuristic match on the method name plus the import graph, which is fast but has false positives.

For now I'm taking the false-positive hit on heuristic matching and exposing it honestly in the response — trace-impact returns named matches separately from heuristic matches so the agent knows which to trust. That's not a fix, it's an accommodation. Real type-aware resolution is on the list but not next.

The live index

Agents work in active codebases. Files change while the agent is mid-task — it edits a file, runs a test, the index is now stale. A stale index means wrong results, which means the agent goes hunting for files that no longer exist or doesn't find files it just wrote.

The watcher uses fs.watch with a 400ms debounce:

const watcher = fs.watch(rootDir, { recursive: true }, (event, filename) => {
  if (!filename) return;
  pendingChanges.add(filename);
  clearTimeout(rebuildTimer);
  rebuildTimer = setTimeout(rebuildIndex, 400);
});

Below 30 changed files, it does an incremental patch — re-parse only the changed files, update the index in place. Above 30, it does a full rebuild. The threshold is a guess that's been fine in practice; below 30 the patch is faster, above 30 the bookkeeping cost outweighs the savings.

The rebuild swaps the index atomically — agents querying during a rebuild see the old index until the new one is ready, then see the new one. They never see a half-built index.

The four tools

That index supports four MCP tools:

get-overview — given a query, return the most relevant files. The thing this whole post has been about.
get-structure — given a file or folder, return its symbol structure. No semantic overlap with embeddings; it's just "what's in here."
trace-deps — what does this file import and depend on?
trace-impact — given a symbol, what depends on it? (with the cross-file-call caveat above) The first one is where most of the design effort went, but the other three are where coldstart genuinely can't be replaced by semantic search — embeddings don't answer "who imports this file."

What I'd tell anyone building something similar

Three things that took me longer to internalize than they should have.

Soft failures beat hard failures for agent tools. Returning 50 results the agent can narrow is recoverable. Returning zero with no recovery signal is a hard stop. Design for the agent to try again, not to be right the first time.

Decorative scores actively mislead. I removed confidence scores because they didn't differentiate results. If a number doesn't carry information, deleting it makes the tool more honest, not less professional.

Subtraction is the main design move. Almost every iteration ended with me removing a feature, not adding one. The final tool is much smaller than the first version.

If you got this far and you're building agent tools, I'd genuinely like to compare notes. Issues and discussions on GitHub are the best way to reach me.