DEV Community

Cover image for How I stopped Claude Code from hallucinating function names on a 4,000-file repo (with a local MCP server)
Nikita Groshin
Nikita Groshin

Posted on

How I stopped Claude Code from hallucinating function names on a 4,000-file repo (with a local MCP server)

TL;DR: My Claude Code agent kept inventing function names that looked plausible but didn't exist (getUserByEmail, parseConfigFile, validateInput — all fake in my codebase). Adding a local MCP server that gives the agent a real symbol graph and ranked code search cut hallucinations to roughly zero on the same repo. Below: the bug, the cause, the fix, the bench numbers, and the cases where it still doesn't help.


The bug

I was refactoring a logging middleware in a 4,000-file TypeScript monorepo. The agent's task: rename logRequest to logHttpRequest everywhere it's called, including transitive callers.

What Claude Code generated, paraphrased:

// src/middleware/auth.ts
import { logRequest } from "../logging/logger"

export function withAuth(handler: Handler) {
  return async (req, res) => {
    logRequest(req)  // ← real
    if (!req.session) {
      return res.status(401).json({ error: "unauthorized" })
    }
    // …
    return logResponseTime(req, res, handler)  // ← INVENTED
  }
}
Enter fullscreen mode Exit fullscreen mode

logResponseTime does not exist. It has never existed in this codebase. The agent generated a call to it because (a) the surrounding code talks about logging, (b) function names like logResponseTime exist in millions of public repos, (c) the model's training data has a strong prior that "logging middleware should also log response time."

The actual function in our codebase is called recordLatency, which the agent never used in any of the four files it edited.

I rolled back, ran the same task again, and got trackRequestDuration and emitTimingEvent — also fake. Three runs, three different invented names. The agent was confident each time.

This is the load-bearing failure of every AI coding agent on a repo larger than its context window: the model treats your codebase as if it were a representative sample of the training corpus.

Why grep alone doesn't fix it

Cursor's @codebase, Claude Code's grep tool, plain ripgrep — all of them work in principle. The agent can search before it writes. In practice, it almost never does, for two reasons.

Cost. A single grep call against this repo returned 14,200 input tokens (we measured). Across the four files in this task the agent would have needed roughly 8 grep calls to be confident. That's ~$1.40 per task at Claude Sonnet's input rate, just for exploration. Multiply by 50 tasks a day. Engineers feel this — agents respond by doing fewer searches and guessing more.

Bias. Grep returns lexically matching strings. It doesn't tell the agent which results are load-bearing — which functions are central to the call graph and which are utility code touched once. The agent reads the first three results and stops, which on a 4,000-file repo is almost always wrong.

Recall. Grep matches identifiers. It doesn't match concepts. If you ask "what handles request timing in this repo?" grep can't answer. Embedding-only search can, but embedding-only search returns ranked-by-cosine results that are often wrong on code (logResponseTime and recordLatency are nearly cosine-identical; the second is correct only because of where it sits in the call graph).

The honest answer is that grep, embeddings, and graphs each fail in a different way. You need all three signals plus a way to combine them, exposed to the agent as MCP tools so it actually uses them.

What I did instead

I installed Sverklo, a local-first MCP server for code intelligence. (Disclosure: I wrote it.) The 60-second pitch:

npm install -g sverklo
cd your-project && sverklo init
Enter fullscreen mode Exit fullscreen mode

sverklo init auto-detects your AI coding agent (Claude Code, Cursor, Windsurf, Zed) and writes the right MCP config. It indexes your repo with tree-sitter, builds a call graph, computes a PageRank-ranked symbol importance score, and generates ONNX embeddings — all locally. Your code never leaves the machine.

The agent now has 37 extra tools alongside grep:

  • sverklo_search — hybrid BM25 + embedding + PageRank ranked search
  • sverklo_lookup — exact symbol definition by name
  • sverklo_refs — every reference to a symbol (call graph, not just textual)
  • sverklo_impact — recursive blast-radius (transitive callers)
  • sverklo_audit — god classes, dead code, security patterns
  • sverklo_remember / sverklo_recall — bi-temporal memory pinned to git SHAs
  • …and 30 more

For the rename task above, the agent's first move now is sverklo_lookup logRequest. That returns the canonical definition with file path and line number, ranked by PageRank importance. Then sverklo_refs logRequest returns every reference in the call graph, including the indirect callers grep would miss. The agent edits exactly the right files. No invented function names.

Re-ran the same task three times after install. Zero hallucinations.

The bench numbers

I ran a 60-task benchmark across 5 retrieval baselines (naive grep, smart grep, sverklo, jcodemunch-mcp, GitNexus). Methodology and raw data: sverklo.com/bench. Headline numbers:

Baseline Avg input tokens per task Tool calls per task
Naive grep 17,169 7–12
Smart grep 5,082 4–6
jcodemunch-mcp 5,351 1
GitNexus 543 1
Sverklo 386 1

That's 13.8× fewer input tokens than naive grep, 2.9× fewer than tuned grep, and a single tool call vs grep's 7–12.

For a typical Claude Sonnet session at $3/M input tokens, 50 tasks a day, the math comes out to roughly $0.41 per session today — projected to ~$123/month at 10 sessions/day across a small team. Sverklo's local indexing turns the same workload into roughly $9/month.

I'm not telling you those numbers to sell you anything. The repo is MIT-licensed and the bench is reproducible — clone it, run npm run bench, get the same numbers (or different ones for your repo, which is the point).

Where this still doesn't help

This is the honest part most blog posts skip.

  1. Repos under ~5,000 LOC. The agent's context window can hold the whole thing. Grep is faster, and the indexing overhead isn't worth it.

  2. Single-file edits with no cross-references. If you're editing one file and the change doesn't propagate, the symbol graph adds no signal.

  3. First call to a fresh repo. Sverklo has to index before its first useful query. On a 50K-LOC repo this is ~30 seconds; on 500K-LOC, ~5 minutes. After that it's incremental and fast, but the first run isn't free.

  4. Reference finding (P2 in the bench). This is the embarrassing one. A well-tuned ripgrep ties sverklo on the "find every caller of X" task. The semantic graph doesn't help when the question is purely textual. If P2 is your dominant workload, smart grep is genuinely competitive.

  5. Definition lookup (P1). jcodemunch-mcp beats sverklo here at 0.65 F1 vs sverklo's 0.45. Their tree-sitter symbol indexing is sharper. I have something to learn from them.

If the audit/blast-radius/memory tools don't sound load-bearing to your workflow, just use ripgrep + Cursor's @codebase. They're fine.

What I'd actually try first

If you're a Claude Code or Cursor user on a repo bigger than ~50K LOC, the cheapest experiment I can suggest is:

npm install -g sverklo
cd your-project
sverklo init
Enter fullscreen mode Exit fullscreen mode

Then ask your agent its three least-favorite codebase questions. Mine were:

  1. "What handles request timing in this repo?" (was: 14,200 grep tokens, no useful answer; now: one sverklo_search call, 312 tokens, correct)
  2. "If I rename logRequest, what breaks?" (was: agent guesses confidently; now: sverklo_impact returns the 23 transitive callers)
  3. "Where is the rate limiter implemented?" (was: agent edits the wrong file 50% of the time; now: sverklo_lookup returns the canonical definition)

If those three questions don't get meaningfully better answers, uninstall and use grep. npm uninstall -g sverklo is one command.

The deeper point

Hallucination on AI coding agents isn't a model problem. It's a retrieval problem. The model has to write code that matches your codebase; if it can't see your codebase fast and ranked, it falls back on the training-data prior. Function names like logResponseTime win over recordLatency because the prior is overwhelming.

The fix isn't a smarter model. It's giving the model a real view of your code — a symbol graph, a ranked search, a call graph, a memory of what changed yesterday — exposed as tools the agent can call cheaply enough to actually use.

That's it. That's the whole post.


Repo: github.com/sverklo/sverklo (MIT, ⭐ if it saved you a hallucination — it's the only way other engineers find it)
Bench: sverklo.com/bench — 60 tasks, 5 baselines, reproducible
Paper: doi.org/10.5281/zenodo.19802051 — peer-reviewable methodology, CC-BY 4.0
Demo (90 sec): youtube.com/watch?v=OX7aEgdlqhQ

If you've been hit by the same problem, I'd love to see your worst hallucination — DM me on X @marazmo or open an issue on the repo. I'm collecting them for a follow-up post.

Top comments (0)