DEV Community

manoj mallick
manoj mallick

Posted on

How I got 80% code retrieval accuracy without vectors, embeddings, or any ML

I wanted to answer one question: how far can pure heuristic
retrieval go before you actually need embeddings?

The answer surprised me.

The problem with AI coding tools today

When you paste your codebase into an LLM, you're typically
sending 60,000–100,000 tokens of raw source code. Most of
that is noise — loop bodies, imports, boilerplate — that
never shows up in the answer.

The model reads the wrong file. Guesses the rest.
You retry 2–3 times.

The insight

Code identifiers are already the compressed representation.

parseToken(src: string, opts?: ParseOpts)  Token[]
Enter fullscreen mode Exit fullscreen mode

That signature tells a retrieval system everything it needs
to decide "is this file relevant to my query?" The body adds
nothing for retrieval purposes.

Embedding that signature loses information — you're projecting
a precise vocabulary into a dense vector. Exact token match
keeps it.

How SigMap works

  1. Walk the codebase, extract signatures per file using language-specific regex extractors (21 languages)
  2. Build a signature index: Map
  3. At query time, tokenize the query (camelCase/snake_case split, stop-word removal)
  4. Score every file using stacked heuristics:
exact token match    +1.0
symbol name hit      +0.5
path token match     +0.8  
prefix match         +0.3
recency boost        ×1.5 multiplier
Enter fullscreen mode Exit fullscreen mode
  1. Return top-K files. Done.

No model inference. No API calls. Runs in ~200ms on a
50K file repo.

Benchmark results

Tested across 18 real open-source repos, 90 hand-labeled
(query → expected_file) tasks:

Metric Result
Hit@5 80.0%
Random baseline 13.6%
Lift 5.8x
Token reduction 98.1%
Prompts per task 1.69 vs 2.84

The random baseline is 1 / avg_files_in_repo — it's what
you get by picking files at random. SigMap hits 5.8x that
with zero ML.

Why path match is underrated

Queries like "python extractor" or "retrieval ranker"
self-select by path before a single signature is checked.

src/extractors/python.js scores +0.8 just from
path match. Most well-structured repos are
src/feature/thing.js — this is essentially free recall.

Where it fails (~20% of tasks)

  • Implicit intent: "how does auth work" when auth functions are named validateSession, checkPermissions etc. — no keyword overlap
  • Synonyms: "authenticate" ≠ "login" unless both appear as identifiers
  • Multi-hop: "find where input gets validated before the DB" needs graph traversal, not single-file scoring

This is where embeddings earn their cost — on the hard 20%,
not the easy 80%.

Try it

npx sigmap
Enter fullscreen mode Exit fullscreen mode

Generates a context file your IDE or LLM already knows
how to read. Works with Claude, Cursor, Copilot, Gemini,
Windsurf, OpenAI.

GitHub: https://github.com/manojmallick/sigmap


Curious what retrieval approaches others are using —
especially around multi-hop and the hybrid TF-IDF +
embeddings tradeoff.

Top comments (0)