How I got 80% code retrieval accuracy without vectors, embeddings, or any ML

#ai #devtool #llm #programming

I wanted to answer one question: how far can pure heuristic
retrieval go before you actually need embeddings?

The answer surprised me.

The problem with AI coding tools today

When you paste your codebase into an LLM, you're typically
sending 60,000–100,000 tokens of raw source code. Most of
that is noise — loop bodies, imports, boilerplate — that
never shows up in the answer.

The model reads the wrong file. Guesses the rest.
You retry 2–3 times.

The insight

Code identifiers are already the compressed representation.

parseToken(src: string, opts?: ParseOpts) → Token[]

That signature tells a retrieval system everything it needs
to decide "is this file relevant to my query?" The body adds
nothing for retrieval purposes.

Embedding that signature loses information — you're projecting
a precise vocabulary into a dense vector. Exact token match
keeps it.

How SigMap works

Walk the codebase, extract signatures per file using language-specific regex extractors (21 languages)
Build a signature index: Map
At query time, tokenize the query (camelCase/snake_case split, stop-word removal)
Score every file using stacked heuristics:

exact token match    +1.0
symbol name hit      +0.5
path token match     +0.8  
prefix match         +0.3
recency boost        ×1.5 multiplier

Return top-K files. Done.

No model inference. No API calls. Runs in ~200ms on a
50K file repo.

Benchmark results

Tested across 18 real open-source repos, 90 hand-labeled
(query → expected_file) tasks:

Metric	Result
Hit@5	80.0%
Random baseline	13.6%
Lift	5.8x
Token reduction	98.1%
Prompts per task	1.69 vs 2.84

The random baseline is 1 / avg_files_in_repo — it's what
you get by picking files at random. SigMap hits 5.8x that
with zero ML.

Why path match is underrated

Queries like "python extractor" or "retrieval ranker"
self-select by path before a single signature is checked.

src/extractors/python.js scores +0.8 just from
path match. Most well-structured repos are
src/feature/thing.js — this is essentially free recall.

Where it fails (~20% of tasks)

Implicit intent: "how does auth work" when auth functions are named validateSession, checkPermissions etc. — no keyword overlap
Synonyms: "authenticate" ≠ "login" unless both appear as identifiers
Multi-hop: "find where input gets validated before the DB" needs graph traversal, not single-file scoring

This is where embeddings earn their cost — on the hard 20%,
not the easy 80%.