I wanted to answer one question: how far can pure heuristic
retrieval go before you actually need embeddings?
The answer surprised me.
The problem with AI coding tools today
When you paste your codebase into an LLM, you're typically
sending 60,000–100,000 tokens of raw source code. Most of
that is noise — loop bodies, imports, boilerplate — that
never shows up in the answer.
The model reads the wrong file. Guesses the rest.
You retry 2–3 times.
The insight
Code identifiers are already the compressed representation.
parseToken(src: string, opts?: ParseOpts) → Token[]
That signature tells a retrieval system everything it needs
to decide "is this file relevant to my query?" The body adds
nothing for retrieval purposes.
Embedding that signature loses information — you're projecting
a precise vocabulary into a dense vector. Exact token match
keeps it.
How SigMap works
- Walk the codebase, extract signatures per file using language-specific regex extractors (21 languages)
- Build a signature index: Map
- At query time, tokenize the query (camelCase/snake_case split, stop-word removal)
- Score every file using stacked heuristics:
exact token match +1.0
symbol name hit +0.5
path token match +0.8
prefix match +0.3
recency boost ×1.5 multiplier
- Return top-K files. Done.
No model inference. No API calls. Runs in ~200ms on a
50K file repo.
Benchmark results
Tested across 18 real open-source repos, 90 hand-labeled
(query → expected_file) tasks:
| Metric | Result |
|---|---|
| Hit@5 | 80.0% |
| Random baseline | 13.6% |
| Lift | 5.8x |
| Token reduction | 98.1% |
| Prompts per task | 1.69 vs 2.84 |
The random baseline is 1 / avg_files_in_repo — it's what
you get by picking files at random. SigMap hits 5.8x that
with zero ML.
Why path match is underrated
Queries like "python extractor" or "retrieval ranker"
self-select by path before a single signature is checked.
src/extractors/python.js scores +0.8 just from
path match. Most well-structured repos are
src/feature/thing.js — this is essentially free recall.
Where it fails (~20% of tasks)
- Implicit intent: "how does auth work" when auth functions
are named
validateSession,checkPermissionsetc. — no keyword overlap - Synonyms: "authenticate" ≠ "login" unless both appear as identifiers
- Multi-hop: "find where input gets validated before the DB" needs graph traversal, not single-file scoring
This is where embeddings earn their cost — on the hard 20%,
not the easy 80%.
Try it
npx sigmap
Generates a context file your IDE or LLM already knows
how to read. Works with Claude, Cursor, Copilot, Gemini,
Windsurf, OpenAI.
GitHub: https://github.com/manojmallick/sigmap
Curious what retrieval approaches others are using —
especially around multi-hop and the hybrid TF-IDF +
embeddings tradeoff.
Top comments (0)