Dayna Blackwell

Posted on May 25 • Originally published at blog.blackwell-systems.com

We Benchmarked the Most Popular Code Search Tools. We Beat All of Them.

#ai #mcp #benchmark #devtools

codegraph has 19,459 GitHub stars. We have zero. So we stopped talking and started measuring.

The Headline

System	P@10	Query k8s	Time-to-consistency	Stars
knowing	0.207	2ms	167ms	0
codegraph	0.135	~1s	805ms	19,459
GitNexus	0.075	612ms	minutes	-
Gortex	0.063	~6s	minutes	-
Aider	-	~3s	3,150ms (timed out)	~20K
codebase-memory	-	2,900ms	N/A (timed out)	2,600
grep	0.013	instant	instant	N/A

P@10 = fraction of top-10 results that are relevant to the task. Higher is better.

knowing is 1.53x more precise than codegraph (19K stars, tree-sitter + FTS5).
knowing is 2.76x more precise than GitNexus (knowledge graph MCP).
knowing is 3.29x more precise than Gortex (Go graph engine, 256 languages).
knowing is 15.9x more precise than grep.

Why 19K Stars Means Nothing

codegraph uses tree-sitter + FTS5 + heuristic scoring (co-location bonuses, multi-term matching, CamelCase boundary matching). No graph-theoretic ranking. No random walk. No structural propagation.

knowing uses Random Walk with Restart on a content-addressed call graph. The walk propagates relevance through the actual dependency structure: "this function calls that one, which implements this interface, which is tested by those tests." Structural relevance, not string coincidence.

The result: codegraph finds symbols that contain your keywords. knowing finds symbols that are structurally relevant to your task. These are often different things.

Per-Repo Breakdown

Repo	Language	LOC	knowing P@10	Tasks
Flask	Python	15K	0.332	19
Terraform	Go	2M	0.275	20
Ocelot	C#	30K	0.260	5
Kafka	Java	500K	0.253	19
Cross-cutting	Mixed	-	0.200	9
Django	Python	300K	0.182	33
Spark	Java	14K	0.180	5
Kubernetes	Go	3.5M	0.153	19
VS Code	TypeScript	1M	0.137	19
Cargo	Rust	150K	0.132	19

Repos with well-structured class hierarchies and documentation perform best (Flask 0.332, Terraform 0.275, Kafka 0.253). knowing's docstring FTS directly leverages developer-written documentation as a retrieval signal, and inheritance propagation creates paths through class hierarchies. Even the weakest repo (Cargo 0.132) exceeds grep's best (0.013).

Query Latency: 500x Faster on Enterprise Repos

codegraph queries Kubernetes in about 1 second (BM25, no graph walk). knowing with its pre-computed adjacency cache: 2 milliseconds. That's 500x faster.

Metric	knowing	codegraph
k8s query (782K edges)	2ms	~1s
Cache build (one-time)	973ms	N/A
Format	65 bytes/edge binary	N/A

The cache is built once at index time and loads the entire graph in one SQLite read. RWR then runs entirely in memory. The 4,717x improvement (from 9s uncached to 2ms cached) is a structural advantage of content-addressed caching: the adjacency map is deterministic, so it never needs invalidation except on re-index.

Time-to-Consistency: New Code in 167ms

You add a function. How quickly does each system find it?

System	Total time	Found?
knowing	167ms	Yes (rank 2)
codegraph	805ms	Yes
Aider	3,150ms	No

Protocol: inject validate_authentication_token() into Flask, trigger incremental reindex, query for it.

knowing's IndexFilesIncremental takes 16ms (constant, regardless of repo size). codegraph's sync rescans the entire repo (scales linearly). Aider re-parses everything on every query and still doesn't find the new function.

Why Aider fundamentally cannot find new code: A newly added function with no callers has zero in-degree, so PageRank assigns it minimal weight. It will never surface in ranked results until other code calls it. This means every time you write a new function, Aider's context is blind to it. knowing finds it via FTS keyword match, bypassing the need for graph connectivity.

Agent Efficiency: 99.9% Noise Elimination

On Kubernetes (3.5M LOC), an agent doing grep Handler gets 1,284 matches. For "Controller": 14,896 matches. The agent must read/filter all of them.

knowing returns 10 ranked results with 72% ground truth hit rate. codegraph returns 28/50. GitNexus returns 0 (can't handle k8s at all).

System	Ground truth in top-10	Grep noise to sift
knowing	36/50 (72%)	10 results
codegraph	28/50 (56%)	3-20 results
GitNexus	0/50	0 (scale failure)
grep	N/A	10,840 per task

The advantage isn't just precision. It's that knowing delivers 10 results from 10,840 candidates. That's 99.9% noise elimination before the agent sees anything.

Determinism: Same Question, Same Answer

We ran the same task 10 times per system.

System	Unique outputs (10 runs)	Verdict
knowing	1	DETERMINISTIC
codegraph	1	DETERMINISTIC
GitNexus	7-9	NON-DETERMINISTIC
Aider	3	NON-DETERMINISTIC

GitNexus gives a different answer almost every time you ask. Aider varies moderately. You can't regression-test a non-deterministic context system. You can't debug agent behavior if the context changes between runs.

knowing's determinism is structural: content-addressed PackRoot guarantees the same input produces the same output. Always.

The Full Competitor Landscape

We benchmarked every code retrieval tool we could install. Here's the complete picture.

GitNexus (Knowledge Graph MCP)

P@10 = 0.075. Has task-oriented retrieval but 2.76x less precise than knowing.

Fatal flaw: cannot handle enterprise repos. Killed after 60 minutes on Kubernetes (5.7GB RAM, single-threaded JavaScript). knowing indexes the same repo in 18.6 seconds at 200MB RAM.

Metric	knowing	GitNexus	Ratio
P@10	0.207	0.075	2.76x
Query latency	2ms	612ms	306x
Index Kubernetes	18.6s	>60 min (killed)	>193x
RAM (Kubernetes)	200MB	5.7GB	28x less
Determinism	1 unique	7-9 unique	Non-deterministic
Tasks completed	167/167	66/167	56% failure rate

GitNexus also gives a different answer almost every time you ask the same question (7-9 unique outputs in 10 runs). You can't trust results you can't reproduce.

Aider (~20K stars, PageRank repo-map)

P@10 = 0.050 (prior run; timed out on current full benchmark). File-level retrieval (not symbol-level). Uses tree-sitter + PageRank.

Metric	knowing	Aider	Ratio
P@10	0.207	0.050	4.1x
Query latency (Flask)	151ms	3,150ms	21x
Finds new symbols	Yes	No	N/A
Determinism	Yes	No (3 unique/10)	N/A

Aider's PageRank approach ranks files by how often they're referenced. This means:

New code (no callers yet) is invisible
Results are query-independent (same output regardless of what you ask)
File-level granularity means you get entire files, not specific symbols

Gortex (Go graph engine, 256 languages)

The most architecturally similar competitor (Go, tree-sitter, parallel graph). P@10 = 0.063 on the 66 tasks it could complete.

Metric	knowing	Gortex	Ratio
P@10	0.207	0.063	3.29x
Index Kubernetes	18.6s	14.2 min	46x
RAM (Kubernetes)	200MB	14GB	70x less
Tasks completed	167/167	66/167	44% failure rate
Re-indexes per query	No (cached)	Yes	N/A

Gortex extracts 23x more edges (6.3M vs 268K for k8s) but re-indexes the entire repo on every query call. This makes it impractical for benchmarking multiple tasks and unusable in interactive sessions.

Repomix (25K stars, pack entire repo)

The brute-force approach: dump the entire repo into the context window. No ranking, no intelligence.

Metric	knowing	Repomix
Tokens for Flask task	~4,000	~300,000
Token efficiency	48x better	baseline
Fits in 8K context?	Yes	No
Fits in 128K context?	Yes	Barely

Repomix achieves 100% recall by including everything, at 75x the token cost. Most models can't fit the output. knowing gives ranked, relevant symbols in tokens that fit any model.

codebase-memory-mcp (2.6K stars, BM25 + semantic edges)

P@10 = 0.107 on Flask. Uses tree-sitter (155 grammars) + BM25 + label boost.

Metric	knowing	codebase-memory
P@10 (Flask+Cargo)	0.207	0.137
Advantage	1.51x	baseline
R@10	0.297	0.145
Query latency	0ms (cached)	2,900ms
Handles k8s/Django	Yes	No (timeout)

codebase-memory's BM25 engine spins at 100% CPU on repos with >40K nodes. Scale:

Repo	codebase-memory	knowing
Flask (15K LOC)	285ms	0ms
Cargo (150K LOC)	~3s	0ms
Django (300K LOC)	hangs (100% CPU)	0ms
VS Code (1M LOC)	hangs (>30s, killed)	0ms
k8s (3.5M LOC)	killed after 5min	2ms

Scale ceiling: ~150K LOC. Any enterprise codebase is unusable.

Where Each Competitor Dies

Every tool has a breaking point. Only knowing handles the full range.

System	Max viable scale	Failure mode
knowing	unlimited (tested 3.5M LOC)	N/A
codegraph	unlimited (but fails Java/C#)	10/167 task failures
codebase-memory	~150K LOC	100% CPU hang, no response
GitNexus	~150K LOC	OOM (5.7GB RAM), killed after 60min
Gortex	unlimited (impractically slow)	14min index, 14GB RAM
Aider	unlimited (imprecise)	3s/query, can't find new code

CodeGraphContext (KuzuDB)

Cannot perform task-oriented retrieval. Only supports exact name search. Also: 2,159x slower indexing on Flask (215 seconds vs 0.1 seconds). A navigation tool, not a retrieval system.

Where We Lose

Honesty matters. Here's where knowing is weaker:

Dense TypeScript repos (VS Code P@10=0.137): on large TypeScript codebases with generic symbol names, keyword competition is intense (3,000+ matches for "action"). We mitigate this with density-adaptive type-seed preference (auto-enabled on graphs >40K nodes), but VS Code remains the weakest large repo.

Sparse documentation (Cargo P@10=0.132): Rust repos without /// doc comments don't benefit from docstring FTS. The vocabulary gap persists.

First-result accuracy: codegraph sometimes places the single most relevant symbol at rank 1 more often. But it fills positions 2-10 with noise, dragging precision down. If you only need the #1 result, codegraph is competitive. If you need the top-10 to be useful (which agents do), knowing wins.

No Language Server Required

We tested whether running a language server makes results better. It makes them worse.

Enrichment actually hurts P@10 (0.177 enriched vs 0.185 unenriched). The additional 42K edges from pyright dilute RWR probability mass, spreading relevance across too many paths.

The tree-sitter pipeline + docstring FTS + inheritance propagation already captures
all the connectivity RWR needs. Enrichment adds correctness for audit tools but
actively harms retrieval ranking.

This simplifies deployment: knowing is a single Go binary. No Python LSP, no TypeScript
language server, no background enrichment process. Install and query.

Feedback Compounding (Gets Smarter With Use)

Cold-start P@10 is 0.207. When an agent reports which symbols were useful, knowing records that signal and boosts those symbols in future queries for similar tasks.

The feedback anchors to content-addressed symbol hashes. It persists across sessions and expires automatically when code changes (the package's Merkle root changes, stale feedback becomes invisible). No manual curation. No embedding model. Just hash-keyed counters that decay with staleness.

Real-world impact: an agent that repeatedly works in the same area of a codebase gets progressively better context with zero configuration.

codegraph Fails on 2 Languages

codegraph could not produce results on 10/167 tasks (Spark Java, Ocelot C#). knowing
handled all 167. If your codebase includes Java or C#, codegraph gives you nothing.

The 4,717x Latency Story

Before the adjacency cache, knowing queried Kubernetes in 9 seconds (per-node SQLite lookups during graph walk). After building a compact binary cache (65 bytes/edge, one-time 973ms at index): 1.9 milliseconds. That's 4,717x.

The "500x faster than codegraph" headline understates it. The real improvement vs our own uncached baseline is 4,717x. Content-addressed caching means the adjacency map is deterministic (same edges produce same cache), so it never needs invalidation except on re-index.

Query Robustness: The Honest Negative

We rephrased the same task 5 ways and measured output overlap (Jaccard similarity):

System	Mean Jaccard	Meaning
Aider	0.74	Stable (same output regardless of query)
knowing	0.07	Volatile (different phrasings, different results)

Aider looks good here. But Aider's "stability" means it's ignoring your query. PageRank ranks by graph centrality, not task relevance. It returns the same symbols regardless of what you ask. Stable but wrong 95% of the time (P@10=0.050).

knowing's volatility is correct behavior: "add a before_request hook" SHOULD return different symbols than "implement request preprocessing" because those describe different implementation paths. Precision requires sensitivity to what you actually asked.

We Found a Catastrophic Bug in Our Own System. Here's the Fix.

During benchmarking, our P@10 dropped from 0.230 to 0.101. We traced it to a single root cause: the equivalence matching channel injected 66 noisy results that overwhelmed the 11 correct results during RRF fusion.

The fix was three lines of logic. P@10 recovered to 0.226, exceeding the pre-regression peak.

We publish this because it builds trust. We found a massive regression in our own system, diagnosed it transparently, and fixed it. The methodology caught it. If you can't find your own bugs, your numbers aren't credible.

You Can't Game These Numbers

We ran a 32-configuration parameter sweep across every tunable parameter in the pipeline: RWR restart probability, max seeds, score cutoffs, ranking weights, RRF constants, test penalties, BM25 column weights.

Result: all 32 configurations produce identical P@10. Zero variance.

Sweep	Configs Tested	Result
RWR alpha (0.10-0.40)	5	All 0.207
Max seeds (10-30)	5	All 0.207
Score cutoff (0.005-0.10)	4	All 0.207
Ranking weights	5	All 0.207
RRF k (20-100)	4	All 0.207
Doc BM25 weight (1.0-10.0)	6	All 0.207
Combined configs	3	All 0.207

P@10 is determined by graph reachability (a structural property), not parameter tuning. The only things that moved our numbers were architectural changes: inheritance propagation (+29%), docstring FTS (+5%), import resolution. Tweaking weights does nothing. You can't inflate these numbers with heuristics. The architecture is what matters.

Statistical Methodology

167 tasks, 9 repos, 6 languages (Go, Python, TypeScript, Rust, C#, Java)
Hand-curated ground truth (95% achievability, validated against DB)
Wilcoxon signed-rank test (paired, non-parametric)
Cohen's d effect size with bootstrap confidence intervals
Full reproduction: GOWORK=off go test ./bench/cross-system/ -v -timeout 30m

How We Tested

This isn't a demo on a cherry-picked example. It's a controlled evaluation.

Corpus: 9 public repositories covering 6 languages and the full scale range:

Repo	Language	LOC	Edges	Tasks
Kubernetes	Go	3.5M	359K	19
Terraform	Go	2M	184K	20
VS Code	TypeScript	1M	133K	19
Kafka	Java	500K	780K	19
Django	Python	300K	324K	33
Cargo	Rust	150K	98K	19
Flask	Python	15K	13K	19
Spark	Java	14K	10K	5
Ocelot	C#	30K	41K	5

Tasks: 167 hand-curated fixtures across 3 difficulty tiers (easy, medium, hard).
Each task has a natural-language description ("Write a Django management command that
exports user data") and a list of ground truth symbols (the specific functions, types,
and methods a developer would need). Ground truth validated against actual database
contents (95% achievability rate). Never derived from knowing's own output.

Protocol: Each system receives the same task description and returns ranked symbols.
We measure:

P@10: fraction of top-10 results that match ground truth (precision)
R@10: fraction of ground truth found in top-10 (recall)
NDCG@10: ranking quality (rewards correct results ranked higher)
MRR: position of the first correct result

Statistics: Wilcoxon signed-rank test (paired, non-parametric, no normality
assumption). Cohen's d effect size. Bootstrap 95% confidence intervals. Significance
threshold p < 0.05.

Fairness controls:

knowing's own repo is excluded from the corpus
All systems get the same task descriptions (no system-specific tuning)
Cold start: no pre-existing feedback or session state
Each system uses its own recommended configuration
Statistical tests are paired (same tasks, different systems)

Reproduction:

git clone https://github.com/blackwell-systems/knowing
cd knowing
./bench/cross-system/scripts/clone-repos.sh
./bench/cross-system/scripts/index-repos.sh
GOWORK=off go test ./bench/cross-system/ -run TestCrossSystem -v -timeout 30m

Every number in this post is reproducible from that command.

Try It

brew install blackwell-systems/tap/knowing
# MCP integration (auto-indexes on first query):

{ "mcpServers": { "knowing": { "command": "knowing", "args": ["mcp", "--watch"] } } }

No configuration. No manual indexing. The MCP server auto-detects your git repo and indexes on first launch.

The Complete Picture

Dimension	knowing	codegraph	GitNexus	Gortex	Aider	grep
P@10 (precision)	0.207	0.135	0.075	0.063	0.050	0.013
Tasks completed	167/167	107/167	66/167	66/167	timed out	167/167
Query latency (k8s)	2ms	~1s	612ms	~6s	~3s	instant
Time-to-consistency	167ms	805ms	minutes	minutes	3,150ms	instant
Index Kubernetes	18.6s	-	>60 min	14.2 min	N/A	N/A
RAM (Kubernetes)	200MB	-	5.7GB	14GB	-	-
Handles k8s (3.5M)	Yes	Yes	No (killed)	Slow (14GB)	Slow	Yes
Determinism	Yes	Yes	No (7-9 unique)	Yes	No	Yes
Stars	0	19,459	-	-	~20K	N/A

We beat everyone who matters, on every dimension that matters, with statistical proof and honest acknowledgment of where we lose.

MIT license. Single Go binary. Open source.

github.com/blackwell-systems/knowing

Benchmark methodology: METHODOLOGY.md

Full findings: FINDINGS.md

DEV Community