DEV Community

Dayna Blackwell
Dayna Blackwell

Posted on • Originally published at blog.blackwell-systems.com

We Benchmarked the Most Popular Code Search Tools. We Beat All of Them.

codegraph has 19,459 GitHub stars. We have zero. So we stopped talking and started measuring.

The Headline

System P@10 Query k8s Time-to-consistency Stars
knowing 0.207 2ms 167ms 0
codegraph 0.135 ~1s 805ms 19,459
GitNexus 0.075 612ms minutes -
Gortex 0.063 ~6s minutes -
Aider - ~3s 3,150ms (timed out) ~20K
codebase-memory - 2,900ms N/A (timed out) 2,600
grep 0.013 instant instant N/A

P@10 = fraction of top-10 results that are relevant to the task. Higher is better.

knowing is 1.53x more precise than codegraph (19K stars, tree-sitter + FTS5).
knowing is 2.76x more precise than GitNexus (knowledge graph MCP).
knowing is 3.29x more precise than Gortex (Go graph engine, 256 languages).
knowing is 15.9x more precise than grep.

Why 19K Stars Means Nothing

codegraph uses tree-sitter + FTS5 + heuristic scoring (co-location bonuses, multi-term matching, CamelCase boundary matching). No graph-theoretic ranking. No random walk. No structural propagation.

knowing uses Random Walk with Restart on a content-addressed call graph. The walk propagates relevance through the actual dependency structure: "this function calls that one, which implements this interface, which is tested by those tests." Structural relevance, not string coincidence.

The result: codegraph finds symbols that contain your keywords. knowing finds symbols that are structurally relevant to your task. These are often different things.

Per-Repo Breakdown

Repo Language LOC knowing P@10 Tasks
Flask Python 15K 0.332 19
Terraform Go 2M 0.275 20
Ocelot C# 30K 0.260 5
Kafka Java 500K 0.253 19
Cross-cutting Mixed - 0.200 9
Django Python 300K 0.182 33
Spark Java 14K 0.180 5
Kubernetes Go 3.5M 0.153 19
VS Code TypeScript 1M 0.137 19
Cargo Rust 150K 0.132 19

Repos with well-structured class hierarchies and documentation perform best (Flask 0.332, Terraform 0.275, Kafka 0.253). knowing's docstring FTS directly leverages developer-written documentation as a retrieval signal, and inheritance propagation creates paths through class hierarchies. Even the weakest repo (Cargo 0.132) exceeds grep's best (0.013).

Query Latency: 500x Faster on Enterprise Repos

codegraph queries Kubernetes in about 1 second (BM25, no graph walk). knowing with its pre-computed adjacency cache: 2 milliseconds. That's 500x faster.

Metric knowing codegraph
k8s query (782K edges) 2ms ~1s
Cache build (one-time) 973ms N/A
Format 65 bytes/edge binary N/A

The cache is built once at index time and loads the entire graph in one SQLite read. RWR then runs entirely in memory. The 4,717x improvement (from 9s uncached to 2ms cached) is a structural advantage of content-addressed caching: the adjacency map is deterministic, so it never needs invalidation except on re-index.

Time-to-Consistency: New Code in 167ms

You add a function. How quickly does each system find it?

System Total time Found?
knowing 167ms Yes (rank 2)
codegraph 805ms Yes
Aider 3,150ms No

Protocol: inject validate_authentication_token() into Flask, trigger incremental reindex, query for it.

knowing's IndexFilesIncremental takes 16ms (constant, regardless of repo size). codegraph's sync rescans the entire repo (scales linearly). Aider re-parses everything on every query and still doesn't find the new function.

Why Aider fundamentally cannot find new code: A newly added function with no callers has zero in-degree, so PageRank assigns it minimal weight. It will never surface in ranked results until other code calls it. This means every time you write a new function, Aider's context is blind to it. knowing finds it via FTS keyword match, bypassing the need for graph connectivity.

Agent Efficiency: 99.9% Noise Elimination

On Kubernetes (3.5M LOC), an agent doing grep Handler gets 1,284 matches. For "Controller": 14,896 matches. The agent must read/filter all of them.

knowing returns 10 ranked results with 72% ground truth hit rate. codegraph returns 28/50. GitNexus returns 0 (can't handle k8s at all).

System Ground truth in top-10 Grep noise to sift
knowing 36/50 (72%) 10 results
codegraph 28/50 (56%) 3-20 results
GitNexus 0/50 0 (scale failure)
grep N/A 10,840 per task

The advantage isn't just precision. It's that knowing delivers 10 results from 10,840 candidates. That's 99.9% noise elimination before the agent sees anything.

Determinism: Same Question, Same Answer

We ran the same task 10 times per system.

System Unique outputs (10 runs) Verdict
knowing 1 DETERMINISTIC
codegraph 1 DETERMINISTIC
GitNexus 7-9 NON-DETERMINISTIC
Aider 3 NON-DETERMINISTIC

GitNexus gives a different answer almost every time you ask. Aider varies moderately. You can't regression-test a non-deterministic context system. You can't debug agent behavior if the context changes between runs.

knowing's determinism is structural: content-addressed PackRoot guarantees the same input produces the same output. Always.

The Full Competitor Landscape

We benchmarked every code retrieval tool we could install. Here's the complete picture.

GitNexus (Knowledge Graph MCP)

P@10 = 0.075. Has task-oriented retrieval but 2.76x less precise than knowing.

Fatal flaw: cannot handle enterprise repos. Killed after 60 minutes on Kubernetes (5.7GB RAM, single-threaded JavaScript). knowing indexes the same repo in 18.6 seconds at 200MB RAM.

Metric knowing GitNexus Ratio
P@10 0.207 0.075 2.76x
Query latency 2ms 612ms 306x
Index Kubernetes 18.6s >60 min (killed) >193x
RAM (Kubernetes) 200MB 5.7GB 28x less
Determinism 1 unique 7-9 unique Non-deterministic
Tasks completed 167/167 66/167 56% failure rate

GitNexus also gives a different answer almost every time you ask the same question (7-9 unique outputs in 10 runs). You can't trust results you can't reproduce.

Aider (~20K stars, PageRank repo-map)

P@10 = 0.050 (prior run; timed out on current full benchmark). File-level retrieval (not symbol-level). Uses tree-sitter + PageRank.

Metric knowing Aider Ratio
P@10 0.207 0.050 4.1x
Query latency (Flask) 151ms 3,150ms 21x
Finds new symbols Yes No N/A
Determinism Yes No (3 unique/10) N/A

Aider's PageRank approach ranks files by how often they're referenced. This means:

  • New code (no callers yet) is invisible
  • Results are query-independent (same output regardless of what you ask)
  • File-level granularity means you get entire files, not specific symbols

Gortex (Go graph engine, 256 languages)

The most architecturally similar competitor (Go, tree-sitter, parallel graph). P@10 = 0.063 on the 66 tasks it could complete.

Metric knowing Gortex Ratio
P@10 0.207 0.063 3.29x
Index Kubernetes 18.6s 14.2 min 46x
RAM (Kubernetes) 200MB 14GB 70x less
Tasks completed 167/167 66/167 44% failure rate
Re-indexes per query No (cached) Yes N/A

Gortex extracts 23x more edges (6.3M vs 268K for k8s) but re-indexes the entire repo on every query call. This makes it impractical for benchmarking multiple tasks and unusable in interactive sessions.

Repomix (25K stars, pack entire repo)

The brute-force approach: dump the entire repo into the context window. No ranking, no intelligence.

Metric knowing Repomix
Tokens for Flask task ~4,000 ~300,000
Token efficiency 48x better baseline
Fits in 8K context? Yes No
Fits in 128K context? Yes Barely

Repomix achieves 100% recall by including everything, at 75x the token cost. Most models can't fit the output. knowing gives ranked, relevant symbols in tokens that fit any model.

codebase-memory-mcp (2.6K stars, BM25 + semantic edges)

P@10 = 0.107 on Flask. Uses tree-sitter (155 grammars) + BM25 + label boost.

Metric knowing codebase-memory
P@10 (Flask+Cargo) 0.207 0.137
Advantage 1.51x baseline
R@10 0.297 0.145
Query latency 0ms (cached) 2,900ms
Handles k8s/Django Yes No (timeout)

codebase-memory's BM25 engine spins at 100% CPU on repos with >40K nodes. Scale:

Repo codebase-memory knowing
Flask (15K LOC) 285ms 0ms
Cargo (150K LOC) ~3s 0ms
Django (300K LOC) hangs (100% CPU) 0ms
VS Code (1M LOC) hangs (>30s, killed) 0ms
k8s (3.5M LOC) killed after 5min 2ms

Scale ceiling: ~150K LOC. Any enterprise codebase is unusable.

Where Each Competitor Dies

Every tool has a breaking point. Only knowing handles the full range.

System Max viable scale Failure mode
knowing unlimited (tested 3.5M LOC) N/A
codegraph unlimited (but fails Java/C#) 10/167 task failures
codebase-memory ~150K LOC 100% CPU hang, no response
GitNexus ~150K LOC OOM (5.7GB RAM), killed after 60min
Gortex unlimited (impractically slow) 14min index, 14GB RAM
Aider unlimited (imprecise) 3s/query, can't find new code

CodeGraphContext (KuzuDB)

Cannot perform task-oriented retrieval. Only supports exact name search. Also: 2,159x slower indexing on Flask (215 seconds vs 0.1 seconds). A navigation tool, not a retrieval system.

Where We Lose

Honesty matters. Here's where knowing is weaker:

Dense TypeScript repos (VS Code P@10=0.137): on large TypeScript codebases with generic symbol names, keyword competition is intense (3,000+ matches for "action"). We mitigate this with density-adaptive type-seed preference (auto-enabled on graphs >40K nodes), but VS Code remains the weakest large repo.

Sparse documentation (Cargo P@10=0.132): Rust repos without /// doc comments don't benefit from docstring FTS. The vocabulary gap persists.

First-result accuracy: codegraph sometimes places the single most relevant symbol at rank 1 more often. But it fills positions 2-10 with noise, dragging precision down. If you only need the #1 result, codegraph is competitive. If you need the top-10 to be useful (which agents do), knowing wins.

No Language Server Required

We tested whether running a language server makes results better. It makes them worse.

Enrichment actually hurts P@10 (0.177 enriched vs 0.185 unenriched). The additional 42K edges from pyright dilute RWR probability mass, spreading relevance across too many paths.

The tree-sitter pipeline + docstring FTS + inheritance propagation already captures
all the connectivity RWR needs. Enrichment adds correctness for audit tools but
actively harms retrieval ranking.

This simplifies deployment: knowing is a single Go binary. No Python LSP, no TypeScript
language server, no background enrichment process. Install and query.

Feedback Compounding (Gets Smarter With Use)

Cold-start P@10 is 0.207. When an agent reports which symbols were useful, knowing records that signal and boosts those symbols in future queries for similar tasks.

The feedback anchors to content-addressed symbol hashes. It persists across sessions and expires automatically when code changes (the package's Merkle root changes, stale feedback becomes invisible). No manual curation. No embedding model. Just hash-keyed counters that decay with staleness.

Real-world impact: an agent that repeatedly works in the same area of a codebase gets progressively better context with zero configuration.

codegraph Fails on 2 Languages

codegraph could not produce results on 10/167 tasks (Spark Java, Ocelot C#). knowing
handled all 167. If your codebase includes Java or C#, codegraph gives you nothing.

The 4,717x Latency Story

Before the adjacency cache, knowing queried Kubernetes in 9 seconds (per-node SQLite lookups during graph walk). After building a compact binary cache (65 bytes/edge, one-time 973ms at index): 1.9 milliseconds. That's 4,717x.

The "500x faster than codegraph" headline understates it. The real improvement vs our own uncached baseline is 4,717x. Content-addressed caching means the adjacency map is deterministic (same edges produce same cache), so it never needs invalidation except on re-index.

Query Robustness: The Honest Negative

We rephrased the same task 5 ways and measured output overlap (Jaccard similarity):

System Mean Jaccard Meaning
Aider 0.74 Stable (same output regardless of query)
knowing 0.07 Volatile (different phrasings, different results)

Aider looks good here. But Aider's "stability" means it's ignoring your query. PageRank ranks by graph centrality, not task relevance. It returns the same symbols regardless of what you ask. Stable but wrong 95% of the time (P@10=0.050).

knowing's volatility is correct behavior: "add a before_request hook" SHOULD return different symbols than "implement request preprocessing" because those describe different implementation paths. Precision requires sensitivity to what you actually asked.

We Found a Catastrophic Bug in Our Own System. Here's the Fix.

During benchmarking, our P@10 dropped from 0.230 to 0.101. We traced it to a single root cause: the equivalence matching channel injected 66 noisy results that overwhelmed the 11 correct results during RRF fusion.

The fix was three lines of logic. P@10 recovered to 0.226, exceeding the pre-regression peak.

We publish this because it builds trust. We found a massive regression in our own system, diagnosed it transparently, and fixed it. The methodology caught it. If you can't find your own bugs, your numbers aren't credible.

You Can't Game These Numbers

We ran a 32-configuration parameter sweep across every tunable parameter in the pipeline: RWR restart probability, max seeds, score cutoffs, ranking weights, RRF constants, test penalties, BM25 column weights.

Result: all 32 configurations produce identical P@10. Zero variance.

Sweep Configs Tested Result
RWR alpha (0.10-0.40) 5 All 0.207
Max seeds (10-30) 5 All 0.207
Score cutoff (0.005-0.10) 4 All 0.207
Ranking weights 5 All 0.207
RRF k (20-100) 4 All 0.207
Doc BM25 weight (1.0-10.0) 6 All 0.207
Combined configs 3 All 0.207

P@10 is determined by graph reachability (a structural property), not parameter tuning. The only things that moved our numbers were architectural changes: inheritance propagation (+29%), docstring FTS (+5%), import resolution. Tweaking weights does nothing. You can't inflate these numbers with heuristics. The architecture is what matters.

Statistical Methodology

  • 167 tasks, 9 repos, 6 languages (Go, Python, TypeScript, Rust, C#, Java)
  • Hand-curated ground truth (95% achievability, validated against DB)
  • Wilcoxon signed-rank test (paired, non-parametric)
  • Cohen's d effect size with bootstrap confidence intervals
  • Full reproduction: GOWORK=off go test ./bench/cross-system/ -v -timeout 30m

How We Tested

This isn't a demo on a cherry-picked example. It's a controlled evaluation.

Corpus: 9 public repositories covering 6 languages and the full scale range:

Repo Language LOC Edges Tasks
Kubernetes Go 3.5M 359K 19
Terraform Go 2M 184K 20
VS Code TypeScript 1M 133K 19
Kafka Java 500K 780K 19
Django Python 300K 324K 33
Cargo Rust 150K 98K 19
Flask Python 15K 13K 19
Spark Java 14K 10K 5
Ocelot C# 30K 41K 5

Tasks: 167 hand-curated fixtures across 3 difficulty tiers (easy, medium, hard).
Each task has a natural-language description ("Write a Django management command that
exports user data") and a list of ground truth symbols (the specific functions, types,
and methods a developer would need). Ground truth validated against actual database
contents (95% achievability rate). Never derived from knowing's own output.

Protocol: Each system receives the same task description and returns ranked symbols.
We measure:

  • P@10: fraction of top-10 results that match ground truth (precision)
  • R@10: fraction of ground truth found in top-10 (recall)
  • NDCG@10: ranking quality (rewards correct results ranked higher)
  • MRR: position of the first correct result

Statistics: Wilcoxon signed-rank test (paired, non-parametric, no normality
assumption). Cohen's d effect size. Bootstrap 95% confidence intervals. Significance
threshold p < 0.05.

Fairness controls:

  • knowing's own repo is excluded from the corpus
  • All systems get the same task descriptions (no system-specific tuning)
  • Cold start: no pre-existing feedback or session state
  • Each system uses its own recommended configuration
  • Statistical tests are paired (same tasks, different systems)

Reproduction:

git clone https://github.com/blackwell-systems/knowing
cd knowing
./bench/cross-system/scripts/clone-repos.sh
./bench/cross-system/scripts/index-repos.sh
GOWORK=off go test ./bench/cross-system/ -run TestCrossSystem -v -timeout 30m
Enter fullscreen mode Exit fullscreen mode

Every number in this post is reproducible from that command.

Try It

brew install blackwell-systems/tap/knowing
# MCP integration (auto-indexes on first query):
Enter fullscreen mode Exit fullscreen mode
{ "mcpServers": { "knowing": { "command": "knowing", "args": ["mcp", "--watch"] } } }
Enter fullscreen mode Exit fullscreen mode

No configuration. No manual indexing. The MCP server auto-detects your git repo and indexes on first launch.

The Complete Picture

Dimension knowing codegraph GitNexus Gortex Aider grep
P@10 (precision) 0.207 0.135 0.075 0.063 0.050 0.013
Tasks completed 167/167 107/167 66/167 66/167 timed out 167/167
Query latency (k8s) 2ms ~1s 612ms ~6s ~3s instant
Time-to-consistency 167ms 805ms minutes minutes 3,150ms instant
Index Kubernetes 18.6s - >60 min 14.2 min N/A N/A
RAM (Kubernetes) 200MB - 5.7GB 14GB - -
Handles k8s (3.5M) Yes Yes No (killed) Slow (14GB) Slow Yes
Determinism Yes Yes No (7-9 unique) Yes No Yes
Stars 0 19,459 - - ~20K N/A

We beat everyone who matters, on every dimension that matters, with statistical proof and honest acknowledgment of where we lose.


MIT license. Single Go binary. Open source.

github.com/blackwell-systems/knowing

Benchmark methodology: METHODOLOGY.md

Full findings: FINDINGS.md

Top comments (0)