DEV Community: Ilias Miftakhov

A Cognitive Benchmark for Code-RAG Retrieval: Part 2 — Why Model Rankings Depend on the Pipeline

Ilias Miftakhov — Sun, 14 Jun 2026 21:00:41 +0000

When developers enter an unfamiliar project, they rarely search for a specific
file by name. They usually ask about system behavior: where incoming
connections are accepted, which component cleans logs, or how a request travels
between architectural layers.

Code-RAG tries to answer such questions through semantic search. It splits and
indexes the source code, then retrieves the context most closely related to a
developer's query.

The quality of this search is often reduced to the choice of embedding model:
compare several candidates and select the one with the highest metric. In
practice, the result also depends on how the code was split and which retrieval
mode was used.

To study these dependencies, I built a Code-RAG benchmark on the Apache Kafka
4.0.0 broker core, a real polyglot project written in Java and Scala. For
thirty questions about system behavior, I identified the correct files in
advance, allowing me to measure how accurately the retrieval pipeline finds
the relevant code.

The results show that a model ranking exists only within a specific
configuration. Changing the chunking strategy, retrieval mode, or query
phrasing can change both the metric value and the order of models in the
ranking.

1. Experimental Setup

In this part of the study, I compare sixteen embedding models, five chunking
configurations, and three retrieval modes: BM25, vector search, and hybrid
search. Each of the thirty questions was expressed in five forms, ranging from
a natural developer question to a query with inaccurate terminology or a
reference to a neighboring module. The structure of these variants and the
evaluation methodology are described in
Part 1 of the study.

I compared four groups of variables:

Variable	What changed	What it tested
Embedding model	7 local models through Ollama and 9 commercial APIs	How strongly quality depends on the vector representation of the query and code
Chunking	Whole-file indexing and four fixed-size chunks with overlap	How the indexed fragment size affects a particular model
Retrieval mode	`BM25_ONLY`, `VECTOR_ONLY`, `HYBRID_RRF`	Whether lexical search, vector search, or their combination works best
Query phrasing	Natural question, technical query, keywords, inaccurate terminology, and selected cross-module queries	How strongly the result depends on the language of the query

The three retrieval modes work as follows:

Mode	How the ranking is produced
`BM25_ONLY`	Lucene lexical search. Files rank highly when query terms match terms in the code. No embedding model is used.
`VECTOR_ONLY`	The query and code fragments are converted into embeddings. Ranking is based on vector similarity, so exact word overlap is unnecessary.
`HYBRID_RRF`	BM25 and vector search run independently, then their positions are combined using Reciprocal Rank Fusion. RRF uses rank positions rather than directly adding incomparable scores.

The primary metric in this article is recall@10. For a single question, it
equals 1 when the primary correct file appears in the first ten results and 0
otherwise. The final value is the average across thirty questions. For example,
recall@10 = 0.900 means that the correct file appeared in the top ten for 27
of 30 questions.

The model ranking also reports a 95% CI, a 95% bootstrap confidence interval.
To calculate it, I repeatedly resampled the set of questions with replacement
and recalculated recall for each sample. A wide interval means that thirty
questions are insufficient for precisely estimating small differences.
Overlapping intervals are not themselves a formal pairwise test, but they warn
against treating the order of neighboring rows as stable.

The chunking label c500-o100 means fragments of 500 characters with a
100-character overlap. whole-file means that an entire file is indexed as one
fragment.

I did not test the complete Cartesian product of all parameters. Models were
compared under a fixed baseline configuration; local and commercial models were
tested across five chunking configurations; and VECTOR_ONLY was compared
with HYBRID_RRF at the fixed c1500-o200 chunking. The full interaction
between retrieval mode and chunking remained outside the study. BM25 was run as
a single baseline because it does not depend on an embedding model.

2. Conditions for Comparing Models

To compare embedding models, the remaining retrieval-pipeline parameters must
be fixed. Otherwise, it is impossible to tell whether a difference was caused
by the model, fragment size, or retrieval method.

The baseline comparison used natural human questions, HYBRID_RRF, and
c1500-o200 chunking. For each model, I measured the share of thirty questions
for which the correct file appeared in the first ten results.

This ranking compares models under identical conditions, but it does not
describe their quality outside the selected configuration. For example, OpenAI
text-embedding-3-large achieved recall@10 = 0.833 with c1500-o200,
0.900 with the smaller c500-o100 fragments, and 0.433 when files were
indexed whole.

The value 0.833 therefore cannot be treated as an independent property of the
model. It describes one combination of model, chunking, retrieval mode, corpus,
and question set. The baseline ranking is a useful starting point, but it
cannot identify the best configuration without testing the other parameters.

3. The Effect of Chunking

Ideally, code should be split along logical boundaries such as methods, classes,
or other structural units. Structural chunking, however, requires a dedicated
parser for every language.

This study deliberately uses a polyglot Java and Scala project. I therefore
split the code into fixed-size fragments. This is not presented as the optimal
way to index code; it provides a common denominator across languages and makes
it possible to isolate the effect of fragment size.

Every value in the table is recall@10 for natural human questions using
hybrid retrieval. The best observed result for each model is shown in bold.

Model	c500-o100	c1500-o200	c3000-o300	c5000-o500	whole-file	Type
all-minilm	0.733	0.700	0.667	0.700	0.567	local
bge-m3	0.833	0.767	0.767	0.633	0.733	local
granite-embedding	0.800	0.767	0.767	0.633	0.433	local
mxbai-embed-large	0.900	0.833	0.733	0.633	0.433	local
nomic-embed-text	0.733	0.667	0.700	0.633	0.133	local
qwen3-embedding:0.6b	0.800	0.800	0.933	0.833	0.900	local
snowflake-arctic-embed2	0.800	0.800	0.800	0.733	0.733	local
EmbeddingsGigaR	0.767	0.767	0.800	0.789	0.167	commercial
GigaEmbeddings-3B	0.767	0.867	0.833	0.767	0.300	commercial
codestral-embed	0.800	0.900	0.867	0.800	0.467	commercial
mistral-embed-2312	0.800	0.800	0.900	0.800	0.400	commercial
text-embedding-3-large	0.900	0.833	0.867	0.800	0.433	commercial
text-embedding-3-small	0.867	0.867	0.867	0.867	0.433	commercial
voyage-4-large	0.900	0.833	0.867	0.833	0.867	commercial
voyage-code-2	0.933	0.867	0.800	0.800	0.533	commercial
voyage-code-3	0.900	0.867	0.900	0.833	0.833	commercial

Smaller Fragments Helped Most Local Models

For five of the seven local models, c500-o100 produced the highest observed
result. One possible explanation is that a small fragment contains less
unrelated code. Its embedding can describe a local implementation more
precisely, while BM25 benefits from matching specific terms.

The experiment does not establish this mechanism directly. Doing so would
require inspecting retrieved fragments and comparing hybrid and vector-only
search at every chunk size.

Some Models Benefit from Larger Fragments

qwen3-embedding:0.6b achieved its highest result with c3000-o300 and still
reached 0.900 when indexing whole files. Unlike most local models, it retained
quality on larger fragments.

A possible explanation is the model's ability to process longer context. A
larger fragment preserves relationships between methods and their surrounding
class that smaller fragments may lose. A similar pattern appeared for
mistral-embed-2312, EmbeddingsGigaR, and partly for voyage-code-3.

This remains a hypothesis: the experiment measured retrieval outcomes, not the
internal cause of each model's behavior.

Whole-File Indexing Is the Riskiest Choice

With whole-file, results ranged from 0.133 to 0.900. The approach
remained viable for qwen3-embedding, voyage-4-large, and voyage-code-3,
but quality dropped sharply for nomic-embed-text and EmbeddingsGigaR.

The likely explanation is context-window limits and truncation of long files.
Because I did not directly measure truncation by provider tokenizers, this must
also remain a hypothesis.

The matrix does not reveal a universally best fragment size. Instead, it shows
three kinds of behavior:

models that benefit substantially from small fragments;
models that require more surrounding context;
models that remain stable across chunking configurations.

Chunking should therefore be selected together with the embedding model. When
tuning time is limited, c500-o100 is a reasonable starting point, but at
least one larger alternative should also be tested, and whole-file should
not be used without separate validation.

4. The Effect of Retrieval Mode

After choosing how to split the code, the next question is how to retrieve the
relevant fragments. The experiment compared three modes:

BM25_ONLY matches words in the query against words in the code;
VECTOR_ONLY compares semantic similarity between embeddings;
HYBRID_RRF combines the rank positions from BM25 and vector search.

The retrieval-mode comparison used c1500-o200. In an earlier experiment, the
combination c1500-o200 + HYBRID_RRF produced the strongest result available
at the time and became the control configuration for later runs.

The subsequent chunking matrix showed that there is no universally optimal
fragment size. Keeping c1500-o200, however, allowed retrieval modes to be
compared under identical conditions without mixing their effect with a
chunking change.

The full matrix of retrieval modes and chunking configurations was not tested.
The results below therefore describe retrieval-mode behavior only at
c1500-o200.

Every value is recall@10 for natural human questions. The best mode for
each model is shown in bold.

Model	BM25_ONLY	VECTOR_ONLY	HYBRID_RRF	Type
No embedding model	0.600	—	—	lexical baseline
all-minilm	—	0.667	0.700	local
bge-m3	—	0.867	0.767	local
granite-embedding	—	0.733	0.767	local
mxbai-embed-large	—	0.833	0.833	local
nomic-embed-text	—	0.667	0.667	local
qwen3-embedding:0.6b	—	0.800	0.800	local
snowflake-arctic-embed2	—	0.900	0.800	local
EmbeddingsGigaR	—	0.711	0.767	commercial
GigaEmbeddings-3B	—	0.833	0.867	commercial
codestral-embed	—	0.967	0.900	commercial
mistral-embed-2312	—	0.900	0.800	commercial
text-embedding-3-large	—	0.867	0.833	commercial
text-embedding-3-small	—	0.833	0.867	commercial
voyage-4-large	—	0.878	0.833	commercial
voyage-code-2	—	0.933	0.867	commercial
voyage-code-3	—	0.933	0.867	commercial

Commercial-model values are averaged across three repeated runs, so some
values are not multiples of one question out of thirty.

Hybrid Search Was Not a Universal Improvement

Adding BM25 to vector search helped two local and three commercial models. It
made no difference for three local models. In the remaining cases, hybrid
retrieval reduced recall@10.

Among local models, the clearest differences appeared for bge-m3 and
snowflake-arctic-embed2: vector-only search improved their results by
0.100. Among commercial models, mistral-embed-2312 showed the same
improvement.

One possible explanation is that BM25 helps when the correct file contains
query terms missed by vector search. It can also promote lexically similar but
semantically incorrect files and weaken an already strong vector ranking. The
experiment did not test this mechanism directly.

BM25 Remains a Useful Baseline

For natural questions, BM25 achieved recall@10 = 0.600, below every tested
embedding-based combination. Its result, however, depended strongly on query
language.

For queries composed of technical terms and keywords, BM25 reached
0.833–0.867. With inaccurate terminology, it fell to 0.400. Lexical search
works well when the developer already knows the names of relevant entities,
but it is less effective when system behavior is described in the developer's
own words.

The choice of retrieval mode, like the choice of chunking, depends on the
embedding model. Hybrid retrieval cannot be assumed to improve vector search:
it helped some models, left some unchanged, and reduced the results of others.

A practical evaluation should compare at least VECTOR_ONLY and HYBRID_RRF
on the selected model and representative queries. BM25 remains both a useful
control point and a standalone option for precise technical searches.

5. The Effect of Query Phrasing

The same question about code can be expressed in different ways. A developer
may describe system behavior in natural language, list known technical terms,
or use a plausible but incorrect name for a component.

To test retrieval robustness under these changes, each question was represented
in several forms:

human — a natural developer question;
ai_optimized — a detailed query using precise technical terminology;
keyword — a short list of keywords;
wrong_terminology — the original intent with one controlled terminology error;
cross_module — a question connecting multiple system components.

The construction rules for these variants are described in
Part 1 of the study.

This comparison fixed chunking at c1500-o200 and used HYBRID_RRF. Every
value is recall@10. The cross_module variant existed for only ten
applicable questions, while the other results were calculated across all
thirty.

Model	human	ai_optimized	keyword	wrong_terminology	cross_module	Type
BM25 without an embedding model	0.600	0.833	0.867	0.400	0.600	baseline
all-minilm	0.700	0.933	0.933	0.433	0.600	local
bge-m3	0.767	1.000	0.833	0.633	0.700	local
granite-embedding	0.767	1.000	0.967	0.567	0.700	local
mxbai-embed-large	0.833	0.967	0.867	0.633	0.700	local
nomic-embed-text	0.667	0.700	0.733	0.467	0.700	local
qwen3-embedding:0.6b	0.800	1.000	0.900	0.600	0.700	local
snowflake-arctic-embed2	0.800	1.000	1.000	0.633	0.700	local
EmbeddingsGigaR	0.767	1.000	0.900	0.600	0.700	commercial
GigaEmbeddings-3B	0.867	1.000	0.800	0.633	0.700	commercial
codestral-embed	0.900	1.000	1.000	0.667	0.700	commercial
mistral-embed-2312	0.800	1.000	1.000	0.633	0.700	commercial
text-embedding-3-large	0.833	1.000	0.933	0.600	0.700	commercial
text-embedding-3-small	0.867	1.000	0.967	0.567	0.700	commercial
voyage-4-large	0.833	1.000	1.000	0.733	0.700	commercial
voyage-code-2	0.867	1.000	1.000	0.700	0.700	commercial
voyage-code-3	0.867	1.000	1.000	0.700	0.700	commercial

Precise Technical Queries Make Retrieval Easier

The ai_optimized variant contains class names, component names, and
operations already present in the code. On these queries, all nine commercial
and four of the seven local models reached recall@10 = 1.000.

This does not mean that code retrieval is solved. It shows that Code-RAG works
far better when the user already knows the terminology and approximate
location of the answer. In practice, retrieval is often needed precisely
because that knowledge is missing.

Short keyword queries also performed well. Even BM25 reached 0.867, because
the keywords often matched names and terms in the source code directly.

Inaccurate Terminology Hurts Every Model

Replacing one term with a plausible but incorrect alternative reduced the
result of every tested model. Among commercial models, the drop relative to
human ranged from 0.100 for voyage-4-large to 0.300 for
text-embedding-3-small.

The model ranked first on natural questions was not the most robust to
terminology distortion. codestral-embed fell from 0.900 to 0.667, while
voyage-4-large fell from 0.833 to 0.733.

Code specialization also failed to predict robustness. The smallest and
largest observed drops among commercial models both belonged to
general-purpose models.

Cross-Module Queries Need a Separate Study

The cross_module values barely distinguish the embedding models: every model
except all-minilm received 0.700, while all-minilm received 0.600.
This variant existed for only ten questions, so the result cannot be
interpreted as evidence of equal robustness.

A meaningful comparison would require a separate question set focused on
relationships between modules and containing more examples of this type.

Query phrasing is another parameter of the retrieval pipeline. Precise
terminology can bring almost every model close to the maximum result, while a
small terminology error can reduce quality substantially.

Model selection should therefore account for where queries come from. A system
for developers familiar with the codebase and a system for new team members or
non-technical users may require different configurations.

6. Model Comparison Results

For the baseline comparison, every model ran under the same conditions:
natural human questions, HYBRID_RRF, and c1500-o200 chunking.

Rank	Model	Model types in row	Recall@10	95% CI
1	mistral/codestral-embed	commercial	0.900	[0.800–1.000]
2–5	GigaEmbeddings-3B, text-embedding-3-small, voyage-code-2, voyage-code-3	commercial	0.867	[0.733–0.967]
6–8	mxbai-embed-large, text-embedding-3-large, voyage-4-large	local and commercial	0.833	[0.700–0.967]
9–11	qwen3-embedding:0.6b, snowflake-arctic-embed2	local	0.800	[0.633–0.933]
9–11	mistral-embed-2312	commercial	0.800	[0.666–0.933]
12–14	bge-m3, granite-embedding, EmbeddingsGigaR	local and commercial	0.767	[0.600–0.900]
15	all-minilm	local	0.700	[0.533–0.867]
16	nomic-embed-text	local	0.667	[0.500–0.833]

At first glance, this looks like a conventional ranking: a specialized
commercial model takes first place, and the remaining results gradually fall
from 0.867 to 0.667. With thirty questions, however, a difference of
0.033 represents only one retrieved file.

One question separates codestral-embed from the next four models. Those four
retrieved the correct files for the same 26 questions out of 30, while the
leader retrieved one additional file. A paired bootstrap analysis showed that
the confidence interval for every pairwise difference among the top five
models included zero. The available data is therefore insufficient to treat
their order as stable.

The separation between local and commercial models was also less pronounced
than expected. Local mxbai-embed-large achieved 0.833, two correct answers
behind the leader out of thirty, and its confidence interval overlaps those of
commercial models.

Larger and more expensive models did not always produce better results.
text-embedding-3-large achieved 0.833, while the cheaper
text-embedding-3-small reached 0.867. The compact 62 MB
granite-embedding tied the 1.2 GB bge-m3 at 0.767. Two generations of
specialized Voyage models, voyage-code-2 and voyage-code-3, also completed
the baseline comparison with the same result of 0.867.

These observations do not prove that the models are equal: thirty questions
are insufficient for confidently comparing small differences. They do show
why an embedding-model ranking is meaningful only together with its
measurement conditions. Changing chunking, retrieval mode, or query phrasing
can alter both the metric and the order of models in the table.

7. Practical Recommendations

The experiment does not identify one configuration suitable for every
Code-RAG project. The decision depends on the queries the system will receive,
where it will run, and how much time is available for tuning.

Requirement	Candidate	Evidence from this study
Highest observed `recall@10`	`codestral-embed` + `c1500-o200` + `VECTOR_ONLY`	Highest point estimate among completed runs: `0.967`
No commercial APIs	`qwen3-embedding:0.6b` + `c3000-o300`	Highest tuned local-model result: `0.933`
Minimal initial tuning	`voyage-4-large` or `voyage-code-3`	Smallest observed range across chunking configurations: `0.067`
Restricted memory	`granite-embedding`	A 62 MB model with the same baseline point estimate, `0.767`, as the 1.2 GB `bge-m3`
Precise technical queries	BM25 as standalone search or a baseline	BM25 reached `recall@10 = 0.867` on `keyword` queries without embedding infrastructure
Inaccurate user queries	Vector or hybrid retrieval after testing the chosen model	The advantage of semantic retrieval over BM25 was most visible with inaccurate terminology

These candidates reflect the best observed results inside the experiment, not
universal production configurations. For example, codestral-embed produced
the highest recall@10, but its advantage was measured at one chunking
configuration, and statistical superiority over nearby models was not
established. Local models avoid API charges but move cost into hardware and
operations.

Practical Code-RAG tuning should begin with a description of future queries,
not a large model leaderboard. If the system is used by developers familiar
with project terminology, BM25 may be a strong starting point. If questions
come from new team members or users describing behavior in their own words,
semantic retrieval becomes more important.

The next step is to choose a short list of models that meet cost, memory, and
deployment constraints. For each candidate, test several fragment sizes, then
compare VECTOR_ONLY and HYBRID_RRF. A chunking or retrieval-mode choice
should not be transferred from one model to another without retesting.

The final comparison should include not only convenient technical queries, but
also natural phrasing, inaccurate terminology, and plausible but incorrect
files. Results should be retained per question so that an apparent advantage
can be traced to a stable pattern rather than a few favorable examples.

In practice, selecting a Code-RAG configuration becomes a process of narrowing
the search space:

identify realistic query types and system constraints;
measure BM25 and one accessible embedding model as starting points;
select a short list of candidate models;
test several chunking configurations for each candidate;
compare vector-only and hybrid retrieval;
repeat the evaluation across different query phrasings;
test the statistical stability of the resulting ranking.

8. Conclusion

The central result of this study is that an embedding model cannot be evaluated
independently from the retrieval pipeline around it. Each model has its own
effective combination of chunking, retrieval mode, and query phrasing.

These parameters are connected. Fragment size determines how much code enters
an embedding. Retrieval mode sets the balance between semantic similarity and
exact terminology. Query phrasing determines how easily the system can connect
a developer's intent to the vocabulary of the source code.

There is therefore no universal ranking of Code-RAG models. Models can only be
compared under explicit conditions: on a particular codebase, with a selected
chunking strategy and retrieval mode, and for a known distribution of user
queries.

The practical question is not "Which model is best?" but "Which configuration
best solves the tasks of this project's users?" Answering it requires joint
tuning of the pipeline and evaluation on a project-specific question set.

This study used one polyglot project and a small gold set, so its selected
configurations should not be transferred to other codebases without retesting.
Part 3 will describe the reproducible benchmark harness used to run these
comparisons.

Reproducibility

The raw results are published with the project. The main tables in this article
can be verified using these artifacts:

results/E003-full/all.parquet
results/E007-commercial-chunking-merged/all.parquet
results/E004-vector/all.parquet
results/E005-production/all.parquet
results/E006-production-merged/all.parquet
results/E007-commercial-vector-merged/all.parquet
results/E004-bm25/all.parquet
results/forest_plot_data.csv

The confidence intervals for the baseline ranking can be recalculated from the
source CSV files:

git clone https://github.com/Daeryss/karta-rag-map
cd karta-rag-map
python3 -m venv .venv
.venv/bin/pip install -r scripts/requirements.txt
.venv/bin/python scripts/bootstrap_cis.py \
  results/E005-production/all.csv \
  results/E006-production-merged/all.csv

The script fixes the baseline conditions at k=10, the human query variant,
2,000 bootstrap samples, and seed 42.

A Cognitive Benchmark for Code-RAG Retrieval · Part 2 of 3 · Previous:
Part 1 — Methodology · Next: Part 3 —
Engineering a Reproducible Benchmark

A Cognitive Benchmark for Code-RAG Retrieval: Part 1 — Methodology

Ilias Miftakhov — Tue, 09 Jun 2026 19:37:51 +0000

TL;DR

Code-RAG systems promise to help developers navigate large codebases: find the
implementation of a behavior, trace a data flow, or identify the component
responsible for a specific function. But a compelling demo does not tell us how
reliable the retrieval itself is.

To investigate this, I built a retrieval benchmark on the Apache Kafka 4.0.0
broker core, a real polyglot project containing 697 Java and Scala files. For
each question, I identified the correct files, acceptable supporting files, and
plausible but incorrect alternatives in advance. I then compared which files
Code-RAG retrieved under different embedding models, code chunking strategies,
and retrieval modes.

The first results looked convincing, but the conclusions changed several times
as the dataset evolved. A single query phrasing concealed retrieval instability,
ordinary recall missed dangerous errors, and the final leaderboard leader beat
the next four models on only one question out of thirty.

The main outcome of the study was not the selection of a "best" model. It was a
methodology for avoiding the mistake of treating a chance result as a reliable
one.

1. Why Code-RAG Needs Its Own Benchmark

Code RAG maps make an appealing promise: a developer asks a question about an
unfamiliar project and receives the relevant parts of the codebase. Such a
system could be used inside an IDE, in a chat assistant, during onboarding,
while investigating incidents, or when preparing a change.

A typical Code-RAG pipeline looks like this:

source code
    → split into chunks
    → build a search index
    → developer query
    → retrieve relevant chunks
    → pass the retrieved context to a generative model
    → answer the user

The final answer does not depend on the generative model alone. If retrieval
fails to find the file that actually implements the relevant behavior, the
model will either produce an incomplete answer or reason plausibly from the
wrong context.

I therefore started with a narrower question:

How accurately does Code-RAG find the correct files in a real polyglot
codebase?

An important clarification: this study did not send questions to different
generative LLMs and evaluate their written answers. It sent them to a retrieval
pipeline. An embedding model converted the query and code into vectors, Lucene
returned a ranked list of files, and the benchmark compared that list with
labels prepared in advance.

This restriction is intentional. It isolates retrieval quality instead of
mixing it with prompt quality, answer generation, and LLM hallucinations.

I selected the Apache Kafka 4.0.0 broker core as the corpus. It is not a
tutorial example, but a large Java and Scala codebase with packages,
interfaces, implementations, manager classes, and logic distributed across
multiple components. In a project like this, a retrieval system must
distinguish the file that actually owns a behavior from a similarly named file
or an adjacent responsibility.

2. Defining the Correct Answer

Evaluating retrieval requires more than asking a question and deciding whether
the result looks reasonable. It requires a known ground truth.

I created a set of questions about the Kafka codebase, which I was familiar
with. Each question described a specific behavior or concept for which the
owner file could be identified manually. For example:

How does the Kafka broker accept incoming TCP connections?

The correct primary file for this question is:

core/src/main/scala/kafka/network/SocketServer.scala

RequestChannel.scala is nearby. It belongs to the networking flow and helps
pass requests onward, so it looks plausible. However, it is not responsible for
accepting TCP connections. If a system ranks it above SocketServer.scala, it
has found thematically related code but identified the wrong owner of the
behavior.

For every run, the retriever returned a ranked list of files. The result could
then be evaluated formally:

recall@k shows whether the correct file appeared in the first k results;
MRR accounts for how highly the first correct result was ranked;
displacement_rate shows whether a plausible but incorrect file displaced the primary answer within the top-k;
rank_gap measures the distance between correct and adversarial files in the complete ranked result.

This turns the subjective judgment that "the result looks reasonable" into a
reproducible comparison. With the same set of questions, I could change the
embedding model, chunking configuration, or retrieval mode and measure how the
quality changed.

But the first version of the dataset was too simple.

3. Why the First Gold Queries Had to Be Replaced

The First Version: Verify That the Pipeline Works

The pilot gold set contained three questions and ten Kafka files. Each question
had one phrasing and a list of expected files. This was enough to exercise the
entire path: load the corpus, build the index, run retrieval, and calculate the
metrics.

The pilot used nomic-embed-text with 3,000-character chunks and a
300-character overlap. It achieved a perfect score on every measured metric. I
therefore selected 3000/300 as the working configuration.

The problem was that the benchmark had not confirmed the reliability of the
approach. It had only confirmed that the system could solve three familiar
questions on a small corpus. It left almost no room for failure.

The Next Version: More Questions and More Phrasings

After expanding the corpus to 697 files, I prepared 30 questions. Each question
received several paraphrases: canonical human phrasing, passive phrasing, and
descriptions framed around a process, component, or location.

This version showed that the same intent could produce different results
depending on its wording. However, the variants were still mostly superficial
language transformations. They did not distinguish the different ways in which
a developer actually searches for code: a natural question, a precise
technical query, a set of keywords, or a query using imprecise terminology.

The labels also still focused mainly on whether the correct file had been
found, without adequately describing plausible errors. For real code
navigation, it is not enough to find SocketServer.scala; the system must also
avoid confusing it with the adjacent RequestChannel.scala.

The Final Version: A Five-Layer Gold-Query Schema

In the repository, the earlier full set is stored as gold-queries.yaml, while
the final set is available as
gold-queries-v2.1.yaml.
Their differences can be summarized as follows:

Aspect	Early gold set	Gold queries v2
Correct answer	List of expected files	Primary and secondary files
Phrasings	Generic language paraphrases	Controlled cognitive modes
Incorrect neighbors	Simple list of adversarial files	Type, strength, and source of confusion
Rationale	Short explanation	Behavior owner, reason, and explicit boundary
Evaluation	Shared set of metrics	Relevant metric families for each question

This led to the gold-queries-v2.1 schema, in which every question is
represented across five independent layers:

1. RETRIEVAL     what counts as the correct answer
2. COGNITIVE     how the user expresses the same query
3. INTERFERENCE  which files look plausible but are incorrect
4. RATIONALE     why the answer is correct and where its boundary lies
5. EVALUATION    which metrics matter for this question

The retrieval layer separates primary and supporting files. The primary file
owns the requested behavior; a supporting file helps explain it but does not
replace the answer.

The cognitive layer contains up to five query forms:

human: a natural question from a developer;
ai_optimized: a detailed technical query using precise terminology;
keyword: a short set of keywords;
wrong_terminology: the same intent with one controlled terminology error;
cross_module: a question that requires connecting multiple components, where applicable.

The interference layer lists adversarial nodes: semantically or structurally
close files that are easy to mistake for the answer. Each one records its
relationship to the primary file, the strength of the trap, and the type of
confusion.

The rationale layer documents the behavior owner, the reason for selecting
it, and an explicit boundary explaining why a neighboring file is not the
answer. This makes the labels reviewable and debatable instead of treating them
as unexplained decisions by the author.

Finally, the evaluation layer specifies which metric families are especially
important for a particular question.

This schema changed the subject of the study itself. Instead of testing whether
the system could occasionally find the right file, the benchmark began testing
how consistently it recognized the same intent and distinguished the behavior
owner from plausible neighbors.

4. What Paraphrases and Adversarial Nodes Revealed

One Phrasing Does Not Represent the Whole Question

In an early full-corpus experiment, top-1 accuracy on the human phrasing was
0.233. When a question counted as solved if the system succeeded on any
available phrasing, the result was 0.600. The difference between one phrasing
and the best available phrasing was 2.6x.

This does not mean that we should select the most convenient variant and
publish the best result. On the contrary, a large gap shows that one phrasing
is a poor representation of real retrieval robustness.

On the final baseline, all nine commercial embedding models reached
quorum-any recall@10 = 1.000: for every question, at least one phrasing placed
the correct file in the top 10. However, the models handled changes in query
language with different levels of consistency.

The drop between human and wrong_terminology ranged from 0.100 to 0.300. I
expected code-specialized models to be more robust, but the data did not
support that expectation: the lowest and highest observed drops both belonged
to general-purpose models.

The ai_optimized variant, deliberately written to be retrieval-friendly,
reached recall@10 = 1.000 for all nine commercial models. In this experiment,
it was useful as an upper bound on solvability but nearly useless for comparing
strong models with one another.

Recall Misses Dangerous Almost-Correct Answers

Suppose the correct file ranks fifth while a plausible but incorrect neighbor
ranks first. recall@10 is still 1.0. For the user, however, the first results
matter most: these are the files they will read or pass to a generative model.

Adversarial labels make this type of error measurable. For the TCP-connection
question, all nine commercial models found SocketServer.scala in the top 10
under every available phrasing. Ordinary recall could not distinguish them at
all.

However, broader ai_optimized queries also brought RequestChannel.scala
into the results more often. The rank_gap and displacement_rate metrics
showed whether the correct order was preserved and whether the neighboring file
displaced the primary answer.

Paraphrases and adversarial nodes measure different properties. Paraphrases
show whether the system recognizes the same intent when its language changes.
Adversarial nodes show whether it can separate the correct answer from a
thematically close mistake. Recall alone cannot measure both.

5. How the Results Changed as the Benchmark Evolved

The methodology changed the practical conclusions several times.

The first pilot, with ten files and three questions, produced perfect scores and
established 3000/300 as the chunking configuration.

When the same embedding model was evaluated on the full corpus and thirty
questions, 3000/300 produced the worst recall@10 among four configurations:
0.533 versus 0.600 for the other three. 1500/200 led on the other quality
metrics and became the new recommended configuration.

I then evaluated seven local embedding models across five chunking
configurations. Even the replacement recommendation did not hold: five of the
seven models preferred c500-o100, which had not been included in the earlier
sweep. For the pilot model, nomic-embed-text, 1500/200 ranked only third.

This is not merely a story about choosing the wrong chunk size. It shows how
easily a fortunate combination of a small corpus, a few questions, and one
model can become a universal recommendation.

6. Bootstrap Confidence Intervals Changed the Final Conclusion

After the main experiment runs were complete, the top of the leaderboard for
the nine commercial models under the baseline configuration looked convincing:

Model                                     recall@10
mistral/codestral-embed                   0.900
GigaEmbeddings-3B                         0.867
OpenAI text-embedding-3-small             0.867
voyage-code-2                             0.867
voyage-code-3                             0.867

On an ordinary chart, Codestral looked like the winner. But with n=30, the
difference between 0.900 and 0.867 is only one question.

I calculated percentile bootstrap confidence intervals with B=2000 for each
model's per-query vector:

codestral-embed                   0.900   [0.800 — 1.000]
each of the next four models      0.867   [0.733 — 0.967]

Overlap between separate confidence intervals is not, by itself, a formal
pairwise test. I therefore also ran a paired bootstrap on the differences
between every pair in the top cluster.

The four models scoring 0.867 were arithmetically identical on this metric:
they found the primary file for the same 26 questions and failed on the same
four. Codestral's entire advantage came from one question, Q30. The confidence
interval for its advantage over each of the four models was
[+0.000, +0.100], meaning that the lower bound was exactly zero.

The data does not support the claim that Codestral won. A defensible statement
is weaker:

Codestral had the highest point estimate in this experiment, but with thirty
questions, the paired bootstrap could not distinguish it from the next
cluster of models.

Bootstrap did not change the collected results. It changed the strength of the
claims those results allowed me to publish.

7. Rules for an Honest AI Benchmark

The study led me to several rules that apply beyond Code-RAG.

First, define which stage of the system you are measuring. Retrieval
quality, generated-answer quality, and product usefulness are different
research targets. Mixing them into one evaluation makes the source of an error
difficult to identify.

Use a known ground truth. The correct result must be defined before the
model runs. Otherwise, the benchmark author can easily accept a plausible
answer as a correct one.

Test the intent, not one fortunate phrasing. Users are not required to ask
questions using the words that work best for an embedding model.

Label plausible errors. In a real codebase, an incorrect result often looks
almost correct. A benchmark should distinguish the behavior owner from a
neighboring interface, manager class, or configuration file.

Publish uncertainty alongside the ranking. Point estimates are useful for
sorting a table, but they do not always support a claim of superiority.

Use the benchmark to challenge your own conclusions. A good methodology
should not only confirm hypotheses. It should create conditions in which they
can fail.

8. Limitations and Next Step

The study has several clear limitations.

Thirty questions. recall@10 has only 31 possible values, from 0/30 to
30/30. The bootstrap 95% confidence-interval half-widths in this experiment
were approximately 0.10-0.17. Reliably separating the top cluster requires a
larger question set.

One corpus. The study used 697 files from the Kafka 4.0.0 broker core. It
does not establish that the same models and configurations will behave
similarly on another project, language, or architecture.

File-level labels. The benchmark evaluates retrieved files rather than
individual classes and methods. A correct file containing an irrelevant chunk
can count as a hit, while the correct method inside a low-ranked file can count
as a miss.

One annotator. I authored all gold queries. The rationale layer makes the
decisions reviewable, but independent annotation by a second specialist would
strengthen the dataset.

Retrieval rather than final-answer quality. The study does not show how
well a generative model uses the retrieved context or how useful its final
answer is to a developer.

The next article in the series will use the same dataset to investigate the
relative effects of the embedding model, chunking, and retrieval mode. This
time, those comparisons will be interpreted with query robustness, adversarial
errors, and statistical uncertainty in mind.

Reproducibility

The repository contains the gold set, raw CSV and Parquet results, benchmark
code, and analysis scripts.

git clone https://github.com/Daeryss/karta-rag-map
cd karta-rag-map

# Inspect the final five-layer schema and 30 gold queries
cat corpora/kafka-4.0.0/gold-queries-v2.1.yaml

# Validate the dataset schema
./gradlew test --tests 'io.github.daeryss.karta.dataset.GoldQueryLoaderV2Test'

# Data behind the ranking and paired bootstrap
ls results/forest_plot_data.csv
ls results/paired_bootstrap_top5.csv

# Recompute the paired bootstrap for the top cluster
.venv/bin/python scripts/paired_bootstrap.py \
  results/E005-production/all.csv \
  results/E006-production-merged/all.csv \
  --output results/paired_bootstrap_top5.csv

Code-RAG Research Series · Article 1 of 3 · Next: "What Actually Matters in
Code-RAG: Models, Chunking, or Retrieval?"