Ilias Miftakhov

Posted on Jun 9

A Cognitive Benchmark for Code-RAG Retrieval: Part 1 — Methodology

#rag #ai #java #machinelearning

TL;DR

Code-RAG systems promise to help developers navigate large codebases: find the
implementation of a behavior, trace a data flow, or identify the component
responsible for a specific function. But a compelling demo does not tell us how
reliable the retrieval itself is.

To investigate this, I built a retrieval benchmark on the Apache Kafka 4.0.0
broker core, a real polyglot project containing 697 Java and Scala files. For
each question, I identified the correct files, acceptable supporting files, and
plausible but incorrect alternatives in advance. I then compared which files
Code-RAG retrieved under different embedding models, code chunking strategies,
and retrieval modes.

The first results looked convincing, but the conclusions changed several times
as the dataset evolved. A single query phrasing concealed retrieval instability,
ordinary recall missed dangerous errors, and the final leaderboard leader beat
the next four models on only one question out of thirty.

The main outcome of the study was not the selection of a "best" model. It was a
methodology for avoiding the mistake of treating a chance result as a reliable
one.

1. Why Code-RAG Needs Its Own Benchmark

Code RAG maps make an appealing promise: a developer asks a question about an
unfamiliar project and receives the relevant parts of the codebase. Such a
system could be used inside an IDE, in a chat assistant, during onboarding,
while investigating incidents, or when preparing a change.

A typical Code-RAG pipeline looks like this:

source code
    → split into chunks
    → build a search index
    → developer query
    → retrieve relevant chunks
    → pass the retrieved context to a generative model
    → answer the user

The final answer does not depend on the generative model alone. If retrieval
fails to find the file that actually implements the relevant behavior, the
model will either produce an incomplete answer or reason plausibly from the
wrong context.

I therefore started with a narrower question:

How accurately does Code-RAG find the correct files in a real polyglot
codebase?

An important clarification: this study did not send questions to different
generative LLMs and evaluate their written answers. It sent them to a retrieval
pipeline. An embedding model converted the query and code into vectors, Lucene
returned a ranked list of files, and the benchmark compared that list with
labels prepared in advance.

This restriction is intentional. It isolates retrieval quality instead of
mixing it with prompt quality, answer generation, and LLM hallucinations.

I selected the Apache Kafka 4.0.0 broker core as the corpus. It is not a
tutorial example, but a large Java and Scala codebase with packages,
interfaces, implementations, manager classes, and logic distributed across
multiple components. In a project like this, a retrieval system must
distinguish the file that actually owns a behavior from a similarly named file
or an adjacent responsibility.

2. Defining the Correct Answer

Evaluating retrieval requires more than asking a question and deciding whether
the result looks reasonable. It requires a known ground truth.

I created a set of questions about the Kafka codebase, which I was familiar
with. Each question described a specific behavior or concept for which the
owner file could be identified manually. For example:

How does the Kafka broker accept incoming TCP connections?

The correct primary file for this question is:

core/src/main/scala/kafka/network/SocketServer.scala

RequestChannel.scala is nearby. It belongs to the networking flow and helps
pass requests onward, so it looks plausible. However, it is not responsible for
accepting TCP connections. If a system ranks it above SocketServer.scala, it
has found thematically related code but identified the wrong owner of the
behavior.

For every run, the retriever returned a ranked list of files. The result could
then be evaluated formally:

recall@k shows whether the correct file appeared in the first k results;
MRR accounts for how highly the first correct result was ranked;
displacement_rate shows whether a plausible but incorrect file displaced the primary answer within the top-k;
rank_gap measures the distance between correct and adversarial files in the complete ranked result.

This turns the subjective judgment that "the result looks reasonable" into a
reproducible comparison. With the same set of questions, I could change the
embedding model, chunking configuration, or retrieval mode and measure how the
quality changed.

But the first version of the dataset was too simple.

3. Why the First Gold Queries Had to Be Replaced

The First Version: Verify That the Pipeline Works

The pilot gold set contained three questions and ten Kafka files. Each question
had one phrasing and a list of expected files. This was enough to exercise the
entire path: load the corpus, build the index, run retrieval, and calculate the
metrics.

The pilot used nomic-embed-text with 3,000-character chunks and a
300-character overlap. It achieved a perfect score on every measured metric. I
therefore selected 3000/300 as the working configuration.

The problem was that the benchmark had not confirmed the reliability of the
approach. It had only confirmed that the system could solve three familiar
questions on a small corpus. It left almost no room for failure.

The Next Version: More Questions and More Phrasings

After expanding the corpus to 697 files, I prepared 30 questions. Each question
received several paraphrases: canonical human phrasing, passive phrasing, and
descriptions framed around a process, component, or location.

This version showed that the same intent could produce different results
depending on its wording. However, the variants were still mostly superficial
language transformations. They did not distinguish the different ways in which
a developer actually searches for code: a natural question, a precise
technical query, a set of keywords, or a query using imprecise terminology.

The labels also still focused mainly on whether the correct file had been
found, without adequately describing plausible errors. For real code
navigation, it is not enough to find SocketServer.scala; the system must also
avoid confusing it with the adjacent RequestChannel.scala.

The Final Version: A Five-Layer Gold-Query Schema

In the repository, the earlier full set is stored as gold-queries.yaml, while
the final set is available as
gold-queries-v2.1.yaml.
Their differences can be summarized as follows:

Aspect	Early gold set	Gold queries v2
Correct answer	List of expected files	Primary and secondary files
Phrasings	Generic language paraphrases	Controlled cognitive modes
Incorrect neighbors	Simple list of adversarial files	Type, strength, and source of confusion
Rationale	Short explanation	Behavior owner, reason, and explicit boundary
Evaluation	Shared set of metrics	Relevant metric families for each question

This led to the gold-queries-v2.1 schema, in which every question is
represented across five independent layers:

1. RETRIEVAL     what counts as the correct answer
2. COGNITIVE     how the user expresses the same query
3. INTERFERENCE  which files look plausible but are incorrect
4. RATIONALE     why the answer is correct and where its boundary lies
5. EVALUATION    which metrics matter for this question

The retrieval layer separates primary and supporting files. The primary file
owns the requested behavior; a supporting file helps explain it but does not
replace the answer.

The cognitive layer contains up to five query forms:

human: a natural question from a developer;
ai_optimized: a detailed technical query using precise terminology;
keyword: a short set of keywords;
wrong_terminology: the same intent with one controlled terminology error;
cross_module: a question that requires connecting multiple components, where applicable.

The interference layer lists adversarial nodes: semantically or structurally
close files that are easy to mistake for the answer. Each one records its
relationship to the primary file, the strength of the trap, and the type of
confusion.

The rationale layer documents the behavior owner, the reason for selecting
it, and an explicit boundary explaining why a neighboring file is not the
answer. This makes the labels reviewable and debatable instead of treating them
as unexplained decisions by the author.

Finally, the evaluation layer specifies which metric families are especially
important for a particular question.

This schema changed the subject of the study itself. Instead of testing whether
the system could occasionally find the right file, the benchmark began testing
how consistently it recognized the same intent and distinguished the behavior
owner from plausible neighbors.

4. What Paraphrases and Adversarial Nodes Revealed

One Phrasing Does Not Represent the Whole Question

In an early full-corpus experiment, top-1 accuracy on the human phrasing was
0.233. When a question counted as solved if the system succeeded on any
available phrasing, the result was 0.600. The difference between one phrasing
and the best available phrasing was 2.6x.

This does not mean that we should select the most convenient variant and
publish the best result. On the contrary, a large gap shows that one phrasing
is a poor representation of real retrieval robustness.

On the final baseline, all nine commercial embedding models reached
quorum-any recall@10 = 1.000: for every question, at least one phrasing placed
the correct file in the top 10. However, the models handled changes in query
language with different levels of consistency.

The drop between human and wrong_terminology ranged from 0.100 to 0.300. I
expected code-specialized models to be more robust, but the data did not
support that expectation: the lowest and highest observed drops both belonged
to general-purpose models.

The ai_optimized variant, deliberately written to be retrieval-friendly,
reached recall@10 = 1.000 for all nine commercial models. In this experiment,
it was useful as an upper bound on solvability but nearly useless for comparing
strong models with one another.

Recall Misses Dangerous Almost-Correct Answers

Suppose the correct file ranks fifth while a plausible but incorrect neighbor
ranks first. recall@10 is still 1.0. For the user, however, the first results
matter most: these are the files they will read or pass to a generative model.

Adversarial labels make this type of error measurable. For the TCP-connection
question, all nine commercial models found SocketServer.scala in the top 10
under every available phrasing. Ordinary recall could not distinguish them at
all.

However, broader ai_optimized queries also brought RequestChannel.scala
into the results more often. The rank_gap and displacement_rate metrics
showed whether the correct order was preserved and whether the neighboring file
displaced the primary answer.

Paraphrases and adversarial nodes measure different properties. Paraphrases
show whether the system recognizes the same intent when its language changes.
Adversarial nodes show whether it can separate the correct answer from a
thematically close mistake. Recall alone cannot measure both.

5. How the Results Changed as the Benchmark Evolved

The methodology changed the practical conclusions several times.

The first pilot, with ten files and three questions, produced perfect scores and
established 3000/300 as the chunking configuration.

When the same embedding model was evaluated on the full corpus and thirty
questions, 3000/300 produced the worst recall@10 among four configurations:
0.533 versus 0.600 for the other three. 1500/200 led on the other quality
metrics and became the new recommended configuration.

I then evaluated seven local embedding models across five chunking
configurations. Even the replacement recommendation did not hold: five of the
seven models preferred c500-o100, which had not been included in the earlier
sweep. For the pilot model, nomic-embed-text, 1500/200 ranked only third.

This is not merely a story about choosing the wrong chunk size. It shows how
easily a fortunate combination of a small corpus, a few questions, and one
model can become a universal recommendation.

6. Bootstrap Confidence Intervals Changed the Final Conclusion

After the main experiment runs were complete, the top of the leaderboard for
the nine commercial models under the baseline configuration looked convincing:

Model                                     recall@10
mistral/codestral-embed                   0.900
GigaEmbeddings-3B                         0.867
OpenAI text-embedding-3-small             0.867
voyage-code-2                             0.867
voyage-code-3                             0.867

On an ordinary chart, Codestral looked like the winner. But with n=30, the
difference between 0.900 and 0.867 is only one question.

I calculated percentile bootstrap confidence intervals with B=2000 for each
model's per-query vector:

codestral-embed                   0.900   [0.800 — 1.000]
each of the next four models      0.867   [0.733 — 0.967]

Overlap between separate confidence intervals is not, by itself, a formal
pairwise test. I therefore also ran a paired bootstrap on the differences
between every pair in the top cluster.

The four models scoring 0.867 were arithmetically identical on this metric:
they found the primary file for the same 26 questions and failed on the same
four. Codestral's entire advantage came from one question, Q30. The confidence
interval for its advantage over each of the four models was
[+0.000, +0.100], meaning that the lower bound was exactly zero.

The data does not support the claim that Codestral won. A defensible statement
is weaker:

Codestral had the highest point estimate in this experiment, but with thirty
questions, the paired bootstrap could not distinguish it from the next
cluster of models.

Bootstrap did not change the collected results. It changed the strength of the
claims those results allowed me to publish.

7. Rules for an Honest AI Benchmark

The study led me to several rules that apply beyond Code-RAG.

First, define which stage of the system you are measuring. Retrieval
quality, generated-answer quality, and product usefulness are different
research targets. Mixing them into one evaluation makes the source of an error
difficult to identify.

Use a known ground truth. The correct result must be defined before the
model runs. Otherwise, the benchmark author can easily accept a plausible
answer as a correct one.

Test the intent, not one fortunate phrasing. Users are not required to ask
questions using the words that work best for an embedding model.

Label plausible errors. In a real codebase, an incorrect result often looks
almost correct. A benchmark should distinguish the behavior owner from a
neighboring interface, manager class, or configuration file.

Publish uncertainty alongside the ranking. Point estimates are useful for
sorting a table, but they do not always support a claim of superiority.

Use the benchmark to challenge your own conclusions. A good methodology
should not only confirm hypotheses. It should create conditions in which they
can fail.

8. Limitations and Next Step

The study has several clear limitations.

Thirty questions. recall@10 has only 31 possible values, from 0/30 to
30/30. The bootstrap 95% confidence-interval half-widths in this experiment
were approximately 0.10-0.17. Reliably separating the top cluster requires a
larger question set.

One corpus. The study used 697 files from the Kafka 4.0.0 broker core. It
does not establish that the same models and configurations will behave
similarly on another project, language, or architecture.

File-level labels. The benchmark evaluates retrieved files rather than
individual classes and methods. A correct file containing an irrelevant chunk
can count as a hit, while the correct method inside a low-ranked file can count
as a miss.

One annotator. I authored all gold queries. The rationale layer makes the
decisions reviewable, but independent annotation by a second specialist would
strengthen the dataset.

Retrieval rather than final-answer quality. The study does not show how
well a generative model uses the retrieved context or how useful its final
answer is to a developer.

The next article in the series will use the same dataset to investigate the
relative effects of the embedding model, chunking, and retrieval mode. This
time, those comparisons will be interpreted with query robustness, adversarial
errors, and statistical uncertainty in mind.

Reproducibility

The repository contains the gold set, raw CSV and Parquet results, benchmark
code, and analysis scripts.

git clone https://github.com/Daeryss/karta-rag-map
cd karta-rag-map

# Inspect the final five-layer schema and 30 gold queries
cat corpora/kafka-4.0.0/gold-queries-v2.1.yaml

# Validate the dataset schema
./gradlew test --tests 'io.github.daeryss.karta.dataset.GoldQueryLoaderV2Test'

# Data behind the ranking and paired bootstrap
ls results/forest_plot_data.csv
ls results/paired_bootstrap_top5.csv

# Recompute the paired bootstrap for the top cluster
.venv/bin/python scripts/paired_bootstrap.py \
  results/E005-production/all.csv \
  results/E006-production-merged/all.csv \
  --output results/paired_bootstrap_top5.csv

Code-RAG Research Series · Article 1 of 3 · Next: "What Actually Matters in
Code-RAG: Models, Chunking, or Retrieval?"

DEV Community