<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ilias Miftakhov</title>
    <description>The latest articles on DEV Community by Ilias Miftakhov (@miftakhov).</description>
    <link>https://dev.to/miftakhov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3974826%2F219693d4-f9c9-40fa-bcf7-727689bce697.png</url>
      <title>DEV Community: Ilias Miftakhov</title>
      <link>https://dev.to/miftakhov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/miftakhov"/>
    <language>en</language>
    <item>
      <title>A Cognitive Benchmark for Code-RAG Retrieval: Part 1 — Methodology</title>
      <dc:creator>Ilias Miftakhov</dc:creator>
      <pubDate>Tue, 09 Jun 2026 19:37:51 +0000</pubDate>
      <link>https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-1-methodology-3m7l</link>
      <guid>https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-1-methodology-3m7l</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Code-RAG systems promise to help developers navigate large codebases: find the&lt;br&gt;
implementation of a behavior, trace a data flow, or identify the component&lt;br&gt;
responsible for a specific function. But a compelling demo does not tell us how&lt;br&gt;
reliable the retrieval itself is.&lt;/p&gt;

&lt;p&gt;To investigate this, I built a retrieval benchmark on the Apache Kafka 4.0.0&lt;br&gt;
broker core, a real polyglot project containing 697 Java and Scala files. For&lt;br&gt;
each question, I identified the correct files, acceptable supporting files, and&lt;br&gt;
plausible but incorrect alternatives in advance. I then compared which files&lt;br&gt;
Code-RAG retrieved under different embedding models, code chunking strategies,&lt;br&gt;
and retrieval modes.&lt;/p&gt;

&lt;p&gt;The first results looked convincing, but the conclusions changed several times&lt;br&gt;
as the dataset evolved. A single query phrasing concealed retrieval instability,&lt;br&gt;
ordinary recall missed dangerous errors, and the final leaderboard leader beat&lt;br&gt;
the next four models on only one question out of thirty.&lt;/p&gt;

&lt;p&gt;The main outcome of the study was not the selection of a "best" model. It was a&lt;br&gt;
methodology for avoiding the mistake of treating a chance result as a reliable&lt;br&gt;
one.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. Why Code-RAG Needs Its Own Benchmark
&lt;/h2&gt;

&lt;p&gt;Code RAG maps make an appealing promise: a developer asks a question about an&lt;br&gt;
unfamiliar project and receives the relevant parts of the codebase. Such a&lt;br&gt;
system could be used inside an IDE, in a chat assistant, during onboarding,&lt;br&gt;
while investigating incidents, or when preparing a change.&lt;/p&gt;

&lt;p&gt;A typical Code-RAG pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source code
    → split into chunks
    → build a search index
    → developer query
    → retrieve relevant chunks
    → pass the retrieved context to a generative model
    → answer the user
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final answer does not depend on the generative model alone. If retrieval&lt;br&gt;
fails to find the file that actually implements the relevant behavior, the&lt;br&gt;
model will either produce an incomplete answer or reason plausibly from the&lt;br&gt;
wrong context.&lt;/p&gt;

&lt;p&gt;I therefore started with a narrower question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How accurately does Code-RAG find the correct files in a real polyglot&lt;br&gt;
codebase?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An important clarification: this study did not send questions to different&lt;br&gt;
generative LLMs and evaluate their written answers. It sent them to a retrieval&lt;br&gt;
pipeline. An embedding model converted the query and code into vectors, Lucene&lt;br&gt;
returned a ranked list of files, and the benchmark compared that list with&lt;br&gt;
labels prepared in advance.&lt;/p&gt;

&lt;p&gt;This restriction is intentional. It isolates retrieval quality instead of&lt;br&gt;
mixing it with prompt quality, answer generation, and LLM hallucinations.&lt;/p&gt;

&lt;p&gt;I selected the Apache Kafka 4.0.0 broker core as the corpus. It is not a&lt;br&gt;
tutorial example, but a large Java and Scala codebase with packages,&lt;br&gt;
interfaces, implementations, manager classes, and logic distributed across&lt;br&gt;
multiple components. In a project like this, a retrieval system must&lt;br&gt;
distinguish the file that actually owns a behavior from a similarly named file&lt;br&gt;
or an adjacent responsibility.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Defining the Correct Answer
&lt;/h2&gt;

&lt;p&gt;Evaluating retrieval requires more than asking a question and deciding whether&lt;br&gt;
the result looks reasonable. It requires a known ground truth.&lt;/p&gt;

&lt;p&gt;I created a set of questions about the Kafka codebase, which I was familiar&lt;br&gt;
with. Each question described a specific behavior or concept for which the&lt;br&gt;
owner file could be identified manually. For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How does the Kafka broker accept incoming TCP connections?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The correct primary file for this question is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;core/src/main/scala/kafka/network/SocketServer.scala
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;RequestChannel.scala&lt;/code&gt; is nearby. It belongs to the networking flow and helps&lt;br&gt;
pass requests onward, so it looks plausible. However, it is not responsible for&lt;br&gt;
accepting TCP connections. If a system ranks it above &lt;code&gt;SocketServer.scala&lt;/code&gt;, it&lt;br&gt;
has found thematically related code but identified the wrong owner of the&lt;br&gt;
behavior.&lt;/p&gt;

&lt;p&gt;For every run, the retriever returned a ranked list of files. The result could&lt;br&gt;
then be evaluated formally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;recall@k&lt;/code&gt; shows whether the correct file appeared in the first &lt;code&gt;k&lt;/code&gt; results;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MRR&lt;/code&gt; accounts for how highly the first correct result was ranked;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;displacement_rate&lt;/code&gt; shows whether a plausible but incorrect file displaced
the primary answer within the top-k;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rank_gap&lt;/code&gt; measures the distance between correct and adversarial files in the
complete ranked result.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This turns the subjective judgment that "the result looks reasonable" into a&lt;br&gt;
reproducible comparison. With the same set of questions, I could change the&lt;br&gt;
embedding model, chunking configuration, or retrieval mode and measure how the&lt;br&gt;
quality changed.&lt;/p&gt;

&lt;p&gt;But the first version of the dataset was too simple.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. Why the First Gold Queries Had to Be Replaced
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The First Version: Verify That the Pipeline Works
&lt;/h3&gt;

&lt;p&gt;The pilot gold set contained three questions and ten Kafka files. Each question&lt;br&gt;
had one phrasing and a list of expected files. This was enough to exercise the&lt;br&gt;
entire path: load the corpus, build the index, run retrieval, and calculate the&lt;br&gt;
metrics.&lt;/p&gt;

&lt;p&gt;The pilot used &lt;code&gt;nomic-embed-text&lt;/code&gt; with 3,000-character chunks and a&lt;br&gt;
300-character overlap. It achieved a perfect score on every measured metric. I&lt;br&gt;
therefore selected &lt;code&gt;3000/300&lt;/code&gt; as the working configuration.&lt;/p&gt;

&lt;p&gt;The problem was that the benchmark had not confirmed the reliability of the&lt;br&gt;
approach. It had only confirmed that the system could solve three familiar&lt;br&gt;
questions on a small corpus. It left almost no room for failure.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Next Version: More Questions and More Phrasings
&lt;/h3&gt;

&lt;p&gt;After expanding the corpus to 697 files, I prepared 30 questions. Each question&lt;br&gt;
received several paraphrases: canonical human phrasing, passive phrasing, and&lt;br&gt;
descriptions framed around a process, component, or location.&lt;/p&gt;

&lt;p&gt;This version showed that the same intent could produce different results&lt;br&gt;
depending on its wording. However, the variants were still mostly superficial&lt;br&gt;
language transformations. They did not distinguish the different ways in which&lt;br&gt;
a developer actually searches for code: a natural question, a precise&lt;br&gt;
technical query, a set of keywords, or a query using imprecise terminology.&lt;/p&gt;

&lt;p&gt;The labels also still focused mainly on whether the correct file had been&lt;br&gt;
found, without adequately describing plausible errors. For real code&lt;br&gt;
navigation, it is not enough to find &lt;code&gt;SocketServer.scala&lt;/code&gt;; the system must also&lt;br&gt;
avoid confusing it with the adjacent &lt;code&gt;RequestChannel.scala&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Final Version: A Five-Layer Gold-Query Schema
&lt;/h3&gt;

&lt;p&gt;In the repository, the earlier full set is stored as &lt;code&gt;gold-queries.yaml&lt;/code&gt;, while&lt;br&gt;
the final set is available as&lt;br&gt;
&lt;a href="https://github.com/Daeryss/karta-rag-map/blob/main/corpora/kafka-4.0.0/gold-queries-v2.1.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;gold-queries-v2.1.yaml&lt;/code&gt;&lt;/a&gt;.&lt;br&gt;
Their differences can be summarized as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Early gold set&lt;/th&gt;
&lt;th&gt;Gold queries v2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Correct answer&lt;/td&gt;
&lt;td&gt;List of expected files&lt;/td&gt;
&lt;td&gt;Primary and secondary files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phrasings&lt;/td&gt;
&lt;td&gt;Generic language paraphrases&lt;/td&gt;
&lt;td&gt;Controlled cognitive modes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Incorrect neighbors&lt;/td&gt;
&lt;td&gt;Simple list of adversarial files&lt;/td&gt;
&lt;td&gt;Type, strength, and source of confusion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rationale&lt;/td&gt;
&lt;td&gt;Short explanation&lt;/td&gt;
&lt;td&gt;Behavior owner, reason, and explicit boundary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evaluation&lt;/td&gt;
&lt;td&gt;Shared set of metrics&lt;/td&gt;
&lt;td&gt;Relevant metric families for each question&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This led to the &lt;code&gt;gold-queries-v2.1&lt;/code&gt; schema, in which every question is&lt;br&gt;
represented across five independent layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. RETRIEVAL     what counts as the correct answer
2. COGNITIVE     how the user expresses the same query
3. INTERFERENCE  which files look plausible but are incorrect
4. RATIONALE     why the answer is correct and where its boundary lies
5. EVALUATION    which metrics matter for this question
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;retrieval&lt;/code&gt; layer separates primary and supporting files. The primary file&lt;br&gt;
owns the requested behavior; a supporting file helps explain it but does not&lt;br&gt;
replace the answer.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;cognitive&lt;/code&gt; layer contains up to five query forms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;human&lt;/code&gt;: a natural question from a developer;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ai_optimized&lt;/code&gt;: a detailed technical query using precise terminology;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;keyword&lt;/code&gt;: a short set of keywords;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;wrong_terminology&lt;/code&gt;: the same intent with one controlled terminology error;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cross_module&lt;/code&gt;: a question that requires connecting multiple components,
where applicable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;interference&lt;/code&gt; layer lists adversarial nodes: semantically or structurally&lt;br&gt;
close files that are easy to mistake for the answer. Each one records its&lt;br&gt;
relationship to the primary file, the strength of the trap, and the type of&lt;br&gt;
confusion.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;rationale&lt;/code&gt; layer documents the behavior owner, the reason for selecting&lt;br&gt;
it, and an explicit boundary explaining why a neighboring file is not the&lt;br&gt;
answer. This makes the labels reviewable and debatable instead of treating them&lt;br&gt;
as unexplained decisions by the author.&lt;/p&gt;

&lt;p&gt;Finally, the &lt;code&gt;evaluation&lt;/code&gt; layer specifies which metric families are especially&lt;br&gt;
important for a particular question.&lt;/p&gt;

&lt;p&gt;This schema changed the subject of the study itself. Instead of testing whether&lt;br&gt;
the system could occasionally find the right file, the benchmark began testing&lt;br&gt;
how consistently it recognized the same intent and distinguished the behavior&lt;br&gt;
owner from plausible neighbors.&lt;/p&gt;
&lt;h2&gt;
  
  
  4. What Paraphrases and Adversarial Nodes Revealed
&lt;/h2&gt;
&lt;h3&gt;
  
  
  One Phrasing Does Not Represent the Whole Question
&lt;/h3&gt;

&lt;p&gt;In an early full-corpus experiment, top-1 accuracy on the human phrasing was&lt;br&gt;
0.233. When a question counted as solved if the system succeeded on any&lt;br&gt;
available phrasing, the result was 0.600. The difference between one phrasing&lt;br&gt;
and the best available phrasing was 2.6x.&lt;/p&gt;

&lt;p&gt;This does not mean that we should select the most convenient variant and&lt;br&gt;
publish the best result. On the contrary, a large gap shows that one phrasing&lt;br&gt;
is a poor representation of real retrieval robustness.&lt;/p&gt;

&lt;p&gt;On the final baseline, all nine commercial embedding models reached&lt;br&gt;
&lt;code&gt;quorum-any recall@10 = 1.000&lt;/code&gt;: for every question, at least one phrasing placed&lt;br&gt;
the correct file in the top 10. However, the models handled changes in query&lt;br&gt;
language with different levels of consistency.&lt;/p&gt;

&lt;p&gt;The drop between &lt;code&gt;human&lt;/code&gt; and &lt;code&gt;wrong_terminology&lt;/code&gt; ranged from 0.100 to 0.300. I&lt;br&gt;
expected code-specialized models to be more robust, but the data did not&lt;br&gt;
support that expectation: the lowest and highest observed drops both belonged&lt;br&gt;
to general-purpose models.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ai_optimized&lt;/code&gt; variant, deliberately written to be retrieval-friendly,&lt;br&gt;
reached recall@10 = 1.000 for all nine commercial models. In this experiment,&lt;br&gt;
it was useful as an upper bound on solvability but nearly useless for comparing&lt;br&gt;
strong models with one another.&lt;/p&gt;
&lt;h3&gt;
  
  
  Recall Misses Dangerous Almost-Correct Answers
&lt;/h3&gt;

&lt;p&gt;Suppose the correct file ranks fifth while a plausible but incorrect neighbor&lt;br&gt;
ranks first. &lt;code&gt;recall@10&lt;/code&gt; is still 1.0. For the user, however, the first results&lt;br&gt;
matter most: these are the files they will read or pass to a generative model.&lt;/p&gt;

&lt;p&gt;Adversarial labels make this type of error measurable. For the TCP-connection&lt;br&gt;
question, all nine commercial models found &lt;code&gt;SocketServer.scala&lt;/code&gt; in the top 10&lt;br&gt;
under every available phrasing. Ordinary recall could not distinguish them at&lt;br&gt;
all.&lt;/p&gt;

&lt;p&gt;However, broader &lt;code&gt;ai_optimized&lt;/code&gt; queries also brought &lt;code&gt;RequestChannel.scala&lt;/code&gt;&lt;br&gt;
into the results more often. The &lt;code&gt;rank_gap&lt;/code&gt; and &lt;code&gt;displacement_rate&lt;/code&gt; metrics&lt;br&gt;
showed whether the correct order was preserved and whether the neighboring file&lt;br&gt;
displaced the primary answer.&lt;/p&gt;

&lt;p&gt;Paraphrases and adversarial nodes measure different properties. Paraphrases&lt;br&gt;
show whether the system recognizes the same intent when its language changes.&lt;br&gt;
Adversarial nodes show whether it can separate the correct answer from a&lt;br&gt;
thematically close mistake. Recall alone cannot measure both.&lt;/p&gt;
&lt;h2&gt;
  
  
  5. How the Results Changed as the Benchmark Evolved
&lt;/h2&gt;

&lt;p&gt;The methodology changed the practical conclusions several times.&lt;/p&gt;

&lt;p&gt;The first pilot, with ten files and three questions, produced perfect scores and&lt;br&gt;
established &lt;code&gt;3000/300&lt;/code&gt; as the chunking configuration.&lt;/p&gt;

&lt;p&gt;When the same embedding model was evaluated on the full corpus and thirty&lt;br&gt;
questions, &lt;code&gt;3000/300&lt;/code&gt; produced the worst recall@10 among four configurations:&lt;br&gt;
0.533 versus 0.600 for the other three. &lt;code&gt;1500/200&lt;/code&gt; led on the other quality&lt;br&gt;
metrics and became the new recommended configuration.&lt;/p&gt;

&lt;p&gt;I then evaluated seven local embedding models across five chunking&lt;br&gt;
configurations. Even the replacement recommendation did not hold: five of the&lt;br&gt;
seven models preferred &lt;code&gt;c500-o100&lt;/code&gt;, which had not been included in the earlier&lt;br&gt;
sweep. For the pilot model, &lt;code&gt;nomic-embed-text&lt;/code&gt;, &lt;code&gt;1500/200&lt;/code&gt; ranked only third.&lt;/p&gt;

&lt;p&gt;This is not merely a story about choosing the wrong chunk size. It shows how&lt;br&gt;
easily a fortunate combination of a small corpus, a few questions, and one&lt;br&gt;
model can become a universal recommendation.&lt;/p&gt;
&lt;h2&gt;
  
  
  6. Bootstrap Confidence Intervals Changed the Final Conclusion
&lt;/h2&gt;

&lt;p&gt;After the main experiment runs were complete, the top of the leaderboard for&lt;br&gt;
the nine commercial models under the baseline configuration looked convincing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model                                     recall@10
mistral/codestral-embed                   0.900
GigaEmbeddings-3B                         0.867
OpenAI text-embedding-3-small             0.867
voyage-code-2                             0.867
voyage-code-3                             0.867
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On an ordinary chart, Codestral looked like the winner. But with &lt;code&gt;n=30&lt;/code&gt;, the&lt;br&gt;
difference between 0.900 and 0.867 is only one question.&lt;/p&gt;

&lt;p&gt;I calculated percentile bootstrap confidence intervals with &lt;code&gt;B=2000&lt;/code&gt; for each&lt;br&gt;
model's per-query vector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;codestral-embed                   0.900   [0.800 — 1.000]
each of the next four models      0.867   [0.733 — 0.967]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Overlap between separate confidence intervals is not, by itself, a formal&lt;br&gt;
pairwise test. I therefore also ran a paired bootstrap on the differences&lt;br&gt;
between every pair in the top cluster.&lt;/p&gt;

&lt;p&gt;The four models scoring 0.867 were arithmetically identical on this metric:&lt;br&gt;
they found the primary file for the same 26 questions and failed on the same&lt;br&gt;
four. Codestral's entire advantage came from one question, Q30. The confidence&lt;br&gt;
interval for its advantage over each of the four models was&lt;br&gt;
&lt;code&gt;[+0.000, +0.100]&lt;/code&gt;, meaning that the lower bound was exactly zero.&lt;/p&gt;

&lt;p&gt;The data does not support the claim that Codestral won. A defensible statement&lt;br&gt;
is weaker:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Codestral had the highest point estimate in this experiment, but with thirty&lt;br&gt;
questions, the paired bootstrap could not distinguish it from the next&lt;br&gt;
cluster of models.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Bootstrap did not change the collected results. It changed the strength of the&lt;br&gt;
claims those results allowed me to publish.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuekffrcpjgxvedg1p9qw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuekffrcpjgxvedg1p9qw.png" alt=" " width="800" height="518"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  7. Rules for an Honest AI Benchmark
&lt;/h2&gt;

&lt;p&gt;The study led me to several rules that apply beyond Code-RAG.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, define which stage of the system you are measuring.&lt;/strong&gt; Retrieval&lt;br&gt;
quality, generated-answer quality, and product usefulness are different&lt;br&gt;
research targets. Mixing them into one evaluation makes the source of an error&lt;br&gt;
difficult to identify.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use a known ground truth.&lt;/strong&gt; The correct result must be defined before the&lt;br&gt;
model runs. Otherwise, the benchmark author can easily accept a plausible&lt;br&gt;
answer as a correct one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test the intent, not one fortunate phrasing.&lt;/strong&gt; Users are not required to ask&lt;br&gt;
questions using the words that work best for an embedding model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Label plausible errors.&lt;/strong&gt; In a real codebase, an incorrect result often looks&lt;br&gt;
almost correct. A benchmark should distinguish the behavior owner from a&lt;br&gt;
neighboring interface, manager class, or configuration file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Publish uncertainty alongside the ranking.&lt;/strong&gt; Point estimates are useful for&lt;br&gt;
sorting a table, but they do not always support a claim of superiority.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the benchmark to challenge your own conclusions.&lt;/strong&gt; A good methodology&lt;br&gt;
should not only confirm hypotheses. It should create conditions in which they&lt;br&gt;
can fail.&lt;/p&gt;
&lt;h2&gt;
  
  
  8. Limitations and Next Step
&lt;/h2&gt;

&lt;p&gt;The study has several clear limitations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thirty questions.&lt;/strong&gt; &lt;code&gt;recall@10&lt;/code&gt; has only 31 possible values, from 0/30 to&lt;br&gt;
30/30. The bootstrap 95% confidence-interval half-widths in this experiment&lt;br&gt;
were approximately 0.10-0.17. Reliably separating the top cluster requires a&lt;br&gt;
larger question set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One corpus.&lt;/strong&gt; The study used 697 files from the Kafka 4.0.0 broker core. It&lt;br&gt;
does not establish that the same models and configurations will behave&lt;br&gt;
similarly on another project, language, or architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File-level labels.&lt;/strong&gt; The benchmark evaluates retrieved files rather than&lt;br&gt;
individual classes and methods. A correct file containing an irrelevant chunk&lt;br&gt;
can count as a hit, while the correct method inside a low-ranked file can count&lt;br&gt;
as a miss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One annotator.&lt;/strong&gt; I authored all gold queries. The &lt;code&gt;rationale&lt;/code&gt; layer makes the&lt;br&gt;
decisions reviewable, but independent annotation by a second specialist would&lt;br&gt;
strengthen the dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval rather than final-answer quality.&lt;/strong&gt; The study does not show how&lt;br&gt;
well a generative model uses the retrieved context or how useful its final&lt;br&gt;
answer is to a developer.&lt;/p&gt;

&lt;p&gt;The next article in the series will use the same dataset to investigate the&lt;br&gt;
relative effects of the embedding model, chunking, and retrieval mode. This&lt;br&gt;
time, those comparisons will be interpreted with query robustness, adversarial&lt;br&gt;
errors, and statistical uncertainty in mind.&lt;/p&gt;
&lt;h2&gt;
  
  
  Reproducibility
&lt;/h2&gt;

&lt;p&gt;The repository contains the gold set, raw CSV and Parquet results, benchmark&lt;br&gt;
code, and analysis scripts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Daeryss/karta-rag-map
&lt;span class="nb"&gt;cd &lt;/span&gt;karta-rag-map

&lt;span class="c"&gt;# Inspect the final five-layer schema and 30 gold queries&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;corpora/kafka-4.0.0/gold-queries-v2.1.yaml

&lt;span class="c"&gt;# Validate the dataset schema&lt;/span&gt;
./gradlew &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;--tests&lt;/span&gt; &lt;span class="s1"&gt;'io.github.daeryss.karta.dataset.GoldQueryLoaderV2Test'&lt;/span&gt;

&lt;span class="c"&gt;# Data behind the ranking and paired bootstrap&lt;/span&gt;
&lt;span class="nb"&gt;ls &lt;/span&gt;results/forest_plot_data.csv
&lt;span class="nb"&gt;ls &lt;/span&gt;results/paired_bootstrap_top5.csv

&lt;span class="c"&gt;# Recompute the paired bootstrap for the top cluster&lt;/span&gt;
.venv/bin/python scripts/paired_bootstrap.py &lt;span class="se"&gt;\&lt;/span&gt;
  results/E005-production/all.csv &lt;span class="se"&gt;\&lt;/span&gt;
  results/E006-production-merged/all.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; results/paired_bootstrap_top5.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Code-RAG Research Series · Article 1 of 3 · Next: "What Actually Matters in&lt;br&gt;
Code-RAG: Models, Chunking, or Retrieval?"&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>java</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
