<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ilias Miftakhov</title>
    <description>The latest articles on DEV Community by Ilias Miftakhov (@miftakhov).</description>
    <link>https://dev.to/miftakhov</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3974826%2F219693d4-f9c9-40fa-bcf7-727689bce697.png</url>
      <title>DEV Community: Ilias Miftakhov</title>
      <link>https://dev.to/miftakhov</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/miftakhov"/>
    <language>en</language>
    <item>
      <title>A Cognitive Benchmark for Code-RAG Retrieval: Part 2 — Why Model Rankings Depend on the Pipeline</title>
      <dc:creator>Ilias Miftakhov</dc:creator>
      <pubDate>Sun, 14 Jun 2026 21:00:41 +0000</pubDate>
      <link>https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-2-why-model-rankings-depend-on-the-pipeline-12a4</link>
      <guid>https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-2-why-model-rankings-depend-on-the-pipeline-12a4</guid>
      <description>&lt;p&gt;When developers enter an unfamiliar project, they rarely search for a specific&lt;br&gt;
file by name. They usually ask about system behavior: where incoming&lt;br&gt;
connections are accepted, which component cleans logs, or how a request travels&lt;br&gt;
between architectural layers.&lt;/p&gt;

&lt;p&gt;Code-RAG tries to answer such questions through semantic search. It splits and&lt;br&gt;
indexes the source code, then retrieves the context most closely related to a&lt;br&gt;
developer's query.&lt;/p&gt;

&lt;p&gt;The quality of this search is often reduced to the choice of embedding model:&lt;br&gt;
compare several candidates and select the one with the highest metric. In&lt;br&gt;
practice, the result also depends on how the code was split and which retrieval&lt;br&gt;
mode was used.&lt;/p&gt;

&lt;p&gt;To study these dependencies, I built a Code-RAG benchmark on the Apache Kafka&lt;br&gt;
4.0.0 broker core, a real polyglot project written in Java and Scala. For&lt;br&gt;
thirty questions about system behavior, I identified the correct files in&lt;br&gt;
advance, allowing me to measure how accurately the retrieval pipeline finds&lt;br&gt;
the relevant code.&lt;/p&gt;

&lt;p&gt;The results show that a model ranking exists only within a specific&lt;br&gt;
configuration. Changing the chunking strategy, retrieval mode, or query&lt;br&gt;
phrasing can change both the metric value and the order of models in the&lt;br&gt;
ranking.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. Experimental Setup
&lt;/h2&gt;

&lt;p&gt;In this part of the study, I compare sixteen embedding models, five chunking&lt;br&gt;
configurations, and three retrieval modes: BM25, vector search, and hybrid&lt;br&gt;
search. Each of the thirty questions was expressed in five forms, ranging from&lt;br&gt;
a natural developer question to a query with inaccurate terminology or a&lt;br&gt;
reference to a neighboring module. The structure of these variants and the&lt;br&gt;
evaluation methodology are described in&lt;br&gt;
&lt;a href="https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-1-methodology-3m7l"&gt;Part 1 of the study&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I compared four groups of variables:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Variable&lt;/th&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;th&gt;What it tested&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Embedding model&lt;/td&gt;
&lt;td&gt;7 local models through Ollama and 9 commercial APIs&lt;/td&gt;
&lt;td&gt;How strongly quality depends on the vector representation of the query and code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chunking&lt;/td&gt;
&lt;td&gt;Whole-file indexing and four fixed-size chunks with overlap&lt;/td&gt;
&lt;td&gt;How the indexed fragment size affects a particular model&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval mode&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;BM25_ONLY&lt;/code&gt;, &lt;code&gt;VECTOR_ONLY&lt;/code&gt;, &lt;code&gt;HYBRID_RRF&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Whether lexical search, vector search, or their combination works best&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query phrasing&lt;/td&gt;
&lt;td&gt;Natural question, technical query, keywords, inaccurate terminology, and selected cross-module queries&lt;/td&gt;
&lt;td&gt;How strongly the result depends on the language of the query&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The three retrieval modes work as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;How the ranking is produced&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;BM25_ONLY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lucene lexical search. Files rank highly when query terms match terms in the code. No embedding model is used.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;VECTOR_ONLY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;The query and code fragments are converted into embeddings. Ranking is based on vector similarity, so exact word overlap is unnecessary.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;HYBRID_RRF&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;BM25 and vector search run independently, then their positions are combined using Reciprocal Rank Fusion. RRF uses rank positions rather than directly adding incomparable scores.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The primary metric in this article is &lt;code&gt;recall@10&lt;/code&gt;. For a single question, it&lt;br&gt;
equals 1 when the primary correct file appears in the first ten results and 0&lt;br&gt;
otherwise. The final value is the average across thirty questions. For example,&lt;br&gt;
&lt;code&gt;recall@10 = 0.900&lt;/code&gt; means that the correct file appeared in the top ten for 27&lt;br&gt;
of 30 questions.&lt;/p&gt;

&lt;p&gt;The model ranking also reports a &lt;code&gt;95% CI&lt;/code&gt;, a 95% bootstrap confidence interval.&lt;br&gt;
To calculate it, I repeatedly resampled the set of questions with replacement&lt;br&gt;
and recalculated recall for each sample. A wide interval means that thirty&lt;br&gt;
questions are insufficient for precisely estimating small differences.&lt;br&gt;
Overlapping intervals are not themselves a formal pairwise test, but they warn&lt;br&gt;
against treating the order of neighboring rows as stable.&lt;/p&gt;

&lt;p&gt;The chunking label &lt;code&gt;c500-o100&lt;/code&gt; means fragments of 500 characters with a&lt;br&gt;
100-character overlap. &lt;code&gt;whole-file&lt;/code&gt; means that an entire file is indexed as one&lt;br&gt;
fragment.&lt;/p&gt;

&lt;p&gt;I did not test the complete Cartesian product of all parameters. Models were&lt;br&gt;
compared under a fixed baseline configuration; local and commercial models were&lt;br&gt;
tested across five chunking configurations; and &lt;code&gt;VECTOR_ONLY&lt;/code&gt; was compared&lt;br&gt;
with &lt;code&gt;HYBRID_RRF&lt;/code&gt; at the fixed &lt;code&gt;c1500-o200&lt;/code&gt; chunking. The full interaction&lt;br&gt;
between retrieval mode and chunking remained outside the study. BM25 was run as&lt;br&gt;
a single baseline because it does not depend on an embedding model.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Conditions for Comparing Models
&lt;/h2&gt;

&lt;p&gt;To compare embedding models, the remaining retrieval-pipeline parameters must&lt;br&gt;
be fixed. Otherwise, it is impossible to tell whether a difference was caused&lt;br&gt;
by the model, fragment size, or retrieval method.&lt;/p&gt;

&lt;p&gt;The baseline comparison used natural &lt;code&gt;human&lt;/code&gt; questions, &lt;code&gt;HYBRID_RRF&lt;/code&gt;, and&lt;br&gt;
&lt;code&gt;c1500-o200&lt;/code&gt; chunking. For each model, I measured the share of thirty questions&lt;br&gt;
for which the correct file appeared in the first ten results.&lt;/p&gt;

&lt;p&gt;This ranking compares models under identical conditions, but it does not&lt;br&gt;
describe their quality outside the selected configuration. For example, OpenAI&lt;br&gt;
&lt;code&gt;text-embedding-3-large&lt;/code&gt; achieved &lt;code&gt;recall@10 = 0.833&lt;/code&gt; with &lt;code&gt;c1500-o200&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;0.900&lt;/code&gt; with the smaller &lt;code&gt;c500-o100&lt;/code&gt; fragments, and &lt;code&gt;0.433&lt;/code&gt; when files were&lt;br&gt;
indexed whole.&lt;/p&gt;

&lt;p&gt;The value &lt;code&gt;0.833&lt;/code&gt; therefore cannot be treated as an independent property of the&lt;br&gt;
model. It describes one combination of model, chunking, retrieval mode, corpus,&lt;br&gt;
and question set. The baseline ranking is a useful starting point, but it&lt;br&gt;
cannot identify the best configuration without testing the other parameters.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. The Effect of Chunking
&lt;/h2&gt;

&lt;p&gt;Ideally, code should be split along logical boundaries such as methods, classes,&lt;br&gt;
or other structural units. Structural chunking, however, requires a dedicated&lt;br&gt;
parser for every language.&lt;/p&gt;

&lt;p&gt;This study deliberately uses a polyglot Java and Scala project. I therefore&lt;br&gt;
split the code into fixed-size fragments. This is not presented as the optimal&lt;br&gt;
way to index code; it provides a common denominator across languages and makes&lt;br&gt;
it possible to isolate the effect of fragment size.&lt;/p&gt;

&lt;p&gt;Every value in the table is &lt;code&gt;recall@10&lt;/code&gt; for natural &lt;code&gt;human&lt;/code&gt; questions using&lt;br&gt;
hybrid retrieval. The best observed result for each model is shown in bold.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;c500-o100&lt;/th&gt;
&lt;th&gt;c1500-o200&lt;/th&gt;
&lt;th&gt;c3000-o300&lt;/th&gt;
&lt;th&gt;c5000-o500&lt;/th&gt;
&lt;th&gt;whole-file&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;all-minilm&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.733&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;0.667&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;0.567&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bge-m3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.833&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;td&gt;0.733&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;granite-embedding&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.800&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;td&gt;0.433&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mxbai-embed-large&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;0.733&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;td&gt;0.433&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic-embed-text&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.733&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.667&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;td&gt;0.133&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3-embedding:0.6b&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.933&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;snowflake-arctic-embed2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.800&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.800&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.800&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.733&lt;/td&gt;
&lt;td&gt;0.733&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EmbeddingsGigaR&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.800&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.789&lt;/td&gt;
&lt;td&gt;0.167&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GigaEmbeddings-3B&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.867&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;0.300&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;codestral-embed&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;0.467&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mistral-embed-2312&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;0.400&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;text-embedding-3-large&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;0.433&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;text-embedding-3-small&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.867&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.867&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.867&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.867&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.433&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;voyage-4-large&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;voyage-code-2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.933&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;0.533&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;voyage-code-3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Smaller Fragments Helped Most Local Models
&lt;/h3&gt;

&lt;p&gt;For five of the seven local models, &lt;code&gt;c500-o100&lt;/code&gt; produced the highest observed&lt;br&gt;
result. One possible explanation is that a small fragment contains less&lt;br&gt;
unrelated code. Its embedding can describe a local implementation more&lt;br&gt;
precisely, while BM25 benefits from matching specific terms.&lt;/p&gt;

&lt;p&gt;The experiment does not establish this mechanism directly. Doing so would&lt;br&gt;
require inspecting retrieved fragments and comparing hybrid and vector-only&lt;br&gt;
search at every chunk size.&lt;/p&gt;
&lt;h3&gt;
  
  
  Some Models Benefit from Larger Fragments
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;qwen3-embedding:0.6b&lt;/code&gt; achieved its highest result with &lt;code&gt;c3000-o300&lt;/code&gt; and still&lt;br&gt;
reached &lt;code&gt;0.900&lt;/code&gt; when indexing whole files. Unlike most local models, it retained&lt;br&gt;
quality on larger fragments.&lt;/p&gt;

&lt;p&gt;A possible explanation is the model's ability to process longer context. A&lt;br&gt;
larger fragment preserves relationships between methods and their surrounding&lt;br&gt;
class that smaller fragments may lose. A similar pattern appeared for&lt;br&gt;
&lt;code&gt;mistral-embed-2312&lt;/code&gt;, &lt;code&gt;EmbeddingsGigaR&lt;/code&gt;, and partly for &lt;code&gt;voyage-code-3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This remains a hypothesis: the experiment measured retrieval outcomes, not the&lt;br&gt;
internal cause of each model's behavior.&lt;/p&gt;
&lt;h3&gt;
  
  
  Whole-File Indexing Is the Riskiest Choice
&lt;/h3&gt;

&lt;p&gt;With &lt;code&gt;whole-file&lt;/code&gt;, results ranged from &lt;code&gt;0.133&lt;/code&gt; to &lt;code&gt;0.900&lt;/code&gt;. The approach&lt;br&gt;
remained viable for &lt;code&gt;qwen3-embedding&lt;/code&gt;, &lt;code&gt;voyage-4-large&lt;/code&gt;, and &lt;code&gt;voyage-code-3&lt;/code&gt;,&lt;br&gt;
but quality dropped sharply for &lt;code&gt;nomic-embed-text&lt;/code&gt; and &lt;code&gt;EmbeddingsGigaR&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The likely explanation is context-window limits and truncation of long files.&lt;br&gt;
Because I did not directly measure truncation by provider tokenizers, this must&lt;br&gt;
also remain a hypothesis.&lt;/p&gt;

&lt;p&gt;The matrix does not reveal a universally best fragment size. Instead, it shows&lt;br&gt;
three kinds of behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;models that benefit substantially from small fragments;&lt;/li&gt;
&lt;li&gt;models that require more surrounding context;&lt;/li&gt;
&lt;li&gt;models that remain stable across chunking configurations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Chunking should therefore be selected together with the embedding model. When&lt;br&gt;
tuning time is limited, &lt;code&gt;c500-o100&lt;/code&gt; is a reasonable starting point, but at&lt;br&gt;
least one larger alternative should also be tested, and &lt;code&gt;whole-file&lt;/code&gt; should&lt;br&gt;
not be used without separate validation.&lt;/p&gt;
&lt;h2&gt;
  
  
  4. The Effect of Retrieval Mode
&lt;/h2&gt;

&lt;p&gt;After choosing how to split the code, the next question is how to retrieve the&lt;br&gt;
relevant fragments. The experiment compared three modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;BM25_ONLY&lt;/code&gt; matches words in the query against words in the code;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;VECTOR_ONLY&lt;/code&gt; compares semantic similarity between embeddings;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;HYBRID_RRF&lt;/code&gt; combines the rank positions from BM25 and vector search.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The retrieval-mode comparison used &lt;code&gt;c1500-o200&lt;/code&gt;. In an earlier experiment, the&lt;br&gt;
combination &lt;code&gt;c1500-o200 + HYBRID_RRF&lt;/code&gt; produced the strongest result available&lt;br&gt;
at the time and became the control configuration for later runs.&lt;/p&gt;

&lt;p&gt;The subsequent chunking matrix showed that there is no universally optimal&lt;br&gt;
fragment size. Keeping &lt;code&gt;c1500-o200&lt;/code&gt;, however, allowed retrieval modes to be&lt;br&gt;
compared under identical conditions without mixing their effect with a&lt;br&gt;
chunking change.&lt;/p&gt;

&lt;p&gt;The full matrix of retrieval modes and chunking configurations was not tested.&lt;br&gt;
The results below therefore describe retrieval-mode behavior only at&lt;br&gt;
&lt;code&gt;c1500-o200&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Every value is &lt;code&gt;recall@10&lt;/code&gt; for natural &lt;code&gt;human&lt;/code&gt; questions. The best mode for&lt;br&gt;
each model is shown in bold.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;BM25_ONLY&lt;/th&gt;
&lt;th&gt;VECTOR_ONLY&lt;/th&gt;
&lt;th&gt;HYBRID_RRF&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No embedding model&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.600&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;lexical baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;all-minilm&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.667&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.700&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bge-m3&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.867&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;granite-embedding&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.733&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.767&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mxbai-embed-large&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.833&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.833&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic-embed-text&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.667&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.667&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3-embedding:0.6b&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.800&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.800&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;snowflake-arctic-embed2&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EmbeddingsGigaR&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.711&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.767&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GigaEmbeddings-3B&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.867&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;codestral-embed&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.967&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mistral-embed-2312&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.900&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;text-embedding-3-large&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.867&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;text-embedding-3-small&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.867&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;voyage-4-large&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.878&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;voyage-code-2&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.933&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;voyage-code-3&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.933&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Commercial-model values are averaged across three repeated runs, so some&lt;br&gt;
values are not multiples of one question out of thirty.&lt;/p&gt;
&lt;h3&gt;
  
  
  Hybrid Search Was Not a Universal Improvement
&lt;/h3&gt;

&lt;p&gt;Adding BM25 to vector search helped two local and three commercial models. It&lt;br&gt;
made no difference for three local models. In the remaining cases, hybrid&lt;br&gt;
retrieval reduced &lt;code&gt;recall@10&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Among local models, the clearest differences appeared for &lt;code&gt;bge-m3&lt;/code&gt; and&lt;br&gt;
&lt;code&gt;snowflake-arctic-embed2&lt;/code&gt;: vector-only search improved their results by&lt;br&gt;
&lt;code&gt;0.100&lt;/code&gt;. Among commercial models, &lt;code&gt;mistral-embed-2312&lt;/code&gt; showed the same&lt;br&gt;
improvement.&lt;/p&gt;

&lt;p&gt;One possible explanation is that BM25 helps when the correct file contains&lt;br&gt;
query terms missed by vector search. It can also promote lexically similar but&lt;br&gt;
semantically incorrect files and weaken an already strong vector ranking. The&lt;br&gt;
experiment did not test this mechanism directly.&lt;/p&gt;
&lt;h3&gt;
  
  
  BM25 Remains a Useful Baseline
&lt;/h3&gt;

&lt;p&gt;For natural questions, BM25 achieved &lt;code&gt;recall@10 = 0.600&lt;/code&gt;, below every tested&lt;br&gt;
embedding-based combination. Its result, however, depended strongly on query&lt;br&gt;
language.&lt;/p&gt;

&lt;p&gt;For queries composed of technical terms and keywords, BM25 reached&lt;br&gt;
&lt;code&gt;0.833–0.867&lt;/code&gt;. With inaccurate terminology, it fell to &lt;code&gt;0.400&lt;/code&gt;. Lexical search&lt;br&gt;
works well when the developer already knows the names of relevant entities,&lt;br&gt;
but it is less effective when system behavior is described in the developer's&lt;br&gt;
own words.&lt;/p&gt;

&lt;p&gt;The choice of retrieval mode, like the choice of chunking, depends on the&lt;br&gt;
embedding model. Hybrid retrieval cannot be assumed to improve vector search:&lt;br&gt;
it helped some models, left some unchanged, and reduced the results of others.&lt;/p&gt;

&lt;p&gt;A practical evaluation should compare at least &lt;code&gt;VECTOR_ONLY&lt;/code&gt; and &lt;code&gt;HYBRID_RRF&lt;/code&gt;&lt;br&gt;
on the selected model and representative queries. BM25 remains both a useful&lt;br&gt;
control point and a standalone option for precise technical searches.&lt;/p&gt;
&lt;h2&gt;
  
  
  5. The Effect of Query Phrasing
&lt;/h2&gt;

&lt;p&gt;The same question about code can be expressed in different ways. A developer&lt;br&gt;
may describe system behavior in natural language, list known technical terms,&lt;br&gt;
or use a plausible but incorrect name for a component.&lt;/p&gt;

&lt;p&gt;To test retrieval robustness under these changes, each question was represented&lt;br&gt;
in several forms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;human&lt;/code&gt; — a natural developer question;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ai_optimized&lt;/code&gt; — a detailed query using precise technical terminology;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;keyword&lt;/code&gt; — a short list of keywords;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;wrong_terminology&lt;/code&gt; — the original intent with one controlled terminology error;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cross_module&lt;/code&gt; — a question connecting multiple system components.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The construction rules for these variants are described in&lt;br&gt;
&lt;a href="https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-1-methodology-3m7l"&gt;Part 1 of the study&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This comparison fixed chunking at &lt;code&gt;c1500-o200&lt;/code&gt; and used &lt;code&gt;HYBRID_RRF&lt;/code&gt;. Every&lt;br&gt;
value is &lt;code&gt;recall@10&lt;/code&gt;. The &lt;code&gt;cross_module&lt;/code&gt; variant existed for only ten&lt;br&gt;
applicable questions, while the other results were calculated across all&lt;br&gt;
thirty.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;human&lt;/th&gt;
&lt;th&gt;ai_optimized&lt;/th&gt;
&lt;th&gt;keyword&lt;/th&gt;
&lt;th&gt;wrong_terminology&lt;/th&gt;
&lt;th&gt;cross_module&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;BM25 without an embedding model&lt;/td&gt;
&lt;td&gt;0.600&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;0.400&lt;/td&gt;
&lt;td&gt;0.600&lt;/td&gt;
&lt;td&gt;baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;all-minilm&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;0.933&lt;/td&gt;
&lt;td&gt;0.933&lt;/td&gt;
&lt;td&gt;0.433&lt;/td&gt;
&lt;td&gt;0.600&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bge-m3&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;granite-embedding&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.967&lt;/td&gt;
&lt;td&gt;0.567&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mxbai-embed-large&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;0.967&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;nomic-embed-text&lt;/td&gt;
&lt;td&gt;0.667&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;0.733&lt;/td&gt;
&lt;td&gt;0.467&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3-embedding:0.6b&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;0.600&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;snowflake-arctic-embed2&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EmbeddingsGigaR&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;0.600&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GigaEmbeddings-3B&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;codestral-embed&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.667&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;mistral-embed-2312&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.633&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;text-embedding-3-large&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.933&lt;/td&gt;
&lt;td&gt;0.600&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;text-embedding-3-small&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.967&lt;/td&gt;
&lt;td&gt;0.567&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;voyage-4-large&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.733&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;voyage-code-2&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;voyage-code-3&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;1.000&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Precise Technical Queries Make Retrieval Easier
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;ai_optimized&lt;/code&gt; variant contains class names, component names, and&lt;br&gt;
operations already present in the code. On these queries, all nine commercial&lt;br&gt;
and four of the seven local models reached &lt;code&gt;recall@10 = 1.000&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This does not mean that code retrieval is solved. It shows that Code-RAG works&lt;br&gt;
far better when the user already knows the terminology and approximate&lt;br&gt;
location of the answer. In practice, retrieval is often needed precisely&lt;br&gt;
because that knowledge is missing.&lt;/p&gt;

&lt;p&gt;Short &lt;code&gt;keyword&lt;/code&gt; queries also performed well. Even BM25 reached &lt;code&gt;0.867&lt;/code&gt;, because&lt;br&gt;
the keywords often matched names and terms in the source code directly.&lt;/p&gt;
&lt;h3&gt;
  
  
  Inaccurate Terminology Hurts Every Model
&lt;/h3&gt;

&lt;p&gt;Replacing one term with a plausible but incorrect alternative reduced the&lt;br&gt;
result of every tested model. Among commercial models, the drop relative to&lt;br&gt;
&lt;code&gt;human&lt;/code&gt; ranged from &lt;code&gt;0.100&lt;/code&gt; for &lt;code&gt;voyage-4-large&lt;/code&gt; to &lt;code&gt;0.300&lt;/code&gt; for&lt;br&gt;
&lt;code&gt;text-embedding-3-small&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The model ranked first on natural questions was not the most robust to&lt;br&gt;
terminology distortion. &lt;code&gt;codestral-embed&lt;/code&gt; fell from &lt;code&gt;0.900&lt;/code&gt; to &lt;code&gt;0.667&lt;/code&gt;, while&lt;br&gt;
&lt;code&gt;voyage-4-large&lt;/code&gt; fell from &lt;code&gt;0.833&lt;/code&gt; to &lt;code&gt;0.733&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Code specialization also failed to predict robustness. The smallest and&lt;br&gt;
largest observed drops among commercial models both belonged to&lt;br&gt;
general-purpose models.&lt;/p&gt;
&lt;h3&gt;
  
  
  Cross-Module Queries Need a Separate Study
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;cross_module&lt;/code&gt; values barely distinguish the embedding models: every model&lt;br&gt;
except &lt;code&gt;all-minilm&lt;/code&gt; received &lt;code&gt;0.700&lt;/code&gt;, while &lt;code&gt;all-minilm&lt;/code&gt; received &lt;code&gt;0.600&lt;/code&gt;.&lt;br&gt;
This variant existed for only ten questions, so the result cannot be&lt;br&gt;
interpreted as evidence of equal robustness.&lt;/p&gt;

&lt;p&gt;A meaningful comparison would require a separate question set focused on&lt;br&gt;
relationships between modules and containing more examples of this type.&lt;/p&gt;

&lt;p&gt;Query phrasing is another parameter of the retrieval pipeline. Precise&lt;br&gt;
terminology can bring almost every model close to the maximum result, while a&lt;br&gt;
small terminology error can reduce quality substantially.&lt;/p&gt;

&lt;p&gt;Model selection should therefore account for where queries come from. A system&lt;br&gt;
for developers familiar with the codebase and a system for new team members or&lt;br&gt;
non-technical users may require different configurations.&lt;/p&gt;
&lt;h2&gt;
  
  
  6. Model Comparison Results
&lt;/h2&gt;

&lt;p&gt;For the baseline comparison, every model ran under the same conditions:&lt;br&gt;
natural &lt;code&gt;human&lt;/code&gt; questions, &lt;code&gt;HYBRID_RRF&lt;/code&gt;, and &lt;code&gt;c1500-o200&lt;/code&gt; chunking.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rank&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Model types in row&lt;/th&gt;
&lt;th&gt;Recall@10&lt;/th&gt;
&lt;th&gt;95% CI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;mistral/codestral-embed&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;[0.800–1.000]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2–5&lt;/td&gt;
&lt;td&gt;GigaEmbeddings-3B, text-embedding-3-small, voyage-code-2, voyage-code-3&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;td&gt;0.867&lt;/td&gt;
&lt;td&gt;[0.733–0.967]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6–8&lt;/td&gt;
&lt;td&gt;mxbai-embed-large, text-embedding-3-large, voyage-4-large&lt;/td&gt;
&lt;td&gt;local and commercial&lt;/td&gt;
&lt;td&gt;0.833&lt;/td&gt;
&lt;td&gt;[0.700–0.967]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9–11&lt;/td&gt;
&lt;td&gt;qwen3-embedding:0.6b, snowflake-arctic-embed2&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;[0.633–0.933]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9–11&lt;/td&gt;
&lt;td&gt;mistral-embed-2312&lt;/td&gt;
&lt;td&gt;commercial&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;[0.666–0.933]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12–14&lt;/td&gt;
&lt;td&gt;bge-m3, granite-embedding, EmbeddingsGigaR&lt;/td&gt;
&lt;td&gt;local and commercial&lt;/td&gt;
&lt;td&gt;0.767&lt;/td&gt;
&lt;td&gt;[0.600–0.900]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;all-minilm&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;[0.533–0.867]&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;nomic-embed-text&lt;/td&gt;
&lt;td&gt;local&lt;/td&gt;
&lt;td&gt;0.667&lt;/td&gt;
&lt;td&gt;[0.500–0.833]&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At first glance, this looks like a conventional ranking: a specialized&lt;br&gt;
commercial model takes first place, and the remaining results gradually fall&lt;br&gt;
from &lt;code&gt;0.867&lt;/code&gt; to &lt;code&gt;0.667&lt;/code&gt;. With thirty questions, however, a difference of&lt;br&gt;
&lt;code&gt;0.033&lt;/code&gt; represents only one retrieved file.&lt;/p&gt;

&lt;p&gt;One question separates &lt;code&gt;codestral-embed&lt;/code&gt; from the next four models. Those four&lt;br&gt;
retrieved the correct files for the same 26 questions out of 30, while the&lt;br&gt;
leader retrieved one additional file. A paired bootstrap analysis showed that&lt;br&gt;
the confidence interval for every pairwise difference among the top five&lt;br&gt;
models included zero. The available data is therefore insufficient to treat&lt;br&gt;
their order as stable.&lt;/p&gt;

&lt;p&gt;The separation between local and commercial models was also less pronounced&lt;br&gt;
than expected. Local &lt;code&gt;mxbai-embed-large&lt;/code&gt; achieved &lt;code&gt;0.833&lt;/code&gt;, two correct answers&lt;br&gt;
behind the leader out of thirty, and its confidence interval overlaps those of&lt;br&gt;
commercial models.&lt;/p&gt;

&lt;p&gt;Larger and more expensive models did not always produce better results.&lt;br&gt;
&lt;code&gt;text-embedding-3-large&lt;/code&gt; achieved &lt;code&gt;0.833&lt;/code&gt;, while the cheaper&lt;br&gt;
&lt;code&gt;text-embedding-3-small&lt;/code&gt; reached &lt;code&gt;0.867&lt;/code&gt;. The compact 62 MB&lt;br&gt;
&lt;code&gt;granite-embedding&lt;/code&gt; tied the 1.2 GB &lt;code&gt;bge-m3&lt;/code&gt; at &lt;code&gt;0.767&lt;/code&gt;. Two generations of&lt;br&gt;
specialized Voyage models, &lt;code&gt;voyage-code-2&lt;/code&gt; and &lt;code&gt;voyage-code-3&lt;/code&gt;, also completed&lt;br&gt;
the baseline comparison with the same result of &lt;code&gt;0.867&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;These observations do not prove that the models are equal: thirty questions&lt;br&gt;
are insufficient for confidently comparing small differences. They do show&lt;br&gt;
why an embedding-model ranking is meaningful only together with its&lt;br&gt;
measurement conditions. Changing chunking, retrieval mode, or query phrasing&lt;br&gt;
can alter both the metric and the order of models in the table.&lt;/p&gt;
&lt;h2&gt;
  
  
  7. Practical Recommendations
&lt;/h2&gt;

&lt;p&gt;The experiment does not identify one configuration suitable for every&lt;br&gt;
Code-RAG project. The decision depends on the queries the system will receive,&lt;br&gt;
where it will run, and how much time is available for tuning.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Candidate&lt;/th&gt;
&lt;th&gt;Evidence from this study&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Highest observed &lt;code&gt;recall@10&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;codestral-embed&lt;/code&gt; + &lt;code&gt;c1500-o200&lt;/code&gt; + &lt;code&gt;VECTOR_ONLY&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Highest point estimate among completed runs: &lt;code&gt;0.967&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No commercial APIs&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;qwen3-embedding:0.6b&lt;/code&gt; + &lt;code&gt;c3000-o300&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Highest tuned local-model result: &lt;code&gt;0.933&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Minimal initial tuning&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;voyage-4-large&lt;/code&gt; or &lt;code&gt;voyage-code-3&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Smallest observed range across chunking configurations: &lt;code&gt;0.067&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Restricted memory&lt;/td&gt;
&lt;td&gt;&lt;code&gt;granite-embedding&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;A 62 MB model with the same baseline point estimate, &lt;code&gt;0.767&lt;/code&gt;, as the 1.2 GB &lt;code&gt;bge-m3&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Precise technical queries&lt;/td&gt;
&lt;td&gt;BM25 as standalone search or a baseline&lt;/td&gt;
&lt;td&gt;BM25 reached &lt;code&gt;recall@10 = 0.867&lt;/code&gt; on &lt;code&gt;keyword&lt;/code&gt; queries without embedding infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inaccurate user queries&lt;/td&gt;
&lt;td&gt;Vector or hybrid retrieval after testing the chosen model&lt;/td&gt;
&lt;td&gt;The advantage of semantic retrieval over BM25 was most visible with inaccurate terminology&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These candidates reflect the best observed results inside the experiment, not&lt;br&gt;
universal production configurations. For example, &lt;code&gt;codestral-embed&lt;/code&gt; produced&lt;br&gt;
the highest &lt;code&gt;recall@10&lt;/code&gt;, but its advantage was measured at one chunking&lt;br&gt;
configuration, and statistical superiority over nearby models was not&lt;br&gt;
established. Local models avoid API charges but move cost into hardware and&lt;br&gt;
operations.&lt;/p&gt;

&lt;p&gt;Practical Code-RAG tuning should begin with a description of future queries,&lt;br&gt;
not a large model leaderboard. If the system is used by developers familiar&lt;br&gt;
with project terminology, BM25 may be a strong starting point. If questions&lt;br&gt;
come from new team members or users describing behavior in their own words,&lt;br&gt;
semantic retrieval becomes more important.&lt;/p&gt;

&lt;p&gt;The next step is to choose a short list of models that meet cost, memory, and&lt;br&gt;
deployment constraints. For each candidate, test several fragment sizes, then&lt;br&gt;
compare &lt;code&gt;VECTOR_ONLY&lt;/code&gt; and &lt;code&gt;HYBRID_RRF&lt;/code&gt;. A chunking or retrieval-mode choice&lt;br&gt;
should not be transferred from one model to another without retesting.&lt;/p&gt;

&lt;p&gt;The final comparison should include not only convenient technical queries, but&lt;br&gt;
also natural phrasing, inaccurate terminology, and plausible but incorrect&lt;br&gt;
files. Results should be retained per question so that an apparent advantage&lt;br&gt;
can be traced to a stable pattern rather than a few favorable examples.&lt;/p&gt;

&lt;p&gt;In practice, selecting a Code-RAG configuration becomes a process of narrowing&lt;br&gt;
the search space:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;identify realistic query types and system constraints;&lt;/li&gt;
&lt;li&gt;measure BM25 and one accessible embedding model as starting points;&lt;/li&gt;
&lt;li&gt;select a short list of candidate models;&lt;/li&gt;
&lt;li&gt;test several chunking configurations for each candidate;&lt;/li&gt;
&lt;li&gt;compare vector-only and hybrid retrieval;&lt;/li&gt;
&lt;li&gt;repeat the evaluation across different query phrasings;&lt;/li&gt;
&lt;li&gt;test the statistical stability of the resulting ranking.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  8. Conclusion
&lt;/h2&gt;

&lt;p&gt;The central result of this study is that an embedding model cannot be evaluated&lt;br&gt;
independently from the retrieval pipeline around it. Each model has its own&lt;br&gt;
effective combination of chunking, retrieval mode, and query phrasing.&lt;/p&gt;

&lt;p&gt;These parameters are connected. Fragment size determines how much code enters&lt;br&gt;
an embedding. Retrieval mode sets the balance between semantic similarity and&lt;br&gt;
exact terminology. Query phrasing determines how easily the system can connect&lt;br&gt;
a developer's intent to the vocabulary of the source code.&lt;/p&gt;

&lt;p&gt;There is therefore no universal ranking of Code-RAG models. Models can only be&lt;br&gt;
compared under explicit conditions: on a particular codebase, with a selected&lt;br&gt;
chunking strategy and retrieval mode, and for a known distribution of user&lt;br&gt;
queries.&lt;/p&gt;

&lt;p&gt;The practical question is not "Which model is best?" but "Which configuration&lt;br&gt;
best solves the tasks of this project's users?" Answering it requires joint&lt;br&gt;
tuning of the pipeline and evaluation on a project-specific question set.&lt;/p&gt;

&lt;p&gt;This study used one polyglot project and a small gold set, so its selected&lt;br&gt;
configurations should not be transferred to other codebases without retesting.&lt;br&gt;
Part 3 will describe the reproducible benchmark harness used to run these&lt;br&gt;
comparisons.&lt;/p&gt;
&lt;h2&gt;
  
  
  Reproducibility
&lt;/h2&gt;

&lt;p&gt;The raw results are published with the project. The main tables in this article&lt;br&gt;
can be verified using these artifacts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;results/E003-full/all.parquet
results/E007-commercial-chunking-merged/all.parquet
results/E004-vector/all.parquet
results/E005-production/all.parquet
results/E006-production-merged/all.parquet
results/E007-commercial-vector-merged/all.parquet
results/E004-bm25/all.parquet
results/forest_plot_data.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The confidence intervals for the baseline ranking can be recalculated from the&lt;br&gt;
source CSV files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Daeryss/karta-rag-map
&lt;span class="nb"&gt;cd &lt;/span&gt;karta-rag-map
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
.venv/bin/pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; scripts/requirements.txt
.venv/bin/python scripts/bootstrap_cis.py &lt;span class="se"&gt;\&lt;/span&gt;
  results/E005-production/all.csv &lt;span class="se"&gt;\&lt;/span&gt;
  results/E006-production-merged/all.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The script fixes the baseline conditions at &lt;code&gt;k=10&lt;/code&gt;, the &lt;code&gt;human&lt;/code&gt; query variant,&lt;br&gt;
2,000 bootstrap samples, and seed &lt;code&gt;42&lt;/code&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;A Cognitive Benchmark for Code-RAG Retrieval · Part 2 of 3 · Previous:&lt;br&gt;
&lt;a href="https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-1-methodology-3m7l"&gt;Part 1 — Methodology&lt;/a&gt; · Next: Part 3 —&lt;br&gt;
Engineering a Reproducible Benchmark&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>java</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>A Cognitive Benchmark for Code-RAG Retrieval: Part 1 — Methodology</title>
      <dc:creator>Ilias Miftakhov</dc:creator>
      <pubDate>Tue, 09 Jun 2026 19:37:51 +0000</pubDate>
      <link>https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-1-methodology-3m7l</link>
      <guid>https://dev.to/miftakhov/a-cognitive-benchmark-for-code-rag-retrieval-part-1-methodology-3m7l</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Code-RAG systems promise to help developers navigate large codebases: find the&lt;br&gt;
implementation of a behavior, trace a data flow, or identify the component&lt;br&gt;
responsible for a specific function. But a compelling demo does not tell us how&lt;br&gt;
reliable the retrieval itself is.&lt;/p&gt;

&lt;p&gt;To investigate this, I built a retrieval benchmark on the Apache Kafka 4.0.0&lt;br&gt;
broker core, a real polyglot project containing 697 Java and Scala files. For&lt;br&gt;
each question, I identified the correct files, acceptable supporting files, and&lt;br&gt;
plausible but incorrect alternatives in advance. I then compared which files&lt;br&gt;
Code-RAG retrieved under different embedding models, code chunking strategies,&lt;br&gt;
and retrieval modes.&lt;/p&gt;

&lt;p&gt;The first results looked convincing, but the conclusions changed several times&lt;br&gt;
as the dataset evolved. A single query phrasing concealed retrieval instability,&lt;br&gt;
ordinary recall missed dangerous errors, and the final leaderboard leader beat&lt;br&gt;
the next four models on only one question out of thirty.&lt;/p&gt;

&lt;p&gt;The main outcome of the study was not the selection of a "best" model. It was a&lt;br&gt;
methodology for avoiding the mistake of treating a chance result as a reliable&lt;br&gt;
one.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. Why Code-RAG Needs Its Own Benchmark
&lt;/h2&gt;

&lt;p&gt;Code RAG maps make an appealing promise: a developer asks a question about an&lt;br&gt;
unfamiliar project and receives the relevant parts of the codebase. Such a&lt;br&gt;
system could be used inside an IDE, in a chat assistant, during onboarding,&lt;br&gt;
while investigating incidents, or when preparing a change.&lt;/p&gt;

&lt;p&gt;A typical Code-RAG pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source code
    → split into chunks
    → build a search index
    → developer query
    → retrieve relevant chunks
    → pass the retrieved context to a generative model
    → answer the user
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final answer does not depend on the generative model alone. If retrieval&lt;br&gt;
fails to find the file that actually implements the relevant behavior, the&lt;br&gt;
model will either produce an incomplete answer or reason plausibly from the&lt;br&gt;
wrong context.&lt;/p&gt;

&lt;p&gt;I therefore started with a narrower question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How accurately does Code-RAG find the correct files in a real polyglot&lt;br&gt;
codebase?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An important clarification: this study did not send questions to different&lt;br&gt;
generative LLMs and evaluate their written answers. It sent them to a retrieval&lt;br&gt;
pipeline. An embedding model converted the query and code into vectors, Lucene&lt;br&gt;
returned a ranked list of files, and the benchmark compared that list with&lt;br&gt;
labels prepared in advance.&lt;/p&gt;

&lt;p&gt;This restriction is intentional. It isolates retrieval quality instead of&lt;br&gt;
mixing it with prompt quality, answer generation, and LLM hallucinations.&lt;/p&gt;

&lt;p&gt;I selected the Apache Kafka 4.0.0 broker core as the corpus. It is not a&lt;br&gt;
tutorial example, but a large Java and Scala codebase with packages,&lt;br&gt;
interfaces, implementations, manager classes, and logic distributed across&lt;br&gt;
multiple components. In a project like this, a retrieval system must&lt;br&gt;
distinguish the file that actually owns a behavior from a similarly named file&lt;br&gt;
or an adjacent responsibility.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Defining the Correct Answer
&lt;/h2&gt;

&lt;p&gt;Evaluating retrieval requires more than asking a question and deciding whether&lt;br&gt;
the result looks reasonable. It requires a known ground truth.&lt;/p&gt;

&lt;p&gt;I created a set of questions about the Kafka codebase, which I was familiar&lt;br&gt;
with. Each question described a specific behavior or concept for which the&lt;br&gt;
owner file could be identified manually. For example:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How does the Kafka broker accept incoming TCP connections?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The correct primary file for this question is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;core/src/main/scala/kafka/network/SocketServer.scala
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;RequestChannel.scala&lt;/code&gt; is nearby. It belongs to the networking flow and helps&lt;br&gt;
pass requests onward, so it looks plausible. However, it is not responsible for&lt;br&gt;
accepting TCP connections. If a system ranks it above &lt;code&gt;SocketServer.scala&lt;/code&gt;, it&lt;br&gt;
has found thematically related code but identified the wrong owner of the&lt;br&gt;
behavior.&lt;/p&gt;

&lt;p&gt;For every run, the retriever returned a ranked list of files. The result could&lt;br&gt;
then be evaluated formally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;recall@k&lt;/code&gt; shows whether the correct file appeared in the first &lt;code&gt;k&lt;/code&gt; results;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MRR&lt;/code&gt; accounts for how highly the first correct result was ranked;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;displacement_rate&lt;/code&gt; shows whether a plausible but incorrect file displaced
the primary answer within the top-k;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rank_gap&lt;/code&gt; measures the distance between correct and adversarial files in the
complete ranked result.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This turns the subjective judgment that "the result looks reasonable" into a&lt;br&gt;
reproducible comparison. With the same set of questions, I could change the&lt;br&gt;
embedding model, chunking configuration, or retrieval mode and measure how the&lt;br&gt;
quality changed.&lt;/p&gt;

&lt;p&gt;But the first version of the dataset was too simple.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. Why the First Gold Queries Had to Be Replaced
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The First Version: Verify That the Pipeline Works
&lt;/h3&gt;

&lt;p&gt;The pilot gold set contained three questions and ten Kafka files. Each question&lt;br&gt;
had one phrasing and a list of expected files. This was enough to exercise the&lt;br&gt;
entire path: load the corpus, build the index, run retrieval, and calculate the&lt;br&gt;
metrics.&lt;/p&gt;

&lt;p&gt;The pilot used &lt;code&gt;nomic-embed-text&lt;/code&gt; with 3,000-character chunks and a&lt;br&gt;
300-character overlap. It achieved a perfect score on every measured metric. I&lt;br&gt;
therefore selected &lt;code&gt;3000/300&lt;/code&gt; as the working configuration.&lt;/p&gt;

&lt;p&gt;The problem was that the benchmark had not confirmed the reliability of the&lt;br&gt;
approach. It had only confirmed that the system could solve three familiar&lt;br&gt;
questions on a small corpus. It left almost no room for failure.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Next Version: More Questions and More Phrasings
&lt;/h3&gt;

&lt;p&gt;After expanding the corpus to 697 files, I prepared 30 questions. Each question&lt;br&gt;
received several paraphrases: canonical human phrasing, passive phrasing, and&lt;br&gt;
descriptions framed around a process, component, or location.&lt;/p&gt;

&lt;p&gt;This version showed that the same intent could produce different results&lt;br&gt;
depending on its wording. However, the variants were still mostly superficial&lt;br&gt;
language transformations. They did not distinguish the different ways in which&lt;br&gt;
a developer actually searches for code: a natural question, a precise&lt;br&gt;
technical query, a set of keywords, or a query using imprecise terminology.&lt;/p&gt;

&lt;p&gt;The labels also still focused mainly on whether the correct file had been&lt;br&gt;
found, without adequately describing plausible errors. For real code&lt;br&gt;
navigation, it is not enough to find &lt;code&gt;SocketServer.scala&lt;/code&gt;; the system must also&lt;br&gt;
avoid confusing it with the adjacent &lt;code&gt;RequestChannel.scala&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Final Version: A Five-Layer Gold-Query Schema
&lt;/h3&gt;

&lt;p&gt;In the repository, the earlier full set is stored as &lt;code&gt;gold-queries.yaml&lt;/code&gt;, while&lt;br&gt;
the final set is available as&lt;br&gt;
&lt;a href="https://github.com/Daeryss/karta-rag-map/blob/main/corpora/kafka-4.0.0/gold-queries-v2.1.yaml" rel="noopener noreferrer"&gt;&lt;code&gt;gold-queries-v2.1.yaml&lt;/code&gt;&lt;/a&gt;.&lt;br&gt;
Their differences can be summarized as follows:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Early gold set&lt;/th&gt;
&lt;th&gt;Gold queries v2&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Correct answer&lt;/td&gt;
&lt;td&gt;List of expected files&lt;/td&gt;
&lt;td&gt;Primary and secondary files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phrasings&lt;/td&gt;
&lt;td&gt;Generic language paraphrases&lt;/td&gt;
&lt;td&gt;Controlled cognitive modes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Incorrect neighbors&lt;/td&gt;
&lt;td&gt;Simple list of adversarial files&lt;/td&gt;
&lt;td&gt;Type, strength, and source of confusion&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rationale&lt;/td&gt;
&lt;td&gt;Short explanation&lt;/td&gt;
&lt;td&gt;Behavior owner, reason, and explicit boundary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Evaluation&lt;/td&gt;
&lt;td&gt;Shared set of metrics&lt;/td&gt;
&lt;td&gt;Relevant metric families for each question&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This led to the &lt;code&gt;gold-queries-v2.1&lt;/code&gt; schema, in which every question is&lt;br&gt;
represented across five independent layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. RETRIEVAL     what counts as the correct answer
2. COGNITIVE     how the user expresses the same query
3. INTERFERENCE  which files look plausible but are incorrect
4. RATIONALE     why the answer is correct and where its boundary lies
5. EVALUATION    which metrics matter for this question
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;retrieval&lt;/code&gt; layer separates primary and supporting files. The primary file&lt;br&gt;
owns the requested behavior; a supporting file helps explain it but does not&lt;br&gt;
replace the answer.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;cognitive&lt;/code&gt; layer contains up to five query forms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;human&lt;/code&gt;: a natural question from a developer;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ai_optimized&lt;/code&gt;: a detailed technical query using precise terminology;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;keyword&lt;/code&gt;: a short set of keywords;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;wrong_terminology&lt;/code&gt;: the same intent with one controlled terminology error;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cross_module&lt;/code&gt;: a question that requires connecting multiple components,
where applicable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;interference&lt;/code&gt; layer lists adversarial nodes: semantically or structurally&lt;br&gt;
close files that are easy to mistake for the answer. Each one records its&lt;br&gt;
relationship to the primary file, the strength of the trap, and the type of&lt;br&gt;
confusion.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;rationale&lt;/code&gt; layer documents the behavior owner, the reason for selecting&lt;br&gt;
it, and an explicit boundary explaining why a neighboring file is not the&lt;br&gt;
answer. This makes the labels reviewable and debatable instead of treating them&lt;br&gt;
as unexplained decisions by the author.&lt;/p&gt;

&lt;p&gt;Finally, the &lt;code&gt;evaluation&lt;/code&gt; layer specifies which metric families are especially&lt;br&gt;
important for a particular question.&lt;/p&gt;

&lt;p&gt;This schema changed the subject of the study itself. Instead of testing whether&lt;br&gt;
the system could occasionally find the right file, the benchmark began testing&lt;br&gt;
how consistently it recognized the same intent and distinguished the behavior&lt;br&gt;
owner from plausible neighbors.&lt;/p&gt;
&lt;h2&gt;
  
  
  4. What Paraphrases and Adversarial Nodes Revealed
&lt;/h2&gt;
&lt;h3&gt;
  
  
  One Phrasing Does Not Represent the Whole Question
&lt;/h3&gt;

&lt;p&gt;In an early full-corpus experiment, top-1 accuracy on the human phrasing was&lt;br&gt;
0.233. When a question counted as solved if the system succeeded on any&lt;br&gt;
available phrasing, the result was 0.600. The difference between one phrasing&lt;br&gt;
and the best available phrasing was 2.6x.&lt;/p&gt;

&lt;p&gt;This does not mean that we should select the most convenient variant and&lt;br&gt;
publish the best result. On the contrary, a large gap shows that one phrasing&lt;br&gt;
is a poor representation of real retrieval robustness.&lt;/p&gt;

&lt;p&gt;On the final baseline, all nine commercial embedding models reached&lt;br&gt;
&lt;code&gt;quorum-any recall@10 = 1.000&lt;/code&gt;: for every question, at least one phrasing placed&lt;br&gt;
the correct file in the top 10. However, the models handled changes in query&lt;br&gt;
language with different levels of consistency.&lt;/p&gt;

&lt;p&gt;The drop between &lt;code&gt;human&lt;/code&gt; and &lt;code&gt;wrong_terminology&lt;/code&gt; ranged from 0.100 to 0.300. I&lt;br&gt;
expected code-specialized models to be more robust, but the data did not&lt;br&gt;
support that expectation: the lowest and highest observed drops both belonged&lt;br&gt;
to general-purpose models.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ai_optimized&lt;/code&gt; variant, deliberately written to be retrieval-friendly,&lt;br&gt;
reached recall@10 = 1.000 for all nine commercial models. In this experiment,&lt;br&gt;
it was useful as an upper bound on solvability but nearly useless for comparing&lt;br&gt;
strong models with one another.&lt;/p&gt;
&lt;h3&gt;
  
  
  Recall Misses Dangerous Almost-Correct Answers
&lt;/h3&gt;

&lt;p&gt;Suppose the correct file ranks fifth while a plausible but incorrect neighbor&lt;br&gt;
ranks first. &lt;code&gt;recall@10&lt;/code&gt; is still 1.0. For the user, however, the first results&lt;br&gt;
matter most: these are the files they will read or pass to a generative model.&lt;/p&gt;

&lt;p&gt;Adversarial labels make this type of error measurable. For the TCP-connection&lt;br&gt;
question, all nine commercial models found &lt;code&gt;SocketServer.scala&lt;/code&gt; in the top 10&lt;br&gt;
under every available phrasing. Ordinary recall could not distinguish them at&lt;br&gt;
all.&lt;/p&gt;

&lt;p&gt;However, broader &lt;code&gt;ai_optimized&lt;/code&gt; queries also brought &lt;code&gt;RequestChannel.scala&lt;/code&gt;&lt;br&gt;
into the results more often. The &lt;code&gt;rank_gap&lt;/code&gt; and &lt;code&gt;displacement_rate&lt;/code&gt; metrics&lt;br&gt;
showed whether the correct order was preserved and whether the neighboring file&lt;br&gt;
displaced the primary answer.&lt;/p&gt;

&lt;p&gt;Paraphrases and adversarial nodes measure different properties. Paraphrases&lt;br&gt;
show whether the system recognizes the same intent when its language changes.&lt;br&gt;
Adversarial nodes show whether it can separate the correct answer from a&lt;br&gt;
thematically close mistake. Recall alone cannot measure both.&lt;/p&gt;
&lt;h2&gt;
  
  
  5. How the Results Changed as the Benchmark Evolved
&lt;/h2&gt;

&lt;p&gt;The methodology changed the practical conclusions several times.&lt;/p&gt;

&lt;p&gt;The first pilot, with ten files and three questions, produced perfect scores and&lt;br&gt;
established &lt;code&gt;3000/300&lt;/code&gt; as the chunking configuration.&lt;/p&gt;

&lt;p&gt;When the same embedding model was evaluated on the full corpus and thirty&lt;br&gt;
questions, &lt;code&gt;3000/300&lt;/code&gt; produced the worst recall@10 among four configurations:&lt;br&gt;
0.533 versus 0.600 for the other three. &lt;code&gt;1500/200&lt;/code&gt; led on the other quality&lt;br&gt;
metrics and became the new recommended configuration.&lt;/p&gt;

&lt;p&gt;I then evaluated seven local embedding models across five chunking&lt;br&gt;
configurations. Even the replacement recommendation did not hold: five of the&lt;br&gt;
seven models preferred &lt;code&gt;c500-o100&lt;/code&gt;, which had not been included in the earlier&lt;br&gt;
sweep. For the pilot model, &lt;code&gt;nomic-embed-text&lt;/code&gt;, &lt;code&gt;1500/200&lt;/code&gt; ranked only third.&lt;/p&gt;

&lt;p&gt;This is not merely a story about choosing the wrong chunk size. It shows how&lt;br&gt;
easily a fortunate combination of a small corpus, a few questions, and one&lt;br&gt;
model can become a universal recommendation.&lt;/p&gt;
&lt;h2&gt;
  
  
  6. Bootstrap Confidence Intervals Changed the Final Conclusion
&lt;/h2&gt;

&lt;p&gt;After the main experiment runs were complete, the top of the leaderboard for&lt;br&gt;
the nine commercial models under the baseline configuration looked convincing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model                                     recall@10
mistral/codestral-embed                   0.900
GigaEmbeddings-3B                         0.867
OpenAI text-embedding-3-small             0.867
voyage-code-2                             0.867
voyage-code-3                             0.867
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On an ordinary chart, Codestral looked like the winner. But with &lt;code&gt;n=30&lt;/code&gt;, the&lt;br&gt;
difference between 0.900 and 0.867 is only one question.&lt;/p&gt;

&lt;p&gt;I calculated percentile bootstrap confidence intervals with &lt;code&gt;B=2000&lt;/code&gt; for each&lt;br&gt;
model's per-query vector:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;codestral-embed                   0.900   [0.800 — 1.000]
each of the next four models      0.867   [0.733 — 0.967]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Overlap between separate confidence intervals is not, by itself, a formal&lt;br&gt;
pairwise test. I therefore also ran a paired bootstrap on the differences&lt;br&gt;
between every pair in the top cluster.&lt;/p&gt;

&lt;p&gt;The four models scoring 0.867 were arithmetically identical on this metric:&lt;br&gt;
they found the primary file for the same 26 questions and failed on the same&lt;br&gt;
four. Codestral's entire advantage came from one question, Q30. The confidence&lt;br&gt;
interval for its advantage over each of the four models was&lt;br&gt;
&lt;code&gt;[+0.000, +0.100]&lt;/code&gt;, meaning that the lower bound was exactly zero.&lt;/p&gt;

&lt;p&gt;The data does not support the claim that Codestral won. A defensible statement&lt;br&gt;
is weaker:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Codestral had the highest point estimate in this experiment, but with thirty&lt;br&gt;
questions, the paired bootstrap could not distinguish it from the next&lt;br&gt;
cluster of models.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Bootstrap did not change the collected results. It changed the strength of the&lt;br&gt;
claims those results allowed me to publish.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuekffrcpjgxvedg1p9qw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuekffrcpjgxvedg1p9qw.png" alt=" " width="800" height="518"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  7. Rules for an Honest AI Benchmark
&lt;/h2&gt;

&lt;p&gt;The study led me to several rules that apply beyond Code-RAG.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, define which stage of the system you are measuring.&lt;/strong&gt; Retrieval&lt;br&gt;
quality, generated-answer quality, and product usefulness are different&lt;br&gt;
research targets. Mixing them into one evaluation makes the source of an error&lt;br&gt;
difficult to identify.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use a known ground truth.&lt;/strong&gt; The correct result must be defined before the&lt;br&gt;
model runs. Otherwise, the benchmark author can easily accept a plausible&lt;br&gt;
answer as a correct one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test the intent, not one fortunate phrasing.&lt;/strong&gt; Users are not required to ask&lt;br&gt;
questions using the words that work best for an embedding model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Label plausible errors.&lt;/strong&gt; In a real codebase, an incorrect result often looks&lt;br&gt;
almost correct. A benchmark should distinguish the behavior owner from a&lt;br&gt;
neighboring interface, manager class, or configuration file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Publish uncertainty alongside the ranking.&lt;/strong&gt; Point estimates are useful for&lt;br&gt;
sorting a table, but they do not always support a claim of superiority.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use the benchmark to challenge your own conclusions.&lt;/strong&gt; A good methodology&lt;br&gt;
should not only confirm hypotheses. It should create conditions in which they&lt;br&gt;
can fail.&lt;/p&gt;
&lt;h2&gt;
  
  
  8. Limitations and Next Step
&lt;/h2&gt;

&lt;p&gt;The study has several clear limitations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thirty questions.&lt;/strong&gt; &lt;code&gt;recall@10&lt;/code&gt; has only 31 possible values, from 0/30 to&lt;br&gt;
30/30. The bootstrap 95% confidence-interval half-widths in this experiment&lt;br&gt;
were approximately 0.10-0.17. Reliably separating the top cluster requires a&lt;br&gt;
larger question set.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One corpus.&lt;/strong&gt; The study used 697 files from the Kafka 4.0.0 broker core. It&lt;br&gt;
does not establish that the same models and configurations will behave&lt;br&gt;
similarly on another project, language, or architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File-level labels.&lt;/strong&gt; The benchmark evaluates retrieved files rather than&lt;br&gt;
individual classes and methods. A correct file containing an irrelevant chunk&lt;br&gt;
can count as a hit, while the correct method inside a low-ranked file can count&lt;br&gt;
as a miss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One annotator.&lt;/strong&gt; I authored all gold queries. The &lt;code&gt;rationale&lt;/code&gt; layer makes the&lt;br&gt;
decisions reviewable, but independent annotation by a second specialist would&lt;br&gt;
strengthen the dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval rather than final-answer quality.&lt;/strong&gt; The study does not show how&lt;br&gt;
well a generative model uses the retrieved context or how useful its final&lt;br&gt;
answer is to a developer.&lt;/p&gt;

&lt;p&gt;The next article in the series will use the same dataset to investigate the&lt;br&gt;
relative effects of the embedding model, chunking, and retrieval mode. This&lt;br&gt;
time, those comparisons will be interpreted with query robustness, adversarial&lt;br&gt;
errors, and statistical uncertainty in mind.&lt;/p&gt;
&lt;h2&gt;
  
  
  Reproducibility
&lt;/h2&gt;

&lt;p&gt;The repository contains the gold set, raw CSV and Parquet results, benchmark&lt;br&gt;
code, and analysis scripts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Daeryss/karta-rag-map
&lt;span class="nb"&gt;cd &lt;/span&gt;karta-rag-map

&lt;span class="c"&gt;# Inspect the final five-layer schema and 30 gold queries&lt;/span&gt;
&lt;span class="nb"&gt;cat &lt;/span&gt;corpora/kafka-4.0.0/gold-queries-v2.1.yaml

&lt;span class="c"&gt;# Validate the dataset schema&lt;/span&gt;
./gradlew &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;--tests&lt;/span&gt; &lt;span class="s1"&gt;'io.github.daeryss.karta.dataset.GoldQueryLoaderV2Test'&lt;/span&gt;

&lt;span class="c"&gt;# Data behind the ranking and paired bootstrap&lt;/span&gt;
&lt;span class="nb"&gt;ls &lt;/span&gt;results/forest_plot_data.csv
&lt;span class="nb"&gt;ls &lt;/span&gt;results/paired_bootstrap_top5.csv

&lt;span class="c"&gt;# Recompute the paired bootstrap for the top cluster&lt;/span&gt;
.venv/bin/python scripts/paired_bootstrap.py &lt;span class="se"&gt;\&lt;/span&gt;
  results/E005-production/all.csv &lt;span class="se"&gt;\&lt;/span&gt;
  results/E006-production-merged/all.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; results/paired_bootstrap_top5.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;Code-RAG Research Series · Article 1 of 3 · Next: "What Actually Matters in&lt;br&gt;
Code-RAG: Models, Chunking, or Retrieval?"&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>ai</category>
      <category>java</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
