<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: JunyiiBlvd</title>
    <description>The latest articles on DEV Community by JunyiiBlvd (@junyiiblvd).</description>
    <link>https://dev.to/junyiiblvd</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3845412%2Fb83e7e6f-2677-4e8e-be66-fb22e943a643.jpeg</url>
      <title>DEV Community: JunyiiBlvd</title>
      <link>https://dev.to/junyiiblvd</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/junyiiblvd"/>
    <language>en</language>
    <item>
      <title>Cosine Similarity Failed Our RAG on Exact Terms — BM25 Fixed It</title>
      <dc:creator>JunyiiBlvd</dc:creator>
      <pubDate>Fri, 27 Mar 2026 01:22:09 +0000</pubDate>
      <link>https://dev.to/junyiiblvd/cosine-similarity-failed-our-rag-on-exact-terms-bm25-fixed-it-af8</link>
      <guid>https://dev.to/junyiiblvd/cosine-similarity-failed-our-rag-on-exact-terms-bm25-fixed-it-af8</guid>
      <description>&lt;h1&gt;
  
  
  Our RAG Couldn't Find Its Own Documentation — Here's the Fix
&lt;/h1&gt;

&lt;p&gt;I built a local AI pipeline on top of Ollama. It has a knowledge base of markdown documents — session notes, architectural decisions, build logs. The idea was that the model could answer questions about its own project history using those documents as ground truth instead of hallucinating from parametric memory.&lt;/p&gt;

&lt;p&gt;It failed in a very specific way.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Failure
&lt;/h2&gt;

&lt;p&gt;I ran this query through the pipeline:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Why was nomic-embed-text chosen over mxbai-embed-large for the RAG embedding upgrade?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer exists verbatim in a session document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;| nomic-embed-text over mxbai-embed-large | Available via Ollama, retrieval-trained, 768d, clean upgrade path |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cosine retrieval returned this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[knowledge/cosine] 'unrelated-project-context'  score=0.5694  [HIT]   ← wrong doc
[knowledge/cosine] 'build-session'              score=0.5634  [HIT]   ← right file, wrong chunk
[memory/cosine]    'old-session-notes'          score=0.6018          ← wrong doc entirely
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model received the wrong documents and responded with generic plausible-sounding reasoning — technically coherent, factually wrong. It invented an explanation rather than reporting the recorded one.&lt;/p&gt;

&lt;p&gt;The answer was on disk. &lt;code&gt;grep&lt;/code&gt; found it in milliseconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s2"&gt;"nomic-embed-text"&lt;/span&gt; ./vault/AI/memory/
&lt;span class="c"&gt;# → | nomic-embed-text over mxbai-embed-large | Available via Ollama, retrieval-trained, 768d...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The answer was trivially retrievable. The system just wasn’t retrieving it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Cosine Fails Here
&lt;/h2&gt;

&lt;p&gt;Cosine similarity works by measuring proximity in an embedding space. Two pieces of text score high if the embedding model places them near each other — which happens when they're semantically related.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;nomic-embed-text&lt;/code&gt; is a model name. It's a low-frequency technical token that the embedding model has no meaningful semantic neighborhood for. When you embed the query &lt;code&gt;"why was nomic-embed-text chosen"&lt;/code&gt;, the resulting vector floats somewhere in the AI/ML region of the embedding space — near documents about embeddings, retrieval, and model selection generally. Not near the specific document containing the exact string &lt;code&gt;nomic-embed-text&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The embedding model is doing its job correctly. It's returning semantically similar content. The problem is that exact technical terms — model names, version strings, flag names, tool names — don't have semantic neighborhoods. They're proper nouns in a space that operates on meaning, not identity.&lt;/p&gt;

&lt;p&gt;This is the class of queries that breaks cosine-only RAG:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;"why was qwen2.5:14b chosen over qwen2.5:32b"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;"what does the --unrestricted flag do in GRUB"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;"what version introduced the format:json parameter"&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;"why was rank_bm25 chosen over sentence-transformers"&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are exactly the queries you want to ask a project knowledge base. And cosine consistently fails them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Fix: BM25 as a Parallel Path
&lt;/h2&gt;

&lt;p&gt;BM25 (Best Match 25) is a classic information retrieval algorithm that scores documents by exact term frequency weighted by document length and corpus statistics. It doesn't use embeddings. It tokenizes the query, looks for those tokens in documents, and ranks by how many times they appear and how rare they are across the corpus.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;nomic-embed-text&lt;/code&gt; as a BM25 query returns the document that contains &lt;code&gt;nomic-embed-text&lt;/code&gt; as a literal string. It's doing what &lt;code&gt;grep&lt;/code&gt; does, but with ranking.&lt;/p&gt;

&lt;p&gt;The fix isn't to replace cosine with BM25. They cover different query types:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Query type&lt;/th&gt;
&lt;th&gt;Cosine wins&lt;/th&gt;
&lt;th&gt;BM25 wins&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"how does the authentication work"&lt;/td&gt;
&lt;td&gt;✓ (semantic)&lt;/td&gt;
&lt;td&gt;✗ (no exact match)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"why was nomic-embed-text chosen"&lt;/td&gt;
&lt;td&gt;✗ (no embedding neighbors)&lt;/td&gt;
&lt;td&gt;✓ (exact token)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"explain the retry logic"&lt;/td&gt;
&lt;td&gt;✓ (semantic)&lt;/td&gt;
&lt;td&gt;partial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"what is the --unrestricted flag"&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;td&gt;✓ (exact token)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The implementation runs both in parallel and merges results.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;BM25 doesn't need to be persisted. It rebuilds from source documents in milliseconds — no &lt;code&gt;.pkl&lt;/code&gt; file, no disk writes, no separate index management. The &lt;code&gt;rank_bm25&lt;/code&gt; library is pure Python with no model downloads. The implementation auto-installs it on first use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rank_bm25&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BM25Okapi&lt;/span&gt;
    &lt;span class="n"&gt;_BM25_AVAILABLE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;ImportError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
        &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;check_call&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;install&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank_bm25&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                              &lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DEVNULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DEVNULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rank_bm25&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BM25Okapi&lt;/span&gt;
        &lt;span class="n"&gt;_BM25_AVAILABLE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;_BM25_AVAILABLE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the install fails for any reason, the pipeline falls back to cosine-only silently — no exception, no warning to the user.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Tokenization Detail
&lt;/h3&gt;

&lt;p&gt;This is the part that matters most. The tokenizer must preserve hyphenated technical tokens as single units:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="c1"&gt;# Split on punctuation and whitespace, excluding hyphens
&lt;/span&gt;    &lt;span class="c1"&gt;# 'nomic-embed-text' → one token, not three
&lt;/span&gt;    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[\s,;:!()\[\]{}&amp;lt;&amp;gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;\'\\@#$%^&amp;amp;*+=~|.]+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key: hyphens are &lt;strong&gt;excluded&lt;/strong&gt; from the split pattern. &lt;code&gt;nomic-embed-text&lt;/code&gt; stays as one token. If you split on hyphens, you get &lt;code&gt;["nomic", "embed", "text"]&lt;/code&gt; — three common words that appear in dozens of documents. BM25 loses all precision and the exact-term advantage disappears.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Merge
&lt;/h3&gt;

&lt;p&gt;def _merge(&lt;br&gt;
    cosine: list[tuple[str, str, float]],&lt;br&gt;
    bm25:   list[tuple[str, str, float]],&lt;br&gt;
    max_results: int = 4,&lt;br&gt;
) -&amp;gt; list[tuple[str, str, float]]:&lt;br&gt;
    seen: set[tuple[str, str]] = set()&lt;br&gt;
    merged: list[tuple[str, str, float]] = []&lt;br&gt;
    for stem, text, score in cosine + bm25:&lt;br&gt;
        key = (stem, text[:100])&lt;br&gt;
        if key not in seen:&lt;br&gt;
            seen.add(key)&lt;br&gt;
            merged.append((stem, text, score))&lt;br&gt;
        if len(merged) &amp;gt;= max_results:&lt;br&gt;
            break&lt;br&gt;
    return merged&lt;/p&gt;

&lt;p&gt;The full implementation is in &lt;code&gt;context_loader.py&lt;/code&gt; in the repo.&lt;/p&gt;


&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;After implementing hybrid retrieval:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[knowledge/cosine] 'unrelated-context'  score=0.5454
&lt;/span&gt;&lt;span class="gp"&gt;[knowledge/bm25]   'build-session'      score=17.74  preview='#&lt;/span&gt;&lt;span class="c"&gt;## GRUB Boot Lockout\n**Cause:** Set GRUB...'&lt;/span&gt;
&lt;span class="go"&gt;[memory/bm25]      'session-notes'      score=6.99   preview='| nomic-embed-text over mxbai-embed-large | A'
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pipeline run after fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;python run_task.py "why was nomic-embed-text chosen over mxbai-embed-large"

→ nomic-embed-text was chosen for four reasons:
  1. Available via Ollama — no separate service required
  2. Retrieval-trained — specifically optimized for retrieval tasks
  3. 768-dimensional vectors — 2x the dimensions of the previous model
  4. Clean upgrade path — drop-in replacement with no interface changes
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All four points match the recorded decision table exactly. The model did not fabricate. It read the document that BM25 retrieved.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Secondary Problem: Context Budget
&lt;/h2&gt;

&lt;p&gt;Fixing retrieval exposed a second problem. Even with BM25 finding the right document, the content wasn't reaching the model.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;CONTEXT_INJECT_LIMIT&lt;/code&gt; was 2000 characters. The file manifest section (listing all scripts in the pipeline directory) consumed 503 characters on every query — before any retrieval content was injected. BM25 found the correct document and then it got truncated out of the prompt before the model saw it.&lt;/p&gt;

&lt;p&gt;Evidence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[load_context output — 2000 chars / 2000 limit]
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="c"&gt;## GRUB Boot Lockout&lt;/span&gt;
&lt;span class="go"&gt;**Cause:** Set GRUB superuser password without `--unrestricted` on menu entries.
GRUB requires the password for ALL interactions including automatic boot
when a superuser i    ← cut off mid-sentence, memory section never appeared
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fix: raise the limit to 4000 and cap the manifest section at 200 characters. The manifest is least valuable for knowledge queries; retrieval hits are highest value. Budget allocation matters as much as retrieval quality.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Still Doesn't Work
&lt;/h2&gt;

&lt;p&gt;Multi-question prompts that require chunks from three different documents simultaneously still fail. A single query asking about three separate historical decisions fills the context budget with the first two documents before the third can be injected. The model falls back to parametric guessing for the third.&lt;/p&gt;

&lt;p&gt;This is an architectural constraint, not a retrieval problem. The fix requires either query decomposition (ask three separate focused questions) or a re-ranker that explicitly allocates context budget across topics. Not solved yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Cosine-only RAG has a silent failure mode on exact technical terms. The failure is invisible — the system returns results, the model generates a response, everything looks like it's working. The model is just hallucinating a plausible answer instead of reading the correct one.&lt;/p&gt;

&lt;p&gt;BM25 as a parallel retrieval path costs almost nothing (pure Python, no model, rebuilds in milliseconds) and directly plugs this gap. It doesn't replace cosine — it covers the query types cosine can't.&lt;/p&gt;

&lt;p&gt;The tokenization detail matters: preserve hyphenated technical tokens as single units by excluding hyphens from your split pattern, or you lose the precision advantage entirely.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/JunyiiBlvd/ollama-hybrid-pipeline" rel="noopener noreferrer"&gt;github.com/JunyiiBlvd/ollama-hybrid-pipeline&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The full implementation is in &lt;code&gt;context_loader.py&lt;/code&gt;. The pipeline runs on any Ollama model.&lt;/p&gt;

</description>
      <category>ollama</category>
      <category>rag</category>
      <category>python</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
