JunyiiBlvd

Posted on Mar 27

Cosine Similarity Failed Our RAG on Exact Terms — BM25 Fixed It

#ollama #rag #python #llm

Our RAG Couldn't Find Its Own Documentation — Here's the Fix

I built a local AI pipeline on top of Ollama. It has a knowledge base of markdown documents — session notes, architectural decisions, build logs. The idea was that the model could answer questions about its own project history using those documents as ground truth instead of hallucinating from parametric memory.

It failed in a very specific way.

The Failure

I ran this query through the pipeline:

"Why was nomic-embed-text chosen over mxbai-embed-large for the RAG embedding upgrade?"

The answer exists verbatim in a session document:

| nomic-embed-text over mxbai-embed-large | Available via Ollama, retrieval-trained, 768d, clean upgrade path |

The cosine retrieval returned this:

[knowledge/cosine] 'unrelated-project-context'  score=0.5694  [HIT]   ← wrong doc
[knowledge/cosine] 'build-session'              score=0.5634  [HIT]   ← right file, wrong chunk
[memory/cosine]    'old-session-notes'          score=0.6018          ← wrong doc entirely

The model received the wrong documents and responded with generic plausible-sounding reasoning — technically coherent, factually wrong. It invented an explanation rather than reporting the recorded one.

The answer was on disk. grep found it in milliseconds:

grep -r "nomic-embed-text" ./vault/AI/memory/
# → | nomic-embed-text over mxbai-embed-large | Available via Ollama, retrieval-trained, 768d...

The answer was trivially retrievable. The system just wasn’t retrieving it.

Why Cosine Fails Here

Cosine similarity works by measuring proximity in an embedding space. Two pieces of text score high if the embedding model places them near each other — which happens when they're semantically related.

nomic-embed-text is a model name. It's a low-frequency technical token that the embedding model has no meaningful semantic neighborhood for. When you embed the query "why was nomic-embed-text chosen", the resulting vector floats somewhere in the AI/ML region of the embedding space — near documents about embeddings, retrieval, and model selection generally. Not near the specific document containing the exact string nomic-embed-text.

The embedding model is doing its job correctly. It's returning semantically similar content. The problem is that exact technical terms — model names, version strings, flag names, tool names — don't have semantic neighborhoods. They're proper nouns in a space that operates on meaning, not identity.

This is the class of queries that breaks cosine-only RAG:

"why was qwen2.5:14b chosen over qwen2.5:32b"
"what does the --unrestricted flag do in GRUB"
"what version introduced the format:json parameter"
"why was rank_bm25 chosen over sentence-transformers"

These are exactly the queries you want to ask a project knowledge base. And cosine consistently fails them.

The Fix: BM25 as a Parallel Path

BM25 (Best Match 25) is a classic information retrieval algorithm that scores documents by exact term frequency weighted by document length and corpus statistics. It doesn't use embeddings. It tokenizes the query, looks for those tokens in documents, and ranks by how many times they appear and how rare they are across the corpus.

nomic-embed-text as a BM25 query returns the document that contains nomic-embed-text as a literal string. It's doing what grep does, but with ranking.

The fix isn't to replace cosine with BM25. They cover different query types:

Query type	Cosine wins	BM25 wins
"how does the authentication work"	✓ (semantic)	✗ (no exact match)
"why was nomic-embed-text chosen"	✗ (no embedding neighbors)	✓ (exact token)
"explain the retry logic"	✓ (semantic)	partial
"what is the --unrestricted flag"	✗	✓ (exact token)

The implementation runs both in parallel and merges results.

Implementation

BM25 doesn't need to be persisted. It rebuilds from source documents in milliseconds — no .pkl file, no disk writes, no separate index management. The rank_bm25 library is pure Python with no model downloads. The implementation auto-installs it on first use:

try:
    from rank_bm25 import BM25Okapi
    _BM25_AVAILABLE = True
except ImportError:
    try:
        import subprocess, sys
        subprocess.check_call([sys.executable, "-m", "pip", "install", "rank_bm25"],
                              stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
        from rank_bm25 import BM25Okapi
        _BM25_AVAILABLE = True
    except Exception:
        _BM25_AVAILABLE = False

If the install fails for any reason, the pipeline falls back to cosine-only silently — no exception, no warning to the user.

The Tokenization Detail

This is the part that matters most. The tokenizer must preserve hyphenated technical tokens as single units:

import re

def _tokenize(text: str) -> list[str]:
    # Split on punctuation and whitespace, excluding hyphens
    # 'nomic-embed-text' → one token, not three
    tokens = re.split(r'[\s,;:!()\[\]{}<>/"\'\\@#$%^&*+=~|.]+', text.lower())
    return [t for t in tokens if t]

The key: hyphens are excluded from the split pattern. nomic-embed-text stays as one token. If you split on hyphens, you get ["nomic", "embed", "text"] — three common words that appear in dozens of documents. BM25 loses all precision and the exact-term advantage disappears.

The Merge

def _merge(
cosine: list[tuple[str, str, float]],
bm25: list[tuple[str, str, float]],
max_results: int = 4,
) -> list[tuple[str, str, float]]:
seen: set[tuple[str, str]] = set()
merged: list[tuple[str, str, float]] = []
for stem, text, score in cosine + bm25:
key = (stem, text[:100])
if key not in seen:
seen.add(key)
merged.append((stem, text, score))
if len(merged) >= max_results:
break
return merged

The full implementation is in context_loader.py in the repo.

Results

After implementing hybrid retrieval:

[knowledge/cosine] 'unrelated-context'  score=0.5454
[knowledge/bm25]   'build-session'      score=17.74  preview='### GRUB Boot Lockout\n**Cause:** Set GRUB...'
[memory/bm25]      'session-notes'      score=6.99   preview='| nomic-embed-text over mxbai-embed-large | A'

Pipeline run after fix:

python run_task.py "why was nomic-embed-text chosen over mxbai-embed-large"

→ nomic-embed-text was chosen for four reasons:
  1. Available via Ollama — no separate service required
  2. Retrieval-trained — specifically optimized for retrieval tasks
  3. 768-dimensional vectors — 2x the dimensions of the previous model
  4. Clean upgrade path — drop-in replacement with no interface changes

All four points match the recorded decision table exactly. The model did not fabricate. It read the document that BM25 retrieved.

The Secondary Problem: Context Budget

Fixing retrieval exposed a second problem. Even with BM25 finding the right document, the content wasn't reaching the model.

The CONTEXT_INJECT_LIMIT was 2000 characters. The file manifest section (listing all scripts in the pipeline directory) consumed 503 characters on every query — before any retrieval content was injected. BM25 found the correct document and then it got truncated out of the prompt before the model saw it.

Evidence:

[load_context output — 2000 chars / 2000 limit]
### GRUB Boot Lockout
**Cause:** Set GRUB superuser password without `--unrestricted` on menu entries.
GRUB requires the password for ALL interactions including automatic boot
when a superuser i    ← cut off mid-sentence, memory section never appeared

Fix: raise the limit to 4000 and cap the manifest section at 200 characters. The manifest is least valuable for knowledge queries; retrieval hits are highest value. Budget allocation matters as much as retrieval quality.

What Still Doesn't Work

Multi-question prompts that require chunks from three different documents simultaneously still fail. A single query asking about three separate historical decisions fills the context budget with the first two documents before the third can be injected. The model falls back to parametric guessing for the third.

This is an architectural constraint, not a retrieval problem. The fix requires either query decomposition (ask three separate focused questions) or a re-ranker that explicitly allocates context budget across topics. Not solved yet.

Takeaway

Cosine-only RAG has a silent failure mode on exact technical terms. The failure is invisible — the system returns results, the model generates a response, everything looks like it's working. The model is just hallucinating a plausible answer instead of reading the correct one.

BM25 as a parallel retrieval path costs almost nothing (pure Python, no model, rebuilds in milliseconds) and directly plugs this gap. It doesn't replace cosine — it covers the query types cosine can't.

The tokenization detail matters: preserve hyphenated technical tokens as single units by excluding hyphens from your split pattern, or you lose the precision advantage entirely.

Repo: github.com/JunyiiBlvd/ollama-hybrid-pipeline

The full implementation is in context_loader.py. The pipeline runs on any Ollama model.