I thought I was being clever. Our RAG system was hallucinating, and everyone on Twitter was raving about rerankers. Cohere's rerank endpoint looked perfect. Connect to their API make once call and boom, better results!
Three weeks and $400 later, my production metrics were worse than before.
Here's what I learned the hard way about when rerankers actually help, and when they badly patch a broken retrieval system.
The Setup: A Classic RAG Mistake
Our customer support chatbot was giving increasingly bizarre answers. Questions like "How do I cancel my subscription?" were getting responses about our security features or pricing tiers. Loosely related to subscriptions, but completely unhelpful.
Checking our cosine similarity scores (0.87, 0.91, 0.93) looked like there were no issues. We thought the problem must be the ordering, right? Wrong.
The $400 Band-Aid
I integrated Cohere's rerank endpoint between retrieval and generation:
# Seemed so simple...
candidates = retriever.search(query, k=50)
reranked = cohere.rerank(
query=query,
documents=[c['text'] for c in candidates],
top_n=10,
model='rerank-english-v2.0'
)
At $0.002 per query and ~15,000 queries per day, that's $30/day = $900/month in steady state. I started with a smaller test set, but even my initial testing cost me $400 before I realized the truth.
The results? Our nDCG@10 metric dropped from 0.72 to 0.68. Latency increased by 250ms. User satisfaction scores didn't budge.
I was paying to make things worse.
The Real Problem: Polishing a Turd
Here's what I didn't understand - Rerankers improve precision (ordering), not recall (coverage).
When I finally measured it properly, my first-stage retrieval had a recall@50 of 0.61. That means 39% of the time, the correct answer wasn't even in my candidate pool.
The reranker was doing exactly what it was designed to do, picking the best chunks from the pool I gave it. The problem was that I was handing it a pool of crap.
I was literally asking it to rank:
- Chunk about security features (ΔS = 0.72)
- Chunk about pricing tiers (ΔS = 0.68)
- Chunk about account settings (ΔS = 0.71)
None of these were about cancellation. The reranker dutifully picked "pricing tiers" as the best match, and our LLM hallucinated an answer about downgrading plans instead of canceling.
The rule I learned: A reranker on bad retrieval is like polishing a turd - you're just making a shinier turd.
To fix it I have to I kill the reranker and get back to first principles.
1. Fixed Chunking (Cost: $0)
Our chunks were too large (800 tokens) and cut mid-sentence. I switched to semantic chunking that respected document structure:
# Before: Arbitrary 800-token chunks
chunks = naive_split(doc, chunk_size=800)
# After: Section-aware chunking
chunks = chunk_by_headers(doc, min_size=200, max_size=500)
This alone improved recall@50 from 0.61 to 0.78.
2. Added Semantic Firewall (Cost: $0, Latency: +12ms)
I started measuring Semantic Stress (ΔS) - the distance between query intent and chunk relevance:
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_firewall(query: str, chunks: list, threshold: float = 0.60):
q_emb = model.encode(query, normalize_embeddings=True)
filtered = []
for chunk in chunks:
c_emb = model.encode(chunk['text'], normalize_embeddings=True)
cosine = float(util.cos_sim(c_emb, q_emb)[0][0])
delta_s = 1 - cosine
if delta_s < threshold: # Lower ΔS = more relevant
chunk['delta_s'] = delta_s
filtered.append(chunk)
return filtered
# Usage
candidates = retriever.search(query, k=50)
safe_chunks = semantic_firewall(query, candidates, threshold=0.60)
Chunks with ΔS > 0.60 were getting rejected before they could poison the context. Simple, fast, and actually effective.
3. Improved Hybrid Search (Cost: $0)
I was only using dense embeddings. Adding BM25 for keyword matching improved recall@50 to 0.87:
# Hybrid retrieval
dense_results = dense_retriever.search(query, k=100)
sparse_results = bm25_retriever.search(query, k=100)
# Reciprocal Rank Fusion
combined = reciprocal_rank_fusion([dense_results, sparse_results])
top_50 = combined[:50]
Now my candidate pool actually contains the right answers.
When Rerankers Actually Help
After fixing retrieval, I tried adding a reranker again. Although having been burned by API reranking I started with a self-hosted cross-encoder. Being more in control of how the reranking is taking place will help me learn a lot more, if it's not enough then I will expore cohere again.
from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True)
def rerank_topk(query: str, candidates: list, out_k: int = 10):
pairs = [(query, c['text']) for c in candidates]
scores = reranker.compute_score(pairs, normalize=True)
ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
return [c for c, _ in ranked[:out_k]]
The difference?
- Cost: ~$30/month (self-hosted GPU) vs. $400/month (Cohere API)
- Latency: +35ms vs. +250ms
- Quality: nDCG@10 improved from 0.87 to 0.93 (vs. 0.72 → 0.68 before)
It worked because now I was reranking from a pool where the right answer was actually present.
The Decision Table I Wish I Had
| Situation | Use Reranker? | Why |
|---|---|---|
| Recall@50 < 0.85 | ❌ NO | Fix retrieval first - reranker can't help |
| Recall good, but top-5 wrong | ✅ YES | Reranker will improve precision |
| ΔS already < 0.40 in top-5 | ❌ NO | You don't need it |
| High QPS (>10k/day) | ⚠️ MAYBE | Use self-hosted, not API |
| Low volume (<1k/day) | ✅ YES | LLM reranker okay for cost |
The Real Costs Nobody Ever Talks About
Senior Management struggle to see the hidden cost about trying to cut costs, if cohere was solving our problem, we still would have got pushback when our low level tests cost $400 a month. Obviously the API wasn't cheap but adding a single API call to a code base is extremely simple in engineering terms.
The following costs are hard to quantify or be taken seriously by some types of leadership:
- Engineering time - 2 weeks debugging why quality dropped
- Opportunity cost - Could have fixed retrieval in day one
- Production incidents - 3 escalations from confused customer support
- Credibility - Having to explain to my VP why we spent money to make things worse
The $400 in API costs was the cheapest part of this mistake.
Key Takeaways
- Measure recall first - If recall@50 < 0.85, don't even think about reranking
- Use ΔS as a gate - Filter out high-stress chunks before they poison your context
- Self-host when possible - Cross-encoders are 30x cheaper than LLM APIs
- Fix fundamentals first - Good chunking and hybrid search >>> expensive rerankers
- Verify with metrics - nDCG improvement should be at least 0.05 to justify the complexity
The Numbers That Matter
Before my fixes:
- Recall@50: 0.61
- nDCG@10: 0.72
- Average ΔS: 0.68
- Monthly cost: $900
- User satisfaction: 3.2/5
After fixing retrieval (no reranker):
- Recall@50: 0.87
- nDCG@10: 0.87
- Average ΔS: 0.42
- Monthly cost: $30
- User satisfaction: 4.1/5
After adding self-hosted reranker:
- Recall@50: 0.87 (unchanged)
- nDCG@10: 0.93
- Average ΔS: 0.38
- Monthly cost: $60
- User satisfaction: 4.4/5
Want to Learn More?
This is just one module from my comprehensive RAG debugging course, where I cover:
- How to measure and fix Semantic Stress (ΔS)
- Building semantic firewalls to prevent hallucinations
- When (and when not) to use rerankers
- Multi-stage retrieval pipelines that actually work
- Production-grade citation tracking
- A/B testing RAG systems properly
Check out the full course: RAG Firewall Guide on Gumroad
Plus, grab the free GitHub repo with working code examples:
github.com/jongmoss/rag-firewall-examples
Have you made expensive mistakes optimizing RAG systems? I'd love to hear your war stories in the comments below.
And if you're currently debugging a RAG system that's hallucinating despite high similarity scores - go measure your recall@50 and ΔS before you buy that reranker. Trust me. 😅
Top comments (0)