Jon

Posted on Feb 5

I Spent $400/Month on a Reranker That Made My RAG Worse

#rag #llm #machinelearning #python

I thought I was being clever. Our RAG system was hallucinating, and everyone on Twitter was raving about rerankers. Cohere's rerank endpoint looked perfect. Connect to their API make once call and boom, better results!

Three weeks and $400 later, my production metrics were worse than before.

Here's what I learned the hard way about when rerankers actually help, and when they badly patch a broken retrieval system.

The Setup: A Classic RAG Mistake

Our customer support chatbot was giving increasingly bizarre answers. Questions like "How do I cancel my subscription?" were getting responses about our security features or pricing tiers. Loosely related to subscriptions, but completely unhelpful.

Checking our cosine similarity scores (0.87, 0.91, 0.93) looked like there were no issues. We thought the problem must be the ordering, right? Wrong.

The $400 Band-Aid

I integrated Cohere's rerank endpoint between retrieval and generation:

# Seemed so simple...
candidates = retriever.search(query, k=50)
reranked = cohere.rerank(
    query=query,
    documents=[c['text'] for c in candidates],
    top_n=10,
    model='rerank-english-v2.0'
)

At $0.002 per query and ~15,000 queries per day, that's $30/day = $900/month in steady state. I started with a smaller test set, but even my initial testing cost me $400 before I realized the truth.

The results? Our nDCG@10 metric dropped from 0.72 to 0.68. Latency increased by 250ms. User satisfaction scores didn't budge.

I was paying to make things worse.

The Real Problem: Polishing a Turd

Here's what I didn't understand - Rerankers improve precision (ordering), not recall (coverage).

When I finally measured it properly, my first-stage retrieval had a recall@50 of 0.61. That means 39% of the time, the correct answer wasn't even in my candidate pool.

The reranker was doing exactly what it was designed to do, picking the best chunks from the pool I gave it. The problem was that I was handing it a pool of crap.

I was literally asking it to rank:

Chunk about security features (ΔS = 0.72)
Chunk about pricing tiers (ΔS = 0.68)
Chunk about account settings (ΔS = 0.71)

None of these were about cancellation. The reranker dutifully picked "pricing tiers" as the best match, and our LLM hallucinated an answer about downgrading plans instead of canceling.

The rule I learned: A reranker on bad retrieval is like polishing a turd - you're just making a shinier turd.

To fix it I have to I kill the reranker and get back to first principles.

1. Fixed Chunking (Cost: $0)

Our chunks were too large (800 tokens) and cut mid-sentence. I switched to semantic chunking that respected document structure:

# Before: Arbitrary 800-token chunks
chunks = naive_split(doc, chunk_size=800)

# After: Section-aware chunking
chunks = chunk_by_headers(doc, min_size=200, max_size=500)

This alone improved recall@50 from 0.61 to 0.78.

2. Added Semantic Firewall (Cost: $0, Latency: +12ms)

I started measuring Semantic Stress (ΔS) - the distance between query intent and chunk relevance:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_firewall(query: str, chunks: list, threshold: float = 0.60):
    q_emb = model.encode(query, normalize_embeddings=True)
    filtered = []

    for chunk in chunks:
        c_emb = model.encode(chunk['text'], normalize_embeddings=True)
        cosine = float(util.cos_sim(c_emb, q_emb)[0][0])
        delta_s = 1 - cosine

        if delta_s < threshold:  # Lower ΔS = more relevant
            chunk['delta_s'] = delta_s
            filtered.append(chunk)

    return filtered

# Usage
candidates = retriever.search(query, k=50)
safe_chunks = semantic_firewall(query, candidates, threshold=0.60)

Chunks with ΔS > 0.60 were getting rejected before they could poison the context. Simple, fast, and actually effective.

3. Improved Hybrid Search (Cost: $0)

I was only using dense embeddings. Adding BM25 for keyword matching improved recall@50 to 0.87:

# Hybrid retrieval
dense_results = dense_retriever.search(query, k=100)
sparse_results = bm25_retriever.search(query, k=100)

# Reciprocal Rank Fusion
combined = reciprocal_rank_fusion([dense_results, sparse_results])
top_50 = combined[:50]

Now my candidate pool actually contains the right answers.

When Rerankers Actually Help

After fixing retrieval, I tried adding a reranker again. Although having been burned by API reranking I started with a self-hosted cross-encoder. Being more in control of how the reranking is taking place will help me learn a lot more, if it's not enough then I will expore cohere again.

from FlagEmbedding import FlagReranker

reranker = FlagReranker('BAAI/bge-reranker-base', use_fp16=True)

def rerank_topk(query: str, candidates: list, out_k: int = 10):
    pairs = [(query, c['text']) for c in candidates]
    scores = reranker.compute_score(pairs, normalize=True)

    ranked = sorted(zip(candidates, scores), key=lambda x: -x[1])
    return [c for c, _ in ranked[:out_k]]

The difference?

Cost: ~$30/month (self-hosted GPU) vs. $400/month (Cohere API)
Latency: +35ms vs. +250ms
Quality: nDCG@10 improved from 0.87 to 0.93 (vs. 0.72 → 0.68 before)

It worked because now I was reranking from a pool where the right answer was actually present.

The Decision Table I Wish I Had

Situation	Use Reranker?	Why
Recall@50 < 0.85	❌ NO	Fix retrieval first - reranker can't help
Recall good, but top-5 wrong	✅ YES	Reranker will improve precision
ΔS already < 0.40 in top-5	❌ NO	You don't need it
High QPS (>10k/day)	⚠️ MAYBE	Use self-hosted, not API
Low volume (<1k/day)	✅ YES	LLM reranker okay for cost

The Real Costs Nobody Ever Talks About

Senior Management struggle to see the hidden cost about trying to cut costs, if cohere was solving our problem, we still would have got pushback when our low level tests cost $400 a month. Obviously the API wasn't cheap but adding a single API call to a code base is extremely simple in engineering terms.

The following costs are hard to quantify or be taken seriously by some types of leadership:

Engineering time - 2 weeks debugging why quality dropped
Opportunity cost - Could have fixed retrieval in day one
Production incidents - 3 escalations from confused customer support
Credibility - Having to explain to my VP why we spent money to make things worse

The $400 in API costs was the cheapest part of this mistake.

Key Takeaways

Measure recall first - If recall@50 < 0.85, don't even think about reranking
Use ΔS as a gate - Filter out high-stress chunks before they poison your context
Self-host when possible - Cross-encoders are 30x cheaper than LLM APIs
Fix fundamentals first - Good chunking and hybrid search >>> expensive rerankers
Verify with metrics - nDCG improvement should be at least 0.05 to justify the complexity

The Numbers That Matter

Before my fixes:

Recall@50: 0.61
nDCG@10: 0.72
Average ΔS: 0.68
Monthly cost: $900
User satisfaction: 3.2/5

After fixing retrieval (no reranker):

Recall@50: 0.87
nDCG@10: 0.87
Average ΔS: 0.42
Monthly cost: $30
User satisfaction: 4.1/5

After adding self-hosted reranker:

Recall@50: 0.87 (unchanged)
nDCG@10: 0.93
Average ΔS: 0.38
Monthly cost: $60
User satisfaction: 4.4/5

Want to Learn More?

This is just one module from my comprehensive RAG debugging course, where I cover:

How to measure and fix Semantic Stress (ΔS)
Building semantic firewalls to prevent hallucinations
When (and when not) to use rerankers
Multi-stage retrieval pipelines that actually work
Production-grade citation tracking
A/B testing RAG systems properly

Check out the full course: RAG Firewall Guide on Gumroad

Plus, grab the free GitHub repo with working code examples:

github.com/jongmoss/rag-firewall-examples

Have you made expensive mistakes optimizing RAG systems? I'd love to hear your war stories in the comments below.

And if you're currently debugging a RAG system that's hallucinating despite high similarity scores - go measure your recall@50 and ΔS before you buy that reranker. Trust me. 😅

DEV Community