Serhii Panchyshyn

Posted on Apr 13 • Edited on Apr 19 • Originally published at animanovalabs.com

No Evals, No Idea. How 40% of RAG Answers Go Wrong.

#ai #rag #production #evaluation

When I started building production RAG systems, I noticed something: nobody was measuring retrieval quality.

Teams would ship a system, ask users if it "felt good," and move on. No metrics. No baseline. No way to know if changes actually helped.

So I started measuring everything. And the first thing I discovered: most RAG failures aren't LLM failures. They're retrieval failures.

The documents that could answer the question aren't making it into the context window. The LLM is being asked to answer questions without the information it needs. No wonder it hallucinates.

Here's what I've learned about measuring and fixing RAG systems across dozens of client engagements.

The metric that actually matters: Recall@k

Before I measure anything else on a new RAG system, I measure Recall@k.

Recall@k answers a simple question: "Of all the documents that should have been retrieved, what percentage actually made it into the top k results?"

def recall_at_k(retrieved_ids: list, relevant_ids: list, k: int) -> float:
    """What % of relevant docs are in the top k results?"""
    top_k = set(retrieved_ids[:k])
    relevant = set(relevant_ids)

    if not relevant:
        return 1.0

    return len(top_k & relevant) / len(relevant)

On systems I've audited, Recall@10 is often around 60%. That means 40% of the time, the document that could answer the question isn't even in the context. The LLM never had a chance.

Here's the math that drives everything:

P(correct answer) ≈ P(correct context retrieved)

If the right chunks aren't retrieved, the LLM can't answer correctly. This is why I always measure retrieval separately from answer quality. Otherwise you're debugging the wrong layer.

You can start measuring today

You don't need production traffic to build evals. Generate synthetic test data from your corpus:

def generate_synthetic_evals(chunks: list) -> list:
    """Generate question-answer pairs from your chunks."""
    eval_pairs = []

    for chunk in chunks:
        response = llm.generate(f"""
Generate 3 questions that this text can answer.
Make them specific. "What is this about?" doesn't test retrieval.

Text:
{chunk.text}

Return JSON: [{{"question": "...", "chunk_id": "{chunk.id}"}}]
""")

        eval_pairs.extend(parse_json(response))

    return eval_pairs

50-100 questions is enough to establish a baseline. Run your retriever, measure Recall@10, write down the number. Now you can actually tell if changes help.

The two fixes that consistently move the needle

I've tried a lot of retrieval improvements across different client systems. Most make marginal differences. Two consistently deliver results.

Fix 1: Hybrid search

Embeddings are great at semantic similarity. "How do I reset my password?" matches "Steps to recover account access" even though they share no keywords.

But embeddings are weak on:

Numbers: They don't understand that 49 is close to 50
Exact match: Product codes, IDs, ticker symbols
Rare terms: Domain jargon not in the training data

BM25 (keyword search) catches what embeddings miss. Combine them:

def hybrid_search(query: str, k: int = 10) -> list:
    """Combine embedding search and BM25 using RRF."""

    embedding_results = embedding_index.search(query, k=20)
    bm25_results = bm25_index.search(query, k=20)

    # Reciprocal Rank Fusion
    scores = {}
    rrf_k = 60

    for rank, doc_id in enumerate(embedding_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1)

    for rank, doc_id in enumerate(bm25_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1)

    ranked = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return ranked[:k]

Typical improvement: 5-15% recall boost depending on query mix.

Fix 2: Add a reranker

Embedding models are bi-encoders. They encode query and documents separately, then compare. Fast, but imprecise.

Cross-encoders (rerankers) look at the query and document together. Slower, but much more accurate. Use them as a second pass:

def search_with_rerank(query: str, k: int = 5) -> list:
    """Retrieve broadly, then rerank precisely."""

    # Cast a wide net
    candidates = hybrid_search(query, k=20)

    # Rerank with cross-encoder
    pairs = [(query, get_content(doc_id)) for doc_id in candidates]
    scores = reranker.score(pairs)

    # Return top k after reranking
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc_id for doc_id, score in ranked[:k]]

Typical improvement: another 5-10% on top of hybrid search.

Combined, these two fixes often take a system from 60% to 80% recall. That's the difference between "works sometimes" and "works reliably."

Chunking decisions that make or break retrieval

Your chunking strategy matters more than your embedding model choice. A few things I always check when onboarding a new project:

The "it" problem

Chunks that start with "It also supports..." or "This feature allows..." are useless on their own. The word "it" has no meaning without the previous chunk.

Fix: Prepend context to every chunk.

def chunk_with_context(doc) -> list:
    chunks = []

    for section in doc.sections:
        # Prepend document and section info
        context = f"Document: {doc.title}\nSection: {section.header}\n\n"

        for chunk_text in split_section(section.content):
            chunks.append({
                "content": context + chunk_text,
                "metadata": {
                    "doc_title": doc.title,
                    "section": section.header
                }
            })

    return chunks

Other chunking rules I follow

Never split mid-table. A row without headers is meaningless.
10-20% overlap between consecutive chunks.
Test multiple chunk sizes (256, 512, 1024 tokens). Optimal depends on your queries.

The workflow I run on every new RAG engagement

Phase 1-2: Establish baseline

Parse documents (test multiple parsers for PDFs)
Chunk with context headers
Generate 50-100 synthetic eval questions
Build basic retriever
Measure Recall@10
Write down the number

Phase 2-4: Apply standard fixes

Add hybrid search (BM25 + embeddings)
Add reranker
Measure again
Compare to baseline

Phase 4+: Debug specific failures

Break down recall by query type
Find worst-performing segment
Fix that segment
Measure again

The key: measure after every change. If you can't see improvement in numbers, you're guessing.

When to measure answer quality

Only after retrieval is solid.

Once Recall@10 is above 80%, start measuring end-to-end:

def eval_answer(question: str, answer: str, context: list) -> dict:
    """Use LLM-as-judge for answer evaluation."""

    result = llm.generate(f"""
Evaluate this answer. Return JSON:
- correct: true/false (factually accurate)
- grounded: true/false (supported by the context)
- complete: true/false (addresses the full question)

Context: {format_context(context)}
Question: {question}
Answer: {answer}
""")

    return parse_json(result)

But if retrieval is broken, this eval is noise. You're just measuring how well your LLM fills in gaps it shouldn't have to fill.

The takeaway

RAG quality is retrieval quality.

Before you touch your prompts:

Generate synthetic evals from your corpus
Measure Recall@10
Add hybrid search
Add a reranker
Fix your chunking
Measure again

The fixes are straightforward. The impact is not.

This is Part 1 of a series on production AI systems. Next: how to know when to fix your prompts vs. build an evaluator.

About me

I help B2B SaaS companies ship production AI in 6 weeks.

DEV Community