DEV Community

Cover image for My First RAG System Had No Evals. 40% of Answers Were Wrong.
Serhii Panchyshyn
Serhii Panchyshyn Subscriber

Posted on

My First RAG System Had No Evals. 40% of Answers Were Wrong.

When I started building production RAG systems, I noticed something: nobody was measuring retrieval quality.

Teams would ship a system, ask users if it "felt good," and move on. No metrics. No baseline. No way to know if changes actually helped.

So I started measuring everything. And the first thing I discovered: most RAG failures aren't LLM failures. They're retrieval failures.

The documents that could answer the question aren't making it into the context window. The LLM is being asked to answer questions without the information it needs. No wonder it hallucinates.

Here's what I've learned about measuring and fixing RAG systems after building them for B2B SaaS companies.


The metric that actually matters: Recall@k

Before I measure anything else on a new RAG system, I measure Recall@k.

Recall@k answers a simple question: "Of all the documents that should have been retrieved, what percentage actually made it into the top k results?"

def recall_at_k(retrieved_ids: list, relevant_ids: list, k: int) -> float:
    """What % of relevant docs are in the top k results?"""
    top_k = set(retrieved_ids[:k])
    relevant = set(relevant_ids)

    if not relevant:
        return 1.0

    return len(top_k & relevant) / len(relevant)
Enter fullscreen mode Exit fullscreen mode

On systems I've audited, Recall@10 is often around 60%. That means 40% of the time, the document that could answer the question isn't even in the context. The LLM never had a chance.

Here's the math that drives everything:

P(correct answer) ≈ P(correct context retrieved)

If the right chunks aren't retrieved, the LLM can't answer correctly. This is why I always measure retrieval separately from answer quality. Otherwise you're debugging the wrong layer.


You can start measuring today

You don't need production traffic to build evals. Generate synthetic test data from your corpus:

def generate_synthetic_evals(chunks: list) -> list:
    """Generate question-answer pairs from your chunks."""
    eval_pairs = []

    for chunk in chunks:
        response = llm.generate(f"""
Generate 3 questions that this text can answer.
Make them specific. "What is this about?" doesn't test retrieval.

Text:
{chunk.text}

Return JSON: [{{"question": "...", "chunk_id": "{chunk.id}"}}]
""")

        eval_pairs.extend(parse_json(response))

    return eval_pairs
Enter fullscreen mode Exit fullscreen mode

50-100 questions is enough to establish a baseline. Run your retriever, measure Recall@10, write down the number. Now you can actually tell if changes help.


The two fixes that consistently move the needle

I've tried a lot of retrieval improvements. Most make marginal differences. Two consistently deliver results.

Fix 1: Hybrid search

Embeddings are great at semantic similarity. "How do I reset my password?" matches "Steps to recover account access" even though they share no keywords.

But embeddings are weak on:

  • Numbers: They don't understand that 49 is close to 50
  • Exact match: Product codes, IDs, ticker symbols
  • Rare terms: Domain jargon not in the training data

BM25 (keyword search) catches what embeddings miss. Combine them:

def hybrid_search(query: str, k: int = 10) -> list:
    """Combine embedding search and BM25 using RRF."""

    embedding_results = embedding_index.search(query, k=20)
    bm25_results = bm25_index.search(query, k=20)

    # Reciprocal Rank Fusion
    scores = {}
    rrf_k = 60

    for rank, doc_id in enumerate(embedding_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1)

    for rank, doc_id in enumerate(bm25_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (rrf_k + rank + 1)

    ranked = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return ranked[:k]
Enter fullscreen mode Exit fullscreen mode

Typical improvement: 5-15% recall boost depending on query mix.

Fix 2: Add a reranker

Embedding models are bi-encoders. They encode query and documents separately, then compare. Fast, but imprecise.

Cross-encoders (rerankers) look at the query and document together. Slower, but much more accurate. Use them as a second pass:

def search_with_rerank(query: str, k: int = 5) -> list:
    """Retrieve broadly, then rerank precisely."""

    # Cast a wide net
    candidates = hybrid_search(query, k=20)

    # Rerank with cross-encoder
    pairs = [(query, get_content(doc_id)) for doc_id in candidates]
    scores = reranker.score(pairs)

    # Return top k after reranking
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [doc_id for doc_id, score in ranked[:k]]
Enter fullscreen mode Exit fullscreen mode

Typical improvement: another 5-10% on top of hybrid search.

Combined, these two fixes often take a system from 60% to 80% recall. That's the difference between "works sometimes" and "works reliably."


Chunking decisions that make or break retrieval

Your chunking strategy matters more than your embedding model choice. A few things I always check:

The "it" problem

Chunks that start with "It also supports..." or "This feature allows..." are useless on their own. The word "it" has no meaning without the previous chunk.

Fix: Prepend context to every chunk.

def chunk_with_context(doc) -> list:
    chunks = []

    for section in doc.sections:
        # Prepend document and section info
        context = f"Document: {doc.title}\nSection: {section.header}\n\n"

        for chunk_text in split_section(section.content):
            chunks.append({
                "content": context + chunk_text,
                "metadata": {
                    "doc_title": doc.title,
                    "section": section.header
                }
            })

    return chunks
Enter fullscreen mode Exit fullscreen mode

Other chunking rules I follow

  1. Never split mid-table. A row without headers is meaningless.
  2. 10-20% overlap between consecutive chunks.
  3. Test multiple chunk sizes (256, 512, 1024 tokens). Optimal depends on your queries.

The workflow I use on every RAG project

Week 1-2: Establish baseline

  1. Parse documents (test multiple parsers for PDFs)
  2. Chunk with context headers
  3. Generate 50-100 synthetic eval questions
  4. Build basic retriever
  5. Measure Recall@10
  6. Write down the number

Week 2-4: Apply standard fixes

  1. Add hybrid search (BM25 + embeddings)
  2. Add reranker
  3. Measure again
  4. Compare to baseline

Week 4+: Debug specific failures

  1. Break down recall by query type
  2. Find worst-performing segment
  3. Fix that segment
  4. Measure again

The key: measure after every change. If you can't see improvement in numbers, you're guessing.


When to measure answer quality

Only after retrieval is solid.

Once Recall@10 is above 80%, start measuring end-to-end:

def eval_answer(question: str, answer: str, context: list) -> dict:
    """Use LLM-as-judge for answer evaluation."""

    result = llm.generate(f"""
Evaluate this answer. Return JSON:
- correct: true/false (factually accurate)
- grounded: true/false (supported by the context)
- complete: true/false (addresses the full question)

Context: {format_context(context)}
Question: {question}
Answer: {answer}
""")

    return parse_json(result)
Enter fullscreen mode Exit fullscreen mode

But if retrieval is broken, this eval is noise. You're just measuring how well your LLM fills in gaps it shouldn't have to fill.


The takeaway

RAG quality is retrieval quality.

Before you touch your prompts:

  1. Generate synthetic evals from your corpus
  2. Measure Recall@10
  3. Add hybrid search
  4. Add a reranker
  5. Fix your chunking
  6. Measure again

The fixes are straightforward. The impact is not.


This is Part 1 of a series on production AI systems. Next: how to know when to fix your prompts vs. build an evaluator.


About me

I help B2B SaaS companies ship production AI in 6 weeks.

If you're building RAG and want a second set of eyes, I do free AI Teardowns — a 30-45 min video showing exactly where your pipeline is breaking and how to fix it.

No pitch. Just clarity.

AI Implementation for B2B SaaS | AnimaNova Labs | AnimaNova Labs

Ship production AI features in 6 weeks. For B2B SaaS companies who need AI but can't hire fast enough. No $300K engineer. No 6-month timeline.

favicon animanovalabs.com

Top comments (0)