Building a RAG Evaluation Harness That Actually Catches Problems

#ai #rag #llm #python

Most "chat with your website" projects ship without any measurement. Mine did too. The live demo was up, answers looked plausible, and I moved on. Then I built a proper evaluation harness and found out exactly how wrong "looks plausible" is as a quality signal.

This post covers the eval design, the bugs it caught, the prompt changes that fixed most of them, and the two metrics that still don't pass threshold after all the fixes. The failures are the interesting part.

The System

Web Intelligence is a RAG pipeline that turns any public URL into a queryable knowledge base. You give it a URL, it crawls up to 50 pages, chunks and embeds the text with Pinecone's multilingual-e5-large, and stores vectors in a serverless Pinecone index. At query time, top-k chunks are retrieved and passed to an LLM (Gemini 2.0 Flash or a local Ollama model) with a strict context-only prompt.

Nothing exotic. The evaluation harness is the part I want to talk about.

Eval Design: The Answerable/Unanswerable Split

Before writing a single metric, the most important design decision is splitting your question bank.

All eval questions
├── Answerable    → Hit@k · MRR · Faithfulness · Hallucination · Ctx Coverage
└── Unanswerable  → Rejection Rate (did the system correctly refuse?)

This matters because they measure fundamentally different behaviours. An unanswerable question where the system correctly refuses should not contribute Hit@1 = 0 to your retrieval average. Before I introduced the split, three out-of-scope questions were dragging down the Hit@k numbers, and there was no metric at all for whether the refusals were happening. The system was getting credit for nothing and penalised for things it was doing right.

The baseline: aboutamazon.com, 5 answerable questions + 3 unanswerable questions, top_k=5. Small sample - I'll address that.

Issue 1: Hit@1 Was 60% for the Wrong Reason

Two of five questions scored Hit@1 = 0. For Q01 ("What does Amazon do?"), the top-ranked chunk by cosine similarity (0.857) was Amazon's mission statement is clearly relevant. But my ground-truth keyword was "ecommerce" and the chunk text used "e-commerce" with a hyphen.

# Original - breaks on surface-form variants
def chunk_hit(chunk_text, keywords):
    text = chunk_text.lower()
    return any(kw in text for kw in keywords)

# Fixed — normalise before comparison
def _norm_kw(s: str) -> str:
    return re.sub(r'[\s\-_]', '', s.lower())

def chunk_hit(chunk_text, keywords):
    norm_text = _norm_kw(chunk_text)
    return any(_norm_kw(kw) in norm_text for kw in keywords)

Result: Hit@1 60% → 80%.

Q03 had a harder problem alongside the normalisation bug: the top chunk genuinely addressed Amazon's mission rather than its business lines, which is what the question targeted. That's a ranking problem. The embedding is working correctly - the mission statement is semantically related to "what Amazon does" - but a cross-encoder re-ranker scoring (query, chunk) pairs jointly would promote the more task-relevant chunk. That fix is still pending.

Issue 2: Hallucination Was 41% but the Metric Was Partly Lying

Before the prompt fix, hallucination averaged 41%. After the fix, it dropped to 28%. But the story of why it was 41% is more useful than the number.

The hallucination metric is 1 - ctx_coverage, where:

ctx_coverage = |answer_tokens ∩ context_tokens| / |answer_tokens|

With NLTK stopwords removed. The problem: verbosity inflates this metric without representing actual fabrication.

With my original prompt ("Prioritise the provided context", "Under 400 words"), answers averaged 219 words. The LLM produced long, connector-heavy responses. Words like "Overall", "As a result", "combining", "leveraging" don't appear in the retrieved chunks — but they're not factual claims either. They counted as hallucinated tokens.

I separated these two failure modes:

Mode	Example	Factual Risk
LLM knowledge leakage	`"Career Choice"`, `"The Climate Pledge"` inserted from training	High
Connector expansion	`"Overall, Amazon combines…"`, `"As a result…"`	Low

The fix: a hallucination_cw metric that counts only content words ≥5 characters. Connector words ("overall", "result", "based") are under that threshold and excluded. The verbosity_score field (max(0, (words − 150) / 150)) quantifies how much of the raw metric is inflation.

Issue 3: The Prompt Was Too Soft

The original prompt:

prompt = f"""You are a website content assistant. 
Prioritise the provided context when answering.
Under 400 words.

CONTEXT:
{context}

QUESTION:
{question}"""

"Prioritise" is not a constraint. The LLM treated it as a suggestion. On Amazon-specific questions, it injected training knowledge: product names, operational statistics, initiatives that weren't in any retrieved chunk.

The fixed prompt (current rag.py):

prompt = f"""You are a website content assistant. Answer ONLY using the text in the CONTEXT section below.

Rules:
- ONLY use information explicitly present in the CONTEXT. Do not add facts, names, or details from your training knowledge.
- If the context has nothing relevant, respond exactly: "Sorry, I couldn't find this information. Please try another question."
- Be concise and specific. No filler, no elaboration beyond what the context states.
- Under 150 words. If the question genuinely requires more, cap at 200 words maximum.

CONTEXT:
{context}

QUESTION:
{question}

ANSWER (cite only what the CONTEXT states):"""

Before/after:

Metric	Before	After	Threshold
Avg words	219	97	≤ 150
Hallucination (raw)	~41%	27%	—
Hallucination (CW) ★	~41%	28%	≤ 25%
Ctx Coverage	59%	73%	≥ 65%

The Two Metrics That Still Fail

Honest reporting: two checks are still red after all the fixes.

1. Hallucination (CW) 28% vs 25% threshold

Three points off. The verbosity fix eliminated most of the signal. What remains is genuine leakage, 2 to 3 content words per answer that came from training knowledge rather than retrieved chunks. The 150-word cap reduced it but didn't eliminate it. The next step is LLM-as-judge faithfulness (RAGAS-style claim decomposition) to measure actual factual correctness rather than surface-form overlap.

2. KW Overlap 53% vs 75% threshold

This one is partly self-inflicted. Before the word-cap fix, KW overlap was 83% — answers were long enough to include all expected keywords. After the 150-word cap, shorter correct answers naturally contain fewer words, including some expected keywords that dropped out. The keyword set was calibrated for 200-word answers. Two options: tighten to 2–3 high-signal keywords per question, or weight by TF-IDF importance so that high-information terms count more.

Full Results Summary

Track	Metric	Before	After	Threshold	Status
Answerable	Hit@1	60%	80%	≥ 80%	✅
Answerable	Hit@5	100%	100%	≥ 95%	✅
Answerable	MRR@5	0.767	0.883	≥ 0.75	✅
Answerable	Hallucination (CW)	~41%	28%	≤ 25%	❌
Answerable	Ctx Coverage	59%	73%	≥ 65%	✅
Answerable	KW Overlap	83%	53%	≥ 75%	❌
Answerable	Avg Words	219	97	≤ 150	✅
Unanswerable	Rejection Rate	unmeasured	100%	≥ 90%	✅

Scope note: one site, 8 questions. These are directional signals, not a production-grade benchmark.

What I'd Do Next

Cross-encoder re-ranking - replace bi-encoder-only ranking with a ms-marco-MiniLM-L-6-v2 cross-encoder as a second-pass re-ranker. Expected Hit@1 improvement: 80% → 90%+.

LLM-as-judge faithfulness - RAGAS-style: decompose each answer into atomic claims and verify each claim against retrieved chunks. Slower and costs tokens but measures actual correctness instead of token overlap.

Answer-length calibration - run the eval at word caps of 100/125/150/175 and plot hallucination (CW) vs KW overlap. Find the Pareto-optimal cap where both pass threshold simultaneously.

Keyword set recalibration - reduce to 2–3 high-signal terms per question, or adopt TF-IDF weighting.