Most "chat with your website" projects ship without any measurement. Mine did too. The live demo was up, answers looked plausible, and I moved on. Then I built a proper evaluation harness and found out exactly how wrong "looks plausible" is as a quality signal.
This post covers the eval design, the bugs it caught, the prompt changes that fixed most of them, and the two metrics that still don't pass threshold after all the fixes. The failures are the interesting part.
The System
Web Intelligence is a RAG pipeline that turns any public URL into a queryable knowledge base. You give it a URL, it crawls up to 50 pages, chunks and embeds the text with Pinecone's multilingual-e5-large, and stores vectors in a serverless Pinecone index. At query time, top-k chunks are retrieved and passed to an LLM (Gemini 2.0 Flash or a local Ollama model) with a strict context-only prompt.
Nothing exotic. The evaluation harness is the part I want to talk about.
Eval Design: The Answerable/Unanswerable Split
Before writing a single metric, the most important design decision is splitting your question bank.
All eval questions
├── Answerable → Hit@k · MRR · Faithfulness · Hallucination · Ctx Coverage
└── Unanswerable → Rejection Rate (did the system correctly refuse?)
This matters because they measure fundamentally different behaviours. An unanswerable question where the system correctly refuses should not contribute Hit@1 = 0 to your retrieval average. Before I introduced the split, three out-of-scope questions were dragging down the Hit@k numbers, and there was no metric at all for whether the refusals were happening. The system was getting credit for nothing and penalised for things it was doing right.
The baseline: aboutamazon.com, 5 answerable questions + 3 unanswerable questions, top_k=5. Small sample - I'll address that.
Issue 1: Hit@1 Was 60% for the Wrong Reason
Two of five questions scored Hit@1 = 0. For Q01 ("What does Amazon do?"), the top-ranked chunk by cosine similarity (0.857) was Amazon's mission statement is clearly relevant. But my ground-truth keyword was "ecommerce" and the chunk text used "e-commerce" with a hyphen.
# Original - breaks on surface-form variants
def chunk_hit(chunk_text, keywords):
text = chunk_text.lower()
return any(kw in text for kw in keywords)
# Fixed — normalise before comparison
def _norm_kw(s: str) -> str:
return re.sub(r'[\s\-_]', '', s.lower())
def chunk_hit(chunk_text, keywords):
norm_text = _norm_kw(chunk_text)
return any(_norm_kw(kw) in norm_text for kw in keywords)
Result: Hit@1 60% → 80%.
Q03 had a harder problem alongside the normalisation bug: the top chunk genuinely addressed Amazon's mission rather than its business lines, which is what the question targeted. That's a ranking problem. The embedding is working correctly - the mission statement is semantically related to "what Amazon does" - but a cross-encoder re-ranker scoring (query, chunk) pairs jointly would promote the more task-relevant chunk. That fix is still pending.
Issue 2: Hallucination Was 41% but the Metric Was Partly Lying
Before the prompt fix, hallucination averaged 41%. After the fix, it dropped to 28%. But the story of why it was 41% is more useful than the number.
The hallucination metric is 1 - ctx_coverage, where:
ctx_coverage = |answer_tokens ∩ context_tokens| / |answer_tokens|
With NLTK stopwords removed. The problem: verbosity inflates this metric without representing actual fabrication.
With my original prompt ("Prioritise the provided context", "Under 400 words"), answers averaged 219 words. The LLM produced long, connector-heavy responses. Words like "Overall", "As a result", "combining", "leveraging" don't appear in the retrieved chunks — but they're not factual claims either. They counted as hallucinated tokens.
I separated these two failure modes:
| Mode | Example | Factual Risk |
|---|---|---|
| LLM knowledge leakage |
"Career Choice", "The Climate Pledge" inserted from training |
High |
| Connector expansion |
"Overall, Amazon combines…", "As a result…"
|
Low |
The fix: a hallucination_cw metric that counts only content words ≥5 characters. Connector words ("overall", "result", "based") are under that threshold and excluded. The verbosity_score field (max(0, (words − 150) / 150)) quantifies how much of the raw metric is inflation.
Issue 3: The Prompt Was Too Soft
The original prompt:
prompt = f"""You are a website content assistant.
Prioritise the provided context when answering.
Under 400 words.
CONTEXT:
{context}
QUESTION:
{question}"""
"Prioritise" is not a constraint. The LLM treated it as a suggestion. On Amazon-specific questions, it injected training knowledge: product names, operational statistics, initiatives that weren't in any retrieved chunk.
The fixed prompt (current rag.py):
prompt = f"""You are a website content assistant. Answer ONLY using the text in the CONTEXT section below.
Rules:
- ONLY use information explicitly present in the CONTEXT. Do not add facts, names, or details from your training knowledge.
- If the context has nothing relevant, respond exactly: "Sorry, I couldn't find this information. Please try another question."
- Be concise and specific. No filler, no elaboration beyond what the context states.
- Under 150 words. If the question genuinely requires more, cap at 200 words maximum.
CONTEXT:
{context}
QUESTION:
{question}
ANSWER (cite only what the CONTEXT states):"""
Before/after:
| Metric | Before | After | Threshold |
|---|---|---|---|
| Avg words | 219 | 97 | ≤ 150 |
| Hallucination (raw) | ~41% | 27% | — |
| Hallucination (CW) ★ | ~41% | 28% | ≤ 25% |
| Ctx Coverage | 59% | 73% | ≥ 65% |
The Two Metrics That Still Fail
Honest reporting: two checks are still red after all the fixes.
1. Hallucination (CW) 28% vs 25% threshold
Three points off. The verbosity fix eliminated most of the signal. What remains is genuine leakage, 2 to 3 content words per answer that came from training knowledge rather than retrieved chunks. The 150-word cap reduced it but didn't eliminate it. The next step is LLM-as-judge faithfulness (RAGAS-style claim decomposition) to measure actual factual correctness rather than surface-form overlap.
2. KW Overlap 53% vs 75% threshold
This one is partly self-inflicted. Before the word-cap fix, KW overlap was 83% — answers were long enough to include all expected keywords. After the 150-word cap, shorter correct answers naturally contain fewer words, including some expected keywords that dropped out. The keyword set was calibrated for 200-word answers. Two options: tighten to 2–3 high-signal keywords per question, or weight by TF-IDF importance so that high-information terms count more.
Full Results Summary
| Track | Metric | Before | After | Threshold | Status |
|---|---|---|---|---|---|
| Answerable | Hit@1 | 60% | 80% | ≥ 80% | ✅ |
| Answerable | Hit@5 | 100% | 100% | ≥ 95% | ✅ |
| Answerable | MRR@5 | 0.767 | 0.883 | ≥ 0.75 | ✅ |
| Answerable | Hallucination (CW) | ~41% | 28% | ≤ 25% | ❌ |
| Answerable | Ctx Coverage | 59% | 73% | ≥ 65% | ✅ |
| Answerable | KW Overlap | 83% | 53% | ≥ 75% | ❌ |
| Answerable | Avg Words | 219 | 97 | ≤ 150 | ✅ |
| Unanswerable | Rejection Rate | unmeasured | 100% | ≥ 90% | ✅ |
Scope note: one site, 8 questions. These are directional signals, not a production-grade benchmark.
What I'd Do Next
Cross-encoder re-ranking - replace bi-encoder-only ranking with a ms-marco-MiniLM-L-6-v2 cross-encoder as a second-pass re-ranker. Expected Hit@1 improvement: 80% → 90%+.
LLM-as-judge faithfulness - RAGAS-style: decompose each answer into atomic claims and verify each claim against retrieved chunks. Slower and costs tokens but measures actual correctness instead of token overlap.
Answer-length calibration - run the eval at word caps of 100/125/150/175 and plot hallucination (CW) vs KW overlap. Find the Pareto-optimal cap where both pass threshold simultaneously.
Keyword set recalibration - reduce to 2–3 high-signal terms per question, or adopt TF-IDF weighting.
Code and Demo
GitHub repo: web-intelligence
Live demo: web-intelligence-red.vercel.app
The eval notebook is at backend/rag_eval_single.ipynb. Results JSON written to data/eval_single_<site>_<date>.json on each run.
If you've built RAG eval harnesses and hit similar issues, especially the verbosity/hallucination conflation, I'd like to hear how you handled it ☺️.

Top comments (0)