Dogukan Karademir

Posted on Jun 28

My RAG Benchmark is lying to me

#rag #llm #springai #ollama

I built a benchmark to find the best local LLM for my RAG system. After some runs, I'm less confident in the results than when I started — and I think that's the more useful story.

Here's the specific problem that broke my assumptions.

The Setup

Kenning is a Spring Boot RAG backend: Spring AI, pgvector, Ollama, Apache Tika for PDF parsing. You upload a document, ask questions, get answers grounded only in that document.

I built a benchmark to test six local models: llama3.1:8b, llama3.2:3b, qwen2.5:7b, gemma2:9b, mistral:7b, phi4:14b.

Four question categories, judged blind by qwen2.5:14b:

IN_CONTEXT — answer is in the document
OUT_OF_CONTEXT — answer isn't; model must refuse
PARTIAL_CONTEXT — partial information; model must say what it found and what's missing
MULTI_CHUNK — answer spans multiple sections

Maximum 875 points per model at 35 questions.

First Problem: The Ceiling Effect

First run, 20 questions on Attention Is All You Need (the Transformer paper):

Model	Score
`qwen2.5:7b`	481/500 — 96.2%
`llama3.1:8b`	475/500 — 95.0%
`phi4:14b`	474/500 — 94.8%
`gemma2:9b`	473/500 — 94.6%
`llama3.2:3b`	466/500 — 93.2%
`mistral:7b`	463/500 — 92.6%

IN_CONTEXT category: every single model averaged 25/25. Perfect score.

This is what a useless benchmark looks like. Questions like "How many attention heads does the Transformer use?" are trivially easy if the retrieved chunk contains h = 8. I wasn't measuring model capability — I was measuring whether models can read.

I added 15 harder questions and rewrote the chunking.

The Rewrite That Changed Everything

The original code used TokenTextSplitter with default settings. I changed it to 200-token chunks with 100-token overlap between adjacent chunks:

TokenTextSplitter splitter = TokenTextSplitter.builder()
    .withChunkSize(200)
    .withKeepSeparator(true)
    .build();

List<Document> chunks = splitter.apply(documents);
List<Document> overlapped = overlapAppender.addOverlap(chunks, 100);

The idea: information lost at chunk boundaries (a sentence split across two chunks is fully represented in neither) would be preserved by overlapping.

New results on 35 questions, same document:

Model	Score
`phi4:14b`	839/875 — 95.9%
`qwen2.5:7b`	822/875 — 93.9%
`gemma2:9b`	818/875 — 93.5%
`llama3.1:8b`	815/875 — 93.1%
`mistral:7b`	780/875 — 89.1%
`llama3.2:3b`	771/875 — 88.1%

The ranking changed. phi4:14b, which was 3rd before, now leads. The spread grew from 3.6 to 7.8 percentage points.

Here's the Problem

I changed two things at the same time: the chunking strategy and the question difficulty. I can't isolate which change drove the ranking shift.

And I can prove the chunking changed what models actually saw.

Question q01: "How many attention heads does the base Transformer use?" — categorized as IN_CONTEXT because the answer (h = 8) is in the paper.

Original chunking: retrieved a chunk containing h = 8. Model answered correctly.

New chunking: retrieved chunks about multi-head attention applications. The specific h = 8 chunk was no longer in the top 5 by similarity score. phi4:14b correctly said: "The provided context does not specify the number of attention heads."

Judge score: 25/25. The model isn't lying — it answered correctly given what it received.

But the system failed the user. That question is answerable. The document has the answer. The retrieval missed it.

So here's what I was actually measuring: model behavior given what my chunking strategy retrieved — not model capability. The "model benchmark" was really a "chunking configuration benchmark." I just didn't realize it until the results changed.

The Second Document Made It Worse

I added a second document — NIST SP 800-63B, a US federal authentication standard. ~70 pages of SHALL/SHOULD requirements, distributed across sections and tables. Nothing like an academic paper.

Same questions structure, same judge, same chunking.

Model	Transformer paper	NIST	Drop
`phi4:14b`	95.9%	90.9%	−5.0 pp
`mistral:7b`	89.1%	88.6%	−0.5 pp
`qwen2.5:7b`	93.9%	87.8%	−6.1 pp
`gemma2:9b`	93.5%	83.4%	−10.1 pp
`llama3.1:8b`	93.1%	83.2%	−9.9 pp
`llama3.2:3b`	88.1%	79.3%	−8.8 pp

mistral:7b went from 5th to 2nd. gemma2:9b dropped 10 percentage points and posted the worst category score in the entire dataset (17.1/25 average in PARTIAL_CONTEXT on NIST).

Now I have two explanations and no way to distinguish them:

First Guess: These are real model differences. Some models handle technical regulatory text better than dense academic prose. mistral is more stable across document types; gemma2 is more brittle.

Explanation B: Chunking performance is entirely document-dependent, and the empirical data proves there is no single "best" strategy for everything.

Recent research highlights exactly how much the structure of a document dictates the winning pipeline. For instance, a February 2026 benchmark by Vecta evaluating 7 chunking strategies across 50 academic papers found that standard recursive 512-token splitting took 1st place with 69% accuracy. In that specific domain, semantic chunking tanked at 54% because it over-fragmented the text, producing tiny snippets averaging just 43 tokens that stripped away crucial context. For a standard academic paper, fixed-size or recursive chunking is often perfectly fine or even superior.

Conversely, when dealing with complex, non-linear layouts, fixed token limits completely collapse. A separate study evaluating structured/clinical documents found that adaptive, theme-boundary chunking reached 87% accuracy, while fixed-size baselines plummeted to a dismal 13%.

This completely recontextualizes my results. My naive 200-token split with 100-token overlap happened to work reasonably well for the uniform, dense layout of the Transformer paper. But when applied to a 70-page regulatory standard like NIST—where a single requirement might be scattered across cross-referenced sections and multi-row tables—it arbitrarily butchered the text. Models like gemma2 that are highly sensitive to context fragmentation fell off a cliff, while mistral proved much more resilient at handling the poorly sliced context.

The takeaway isn't that semantic chunking is a silver bullet—it's that a one-size-fits-all chunking pipeline is fundamentally broken. The experiment that would actually prove this — running the same models with multiple chunking configurations (fixed vs. semantic vs. structure-aware) on the exact same document — is the one I didn't do.

What I'd Actually Need to Know Which Model to Pick

Multiple chunking strategies per document type, held constant while varying models
Retrieval quality metrics separate from answer quality (MRR, Recall@5 — did the right chunk even make it into the top 5?)
Multiple judge models, not just one (my judge could have systematic biases I can't detect)
Real user questions from actual sessions, not questions I wrote after reading the document myself
Multiple runs per model to account for non-determinism

Without these, the ranking I have is a ranking of "this specific pipeline configuration" not "these models."

The Honest Takeaway

I didn't build a production RAG app. I built an understanding of how much is hidden under "just do RAG."

The thing I expected to matter most — model choice — turned out to be inseparable from chunking strategy, retrieval configuration, and document structure. Changing chunk size doesn't change which model is capable of what. It changes what the model sees. And what the model sees determines everything.

If I had to tell someone one thing before they start benchmarking models for RAG: measure your retrieval quality first. If the right chunks aren't being retrieved, you're not benchmarking models — you're benchmarking whether your similarity search surfaces the right context. Those are very different problems.

Top comments (1)

Tae Kim • Jun 29

Changing chunking and questions simultaneously is the classic confound here — the ranking shift could be entirely from harder questions, entirely from the new chunking, or some mix, and you cannot separate them. The isolation runs are: original chunking + original questions (baseline), original chunking + harder questions (question effect only), new chunking + harder questions (adds chunking effect); delta between the last two is the chunking signal. Your cross-document results actually do show something clean though: the spread widening on NIST vs the Transformer paper is evidence that chunking is document-structure-dependent, not a property of the models. One more variable to watch: qwen2.5:14b as the judge in a benchmark where qwen2.5:7b is a test subject likely inflates the qwen family scores relative to the rest.