Tobias Egner

Posted on May 27 • Originally published at dagentic.de

Most RAG Problems Are R(etrieval) Problems

#ai #llm #rag #machinelearning

Most RAG blog posts read like product brochures. After building a few systems over the last months and reading way too many production post-mortems, I'm pretty convinced the LLM is usually not the thing that breaks first.

Especially not in EU mid-market deployments.

A few failure modes I see again and again:

1. Retrieval quality falls apart somewhere between 10K and 40K docs

The demo with 500 PDFs looks amazing.

Then the first real pilot starts, somebody uploads 30k documents from SharePoint and suddenly top-3 retrieval becomes semi-random.

Typical example:
Query is Lieferantenbewertung 2024.

What comes back:

a supplier evaluation form from 2019
three meeting notes because they contain the word “Lieferant”
the actually correct document maybe at rank 4 or 5

This problem is way more common than most tutorials mention.

What people in production seem to converge on:

hybrid retrieval (BM25 + dense)
reciprocal rank fusion
reranker on top (Cohere if budget exists, BGE reranker otherwise)
separate indexes per document type

Honestly, adding a reranker solved more quality issues for us than changing the LLM ever did.

2. German enterprise PDFs are completely cursed

Most demos run on clean PDFs.

Real document stores are:

scanned contracts from 1998
supplier manuals with 3-column layouts
rotated tables
faxed quality reports
old encodings destroying umlauts

pypdf turns many of these into complete garbage text.

Things I saw multiple times already:

ü becoming weird symbols
tables flattened into unreadable prose
footnotes injected into random sentences
OCR artifacts treated as actual content

Current stack that works reasonably okay:

Marker for most docs
Docling as fallback
VLM pass for ugly tables

This preprocessing layer is very unsexy work, but probably 30% of the actual implementation effort.

And if you skip it, the whole RAG quality later becomes fake-good.

3. Hallucinations are not the real production problem

Every stakeholder asks:
“What about hallucinations?”

Almost nobody asks:
“What if the source itself is outdated?”'

This kills more pilots from what I’ve seen.

The model gives a perfectly grounded answer.
It cites the right document.
The document is just no longer valid.

Or worse:
two valid documents disagree and the system confidently picks one.

What seems to work:

recency decay in retrieval scoring
contradiction checks across retrieved chunks
confidence thresholds + human handoff

A lot of “hallucination problems” are actually retrieval problems wearing a fake mustache.

4. Permissions become a disaster very fast

This one appears in basically every internal rollout thread.

The assistant accidentally answers something using a HR spreadsheet or salary export the user should never have seen.

Technically the solution is easy:
permission filtering before semantic retrieval.

In reality:

SharePoint permissions are ancient
metadata missing
nobody knows document ownership anymore
legal says ask IT
IT says ask department head
department head left in 2021

In EU environments this becomes even more annoying because GDPR changes this from “oops” into potential reportable incident territory.

Honestly I would not even start a pilot anymore before the customer can explain who should access what.

5. Re-embedding costs are massively underestimated

Everybody budgets the first embedding run.

Almost nobody budgets:

daily delta updates
re-embedding after model upgrades
vector storage growth
multi-vector indexing

Embedding APIs look cheap until somebody realizes the SharePoint dump contains 800 million tokens.

What seems to become the default setup now:

local embedding models after ~10k docs
incremental indexing pipelines from day one
embedding model versioning in metadata

Otherwise migrations become pain very quickly.

The EU / German Mittelstand angle

This changes the architecture more than many US blog posts suggest.

On-premise is usually the default ask now.

GDPR + Art. 28 contracts eliminate half the providers immediately.
Most legal departments only accept a very small shortlist without months of discussions.

Also:
right-to-erasure with vector DBs is more annoying than many teams expect. If embeddings are derived from customer documents, you need to know exactly where they are.

Still feels like many teams underestimate how much “boring infrastructure work” is inside production RAG systems.

The LLM part is honestly often the easiest component.

If you want a longer version with concrete vendor breakdowns and cost ranges, we wrote one up here: RAG mit eigenen Daten (in German). The broader take on agentic AI in EU-regulated
environments: KI-Agenten im Mittelstand 2026.

Top comments (5)

Haris Putratama • May 27

I am building systems with pdf from 90's too, and having similar problem with you. So what solutions that you do to overcome this problems? I make a lot of Financial, Health, Legal Agentic AI that needs almost 99.99% Accuracy

Tobias Egner • May 27

Honestly first: even with high-quality PDFs that have an intact markup layer, 99.99% on free
form generation is unfeasible in basically any domain.

I feel you on the 90s ones. The COBOL of the future :D

Hardware-dependent, but if you have the GPU budget: sparring approach with a fast primary
OCR + a much heavier VLM as verifier. My approach currently using a RTX 6000 PRO Blackwell would be: ocrmypdf for preprocessing -> dots.ocr as primary ocr -> Qwen3-VL-235B-FP8 as verifier, which flags every disagreement with the primary

The verifier step is required for your accuracy requirements. Pair it with schema-bound extraction (typed JSON) wherever the answer fits into a structure, and route disagreements to a human queue.

How many GBs of those legacy PDFs are you dealing with?

Haris Putratama • May 29

Almost 500GB for total PDFs but under 100GB i guess for the old one, currently i run 2 H200 on my datacenter

Nice insights! I use Qwen3-VL too combined with GPT-OSS 120B for my agents, yeah the human verifying method will be the most painful process that i still have to do to process those PDFs lol

Harjot Singh • May 31

Hard agree, and the title is the whole insight - people blame the LLM for a bad RAG answer when the model never had a chance because retrieval handed it the wrong context. Garbage in, confident garbage out. The instinct to "switch to a smarter model" when RAG underperforms is almost always misdirected effort; the smarter model still can't answer from chunks that don't contain the answer.

The retrieval levers that actually move quality, in rough order: chunking (semantic boundaries beat fixed-size splits that cut answers in half), the embedding model fit to your domain, and a re-ranker to fix the "top-k by cosine similarity isn't top-k by relevance" gap. Re-ranking especially is the cheapest big win most people skip. And there's a cost bonus: better retrieval means fewer, more relevant chunks, which trims tokens too - same scoped-context principle I lean on in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) to keep builds accurate AND ~$3 flat. Spot-on post, this needs to be said more. What moved your retrieval quality most - chunking strategy, a re-ranker, or query rewriting? Curious which gave the biggest jump per unit effort.

Mudassir Khan • Jun 2

the 'two valid documents disagree' case is the one that kills production confidence fastest tbh. hardest to catch in evals too because the model answer is technically grounded — you only discover it when a user escalates.

we hit this in an internal knowledge base: right answer in a 2024 policy doc, but the 2022 version retrieving with higher cosine similarity because it was in a cleaner chunk. recency decay helped but the real fix was explicit chunk metadata tagging plus a contradiction check before surfacing the final result.

what's your current setup for detecting document staleness before it reaches the user?