<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tobias Egner</title>
    <description>The latest articles on DEV Community by Tobias Egner (@dagentic).</description>
    <link>https://dev.to/dagentic</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3954423%2F845086bf-3e95-4c8c-b696-4609d0ea0b4b.png</url>
      <title>DEV Community: Tobias Egner</title>
      <link>https://dev.to/dagentic</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dagentic"/>
    <language>en</language>
    <item>
      <title>Most RAG Problems Are R(etrieval) Problems</title>
      <dc:creator>Tobias Egner</dc:creator>
      <pubDate>Wed, 27 May 2026 14:09:52 +0000</pubDate>
      <link>https://dev.to/dagentic/most-rag-problems-are-retrieval-problems-327h</link>
      <guid>https://dev.to/dagentic/most-rag-problems-are-retrieval-problems-327h</guid>
      <description>&lt;p&gt;Most RAG blog posts read like product brochures. After building a few systems over the last months and reading way too many production post-mortems, I'm pretty convinced the LLM is usually not the thing that breaks first.&lt;/p&gt;

&lt;p&gt;Especially not in EU mid-market deployments.&lt;/p&gt;

&lt;p&gt;A few failure modes I see again and again:&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Retrieval quality falls apart somewhere between 10K and 40K docs
&lt;/h2&gt;

&lt;p&gt;The demo with 500 PDFs looks amazing.&lt;/p&gt;

&lt;p&gt;Then the first real pilot starts, somebody uploads 30k documents from SharePoint and suddenly top-3 retrieval becomes semi-random.&lt;/p&gt;

&lt;p&gt;Typical example:&lt;br&gt;
Query is &lt;code&gt;Lieferantenbewertung 2024&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What comes back:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a supplier evaluation form from 2019&lt;/li&gt;
&lt;li&gt;three meeting notes because they contain the word “Lieferant”&lt;/li&gt;
&lt;li&gt;the actually correct document maybe at rank 4 or 5&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This problem is way more common than most tutorials mention.&lt;/p&gt;

&lt;p&gt;What people in production seem to converge on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hybrid retrieval (BM25 + dense)&lt;/li&gt;
&lt;li&gt;reciprocal rank fusion&lt;/li&gt;
&lt;li&gt;reranker on top (Cohere if budget exists, BGE reranker otherwise)&lt;/li&gt;
&lt;li&gt;separate indexes per document type&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Honestly, adding a reranker solved more quality issues for us than changing the LLM ever did.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. German enterprise PDFs are completely cursed
&lt;/h2&gt;

&lt;p&gt;Most demos run on clean PDFs.&lt;/p&gt;

&lt;p&gt;Real document stores are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scanned contracts from 1998&lt;/li&gt;
&lt;li&gt;supplier manuals with 3-column layouts&lt;/li&gt;
&lt;li&gt;rotated tables&lt;/li&gt;
&lt;li&gt;faxed quality reports&lt;/li&gt;
&lt;li&gt;old encodings destroying umlauts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;pypdf&lt;/code&gt; turns many of these into complete garbage text.&lt;/p&gt;

&lt;p&gt;Things I saw multiple times already:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ü&lt;/code&gt; becoming weird symbols&lt;/li&gt;
&lt;li&gt;tables flattened into unreadable prose&lt;/li&gt;
&lt;li&gt;footnotes injected into random sentences&lt;/li&gt;
&lt;li&gt;OCR artifacts treated as actual content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Current stack that works reasonably okay:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Marker for most docs&lt;/li&gt;
&lt;li&gt;Docling as fallback&lt;/li&gt;
&lt;li&gt;VLM pass for ugly tables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This preprocessing layer is very unsexy work, but probably 30% of the actual implementation effort.&lt;/p&gt;

&lt;p&gt;And if you skip it, the whole RAG quality later becomes fake-good.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Hallucinations are not the real production problem
&lt;/h2&gt;

&lt;p&gt;Every stakeholder asks:&lt;br&gt;
“What about hallucinations?”&lt;/p&gt;

&lt;p&gt;Almost nobody asks:&lt;br&gt;
“What if the source itself is outdated?”'&lt;/p&gt;

&lt;p&gt;This kills more pilots from what I’ve seen.&lt;/p&gt;

&lt;p&gt;The model gives a perfectly grounded answer.&lt;br&gt;
It cites the right document.&lt;br&gt;
The document is just no longer valid.&lt;/p&gt;

&lt;p&gt;Or worse:&lt;br&gt;
two valid documents disagree and the system confidently picks one.&lt;/p&gt;

&lt;p&gt;What seems to work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;recency decay in retrieval scoring&lt;/li&gt;
&lt;li&gt;contradiction checks across retrieved chunks&lt;/li&gt;
&lt;li&gt;confidence thresholds + human handoff&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A lot of “hallucination problems” are actually retrieval problems wearing a fake mustache.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Permissions become a disaster very fast
&lt;/h2&gt;

&lt;p&gt;This one appears in basically every internal rollout thread.&lt;/p&gt;

&lt;p&gt;The assistant accidentally answers something using a HR spreadsheet or salary export the user should never have seen.&lt;/p&gt;

&lt;p&gt;Technically the solution is easy:&lt;br&gt;
permission filtering before semantic retrieval.&lt;/p&gt;

&lt;p&gt;In reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SharePoint permissions are ancient&lt;/li&gt;
&lt;li&gt;metadata missing&lt;/li&gt;
&lt;li&gt;nobody knows document ownership anymore&lt;/li&gt;
&lt;li&gt;legal says ask IT&lt;/li&gt;
&lt;li&gt;IT says ask department head&lt;/li&gt;
&lt;li&gt;department head left in 2021&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In EU environments this becomes even more annoying because GDPR changes this from “oops” into potential reportable incident territory.&lt;/p&gt;

&lt;p&gt;Honestly I would not even start a pilot anymore before the customer can explain who should access what.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Re-embedding costs are massively underestimated
&lt;/h2&gt;

&lt;p&gt;Everybody budgets the first embedding run.&lt;/p&gt;

&lt;p&gt;Almost nobody budgets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;daily delta updates&lt;/li&gt;
&lt;li&gt;re-embedding after model upgrades&lt;/li&gt;
&lt;li&gt;vector storage growth&lt;/li&gt;
&lt;li&gt;multi-vector indexing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Embedding APIs look cheap until somebody realizes the SharePoint dump contains 800 million tokens.&lt;/p&gt;

&lt;p&gt;What seems to become the default setup now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local embedding models after ~10k docs&lt;/li&gt;
&lt;li&gt;incremental indexing pipelines from day one&lt;/li&gt;
&lt;li&gt;embedding model versioning in metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Otherwise migrations become pain very quickly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The EU / German Mittelstand angle
&lt;/h2&gt;

&lt;p&gt;This changes the architecture more than many US blog posts suggest.&lt;/p&gt;

&lt;p&gt;On-premise is usually the default ask now.&lt;/p&gt;

&lt;p&gt;GDPR + Art. 28 contracts eliminate half the providers immediately.&lt;br&gt;
Most legal departments only accept a very small shortlist without months of discussions.&lt;/p&gt;

&lt;p&gt;Also:&lt;br&gt;
right-to-erasure with vector DBs is more annoying than many teams expect. If embeddings are derived from customer documents, you need to know exactly where they are.&lt;/p&gt;

&lt;p&gt;Still feels like many teams underestimate how much “boring infrastructure work” is inside production RAG systems.&lt;/p&gt;

&lt;p&gt;The LLM part is honestly often the easiest component.&lt;/p&gt;

&lt;p&gt;If you want a longer version with concrete vendor breakdowns and cost ranges, we wrote one up here: &lt;a href="https://dagentic.de/blog/rag-eigene-daten/" rel="noopener noreferrer"&gt;RAG mit eigenen Daten&lt;/a&gt; (in German). The broader take on agentic AI in EU-regulated&lt;br&gt;
environments: &lt;a href="https://dagentic.de/blog/ki-agenten-mittelstand-2026/" rel="noopener noreferrer"&gt;KI-Agenten im Mittelstand 2026&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
