Ksenia Se

Posted on Jun 16

Your RAG Stack Is Solving the 2023 Problem

#ai #architecture #llm #rag

Top-k retrieval was the beginning. Production systems now need routing, memory, evidence checks, structured retrieval, and security around the retrieval layer.

Most RAG tutorials still start with the same pipeline:

documents → chunks → embeddings → vector database → top-k retrieval → LLM answer

This was the right starting point. It made LLMs useful with private data, reduced some hallucinations, and gave developers a simple way to connect models to documents.

With real applications everything is a bit more complicated.

The answer may be scattered across twelve pages. The relevant source may be a table, a PDF diagram, a spreadsheet, a ticket thread, or a database row. The user’s question may require several retrieval steps. A semantically similar chunk may be relevant without actually proving the answer. The corpus may contain untrusted text that tries to steer the model.

At that point, the question is no longer “Do we have RAG?”

The better question is:

What kind of retrieval problem are we solving?

I wrote a longer taxonomy at Turing Post, 20 Advanced RAG Types to Know in 2026. This post is the short developer version: the part you should think through before building another “upload docs and chat” system.

Basic RAG assumes too much

The classic RAG pipeline works well in a well-structured world.

It assumes the answer lives in one or a few text chunks. It assumes semantic similarity is close enough to evidence. It assumes one retrieval pass is enough. It assumes the retrieved context is safe to pass into the model. It assumes the user’s question is clear and the corpus is stable.

Sometimes those assumptions hold. A support FAQ, a small documentation set, or a clean internal knowledge base can work well with basic vector search plus reranking.

But many systems break these assumptions quickly.

A legal assistant may need to connect definitions across a contract. A research assistant may need to understand a full paper, not one paragraph. A finance assistant may need structured numbers from tables. A customer support agent may need to check policy, account status, previous tickets, and current product behavior. A coding assistant may need repo structure, conventions, issue history, and a changing local state. etc etc

In all these cases, “retrieve top-k chunks” becomes a habit, not an architecture.

The useful shift: retrieval as a set of decisions

A more mature RAG system treats retrieval as a sequence of choices.

user query
    ↓
should we retrieve?
    ↓
what source should we use?
    ↓
what retrieval method fits this source?
    ↓
is the evidence enough?
    ↓
should we retrieve again?
    ↓
answer, refuse, ask, or escalate

Seems small when seen like this? Maybe, but in production, it changes almost everything.

The system now has to decide whether the user’s question needs retrieval at all. It has to choose between sources: documentation, database, logs, tickets, long-term memory, web search, or internal APIs. It has to decide whether the answer requires one pass, multiple passes, a graph traversal, a table lookup, or a verification step.

This is why “RAG” has become a family of patterns.

Five places where basic RAG usually breaks

1. The answer is spread across the document

When the answer is one place – basic RAG works great.

Real documents do not always behave like that. Contracts define terms in one section and apply them later. Research papers introduce assumptions early and results much later. Internal strategy docs contain scattered decisions, caveats, and exceptions.

When the answer depends on document-level structure, the system needs more than similarity search. It may need long-context retrieval, hierarchical chunking, section-aware retrieval, summary layers, or memory over previous retrieval steps.

A simple symptom: the model keeps giving plausible partial answers because every retrieved chunk is locally relevant, while the actual answer requires a broader reading.

2. Retrieval should be conditional

A lot of RAG systems retrieve every time because the pipeline says so.

That creates noise. Some questions can be answered from the model’s general knowledge. Some require precise internal data. Some require asking the user a clarifying question before retrieval. Some require several retrieval rounds because the first answer reveals what must be checked next.

This is where adaptive and agentic RAG patterns will shine.

The system does not treat retrieval as an automatic reflex. It treats retrieval as an action. The model, router, or controller decides when to retrieve, where to retrieve from, and whether the result is good enough to continue.

For developers, this usually means the retrieval layer needs policy, and not just plumbing.

if query asks about internal policy:
    retrieve from policy docs

if query asks about account-specific status:
    call account API

if query is ambiguous:
    ask a clarifying question

if first retrieval has weak evidence:
    retrieve again or escalate

This sounds obvious. But again, many production failures come from skipping exactly this step.

3. Similarity is not evidence

Vector search is good at finding text that resembles the query. That is useful, but resemblance is not proof.

A retrieved paragraph can be on the same topic and still fail to support the answer. It can mention the right entity while saying nothing about the user’s actual question. It can be outdated, or contradict another source. Or it can be a summary of a policy instead of the policy itself.

This is where failure creeps into many RAG systems: the answer looks grounded because there are citations, but the citations do not actually carry the claim.

Verification-oriented RAG adds another step. After retrieval, the system checks whether the evidence is sufficient, whether it contradicts other evidence, and whether the answer should be narrower.

A useful mental model:

retrieval asks: “What looks relevant?”
verification asks: “What can we safely say from this?”

The second question is where many serious applications begin.

4. The source is not plain text

The easy version of RAG assumes documents become chunks of text.

But the world, which includes modern corpora, is messier. They contain tables, charts, screenshots, slides, diagrams, invoices, code, transcripts, forms, logs, and database records. Flattening all of that into plain text can destroy the structure that made the source useful.

A spreadsheet is not just a sequence of words. A table has rows, columns, headers, units, and relationships. A diagram can encode a process. A codebase has imports, call graphs, tests, comments, issues, and conventions.

When the source has structure, retrieval should preserve as much of that structure as possible.

That may mean multimodal retrieval, table-aware retrieval, graph-based retrieval, SQL generation, code search, metadata filters, or hybrid search that combines lexical, semantic, and structural signals.

This is less about building a fancier system and more about acknowledging that knowledge doesn't always come packaged as paragraphs.

5. The retrieval layer can be attacked

RAG systems often treat retrieved context as trusted context.

That is a dangerous shortcut.

If the corpus includes user-generated content, external web pages, third-party documentation, support tickets, shared docs, or any source that can be edited by someone outside your control, retrieval becomes a security boundary.

The model does not naturally know which retrieved text is evidence and which retrieved text is an instruction. Developers have to make that boundary explicit.

At minimum, retrieval-aware systems need:

source trust levels
instruction filtering
clear separation between system instructions and retrieved evidence
logging for what context was used
refusal or escalation paths when retrieved content conflicts with policy
different handling for trusted internal data and untrusted external text

Security has to be woven into the retrieval process.

A better architecture question

Before building another RAG stack, ask this:

What job is retrieval doing in this system?

That one question is more useful than picking a vector database too early.

Maybe retrieval is there to fetch facts. Maybe it is there to maintain memory. Maybe it is there to support reasoning across documents. Maybe it is there to verify claims. Maybe it is there to search structured data. Maybe it is there to ground an agent before it takes action.

Each job leads to a different architecture.

For a simple documentation bot, the old pipeline may be enough:

query → vector search → rerank → answer with citations

For a customer support assistant, you may need:

query → classify intent → retrieve policy → check account state → draft response → verify against policy → human review

For a research assistant, you may need:

query → decompose question → retrieve sources → compare evidence → identify gaps → retrieve again → synthesize with citations

For an enterprise agent, you may need:

query → permission check → source routing → retrieval → tool call → evidence check → action proposal → approval gate

These are all called RAG in casual conversation. They are not the same system.

The practical developer checklist

When a RAG system starts failing, do not immediately tune embeddings or change chunk size. Those can help, but they are often local fixes for a deeper design issue.

Start with these questions:

Where should retrieval happen?
Before the model answers, during a multi-step process, after the model forms a plan, or only when confidence is low?

What source should the system trust?
Internal documentation, user files, live APIs, external web pages, databases, logs, tickets, or memory?

What shape is the source?
Plain text, long documents, tables, code, images, transcripts, structured records, or mixed media?

What does “good evidence” mean?
A relevant paragraph, an exact quote, a database value, a policy match, a calculation, or agreement across several sources?

What should happen when evidence is weak?
Retrieve again, ask the user, answer narrowly, refuse, or escalate?

Can retrieved content change behavior?
If yes, you need stronger boundaries between evidence and instructions.

This is where RAG becomes the real engineering.

The 2026 version of RAG is less glamorous and more useful

The early RAG story was simple: give the model external knowledge.

The production story is more complicated: give the system the right context, from the right source, at the right time, with the right permissions, and with enough evidence to justify the answer.

That is why the field is spreading into many patterns: long-document RAG, agentic RAG, adaptive RAG, corrective RAG, self-reflective RAG, graph RAG, multimodal RAG, structured RAG, federated RAG, secure RAG, and more.

Some of these names will survive. Some will be renamed. Some will be absorbed into ordinary application architecture. The naming is less important than the underlying shift.

RAG is becoming the context layer for AI systems.

And context is no longer just “some chunks in the prompt.”

For the full map, I put together a deeper taxonomy at Turing Post: 20 Advanced RAG Types to Know in 2026. It groups the patterns by the problems they solve, from long-document memory and adaptive retrieval to verification, multimodal sources, graph reasoning, federated retrieval, and retrieval-layer security.

Read that before you rebuild your pipeline for the fifth time. Your vector database has suffered enough.

Top comments (1)

Max Quimby • Jun 17

The line that hits hardest for me is "a semantically similar chunk may be relevant without actually proving the answer." That gap is where most of my production incidents have lived — the retriever returns something that reads on-topic, the model treats proximity as proof, and you only notice when an answer is confidently wrong. Adding a cheap "does this passage actually support the claim?" verification pass before generation caught more bad answers for us than any amount of embedding-model tuning did.

The decision I'd push even harder on is "should we retrieve at all?" A surprising share of traffic doesn't need retrieval, and forcing it just injects noise and latency. Routing that upfront was the single highest-ROI change we made.

Curious how you're handling the untrusted-corpus case in practice — provenance/trust scoring on chunks before they reach the prompt, quarantining sources, or mostly instruction-hardening on the model side?