Gursharan Singh

Posted on Apr 4 • Edited on Jun 12

RAG in Practice — Part 3: How RAG Works — The Complete Pipeline

#rag #ai #architecture #webdev

This article is Part 3 of my RAG in Practice series, where I explain retrieval-augmented generation in practical, production-oriented terms.

In this part, we walk through the complete RAG pipeline step by step — from ingestion to retrieval to generation — and the tradeoffs that matter in real systems.

Two Shifts, Two Jobs

Part 2 showed the RAG pattern as six components in a line: query in, context retrieved, answer out. That is the shape of the system. This article shows how it actually runs.

For a single document, you can paste it into a chat window and ask questions directly. RAG exists because companies have hundreds of documents that change weekly, and the answer to a real question may depend on several of them.

A RAG pipeline is not one flow. It is two shifts with different jobs, different costs, and different ways to fail.

Shift 1 is ingestion. It runs offline, before any question arrives. Its job is to take your raw documents — TechNova's return policies, troubleshooting guides, product specs, firmware changelogs — and turn them into something a retriever can search. Parse, chunk, embed, store. This shift runs once per document update, not once per question.

Shift 2 is query time. It runs live, when a customer asks a question. Its job is to find the right chunks from the index that Shift 1 built, assemble them into a prompt, and generate an answer. This shift runs on every question and needs to be fast.

The two shifts share an index but share almost nothing else. They run at different times, at different speeds, with different failure modes. Understanding them as separate shifts is what makes debugging possible.

Shift 1 — Preparing the Knowledge

TechNova has five documents that need to become searchable: the return policy, the warranty terms, the troubleshooting guide, the firmware changelog, and the product specifications with a comparison table. Each one is structured differently, and each creates a different problem for the ingestion pipeline.

The goal of Shift 1 is to make these documents searchable by meaning, not just by keywords. A customer might ask "can I return my headphones?" while the document says "return window" or "refund policy."

To make that match possible, the system turns documents into clean text, splits them into smaller pieces, and converts those pieces into representations it can search later. Those representations are stored in a vector database for retrieval at query time.

Document Parsing Matters More Than You Think

Most tutorials skip this step. Before you can chunk or embed anything, you need clean text. Getting clean text from real documents is harder than it sounds.

TechNova's knowledge base includes Markdown files, HTML help pages, and an HTML product specs page. Each format needs a different parser before any of them become usable text.

But parsing is not just text extraction. It is structure preservation. A heading, a numbered procedure, and a comparison table all look like plain text after extraction, but they carry very different meaning during retrieval. When structure is lost early, every step after it works with broken material.

Consider TechNova's product specs. The original table looks like this:

Model	Driver Size	Battery	Codecs
WH-1000	30mm	30 hours	SBC, AAC, LDAC
WH-500	30mm	20 hours	SBC, AAC

A naive parser — one that strips HTML tags or pulls raw text — flattens that into:

"30mm 30 hours SBC AAC LDAC 30mm 20 hours SBC AAC"

No row boundaries. No column headers. No way for a retriever to answer "What is the battery life of the WH-1000?" because the answer is mixed up with the WH-500's specs.

A structure-aware parser keeps the table's shape intact, so each product's attributes stay separate. Now retrieval has something usable to work with.

In practice, production systems often store both a searchable summary and the raw structured data for tables. The summary — "WH-1000: 30mm driver, 30hr battery, LDAC + SBC" — gets embedded and indexed for retrieval. The full table is stored alongside it as a separate object.

When a summary matches a query, the generator receives the complete table, not just the summary. This matters because a summary can match a query it cannot fully answer. "Compare the codec support of WH-1000 and WH-500" needs the raw table, not a one-line description of one product. Part 6 uses a sample product specs document with a comparison table so this parsing challenge becomes visible in code, not just prose.

The decision: how do you handle documents that are not plain text? Tables, nested headers, lists with sub-items, mixed-format PDFs — each needs a parser that understands structure, not just characters. The failure: structured content destroyed by bad parsing. Every step after it inherits the damage.

Chunking

Documents are too long to retrieve whole. A 2,000-word troubleshooting guide cannot fit in a model's context alongside four other retrieved documents and still leave room for generation. The guide needs to be split into chunks — pieces small enough to retrieve individually, but large enough to carry a complete thought.

Where you split matters. TechNova's troubleshooting guide has a section on Bluetooth pairing with five numbered steps. If the chunk boundary falls between step 3 and step 4, the retriever might return the first chunk when a customer asks about pairing. That chunk ends mid-procedure. The model generates an answer from incomplete instructions. The customer follows three steps, gets stuck, and contacts support anyway.

The tradeoff: how big should chunks be, and where should boundaries fall? Too small, and chunks lack context. Too large, and retrieval gets less accurate. Overlap between chunks — repeating the last few sentences of one chunk at the start of the next — helps preserve context at boundaries. Part 4 examines chunking strategies in detail.

What breaks: a coherent answer split across two chunks, so neither chunk is enough on its own.

Embedding and Storage

Each chunk gets converted into a vector — a list of numbers that represents what the text means. Two chunks about return policies will produce similar vectors, even if they use different words. This is what makes semantic search possible: the retriever matches meaning, not keywords.

Here is what that looks like in practice.

The embedding model matters more than most teams expect early on. A general-purpose model trained on web text will treat "WH-1000" as a meaningless token. A model that has seen electronics documentation will understand it as a specific product with specific attributes. The same query will retrieve different chunks depending on how well the embedding model understands your vocabulary.

Once embedded, chunks go into a vector database — an index built for finding the most similar vectors to a given query. This is the bridge between the two shifts: everything ingestion produces, the query pipeline searches.

The choice that matters: which embedding model, and does it understand your domain? The silent risk: embeddings that capture general meaning but miss domain-specific terms, so the retriever returns results that sound right but are wrong.

Contextual Enrichment

A chunk that says "Return window: 15 days" is unclear on its own. Fifteen days for which product? Under which policy version? If TechNova's WH-1000 and WH-500 have different return windows, the embedding for "15 days" alone cannot tell them apart. Both chunks can look too similar to the retriever, and it may return the wrong one.

Before embedding, some teams use an LLM to add context to each chunk — turning "Return window: 15 days" into "From TechNova WH-1000 return policy (updated Q4 2024): Return window: 15 days." Now the embedding captures not just the content, but which product and which policy version it came from. Chunks that would otherwise look too similar become easier to tell apart. This is not required on day one, but it is one of the first improvements teams make when retrieval is not accurate enough on domain-specific queries.

Some teams also attach structured metadata to each chunk — product name, document version, last-updated date — so retrieval can filter by product or version before comparing embeddings.

Shift 2 — Answering the Question

A customer asks: "What is the return policy for the WH-1000?" The question enters Shift 2. Everything from here runs live.

The Vector Search Path

The query gets embedded using the same model that embedded the chunks in Shift 1. Same model, same vector space — so the query's vector can be compared directly against every chunk in the index. The retriever returns the chunks whose vectors are closest in meaning to the question.

For the return policy question, the retriever pulls the chunk from return-policy.md that says "Return window: 15 days from date of delivery." That chunk, along with any other high-scoring results, gets assembled into a prompt: "Here is the relevant context. Now answer this question." The model reads the assembled prompt and generates: "The return policy for the WH-1000 is 15 days from the date of delivery."

This is the path most people picture when they hear "RAG." It works well for questions answered by documents — policies, guides, specifications, changelogs.

The Structured Data Path

Not every question is answered by a document. "How many WH-1000 units were returned last quarter?" is a data question. No document chunk contains that number. It lives in a database.

The structured data path uses text-to-SQL: the model translates the natural language question into a SQL query, runs it against a database, and generates an answer from the result. The retrieval mechanism is different, but the pattern is the same — retrieve the relevant data, then generate from it. In production, this path usually needs schema constraints, query validation, and safe execution boundaries. The model should not have unrestricted write access to production databases.

Both paths meet at the same point: prompt assembly. The model does not know or care which path produced its context. This matters because production systems rarely deal only with documents. Knowing that RAG supports both paths prevents the common mistake of forcing every question through vector search. Whether teams call this RAG or a related retrieval pattern matters less than the architectural point: the model answers from retrieved external context, not from its training data alone.

Production Additions: Query Rewriting and Reranking

Two production additions worth naming briefly. Query rewriting rephrases the user's question before retrieval so the retriever has a better target. The most common version is multi-query retrieval: an LLM generates three to five rephrased versions of the original question, runs retrieval on each, and merges the results. A customer who asks "my headphones won't connect" generates variants like "Bluetooth pairing failure WH-1000" and "troubleshooting wireless connection issues." Each phrasing retrieves chunks the original might have missed.

Reranking re-scores retrieved chunks with a more expensive model to improve accuracy. Neither technique is required on day one. Both are among the first things teams add when retrieval quality falls short. Part 4 covers when and why to adopt reranking alongside its broader look at retrieval decisions.

Where This Pipeline Breaks

The pipeline above will produce wrong answers. Every stage has a failure mode, and the symptoms show up in the generated output. Three patterns are worth recognizing early.

Wrong chunks, confident answer. The retriever returns the wrong chunks, and the model generates a fluent, well-structured, wrong answer. It reads like a correct response because the model is doing exactly what it should — generating confidently from whatever context it received. The context was just wrong. This is the hardest failure to catch because nothing in the output looks broken.

Right topic, wrong content. The query is not understood well enough, and the retriever returns content that is about the right topic but not what the user actually needed. A question about firmware update failures retrieves the firmware changelog instead of the troubleshooting guide. The content is real. It is just not the right content.

Right chunks, wrong answer. Sometimes the retriever does its job correctly — the right chunks are in the prompt — but the model still generates a wrong answer. It misreads the context, ignores a qualifying condition, or goes beyond what the retrieved text actually says. From the outside, this looks identical to the first failure: a confident, wrong answer. The difference is internal: the retriever succeeded and the generator failed. Telling retrieval failures apart from generation failures is the single most important debugging skill in RAG. Part 7 builds a diagnostic framework around exactly this.

For now, the instinct worth developing: when the answer is wrong, look at what was retrieved before blaming the model.

Three Takeaways

1. Ingestion and query time are separate shifts with different failure modes. Shift 1 prepares knowledge offline. Shift 2 answers questions live. They share an index but share almost nothing else. Debugging requires knowing which shift failed.

2. Parsing quality constrains everything downstream. If structured content is destroyed during parsing, no amount of chunking or embedding improvement will recover it.

3. RAG works with structured data too, not just documents. Text-to-SQL handles data questions that no document chunk can answer. Production systems often need both paths.

This article focuses on the core pipeline. Production concerns like input validation, access control, handling sensitive information, and safety checks come later in the series.

The pipeline is the mechanism. But the decisions you make inside it — how to chunk, how to retrieve, how to evaluate — are what determine whether it works. Part 4 examines those decisions and the tradeoffs that come with each one, including when hybrid search becomes useful.

Next: Chunking, Retrieval, and the Decisions That Break RAG (Part 4 of 8)

Part of AI in Practice — three practical series on MCP, RAG, and AI Agents, focused on why these patterns exist, where they break, and how to think through the engineering decisions behind them.

DEV Community