DEV Community

Bhargav Patel
Bhargav Patel

Posted on

Part 2: RAG Architecture: How Retrieval-Augmented Generation Actually Works

Now that we understand what RAG is and why it is so popular, let’s understand how RAG actually works in real systems.

So now we will talk about the complete RAG pipeline.

RAG pipeline has two main components:

  • First is called the ingestion pipeline
  • Second is called the retrieval pipeline

These are very important concepts, but they are actually very simple once you understand them.


1. Ingestion Pipeline

Let’s first understand the ingestion pipeline.

Ingestion pipeline basically means how we prepare the data before giving it to the model.

You can think of it like preparing an open book before the exam.

We first prepare all the information in a structured format so that the model can later access it easily.


Step 1: Collect Data

First we collect all the data.

This data can be anything like:

  • PDF files
  • simple documents
  • Excel files
  • websites
  • or even an entire company’s internal database

So basically, we take all possible sources of information.


Step 2: Extract Useful Data

Once we collect the data, the next step is extraction.

We extract useful text from all these sources.

For example:

  • from PDFs we extract text
  • from websites we extract content
  • from databases we extract structured information

Now we have clean usable text.


Step 3: Split Data into Chunks

Now we split this data into small parts.

We do not process the whole document at once.

We divide it into small pieces called chunks.

For example, a document can be split into:

  • chunk 1
  • chunk 2
  • chunk 3
  • chunk 4

Each chunk contains a small meaningful portion of information.


Step 4: Convert Chunks into Embeddings

Now comes a very important step.

We convert each chunk into embeddings.

Now what are embeddings?

Basically, models cannot directly understand text. They only understand numbers.

So we convert text into numbers using embedding models.

Each chunk is converted into a vector of numbers, which represents its meaning.

Example:

"Refunds are allowed within 30 days"
→ [0.21, -0.67, 0.92, ...]
Enter fullscreen mode Exit fullscreen mode

So now every chunk has:

  • its original text
  • and its embedding (vector representation)

Step 5: Store in Vector Database

Now we store all these embeddings in a vector database.

A vector database is different from a normal database.

Normal databases like MongoDB or MySQL use keyword-based search.

For example, if I search:

“heart attack symptoms”

I will only get documents that contain these exact words.

But vector databases work differently.

They use semantic search, which means meaning-based search.

So even if the exact words are not present, but the meaning is similar, it will still return results.

For example:

  • “heart attack symptoms”
  • “cardiac arrest signs”

Even if words are different, meaning is similar, so results will still be retrieved.


So this entire process of:

  • collecting data
  • extracting text
  • chunking
  • creating embeddings
  • storing in vector database

is called the data ingestion pipeline.


2. Retrieval Pipeline

Now let’s understand the second part, which is the retrieval pipeline.

This is the actual runtime system where user interacts with the model.


Step 1: User Query

First, the user asks a question.

For example:

“What are heart attack symptoms?”

This query enters the system.


Step 2: Convert Query into Embeddings

Just like we converted documents into embeddings, we also convert the user query into embeddings.

We use the same embedding model that was used during ingestion.

This is important because both must exist in the same vector space.


Step 3: Search in Vector Database

Now we search the vector database.

We compare the query embedding with all stored embeddings.

This is called semantic search or similarity search.

So instead of matching keywords, we match meaning.

For example, if the query is about:

heart attack symptoms

We may retrieve documents about:

  • cardiac arrest
  • chest pain
  • emergency conditions

Even if exact words are not present, we still get relevant results.


Step 4: Retrieve Relevant Documents (Context)

Now we get the relevant documents from the vector database.

These retrieved pieces of information are called context.

This context contains useful information related to the user query.


Step 5: Augmentation (Create Prompt)

Now instead of directly sending the user query to the LLM, we create a full prompt.

This prompt contains:

  • the original user query
  • the retrieved context

So we are basically enhancing the prompt with external knowledge.

This is called augmentation.


Step 6: Generation

Finally, this augmented prompt is passed to the LLM.

Now the LLM generates the final answer based on:

  • user query
  • retrieved context

Then we return the response to the user.


So this entire process of:

  • taking a user query
  • converting it into embeddings
  • searching the vector database using semantic similarity
  • retrieving relevant documents as context
  • augmenting the prompt with query and context
  • generating the final answer using the LLM

is called the retrieval pipeline.


Why It is Called Retrieval Augmented Generation

Now let’s understand the naming clearly.

It is called Retrieval Augmented Generation because:

  • Retrieval → we first retrieve relevant documents from vector database
  • Augmented → we add that retrieved context into the prompt
  • Generation → the LLM generates the final response

So all three steps together form RAG.


Complete RAG Pipeline (Simple View)


Final Understanding

So now you should clearly understand:

  • Ingestion pipeline is where we prepare data
  • Retrieval pipeline is where we use that data at runtime
  • Vector databases help us search based on meaning
  • LLM generates answers using retrieved context

One Line Summary

RAG works by preparing data into embeddings during ingestion, storing it in a vector database, and retrieving relevant context at query time to help the LLM generate accurate answers.

Top comments (0)