What is RAG (Retrieval- Augumented Generation)

RAG allows the AI model to look things up before answering a question.

RAG is an AI technique that combines two powerful components:

1.  A retriever that searches for relevant information from external sources
2.  A generator (like GPT or Claude or any other AI model) that uses that information to craft accurate, grounded responses

RAG is used in scenarios such as:

Recent data or information is required to answer a question
Information needs to be retrieved from private documents

How does RAG work?

Let’s say you ask an AI assistant:

“What’s our company’s refund policy?”

The question is converted into a vector (a list of numbers that captures the meaning)
It searches a vector database of your documents (like PDFs, FAQs, or manuals)
It retrieves the most relevant chunks of text
It inserts those chunks into the prompt sent to the language model
The model then generates an answer based on both your question and the retrieved info

RAG processing step by step

1.Split Documents into Chunks (Document Chunking)
•Your company's HR policy PDF is split into small, readable chunks (e.g., 200–500 words each).

Example Chunk:

“Employees are eligible for health benefits after 90 days of full-time employment…”

2.Generate Embeddings (Using Embedding Model)
•Each chunk is passed through an embedding model (e.g., Amazon Titan Embeddings).
•Output: a vector (a list of numbers) representing the meaning of the text.
•A high dimensional vector can capture complex meaning, relationships and semantic context in:
• Words and sentences (via embeddings)
• Images
• User behavior

The more dimensions, the more nuance the vector can represent — like tone, topic, or context.

Example

Chunk vector → 0.21, -0.64, 0.48, …, 0.02

3.Store in Vector Database
•All vectors are stored in a vector DB (e.g., Amazon OpenSearch, Kendra, Pinecone).
•Each vector is linked to its original text chunk.

Now your database can search by meaning, not just keywords.

⸻

RUNTIME (When User Asks a Question)

⸻

4.User Asks a Question

“When do I qualify for health benefits?”

5.Convert Question to a Vector (Query Embedding)
•The question is passed through the same embedding model.
•Result: a query vector that captures the semantic meaning of the question.

6.Semantic Search in Vector DB
•The query vector is compared to all stored vectors using cosine similarity (or similar metric).
•The most relevant document chunks are retrieved — even if the wording doesn’t match exactly.

Retrieved chunk:

“Employees are eligible for health benefits after 90 days…”

7.Augment the Prompt
•The retrieved chunks are inserted into the prompt along with the user’s question:

Prompt to the foundation model:

Context:
"Employees are eligible for health benefits after 90 days of full-time employment."

Question:
"When do I qualify for health benefits?"

8.Foundation Model Generates Answer

“You qualify for health benefits after 90 days of full-time employment, according to company policy.”