I know it's a bit late to write about this, but knowledge never becomes irrelevant. I haven't explored the topic of Generative AI (GenAI) yet, so I'm starting today with a series of blogs covering various exciting and practical topics with hands-on implementations.
Before we dive in, try this: go to ChatGPT or any other large language model (LLM) and ask, "What is the name of my father?" You'll likely get a response like, "I don't know until you provide more details." This is an improved response from LLMs. In the past, when queries lacked context not present in the LLM's training data, they might have given a random name based on that data.
This highlights a key challenge that LLM providers faced, and the best solution they found is called Retrieval-Augmented Generation (RAG). Let's explore what it is and how it works.
What Is RAG ?
Retrieval-Augmented Generation (RAG) is a technique that provides an LLM with additional factual information beyond its training data, enabling it to answer user queries more effectively using this new information.
In the example above, when you asked the LLM for your father's name, it lacked the necessary information and responded by asking for more details. If we had included personal details in the query, the LLM could have answered accurately. Providing this additional context is, in essence, what RAG does.
"The term RAG was coined by Patrick Lewis in the 2020 paper, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks."
Why do we need RAG ?
Like any system driven by demand and supply, LLMs like ChatGPT initially had limitations in their knowledge, which restricted their ability to provide personalized answers. While improvements have been made over time, they still struggled to meet certain user expectations.
Here are the main problems LLMs faced before RAG:
- Hallucination: Providing False information when the answer wasn't in the training data, Learn more about AI hallucinations.
- Generic responses: Giving broad answers when users expected personalized ones.
- Non-existent references: Citing resources or links that don't exist (e.g., leading to a 404 error).
- Terminology confusion: Generating incorrect responses because different training sources used the same terms for different concepts.
You might have experienced an LLM providing a reference link that doesn't exist, resulting in a 404 error.
But is there only one type of RAG? No, there are several types, each with unique approaches to retrieving data and generating answers.
Types of RAG
RAG can be categorized based on how data is retrieved and answers are generated. Here are the three main types:
1) Naive / Native RAG
- Description: Naive RAG is the simplest form of retrieval-augmented generation. It retrieves information and generates responses without optimizations or feedback loops.
-
Process:
- Convert the user query into a vector.
- Perform a similarity check to select the top N relevant results.
- Pass the query and retrieved context to the LLM for response generation.
2) Advanced RAG
- Description: Advanced RAG uses sophisticated algorithms, including rerankers, fine-tuned LLMs, and feedback loops, to improve retrieval and generation.
-
Process:
- Convert the query into a vector.
- Retrieve documents using both vector search (semantic) and keyword-based search.
- Re-rank retrieved documents for relevance.
- Fuse documents (as they may be encoded differently).
- Generate a response.
- Use feedback loops (active learning, reinforcement learning, or retriever-generator co-training) to enhance performance over time.
3) Modular RAG
- Description: Modular RAG is the most advanced variant, operating as an open, composable pipeline where retrieval and generation work seamlessly.
-
Process:
- Rephrase the query to remove ambiguity.
- Retrieve documents from a vector database or knowledge base using similarity search.
- Filter and re-rank documents to prioritize the most relevant ones.
- Feed the retrieved information to the LLM for response generation.
- Post-process the response to add citations for credibility.
- Create a feedback loop to refine retrieval for future interactions.
Steps in RAG
Here are the key steps involved in implementing RAG:
1. Data Collection
Gather all relevant data for your application, such as emails, sales data, company policies, websites, blogs, FAQs, databases, or user manuals.
2. Data Chunking
Break data into smaller, manageable parts (e.g., by page, topic, or paragraph). This ensures related information stays together, making it easier to locate based on user queries.
3. Document Embedding
Convert data into vector embeddings (numerical representations) where semantically related data is grouped closely together, and unrelated data is farther apart.
4. Handle user query
Transform the user query into a vector embedding using the same method as document embedding. Compare the query embedding with document embeddings to identify relevant chunks using techniques like cosine similarity or Euclidean distance.
5. Generate reponse
Feed the retrieved relevant data chunks and the user query to the LLM, which generates the final answer.
Use Cases of RAG
RAG has a wide range of applications, including:
- Specialized chatbots and virtual assistants: Providing tailored responses based on specific data.
- Internal data research: Quickly finding answers within organizational datasets.
- Citations for credibility: Adding references to generated answers to improve trust.
- Recommendation services: Suggesting relevant products, services, or content.
Implementing RAG
Now that we understand RAG, let's implement it to get hands-on experience. In this example, we'll use cat-related data to answer user queries about cats. You'll need a Gemini API Key for generating embeddings and accessing LLM functionality.
Here’s the implementation:
client = genai.Client(
api_key=os.getenv("GOOGLE_API_KEY")
)
data = []
with open("cat.txt", "r", encoding="utf-8") as f:
data = f.readlines()
This code initializes the Gemini API client using an API key stored in an environment variable. It then reads data about cats from a file (cat.txt
) into a list, where each line represents a chunk of information.
vector_database = []
def add_chunk_to_db(chunk):
embeddings = client.models.embed_content(
model="gemini-embedding-001",
contents=[chunk],
config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY")
).embeddings[0].values
vector_database.append(embeddings)
print(f"Added chunk to database: {chunk.strip()}")
for chunk in data:
add_chunk_to_db(chunk)
This code creates a vector database by converting each chunk of cat data into a vector embedding using the Gemini embedding-001
model. The embeddings are stored in vector_database
, and a confirmation message is printed for each chunk added. The task_type="SEMANTIC_SIMILARITY"
ensures the embeddings are optimized for document retrieval.
def retrieve(query, top_k=5):
query_embedding = client.models.embed_content(
model="gemini-embedding-001",
contents=[query],
config=types.EmbedContentConfig(task_type="SEMANTIC_SIMILARITY")
).embeddings[0].values
similarities = cosine_similarity([query_embedding], vector_database)[0]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [(data[i], similarities[i]) for i in top_indices]
This function takes a user query and converts it into a vector embedding using the same Gemini model, with task_type="SEMANTIC_SIMILARITY"
for query optimization. It calculates the cosine similarity between the query embedding and the vector database, selects the top k
most similar chunks, and returns them along with their similarity scores.
query = [
"What do cats eat?",
"How long do cats sleep?",
"What is the average lifespan of a cat?",
]
for q in query:
results = retrieve(q)
print(f"Query: {q}")
print("Top results:")
for text, similarity in results:
print(f" - {text.strip()} (similarity: {similarity:.4f})")
print("\n\n\n")
# Without context
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=f"Answer the question: {q}",
)
print(f"Response without context: {response.candidates[0].content.text.strip()}\n")
# With context
context = "\n".join([text.strip() for text, _ in results])
response_with_context = client.models.generate_content(
model="gemini-2.5-flash",
contents=f"Answer the question only using the provided context: {q}\n\nContext:\n{context}"
)
print(f"Response with context: {response_with_context.candidates[0].content.text.strip()}\n")
print("="*50 + "\n")
This code processes a list of sample queries about cats. For each query:
- It retrieves the top 5 relevant chunks from the vector database using the retrieve function.
- It prints the query and the top results with their similarity scores.
- It generates two responses using the Gemini gemini-1.5-flash model:
- One without context (just the query).
- One with the retrieved context included.
- The responses are printed for comparison, showing how context improves accuracy. The contents parameter is formatted as a list of parts to match the Gemini API's expected input structure.
Conclusion
Retrieval-Augmented Generation (RAG) is a powerful technique that enhances LLMs by providing them with relevant external data, enabling more accurate and personalized responses. By addressing issues like hallucination, generic answers, and incorrect references, RAG makes LLMs more reliable and versatile. Its various types like Naive, Advanced, and Modular, offer flexibility for different use cases, from simple chatbots to complex research tools. The hands-on implementation above demonstrates how RAG works in practice, using cat data to answer queries with improved precision. As you explore RAG further, you can experiment with different datasets and fine-tune the process to suit your needs.
Credits
- Cat Data: Xuan-Son Nguyen
- Diagrams: IBM, Amazon
- Resources: Google, Nvidia, Salesforce, Amazon, IBM, Datacamp
- Code: Heet Vekariya
👉 If you found this helpful, don’t forget to share and follow for more agent-powered insights. Got an idea or workflow in mind? Join the discussion in the comments or reach out on Twitter | LinkedIn
Top comments (2)
good guide
Thank you @k0msenapati :)