🧠 GenAI as a Backend Engineer: Part 3 — RAG with LlamaIndex

#python #ai #rag

Next Up: Retrieval-Augmented Generation (RAG) made simple 🚀

In Part 2, we built a semantic search engine with embeddings + Qdrant. Now, we’ll hook that up to an LLM so your app can answer questions instead of just returning matching docs.

By the end, you’ll:

📖 Understand RAG and why it matters.
⚙️ Learn how LlamaIndex helps you build RAG pipelines quickly.
🏗 Build: Raw text → Vector DB → LLM → Answer.

What is RAG? 🤔

RAG = Retrieval-Augmented Generation.

Retrieve 🔍: Fetch relevant chunks from your own data using semantic search.
Augment ➕: Add that data to the prompt for the LLM.
Generate ✨: Get a grounded, accurate answer.

Why? LLMs are frozen in time. They are trained on a generic data, not your data. RAG lets them use your current, domain-specific data and generate responses.

LlamaIndex vs LangChain ⚖️

Feature	LangChain	LlamaIndex
Focus	Agents, tools, flexibility	RAG & retrieval pipelines
Setup	More verbose, more options	Leaner, less boilerplate
Best for	Complex, multi-tool workflows	Quick, retrieval-focused RAG

Here, we go with LlamaIndex — less config, easier to build. They are however, not mutually exclusive and can be mixed & matched.

📖 More on concepts: LlamaIndex Overview

Building It 🛠

Install dependencies 📦

python -m venv venv
source venv/bin/activate
pip install llama-index sentence-transformers qdrant-client langchain_community llama-index-vector-stores-qdrant llama-index-embeddings-langchain llama-index-llms-gemini

Scenario A: Naïve RAG (full doc as context)

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings, StorageContext
from llama_index.embeddings.langchain import LangchainEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from llama_index.llms.gemini import Gemini, Gemini
import os

# Keep this as none to see the context creation in response
api_key = os.environ.get('GEMINI_API_KEY', None)

# Loads full docs, embeds, and queries without chunking

# Step 1: Load your files (txt for now)
documents = SimpleDirectoryReader("./tmp").load_data()

# Step 2: Use HuggingFace embeddings (SBERT)
# Interface conversion so llama_index can understand embeddings
embed_model = LangchainEmbedding(HuggingFaceBgeEmbeddings(model_name="all-MiniLM-L6-v2"))

# Step 3: Connect to Qdrant + create storage context for llamaindex
qdrant = QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=qdrant, collection_name="rag_docs")

storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Step 4: Create index to query on
index = VectorStoreIndex.from_documents(
  documents,
  storage_context=storage_context,
  embed_model=embed_model
)

Settings.llm = Gemini(api_key=api_key) if api_key else None
# Internally calls retreiver as shown in rag_tokens
query_engine = index.as_query_engine()

# Step 4: Create context based on index response, send to LLM
# Works, but not efficient — passes entire text to the LLM
response = query_engine.query("What is milk?")
print(response)

response = query_engine.query("What is the color of this item?")
print(response)

response = query_engine.query("Is it edible?")
print(response)

Scenario B: Smarter RAG (chunk + retrieve)

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings, StorageContext
from llama_index.embeddings.langchain import LangchainEmbedding
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from langchain_community.embeddings import HuggingFaceBgeEmbeddings 
from llama_index.llms.gemini import Gemini, Gemini
from textwrap import dedent
import google.generativeai as genai
import google.generativeai.generative_models as genaimodels
import sys
import os

api_key = os.environ.get('GEMINI_API_KEY', None)

documents = SimpleDirectoryReader("./tmp").load_data()

embed_model = LangchainEmbedding(HuggingFaceBgeEmbeddings(model_name="all-MiniLM-L6-v2"))

qdrant = QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=qdrant, collection_name="rag_docs_chunk")

storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Splits docs into 200-token chunks with 50-token overlap
parser = SimpleNodeParser.from_defaults(chunk_size=200, chunk_overlap=50)
nodes = parser.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes, storage_context=storage_context, embed_model=embed_model)
retreiver = index.as_retriever(similarity_top_k=3)

question = "Health benefits of turmeric?"
retreived_nodes = retreiver.retrieve(question)

print ("=======Nodes retreived=========")
for i, node in enumerate(retreived_nodes):
  print(f"--- Chunk {i} ---")
  print(f"(Score: {node.score})")
  print(node)

if (api_key is None):
  print ("Cannot run gemini workflows. API Key not set")
  sys.exit()

# Both retrieve top-k relevant chunks, then passes only those to the LLM

# =======Self pompt building section=========
def build_prompt(question, nodes):
  context_text = "\n\n".join(
    [f"Chunk{i+1}:\n{chunk.get_content()}" for i, chunk in enumerate(nodes)]
  )
  return dedent(f"""
  You are an expert assistant. Use only the provided context to answer the question.
  If the answer is not found in the context, say "I don't know."

  Context:
  {context_text}

  Question:
  {question}

  Answer:
  """)

genai.configure(api_key=api_key)
prompt = build_prompt(question, retreived_nodes)

model = genaimodels.GenerativeModel()
response = model.generate_content(prompt)
print("\n--- Final Answer ---")
print(response)
# =======Self pompt building section=========

# =======Auto pompt section=========
Settings.llm = Gemini(api_key=api_key) # None
query_engine = index.as_query_engine()
response = query_engine.query("Health benefits of turmeric?")
print(response)
# =======Auto pompt section=========

Step 4 — How the Pipeline Works 🔄

Chunking 🪓 — Break docs into small segments.
Embedding 🧩 — Convert chunks to semantic vectors.
Vector Store 📚 — Save vectors + metadata in Qdrant.
Retrieval 🔍 — Search for top-k similar chunks.
Prompting 🗨 — Inject chunks into the LLM query.

TL;DR Takeaways 📌

RAG = more accurate, up-to-date answers.
Chunking keeps context relevant and token count low.
LlamaIndex is great for quick RAG setups.

If using the files in repo, you should see chunks instead of the whole file:

=======Nodes retreived=========
--- Chunk 0 ---
(Score: 0.72586405)
Node ID: 473e9e27-8052-4347-8108-6dc496f46912
Text: It's also used as a natural food coloring and in various
beverages and teas.   Fresh turmeric root can be grated and added to
dishes, while dried and powdered turmeric is widely available.
Traditional and Medicinal Uses:  Turmeric has a rich history in
traditional medicine systems like Ayurveda and traditional Chinese
medicine.   It has been u...