DEV Community

Ayush
Ayush

Posted on

🧠 GenAI as a Backend Engineer: Part 3 β€” RAG with LlamaIndex

Next Up: Retrieval-Augmented Generation (RAG) made simple πŸš€

In Part 2, we built a semantic search engine with embeddings + Qdrant. Now, we’ll hook that up to an LLM so your app can answer questions instead of just returning matching docs.

By the end, you’ll:

  • πŸ“– Understand RAG and why it matters.
  • βš™οΈ Learn how LlamaIndex helps you build RAG pipelines quickly.
  • πŸ— Build: Raw text β†’ Vector DB β†’ LLM β†’ Answer.

What is RAG? πŸ€”

RAG = Retrieval-Augmented Generation.

  • Retrieve πŸ”: Fetch relevant chunks from your own data using semantic search.
  • Augment βž•: Add that data to the prompt for the LLM.
  • Generate ✨: Get a grounded, accurate answer.

Why? LLMs are frozen in time. They are trained on a generic data, not your data. RAG lets them use your current, domain-specific data and generate responses.


LlamaIndex vs LangChain βš–οΈ

Feature LangChain LlamaIndex
Focus Agents, tools, flexibility RAG & retrieval pipelines
Setup More verbose, more options Leaner, less boilerplate
Best for Complex, multi-tool workflows Quick, retrieval-focused RAG

Here, we go with LlamaIndex β€” less config, easier to build. They are however, not mutually exclusive and can be mixed & matched.

πŸ“– More on concepts: LlamaIndex Overview


Building It πŸ› 

Install dependencies πŸ“¦

python -m venv venv
source venv/bin/activate
pip install llama-index sentence-transformers qdrant-client langchain_community llama-index-vector-stores-qdrant llama-index-embeddings-langchain llama-index-llms-gemini
Enter fullscreen mode Exit fullscreen mode

Scenario A: NaΓ―ve RAG (full doc as context)

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings, StorageContext
from llama_index.embeddings.langchain import LangchainEmbedding
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
from llama_index.llms.gemini import Gemini, Gemini
import os

# Keep this as none to see the context creation in response
api_key = os.environ.get('GEMINI_API_KEY', None)

# Loads full docs, embeds, and queries without chunking

# Step 1: Load your files (txt for now)
documents = SimpleDirectoryReader("./tmp").load_data()

# Step 2: Use HuggingFace embeddings (SBERT)
# Interface conversion so llama_index can understand embeddings
embed_model = LangchainEmbedding(HuggingFaceBgeEmbeddings(model_name="all-MiniLM-L6-v2"))

# Step 3: Connect to Qdrant + create storage context for llamaindex
qdrant = QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=qdrant, collection_name="rag_docs")

storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Step 4: Create index to query on
index = VectorStoreIndex.from_documents(
  documents,
  storage_context=storage_context,
  embed_model=embed_model
)

Settings.llm = Gemini(api_key=api_key) if api_key else None
# Internally calls retreiver as shown in rag_tokens
query_engine = index.as_query_engine()

# Step 4: Create context based on index response, send to LLM
# Works, but not efficient β€” passes entire text to the LLM
response = query_engine.query("What is milk?")
print(response)

response = query_engine.query("What is the color of this item?")
print(response)

response = query_engine.query("Is it edible?")
print(response)
Enter fullscreen mode Exit fullscreen mode

Scenario B: Smarter RAG (chunk + retrieve)

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings, StorageContext
from llama_index.embeddings.langchain import LangchainEmbedding
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from langchain_community.embeddings import HuggingFaceBgeEmbeddings 
from llama_index.llms.gemini import Gemini, Gemini
from textwrap import dedent
import google.generativeai as genai
import google.generativeai.generative_models as genaimodels
import sys
import os

api_key = os.environ.get('GEMINI_API_KEY', None)

documents = SimpleDirectoryReader("./tmp").load_data()

embed_model = LangchainEmbedding(HuggingFaceBgeEmbeddings(model_name="all-MiniLM-L6-v2"))

qdrant = QdrantClient(host="localhost", port=6333)
vector_store = QdrantVectorStore(client=qdrant, collection_name="rag_docs_chunk")

storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Splits docs into 200-token chunks with 50-token overlap
parser = SimpleNodeParser.from_defaults(chunk_size=200, chunk_overlap=50)
nodes = parser.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes, storage_context=storage_context, embed_model=embed_model)
retreiver = index.as_retriever(similarity_top_k=3)

question = "Health benefits of turmeric?"
retreived_nodes = retreiver.retrieve(question)

print ("=======Nodes retreived=========")
for i, node in enumerate(retreived_nodes):
  print(f"--- Chunk {i} ---")
  print(f"(Score: {node.score})")
  print(node)

if (api_key is None):
  print ("Cannot run gemini workflows. API Key not set")
  sys.exit()

# Both retrieve top-k relevant chunks, then passes only those to the LLM

# =======Self pompt building section=========
def build_prompt(question, nodes):
  context_text = "\n\n".join(
    [f"Chunk{i+1}:\n{chunk.get_content()}" for i, chunk in enumerate(nodes)]
  )
  return dedent(f"""
  You are an expert assistant. Use only the provided context to answer the question.
  If the answer is not found in the context, say "I don't know."

  Context:
  {context_text}

  Question:
  {question}

  Answer:
  """)

genai.configure(api_key=api_key)
prompt = build_prompt(question, retreived_nodes)

model = genaimodels.GenerativeModel()
response = model.generate_content(prompt)
print("\n--- Final Answer ---")
print(response)
# =======Self pompt building section=========

# =======Auto pompt section=========
Settings.llm = Gemini(api_key=api_key) # None
query_engine = index.as_query_engine()
response = query_engine.query("Health benefits of turmeric?")
print(response)
# =======Auto pompt section=========
Enter fullscreen mode Exit fullscreen mode

Step 4 β€” How the Pipeline Works πŸ”„

  1. Chunking πŸͺ“ β€” Break docs into small segments.
  2. Embedding 🧩 β€” Convert chunks to semantic vectors.
  3. Vector Store πŸ“š β€” Save vectors + metadata in Qdrant.
  4. Retrieval πŸ” β€” Search for top-k similar chunks.
  5. Prompting πŸ—¨ β€” Inject chunks into the LLM query.

TL;DR Takeaways πŸ“Œ

  • RAG = more accurate, up-to-date answers.
  • Chunking keeps context relevant and token count low.
  • LlamaIndex is great for quick RAG setups.

If using the files in repo, you should see chunks instead of the whole file:

=======Nodes retreived=========
--- Chunk 0 ---
(Score: 0.72586405)
Node ID: 473e9e27-8052-4347-8108-6dc496f46912
Text: It's also used as a natural food coloring and in various
beverages and teas.   Fresh turmeric root can be grated and added to
dishes, while dried and powdered turmeric is widely available.
Traditional and Medicinal Uses:  Turmeric has a rich history in
traditional medicine systems like Ayurveda and traditional Chinese
medicine.   It has been u...
Enter fullscreen mode Exit fullscreen mode

πŸ’» GitHub Repo

πŸ”— Other Parts:

Top comments (0)