Build Your First RAG App with Python + LlamaIndex — Step-by-Step Tutorial (2026)
Large language models know a lot, but they do not know your data. Ask Claude or GPT about your company's internal docs, your research papers, or last quarter's reports and you get confident-sounding nonsense. Fine-tuning is expensive, slow, and overkill for most use cases. RAG — Retrieval-Augmented Generation — is how you fix this.
RAG is simple in concept: before the LLM generates an answer, your application retrieves relevant chunks from your own documents and feeds them as context. The LLM answers based on your data, not its training data. No model retraining required. No hallucination about facts that exist in your files.
The concept is simple. Building it properly is not. You need document loading, text chunking, vector embeddings, a retrieval strategy, and a way to stitch it all together without drowning in boilerplate.
That is where LlamaIndex comes in. LlamaIndex is a Python framework purpose-built for RAG. While LangChain focuses on general-purpose LLM orchestration, LlamaIndex focuses specifically on connecting LLMs to your data — and it does that one job exceptionally well.
In this tutorial, you will build a RAG application from scratch using Python and LlamaIndex 0.14. Not a toy that queries a single PDF — a real application with hybrid search, conversational memory, and a path to production deployment. Every code block is complete and runnable.
What Is Retrieval-Augmented Generation (RAG) — Explained Simply
Before writing code, let's make sure the concept is clear.
A standard LLM interaction works like this:
- You send a question to the LLM.
- The LLM generates an answer from its training data.
- If the answer requires information the LLM was not trained on, it either says "I don't know" (rare) or halluccinates (common).
A RAG interaction adds a retrieval step:
- You send a question to the RAG application.
- The application searches your documents for chunks relevant to the question.
- Those chunks are injected into the LLM prompt as context.
- The LLM generates an answer grounded in your actual data.
The key insight: you are not changing the model. You are changing what the model can see when it answers. This means RAG works with any LLM — OpenAI, Anthropic, a local model running on Ollama, or even a small model like Gemma 4.
Why RAG Matters in 2026
RAG is not new — the original paper was published by Meta in 2020. But in 2026, it has become the default architecture for enterprise AI applications. The reasons:
- Context windows grew, but RAG still wins. Models now accept 200K+ token inputs. You could dump entire documents into the prompt. But retrieval is still faster, cheaper, and more accurate than brute-force context stuffing. A focused 2,000-token retrieval outperforms a 100,000-token dump in both answer quality and cost.
- Enterprise adoption exploded. Companies need LLMs that know their internal data — support tickets, product docs, legal contracts, research databases. RAG is how they get there without exposing training data or paying for fine-tuning.
- The tooling matured. In 2024, building a production RAG pipeline meant stitching together five different libraries. In 2026, frameworks like LlamaIndex handle the entire pipeline with tested, optimized defaults.
RAG vs Fine-Tuning: When to Use Each
This is the most common question developers ask before starting a RAG project. The answer is clearer in 2026 than it was a year ago.
Use RAG when:
- Your knowledge base changes frequently (docs updated weekly, new data arriving daily)
- You need answers grounded in specific documents with citations
- You want to get started quickly without GPU infrastructure
- Your data is proprietary and you cannot send it to a fine-tuning API
- You need the LLM to admit when it does not know something (RAG makes this natural)
Use fine-tuning when:
- You need to change the model's behavior — its tone, format, or reasoning style
- Your task is highly specialized (medical coding, legal clause extraction) and general models underperform
- Latency is critical and you cannot afford the retrieval step
- Your training data is stable and does not change frequently
The 2026 trend: use both. The most effective production systems use RAG for knowledge and fine-tuning for behavior. Put volatile knowledge in retrieval. Put stable behavior patterns in fine-tuning. Stop trying to force one tool to do both jobs.
For this tutorial, we are building pure RAG — which covers 80%+ of real-world use cases.
LlamaIndex vs LangChain: Why LlamaIndex for RAG
Both frameworks can build RAG applications. The difference is focus.
LangChain is a general-purpose LLM orchestration framework. It handles agents, chains, tools, memory, and RAG. It is broad. If you are building an AI agent that uses tools and makes decisions, LangChain (specifically LangGraph) is the better choice.
LlamaIndex is a data framework for LLM applications. It focuses specifically on ingesting, indexing, and querying data. For RAG, it provides:
- Better document handling. Built-in loaders for 160+ data sources (PDFs, databases, APIs, Notion, Slack, Google Drive).
- Smarter chunking. Sentence-aware splitting, hierarchical chunking, and metadata extraction out of the box.
- More retrieval options. Vector search, BM25, hybrid search, knowledge graphs, and recursive retrieval — all as first-class features.
- Simpler API for RAG. A basic RAG pipeline is 5 lines of code. A production pipeline is 50. LangChain's equivalent is typically 2-3x more code.
LlamaIndex reached version 0.14.20 as of April 2026 (source), requires Python 3.10+, and has over 300 integration packages. It is the most focused tool for the job we are doing today.
Environment Setup
Prerequisites
- Python 3.10 or higher
- An OpenAI API key (for embeddings and LLM; you can swap for Anthropic or a local model later)
- A terminal and a text editor
Create Your Project
mkdir rag-tutorial && cd rag-tutorial
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
Install Dependencies
pip install llama-index==0.14.20 \
llama-index-vector-stores-chroma==0.5.5 \
llama-index-retrievers-bm25==0.7.1 \
chromadb==1.5.5 \
python-dotenv==1.1.0
Set Your API Key
Create a .env file in your project root:
OPENAI_API_KEY=sk-your-key-here
Then load it in your code:
from dotenv import load_dotenv
load_dotenv()
If you prefer to use Anthropic's Claude as your LLM, install llama-index-llms-anthropic and set ANTHROPIC_API_KEY instead. The rest of the tutorial works the same — LlamaIndex abstracts the LLM layer.
Create a Data Directory
mkdir data
Add some documents to this directory. For this tutorial, use any text files, PDFs, or Markdown files you have. If you want to follow along exactly, create a sample file:
cat > data/sample.txt << 'EOF'
Effloow is an AI-powered content company that uses 14 AI agents
orchestrated through Paperclip to produce SEO-optimized technical
articles. The company focuses on developer tools, AI infrastructure,
and automation workflows. Their content covers topics including
Claude Code workflows, self-hosting guides, AI coding tool comparisons,
and hands-on tutorials for frameworks like LangGraph and LlamaIndex.
The tech stack includes a Laravel 13 content site, with articles
written in Markdown and published through an automated pipeline.
Each article targets specific keywords identified through SERP analysis
and aims for 2,000-3,000 words with practical, runnable code examples.
EOF
Step 1: Document Loading with SimpleDirectoryReader
LlamaIndex's SimpleDirectoryReader is the fastest way to load documents from a local directory. It automatically detects file types and uses the appropriate parser — plain text, PDF, Markdown, DOCX, and more.
from llama_index.core import SimpleDirectoryReader
# Load all documents from the data directory
documents = SimpleDirectoryReader(
input_dir="./data",
recursive=True, # Include subdirectories
required_exts=[".txt", ".pdf", ".md"], # Filter by extension
).load_data()
print(f"Loaded {len(documents)} document(s)")
for doc in documents:
print(f" - {doc.metadata.get('file_name', 'unknown')} "
f"({len(doc.text)} characters)")
What Happens Under the Hood
SimpleDirectoryReader does three things:
- Scans the directory for files matching your criteria.
- Parses each file using the appropriate reader (built-in for common formats, extensible for custom ones).
-
Returns a list of
Documentobjects, each containing the text content and metadata (filename, file path, creation date, etc.).
Loading from Other Sources
LlamaIndex supports 160+ data connectors via LlamaHub. Common examples:
# From a web page
from llama_index.readers.web import SimpleWebPageReader
docs = SimpleWebPageReader().load_data(["https://example.com/docs"])
# From a database
from llama_index.readers.database import DatabaseReader
reader = DatabaseReader(uri="postgresql://user:pass@localhost/db")
docs = reader.load_data(query="SELECT content FROM articles")
For this tutorial, we stick with SimpleDirectoryReader because it covers the most common case: loading local files.
Step 2: Chunking and Indexing with VectorStoreIndex
Raw documents are too large to retrieve effectively. A 50-page PDF as a single chunk means your LLM gets 50 pages of context when it only needs 2 paragraphs. Chunking splits documents into smaller, focused pieces.
Basic Indexing (5 Lines)
The simplest RAG pipeline in LlamaIndex:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What does Effloow do?")
print(response)
That is a working RAG app in 5 lines. But the defaults hide important decisions. Let's make them explicit.
Custom Chunking Strategy
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
# Configure the text splitter
Settings.text_splitter = SentenceSplitter(
chunk_size=512, # Target chunk size in tokens
chunk_overlap=128, # Overlap between consecutive chunks
)
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
print(f"Created {len(index.docstore.docs)} chunks from documents")
Chunking Strategy Guidelines
Choosing chunk size involves a trade-off:
| Chunk Size | Pros | Cons |
|---|---|---|
| Small (128-256 tokens) | Precise retrieval, less noise | May lose context, more chunks to search |
| Medium (512-768 tokens) | Good balance of precision and context | Default choice for most use cases |
| Large (1024-2048 tokens) | Rich context per chunk | More noise, higher LLM cost per query |
Our recommendation: Start with 512 tokens and 128-token overlap. Evaluate retrieval quality on your specific data. If answers lack context, increase chunk size. If answers include irrelevant information, decrease it.
The overlap ensures that sentences split across chunk boundaries are still captured. Without overlap, a key fact that spans two chunks might be lost from both.
Persisting Your Index with ChromaDB
The default VectorStoreIndex stores everything in memory. When you restart your application, you re-index everything. For production, use a persistent vector store.
import chromadb
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
# Create a persistent ChromaDB client
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_or_create_collection("rag_tutorial")
# Create a vector store backed by ChromaDB
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Load and index documents
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
)
print("Index persisted to ./chroma_db")
Now your embeddings survive restarts. To load an existing index without re-indexing:
# Load existing index from ChromaDB
chroma_client = chromadb.PersistentClient(path="./chroma_db")
chroma_collection = chroma_client.get_collection("rag_tutorial")
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
index = VectorStoreIndex.from_vector_store(vector_store)
Step 3: Querying — Building a Query Engine
A query engine is where retrieval meets generation. It retrieves relevant chunks, constructs a prompt with those chunks as context, and sends it to the LLM.
Basic Query Engine
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
similarity_top_k=3, # Retrieve top 3 most relevant chunks
)
response = query_engine.query("What topics does Effloow cover?")
print(response)
Inspecting Retrieved Chunks
Understanding what the retriever finds is critical for debugging:
response = query_engine.query("What is Effloow's tech stack?")
print("Answer:", response)
print("\n--- Source Chunks ---")
for i, node in enumerate(response.source_nodes):
print(f"\nChunk {i+1} (score: {node.score:.4f}):")
print(f" File: {node.metadata.get('file_name', 'unknown')}")
print(f" Text: {node.text[:200]}...")
Customizing the LLM
By default, LlamaIndex uses OpenAI's gpt-3.5-turbo. To use a different model:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
# Use GPT-4o for higher quality answers
Settings.llm = OpenAI(model="gpt-4o", temperature=0.1)
# Or use a local model via Ollama
# pip install llama-index-llms-ollama
# from llama_index.llms.ollama import Ollama
# Settings.llm = Ollama(model="gemma4:4b", request_timeout=120)
Using a local model with Ollama or Docker Model Runner means zero API costs and full data privacy — your documents never leave your machine.
Customizing the Prompt
You can override the default prompt template:
from llama_index.core import PromptTemplate
custom_template = PromptTemplate(
"Context from our documents:\n"
"-----\n"
"{context_str}\n"
"-----\n"
"Based on the context above, answer the following question. "
"If the context does not contain enough information, say so clearly.\n\n"
"Question: {query_str}\n"
"Answer: "
)
query_engine = index.as_query_engine(
similarity_top_k=3,
text_qa_template=custom_template,
)
Step 4: Hybrid Search — BM25 + Vector Search
Pure vector search finds semantically similar content. But it can miss exact keyword matches that matter. The query "error code E-4021" might retrieve chunks about errors in general rather than the specific error code.
Hybrid search combines vector search (semantic similarity) with BM25 (keyword matching) to get the best of both approaches.
How Hybrid Search Works
- Vector search converts your query into an embedding and finds chunks with similar embeddings. Great for "what does this mean" queries.
- BM25 search is a traditional keyword ranking algorithm. It finds chunks that contain the exact terms in your query. Great for "find this specific thing" queries.
- Hybrid search runs both, then fuses the results using reciprocal rank fusion.
Implementation
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.retrievers.bm25 import BM25Retriever
# Load and parse documents into nodes
documents = SimpleDirectoryReader("./data").load_data()
splitter = SentenceSplitter(chunk_size=512, chunk_overlap=128)
nodes = splitter.get_nodes_from_documents(documents)
# Create vector index from nodes
index = VectorStoreIndex(nodes)
# Create both retrievers
vector_retriever = index.as_retriever(similarity_top_k=5)
bm25_retriever = BM25Retriever.from_defaults(
nodes=nodes,
similarity_top_k=5,
)
# Fuse them with reciprocal rank fusion
hybrid_retriever = QueryFusionRetriever(
retrievers=[vector_retriever, bm25_retriever],
num_queries=1, # No query expansion, just fuse results
similarity_top_k=5, # Final number of results after fusion
)
# Use the hybrid retriever in a query engine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(
retriever=hybrid_retriever,
)
response = query_engine.query("What AI agents does Effloow use?")
print(response)
When Hybrid Search Matters
Hybrid search is worth the additional complexity when:
- Your documents contain technical terms, codes, or identifiers (error codes, product SKUs, API endpoint names)
- Users search with both natural language and exact terms
- Your corpus mixes structured data (tables, configs) with narrative text
For most RAG applications, hybrid search improves retrieval quality by 10-25% over pure vector search. The cost is slightly more memory (BM25 index in addition to vector index) and marginally higher query latency.
Step 5: Adding Memory — RAG Chatbot with Conversation History
So far, our query engine is stateless. Each query is independent. But real applications need conversation — users ask follow-up questions that reference previous answers.
The Problem
Without memory:
User: What is Effloow?
Bot: Effloow is an AI-powered content company...
User: How many agents do they use?
Bot: I don't have information about agents. ← Lost context
With memory:
User: What is Effloow?
Bot: Effloow is an AI-powered content company...
User: How many agents do they use?
Bot: Effloow uses 14 AI agents orchestrated through Paperclip. ← Remembers context
Implementation with Chat Engine
LlamaIndex provides a ChatEngine that wraps your index with conversation memory:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.memory import ChatMemoryBuffer
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
# Create a memory buffer that keeps the last 3000 tokens
memory = ChatMemoryBuffer.from_defaults(token_limit=3000)
# Create a chat engine with context mode
chat_engine = index.as_chat_engine(
chat_mode="context",
memory=memory,
system_prompt=(
"You are a helpful assistant that answers questions about "
"the user's documents. Be concise and cite specific details "
"from the documents when possible. If you don't know the answer "
"based on the provided context, say so."
),
)
# Multi-turn conversation
response1 = chat_engine.chat("What is Effloow?")
print("Bot:", response1)
response2 = chat_engine.chat("How many agents do they use?")
print("Bot:", response2)
response3 = chat_engine.chat("What framework orchestrates them?")
print("Bot:", response3)
Chat Modes Explained
LlamaIndex offers several chat modes:
| Mode | Description | Best For |
|---|---|---|
context |
Retrieves relevant chunks for every message, combines with chat history | General-purpose RAG chatbot |
condense_plus_context |
Rewrites the user query using chat history before retrieval | Follow-up questions that reference earlier messages |
simple |
No retrieval, pure LLM chat | When you want to toggle RAG on/off |
For most applications, condense_plus_context gives the best results because it reformulates "How many agents do they use?" into "How many AI agents does Effloow use?" before searching.
Complete Chatbot Script
Here is a complete, runnable chatbot that combines everything so far:
"""
RAG Chatbot with LlamaIndex — Complete Example
Requires: pip install llama-index chromadb python-dotenv
"""
import chromadb
from dotenv import load_dotenv
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings, StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.memory import ChatMemoryBuffer
from llama_index.vector_stores.chroma import ChromaVectorStore
load_dotenv()
# --- Configuration ---
DATA_DIR = "./data"
CHROMA_DIR = "./chroma_db"
COLLECTION_NAME = "rag_chatbot"
CHUNK_SIZE = 512
CHUNK_OVERLAP = 128
# --- Chunking settings ---
Settings.text_splitter = SentenceSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
)
# --- Vector store ---
chroma_client = chromadb.PersistentClient(path=CHROMA_DIR)
chroma_collection = chroma_client.get_or_create_collection(COLLECTION_NAME)
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
# --- Build or load index ---
if chroma_collection.count() == 0:
print("Building index from documents...")
documents = SimpleDirectoryReader(DATA_DIR, recursive=True).load_data()
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context
)
print(f"Indexed {len(documents)} document(s)")
else:
print("Loading existing index...")
index = VectorStoreIndex.from_vector_store(vector_store)
# --- Chat engine ---
memory = ChatMemoryBuffer.from_defaults(token_limit=3000)
chat_engine = index.as_chat_engine(
chat_mode="condense_plus_context",
memory=memory,
system_prompt=(
"You are a helpful assistant. Answer questions based on the "
"provided documents. Be concise and accurate. If the documents "
"don't contain the answer, say so clearly."
),
)
# --- Chat loop ---
print("\nRAG Chatbot ready. Type 'quit' to exit.\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ("quit", "exit", "q"):
break
if not user_input:
continue
response = chat_engine.chat(user_input)
print(f"Bot: {response}\n")
Step 6: Deployment Options
You have a working RAG application. Now let's get it running beyond your laptop.
Option A: FastAPI Server (Recommended for Most Teams)
Wrap your RAG pipeline in a REST API:
"""
RAG API Server — FastAPI
Requires: pip install fastapi uvicorn llama-index chromadb python-dotenv
"""
import chromadb
from contextlib import asynccontextmanager
from dotenv import load_dotenv
from fastapi import FastAPI
from pydantic import BaseModel
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.chroma import ChromaVectorStore
load_dotenv()
index = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global index
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_collection("rag_chatbot")
vector_store = ChromaVectorStore(chroma_collection=collection)
index = VectorStoreIndex.from_vector_store(vector_store)
yield
app = FastAPI(title="RAG API", lifespan=lifespan)
class QueryRequest(BaseModel):
question: str
top_k: int = 3
class QueryResponse(BaseModel):
answer: str
sources: list[dict]
@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
query_engine = index.as_query_engine(
similarity_top_k=request.top_k,
)
response = query_engine.query(request.question)
sources = [
{
"text": node.text[:200],
"score": round(node.score, 4),
"file": node.metadata.get("file_name", "unknown"),
}
for node in response.source_nodes
]
return QueryResponse(answer=str(response), sources=sources)
Run it:
uvicorn server:app --host 0.0.0.0 --port 8000
Test it:
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What is Effloow?", "top_k": 3}'
Option B: Docker Deployment
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
# requirements.txt
llama-index==0.14.20
llama-index-vector-stores-chroma==0.4.1
chromadb==0.6.3
fastapi==0.135.3
uvicorn==0.43.0
python-dotenv==1.1.0
docker build -t rag-app .
docker run -p 8000:8000 -e OPENAI_API_KEY=sk-your-key -v ./chroma_db:/app/chroma_db rag-app
Option C: Serverless on AWS Lambda
For intermittent traffic, serverless is cost-effective. The main challenge is cold starts — loading the vector index on each invocation is slow. Solutions:
- Use a managed vector database (Pinecone, Qdrant Cloud) instead of local ChromaDB. The index lives remotely, so cold starts only load your application code.
- Use provisioned concurrency on AWS Lambda to keep instances warm.
- Use AWS Fargate for container-based serverless with persistent storage.
For most teams, Option A (FastAPI on a VPS or container) is the right starting point. If you want a cheap VPS for deployment, we covered affordable cloud options in our self-hosted dev stack guide.
Cost Estimates
RAG costs come from two places: embeddings and LLM queries.
Embedding costs (one-time per document):
- OpenAI
text-embedding-3-small: $0.02 per 1M tokens - 1,000 pages of documents ≈ 500K tokens ≈ $0.01
Query costs (per user query):
- Embedding the query: negligible
- LLM generation with context: $0.01-0.05 per query (GPT-4o)
- Or $0.00 if using a local model via Ollama
For a typical internal tool with 100 queries/day using GPT-4o, expect roughly $50-150/month in API costs. Using a local model drops this to zero — see our guides on running local models with Ollama or Docker Model Runner.
Vector Database Comparison: Pinecone vs Weaviate vs Qdrant vs ChromaDB
Choosing a vector database is one of the most important decisions in your RAG architecture. Here is an honest comparison based on how each database fits RAG use cases in 2026.
ChromaDB — Best for Getting Started
What it is: An open-source, embedded vector database. Runs in your Python process — no separate server needed.
Strengths:
- Zero infrastructure.
pip install chromadband you are running. - Persistent storage to disk. Survives restarts.
- Good enough for production with datasets under 1M vectors.
- Active community and good LlamaIndex integration.
Limitations:
- Single-node only. No horizontal scaling.
- No built-in hybrid search (you implement BM25 separately, as we did above).
- Performance degrades above 1M vectors.
Best for: Prototypes, internal tools, small-to-medium datasets, solo developers.
Qdrant — Best Performance per Dollar
What it is: A Rust-based vector search engine. Self-hosted or managed cloud.
Strengths:
- Fast. Rust implementation means lower latency and better throughput than Python-based alternatives.
- Excellent metadata filtering. Filter by any field before vector search.
- Built-in hybrid search with sparse vectors.
- Generous free tier on Qdrant Cloud (1GB free cluster).
Limitations:
- Self-hosted requires more ops knowledge than ChromaDB.
- Smaller ecosystem than Pinecone.
Best for: Production RAG with complex filtering, cost-sensitive teams, teams with ops capacity.
Pinecone — Best Managed Experience
What it is: Fully managed vector database. No infrastructure to manage.
Strengths:
- Zero ops. Pinecone handles scaling, backups, and availability.
- Scales to billions of vectors.
- Good enterprise features (RBAC, encryption, SOC2).
Limitations:
- Expensive at scale. Pricing starts around $70/month and grows with data volume. At 5M+ vectors, costs can reach $500-1,500/month.
- Vendor lock-in. Proprietary API, no self-hosted option.
- Overkill for small datasets.
Best for: Enterprise teams with budget, applications that need to scale to millions of vectors, teams that want zero ops.
Weaviate — Best for Hybrid and Multi-Modal
What it is: Open-source vector search engine with built-in hybrid search and multi-modal support.
Strengths:
- Native hybrid search (BM25 + vector in a single query).
- Multi-modal support (text, images, audio).
- GraphQL API for complex queries.
- Strong knowledge graph capabilities.
Limitations:
- Heavier than ChromaDB or Qdrant to run.
- More complex configuration.
- Overkill if you only need basic vector search.
Best for: Applications needing built-in hybrid search, multi-modal RAG, knowledge graph integration.
Quick Decision Framework
| Scenario | Choose |
|---|---|
| Starting a new project, < 100K documents | ChromaDB |
| Production app, need filtering and performance | Qdrant |
| Enterprise, zero ops, big budget | Pinecone |
| Need native hybrid search or multi-modal | Weaviate |
| Already using PostgreSQL and want simplicity | pgvector |
All of these integrate with LlamaIndex. Switching between them requires changing roughly 5-10 lines of code — the rest of your pipeline stays the same.
Frequently Asked Questions
What is retrieval-augmented generation (RAG) explained simply?
RAG is a pattern where your application retrieves relevant information from your own documents before asking an LLM to generate an answer. Instead of relying on the LLM's training data, the LLM answers based on the specific context you provide. Think of it as giving the LLM an open-book exam instead of a closed-book exam.
How much data can a RAG application handle?
With ChromaDB, you can comfortably handle up to 1 million document chunks. With Qdrant or Pinecone, the limit is in the hundreds of millions to billions. The practical limit is usually your embedding cost (one-time) and query latency requirements, not the vector database itself.
Can I use RAG with a local LLM instead of OpenAI?
Yes. LlamaIndex supports any LLM backend including Ollama for running local models. Replace the LLM setting with Ollama(model="gemma4:4b") and your data never leaves your machine. See our Ollama setup guide for installation instructions.
How is RAG different from just pasting documents into ChatGPT?
Three differences. First, RAG is selective — it retrieves only the relevant chunks, not the entire document, which produces better answers and lower costs. Second, RAG works at scale — you cannot paste 10,000 documents into a chat window, but a RAG system searches them all in milliseconds. Third, RAG is programmable — you control the retrieval strategy, the prompt, and the output format.
LlamaIndex vs LangChain for RAG — which is better in 2026?
For pure RAG applications, LlamaIndex is more focused and typically requires less code. For applications that combine RAG with agentic behavior (tool use, multi-step reasoning, decision loops), LangGraph is the better choice. Many production systems use both — LlamaIndex for the retrieval layer and LangChain/LangGraph for the orchestration layer.
What is the best vector database for RAG in 2026?
It depends on your scale and team. ChromaDB for starting out and small-to-medium projects. Qdrant for production workloads that need performance and filtering. Pinecone if you want fully managed infrastructure and have the budget. Weaviate if you need native hybrid search. See the detailed comparison section above.
How do I evaluate if my RAG application is working well?
Track three metrics: context precision (are the retrieved chunks relevant?), faithfulness (does the answer match the retrieved context?), and answer relevancy (does the answer actually address the question?). LlamaIndex includes evaluation utilities, and the RAGAS framework provides a standardized evaluation toolkit for RAG systems.
Pinecone vs ChromaDB vs Qdrant for RAG — which should I use?
Start with ChromaDB for prototyping. It requires no infrastructure and runs embedded in your Python process. When you outgrow it — either because your dataset exceeds 1M vectors, you need horizontal scaling, or you need advanced filtering — migrate to Qdrant (best price-performance) or Pinecone (best managed experience). The migration is straightforward because LlamaIndex abstracts the vector store layer.
Can I deploy a RAG application serverless on AWS?
Yes, but with caveats. The main challenge is cold starts — loading a local vector index on each Lambda invocation is slow. Use a managed vector database (Pinecone or Qdrant Cloud) so the index lives remotely, and your Lambda only handles the application logic. Alternatively, use AWS Fargate for container-based serverless with persistent storage.
What to Build Next
You now have a working RAG application with hybrid search, conversational memory, and deployment options. Here are the natural next steps:
Add evaluation. Use LlamaIndex's built-in evaluation or the RAGAS framework to measure retrieval quality and answer accuracy on your specific data.
Try agentic RAG. Combine your RAG pipeline with an AI agent that can decide when to search and what to search for. Our LangGraph tutorial shows how to build agents that use tools — your RAG query engine can be one of those tools.
Add metadata filtering. Tag documents with metadata (department, date, category) and let users filter results. This is especially powerful with Qdrant's filtering capabilities.
Experiment with local models. Replace OpenAI with a local model via Ollama for zero-cost, fully private RAG. Models like Gemma 4 are surprisingly capable for RAG tasks.
Scale your retrieval. As your document corpus grows, explore advanced strategies: hierarchical retrieval (retrieve document → retrieve chunks), recursive retrieval, and re-ranking with cross-encoder models.
The tools you learned in this tutorial — LlamaIndex, vector databases, hybrid search, and conversational memory — are the foundation of every production RAG system. The specific architecture varies, but these building blocks stay the same.
Build something useful. Ship it. Then optimize.
Top comments (0)