Picture this: You’ve just deployed a shiny new AI assistant for your company. Your stakeholders are thrilled. Then, a highly-valued client asks it a crucial question about your brand-new, unreleased product documentation. The AI, possessing all the confidence in the world, proceeds to completely hallucinate an answer based on what it read on the internet two years ago.
This is the fatal flaw of standard Large Language Models (LLMs). Out of the box, they are incredibly articulate but fundamentally ignorant about your proprietary, real-time, or deeply domain-specific data. They confidently guess when they don't know the answer.
Enter Retrieval-Augmented Generation (RAG)—the silver bullet for LLM hallucination.
Think of RAG as giving your AI an open-book test. Instead of relying on its static memory, a RAG system securely searches your private databases for the exact, relevant context and injects it directly into the prompt before the LLM generates a response. You instantly transform a generic AI into a precise domain expert on your custom data.
While there are plenty of simple "hello world" RAG tutorials out there, building something that scales requires careful software design. In this comprehensive deep-dive tutorial, we aren't just writing toy Jupyter cells. I will walk you through building a modular, multi-storage, production-ready RAG web service from scratch using FastAPI, LangChain, and Google Gemini—an architecture that you can actually deploy to the real world.
1. Our Technology Stack
Instead of bloated orchestrators, we'll build on a lean, highly performant stack:
- FastAPI: A modern, wildly fast web framework for building our backend API. It gives us async capabilities, automatic data validation, and built-in Swagger UI.
- LangChain & LCEL: The leading framework to orchestrate LLM workflows. We'll heavily utilize LangChain Expression Language (LCEL) for declarative, readable pipelines.
-
Google Generative AI (Gemini): We'll power our system using two models:
-
gemini-embedding-001for mapping text to vectors. -
gemini-2.5-flashfor blindingly fast and intelligent text generation.
-
-
Durable Vector Stores: To store and query embeddings, our application dynamically supports:
- Pinecone: A fully managed, cloud-native vector database.
- FAISS: Facebook AI Similarity Search for blistering local performance.
- Cloud Backups: Built-in support for synchronizing our FAISS indexes with AWS S3 or Google Cloud Storage (GCS).
2. Project Architecture
Before jumping into the implementation, let's map out our project structure. A good RAG application separates the HTTP layer, the orchestration layer, and the data layer.
rag-app/
├── main.py # FastAPI application entry point
├── endpoints.py # API route definitions
├── rag.py # Core RAG logic & LLM orchestration
├── vector_stores/ # Vector Database integrations
│ ├── __init__.py
│ ├── pinecone_store.py # Managed Cloud Vector DB
│ ├── faiss_store.py # Local Disk FAISS
│ ├── s3_store.py # AWS S3 backed FAISS
│ └── gcs_store.py # Google Cloud Storage backed FAISS
├── data/ # Your raw PDFs, TXTs, MDs, etc.
├── .env # Environment variables
└── requirements.txt # Dependencies
To follow along, make sure your core dependencies are installed:
pip install fastapi uvicorn langchain langchain-core langchain-google-genai langchain-pinecone pinecone pydantic python-dotenv faiss-cpu google-cloud-storage boto3
3. The Data Layer: Multi-Backend Vector Stores
Vector stores hold our document "embeddings" (mathematical representations of text in n-dimensional space). This allows us to perform rapid "similarity searches" using cosine similarity or euclidean distance to find text chunks semantically related to a user's question.
Because production environments vary, we designed our data layer to be dynamic based on environment variables.
Option A: Cloud-Native with Pinecone
For a fully managed experience, we use Pinecone (vector_stores/pinecone_store.py):
import os
from langchain_pinecone import PineconeVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from pinecone import Pinecone
def setup_pinecone(embeddings, data_dir):
"""Setup Pinecone vector store"""
index_name = os.getenv("PINECONE_INDEX_NAME", "rag-index")
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index = pc.Index(index_name)
# Idempotency: Avoid re-processing if the index is already populated
if index.describe_index_stats()['total_vector_count'] > 0:
print("Loading existing Pinecone index...")
return PineconeVectorStore(index=index, embedding=embeddings)
print("Creating new Pinecone index from documents...")
loader = DirectoryLoader(data_dir, glob="**/*.*")
documents = loader.load()
# Chunk the documents to prevent hitting LLM context windows
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
document_chunks = splitter.split_documents(documents)
return PineconeVectorStore.from_documents(document_chunks, embeddings, index_name=index_name)
Key Takeaways here:
-
Idempotency: We check
index.describe_index_stats()to ensure we don't ingest the same data multiple times on server restarts. -
Chunking:
RecursiveCharacterTextSplitterbreaks huge documents into smaller 500-character chunks with a 50-character overlap (to prevent cutting ideas in half).
Option B: Local & Cloud-Backed FAISS
If you don't want to use a managed service, FAISS is incredibly fast. However, FAISS runs in-memory. To make it durable across deployments, we've implemented loaders that sync the index directory to S3 or GCS.
Here is a look at our GCS implementation (vector_stores/gcs_store.py):
from google.cloud import storage
from langchain_community.vectorstores import FAISS
# ... imports omitted for brevity
def save_faiss_to_gcs(vector_store, bucket_name, gcs_prefix="faiss_index"):
local_path = "/faiss_index"
vector_store.save_local(local_path)
client = storage.Client()
bucket = client.bucket(bucket_name)
# Upload local FAISS index files to GCS
for file in os.listdir(local_path):
blob = bucket.blob(f"{gcs_prefix}/{file}")
blob.upload_from_filename(os.path.join(local_path, file))
def load_faiss_from_gcs(embeddings, bucket_name, gcs_prefix="faiss_index"):
# Downloads files back from GCS to local disk before instantiating FAISS
# ...
return FAISS.load_local(local_path, embeddings, allow_dangerous_deserialization=True)
Our setup_faiss_gcs function tries to load from GCS first; if the index isn't there, it creates it locally from the data/ folder and then pushes it to the bucket. We have an identical setup for AWS S3 using boto3.
4. The Brains: Orchestrating with LangChain (LCEL)
Now we need to glue together our embedding model, our vector store, our prompt, and our Gemini LLM. LangChain makes this elegant using the LangChain Expression Language (LCEL).
Let's look at rag.py:
import os
from dotenv import load_dotenv
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from vector_stores import setup_faiss, setup_pinecone, setup_faiss_gcs, setup_faiss_s3
load_dotenv()
google_api_key = os.getenv("GOOGLE_API_KEY")
data_dir = os.getenv("DATA_DIR", "data/")
vector_store_type = os.getenv("VECTOR_STORE", "faiss").lower()
use_gcs = os.getenv("USE_GCS_STORAGE", "false").lower() == "true"
use_s3 = os.getenv("USE_S3_STORAGE", "false").lower() == "true"
# 1. Initialize the Generator LLM (Gemini 2.5 Flash)
llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", google_api_key=google_api_key)
def setup_rag_system():
# 2. Setup Embeddings
embeddings = GoogleGenerativeAIEmbeddings(model="gemini-embedding-001", google_api_key=google_api_key)
# 3. Dynamically route to the right vector database
if vector_store_type == "pinecone":
vector_store = setup_pinecone(embeddings, data_dir)
elif use_gcs:
vector_store = setup_faiss_gcs(embeddings, data_dir)
elif use_s3:
vector_store = setup_faiss_s3(embeddings, data_dir)
else:
vector_store = setup_faiss(embeddings, data_dir)
# 4. Create the Retriever (Fetch Top 5 most relevant chunks)
return vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 5})
def get_qa_chain():
retriever = setup_rag_system()
# 5. Construct the System Prompt
prompt = ChatPromptTemplate.from_template(
"Answer the question based strictly on the following context:\n\n{context}\n\nQuestion: {question}"
)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# 6. Build the LCEL Pipeline
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
return chain
# Create a singleton instance to use across API calls
qa_chain_instance = get_qa_chain()
async def get_rag_response(query: str):
return qa_chain_instance.invoke(query)
Deep Dive into LCEL: How data relates
The most beautiful part of this script is Langchain's piping syntax:
{"context": retriever | format_docs, ...} | prompt | llm | StrOutputParser()
Here is exactly what happens when qa_chain_instance.invoke("What is our refund policy?") is called:
-
Routing: The query
"What is our refund policy?"is passed to the dictionary. -
Retrieval: For the
contextkey, the query goes into theretriever. The retriever converts the text to a vector usinggemini-embedding-001, searches Pinecone/FAISS, and returns the top 5Documentobjects. -
Formatting: Those 5 documents are piped (
|) intoformat_docs, which extracts the raw text and concatenates it with double newlines. -
Prompting: The filled dictionary (context + question) is piped into the
ChatPromptTemplate, injecting the values into the string. -
Generation: The massive prompt is sent to
gemini-2.5-flash. -
Parsing: Because ChatModels return metadata-heavy AI Message objects, the
StrOutputParserextracts just the raw markdown string.
All of this happens synchronously or asynchronously, abstracted cleanly into one chain.
5. Exposing the Pipeline via FastAPI
To facilitate the deployment of this pipeline for operational use, we expose it over HTTP using FastAPI. By leveraging FastAPI's Pydantic integrations, we effortlessly add strict input validation and automatic OpenAPI (Swagger) documentation.
In endpoints.py, we define the /chat route:
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from rag import get_rag_response
router = APIRouter()
# Define our expected JSON Request Body
class Payload(BaseModel):
chat_input: str
@router.post("/chat")
async def query_rag_system(request: Payload):
try:
# Pass the input securely to our RAG service
response = await get_rag_response(request.chat_input)
return {"input": request.chat_input, "response": response}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
And in main.py, we strap it all together:
from fastapi import FastAPI
from endpoints import router
app = FastAPI(title="RAG Application with Langchain and Gemini")
# Register our route controllers
app.include_router(router)
6. Running, Testing, and Deployment Options
To fire up your robust new RAG server, spin up Uvicorn in your terminal:
uvicorn main:app --reload
By default, the server spins up on http://127.0.0.1:8000. Thanks to FastAPI, you instantly have an interactive UI available.
- Visit
http://127.0.0.1:8000/docs. - Open the
POST /chatendpoint. - Pass a payload into the
chat_input.
{
"chat_input": "What data formatting does our application support?"
}
Behind the scenes, the system will embed your query, search your data sources, generate a context-aware prompt, consult Google Gemini, and seamlessly stream back a laser-accurate answer over HTTP.
Deploying to Production
Because the entire app is stateless (state is managed by Pinecone/S3/GCS), you can quickly containerize this application with Docker.
Example Docker command flow:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
Deploy the image to Google Cloud Run, AWS App Runner, or Railway, provide your .env variables, and you have an instantaneously scalable AI backend.
Conclusion & Next Steps
Building an intelligent, context-aware chatbot doesn't require a messy monolith of code. By cleanly separating our concerns—FastAPI for networking, Langchain for orchestration, Pinecone/FAISS for storage, and Google Gemini for raw intelligence—we've crafted a highly scalable codebase.
Where to go from here?
-
Conversation Memory: Try wrapping our chain in
RunnableWithMessageHistoryto sustain complex, multi-turn chat interactions. -
Metadata Filtering: If you serve multiple users, update the Pinecone/FAISS retriever to apply metadata filtering based on
user_id, ensuring strict data isolation per user. -
Agentic Workflows: Replace our linear chain with
langgraphto allow the LLM to decide whether it needs to search the vector database, search the web, or use an API tool.
Happy Coding! Let me know in the comments how you plan to use RAG in your own projects, and if you have any questions setting up your vector indexes!
Top comments (0)