Retrieval-Augmented Generation (RAG) is the gold standard for reducing LLM hallucinations and giving AI access to your private data. While there are many frameworks out there, building one from scratch gives you full control over the pipeline.
In this post, Iβll show you how to build a complete RAG system using Googleβs Gemini API for embeddings and text generation, and FAISS for lightning-fast vector similarity search.
ποΈ The Tech Stack
- LLM: Gemini 2.5 Flash (Fast, cost-effective, and powerful)
- Embeddings: gemini-embedding-001
- Vector Database: FAISS (Facebook AI Similarity Search)
- Environment: Python 3.13+ with the uv package manager.
π οΈ How it Works: The Architecture
The system follows a three-step process:
Ingestion: We read .txt files, convert them into high-dimensional vectors using Gemini, and store them in a FAISS index.
Retrieval: When a user asks a question, we embed the query and find the most relevant document chunks using Cosine Similarity.
Generation: We feed the retrieved context + the original question into Gemini 2.5 Flash to generate a grounded, cited answer.
rag-implementation-gemini/
βββ docs/ # Place your .txt documents here
β βββ doc1.txt # Sample AI/ML content
β βββ doc2.txt # Sample RAG content
βββ main.py # Main application entry point
βββ ingest_docs.py # Document ingestion and embedding
βββ rag_query.py # Query processing and answer generation
βββ test_rag.py # Automated test suite
βββ pyproject.toml # Project dependencies
βββ README.md # This file
βββ faiss_index.bin # Generated FAISS index (after ingestion)
βββ docs_meta.pkl # Generated document metadata (after ingestion)
π Getting Started
1. Setup
First, ensure you have your Google API Key from Google AI Studio.
I use the uv package manager for its incredible speed. If you haven't tried it yet, itβs a game-changer for Python workflows.
# Install dependencies
uv sync
# Set your API Key
export GOOGLE_API_KEY="your_api_key_here"
2. Document Ingestion (ingest_docs.py)
We use IndexFlatIP (Inner Product) on normalized vectors to perform Cosine Similarity. This ensures that even if the document lengths vary, the semantic relevance remains accurate.
import os
import pickle
from glob import glob
import numpy as np
import faiss
from dotenv import load_dotenv
from google import genai
load_dotenv()
API_KEY = os.environ.get("GEMINI_API_KEY")
if not API_KEY:
raise SystemExit("Set GEMINI_API_KEY in .env or environment")
# initialize client (per Gemini docs)
client = genai.Client(api_key=API_KEY)
# choose the Gemini embedding model (example name; docs use gemini-embedding-001).
EMBED_MODEL = "gemini-embedding-001"
DOCS_DIR = "docs"
INDEX_FILE = "faiss_index.bin"
META_FILE = "docs_meta.pkl"
def read_documents(path):
files = glob(os.path.join(path, "*.txt"))
docs = []
for p in files:
with open(p, "r", encoding="utf-8") as f:
text = f.read().strip()
docs.append({"path": p, "text": text})
return docs
def embed_texts(texts):
# Call Gemini embeddings endpoint and normalize output to list of float vectors
resp = client.models.embed_content(model=EMBED_MODEL, contents=texts)
vectors = []
for emb in resp.embeddings:
# SDKs often return ContentEmbedding with `.values`; fallback to iterable
if hasattr(emb, "values"):
vec = np.array(emb.values, dtype="float32")
else:
vec = np.array(list(emb), dtype="float32")
vectors.append(vec)
return np.vstack(vectors)
def build_faiss_index(embs):
dim = embs.shape[1]
index = faiss.IndexFlatIP(dim) # use inner product on normalized vectors (cosine)
# Normalize if using IP for cosine:
faiss.normalize_L2(embs)
index.add(embs)
return index
def main():
docs = read_documents(DOCS_DIR)
texts = [d["text"] for d in docs]
if not texts:
print("No docs found in", DOCS_DIR)
return
print(f"Embedding {len(texts)} docs with model {EMBED_MODEL} ...")
embs = embed_texts(texts) # shape: (N, dim)
print("Building FAISS index...")
index = build_faiss_index(embs)
print("Saving index and metadata...")
faiss.write_index(index, INDEX_FILE)
with open(META_FILE, "wb") as f:
pickle.dump(docs, f)
print("Done. Index saved to", INDEX_FILE)
if __name__ == "__main__":
main()
3. The Retrieval & Query Engine (rag_query.py)
The magic happens when we combine the retrieved snippets into a single prompt. We instruct the model to be a "helpful assistant" and, crucially, to cite its sources.
import os
import pickle
import numpy as np
import faiss
from dotenv import load_dotenv
from google import genai
load_dotenv()
# Support both GOOGLE_API_KEY and GEMINI_API_KEY for convenience
API_KEY = os.environ.get("GOOGLE_API_KEY") or os.environ.get("GEMINI_API_KEY")
if not API_KEY:
raise SystemExit("Set GOOGLE_API_KEY or GEMINI_API_KEY in .env or environment")
client = genai.Client(api_key=API_KEY)
EMBED_MODEL = "gemini-embedding-001"
GEN_MODEL = "gemini-2.5-flash" # example text generation model; pick one available to you
INDEX_FILE = "faiss_index.bin"
META_FILE = "docs_meta.pkl"
def embed_query(q):
resp = client.models.embed_content(model=EMBED_MODEL, contents=[q])
# extract the actual embedding values from ContentEmbedding object
if hasattr(resp.embeddings[0], 'values'):
vec = np.array(resp.embeddings[0].values, dtype="float32")
else:
# fallback if structure is different
vec = np.array(list(resp.embeddings[0]), dtype="float32")
# normalize for cosine (since index used normalized vectors)
faiss.normalize_L2(vec.reshape(1, -1))
return vec
def load_index():
if not os.path.exists(INDEX_FILE) or not os.path.exists(META_FILE):
raise SystemExit("Run ingest_docs.py first to build index.")
index = faiss.read_index(INDEX_FILE)
with open(META_FILE, "rb") as f:
docs = pickle.load(f)
return index, docs
def retrieve_topk(index, qvec, k=3):
# qvec shape (dim,)
q = qvec.reshape(1, -1)
faiss.normalize_L2(q) # ensure normalized
scores, ids = index.search(q, k)
return scores[0], ids[0]
def generate_answer(query, retrieved_texts):
# Build a prompt that includes retrieved docs as context (short)
context = "\n\n---\n\n".join(retrieved_texts)
prompt = (
"You are a helpful assistant. Use the following context to answer the question.\n\n"
f"CONTEXT:\n{context}\n\nQUESTION:\n{query}\n\nAnswer concisely and cite which context file you used."
)
# call Gemini text generation (per docs)
response = client.models.generate_content(
model=GEN_MODEL,
contents=prompt
)
# many SDKs have response.text or response.output; check your SDK return structure
answer = getattr(response, "text", None) or response.output[0].content[0].text
return answer
def main():
index, docs = load_index()
question = input("Enter your question: ").strip()
qvec = embed_query(question)
scores, ids = retrieve_topk(index, qvec, k=3)
retrieved_texts = []
for idx in ids:
if idx < 0 or idx >= len(docs):
continue
meta = docs[idx]
retrieved_texts.append(f"FILE: {meta['path']}\n{meta['text'][:1000]}") # limited preview
print("\nRetrieved top documents (score, path):")
for s, i in zip(scores, ids):
if i >= 0 and i < len(docs):
print(f"{s:.4f} {docs[i]['path']}")
print("\nGenerating answer using Gemini...")
answer = generate_answer(question, retrieved_texts)
print("\n=== Answer ===\n")
print(answer)
if __name__ == "__main__":
main()
π₯οΈ User Experience: The Interactive CLI
I built an interactive menu to make the system easy to use. You can toggle between ingesting new knowledge and asking questions instantly.
=== RAG with Gemini ===
1. Ingest documents
2. Ask a question
0. Exit
Enter your choice: 2
Enter your question: How does this RAG system handle vector search?
Retrieved top documents (score, path):
0.8921 docs/technical_specs.txt
Generating answer using Gemini...
=== Answer ===
This system utilizes FAISS with IndexFlatIP for similarity search.
It normalizes embeddings to perform Cosine Similarity... [FILE: docs/technical_specs.txt]
π‘ Why this approach?
Gemini 2.5 Flash: It offers a massive context window and rapid response times, making the "Generation" phase feel instantaneous.
FAISS: Instead of relying on a heavy cloud database for small-to-medium projects, FAISS is local, incredibly fast, and easy to deploy.
Transparency: By including Source Citations, we eliminate the "black box" feel of AI. The user knows exactly which document provided the answer.
π§ Future Improvements
Chunking Strategy: Implement recursive character splitting for larger documents.
PDF Support: Add PyPDF2 or langchain loaders to handle more file types.
Web UI: Wrap this in Streamlit for a more modern interface.
π Explore the Code
The full implementation is available on my GitHub:
(https://github.com/SumantaSwainEpam/rag-implementation-gemini)
Iβd love to hear your thoughts! How are you handling document retrieval in your projects? Letβs discuss in the comments! π
Top comments (0)