Build a High-Performance RAG System with Gemini 2.5 Flash and FAISS 🚀

#llm #rag #python #ai

Retrieval-Augmented Generation (RAG) is the gold standard for reducing LLM hallucinations and giving AI access to your private data. While there are many frameworks out there, building one from scratch gives you full control over the pipeline.

In this post, I’ll show you how to build a complete RAG system using Google’s Gemini API for embeddings and text generation, and FAISS for lightning-fast vector similarity search.

🏗️ The Tech Stack

LLM: Gemini 2.5 Flash (Fast, cost-effective, and powerful)
Embeddings: gemini-embedding-001
Vector Database: FAISS (Facebook AI Similarity Search)
Environment: Python 3.13+ with the uv package manager.

🛠️ How it Works: The Architecture

The system follows a three-step process:

Ingestion: We read .txt files, convert them into high-dimensional vectors using Gemini, and store them in a FAISS index.
Retrieval: When a user asks a question, we embed the query and find the most relevant document chunks using Cosine Similarity.
Generation: We feed the retrieved context + the original question into Gemini 2.5 Flash to generate a grounded, cited answer.

rag-implementation-gemini/
├── docs/                    # Place your .txt documents here
│   ├── doc1.txt            # Sample AI/ML content
│   └── doc2.txt            # Sample RAG content
├── main.py                 # Main application entry point
├── ingest_docs.py          # Document ingestion and embedding
├── rag_query.py            # Query processing and answer generation
├── test_rag.py             # Automated test suite
├── pyproject.toml          # Project dependencies
├── README.md               # This file
├── faiss_index.bin         # Generated FAISS index (after ingestion)
└── docs_meta.pkl           # Generated document metadata (after ingestion)

🚀 Getting Started

1. Setup

First, ensure you have your Google API Key from Google AI Studio.

I use the uv package manager for its incredible speed. If you haven't tried it yet, it’s a game-changer for Python workflows.

# Install dependencies
uv sync

# Set your API Key
export GOOGLE_API_KEY="your_api_key_here"

2. Document Ingestion (ingest_docs.py)

We use IndexFlatIP (Inner Product) on normalized vectors to perform Cosine Similarity. This ensures that even if the document lengths vary, the semantic relevance remains accurate.

import os
import pickle
from glob import glob
import numpy as np
import faiss
from dotenv import load_dotenv
from google import genai

load_dotenv()
API_KEY = os.environ.get("GEMINI_API_KEY")
if not API_KEY:
    raise SystemExit("Set GEMINI_API_KEY in .env or environment")

# initialize client (per Gemini docs)
client = genai.Client(api_key=API_KEY)

# choose the Gemini embedding model (example name; docs use gemini-embedding-001).
EMBED_MODEL = "gemini-embedding-001"

DOCS_DIR = "docs"
INDEX_FILE = "faiss_index.bin"
META_FILE = "docs_meta.pkl"

def read_documents(path):
    files = glob(os.path.join(path, "*.txt"))
    docs = []
    for p in files:
        with open(p, "r", encoding="utf-8") as f:
            text = f.read().strip()
        docs.append({"path": p, "text": text})
    return docs

def embed_texts(texts):
    # Call Gemini embeddings endpoint and normalize output to list of float vectors
    resp = client.models.embed_content(model=EMBED_MODEL, contents=texts)
    vectors = []
    for emb in resp.embeddings:
        # SDKs often return ContentEmbedding with `.values`; fallback to iterable
        if hasattr(emb, "values"):
            vec = np.array(emb.values, dtype="float32")
        else:
            vec = np.array(list(emb), dtype="float32")
        vectors.append(vec)
    return np.vstack(vectors)

def build_faiss_index(embs):
    dim = embs.shape[1]
    index = faiss.IndexFlatIP(dim)  # use inner product on normalized vectors (cosine)
    # Normalize if using IP for cosine:
    faiss.normalize_L2(embs)
    index.add(embs)
    return index

def main():
    docs = read_documents(DOCS_DIR)
    texts = [d["text"] for d in docs]
    if not texts:
        print("No docs found in", DOCS_DIR)
        return

    print(f"Embedding {len(texts)} docs with model {EMBED_MODEL} ...")
    embs = embed_texts(texts)  # shape: (N, dim)

    print("Building FAISS index...")
    index = build_faiss_index(embs)

    print("Saving index and metadata...")
    faiss.write_index(index, INDEX_FILE)
    with open(META_FILE, "wb") as f:
        pickle.dump(docs, f)

    print("Done. Index saved to", INDEX_FILE)

if __name__ == "__main__":
    main()

3. The Retrieval & Query Engine (rag_query.py)

The magic happens when we combine the retrieved snippets into a single prompt. We instruct the model to be a "helpful assistant" and, crucially, to cite its sources.

import os
import pickle
import numpy as np
import faiss
from dotenv import load_dotenv
from google import genai

load_dotenv()
# Support both GOOGLE_API_KEY and GEMINI_API_KEY for convenience
API_KEY = os.environ.get("GOOGLE_API_KEY") or os.environ.get("GEMINI_API_KEY")
if not API_KEY:
    raise SystemExit("Set GOOGLE_API_KEY or GEMINI_API_KEY in .env or environment")

client = genai.Client(api_key=API_KEY)
EMBED_MODEL = "gemini-embedding-001"
GEN_MODEL = "gemini-2.5-flash"   # example text generation model; pick one available to you

INDEX_FILE = "faiss_index.bin"
META_FILE = "docs_meta.pkl"

def embed_query(q):
    resp = client.models.embed_content(model=EMBED_MODEL, contents=[q])
    # extract the actual embedding values from ContentEmbedding object
    if hasattr(resp.embeddings[0], 'values'):
        vec = np.array(resp.embeddings[0].values, dtype="float32")
    else:
        # fallback if structure is different
        vec = np.array(list(resp.embeddings[0]), dtype="float32")
    # normalize for cosine (since index used normalized vectors)
    faiss.normalize_L2(vec.reshape(1, -1))
    return vec

def load_index():
    if not os.path.exists(INDEX_FILE) or not os.path.exists(META_FILE):
        raise SystemExit("Run ingest_docs.py first to build index.")
    index = faiss.read_index(INDEX_FILE)
    with open(META_FILE, "rb") as f:
        docs = pickle.load(f)
    return index, docs

def retrieve_topk(index, qvec, k=3):
    # qvec shape (dim,)
    q = qvec.reshape(1, -1)
    faiss.normalize_L2(q)  # ensure normalized
    scores, ids = index.search(q, k)
    return scores[0], ids[0]

def generate_answer(query, retrieved_texts):
    # Build a prompt that includes retrieved docs as context (short)
    context = "\n\n---\n\n".join(retrieved_texts)
    prompt = (
        "You are a helpful assistant. Use the following context to answer the question.\n\n"
        f"CONTEXT:\n{context}\n\nQUESTION:\n{query}\n\nAnswer concisely and cite which context file you used."
    )

    # call Gemini text generation (per docs)
    response = client.models.generate_content(
        model=GEN_MODEL,
        contents=prompt
    )
    # many SDKs have response.text or response.output; check your SDK return structure
    answer = getattr(response, "text", None) or response.output[0].content[0].text
    return answer

def main():
    index, docs = load_index()
    question = input("Enter your question: ").strip()
    qvec = embed_query(question)
    scores, ids = retrieve_topk(index, qvec, k=3)

    retrieved_texts = []
    for idx in ids:
        if idx < 0 or idx >= len(docs): 
            continue
        meta = docs[idx]
        retrieved_texts.append(f"FILE: {meta['path']}\n{meta['text'][:1000]}")  # limited preview

    print("\nRetrieved top documents (score, path):")
    for s, i in zip(scores, ids):
        if i >= 0 and i < len(docs):
            print(f"{s:.4f}  {docs[i]['path']}")

    print("\nGenerating answer using Gemini...")
    answer = generate_answer(question, retrieved_texts)
    print("\n=== Answer ===\n")
    print(answer)

if __name__ == "__main__":
    main()

🖥️ User Experience: The Interactive CLI

I built an interactive menu to make the system easy to use. You can toggle between ingesting new knowledge and asking questions instantly.

=== RAG with Gemini ===
1. Ingest documents
2. Ask a question
0. Exit

Enter your choice: 2
Enter your question: How does this RAG system handle vector search?

Retrieved top documents (score, path):
0.8921  docs/technical_specs.txt

Generating answer using Gemini...
=== Answer ===
This system utilizes FAISS with IndexFlatIP for similarity search. 
It normalizes embeddings to perform Cosine Similarity... [FILE: docs/technical_specs.txt]

💡 Why this approach?

Gemini 2.5 Flash: It offers a massive context window and rapid response times, making the "Generation" phase feel instantaneous.
FAISS: Instead of relying on a heavy cloud database for small-to-medium projects, FAISS is local, incredibly fast, and easy to deploy.
Transparency: By including Source Citations, we eliminate the "black box" feel of AI. The user knows exactly which document provided the answer.

🔧 Future Improvements

Chunking Strategy: Implement recursive character splitting for larger documents.
PDF Support: Add PyPDF2 or langchain loaders to handle more file types.
Web UI: Wrap this in Streamlit for a more modern interface.