Santanu Mohanta

Posted on May 30

I built a RAG pipeline from scratch — no LangChain, just FastAPI + FAISS

#rag #python #ai #fastapi

Most RAG tutorials I found were either "pip install langchain and you're done" or 50-page academic papers. I wanted something in between — a pipeline I could actually explain in an interview, where I understood every line.

So I built one from scratch. No LangChain, no LlamaIndex, no frameworks. Just FastAPI, FAISS, sentence-transformers, and an LLM API.

Here's what I built, what worked, and what broke.

Uploading a PDF

Querying the document

The architecture

PDF --> extract text (pypdf) --> chunk (500 char, 50 overlap) --> embed (MiniLM-L6-v2)
                                                                        |
                                                                        v
question --> embed --> FAISS top-k search --> build prompt with chunks --> LLM --> answer + sources

Five Python files, ~300 lines total:

File	Responsibility
`main.py`	FastAPI app, 3 endpoints, prompt engineering
`pdf_loader.py`	PDF text extraction via pypdf
`rag.py`	Chunking + embedding
`store.py`	FAISS vector store wrapper
`llm.py`	Swappable LLM client (Groq / OpenAI / Anthropic)

How the upload works

When you POST a PDF to /upload, three things happen:

1. Text extraction — pypdf reads each page and returns the raw text. Pages with no extractable text (scanned images) are skipped.

2. Chunking — each page is split into ~500-character chunks with 50 characters of overlap. The overlap prevents losing context at chunk boundaries.

CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

def chunk_pages(pages):
    chunks = []
    chunk_id = 0
    for text, page_num in pages:
        start = 0
        while start < len(text):
            end = min(start + CHUNK_SIZE, len(text))
            chunk_text = text[start:end].strip()
            if chunk_text:
                chunks.append(Chunk(chunk_id=chunk_id, text=chunk_text, page=page_num))
                chunk_id += 1
            if end == len(text):
                break
            start = end - CHUNK_OVERLAP
    return chunks

3. Embedding — each chunk is embedded into a 384-dimensional vector using all-MiniLM-L6-v2. This runs locally on CPU, no API call needed. Vectors are normalized so we can use inner product as cosine similarity.

def embed_texts(texts):
    model = get_embed_model()  # lazy-loaded singleton
    vectors = model.encode(
        texts,
        normalize_embeddings=True,
        show_progress_bar=False,
        convert_to_numpy=True,
    )
    return vectors.astype("float32")

The vectors and chunk metadata go into a FAISS IndexFlatIP index — brute-force exact search, which is fine for up to ~100k vectors.

How the query works

When you POST a question to /query:

The question is embedded using the same model
FAISS finds the top-k most similar chunks by cosine similarity
The chunks are formatted into a prompt with labels like [Chunk 3 | Page 2]
The LLM generates an answer grounded in those chunks
Both the answer and source chunks are returned

The system prompt is deliberately strict:

You are a careful assistant that answers questions strictly
from the provided document context.

Rules:
- Use ONLY the context below. Do not use outside knowledge.
- If the answer is not in the context, say:
  "I couldn't find that in the document."

Swappable LLM providers

One thing I'm happy with — the LLM is swappable via a single environment variable:

LLM_PROVIDER=groq      # or openai, or anthropic

All three providers share the same interface:

class LLMClient(ABC):
    @abstractmethod
    def generate(self, system: str, user: str) -> str: ...

You only need an API key for the provider you pick. I used Groq with Llama 3.3 70B for development because it's fast and free-tier friendly.

Testing it: what worked and what didn't

I created a fictional 5-page company document and threw 19 questions at the pipeline. Questions ranged from simple lookups to multi-hop reasoning to negative tests (questions the document can't answer).

What worked well:

Direct lookups: "What is the list price of the Magpie-7?" — nailed it
Table data: "What's included in the Standard tier?" — correct
Negative tests: "What's Zentara's stock ticker?" — correctly said "not in the document"
Multi-hop: "If I want 1-hour SLA support, what will it cost?" — combined info from the pricing table

What failed:

"Who is the CEO?" — couldn't find it
"How many employees does Zentara have?" — couldn't find it

Both answers were on page 1, in a dense "Company snapshot" table: CEO, CTO, HQ, employees, revenue — all packed together.

Why it failed (and what I learned)

The problem wasn't the LLM — it was the retriever. The Company snapshot table had 8+ different facts crammed into one chunk. The embedding for that chunk became a muddy average of all those topics, so it didn't rank highly for any specific question.

This is the classic weakness of pure semantic search. The word "CEO" appears exactly once in the document. A keyword search (BM25) would find it instantly. But vector search relies on semantic similarity, and a short query like "Who is the CEO?" doesn't produce a strong enough match against a chunk that's 80% about revenue, headquarters, and employee count.

The fix: hybrid retrieval — combine BM25 (keyword matching) with vector search. This is what production RAG systems do. It's on my to-do list.

Key design decisions (interview-ready)

If you're building this for interviews, these are the tradeoffs worth knowing:

Decision	Why
Character-based chunking (not token-based)	Simpler, no tokenizer dependency. Production would use tiktoken.
Local embeddings (not OpenAI)	Free, offline, no API latency. Lower quality but fine for demos.
FAISS IndexFlatIP (not HNSW)	Exact search, no approximation. Fine up to ~100k vectors.
Normalized embeddings	Inner product = cosine similarity. One less thing to configure.
No streaming	v1 simplification. Streaming is where LLM SDKs diverge the most.
No conversation memory	Each query is independent. Adding memory is straightforward but adds complexity.

What I'd add next

Hybrid retrieval (BM25 + vector) — catches keyword matches that pure semantic search misses
Reranker (cross-encoder) — re-scores the top-k results for better precision
Evaluation set — automated accuracy measurement instead of manual testing
Streaming — better UX for longer answers
Conversation memory — follow-up questions

Try it yourself

The repo is here: github.com/santanu2908/chat-with-pdf-rag (v1)

uv sync
cp .env.example .env   # set your API key
uv run uvicorn app.main:app --reload

Open http://localhost:8000/docs, upload the included sample PDF (data/sample_test_file.pdf), and start asking questions.

If you've built something similar or have suggestions (especially on hybrid retrieval), I'd love to hear about it in the comments.

I'm Santanu Mohanta — you can connect with me on LinkedIn or check out my other projects on GitHub.

Top comments (7)

Harjot Singh • May 31

"No LangChain, just FastAPI + FAISS" is a choice a lot of people are quietly making, and for good reason - for a straightforward RAG pipeline the framework often adds more abstraction (and debugging pain) than it saves, and rolling it yourself means you actually understand every step and can tune it. Frameworks earn their keep on complex multi-step orchestration; for "embed, store, retrieve, stuff context," raw is frequently cleaner. Knowing WHEN you've outgrown DIY is the real skill.

The payoff of building it raw is exactly what shows up when quality matters: you control chunking, the embedding choice, and can add a re-ranker - the levers that actually drive retrieval quality, which a framework can obscure. Owning those is owning the part that makes RAG good vs mediocre. That control-the-retrieval-quality discipline is what I lean on in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - scoped, well-retrieved context beats framework magic, for quality and cost. Clean build, FastAPI+FAISS is a solid no-framework base. At what point would you reach for a framework - or have you found raw scales fine even as the pipeline grows? Curious where your DIY ceiling is.

Santanu Mohanta • May 31

Thanks for the thoughtful comment! Completely agree — for a focused pipeline like this, raw gives you the understanding that frameworks abstract away. You can't debug what you don't understand.

My plan is to keep building raw for a few more iterations — hybrid retrieval (BM25 + vector), a reranker, evaluation harness, streaming — basically until I've implemented the core components that actually drive RAG quality. Once I'm comfortable with how each piece works under the hood, I'll move to a framework like LangChain or LlamaIndex for the orchestration layer. At that point, the framework becomes a productivity tool rather than a black box — I'll know exactly what it's doing for me and where to look when something breaks.

So to answer your question: my DIY ceiling is when I've touched enough of the moving parts to have strong intuition about what the framework is abstracting. Not there yet, but getting close. Moonshift sounds like a great example of that approach — scoped retrieval over framework magic.

Harjot Singh • May 31

Exactly, for a focused pipeline the no-framework route is the right call: FastAPI + FAISS gives you full control over chunking, the embedding step, and re-ranking, which IS retrieval quality. The framework only earns its weight once you need the generic abstractions, and most focused RAG never does. The one upgrade I'd put next on your list is an abstain path, when retrieval comes back thin, "I don't have support for this" beats letting the model paper over the gap with a fluent guess. That single check is what makes raw RAG trustworthy. Genuinely good build-from-scratch writeup.

Santanu Mohanta • May 31

Thanks! Agree on the abstain path — a retrieval-level confidence check before even hitting the LLM would be more robust. Something like a minimum similarity threshold on the top-k results. Good call, adding it to the list.

Harjot Singh • May 31

That's exactly the right ceiling: the framework stops being a black box the moment you've hand-built enough of the moving parts to know what it's hiding. Your ordering is also right, hybrid retrieval and a reranker move quality far more than orchestration sugar does, and the evaluation harness is the piece most people skip and then can't tell whether a change actually helped. Build that early, it turns "feels better" into a number you can defend. By the time you reach for LangChain/LlamaIndex you'll be using it for plumbing while keeping your own judgment on retrieval, which is the healthy split. I went the same way with Moonshift: own the parts that decide quality, let a framework handle the boilerplate. When you add the reranker, are you leaning cross-encoder, or LLM-as-reranker?

Santanu Mohanta • May 31

Cross-encoder first — lightweight, fast, no extra API calls. Something like cross-encoder/ms-marco-MiniLM-L-6-v2 keeps it local and consistent with the local-embeddings philosophy.
LLM-as-reranker is interesting but adds latency and API cost for what's still a small-scale pipeline. If cross-encoder precision isn't enough, that's when I'd experiment with LLM reranking.
Thanks for the validation on the evaluation harness.

FORGE SOCIAL AGENT • May 31

Great to see a detailed walk-through! How did you handle real-time query processing with FastAPI?