DEV Community

Santanu Mohanta
Santanu Mohanta

Posted on

I built a RAG pipeline from scratch — no LangChain, just FastAPI + FAISS

Most RAG tutorials I found were either "pip install langchain and you're done" or 50-page academic papers. I wanted something in between — a pipeline I could actually explain in an interview, where I understood every line.

So I built one from scratch. No LangChain, no LlamaIndex, no frameworks. Just FastAPI, FAISS, sentence-transformers, and an LLM API.

Here's what I built, what worked, and what broke.

Uploading a PDF

Selecting a PDF to upload via Swagger UI

Upload response — 16 chunks indexed from 5 pages

Querying the document

Asking a question via the /query endpoint

Response with answer and source chunks

The architecture

PDF --> extract text (pypdf) --> chunk (500 char, 50 overlap) --> embed (MiniLM-L6-v2)
                                                                        |
                                                                        v
question --> embed --> FAISS top-k search --> build prompt with chunks --> LLM --> answer + sources
Enter fullscreen mode Exit fullscreen mode

Five Python files, ~300 lines total:

File Responsibility
main.py FastAPI app, 3 endpoints, prompt engineering
pdf_loader.py PDF text extraction via pypdf
rag.py Chunking + embedding
store.py FAISS vector store wrapper
llm.py Swappable LLM client (Groq / OpenAI / Anthropic)

How the upload works

When you POST a PDF to /upload, three things happen:

1. Text extraction — pypdf reads each page and returns the raw text. Pages with no extractable text (scanned images) are skipped.

2. Chunking — each page is split into ~500-character chunks with 50 characters of overlap. The overlap prevents losing context at chunk boundaries.

CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

def chunk_pages(pages):
    chunks = []
    chunk_id = 0
    for text, page_num in pages:
        start = 0
        while start < len(text):
            end = min(start + CHUNK_SIZE, len(text))
            chunk_text = text[start:end].strip()
            if chunk_text:
                chunks.append(Chunk(chunk_id=chunk_id, text=chunk_text, page=page_num))
                chunk_id += 1
            if end == len(text):
                break
            start = end - CHUNK_OVERLAP
    return chunks
Enter fullscreen mode Exit fullscreen mode

3. Embedding — each chunk is embedded into a 384-dimensional vector using all-MiniLM-L6-v2. This runs locally on CPU, no API call needed. Vectors are normalized so we can use inner product as cosine similarity.

def embed_texts(texts):
    model = get_embed_model()  # lazy-loaded singleton
    vectors = model.encode(
        texts,
        normalize_embeddings=True,
        show_progress_bar=False,
        convert_to_numpy=True,
    )
    return vectors.astype("float32")
Enter fullscreen mode Exit fullscreen mode

The vectors and chunk metadata go into a FAISS IndexFlatIP index — brute-force exact search, which is fine for up to ~100k vectors.

How the query works

When you POST a question to /query:

  1. The question is embedded using the same model
  2. FAISS finds the top-k most similar chunks by cosine similarity
  3. The chunks are formatted into a prompt with labels like [Chunk 3 | Page 2]
  4. The LLM generates an answer grounded in those chunks
  5. Both the answer and source chunks are returned

The system prompt is deliberately strict:

You are a careful assistant that answers questions strictly
from the provided document context.

Rules:
- Use ONLY the context below. Do not use outside knowledge.
- If the answer is not in the context, say:
  "I couldn't find that in the document."
Enter fullscreen mode Exit fullscreen mode

Swappable LLM providers

One thing I'm happy with — the LLM is swappable via a single environment variable:

LLM_PROVIDER=groq      # or openai, or anthropic
Enter fullscreen mode Exit fullscreen mode

All three providers share the same interface:

class LLMClient(ABC):
    @abstractmethod
    def generate(self, system: str, user: str) -> str: ...
Enter fullscreen mode Exit fullscreen mode

You only need an API key for the provider you pick. I used Groq with Llama 3.3 70B for development because it's fast and free-tier friendly.

Testing it: what worked and what didn't

I created a fictional 5-page company document and threw 19 questions at the pipeline. Questions ranged from simple lookups to multi-hop reasoning to negative tests (questions the document can't answer).

What worked well:

  • Direct lookups: "What is the list price of the Magpie-7?" — nailed it
  • Table data: "What's included in the Standard tier?" — correct
  • Negative tests: "What's Zentara's stock ticker?" — correctly said "not in the document"
  • Multi-hop: "If I want 1-hour SLA support, what will it cost?" — combined info from the pricing table

What failed:

  • "Who is the CEO?" — couldn't find it
  • "How many employees does Zentara have?" — couldn't find it

Both answers were on page 1, in a dense "Company snapshot" table: CEO, CTO, HQ, employees, revenue — all packed together.

Why it failed (and what I learned)

The problem wasn't the LLM — it was the retriever. The Company snapshot table had 8+ different facts crammed into one chunk. The embedding for that chunk became a muddy average of all those topics, so it didn't rank highly for any specific question.

This is the classic weakness of pure semantic search. The word "CEO" appears exactly once in the document. A keyword search (BM25) would find it instantly. But vector search relies on semantic similarity, and a short query like "Who is the CEO?" doesn't produce a strong enough match against a chunk that's 80% about revenue, headquarters, and employee count.

The fix: hybrid retrieval — combine BM25 (keyword matching) with vector search. This is what production RAG systems do. It's on my to-do list.

Key design decisions (interview-ready)

If you're building this for interviews, these are the tradeoffs worth knowing:

Decision Why
Character-based chunking (not token-based) Simpler, no tokenizer dependency. Production would use tiktoken.
Local embeddings (not OpenAI) Free, offline, no API latency. Lower quality but fine for demos.
FAISS IndexFlatIP (not HNSW) Exact search, no approximation. Fine up to ~100k vectors.
Normalized embeddings Inner product = cosine similarity. One less thing to configure.
No streaming v1 simplification. Streaming is where LLM SDKs diverge the most.
No conversation memory Each query is independent. Adding memory is straightforward but adds complexity.

What I'd add next

  1. Hybrid retrieval (BM25 + vector) — catches keyword matches that pure semantic search misses
  2. Reranker (cross-encoder) — re-scores the top-k results for better precision
  3. Evaluation set — automated accuracy measurement instead of manual testing
  4. Streaming — better UX for longer answers
  5. Conversation memory — follow-up questions

Try it yourself

The repo is here: github.com/santanu2908/chat-with-pdf-rag

uv sync
cp .env.example .env   # set your API key
uv run uvicorn app.main:app --reload
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:8000/docs, upload the included sample PDF (data/sample_test_file.pdf), and start asking questions.


If you've built something similar or have suggestions (especially on hybrid retrieval), I'd love to hear about it in the comments.

I'm Santanu Mohanta — you can connect with me on LinkedIn or check out my other projects on GitHub.

Top comments (0)