DEV Community: Santanu Mohanta

Adding streaming to my RAG pipeline — three SDKs, three different APIs

Santanu Mohanta — Tue, 23 Jun 2026 13:34:20 +0000

In v3 I added a cross-encoder reranker. This time the feature was simpler but touched every layer: streaming responses via Server-Sent Events (SSE).

The goal: instead of waiting 3-5 seconds for the full answer, start showing tokens the moment the LLM generates them. The sources still arrive at the end.

Why streaming matters for RAG

Without streaming, the user experience is: click → wait → wall of text. With streaming, the first token arrives in ~200ms. The user starts reading while the model is still generating. It's the same answer, but it feels instant.

For a RAG pipeline specifically, there's a design question: when do you send the sources? You can't stream them inline — the LLM doesn't produce structured source metadata as it generates. So the pattern becomes:

SSE event 1:  {"token": "The"}
SSE event 2:  {"token": " list"}
SSE event 3:  {"token": " price"}
...
SSE final:    {"sources": [{"chunk_id": 4, "text": "...", "page": 2, "score": 0.75}]}

Tokens stream in real-time. Sources are sent as the final event once the LLM is done. The client knows the stream is complete when it receives the sources event.

The abstraction

In v3, LLMClient had one method:

class LLMClient(ABC):
    @abstractmethod
    def generate(self, system: str, user: str) -> str: ...

Now it has two:

class LLMClient(ABC):
    @abstractmethod
    def generate(self, system: str, user: str) -> str: ...

    @abstractmethod
    def stream(self, system: str, user: str) -> Iterator[str]: ...

Same inputs, different output shape. generate returns a string. stream yields string chunks. The endpoint decides which to call — /query calls generate, /query/stream calls stream.

This is where it got interesting: each SDK streams differently.

Three SDKs, three streaming APIs

Groq and OpenAI (similar)

Both use the OpenAI-compatible stream=True parameter:

def stream(self, system: str, user: str) -> Iterator[str]:
    resp = self.client.chat.completions.create(
        model=self.model,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user},
        ],
        temperature=0.2,
        max_tokens=800,
        stream=True,
    )
    for chunk in resp:
        delta = chunk.choices[0].delta.content
        if delta:
            yield delta

The only difference from generate is stream=True and iterating over chunks instead of reading .choices[0].message.content. Groq uses the same API shape since it's OpenAI-compatible.

Anthropic (different)

Anthropic's SDK has a dedicated streaming context manager:

def stream(self, system: str, user: str) -> Iterator[str]:
    with self.client.messages.stream(
        model=self.model,
        system=system,
        messages=[{"role": "user", "content": user}],
        temperature=0.2,
        max_tokens=800,
    ) as resp:
        for text in resp.text_stream:
            yield text

Instead of client.messages.create(..., stream=True), it's client.messages.stream(...) — a different method entirely. And instead of parsing chunk.choices[0].delta.content, you iterate resp.text_stream which yields clean text directly. The with block handles connection cleanup.

It's a cleaner API honestly — no null-checking on deltas, no digging into nested objects. But it means you can't write one streaming implementation and share it across providers.

The endpoint

FastAPI's StreamingResponse handles the SSE transport:

@app.post("/query/stream")
def query_stream(req: QueryRequest) -> StreamingResponse:
    # ... same retrieval + reranking as /query ...

    llm = get_llm_client()
    user_prompt = build_user_prompt(req.question, retrieved)

    def event_stream() -> Iterator[str]:
        for token in llm.stream(system=SYSTEM_PROMPT, user=user_prompt):
            yield f"data: {json.dumps({'token': token})}\n\n"
        yield f"data: {json.dumps({'sources': sources})}\n\n"

    return StreamingResponse(event_stream(), media_type="text/event-stream")

The retrieval pipeline (embed → hybrid search → rerank) runs before streaming starts — that's all synchronous work. Only the LLM generation streams. This means the client sees a brief pause (retrieval + reranking), then tokens start flowing.

The sources list is built from the retrieved chunks before the stream starts, so it's ready to send as the final event without any extra processing.

Testing it

curl -N -X POST http://localhost:8000/query/stream \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the list price of the Magpie-7?", "top_k": 3}'

Output:

data: {"token": "The"}

data: {"token": " list"}

data: {"token": " price"}

data: {"token": " of"}

data: {"token": " the"}

data: {"token": " Magpie"}

data: {"token": "-7"}

data: {"token": " is"}

data: {"token": " €"}

data: {"token": "68"}

data: {"token": ",400"}

data: {"token": " per"}

data: {"token": " unit."}

data: {"sources": [{"chunk_id": 4, "text": "...", "page": 2, "score": 0.7542}]}

The -N flag disables curl's output buffering so you see tokens as they arrive.

The pipeline now

PDF ─► extract text ─► chunk ─► embed (MiniLM-L6-v2)
                                        │
                                        ▼
question ─► FAISS + BM25 (RRF) ─► cross-encoder rerank
         ─► LLM generate (blocking)  → /query   → {answer, sources}
         ─► LLM stream   (SSE)       → /query/stream → token events + sources

Same retrieval pipeline, two output modes. The client picks which endpoint to call.

What I learned

Streaming is a UX feature, not an accuracy feature. The answer is identical — streaming just changes when the user sees it. But the perceived latency difference is dramatic.
SDK divergence is real. Groq and OpenAI share the same streaming interface (OpenAI-compatible). Anthropic uses a fundamentally different pattern. If you're building a multi-provider abstraction, streaming is where it gets messy. The LLMClient abstract class earns its keep here.
Sources and tokens are separate concerns. In a RAG pipeline, you know the sources before the LLM starts generating. Streaming them as the final SSE event is a clean separation — the client can render tokens immediately and append source citations when the stream ends.
FastAPI makes SSE trivial. StreamingResponse with a generator function and text/event-stream media type — that's it. No WebSocket setup, no special middleware.

What's next

Conversation memory (multi-turn follow-ups)
Possibly a Streamlit UI

Try it yourself

v4 (streaming): github.com/santanu2908/chat-with-pdf-rag
v3 (reranker): github.com/santanu2908/chat-with-pdf-rag/tree/v3
v2 (hybrid retrieval): github.com/santanu2908/chat-with-pdf-rag/tree/v2
v1 (pure FAISS): github.com/santanu2908/chat-with-pdf-rag/tree/v1

uv sync
cp .env.example .env   # set your API key
uv run uvicorn app.main:app --reload

Open http://localhost:8000/docs, upload the sample PDF, and try /query/stream — watch the tokens arrive one by one.

If you're building multi-provider streaming, I'd love to hear how you handled the SDK differences.

I'm Santanu Mohanta — connect with me on LinkedIn or check out my projects on GitHub.

I added a reranker to my RAG pipeline — it broke everything, then I fixed it

Santanu Mohanta — Tue, 23 Jun 2026 13:01:18 +0000

In v2 I added hybrid retrieval (FAISS + BM25) to fix keyword blindspots. All 19 test questions passed. The next item on my list was a cross-encoder reranker for better precision.

The idea is standard: over-fetch candidates, rerank with a smarter model, keep the top-k. Every RAG tutorial recommends it. It took me 20 minutes to implement and immediately broke 2 of my 19 tests.

Here's what went wrong and the strategy I landed on.

What a cross-encoder does (and why it's better)

In v2, retrieval uses bi-encoders — the query and each chunk are embedded independently, then compared by cosine similarity. Fast, but the model never sees query and chunk together.

A cross-encoder is different. It takes the (query, chunk) pair as a single input and outputs a relevance score. It can attend to both simultaneously — word-level interactions, negation, paraphrasing. Much more accurate, but too slow for first-stage retrieval because you'd need to score every chunk in the index.

The standard two-stage pattern:

Stage 1: cheap retrieval (FAISS + BM25) → broad candidate set
Stage 2: cross-encoder reranks candidates → precise top-k → LLM

The implementation (the easy part)

New file — app/reranker.py:

from sentence_transformers import CrossEncoder

RERANKER_MODEL_NAME = "cross-encoder/ms-marco-MiniLM-L-6-v2"

_reranker = None

def get_reranker():
    global _reranker
    if _reranker is None:
        _reranker = CrossEncoder(RERANKER_MODEL_NAME)
    return _reranker

def rerank(query, retrievals, top_k):
    model = get_reranker()
    pairs = [[query, r.chunk.text] for r in retrievals]
    scores = model.predict(pairs)
    for r, score in zip(retrievals, scores):
        r.score = float(score)
    ranked = sorted(retrievals, key=lambda r: r.score, reverse=True)
    return ranked[:top_k]

And in main.py, over-fetch then rerank:

# Before (v2): retrieve top_k directly
retrieved = store.search(query_vec, top_k=req.top_k, query_text=req.question)

# After (v3): over-fetch, then rerank
candidates = store.search(query_vec, top_k=req.top_k * 2, query_text=req.question)
retrieved = rerank(req.question, candidates, top_k=req.top_k)

No new dependency — cross-encoder/ms-marco-MiniLM-L-6-v2 works through sentence-transformers which was already installed. The model is ~80MB, runs on CPU.

I ran the eval. Two tests broke.

What broke

Question:  Who is the CEO of Zentara Robotics?
Expected:  ['Iris Kallas']
Got:       I couldn't find that in the document.

Question:  How many employees does Zentara have?
Expected:  ['287']
Got:       I couldn't find that in the document.

The exact same two questions that failed in v1 with pure FAISS. Hybrid retrieval fixed them. The reranker un-fixed them.

Why the cross-encoder hates tables

The CEO chunk looks like this:

Company: Zentara Robotics | CEO: Iris Kallas | Employees: 287 | Founded: 2018 ...

Dense. Tabular. Eight facts crammed together.

The cross-encoder (ms-marco-MiniLM-L-6-v2) was trained on MS MARCO — a web search dataset where passages are natural language paragraphs. When it sees a fact-packed table row as a "passage" for the query "Who is the CEO?", it scores it low. It doesn't look like a good answer, even though it contains the answer.

Meanwhile, hybrid retrieval ranked this chunk #1 — BM25 matched "CEO" exactly and RRF boosted it. The cross-encoder then threw it away.

What I tried (and why it failed)

I went through 7 approaches before finding one that worked. Here's the progression:

#	Approach	Result
1	Pure CE rerank	CE buries table chunks
2	Bigger candidate pool (15)	More candidates = more competition
3	Score blending (0.7 CE + 0.3 RRF)	CE score is so negative it still dominates
4	Score blending (0.5 + 0.5)	Still not enough
5	RRF fusion of CE + first-stage rankings	K=60 makes all rank contributions ~equal, CE rank wins
6	Weighted RRF (2x first-stage)	Still too flat with K=60
7	Smaller pool (top_k * 2)	CE still pushes table chunks out

The core issue: the cross-encoder's score for table chunks is so negative that no amount of score blending or rank fusion can compensate. It's not a "this chunk ranks slightly lower" problem — it's a "the model actively rejects this format" problem.

What actually worked: guaranteed slots

The insight: the first-stage results are already good. Hybrid retrieval passed all 19 tests. The reranker should improve those results, not override them.

The strategy:

top_k = 3:  guaranteed slots = 2 (from first-stage)  +  1 CE pick
top_k = 5:  guaranteed slots = 4 (from first-stage)  +  1 CE pick

The top first-stage results are preserved. The cross-encoder only gets to fill the last slot from the remaining candidates. Here's the final implementation:

def rerank(query, retrievals, top_k):
    if not retrievals or top_k >= len(retrievals):
        return retrievals

    n_guaranteed = top_k - 1
    n_ce_slots = 1

    guaranteed = retrievals[:n_guaranteed]
    remaining = retrievals[n_guaranteed:]

    if remaining:
        model = get_reranker()
        pairs = [[query, r.chunk.text] for r in remaining]
        scores = model.predict(pairs)
        for r, score in zip(remaining, scores):
            r.score = round(float(score), 4)
        remaining.sort(key=lambda r: r.score, reverse=True)

    return guaranteed + remaining[:n_ce_slots]

The CEO chunk (first-stage #1) is always guaranteed. The employee chunk (~rank 3-4 at top_k=5) is also preserved. The CE still adds value by selecting the most relevant candidate for the final slot.

Result: 19/19 passing.

The pipeline now

PDF ─► extract text ─► chunk ─► embed (MiniLM-L6-v2)
                                        │
                                        ▼
question ─► FAISS + BM25 (2× top_k candidates, RRF fused)
         ─► cross-encoder reranks remaining candidates
         ─► guaranteed first-stage slots + 1 CE-picked slot
         ─► top_k chunks ─► LLM ─► answer + sources

Three stages of retrieval now: vector search, keyword search, cross-encoder. Each catches something the others miss.

What I learned

Rerankers aren't drop-in improvements. Every RAG tutorial shows "add a cross-encoder, get better results." In practice, cross-encoders trained on natural language passages can actively hurt retrieval quality on structured or tabular content.
Your eval set is your safety net. Without the 19-question eval harness, I would've shipped this and had no idea I'd regressed on 2 questions. The eval caught it in seconds.
Guaranteed slots > score blending. I tried 7 different ways to blend CE and first-stage scores. None worked because the CE's score for table chunks was so negative it dominated every blend. The fix wasn't mathematical — it was structural: protect what's already working, let the CE improve the margins.
The retriever still matters most. v1 → v2 (adding BM25) was the biggest accuracy jump. v2 → v3 (adding the reranker) was a precision refinement that nearly caused regressions. Invest in your first-stage retrieval before reaching for rerankers.

What's next

Streaming responses
Conversation memory
Possibly a Streamlit UI

Try it yourself

v3 (reranker): github.com/santanu2908/chat-with-pdf-rag
v2 (hybrid retrieval): github.com/santanu2908/chat-with-pdf-rag/tree/v2
v1 (pure FAISS): github.com/santanu2908/chat-with-pdf-rag/tree/v1

uv sync
cp .env.example .env   # set your API key
uv run uvicorn app.main:app --reload

Open http://localhost:8000/docs, upload the sample PDF, and try "Who is the CEO?" — it still works, even with the reranker.

If you've hit similar issues with cross-encoders on structured content, I'd love to hear your approach.

I'm Santanu Mohanta — connect with me on LinkedIn or check out my projects on GitHub.

My RAG pipeline couldn't find the CEO — here's how I fixed it with hybrid retrieval

Santanu Mohanta — Wed, 03 Jun 2026 15:04:11 +0000

In my last post, I built a RAG pipeline from scratch — no LangChain, just FastAPI + FAISS. It scored 17/19 on my test set. But two questions failed:

"Who is the CEO?" — couldn't find it
"How many employees does Zentara have?" — couldn't find it

Both answers were right there on page 1. So what went wrong, and how did I fix it?

Why pure vector search failed

The problem was a dense "Company snapshot" table on page 1 — CEO, CTO, HQ, employee count, revenue, all packed into one chunk. The embedding for that chunk became a muddy average of 8+ topics, so when I asked "Who is the CEO?", it didn't rank highly against any specific query.

This is the classic weakness of pure semantic search. The word "CEO" appears exactly once in the document. A keyword search would find it instantly. But vector search relies on semantic similarity, and a short query doesn't produce a strong enough match against a chunk that's mostly about other things.

The fix: hybrid retrieval

The solution is to run two searches in parallel and combine the results:

FAISS (dense) — semantic similarity, good at "What's the charging time?" style questions
BM25 (sparse) — keyword matching, good at "Who is the CEO?" style questions

Then merge them using Reciprocal Rank Fusion (RRF) — a standard algorithm that combines ranked lists from different sources.

question ─► embed ─► FAISS search ──┐
                                    ├─► RRF fusion ─► top-k chunks ─► LLM ─► answer
question ─► tokenize ─► BM25 search ┘

How RRF works

RRF is simple. For each chunk that appears in either ranked list, compute:

rrf_score = 1/(k + rank_in_faiss) + 1/(k + rank_in_bm25)

Where k = 60 (standard constant). A chunk that ranks well in both searches scores higher than one that ranks #1 in only one.

Example: chunk 5 is ranked #1 by BM25, #4 by FAISS:

From FAISS:  1/(60 + 4) = 0.0156
From BM25:   1/(60 + 1) = 0.0164
RRF score:                0.0320  ← beats a FAISS-only #1 (0.0164)

The implementation

Only 3 files changed. Here's the core — the updated store.py:

from rank_bm25 import BM25Okapi

RRF_K = 60

def _tokenize(text: str) -> list[str]:
    return re.findall(r"[a-z0-9]+", text.lower())

class VectorStore:
    def __init__(self):
        self.index = faiss.IndexFlatIP(EMBED_DIM)
        self.chunks = []
        self.bm25 = None

    def add(self, vectors, chunks):
        self.index.add(vectors)
        self.chunks.extend(chunks)
        # Build BM25 index from the same chunks
        tokenized = [_tokenize(c.text) for c in self.chunks]
        self.bm25 = BM25Okapi(tokenized)

    def search(self, query_vector, top_k=3, query_text=""):
        top_k_fetch = min(top_k * 3, self.index.ntotal)

        # Dense search
        _, faiss_indices = self.index.search(query_vector.reshape(1, -1), top_k_fetch)
        faiss_ranking = [int(i) for i in faiss_indices[0] if i != -1]

        # Sparse search
        bm25_scores = self.bm25.get_scores(_tokenize(query_text))
        bm25_ranking = np.argsort(bm25_scores)[::-1][:top_k_fetch].tolist()

        # Reciprocal Rank Fusion
        rrf_scores = {}
        for rank, idx in enumerate(faiss_ranking):
            rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (RRF_K + rank + 1)
        for rank, idx in enumerate(bm25_ranking):
            rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (RRF_K + rank + 1)

        sorted_indices = sorted(rrf_scores, key=rrf_scores.get, reverse=True)[:top_k]
        return [Retrieval(chunk=self.chunks[i], score=rrf_scores[i]) for i in sorted_indices]

The only change in main.py — one extra parameter:

# Before (v1)
retrieved = store.search(query_vec, top_k=req.top_k)

# After (v2)
retrieved = store.search(query_vec, top_k=req.top_k, query_text=req.question)

That's it. No changes to chunking, embedding, PDF extraction, or LLM logic.

Results: before and after

Question	v1 (FAISS only)	v2 (hybrid)
Who is the CEO of Zentara Robotics?	Failed	Correct
How many employees does Zentara have?	Failed	Correct (top_k=5)
All other 17 questions	Correct	Correct

The CEO question now works at default top_k=3 — BM25 matches "CEO" directly and RRF promotes it.

The employee count question works at top_k=5. The chunk still ranks lower because it's packed with many facts, but hybrid retrieval brings it within reach. A reranker (cross-encoder) would likely fix this at top_k=3 — that's next on the list.

What I learned

Pure vector search has a keyword blindspot. If a term appears once in a dense chunk, semantic similarity alone won't reliably surface it. BM25 catches these instantly.
RRF is elegant. No score normalization needed, no tuning of weights between the two retrievers. Just ranks and a constant. It works out of the box.
The retriever matters more than the LLM. Both failures in v1 were retrieval failures, not LLM failures. The LLM never even saw the right chunk. Improving retrieval quality is where RAG gets better — not by switching to a fancier model.
Hybrid didn't fully solve dense chunks. The employee count still needs top_k=5. The real fix is either better chunking (split dense tables into smaller pieces) or a reranker that can re-score candidates more precisely.

What's next

Reranker (cross-encoder) — re-score the top-k for better precision
Evaluation harness — automate the 19-question test set instead of testing manually
Streaming — better UX for longer answers

Try it yourself

v2 (hybrid retrieval): github.com/santanu2908/chat-with-pdf-rag
v1 (pure FAISS): github.com/santanu2908/chat-with-pdf-rag/tree/v1

uv sync
cp .env.example .env   # set your API key
uv run uvicorn app.main:app --reload

Open http://localhost:8000/docs, upload the included sample PDF (data/sample_test_file.pdf), and try "Who is the CEO?" — it works now.

If you've implemented hybrid retrieval or have experience with rerankers, I'd love to hear what worked for you.

I'm Santanu Mohanta — connect with me on LinkedIn or check out my projects on GitHub.

I built a RAG pipeline from scratch — no LangChain, just FastAPI + FAISS

Santanu Mohanta — Sat, 30 May 2026 18:38:55 +0000

Most RAG tutorials I found were either "pip install langchain and you're done" or 50-page academic papers. I wanted something in between — a pipeline I could actually explain in an interview, where I understood every line.

So I built one from scratch. No LangChain, no LlamaIndex, no frameworks. Just FastAPI, FAISS, sentence-transformers, and an LLM API.

Here's what I built, what worked, and what broke.

Uploading a PDF

Querying the document

The architecture

PDF --> extract text (pypdf) --> chunk (500 char, 50 overlap) --> embed (MiniLM-L6-v2)
                                                                        |
                                                                        v
question --> embed --> FAISS top-k search --> build prompt with chunks --> LLM --> answer + sources

Five Python files, ~300 lines total:

File	Responsibility
`main.py`	FastAPI app, 3 endpoints, prompt engineering
`pdf_loader.py`	PDF text extraction via pypdf
`rag.py`	Chunking + embedding
`store.py`	FAISS vector store wrapper
`llm.py`	Swappable LLM client (Groq / OpenAI / Anthropic)

How the upload works

When you POST a PDF to /upload, three things happen:

1. Text extraction — pypdf reads each page and returns the raw text. Pages with no extractable text (scanned images) are skipped.

2. Chunking — each page is split into ~500-character chunks with 50 characters of overlap. The overlap prevents losing context at chunk boundaries.

CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

def chunk_pages(pages):
    chunks = []
    chunk_id = 0
    for text, page_num in pages:
        start = 0
        while start < len(text):
            end = min(start + CHUNK_SIZE, len(text))
            chunk_text = text[start:end].strip()
            if chunk_text:
                chunks.append(Chunk(chunk_id=chunk_id, text=chunk_text, page=page_num))
                chunk_id += 1
            if end == len(text):
                break
            start = end - CHUNK_OVERLAP
    return chunks

3. Embedding — each chunk is embedded into a 384-dimensional vector using all-MiniLM-L6-v2. This runs locally on CPU, no API call needed. Vectors are normalized so we can use inner product as cosine similarity.

def embed_texts(texts):
    model = get_embed_model()  # lazy-loaded singleton
    vectors = model.encode(
        texts,
        normalize_embeddings=True,
        show_progress_bar=False,
        convert_to_numpy=True,
    )
    return vectors.astype("float32")

The vectors and chunk metadata go into a FAISS IndexFlatIP index — brute-force exact search, which is fine for up to ~100k vectors.

How the query works

When you POST a question to /query:

The question is embedded using the same model
FAISS finds the top-k most similar chunks by cosine similarity
The chunks are formatted into a prompt with labels like [Chunk 3 | Page 2]
The LLM generates an answer grounded in those chunks
Both the answer and source chunks are returned

The system prompt is deliberately strict:

You are a careful assistant that answers questions strictly
from the provided document context.

Rules:
- Use ONLY the context below. Do not use outside knowledge.
- If the answer is not in the context, say:
  "I couldn't find that in the document."

Swappable LLM providers

One thing I'm happy with — the LLM is swappable via a single environment variable:

LLM_PROVIDER=groq      # or openai, or anthropic

All three providers share the same interface:

class LLMClient(ABC):
    @abstractmethod
    def generate(self, system: str, user: str) -> str: ...

You only need an API key for the provider you pick. I used Groq with Llama 3.3 70B for development because it's fast and free-tier friendly.

Testing it: what worked and what didn't

I created a fictional 5-page company document and threw 19 questions at the pipeline. Questions ranged from simple lookups to multi-hop reasoning to negative tests (questions the document can't answer).

What worked well:

Direct lookups: "What is the list price of the Magpie-7?" — nailed it
Table data: "What's included in the Standard tier?" — correct
Negative tests: "What's Zentara's stock ticker?" — correctly said "not in the document"
Multi-hop: "If I want 1-hour SLA support, what will it cost?" — combined info from the pricing table

What failed:

"Who is the CEO?" — couldn't find it
"How many employees does Zentara have?" — couldn't find it

Both answers were on page 1, in a dense "Company snapshot" table: CEO, CTO, HQ, employees, revenue — all packed together.

Why it failed (and what I learned)

The problem wasn't the LLM — it was the retriever. The Company snapshot table had 8+ different facts crammed into one chunk. The embedding for that chunk became a muddy average of all those topics, so it didn't rank highly for any specific question.

This is the classic weakness of pure semantic search. The word "CEO" appears exactly once in the document. A keyword search (BM25) would find it instantly. But vector search relies on semantic similarity, and a short query like "Who is the CEO?" doesn't produce a strong enough match against a chunk that's 80% about revenue, headquarters, and employee count.

The fix: hybrid retrieval — combine BM25 (keyword matching) with vector search. This is what production RAG systems do. It's on my to-do list.

Key design decisions (interview-ready)

If you're building this for interviews, these are the tradeoffs worth knowing:

Decision	Why
Character-based chunking (not token-based)	Simpler, no tokenizer dependency. Production would use tiktoken.
Local embeddings (not OpenAI)	Free, offline, no API latency. Lower quality but fine for demos.
FAISS IndexFlatIP (not HNSW)	Exact search, no approximation. Fine up to ~100k vectors.
Normalized embeddings	Inner product = cosine similarity. One less thing to configure.
No streaming	v1 simplification. Streaming is where LLM SDKs diverge the most.
No conversation memory	Each query is independent. Adding memory is straightforward but adds complexity.

What I'd add next

Hybrid retrieval (BM25 + vector) — catches keyword matches that pure semantic search misses
Reranker (cross-encoder) — re-scores the top-k results for better precision
Evaluation set — automated accuracy measurement instead of manual testing
Streaming — better UX for longer answers
Conversation memory — follow-up questions

Try it yourself

The repo is here: github.com/santanu2908/chat-with-pdf-rag (v1)

uv sync
cp .env.example .env   # set your API key
uv run uvicorn app.main:app --reload

Open http://localhost:8000/docs, upload the included sample PDF (data/sample_test_file.pdf), and start asking questions.

If you've built something similar or have suggestions (especially on hybrid retrieval), I'd love to hear about it in the comments.

I'm Santanu Mohanta — you can connect with me on LinkedIn or check out my other projects on GitHub.