<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Santanu Mohanta</title>
    <description>The latest articles on DEV Community by Santanu Mohanta (@santanu_mohanta_29).</description>
    <link>https://dev.to/santanu_mohanta_29</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3959064%2F3ea06ae1-908c-4583-82ae-64a8e10d6737.png</url>
      <title>DEV Community: Santanu Mohanta</title>
      <link>https://dev.to/santanu_mohanta_29</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/santanu_mohanta_29"/>
    <language>en</language>
    <item>
      <title>I built a RAG pipeline from scratch — no LangChain, just FastAPI + FAISS</title>
      <dc:creator>Santanu Mohanta</dc:creator>
      <pubDate>Sat, 30 May 2026 18:38:55 +0000</pubDate>
      <link>https://dev.to/santanu_mohanta_29/i-built-a-rag-pipeline-from-scratch-no-langchain-just-fastapi-faiss-28ke</link>
      <guid>https://dev.to/santanu_mohanta_29/i-built-a-rag-pipeline-from-scratch-no-langchain-just-fastapi-faiss-28ke</guid>
      <description>&lt;p&gt;Most RAG tutorials I found were either "pip install langchain and you're done" or 50-page academic papers. I wanted something in between — a pipeline I could actually explain in an interview, where I understood every line.&lt;/p&gt;

&lt;p&gt;So I built one from scratch. No LangChain, no LlamaIndex, no frameworks. Just FastAPI, FAISS, sentence-transformers, and an LLM API.&lt;/p&gt;

&lt;p&gt;Here's what I built, what worked, and what broke.&lt;/p&gt;

&lt;h3&gt;
  
  
  Uploading a PDF
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7z7evnjm0l028a68yycx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7z7evnjm0l028a68yycx.png" alt="Selecting a PDF to upload via Swagger UI" width="800" height="472"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fokfhrbbj8vt8yxbu2395.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fokfhrbbj8vt8yxbu2395.png" alt="Upload response — 16 chunks indexed from 5 pages" width="799" height="409"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Querying the document
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbah7z1477rixwjux51g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbah7z1477rixwjux51g.png" alt="Asking a question via the /query endpoint" width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbu7jtybw38xs70ql7jj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flbu7jtybw38xs70ql7jj.png" alt="Response with answer and source chunks" width="799" height="318"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDF --&amp;gt; extract text (pypdf) --&amp;gt; chunk (500 char, 50 overlap) --&amp;gt; embed (MiniLM-L6-v2)
                                                                        |
                                                                        v
question --&amp;gt; embed --&amp;gt; FAISS top-k search --&amp;gt; build prompt with chunks --&amp;gt; LLM --&amp;gt; answer + sources
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five Python files, ~300 lines total:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;main.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;FastAPI app, 3 endpoints, prompt engineering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pdf_loader.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;PDF text extraction via pypdf&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;rag.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Chunking + embedding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;store.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;FAISS vector store wrapper&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llm.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Swappable LLM client (Groq / OpenAI / Anthropic)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How the upload works
&lt;/h2&gt;

&lt;p&gt;When you POST a PDF to &lt;code&gt;/upload&lt;/code&gt;, three things happen:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Text extraction&lt;/strong&gt; — pypdf reads each page and returns the raw text. Pages with no extractable text (scanned images) are skipped.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Chunking&lt;/strong&gt; — each page is split into ~500-character chunks with 50 characters of overlap. The overlap prevents losing context at chunk boundaries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CHUNK_SIZE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
&lt;span class="n"&gt;CHUNK_OVERLAP&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chunk_pages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;chunk_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;CHUNK_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;chunk_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Chunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
                &lt;span class="n"&gt;chunk_id&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
            &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;CHUNK_OVERLAP&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Embedding&lt;/strong&gt; — each chunk is embedded into a 384-dimensional vector using &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt;. This runs locally on CPU, no API call needed. Vectors are normalized so we can use inner product as cosine similarity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;embed_texts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_embed_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# lazy-loaded singleton
&lt;/span&gt;    &lt;span class="n"&gt;vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;normalize_embeddings&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;show_progress_bar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;convert_to_numpy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The vectors and chunk metadata go into a FAISS &lt;code&gt;IndexFlatIP&lt;/code&gt; index — brute-force exact search, which is fine for up to ~100k vectors.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the query works
&lt;/h2&gt;

&lt;p&gt;When you POST a question to &lt;code&gt;/query&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The question is embedded using the &lt;strong&gt;same model&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;FAISS finds the top-k most similar chunks by cosine similarity&lt;/li&gt;
&lt;li&gt;The chunks are formatted into a prompt with labels like &lt;code&gt;[Chunk 3 | Page 2]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The LLM generates an answer grounded in those chunks&lt;/li&gt;
&lt;li&gt;Both the answer and source chunks are returned&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The system prompt is deliberately strict:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are a careful assistant that answers questions strictly
from the provided document context.

Rules:
- Use ONLY the context below. Do not use outside knowledge.
- If the answer is not in the context, say:
  "I couldn't find that in the document."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Swappable LLM providers
&lt;/h2&gt;

&lt;p&gt;One thing I'm happy with — the LLM is swappable via a single environment variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;LLM_PROVIDER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;groq      &lt;span class="c"&gt;# or openai, or anthropic&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All three providers share the same interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LLMClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You only need an API key for the provider you pick. I used Groq with Llama 3.3 70B for development because it's fast and free-tier friendly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing it: what worked and what didn't
&lt;/h2&gt;

&lt;p&gt;I created a &lt;a href="https://github.com/santanu2908/chat-with-pdf-rag/blob/main/data/sample_test_file.pdf" rel="noopener noreferrer"&gt;fictional 5-page company document&lt;/a&gt; and threw 19 questions at the pipeline. Questions ranged from simple lookups to multi-hop reasoning to negative tests (questions the document can't answer).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What worked well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct lookups: &lt;em&gt;"What is the list price of the Magpie-7?"&lt;/em&gt; — nailed it&lt;/li&gt;
&lt;li&gt;Table data: &lt;em&gt;"What's included in the Standard tier?"&lt;/em&gt; — correct&lt;/li&gt;
&lt;li&gt;Negative tests: &lt;em&gt;"What's Zentara's stock ticker?"&lt;/em&gt; — correctly said "not in the document"&lt;/li&gt;
&lt;li&gt;Multi-hop: &lt;em&gt;"If I want 1-hour SLA support, what will it cost?"&lt;/em&gt; — combined info from the pricing table&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What failed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;"Who is the CEO?"&lt;/em&gt; — couldn't find it&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;"How many employees does Zentara have?"&lt;/em&gt; — couldn't find it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both answers were on page 1, in a dense "Company snapshot" table: CEO, CTO, HQ, employees, revenue — all packed together.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it failed (and what I learned)
&lt;/h2&gt;

&lt;p&gt;The problem wasn't the LLM — it was the &lt;strong&gt;retriever&lt;/strong&gt;. The Company snapshot table had 8+ different facts crammed into one chunk. The embedding for that chunk became a muddy average of all those topics, so it didn't rank highly for any specific question.&lt;/p&gt;

&lt;p&gt;This is the classic weakness of &lt;strong&gt;pure semantic search&lt;/strong&gt;. The word "CEO" appears exactly once in the document. A keyword search (BM25) would find it instantly. But vector search relies on semantic similarity, and a short query like "Who is the CEO?" doesn't produce a strong enough match against a chunk that's 80% about revenue, headquarters, and employee count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; hybrid retrieval — combine BM25 (keyword matching) with vector search. This is what production RAG systems do. It's on my to-do list.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key design decisions (interview-ready)
&lt;/h2&gt;

&lt;p&gt;If you're building this for interviews, these are the tradeoffs worth knowing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Character-based chunking (not token-based)&lt;/td&gt;
&lt;td&gt;Simpler, no tokenizer dependency. Production would use tiktoken.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local embeddings (not OpenAI)&lt;/td&gt;
&lt;td&gt;Free, offline, no API latency. Lower quality but fine for demos.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FAISS IndexFlatIP (not HNSW)&lt;/td&gt;
&lt;td&gt;Exact search, no approximation. Fine up to ~100k vectors.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Normalized embeddings&lt;/td&gt;
&lt;td&gt;Inner product = cosine similarity. One less thing to configure.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No streaming&lt;/td&gt;
&lt;td&gt;v1 simplification. Streaming is where LLM SDKs diverge the most.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No conversation memory&lt;/td&gt;
&lt;td&gt;Each query is independent. Adding memory is straightforward but adds complexity.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What I'd add next
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid retrieval&lt;/strong&gt; (BM25 + vector) — catches keyword matches that pure semantic search misses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reranker&lt;/strong&gt; (cross-encoder) — re-scores the top-k results for better precision&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation set&lt;/strong&gt; — automated accuracy measurement instead of manual testing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming&lt;/strong&gt; — better UX for longer answers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conversation memory&lt;/strong&gt; — follow-up questions&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;The repo is here: &lt;a href="https://github.com/santanu2908/chat-with-pdf-rag" rel="noopener noreferrer"&gt;github.com/santanu2908/chat-with-pdf-rag&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv &lt;span class="nb"&gt;sync
cp&lt;/span&gt; .env.example .env   &lt;span class="c"&gt;# set your API key&lt;/span&gt;
uv run uvicorn app.main:app &lt;span class="nt"&gt;--reload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;code&gt;http://localhost:8000/docs&lt;/code&gt;, upload the included sample PDF (&lt;code&gt;data/sample_test_file.pdf&lt;/code&gt;), and start asking questions.&lt;/p&gt;




&lt;p&gt;If you've built something similar or have suggestions (especially on hybrid retrieval), I'd love to hear about it in the comments.&lt;/p&gt;

&lt;p&gt;I'm &lt;strong&gt;Santanu Mohanta&lt;/strong&gt; — you can connect with me on &lt;a href="https://www.linkedin.com/in/santanu29/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; or check out my other projects on &lt;a href="https://github.com/santanu2908" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>python</category>
      <category>ai</category>
      <category>fastapi</category>
    </item>
  </channel>
</rss>
