<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sharath Kurup</title>
    <description>The latest articles on DEV Community by Sharath Kurup (@sharathkurup).</description>
    <link>https://dev.to/sharathkurup</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3855663%2F841065e1-5964-4cef-8344-4800620c456f.jpg</url>
      <title>DEV Community: Sharath Kurup</title>
      <link>https://dev.to/sharathkurup</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sharathkurup"/>
    <language>en</language>
    <item>
      <title>Part 4: Improving Retrieval Quality with Token-Aware Chunking and HyDE</title>
      <dc:creator>Sharath Kurup</dc:creator>
      <pubDate>Sun, 26 Apr 2026 03:04:50 +0000</pubDate>
      <link>https://dev.to/sharathkurup/part-4-improving-retrieval-quality-with-token-aware-chunking-and-hyde-4hkp</link>
      <guid>https://dev.to/sharathkurup/part-4-improving-retrieval-quality-with-token-aware-chunking-and-hyde-4hkp</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjtwfkmhfsw3fwj0w0p8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwjtwfkmhfsw3fwj0w0p8.png" alt="Rag Architecture" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Making RAG Smarter with Token-Aware Chunking, HyDE, and Context-Aware Search
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;In Part 3, we improved chunking and optimized context. The system was faster and cleaner… but still not always correct.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What broke after Part 3?
&lt;/h2&gt;

&lt;p&gt;By this point, the system looked solid:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smarter chunking&lt;/li&gt;
&lt;li&gt;Context compression&lt;/li&gt;
&lt;li&gt;FAISS + re-ranking&lt;/li&gt;
&lt;li&gt;Streaming responses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But when I started using it more realistically, a few problems showed up:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Token limits were still hurting quality
&lt;/h3&gt;

&lt;p&gt;Even with better chunks, we were still &lt;strong&gt;not controlling how much context we send to the model&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Vague queries failed badly
&lt;/h3&gt;

&lt;p&gt;Questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;“Explain this”&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;“What does it mean?”&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…would often retrieve irrelevant chunks.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Follow-up questions felt disconnected
&lt;/h3&gt;

&lt;p&gt;The system didn’t “remember” what we were talking about.&lt;/p&gt;




&lt;p&gt;At this point, it stopped feeling like a “retrieval problem”&lt;br&gt;
…and more like a &lt;strong&gt;context understanding problem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So in Part 4, I focused on making RAG smarter.&lt;/p&gt;


&lt;h2&gt;
  
  
  What we’re building in this part
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Token-aware chunking (based on actual LLM limits)&lt;/li&gt;
&lt;li&gt;HyDE (Hypothetical Document Embeddings)&lt;/li&gt;
&lt;li&gt;Early version of context-aware retrieval&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Updated Pipeline
&lt;/h2&gt;

&lt;p&gt;Before jumping into code, here’s how the pipeline evolved:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before (Part 3):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query → Embedding → FAISS → Re-rank → Context → LLM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Now (Part 4):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query → (HyDE) → Better Query → Embedding  
     → FAISS → Re-rank → Token-aware Context  
     → LLM (with constraints)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsqhqkj55r650b92n790y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsqhqkj55r650b92n790y.png" alt="Pipeline" width="800" height="311"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Problem 1: Token limits are real (and we were ignoring them)
&lt;/h2&gt;

&lt;p&gt;Until now, chunking was based on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;characters&lt;/li&gt;
&lt;li&gt;sentences&lt;/li&gt;
&lt;li&gt;separators&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But LLMs don’t think in characters.&lt;/p&gt;

&lt;p&gt;They think in &lt;strong&gt;tokens&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why this matters
&lt;/h3&gt;

&lt;p&gt;You might send:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;2 chunks → fine&lt;/li&gt;
&lt;li&gt;5 chunks → maybe fine&lt;/li&gt;
&lt;li&gt;10 chunks → silently truncated or degraded&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And the worst part?&lt;/p&gt;

&lt;p&gt;You won’t even know it’s happening.&lt;/p&gt;




&lt;h2&gt;
  
  
  Solution: Token-aware chunking
&lt;/h2&gt;

&lt;p&gt;Instead of guessing chunk sizes, we measure them using a tokenizer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;

&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tiktoken&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_encoding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cl100k_base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_token_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now chunking becomes &lt;strong&gt;token-driven instead of size-driven&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Token-based chunking strategy
&lt;/h3&gt;

&lt;p&gt;Key idea:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build chunks until a &lt;strong&gt;token limit&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;If exceeded → split intelligently&lt;/li&gt;
&lt;li&gt;Maintain &lt;strong&gt;overlap using tokens&lt;/strong&gt;, not characters
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MAX_TOKENS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;250&lt;/span&gt;
&lt;span class="n"&gt;OVERLAP_TOKENS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Smarter chunk building
&lt;/h3&gt;

&lt;p&gt;Instead of blindly splitting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prefer paragraphs&lt;/li&gt;
&lt;li&gt;Then sentences&lt;/li&gt;
&lt;li&gt;Then fallback
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_chunks_recursive_tokens&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;paragraphs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;current_chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;current_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;paragraph&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;paragraphs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;paragraph_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_token_length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paragraph&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;paragraph_tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;MAX_TOKENS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_chunk&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

            &lt;span class="n"&gt;current_chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_get_overlap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current_chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;current_chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;paragraph&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;current_tokens&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;paragraph_tokens&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Why this works better
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Matches &lt;strong&gt;actual LLM limits&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Avoids hidden truncation&lt;/li&gt;
&lt;li&gt;Improves &lt;strong&gt;context density&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Makes responses more reliable&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Problem 2: RAG fails on vague queries
&lt;/h2&gt;

&lt;p&gt;This was the bigger issue.&lt;/p&gt;

&lt;p&gt;Even with good chunking, queries like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Explain this concept”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;…don’t contain enough semantic signal.&lt;/p&gt;

&lt;p&gt;So FAISS retrieves something… but often not the right thing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Solution: HyDE (Hypothetical Document Embeddings)
&lt;/h2&gt;

&lt;p&gt;This is one of the most interesting tricks in RAG.&lt;/p&gt;

&lt;p&gt;Instead of embedding the raw query…&lt;/p&gt;

&lt;p&gt;👉 We first &lt;strong&gt;generate a hypothetical answer&lt;/strong&gt;&lt;br&gt;
👉 Then embed &lt;em&gt;that&lt;/em&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  Why this works
&lt;/h3&gt;

&lt;p&gt;A vague query becomes a &lt;strong&gt;rich semantic representation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User query:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Explain this
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;HyDE generates:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This concept refers to a method where...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now embedding this gives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More keywords&lt;/li&gt;
&lt;li&gt;Better semantic alignment&lt;/li&gt;
&lt;li&gt;Stronger retrieval&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Generate hypothetical answer
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_hypothetical_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;recent_user&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;history_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;recent_user&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:])&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Write a 2-sentence technical summary answering: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Recent user context: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;history_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;HYDE_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 2: Augment the query
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;hypothetical_answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_hypothetical_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;search_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;hypothetical_answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Step 3: Embed the enriched query
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;EMBED_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;search_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Impact
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Better retrieval for vague queries&lt;/li&gt;
&lt;li&gt;Improved relevance&lt;/li&gt;
&lt;li&gt;More stable responses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the cost of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~1–2 seconds extra latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Worth it? In most cases — yes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Early Step: Context-aware retrieval
&lt;/h2&gt;

&lt;p&gt;Another issue we started addressing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Follow-up questions were treated as completely new queries.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: Explain attention mechanism
User: How does it work?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Second query loses context.&lt;/p&gt;




&lt;h3&gt;
  
  
  What we added
&lt;/h3&gt;

&lt;p&gt;A simple but effective improvement:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detect vague follow-ups&lt;/li&gt;
&lt;li&gt;Inject &lt;strong&gt;previous page context&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;target_page&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;vague_followups&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()):&lt;/span&gt;
    &lt;span class="n"&gt;last_pages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_last_referenced_pages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;last_pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; page &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;last_pages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Result
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Follow-ups become meaningful&lt;/li&gt;
&lt;li&gt;Retrieval stays anchored&lt;/li&gt;
&lt;li&gt;Conversation feels connected&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Putting it all together
&lt;/h2&gt;

&lt;p&gt;Now the system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understands token limits&lt;/li&gt;
&lt;li&gt;Improves weak queries&lt;/li&gt;
&lt;li&gt;Handles basic conversation flow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiiawcneyiju7772hub0i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiiawcneyiju7772hub0i.png" alt="Show Flow" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  From learning project → real system
&lt;/h2&gt;

&lt;p&gt;This is where things started to change.&lt;/p&gt;

&lt;p&gt;Earlier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It worked&lt;/li&gt;
&lt;li&gt;It demonstrated RAG&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It behaves more like a real assistant&lt;/li&gt;
&lt;li&gt;Handles imperfect queries&lt;/li&gt;
&lt;li&gt;Works under constraints&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What’s next?
&lt;/h2&gt;

&lt;p&gt;We’ve improved:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data representation (chunks)&lt;/li&gt;
&lt;li&gt;Query understanding (HyDE)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But one big gap still remains:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;We still don’t know &lt;em&gt;why&lt;/em&gt; RAG fails when it fails.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In Part 5, we’ll go deeper into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Debugging the RAG pipeline&lt;/li&gt;
&lt;li&gt;Visualizing FAISS vs re-ranking&lt;/li&gt;
&lt;li&gt;Understanding retrieval quality&lt;/li&gt;
&lt;li&gt;Making the system more transparent&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;Full implementation available here:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://github.com/SharathKurup/chatPDF/blob/token_aware_rag/" rel="noopener noreferrer"&gt;https://github.com/SharathKurup/chatPDF/blob/token_aware_rag/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;At a high level, RAG seems simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Retrieve → augment → generate&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But in practice, most of the work is here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How you &lt;strong&gt;represent data&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;How you &lt;strong&gt;interpret queries&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;How you &lt;strong&gt;control context&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This part was about tightening those pieces.&lt;/p&gt;

&lt;p&gt;And it makes a noticeable difference.&lt;/p&gt;




&lt;p&gt;If you’ve been building with RAG, I’d recommend trying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token-aware chunking&lt;/li&gt;
&lt;li&gt;HyDE&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They’re relatively small changes — but high impact.&lt;/p&gt;




&lt;p&gt;Let me know what you think or what you’d improve.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>python</category>
      <category>ai</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Understanding RAG by Building a ChatPDF App: Smarter Chunking &amp; Context Optimization (Part 3)</title>
      <dc:creator>Sharath Kurup</dc:creator>
      <pubDate>Mon, 20 Apr 2026 06:55:25 +0000</pubDate>
      <link>https://dev.to/sharathkurup/understanding-rag-by-building-a-chatpdf-app-smarter-chunking-context-optimization-part-3-4blh</link>
      <guid>https://dev.to/sharathkurup/understanding-rag-by-building-a-chatpdf-app-smarter-chunking-context-optimization-part-3-4blh</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83oq3rvwmnymyhdihmmn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83oq3rvwmnymyhdihmmn.png" alt="Rag Pipeline" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚡ Understanding RAG by Building a ChatPDF App: Better Chunking &amp;amp; Smarter Context
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;In Part 1, we made it work.&lt;br&gt;
In Part 2, we made it fast.&lt;br&gt;
In Part 3, things got… interesting 😅&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  📌 Recap from Part 1 &amp;amp; 2
&lt;/h2&gt;

&lt;p&gt;In the previous parts:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;Part 1&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Built a basic RAG pipeline using NumPy&lt;/li&gt;
&lt;li&gt;Understood embeddings + similarity search&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 &lt;strong&gt;Part 2&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Switched to FAISS for fast retrieval ⚡&lt;/li&gt;
&lt;li&gt;Added persistence + re-ranking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this point, everything &lt;em&gt;looked&lt;/em&gt; solid.&lt;/p&gt;




&lt;h2&gt;
  
  
  😅 But Something Still Felt Off
&lt;/h2&gt;

&lt;p&gt;I started testing with real questions…&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is FAISS indexing?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And sometimes the answer would:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Talk about embeddings instead&lt;/li&gt;
&lt;li&gt;Miss key details&lt;/li&gt;
&lt;li&gt;Or feel… slightly off&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🤔 The weird part?
&lt;/h3&gt;

&lt;p&gt;The answer &lt;strong&gt;was actually present in the document&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But we weren’t retrieving the &lt;em&gt;right chunk&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 The Real Problem Was Not Search
&lt;/h2&gt;

&lt;p&gt;FAISS was doing its job.&lt;/p&gt;

&lt;p&gt;The issue was earlier in the pipeline:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;We were feeding it bad chunks.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🔍 Let’s Look at the Old Chunking Logic
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;CHUNK_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;last_space&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rfind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;last_space&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;last_space&lt;/span&gt;
                &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;OVERLAP_SIZE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 What This Was Doing
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Split text using fixed size&lt;/li&gt;
&lt;li&gt;Avoid breaking words&lt;/li&gt;
&lt;li&gt;Add overlap&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Looks reasonable… right?&lt;/p&gt;




&lt;h2&gt;
  
  
  🚨 Where It Breaks
&lt;/h2&gt;

&lt;p&gt;Let’s take a simple example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Original:
"FAISS is a library for efficient similarity search. It is widely used in RAG systems."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now imagine this gets chunked like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chunk 1:
"FAISS is a library for efficient similarity"

Chunk 2:
"search. It is widely used in RAG systems"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  💥 What just happened?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The sentence got split&lt;/li&gt;
&lt;li&gt;Meaning got split&lt;/li&gt;
&lt;li&gt;Embeddings lost context&lt;/li&gt;
&lt;/ul&gt;




&lt;blockquote&gt;
&lt;p&gt;Embeddings don’t understand fragments&lt;br&gt;
They understand &lt;strong&gt;complete ideas&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🔍 Let’s Visualize This
&lt;/h2&gt;

&lt;p&gt;Let’s visualize what was actually happening 👇&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsvwj9otfwnom9chbd1vy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsvwj9otfwnom9chbd1vy.png" alt="Fixed vs Recursive Chunking" width="800" height="333"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;👉 Notice how sentences are broken across chunks —&lt;br&gt;
this is exactly what degrades retrieval quality.&lt;/p&gt;


&lt;h2&gt;
  
  
  💡 The Shift in Thinking
&lt;/h2&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Split text by size”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We need:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Split text by meaning”&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  🚀 Step 1: Recursive Chunking (Respect Structure)
&lt;/h2&gt;


&lt;h3&gt;
  
  
  ✅ New Approach
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_chunks_recursive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;overlap_size&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;
        &lt;span class="n"&gt;chunk_slice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;separator&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
            &lt;span class="n"&gt;last_break&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk_slice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rfind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;separator&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;last_break&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;separator&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;last_break&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;last_break&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk_size&lt;/span&gt;

        &lt;span class="n"&gt;actual_end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;last_break&lt;/span&gt;
        &lt;span class="n"&gt;final_chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;actual_end&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;final_chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
        &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;actual_end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;overlap_size&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧠 What Changed Here?
&lt;/h2&gt;

&lt;p&gt;Instead of blindly splitting, we now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;separator&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We try:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Paragraph&lt;/li&gt;
&lt;li&gt;Line&lt;/li&gt;
&lt;li&gt;Sentence&lt;/li&gt;
&lt;li&gt;Word&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;👉 This is a &lt;strong&gt;priority-based splitting strategy&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  💡 Why This Works Better
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Paragraphs stay intact&lt;/li&gt;
&lt;li&gt;Sentences stay intact&lt;/li&gt;
&lt;li&gt;Meaning stays intact&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ✅ Micro Summary
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What changed:&lt;/strong&gt; Structure-aware chunking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why it matters:&lt;/strong&gt; Better embeddings → better retrieval&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔁 Step 2: Overlap Still Matters
&lt;/h2&gt;

&lt;p&gt;We still keep overlap:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Chunk 1]  "RAG systems work by retrieving relevant context"
[Chunk 2]         "retrieving relevant context from documents"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  🧠 Why This Is Important
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Prevents context gaps&lt;/li&gt;
&lt;li&gt;Keeps continuity between chunks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔥 Step 3: Storing Full Context (Big Upgrade)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_advanced_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;search_chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_chunks_recursive&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;search_chunks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Page &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full_context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page_content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 Why This Matters
&lt;/h2&gt;

&lt;p&gt;Earlier:&lt;/p&gt;

&lt;p&gt;👉 We only stored chunk text&lt;/p&gt;

&lt;p&gt;Now:&lt;/p&gt;

&lt;p&gt;👉 We also store the &lt;strong&gt;entire page&lt;/strong&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  💡 What This Enables
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Better answer generation&lt;/li&gt;
&lt;li&gt;Flexibility for later processing&lt;/li&gt;
&lt;li&gt;Smarter context selection&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚨 New Problem Introduced
&lt;/h2&gt;

&lt;p&gt;Now that we store full pages…&lt;/p&gt;

&lt;p&gt;We started sending &lt;strong&gt;too much data&lt;/strong&gt; to the LLM.&lt;/p&gt;




&lt;h2&gt;
  
  
  ❌ Problem
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Large token usage&lt;/li&gt;
&lt;li&gt;Slower responses&lt;/li&gt;
&lt;li&gt;Irrelevant information&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚀 Step 4: Context Compression
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;compress_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;full_text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;(?&amp;lt;=[.!?]) +&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;full_text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;query_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="n"&gt;scored_sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query_words&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;scored_sentences&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;top_sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scored_sentences&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;MAX_SENTENCES&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_sentences&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 What’s Happening Here?
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Break into sentences&lt;/li&gt;
&lt;li&gt;Score relevance using query&lt;/li&gt;
&lt;li&gt;Keep only top sentences&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🔍 Visualize It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before:
Full page → 1000+ tokens ❌

After:
Relevant sentences → smaller context ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  💡 Why This Is Powerful
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Faster responses&lt;/li&gt;
&lt;li&gt;Better relevance&lt;/li&gt;
&lt;li&gt;Lower token usage&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ✅ Micro Summary
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What changed:&lt;/strong&gt; Context filtering&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why it matters:&lt;/strong&gt; Less noise → better answers&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔍 Retrieval Still Uses FAISS + Re-ranking
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ranker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rerank_request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 Flow
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;FAISS → fast retrieval&lt;/li&gt;
&lt;li&gt;Re-ranker → improves relevance&lt;/li&gt;
&lt;li&gt;Top results → passed to LLM&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  💬 Smarter Answer Generation
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;compressed_text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compress_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;full_context&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;👉 Instead of raw chunks, we now send:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Focused, relevant context&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🔁 Final System
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4msme4cfreusqqgenqmn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4msme4cfreusqqgenqmn.png" alt="Final System" width="500" height="992"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 What Changed Overall
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Part 1&lt;/th&gt;
&lt;th&gt;Part 2&lt;/th&gt;
&lt;th&gt;Part 3&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Search&lt;/td&gt;
&lt;td&gt;NumPy&lt;/td&gt;
&lt;td&gt;FAISS&lt;/td&gt;
&lt;td&gt;FAISS + Rerank&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chunking&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Recursive 🧠&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Context&lt;/td&gt;
&lt;td&gt;Raw&lt;/td&gt;
&lt;td&gt;Raw&lt;/td&gt;
&lt;td&gt;Compressed 🔥&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🧠 Final Thought
&lt;/h2&gt;

&lt;p&gt;This is where things clicked for me.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I kept thinking better models would fix my system…&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But the real issue was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bad context in → bad answers out&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Most people focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Models ❌&lt;/li&gt;
&lt;li&gt;Vector DB ❌&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the real gains came from:&lt;/p&gt;

&lt;p&gt;👉 Chunking&lt;br&gt;
👉 Context handling&lt;/p&gt;




&lt;h2&gt;
  
  
  🔜 What’s Next?
&lt;/h2&gt;

&lt;p&gt;Now things get even more interesting.&lt;/p&gt;

&lt;p&gt;In Part 4:&lt;/p&gt;

&lt;p&gt;👉 We’ll move beyond basic retrieval and make the system &lt;strong&gt;smarter&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token-aware chunking&lt;/li&gt;
&lt;li&gt;Better query understanding&lt;/li&gt;
&lt;li&gt;More intelligent retrieval&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  💬 Let’s Connect
&lt;/h2&gt;

&lt;p&gt;If you're building something similar or experimenting with local LLMs, I’d love to hear your thoughts 👇&lt;/p&gt;




</description>
      <category>ai</category>
      <category>python</category>
      <category>rag</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Understanding RAG by Building a ChatPDF App: From NumPy to FAISS (Part 2)</title>
      <dc:creator>Sharath Kurup</dc:creator>
      <pubDate>Tue, 14 Apr 2026 06:44:11 +0000</pubDate>
      <link>https://dev.to/sharathkurup/understanding-rag-by-building-a-chatpdf-app-from-numpy-to-faiss-part-2-33oe</link>
      <guid>https://dev.to/sharathkurup/understanding-rag-by-building-a-chatpdf-app-from-numpy-to-faiss-part-2-33oe</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimcq46y2robl4qqoz9hm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimcq46y2robl4qqoz9hm.png" alt="RAG with FAISS" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚡ From NumPy to FAISS: Making ChatPDF Fast &amp;amp; Scalable
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;In Part 1, we made it work.&lt;br&gt;
In Part 2, we make it &lt;em&gt;usable&lt;/em&gt; 🚀&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  📌 Recap from Part 1
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/sharathkurup/understanding-rag-by-building-a-chatpdf-app-with-numpy-part-1-14p7"&gt;Part 1&lt;/a&gt;, we built a ChatPDF app using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PDF → Text → Chunks&lt;/li&gt;
&lt;li&gt;Embeddings using Ollama&lt;/li&gt;
&lt;li&gt;Similarity search using &lt;strong&gt;NumPy&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;LLM to generate answers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It worked well for small PDFs and helped us understand &lt;strong&gt;RAG from first principles&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But once I started testing with slightly larger PDFs…&lt;/p&gt;




&lt;h2&gt;
  
  
  😅 The Problem Started Showing Up
&lt;/h2&gt;

&lt;p&gt;The issue was not correctness — it was &lt;strong&gt;performance&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let’s revisit what we were doing during search.&lt;/p&gt;




&lt;h3&gt;
  
  
  ❌ NumPy Search (What we had before)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;similarities&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;top_indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarities&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;TOP_K&lt;/span&gt;&lt;span class="p"&gt;:][::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  🧠 What’s actually happening here?
&lt;/h3&gt;

&lt;p&gt;Every time you ask a question:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Compute similarity with &lt;strong&gt;every chunk&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Store all similarity scores&lt;/li&gt;
&lt;li&gt;Sort the entire list&lt;/li&gt;
&lt;li&gt;Pick top K&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  🚨 Why this becomes a problem
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Time complexity → &lt;strong&gt;O(n)&lt;/strong&gt; per query&lt;/li&gt;
&lt;li&gt;More chunks = slower search&lt;/li&gt;
&lt;li&gt;Entire dataset scanned every time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To make this visible, I added timing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# similarity logic
&lt;/span&gt;&lt;span class="n"&gt;end_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total time with numpy: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;execution_time&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And as the number of chunks increased…&lt;br&gt;
⏳ the delay became noticeable.&lt;/p&gt;


&lt;h2&gt;
  
  
  💡 So What’s the Solution?
&lt;/h2&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Search through everything every time”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We need:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“A system that knows &lt;em&gt;where to look&lt;/em&gt;”&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  🔍 Let’s Visualize the Problem (This is the key moment)
&lt;/h2&gt;

&lt;p&gt;👉 This is where the real difference becomes obvious:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F025vv39ee8jvojnoq5qs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F025vv39ee8jvojnoq5qs.png" alt="Brute Force vs Index Search" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;💡 What this diagram shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NumPy → scans every single chunk&lt;/li&gt;
&lt;li&gt;FAISS → directly jumps to the most relevant results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the exact shift from:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;brute force → intelligent retrieval&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  🚀 Introducing FAISS
&lt;/h2&gt;

&lt;p&gt;FAISS (Facebook AI Similarity Search) is built for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast vector similarity search&lt;/li&gt;
&lt;li&gt;Efficient indexing&lt;/li&gt;
&lt;li&gt;Handling large datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key idea:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;Build an index once → search efficiently many times&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  🔄 Step 1: Moving from Raw Vectors → FAISS Index
&lt;/h2&gt;


&lt;h2&gt;
  
  
  ❌ Before (NumPy mindset)
&lt;/h2&gt;

&lt;p&gt;We stored vectors like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;vector_db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it.&lt;/p&gt;

&lt;p&gt;No structure. No optimization. Just raw data.&lt;/p&gt;




&lt;h2&gt;
  
  
  ✅ After (FAISS approach)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;vector_np&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;dimension&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IndexFlatIP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_np&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 Let’s understand this properly
&lt;/h2&gt;




&lt;h3&gt;
  
  
  1️⃣ Converting to float32
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;vector_np&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vectors&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FAISS requires vectors in &lt;code&gt;float32&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Even if your embeddings are already floats, doing this ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compatibility&lt;/li&gt;
&lt;li&gt;No runtime surprises&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2️⃣ Getting the dimension
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dimension&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each embedding looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-0.45&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.88&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The number of elements = &lt;strong&gt;dimension&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;FAISS needs this to build the index correctly.&lt;/p&gt;




&lt;h3&gt;
  
  
  3️⃣ Creating the index
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IndexFlatIP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dimension&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;IndexFlatIP&lt;/code&gt; → Inner Product search&lt;/li&gt;
&lt;li&gt;Since embeddings are normalized →
👉 Inner Product ≈ Cosine Similarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we are essentially saying:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Store these vectors and allow fast similarity-based search.”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  4️⃣ Adding vectors to FAISS
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_np&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This step:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Loads all embeddings into FAISS&lt;/li&gt;
&lt;li&gt;Builds the internal structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 From here, we stop thinking in terms of arrays and start thinking in terms of an &lt;strong&gt;index&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🎯 Big Concept Shift
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;NumPy&lt;/th&gt;
&lt;th&gt;FAISS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw vectors&lt;/td&gt;
&lt;td&gt;Indexed vectors&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Manual search&lt;/td&gt;
&lt;td&gt;Optimized search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full scan&lt;/td&gt;
&lt;td&gt;Smart retrieval&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🔍 Step 2: Searching with FAISS
&lt;/h2&gt;




&lt;h2&gt;
  
  
  ❌ Before (NumPy)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;similarities&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;top_indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarities&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;TOP_K&lt;/span&gt;&lt;span class="p"&gt;:][::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ✅ After (FAISS)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TOP_K&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 Let’s break this down
&lt;/h2&gt;




&lt;h3&gt;
  
  
  1️⃣ Why reshape?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reshape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FAISS expects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[number_of_queries, dimension]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even a single query must be shaped like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="err"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2️⃣ What does &lt;code&gt;search()&lt;/code&gt; do?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;distances&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;FAISS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Finds nearest vectors&lt;/li&gt;
&lt;li&gt;Sorts internally&lt;/li&gt;
&lt;li&gt;Returns top K&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3️⃣ Mapping results back
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;text_metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We use indices to fetch:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Actual text chunks&lt;/li&gt;
&lt;li&gt;Page numbers&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  💡 Why this is powerful
&lt;/h2&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing similarity logic ❌&lt;/li&gt;
&lt;li&gt;Writing sorting logic ❌&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You now:&lt;/p&gt;

&lt;p&gt;👉 Call one optimized function&lt;/p&gt;




&lt;h2&gt;
  
  
  💾 Step 3: Avoid Recomputing Everything
&lt;/h2&gt;




&lt;h2&gt;
  
  
  🚨 Problem in Part 1
&lt;/h2&gt;

&lt;p&gt;Every run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read PDF&lt;/li&gt;
&lt;li&gt;Chunk text&lt;/li&gt;
&lt;li&gt;Generate embeddings&lt;/li&gt;
&lt;li&gt;Build vectors&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ✅ Solution: Save the Index
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db/index.faiss&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db/metadata.pkl&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;pickle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 What are we saving?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;FAISS index → vector structure&lt;/li&gt;
&lt;li&gt;Metadata → chunk + page info&lt;/li&gt;
&lt;li&gt;PDF hash → detect changes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔁 Loading instead of recomputing
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;faiss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db/index.faiss&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;⚡ Faster startup&lt;/li&gt;
&lt;li&gt;❌ No repeated embedding calls&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔐 Step 4: Detecting PDF Changes
&lt;/h2&gt;






&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_pdf_hash&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;sha256_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 Why this matters
&lt;/h2&gt;

&lt;p&gt;If the PDF changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old embeddings become invalid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate hash&lt;/li&gt;
&lt;li&gt;Compare with stored hash&lt;/li&gt;
&lt;li&gt;Rebuild only if needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Small addition, big impact.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔥 Step 5: Improving Retrieval with Re-ranking
&lt;/h2&gt;




&lt;p&gt;Even FAISS isn’t perfect.&lt;/p&gt;

&lt;p&gt;So we add another layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ranker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rerank_request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 What’s happening here?
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;FAISS retrieves top 10 chunks&lt;/li&gt;
&lt;li&gt;Re-ranker evaluates relevance&lt;/li&gt;
&lt;li&gt;Returns best TOP_K&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  📊 Debug visibility
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--- Re-ranker Scores ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Helps you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understand ranking&lt;/li&gt;
&lt;li&gt;Debug results&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  💬 Step 6: Streaming Responses (UX Upgrade)
&lt;/h2&gt;






&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;generate_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_llm&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🧠 Why this matters
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Feels real-time&lt;/li&gt;
&lt;li&gt;Improves perceived speed&lt;/li&gt;
&lt;li&gt;Better experience&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔁 Final System (Let’s Visualize It)
&lt;/h2&gt;

&lt;p&gt;👉 This is what your ChatPDF system looks like now:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7b0jim1wbgbdl4wszlpj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7b0jim1wbgbdl4wszlpj.png" alt="Rag Pipeline" width="640" height="451"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 What this diagram represents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Query → converted into embedding&lt;/li&gt;
&lt;li&gt;FAISS → retrieves relevant chunks&lt;/li&gt;
&lt;li&gt;Re-ranker → improves quality&lt;/li&gt;
&lt;li&gt;LLM → generates final answer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 This is a &lt;strong&gt;complete RAG pipeline&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 What We Achieved
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Part 1 (NumPy)&lt;/th&gt;
&lt;th&gt;Part 2 (FAISS)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Search&lt;/td&gt;
&lt;td&gt;Brute force&lt;/td&gt;
&lt;td&gt;Indexed ⚡&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Slow&lt;/td&gt;
&lt;td&gt;Fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistence&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Improved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UX&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🧠 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This is where things became real.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From “I understand RAG”&lt;br&gt;
to&lt;br&gt;
“I can build something scalable”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;If you’re learning RAG:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with NumPy ✅&lt;/li&gt;
&lt;li&gt;Move to FAISS ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That transition is where the real understanding happens.&lt;/p&gt;




&lt;h2&gt;
  
  
  📂 Project Repo
&lt;/h2&gt;

&lt;p&gt;👉 &lt;a href="https://github.com/SharathKurup/chatPDF/tree/faiss_indexing" rel="noopener noreferrer"&gt;https://github.com/SharathKurup/chatPDF/tree/faiss_indexing&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🔜 What’s Next?
&lt;/h2&gt;

&lt;p&gt;In Part 3:&lt;/p&gt;

&lt;p&gt;👉 We’ll build a &lt;strong&gt;Streamlit UI&lt;/strong&gt;&lt;br&gt;
👉 Turn this into a proper app&lt;/p&gt;




&lt;h2&gt;
  
  
  💬 Let’s Connect
&lt;/h2&gt;

&lt;p&gt;If you're building something similar or exploring local LLMs, I’d love to hear your thoughts 👇&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>tutorial</category>
      <category>faiss</category>
    </item>
    <item>
      <title>Understanding RAG by Building a ChatPDF App with NumPy (Part 1)</title>
      <dc:creator>Sharath Kurup</dc:creator>
      <pubDate>Wed, 08 Apr 2026 02:21:07 +0000</pubDate>
      <link>https://dev.to/sharathkurup/understanding-rag-by-building-a-chatpdf-app-with-numpy-part-1-14p7</link>
      <guid>https://dev.to/sharathkurup/understanding-rag-by-building-a-chatpdf-app-with-numpy-part-1-14p7</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfp80bej7rutb9f9cg0i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfp80bej7rutb9f9cg0i.png" alt="RAG Architecture" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🧠 Building a Chat with PDF App (From Scratch using NumPy)
&lt;/h2&gt;




&lt;blockquote&gt;
&lt;p&gt;Turning a simple PDF into a conversational AI system using local LLMs 🚀&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  📌 Introduction
&lt;/h2&gt;

&lt;p&gt;Have you ever wanted to &lt;em&gt;chat with your PDF documents&lt;/em&gt; like you chat with ChatGPT?&lt;/p&gt;

&lt;p&gt;In this series, I’ll walk you through building a &lt;strong&gt;ChatPDF application from scratch&lt;/strong&gt;, starting from the absolute basics and gradually improving it into a production-ready system.&lt;/p&gt;

&lt;p&gt;👉 In this first part, we’ll build a &lt;strong&gt;naive RAG (Retrieval-Augmented Generation) system using only NumPy&lt;/strong&gt; — no FAISS, no vector databases, just pure fundamentals.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎯 What We’ll Build
&lt;/h2&gt;

&lt;p&gt;By the end of this article, you’ll have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📄 A system that reads a PDF&lt;/li&gt;
&lt;li&gt;✂️ Splits it into meaningful chunks&lt;/li&gt;
&lt;li&gt;🔢 Converts text into embeddings using a local model&lt;/li&gt;
&lt;li&gt;🔍 Searches relevant content using vector similarity&lt;/li&gt;
&lt;li&gt;💬 Generates answers using an LLM&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ⚙️ Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;pdfplumber&lt;/code&gt; → Extract text from PDFs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;numpy&lt;/code&gt; → Perform vector similarity search&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ollama&lt;/code&gt; → Run local embedding + LLM models&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧩 How It Works (High Level)
&lt;/h2&gt;

&lt;p&gt;Our pipeline looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDF → Text → Chunks → Embeddings → Similarity Search → LLM → Answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  📥 Step 1: Reading the PDF
&lt;/h2&gt;

&lt;p&gt;We start by extracting text page by page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;readpdf&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;all_texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;pdfplumber&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PDF_PATH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="n"&gt;all_texts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;all_texts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🧠 What’s happening?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Reads each page&lt;/li&gt;
&lt;li&gt;Skips empty pages&lt;/li&gt;
&lt;li&gt;Stores &lt;code&gt;(page_number, text)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ✂️ Step 2: Chunking the Text
&lt;/h2&gt;

&lt;p&gt;Large text doesn’t work well for embeddings or LLMs, so we split it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_chunks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;CHUNK_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;last_space&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rfind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;last_space&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;last_space&lt;/span&gt;
                &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;OVERLAP_SIZE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🧠 Why overlap?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Prevents context loss between chunks&lt;/li&gt;
&lt;li&gt;Helps LLM understand continuity&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔢 Step 3: Generating Embeddings
&lt;/h2&gt;

&lt;p&gt;We convert text into vectors using Ollama:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_embeddings_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;all_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;batch_texts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;BATCH_SIZE&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;EMBED_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;batch_texts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;all_embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;embeddings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;all_embeddings&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🧠 Why batching?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Faster processing&lt;/li&gt;
&lt;li&gt;Efficient use of resources&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📏 Step 4: Similarity Search (Core Logic)
&lt;/h2&gt;

&lt;p&gt;Here’s where NumPy shines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;similarities&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;top_indices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;argsort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similarities&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;TOP_K&lt;/span&gt;&lt;span class="p"&gt;:][::&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🧠 What’s happening?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;We compute &lt;strong&gt;dot product similarity&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Higher score = more relevant chunk&lt;/li&gt;
&lt;li&gt;Select top K results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 This is essentially a &lt;strong&gt;manual vector database using NumPy&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  💬 Step 5: Generate Answer using LLM
&lt;/h2&gt;

&lt;p&gt;We pass retrieved chunks as context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;context_chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Context:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context_chunks&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Answer:
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ollama&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;THINKING_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🧠 Key Idea
&lt;/h3&gt;

&lt;p&gt;We’re doing &lt;strong&gt;RAG (Retrieval-Augmented Generation)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval → relevant chunks&lt;/li&gt;
&lt;li&gt;Generation → LLM response&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔁 Step 6: Interactive Chat Loop
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat_pdf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text_metadata&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;user_query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You - &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector_db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text_metadata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;context_llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;res&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generate_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context_llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can literally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You - What is the main topic?
AI  - ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔍 Bonus: Embedding Normalization Check
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;norms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embeddings_array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  🧠 Why this matters?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;If vectors are normalized → dot product ≈ cosine similarity&lt;/li&gt;
&lt;li&gt;Improves consistency in search results&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚨 Limitations of This Approach
&lt;/h2&gt;

&lt;p&gt;This implementation is intentionally simple — and that comes with trade-offs:&lt;/p&gt;

&lt;h3&gt;
  
  
  ⚠️ 1. Slow Search for Large PDFs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;NumPy scans &lt;strong&gt;every vector&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;No indexing → O(n) search&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚠️ 2. Not Scalable
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Works fine for small docs&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Breaks down with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Large PDFs&lt;/li&gt;
&lt;li&gt;Multiple documents&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚠️ 3. No Persistent Storage
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Embeddings are generated every run&lt;/li&gt;
&lt;li&gt;No caching or database&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚠️ 4. Limited Retrieval Quality
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Pure similarity search&lt;/li&gt;
&lt;li&gt;No reranking, filtering, or hybrid search&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ⚠️ 5. Context Limitation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Only &lt;code&gt;TOP_K&lt;/code&gt; chunks used&lt;/li&gt;
&lt;li&gt;May miss important information&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧠 What You Learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;How RAG works under the hood&lt;/li&gt;
&lt;li&gt;How embeddings enable semantic search&lt;/li&gt;
&lt;li&gt;How to build a vector search using NumPy&lt;/li&gt;
&lt;li&gt;How LLMs use context to answer questions&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔜 What’s Next?
&lt;/h2&gt;

&lt;p&gt;In &lt;strong&gt;Part 2&lt;/strong&gt;, we’ll upgrade this system by replacing NumPy search with:&lt;/p&gt;

&lt;p&gt;➡️ &lt;strong&gt;FAISS (Facebook AI Similarity Search)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This will give us:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;⚡ Faster retrieval&lt;/li&gt;
&lt;li&gt;📈 Better scalability&lt;/li&gt;
&lt;li&gt;🧠 Efficient indexing&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📂 Project Repo
&lt;/h2&gt;

&lt;p&gt;👉 GitHub: &lt;a href="https://github.com/SharathKurup/chatPDF/tree/numpy_vector" rel="noopener noreferrer"&gt;https://github.com/SharathKurup/chatPDF/tree/numpy_vector&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  💬 Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This is the &lt;strong&gt;most important step&lt;/strong&gt; in understanding RAG systems:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Before using fancy tools like FAISS or vector DBs,&lt;br&gt;
you should understand what’s happening underneath.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once you get this, everything else becomes easy.&lt;/p&gt;




&lt;p&gt;If you're building something similar or experimenting with local LLMs, I’d love to hear your thoughts 👇&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Stay tuned for Part 2 🚀&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>rag</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
