<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: ANKIT AMBASTA</title>
    <description>The latest articles on DEV Community by ANKIT AMBASTA (@asquare8).</description>
    <link>https://dev.to/asquare8</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3927544%2F647e9276-d20d-401f-9a26-d17c1071cd8f.png</url>
      <title>DEV Community: ANKIT AMBASTA</title>
      <link>https://dev.to/asquare8</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/asquare8"/>
    <language>en</language>
    <item>
      <title>The Fallback Pattern: How I Handle 15+ RPM (30,000 Tokens/Min) on Free AI Models # The Solution: Dynamic Fallback Queue"</title>
      <dc:creator>ANKIT AMBASTA</dc:creator>
      <pubDate>Tue, 12 May 2026 16:57:48 +0000</pubDate>
      <link>https://dev.to/asquare8/the-fallback-pattern-how-i-handle-15-rpm-30000-tokensmin-on-free-ai-models-the-solution-4dig</link>
      <guid>https://dev.to/asquare8/the-fallback-pattern-how-i-handle-15-rpm-30000-tokensmin-on-free-ai-models-the-solution-4dig</guid>
      <description>&lt;p&gt;When I built &lt;strong&gt;VerdictAI X&lt;/strong&gt; — a high-end decision support system where five specialized AI agents debate your life choices — I ran into a massive architectural problem.&lt;/p&gt;

&lt;p&gt;Multi-agent systems do not just eat tokens; they completely destroy your rate limits.&lt;/p&gt;

&lt;p&gt;Most tutorials show you how to build a simple chatbot that makes one API call per user message. But what happens when you have a multi-agent orchestration pipeline that triggers &lt;strong&gt;21 simultaneous LLM calls&lt;/strong&gt; for a single button click?&lt;/p&gt;

&lt;p&gt;If you are using the free tier of Google AI Studio, you can hit &lt;code&gt;429 RESOURCE_EXHAUSTED&lt;/code&gt; errors almost immediately.&lt;/p&gt;

&lt;p&gt;The bottleneck is not the tokens. It is the &lt;strong&gt;RPM (Requests Per Minute)&lt;/strong&gt;. &lt;/p&gt;




&lt;h1&gt;
  
  
  The Math: Why RPM Kills Multi-Agent Systems
&lt;/h1&gt;

&lt;p&gt;VerdictAI X is not a standard chatbot; it is a multi-layered reasoning pipeline.&lt;/p&gt;

&lt;p&gt;When a user submits a dilemma, the system spins up five specialized agents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Strategist&lt;/li&gt;
&lt;li&gt;The Guardian&lt;/li&gt;
&lt;li&gt;The Visionary&lt;/li&gt;
&lt;li&gt;The Humanist&lt;/li&gt;
&lt;li&gt;The Contrarian&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A single user query requires the following behind the scenes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Initial Analysis: 5 requests
Debate Round 1 (Challenge): 5 requests
Debate Round 2 (Defend &amp;amp; Challenge): 5 requests
Debate Round 2 (Defend): 5 requests
Final Verdict Synthesis: 1 request

Total = 21 LLM requests per user click
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That creates a real problem for free-tier usage, because the primary model may allow only around 15 RPM. One user query can already exceed that ceiling, even when token usage is still well under the TPM limit. &lt;/p&gt;




&lt;h1&gt;
  
  
  The Solution: Dynamic Fallback Queue
&lt;/h1&gt;

&lt;p&gt;Instead of hardcoding a single model, I built a &lt;strong&gt;fallback queue&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The idea was simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Try the primary model first&lt;/li&gt;
&lt;li&gt;If it hits a rate limit, move to the next model&lt;/li&gt;
&lt;li&gt;Keep retrying until one succeeds&lt;/li&gt;
&lt;li&gt;Show a small system notice in the UI when switching models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This way, the app can keep streaming responses instead of crashing on a 429 error. &lt;/p&gt;




&lt;h1&gt;
  
  
  Core Failover Logic
&lt;/h1&gt;

&lt;p&gt;Here is the architecture powering the automatic model switching inside &lt;code&gt;gemini_client.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;types&lt;/span&gt;

&lt;span class="n"&gt;FALLBACK_MODELS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3.1-flash-lite-preview&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma-4-31b-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma-4-26b-a4b-it&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_model_queue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_pro&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Returns a list of models to try in order.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;primary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;use_pro&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;FALLBACK_MODELS&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;use_pro&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Streams a response with automatic failover to fallback models.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;models_to_try&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_get_model_queue&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;use_pro&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models_to_try&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;final_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_build_config_and_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;br&amp;gt;&amp;lt;span style=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;color:#fbbf24; font-size:10px;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;[System: Primary RPM limit reached. Switching to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;...]&amp;lt;/span&amp;gt;&amp;lt;br&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;final_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;

            &lt;span class="k"&gt;return&lt;/span&gt;

        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;429&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RESOURCE_EXHAUSTED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;error_msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models_to_try&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;continue&lt;/span&gt;
                &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;lt;span style=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;color:#f43f5e; font-weight:600;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;System overloaded. All backup models are currently busy. Please try again in a few minutes.&amp;lt;/span&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

            &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;error_msg&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;internal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;error_msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  What This Actually Bought Me
&lt;/h1&gt;

&lt;p&gt;When the primary model hits its RPM limit, &lt;code&gt;generate_stream()&lt;/code&gt; catches the &lt;code&gt;429&lt;/code&gt; error, skips to the next model, and retries the same prompt.&lt;/p&gt;

&lt;p&gt;Because the fallback happens inside the streaming loop, the UI can show a tiny notice like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[System: Primary RPM limit reached. Switching to gemma-4-31b-it...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user does not get an ugly error screen. They just keep seeing the response stream normally. &lt;/p&gt;




&lt;h1&gt;
  
  
  Why I Am Writing About This
&lt;/h1&gt;

&lt;p&gt;Most tutorials end at the point where one LLM call works.&lt;/p&gt;

&lt;p&gt;But if you want to build complex, multi-agent AI applications, &lt;strong&gt;Requests Per Minute&lt;/strong&gt; limits are one of the first real architectural hurdles you will face.&lt;/p&gt;

&lt;p&gt;You do not always need to upgrade to a paid tier immediately. Sometimes the better solution is to design your system to fail gracefully and take advantage of the available model ecosystem. &lt;/p&gt;




&lt;h1&gt;
  
  
  Project Links
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: VerdictAI X repository [&lt;a href="https://github.com/A-Square8/VerdictAI-X" rel="noopener noreferrer"&gt;https://github.com/A-Square8/VerdictAI-X&lt;/a&gt;]&lt;/li&gt;
&lt;li&gt;LinkedIn: Ankit Ambasta [&lt;a href="https://www.linkedin.com/in/ankit-ambasta-4a58002b9/" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/ankit-ambasta-4a58002b9/&lt;/a&gt;]&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>python</category>
    </item>
    <item>
      <title>Why I Used SHA-256 to Solve a Problem Most RAG Tutorials Pretend Doesn't Exist</title>
      <dc:creator>ANKIT AMBASTA</dc:creator>
      <pubDate>Tue, 12 May 2026 16:18:27 +0000</pubDate>
      <link>https://dev.to/asquare8/why-i-used-sha-256-to-solve-a-problem-most-rag-tutorials-pretend-doesnt-exist-2gbc</link>
      <guid>https://dev.to/asquare8/why-i-used-sha-256-to-solve-a-problem-most-rag-tutorials-pretend-doesnt-exist-2gbc</guid>
      <description>&lt;p&gt;When I built GridMind — a fully offline RAG assistant designed to run on CPU-only hardware with under 4 GB of RAM — I ran into a problem that no LangChain tutorial ever warned me about.&lt;/p&gt;

&lt;p&gt;GridMind is a knowledge base assistant designed to work when there's no internet, no GPU, no cloud. Think disaster scenarios, remote areas, zombie apocalypse and government is not coming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens when your knowledge base changes?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most RAG demos show you the happy path: chunk documents, embed them, store vectors, query. Done. But they quietly skip the part where your source documents get updated, corrected, or extended. Because if you follow the naive approach, the answer is painful: re-embed everything from scratch, every single time.&lt;/p&gt;

&lt;p&gt;For GridMind, that wasn't an option.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Constraints That Forced Me to Think
&lt;/h2&gt;

&lt;p&gt;GridMind's premise is that it works &lt;em&gt;when the grid fails&lt;/em&gt; — no internet, no GPU, no cloud. It runs on a Raspberry Pi class machine using &lt;code&gt;nomic-embed-text&lt;/code&gt; for embeddings and &lt;code&gt;qwen2.5:3b&lt;/code&gt; via Ollama for inference.&lt;/p&gt;

&lt;p&gt;Embedding is the expensive step. On CPU, embedding a full knowledge base across 8 survival domains (water, shelter, medical, navigation, etc.) takes minutes. Re-running that every time I updated a markdown file was a non-starter.&lt;/p&gt;

&lt;p&gt;I needed a way to know, cheaply and reliably, exactly which documents had changed since the last index run — and only re-embed those.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Solution: SHA-256 as a Change Fingerprint
&lt;/h2&gt;

&lt;p&gt;The core idea is simple but I didn't see it written about clearly anywhere, so I'll spell it out.&lt;/p&gt;

&lt;p&gt;Before embedding any document, compute its SHA-256 hash and store it alongside its vector in FAISS metadata. On the next indexing run, before calling the embedding model at all, hash the current file and compare it against the stored hash.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hash matches&lt;/strong&gt; → skip. The document hasn't changed. No embedding call made.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hash differs&lt;/strong&gt; → re-embed and update the stored hash.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New file (no hash stored)&lt;/strong&gt; → embed fresh and store the hash.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File deleted&lt;/strong&gt; → remove its vectors from the index.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hash_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sha256&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Reading in 8 KB chunks matters — it keeps memory flat even for large documents.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why SHA-256 Specifically?
&lt;/h2&gt;

&lt;p&gt;A few alternatives I considered:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File modification timestamps (&lt;code&gt;mtime&lt;/code&gt;)&lt;/strong&gt; — Fast, but unreliable. Copying a file, running a deployment script, or touching a file changes &lt;code&gt;mtime&lt;/code&gt; without changing content. You'd re-embed files that didn't need it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File size&lt;/strong&gt; — Even faster, even less reliable. A one-character edit to a 10 KB file changes content but not size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MD5&lt;/strong&gt; — Would work fine here. SHA-256 is marginally slower but the difference at this scale is microseconds. I used it because it's the standard I'm used to reaching for and collision resistance, while overkill for this use case, costs nothing.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Index Store Structure
&lt;/h2&gt;

&lt;p&gt;I kept a simple JSON manifest alongside the FAISS index:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"documents"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"data/water/purification.md"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a3f5c2d1..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"vector_ids"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"indexed_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-11-14T10:22:00"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"data/medical/wound-care.md"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"9b8e1f44..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"vector_ids"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"indexed_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-11-14T10:22:01"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tracking &lt;code&gt;vector_ids&lt;/code&gt; per document is what makes deletion and update clean — when a file changes, you know exactly which FAISS vectors to remove before inserting the new ones.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Actually Bought Me
&lt;/h2&gt;

&lt;p&gt;On a knowledge base update where I corrected two markdown files and added one new one, the indexer processed 3 files instead of 47. Embedding time dropped from ~6 minutes to ~40 seconds on the test machine.&lt;/p&gt;

&lt;p&gt;More importantly, it made iteration &lt;em&gt;feel&lt;/em&gt; fast. When you're building a local-first tool and testing knowledge base changes, waiting 6 minutes per cycle kills momentum. 40 seconds doesn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Limitations
&lt;/h2&gt;

&lt;p&gt;This approach has real tradeoffs I want to be upfront about:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FAISS doesn't natively support deletion.&lt;/strong&gt; To "remove" old vectors, I rebuild the index from the non-deleted vectors. For 47 documents this is fast. At 10,000 documents it would become the bottleneck. A production system would reach for something like Qdrant or Weaviate that supports vector-level deletes natively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The manifest is a single JSON file with no locking.&lt;/strong&gt; If two indexing processes ran simultaneously (they don't in GridMind, but still), you'd get corruption. A proper solution uses SQLite or file-level locking.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SHA-256 hashes content, not semantics.&lt;/strong&gt; If I rename a section header in a document, the hash changes and it re-embeds — even though the semantic content barely changed. That's probably the right behavior, but it's worth knowing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I'm Writing About This
&lt;/h2&gt;

&lt;p&gt;Because the RAG tutorials that got me started all ended at step 3. They showed me how to build something that works once, in a clean demo environment, with a static knowledge base.&lt;/p&gt;

&lt;p&gt;Real systems have messy, evolving data. If you're building anything beyond a proof-of-concept, you'll hit this problem. I spent a day thinking through the right approach before I wrote a line of code, and I think that day was worth it.&lt;/p&gt;

&lt;p&gt;GridMind is open source. If you're building something offline-first or resource-constrained, the indexer code is in the repo — feel free to use or adapt it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;GitHub → [&lt;a href="https://github.com/A-Square8/GRIDMIND-Intelligence-When-the-Grid-Fails" rel="noopener noreferrer"&gt;https://github.com/A-Square8/GRIDMIND-Intelligence-When-the-Grid-Fails&lt;/a&gt;] | LinkedIn → [&lt;a href="https://www.linkedin.com/in/ankit-ambasta-4a58002b9" rel="noopener noreferrer"&gt;https://www.linkedin.com/in/ankit-ambasta-4a58002b9&lt;/a&gt;]&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
