Why I Used SHA-256 to Solve a Problem Most RAG Tutorials Pretend Doesn't Exist

#llm #python #ai

When I built GridMind — a fully offline RAG assistant designed to run on CPU-only hardware with under 4 GB of RAM — I ran into a problem that no LangChain tutorial ever warned me about.

GridMind is a knowledge base assistant designed to work when there's no internet, no GPU, no cloud. Think disaster scenarios, remote areas, zombie apocalypse and government is not coming.

What happens when your knowledge base changes?

Most RAG demos show you the happy path: chunk documents, embed them, store vectors, query. Done. But they quietly skip the part where your source documents get updated, corrected, or extended. Because if you follow the naive approach, the answer is painful: re-embed everything from scratch, every single time.

For GridMind, that wasn't an option.

The Constraints That Forced Me to Think

GridMind's premise is that it works when the grid fails — no internet, no GPU, no cloud. It runs on a Raspberry Pi class machine using nomic-embed-text for embeddings and qwen2.5:3b via Ollama for inference.

Embedding is the expensive step. On CPU, embedding a full knowledge base across 8 survival domains (water, shelter, medical, navigation, etc.) takes minutes. Re-running that every time I updated a markdown file was a non-starter.

I needed a way to know, cheaply and reliably, exactly which documents had changed since the last index run — and only re-embed those.

The Solution: SHA-256 as a Change Fingerprint

The core idea is simple but I didn't see it written about clearly anywhere, so I'll spell it out.

Before embedding any document, compute its SHA-256 hash and store it alongside its vector in FAISS metadata. On the next indexing run, before calling the embedding model at all, hash the current file and compare it against the stored hash.

Hash matches → skip. The document hasn't changed. No embedding call made.
Hash differs → re-embed and update the stored hash.
New file (no hash stored) → embed fresh and store the hash.
File deleted → remove its vectors from the index.

import hashlib

def hash_file(filepath: str) -> str:
    sha256 = hashlib.sha256()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(8192), b""):
            sha256.update(chunk)
    return sha256.hexdigest()

Reading in 8 KB chunks matters — it keeps memory flat even for large documents.

Why SHA-256 Specifically?

A few alternatives I considered:

File modification timestamps (mtime) — Fast, but unreliable. Copying a file, running a deployment script, or touching a file changes mtime without changing content. You'd re-embed files that didn't need it.

File size — Even faster, even less reliable. A one-character edit to a 10 KB file changes content but not size.

MD5 — Would work fine here. SHA-256 is marginally slower but the difference at this scale is microseconds. I used it because it's the standard I'm used to reaching for and collision resistance, while overkill for this use case, costs nothing.

The Index Store Structure

I kept a simple JSON manifest alongside the FAISS index:

{
  "documents": {
    "data/water/purification.md": {
      "hash": "a3f5c2d1...",
      "vector_ids": [0, 1, 2, 3],
      "indexed_at": "2024-11-14T10:22:00"
    },
    "data/medical/wound-care.md": {
      "hash": "9b8e1f44...",
      "vector_ids": [4, 5, 6],
      "indexed_at": "2024-11-14T10:22:01"
    }
  }
}

Tracking vector_ids per document is what makes deletion and update clean — when a file changes, you know exactly which FAISS vectors to remove before inserting the new ones.

What This Actually Bought Me

On a knowledge base update where I corrected two markdown files and added one new one, the indexer processed 3 files instead of 47. Embedding time dropped from ~6 minutes to ~40 seconds on the test machine.

More importantly, it made iteration feel fast. When you're building a local-first tool and testing knowledge base changes, waiting 6 minutes per cycle kills momentum. 40 seconds doesn't.

The Honest Limitations

This approach has real tradeoffs I want to be upfront about:

FAISS doesn't natively support deletion. To "remove" old vectors, I rebuild the index from the non-deleted vectors. For 47 documents this is fast. At 10,000 documents it would become the bottleneck. A production system would reach for something like Qdrant or Weaviate that supports vector-level deletes natively.

The manifest is a single JSON file with no locking. If two indexing processes ran simultaneously (they don't in GridMind, but still), you'd get corruption. A proper solution uses SQLite or file-level locking.

SHA-256 hashes content, not semantics. If I rename a section header in a document, the hash changes and it re-embeds — even though the semantic content barely changed. That's probably the right behavior, but it's worth knowing.

Why I'm Writing About This

Because the RAG tutorials that got me started all ended at step 3. They showed me how to build something that works once, in a clean demo environment, with a static knowledge base.

Real systems have messy, evolving data. If you're building anything beyond a proof-of-concept, you'll hit this problem. I spent a day thinking through the right approach before I wrote a line of code, and I think that day was worth it.

GridMind is open source. If you're building something offline-first or resource-constrained, the indexer code is in the repo — feel free to use or adapt it.

GitHub → [https://github.com/A-Square8/GRIDMIND-Intelligence-When-the-Grid-Fails] | LinkedIn → [https://www.linkedin.com/in/ankit-ambasta-4a58002b9]