WonderLab

Posted on May 18

RAG Series (19): Incremental Updates — Keeping the Knowledge Base Fresh

#ai #ragas #rag #langchain

Knowledge Bases Are Not Static

Every article in this series so far has shared one implicit assumption: documents are loaded once, and the index never changes.

Production doesn't work like that.

Product documentation updates weekly. Knowledge base articles are added daily. Outdated content gets retired. Every time something changes, you face a choice:

Option A: Full rebuild

Re-embed every document — including the ones that didn't change — and rebuild the entire vector index from scratch. Simple to implement. Expensive to run:

You pay for embedding every document every time
1000 documents, 5 changed: still 1000 embed calls
Rebuild time grows proportionally to corpus size, not change size

Option B: Incremental update

Store a content hash for each indexed document. On the next sync, only process the documents whose hash changed — embed the new ones, replace the modified ones, clean up the deleted ones, skip everything else.

LangChain's Indexing API implements Option B.

How the Indexing API Works

Two components:

SQLRecordManager: A SQLite database that stores a record for each indexed document:

source         |  content_hash           |  indexed_at
rag-intro      |  a3f8b2c1...            |  2026-05-15 10:00
ragas          |  d9e2f1a4...            |  2026-05-15 10:00
vector-db      |  7c4b8e3f...            |  2026-05-15 10:00
...

index() function: Compares the current document batch against the RecordManager and decides what happens to each document:

For each document in the batch:
  Hash matches   → skip (num_skipped++)
  Hash differs   → delete old version, insert new (num_deleted++, num_added++)
  First time     → insert (num_added++)

After processing all documents (cleanup="full"):
  In RecordManager but not in batch → delete (num_deleted++)

cleanup="full" handles the deletion case. Without it, documents that were removed from your knowledge base continue to live in the vector store and show up in retrieval results — stale content, indefinitely.

Core Implementation

RecordManager Setup

from langchain_classic.indexes import SQLRecordManager, index

NAMESPACE = "chroma/rag_knowledge_base"

record_manager = SQLRecordManager(
    NAMESPACE,
    db_url="sqlite:///record_manager.db",
)
record_manager.create_schema()   # create tables on first run

NAMESPACE acts as a partition key. One SQLite file can manage multiple independent knowledge bases without interference.

The Sync Function

def sync_knowledge_base(docs: list[Document]) -> dict:
    """Incrementally sync a document batch into the vector store.

    - Unchanged documents: skipped (no embedding API call)
    - New / modified documents: embedded and written
    - Removed documents: deleted from the vector store
    """
    return index(
        docs,
        record_manager,
        vectorstore,
        cleanup="full",          # auto-remove docs not in this batch
        source_id_key="source",  # metadata["source"] identifies each document
    )

source_id_key is the document identity key. Two documents with the same source are treated as different versions of the same document. If content changes, the old version is deleted and the new version is added.

Documents Must Have a `source` Field

Document(
    page_content="...",
    metadata={"source": "rag-intro"},   # required for version tracking
)

Documents without a source can't be tracked incrementally — they'll be treated as new every single time.

Experiment: Three Sync Rounds

Dataset Design

V1 (initial knowledge base, 6 documents):
rag-intro, ragas, vector-db, embedding, rerank, chunking

V2 (simulated update cycle):

Change Type	Document	Description
Unchanged	rag-intro, vector-db, rerank	Identical content
Modified	ragas	Added faithfulness explanation
Modified	chunking	Added contextual retrieval section
Deleted	embedding	Not present in V2 batch
Added	advanced-rag	New document
Added	conv-rag	New document

V1 → V2: 3 unchanged, 2 modified, 1 deleted, 2 added.

Results

======================================================================
  Scenario 1: Initial Index (V1 — 6 documents)
======================================================================

  [Initial Index]
  ┌─────────────────────────────────────────┐
  │  added:       6  (newly embedded)       │
  │  skipped:     0  (content unchanged)    │
  │  deleted:     0  (removed/replaced)     │
  ├─────────────────────────────────────────┤
  │  embed calls:    6                      │
  │  wall time:   0.913s                    │
  └─────────────────────────────────────────┘

======================================================================
  Scenario 2: Incremental Update (V2)
======================================================================

  [Incremental Update]
  ┌─────────────────────────────────────────┐
  │  added:       4  (newly embedded)       │
  │  skipped:     3  (content unchanged)    │
  │  deleted:     3  (removed/replaced)     │
  ├─────────────────────────────────────────┤
  │  embed calls:    4                      │
  │  wall time:   0.891s                    │
  └─────────────────────────────────────────┘

======================================================================
  Scenario 3: Full Rebuild (V2, record manager wiped)
======================================================================

  [Full Rebuild]
  ┌─────────────────────────────────────────┐
  │  added:       7  (newly embedded)       │
  │  skipped:     0  (content unchanged)    │
  │  deleted:     0  (removed/replaced)     │
  ├─────────────────────────────────────────┤
  │  embed calls:    7                      │
  │  wall time:   0.494s                    │
  └─────────────────────────────────────────┘

Cost Comparison

  ┌──────────────────────┬───────────────┬───────────────┐
  │                      │   Incremental │  Full Rebuild │
  ├──────────────────────┼───────────────┼───────────────┤
  │  Documents embedded  │       4       │       7       │
  │  Documents skipped   │       3       │       0       │
  │  Embedding savings   │    42.9%      │     0%        │
  └──────────────────────┴───────────────┴───────────────┘

Incremental update triggered 4 embed calls; full rebuild triggered 7. That's 42.9% fewer API calls for the same end state.

An Honest Look at the Timing Results

Full rebuild was actually faster (0.494s vs 0.891s). This deserves an explanation.

With 7 small documents, the SQLite hash lookup and comparison overhead costs more than the time saved by skipping 3 embed calls. Embedding calls go out as batched async HTTP requests — latency is dominated by network round-trip. SQLite operations are synchronous local disk I/O. At small scale, the bookkeeping costs more than the savings.

This reverses quickly at realistic scale:

Scenario: 1,000-document knowledge base, 5% daily change rate (50 docs)

Full rebuild:    1,000 embed calls per day
Incremental:        50 embed calls per day  →  95% reduction

At $0.0001 per embed call (typical for bge-large-zh-v1.5):
  Full rebuild:   ~$100/day (assuming avg 200 tokens/doc)
  Incremental:    ~$5/day

At 10,000 documents:
  Full rebuild:   ~$1,000/day
  Incremental:    ~$50/day

The time savings at small scale are not meaningful. The API cost savings are real from day one, and both metrics grow with corpus size.

Two Kinds of Deletion

deleted: 3 in the incremental result contains two different things:

Replacement deletion (2): ragas and chunking changed content. The old versions are deleted from the vector store; new versions are embedded and inserted. Net document count for these sources: unchanged.
Cleanup deletion (1): the embedding document was not in the V2 batch at all. With cleanup="full", after processing all documents in the batch, the indexer checks the RecordManager for any source not seen in this run — finds embedding, and removes it.

If you used cleanup=None instead:

# Not recommended: stale documents accumulate
index(docs, record_manager, vectorstore, cleanup=None)

# Recommended: full sync, stale content auto-removed
index(docs, record_manager, vectorstore, cleanup="full", source_id_key="source")

Without cleanup, embedding stays in the vector store indefinitely. Users querying about embedding models would still get answers based on the retired document. This is the "ghost document" problem — one of the more insidious production bugs in RAG systems, because it's invisible until someone notices the answers reference content that no longer exists.

Production Integration Pattern

In practice, incremental updates are triggered by a scheduled job or a document change event:

import glob

def load_documents_from_dir(docs_dir: str) -> list[Document]:
    """Load documents from the filesystem, using file path as source."""
    docs = []
    for filepath in glob.glob(f"{docs_dir}/**/*.md", recursive=True):
        with open(filepath, encoding="utf-8") as f:
            content = f.read()
        docs.append(Document(
            page_content=content,
            metadata={"source": filepath},
        ))
    return docs

# Scheduled job: sync every hour
def hourly_sync():
    docs = load_documents_from_dir("./knowledge_base")
    result = index(
        docs,
        record_manager,
        vectorstore,
        cleanup="full",
        source_id_key="source",
    )
    print(
        f"Sync done: +{result['num_added']} added  "
        f"~{result['num_deleted']} deleted  "
        f"={result['num_skipped']} skipped"
    )

File path as source is naturally unique. File content changes automatically invalidate the stored hash, triggering re-embedding on the next sync. No extra tracking code needed.

RecordManager Persistence

SQLRecordManager persists to disk, so the hash registry survives service restarts. For production:

# Development / single machine
record_manager = SQLRecordManager(
    "namespace",
    db_url="sqlite:///record_manager.db",
)

# Production / distributed (multiple service instances share one registry)
record_manager = SQLRecordManager(
    "namespace",
    db_url="postgresql://user:pass@host/dbname",
)

SQLite works for single-instance deployments. Switch to PostgreSQL when multiple instances need to share the same RecordManager — otherwise concurrent writes will corrupt the hash registry.

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/19-incremental-update

Key file:

incremental_update.py — three sync scenarios, counting wrapper, cost comparison, query verification

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 19-incremental-update
cp .env.example .env
pip install -r requirements.txt
python incremental_update.py

Summary

This article implemented incremental knowledge base updates using LangChain's Indexing API. Key findings:

Content hash tracking is the mechanism — RecordManager stores a hash for each document; unchanged → skip, modified → delete old + insert new, deleted → cleanup removes it
42.9% embedding reduction — 7 documents, 3 unchanged: only 4 embed calls instead of 7. The ratio improves as the corpus grows and change rate decreases
Wall time savings don't show at small scale — SQLite hash lookup overhead dominates at 7 documents; time savings become significant at 1,000+ documents
cleanup="full" prevents ghost documents — without it, deleted documents stay in the vector store indefinitely and keep appearing in retrieval results

Incremental updates are the step that moves RAG from "demo that works" to "production system that stays correct." A knowledge base is not a one-time artifact — it needs to evolve alongside the business that depends on it.

DEV Community

RAG Series (19): Incremental Updates — Keeping the Knowledge Base Fresh

Knowledge Bases Are Not Static

How the Indexing API Works

Core Implementation

RecordManager Setup

The Sync Function

Documents Must Have a `source` Field

Experiment: Three Sync Rounds

Dataset Design

Results

Cost Comparison

An Honest Look at the Timing Results

Two Kinds of Deletion

Production Integration Pattern

RecordManager Persistence

Full Code

Summary

References

Top comments (0)

Knowledge Bases Are Not Static

How the Indexing API Works

Core Implementation

RecordManager Setup

The Sync Function

Documents Must Have a source Field

Experiment: Three Sync Rounds

Dataset Design

Results

Cost Comparison

An Honest Look at the Timing Results

Two Kinds of Deletion

Production Integration Pattern

RecordManager Persistence

Full Code

Summary

References

Documents Must Have a `source` Field