Knowledge Bases Are Not Static
Every article in this series so far has shared one implicit assumption: documents are loaded once, and the index never changes.
Production doesn't work like that.
Product documentation updates weekly. Knowledge base articles are added daily. Outdated content gets retired. Every time something changes, you face a choice:
Option A: Full rebuild
Re-embed every document — including the ones that didn't change — and rebuild the entire vector index from scratch. Simple to implement. Expensive to run:
- You pay for embedding every document every time
- 1000 documents, 5 changed: still 1000 embed calls
- Rebuild time grows proportionally to corpus size, not change size
Option B: Incremental update
Store a content hash for each indexed document. On the next sync, only process the documents whose hash changed — embed the new ones, replace the modified ones, clean up the deleted ones, skip everything else.
LangChain's Indexing API implements Option B.
How the Indexing API Works
Two components:
SQLRecordManager: A SQLite database that stores a record for each indexed document:
source | content_hash | indexed_at
rag-intro | a3f8b2c1... | 2026-05-15 10:00
ragas | d9e2f1a4... | 2026-05-15 10:00
vector-db | 7c4b8e3f... | 2026-05-15 10:00
...
index() function: Compares the current document batch against the RecordManager and decides what happens to each document:
For each document in the batch:
Hash matches → skip (num_skipped++)
Hash differs → delete old version, insert new (num_deleted++, num_added++)
First time → insert (num_added++)
After processing all documents (cleanup="full"):
In RecordManager but not in batch → delete (num_deleted++)
cleanup="full" handles the deletion case. Without it, documents that were removed from your knowledge base continue to live in the vector store and show up in retrieval results — stale content, indefinitely.
Core Implementation
RecordManager Setup
from langchain_classic.indexes import SQLRecordManager, index
NAMESPACE = "chroma/rag_knowledge_base"
record_manager = SQLRecordManager(
NAMESPACE,
db_url="sqlite:///record_manager.db",
)
record_manager.create_schema() # create tables on first run
NAMESPACE acts as a partition key. One SQLite file can manage multiple independent knowledge bases without interference.
The Sync Function
def sync_knowledge_base(docs: list[Document]) -> dict:
"""Incrementally sync a document batch into the vector store.
- Unchanged documents: skipped (no embedding API call)
- New / modified documents: embedded and written
- Removed documents: deleted from the vector store
"""
return index(
docs,
record_manager,
vectorstore,
cleanup="full", # auto-remove docs not in this batch
source_id_key="source", # metadata["source"] identifies each document
)
source_id_key is the document identity key. Two documents with the same source are treated as different versions of the same document. If content changes, the old version is deleted and the new version is added.
Documents Must Have a source Field
Document(
page_content="...",
metadata={"source": "rag-intro"}, # required for version tracking
)
Documents without a source can't be tracked incrementally — they'll be treated as new every single time.
Experiment: Three Sync Rounds
Dataset Design
V1 (initial knowledge base, 6 documents):
rag-intro, ragas, vector-db, embedding, rerank, chunking
V2 (simulated update cycle):
| Change Type | Document | Description |
|---|---|---|
| Unchanged | rag-intro, vector-db, rerank | Identical content |
| Modified | ragas | Added faithfulness explanation |
| Modified | chunking | Added contextual retrieval section |
| Deleted | embedding | Not present in V2 batch |
| Added | advanced-rag | New document |
| Added | conv-rag | New document |
V1 → V2: 3 unchanged, 2 modified, 1 deleted, 2 added.
Results
======================================================================
Scenario 1: Initial Index (V1 — 6 documents)
======================================================================
[Initial Index]
┌─────────────────────────────────────────┐
│ added: 6 (newly embedded) │
│ skipped: 0 (content unchanged) │
│ deleted: 0 (removed/replaced) │
├─────────────────────────────────────────┤
│ embed calls: 6 │
│ wall time: 0.913s │
└─────────────────────────────────────────┘
======================================================================
Scenario 2: Incremental Update (V2)
======================================================================
[Incremental Update]
┌─────────────────────────────────────────┐
│ added: 4 (newly embedded) │
│ skipped: 3 (content unchanged) │
│ deleted: 3 (removed/replaced) │
├─────────────────────────────────────────┤
│ embed calls: 4 │
│ wall time: 0.891s │
└─────────────────────────────────────────┘
======================================================================
Scenario 3: Full Rebuild (V2, record manager wiped)
======================================================================
[Full Rebuild]
┌─────────────────────────────────────────┐
│ added: 7 (newly embedded) │
│ skipped: 0 (content unchanged) │
│ deleted: 0 (removed/replaced) │
├─────────────────────────────────────────┤
│ embed calls: 7 │
│ wall time: 0.494s │
└─────────────────────────────────────────┘
Cost Comparison
┌──────────────────────┬───────────────┬───────────────┐
│ │ Incremental │ Full Rebuild │
├──────────────────────┼───────────────┼───────────────┤
│ Documents embedded │ 4 │ 7 │
│ Documents skipped │ 3 │ 0 │
│ Embedding savings │ 42.9% │ 0% │
└──────────────────────┴───────────────┴───────────────┘
Incremental update triggered 4 embed calls; full rebuild triggered 7. That's 42.9% fewer API calls for the same end state.
An Honest Look at the Timing Results
Full rebuild was actually faster (0.494s vs 0.891s). This deserves an explanation.
With 7 small documents, the SQLite hash lookup and comparison overhead costs more than the time saved by skipping 3 embed calls. Embedding calls go out as batched async HTTP requests — latency is dominated by network round-trip. SQLite operations are synchronous local disk I/O. At small scale, the bookkeeping costs more than the savings.
This reverses quickly at realistic scale:
Scenario: 1,000-document knowledge base, 5% daily change rate (50 docs)
Full rebuild: 1,000 embed calls per day
Incremental: 50 embed calls per day → 95% reduction
At $0.0001 per embed call (typical for bge-large-zh-v1.5):
Full rebuild: ~$100/day (assuming avg 200 tokens/doc)
Incremental: ~$5/day
At 10,000 documents:
Full rebuild: ~$1,000/day
Incremental: ~$50/day
The time savings at small scale are not meaningful. The API cost savings are real from day one, and both metrics grow with corpus size.
Two Kinds of Deletion
deleted: 3 in the incremental result contains two different things:
Replacement deletion (2): ragas and chunking changed content. The old versions are deleted from the vector store; new versions are embedded and inserted. Net document count for these sources: unchanged.
Cleanup deletion (1): the
embeddingdocument was not in the V2 batch at all. Withcleanup="full", after processing all documents in the batch, the indexer checks the RecordManager for any source not seen in this run — findsembedding, and removes it.
If you used cleanup=None instead:
# Not recommended: stale documents accumulate
index(docs, record_manager, vectorstore, cleanup=None)
# Recommended: full sync, stale content auto-removed
index(docs, record_manager, vectorstore, cleanup="full", source_id_key="source")
Without cleanup, embedding stays in the vector store indefinitely. Users querying about embedding models would still get answers based on the retired document. This is the "ghost document" problem — one of the more insidious production bugs in RAG systems, because it's invisible until someone notices the answers reference content that no longer exists.
Production Integration Pattern
In practice, incremental updates are triggered by a scheduled job or a document change event:
import glob
def load_documents_from_dir(docs_dir: str) -> list[Document]:
"""Load documents from the filesystem, using file path as source."""
docs = []
for filepath in glob.glob(f"{docs_dir}/**/*.md", recursive=True):
with open(filepath, encoding="utf-8") as f:
content = f.read()
docs.append(Document(
page_content=content,
metadata={"source": filepath},
))
return docs
# Scheduled job: sync every hour
def hourly_sync():
docs = load_documents_from_dir("./knowledge_base")
result = index(
docs,
record_manager,
vectorstore,
cleanup="full",
source_id_key="source",
)
print(
f"Sync done: +{result['num_added']} added "
f"~{result['num_deleted']} deleted "
f"={result['num_skipped']} skipped"
)
File path as source is naturally unique. File content changes automatically invalidate the stored hash, triggering re-embedding on the next sync. No extra tracking code needed.
RecordManager Persistence
SQLRecordManager persists to disk, so the hash registry survives service restarts. For production:
# Development / single machine
record_manager = SQLRecordManager(
"namespace",
db_url="sqlite:///record_manager.db",
)
# Production / distributed (multiple service instances share one registry)
record_manager = SQLRecordManager(
"namespace",
db_url="postgresql://user:pass@host/dbname",
)
SQLite works for single-instance deployments. Switch to PostgreSQL when multiple instances need to share the same RecordManager — otherwise concurrent writes will corrupt the hash registry.
Full Code
Complete code is open-sourced at:
https://github.com/chendongqi/llm-in-action/tree/main/19-incremental-update
Key file:
-
incremental_update.py— three sync scenarios, counting wrapper, cost comparison, query verification
How to run:
git clone https://github.com/chendongqi/llm-in-action
cd 19-incremental-update
cp .env.example .env
pip install -r requirements.txt
python incremental_update.py
Summary
This article implemented incremental knowledge base updates using LangChain's Indexing API. Key findings:
- Content hash tracking is the mechanism — RecordManager stores a hash for each document; unchanged → skip, modified → delete old + insert new, deleted → cleanup removes it
- 42.9% embedding reduction — 7 documents, 3 unchanged: only 4 embed calls instead of 7. The ratio improves as the corpus grows and change rate decreases
- Wall time savings don't show at small scale — SQLite hash lookup overhead dominates at 7 documents; time savings become significant at 1,000+ documents
-
cleanup="full"prevents ghost documents — without it, deleted documents stay in the vector store indefinitely and keep appearing in retrieval results
Incremental updates are the step that moves RAG from "demo that works" to "production system that stays correct." A knowledge base is not a one-time artifact — it needs to evolve alongside the business that depends on it.
Top comments (0)