RAG Re-Indexing Without Downtime: A Dual-Write Pattern for Embeddings

#rag #python #ai #postgres

Book: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You've shipped a RAG service. It works. Then imagine OpenAI
ships a hypothetical text-embedding-next, your eval rig says
it picks up 4 points of recall on the corpus you actually care
about, and your VP wants the upgrade live by Friday.

The naive plan is: stop ingest, drop the old vectors, re-embed
the corpus, write the new vectors back, resume traffic. On a
small corpus this takes minutes. On 8 million chunks it takes
a weekend, and your support bot returns nothing useful while
the swap is mid-flight.

The right shape is a dual-write re-index. Writes land in
both the old and the new index from the moment you start.
Reads keep going to the old one. A background worker
back-fills history into the new index. A parity check confirms
the new index agrees with the old on a held-out query set.
Only then do you flip reads.

Schema, orchestrator, parity gate, and the three edge cases
that bite teams the first time they try it.

What dual-write actually means here

In a transactional database, dual-write is the pattern where
every write hits two stores so you can switch readers without
losing data. It's the same idea here, with a wrinkle: the two
stores are not equivalent. They have different embedding
dimensions, possibly different chunk granularity, and possibly
different metadata schemas.

Three migration shapes show up in practice, and each breaks a
different assumption.

Shape 1: schema change. You're adding a new filterable
field. Say, tenant_id for row-level access control, or
section_path so a retriever can scope to one chapter. The
embedding model and chunk strategy stay the same, but the new
table has columns the old one doesn't.

Shape 2: embedding model change. You're moving from
text-embedding-3-small (1536 dims) to a 3072-dim model
(e.g., text-embedding-3-large). The vectors are not
interchangeable. The two indexes can't share a namespace; they
can't even share a column type.

Shape 3: chunk-size change. You're dropping from
1000-character chunks to 400-character chunks for tighter
retrieval. The doc-to-chunk ratio changes. One source
document maps to N old rows and M new rows, where M is often
2-3× N. Foreign keys to chunk IDs across services are now
stale.

Dual-write handles all three. The operational shape is the
same; what changes is how the writer translates a single
ingest event into rows in each index.

The schema

Two physical tables, never one with a versioned column. A
versioned column tempts you to query both at once, and the
pgvector planner cannot do that across different vector
dimensions anyway.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks_v1 (
  id          BIGSERIAL PRIMARY KEY,
  doc_id      TEXT NOT NULL,
  text        TEXT NOT NULL,
  embedding   VECTOR(1536),
  created_at  TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX chunks_v1_emb
  ON chunks_v1 USING hnsw (embedding vector_cosine_ops);

CREATE TABLE chunks_v2 (
  id          BIGSERIAL PRIMARY KEY,
  doc_id      TEXT NOT NULL,
  section_path TEXT,
  text        TEXT NOT NULL,
  embedding   VECTOR(3072),
  created_at  TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX chunks_v2_emb
  ON chunks_v2 USING hnsw (embedding vector_cosine_ops);

A separate index_state row tracks which index reads should
hit. The orchestrator updates it; the retriever reads it on
every request (cache it in process memory with a 1-second TTL).
The table is a singleton: one row, ever, enforced by the
primary-key check.

CREATE TABLE index_state (
  id              INT PRIMARY KEY CHECK (id = 1),
  active_version  INT NOT NULL,
  updated_at      TIMESTAMPTZ DEFAULT now()
);
INSERT INTO index_state (id, active_version) VALUES (1, 1);

If you're on a managed vector store like
Pinecone,
Qdrant, or
Weaviate, the equivalent is two
namespaces or two collections, with the active-version pointer
in your config store.

The dual-write orchestrator

Roughly 70 lines of Python over asyncpg. The chunker and
embedder are stubs; swap in your real ones.

import asyncio
import asyncpg
import random
from openai import AsyncOpenAI

oai = AsyncOpenAI()
PARITY_SAMPLE = 200
PARITY_THRESHOLD = 0.92

async def chunk_v1(doc): ...   # 1000-char chunks
async def chunk_v2(doc): ...   # 400-char chunks + section_path
async def embed_v1(text): ...  # 1536-dim model
async def embed_v2(text): ...  # 3072-dim model

async def ingest(pool, doc):
    rows_v1 = [(doc["id"], c, await embed_v1(c))
               for c in await chunk_v1(doc)]
    rows_v2 = [(doc["id"], path, c, await embed_v2(c))
               for path, c in await chunk_v2(doc)]
    async with pool.acquire() as conn:
        async with conn.transaction():
            # Per-doc advisory lock so live ingest and
            # backfill cannot clobber each other on
            # the same doc_id.
            await conn.execute(
                "SELECT pg_advisory_xact_lock("
                "hashtext($1))", doc["id"])
            await conn.execute(
                "DELETE FROM chunks_v1 WHERE doc_id=$1",
                doc["id"])
            await conn.executemany(
                "INSERT INTO chunks_v1 "
                "(doc_id, text, embedding) "
                "VALUES ($1, $2, $3)", rows_v1)
            await conn.execute(
                "DELETE FROM chunks_v2 WHERE doc_id=$1",
                doc["id"])
            await conn.executemany(
                "INSERT INTO chunks_v2 "
                "(doc_id, section_path, text, embedding) "
                "VALUES ($1, $2, $3, $4)", rows_v2)

async def backfill(pool, batch=500):
    async with pool.acquire() as conn:
        while True:
            todo = await conn.fetch(
                "SELECT DISTINCT c.doc_id "
                "FROM chunks_v1 c "
                "LEFT JOIN chunks_v2 v "
                "  ON c.doc_id = v.doc_id "
                "WHERE v.doc_id IS NULL "
                "LIMIT $1", batch)
            if not todo:
                return
            for r in todo:
                doc = await load_source(r["doc_id"])
                await ingest(pool, doc)

async def parity_check(pool, eval_queries):
    sample = random.sample(eval_queries, PARITY_SAMPLE)
    agree = 0
    for q in sample:
        v1 = set(await retrieve(pool, q, version=1, top_k=5))
        v2 = set(await retrieve(pool, q, version=2, top_k=5))
        union = len(v1 | v2)
        overlap = len(v1 & v2) / union if union else 1.0
        if overlap >= 0.6:
            agree += 1
    return agree / len(sample)

async def cutover(pool, eval_queries):
    score = await parity_check(pool, eval_queries)
    if score < PARITY_THRESHOLD:
        raise RuntimeError(
            f"Parity {score:.2f} below threshold")
    async with pool.acquire() as conn:
        await conn.execute(
            "UPDATE index_state SET active_version=2, "
            "updated_at=now() WHERE id=1")

The shape:

ingest is the live writer. Every doc update writes to both tables in the same transaction, behind a per-doc advisory lock. A failure on either side rolls both back. The lock also serialises live ingest against backfill, so a backfill worker cannot clobber a fresher live write on the same doc_id.
backfill runs once on the historical corpus. It walks every doc_id in chunks_v1 that hasn't been written to chunks_v2 yet and re-runs ingest. It's idempotent because ingest deletes the doc's existing rows on each side before inserting, and the advisory lock keeps it safe under concurrent live writes.
parity_check runs an eval-query set against both versions and measures Jaccard overlap of the returned doc_id sets. Jaccard (intersection over union) is robust to dup-heavy corpora where one doc contributes several chunks to a top-5; dividing by a constant 5 over-penalises those cases. Pick the cutover threshold from your own eval rig; 0.92 is a reasonable starting point.
cutover flips active_version once parity is good. The retriever reads active_version on every request and routes accordingly. Because index_state is a singleton, the WHERE id=1 update is the atomic switch.

Reading is two queries hidden behind a version switch:

async def retrieve(pool, query, version=None, top_k=10):
    async with pool.acquire() as conn:
        if version is None:
            v = await conn.fetchval(
                "SELECT active_version FROM index_state "
                "WHERE id=1")
            version = v
        if version == 1:
            qvec = await embed_v1(query)
            sql = ("SELECT DISTINCT doc_id FROM chunks_v1 "
                   "ORDER BY embedding <=> $1 LIMIT $2")
        else:
            qvec = await embed_v2(query)
            sql = ("SELECT DISTINCT doc_id FROM chunks_v2 "
                   "ORDER BY embedding <=> $1 LIMIT $2")
        rows = await conn.fetch(sql, qvec, top_k)
        return [r["doc_id"] for r in rows]

The edge cases that actually hurt

1. The embedding model has different dimensions. Two
tables, two VECTOR(N) widths. A query embedded with
embed_v1 cannot run against chunks_v2. The bug is when
you forget to switch the embedder alongside the table:
cutover produces zero results because the dimension mismatch
errors at the driver level and your code swallows the
exception. The fix is to bind embedder and table together
inside retrieve so they cannot drift.

2. The chunk-size change reshapes the doc-to-chunk ratio.
At 1000 chars, a 30-page contract becomes 60 chunks. At 400
chars, the same contract becomes 150. Anything downstream
that keys off chunk IDs (reranker cache, citation IDs, audit
log) is now meaningless. Don't expose chunk IDs across
service boundaries. Expose (doc_id, byte range) or
(doc_id, section_path) and translate to current chunk IDs
at read time. If chunk IDs are already exposed, build a
translation table from old to new before the parity check
runs, or the parity sample will fail on references the new
index doesn't recognize.

3. The schema change introduces a new filterable field.
section_path is null for any document the back-fill hasn't
touched yet. A retriever with WHERE section_path = $1 and
a request that hits during back-fill returns partial results.
Back-fill first, then flip reads. Never run a filter that
depends on the new column while reads are still on the old
version. If the filter is mandatory at read time, gate
cutover on back-fill completion alongside parity score.

4. Deletes. The orchestrator above deletes by doc_id on
every write, which keeps both tables in sync. If you support
hard deletes (GDPR erasure, manual takedown), the delete path
has to write to both tables too. A document lingers in
chunks_v1 after cutover when the delete path forgot the new
table. That bug shows up six weeks later in a privacy audit.
(GDPR Article 17 erasure has its own legal process; this
section is only about the engineering plumbing.)

When to skip dual-write

Dual-write earns its weight when downtime is unacceptable
and a stop-the-world re-index would take more than a few
minutes. Skip it when the corpus is under a few hundred
thousand chunks (a maintenance window is cheaper), when a
managed vector DB exposes versioned namespaces natively, or
when the change is purely additive and pgvector's ALTER TABLE ... ADD COLUMN does the job in one statement.

The shape (write to both, validate, switch) is the
embedding-store version of the classic blue-green deployment
idea. The vocabulary is older than the embedding wave; it
applies cleanly here.

What to monitor during the migration

Three signals tell you the dual-write is healthy.

Write latency on ingest. It roughly doubles because you're paying for two embed calls and two inserts. If it more than doubles, your v2 embedder is the bottleneck, not the database.
Back-fill rate. Documents per minute. If it flattens, you're hitting embedding API rate limits or HNSW build pressure on chunks_v2. Drop batch size or pause ingest.
Parity-check trend. Run it every 10 minutes during back-fill, not only at cutover. The score should climb monotonically. If it plateaus below threshold for an hour, your v2 chunker or embedder is wrong and back-filling more documents will not fix it.

Once active_version flips, hold the old index for at least
a week. A bad cutover is a UPDATE index_state away from
recovery. Don't drop chunks_v1 until production retrieval
metrics have settled for a full business cycle.

If this was useful

The RAG Pocket Guide
covers the operational side of RAG that tutorials skip:
re-indexing, eval gating, chunking trade-offs, hybrid retrieval,
reranking, and the production patterns that keep showing up
across teams. If you're shipping RAG and the migrations feel
scarier than the model swaps, it's the book.