Humza Tareen

Posted on Feb 20 • Originally published at humzakt.github.io

Zero-Downtime Embedding Migration: Switching from text-embedding-004 to text-embedding-3-large in Production

#rag #ai #postgres #python

Our embedding model got deprecated overnight. Every RAG query started returning 404s. Here's the exact playbook we used to migrate to a new model in 48 hours with zero downtime.

The Situation

Service: RAG retrieval service using pgvector on PostgreSQL
Old model: text-embedding-004 (deprecated)
New model: text-embedding-3-large (768 dimensions)
Data volume: Thousands of embedded documents
Constraint: Zero downtime, zero data loss, production traffic must keep flowing

Step 1: Make the Model Configurable

Before anything else, stop hardcoding:

# Before (hardcoded in 6 places)
response = openai.embeddings.create(
    model="text-embedding-004",
    input=text,
)

# After (configured once)
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-large")
EMBED_DIMENSIONS = int(os.getenv("EMBED_DIMENSIONS", "768"))

response = openai.embeddings.create(
    model=EMBED_MODEL,
    input=text,
    dimensions=EMBED_DIMENSIONS,
)

Two environment variables. This is what makes the difference between a 2-day migration and a 2-week one.

Step 2: Add New Columns (Don't Replace)

-- Migration: add new embedding column alongside the old one
ALTER TABLE documents 
ADD COLUMN embedding_v2 vector(768);

CREATE INDEX CONCURRENTLY idx_documents_embedding_v2 
ON documents USING ivfflat (embedding_v2 vector_cosine_ops)
WITH (lists = 100);

Using CONCURRENTLY means the index builds without locking the table. Production reads continue uninterrupted.

Step 3: Batch Re-embedding Script

import asyncio
from tqdm import tqdm

async def re_embed_batch(session, documents, batch_size=50):
    """Re-embed documents in batches with progress tracking."""
    for i in tqdm(range(0, len(documents), batch_size)):
        batch = documents[i:i + batch_size]
        texts = [doc.content for doc in batch]

        # Batch embedding call
        response = await openai.embeddings.create(
            model=EMBED_MODEL,
            input=texts,
            dimensions=EMBED_DIMENSIONS,
        )

        for doc, embedding in zip(batch, response.data):
            doc.embedding_v2 = embedding.embedding

        await session.commit()

        # Rate limiting
        await asyncio.sleep(0.5)

Key features:

Batch processing (don't embed one doc at a time)
Progress bar (you need to know how long this takes)
Rate limiting (embedding APIs have rate limits)
Commits per batch (don't hold a transaction for 10K docs)

Step 4: Dry-Run Validation

Before switching production traffic:

async def validate_migration(session, sample_size=100):
    """Compare search results between old and new embeddings."""
    test_queries = get_random_queries(sample_size)

    for query in test_queries:
        old_results = await search(session, query, column="embedding")
        new_results = await search(session, query, column="embedding_v2")

        # Check overlap
        old_ids = {r.id for r in old_results[:10]}
        new_ids = {r.id for r in new_results[:10]}
        overlap = len(old_ids & new_ids) / len(old_ids)

        if overlap < 0.6:  # Less than 60% overlap is concerning
            logger.warning(f"Low overlap for query: {query[:50]}... ({overlap:.0%})")

    logger.info(f"Validation complete. Average overlap: {avg_overlap:.0%}")

Our average overlap was 82% -- different models produce different embeddings, but the top results were comparable. Good enough.

Step 5: Feature Flag Switch

USE_V2_EMBEDDINGS = os.getenv("USE_V2_EMBEDDINGS", "false") == "true"

async def search(session, query: str, top_k: int = 10):
    column = "embedding_v2" if USE_V2_EMBEDDINGS else "embedding"
    query_embedding = await embed(query)

    results = await session.execute(text(f"""
        SELECT id, content, 1 - ({column} <=> :query_vec) as similarity
        FROM documents
        ORDER BY {column} <=> :query_vec
        LIMIT :top_k
    """), {"query_vec": str(query_embedding), "top_k": top_k})

    return results.fetchall()

Deploy with USE_V2_EMBEDDINGS=false. Verify everything works. Flip to true. If anything breaks, flip back instantly.

Step 6: Cleanup

After running with v2 for a week with no issues:

ALTER TABLE documents DROP COLUMN embedding;
ALTER TABLE documents RENAME COLUMN embedding_v2 TO embedding;
DROP INDEX idx_documents_embedding;
ALTER INDEX idx_documents_embedding_v2 RENAME TO idx_documents_embedding;

Lessons Learned

Always abstract the embedding provider. Two env vars saved us from a multi-file refactor.
Add model version tracking to stored vectors. We didn't. We should have.
Build migration tooling before you need it. The batch script and validation tool are reusable.
Side-by-side columns > in-place replacement. The rollback story is instant.
Dry-run everything. Our validation caught 3 queries with low overlap that needed investigation.

Total impact: 48 hours, zero downtime, zero data loss.

Read the full migration story on my blog. Part of my "Production GCP Patterns" series — find me at humzakt.github.io.

DEV Community