DEV Community

Humza Tareen
Humza Tareen

Posted on • Originally published at humzakt.github.io

Zero-Downtime Embedding Migration: Switching from text-embedding-004 to text-embedding-3-large in Production

Our embedding model got deprecated overnight. Every RAG query started returning 404s. Here's the exact playbook we used to migrate to a new model in 48 hours with zero downtime.

The Situation

  • Service: RAG retrieval service using pgvector on PostgreSQL
  • Old model: text-embedding-004 (deprecated)
  • New model: text-embedding-3-large (768 dimensions)
  • Data volume: Thousands of embedded documents
  • Constraint: Zero downtime, zero data loss, production traffic must keep flowing

Step 1: Make the Model Configurable

Before anything else, stop hardcoding:

# Before (hardcoded in 6 places)
response = openai.embeddings.create(
    model="text-embedding-004",
    input=text,
)

# After (configured once)
EMBED_MODEL = os.getenv("EMBED_MODEL", "text-embedding-3-large")
EMBED_DIMENSIONS = int(os.getenv("EMBED_DIMENSIONS", "768"))

response = openai.embeddings.create(
    model=EMBED_MODEL,
    input=text,
    dimensions=EMBED_DIMENSIONS,
)
Enter fullscreen mode Exit fullscreen mode

Two environment variables. This is what makes the difference between a 2-day migration and a 2-week one.

Step 2: Add New Columns (Don't Replace)

-- Migration: add new embedding column alongside the old one
ALTER TABLE documents 
ADD COLUMN embedding_v2 vector(768);

CREATE INDEX CONCURRENTLY idx_documents_embedding_v2 
ON documents USING ivfflat (embedding_v2 vector_cosine_ops)
WITH (lists = 100);
Enter fullscreen mode Exit fullscreen mode

Using CONCURRENTLY means the index builds without locking the table. Production reads continue uninterrupted.

Step 3: Batch Re-embedding Script

import asyncio
from tqdm import tqdm

async def re_embed_batch(session, documents, batch_size=50):
    """Re-embed documents in batches with progress tracking."""
    for i in tqdm(range(0, len(documents), batch_size)):
        batch = documents[i:i + batch_size]
        texts = [doc.content for doc in batch]

        # Batch embedding call
        response = await openai.embeddings.create(
            model=EMBED_MODEL,
            input=texts,
            dimensions=EMBED_DIMENSIONS,
        )

        for doc, embedding in zip(batch, response.data):
            doc.embedding_v2 = embedding.embedding

        await session.commit()

        # Rate limiting
        await asyncio.sleep(0.5)
Enter fullscreen mode Exit fullscreen mode

Key features:

  • Batch processing (don't embed one doc at a time)
  • Progress bar (you need to know how long this takes)
  • Rate limiting (embedding APIs have rate limits)
  • Commits per batch (don't hold a transaction for 10K docs)

Step 4: Dry-Run Validation

Before switching production traffic:

async def validate_migration(session, sample_size=100):
    """Compare search results between old and new embeddings."""
    test_queries = get_random_queries(sample_size)

    for query in test_queries:
        old_results = await search(session, query, column="embedding")
        new_results = await search(session, query, column="embedding_v2")

        # Check overlap
        old_ids = {r.id for r in old_results[:10]}
        new_ids = {r.id for r in new_results[:10]}
        overlap = len(old_ids & new_ids) / len(old_ids)

        if overlap < 0.6:  # Less than 60% overlap is concerning
            logger.warning(f"Low overlap for query: {query[:50]}... ({overlap:.0%})")

    logger.info(f"Validation complete. Average overlap: {avg_overlap:.0%}")
Enter fullscreen mode Exit fullscreen mode

Our average overlap was 82% -- different models produce different embeddings, but the top results were comparable. Good enough.

Step 5: Feature Flag Switch

USE_V2_EMBEDDINGS = os.getenv("USE_V2_EMBEDDINGS", "false") == "true"

async def search(session, query: str, top_k: int = 10):
    column = "embedding_v2" if USE_V2_EMBEDDINGS else "embedding"
    query_embedding = await embed(query)

    results = await session.execute(text(f"""
        SELECT id, content, 1 - ({column} <=> :query_vec) as similarity
        FROM documents
        ORDER BY {column} <=> :query_vec
        LIMIT :top_k
    """), {"query_vec": str(query_embedding), "top_k": top_k})

    return results.fetchall()
Enter fullscreen mode Exit fullscreen mode

Deploy with USE_V2_EMBEDDINGS=false. Verify everything works. Flip to true. If anything breaks, flip back instantly.

Step 6: Cleanup

After running with v2 for a week with no issues:

ALTER TABLE documents DROP COLUMN embedding;
ALTER TABLE documents RENAME COLUMN embedding_v2 TO embedding;
DROP INDEX idx_documents_embedding;
ALTER INDEX idx_documents_embedding_v2 RENAME TO idx_documents_embedding;
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

  1. Always abstract the embedding provider. Two env vars saved us from a multi-file refactor.
  2. Add model version tracking to stored vectors. We didn't. We should have.
  3. Build migration tooling before you need it. The batch script and validation tool are reusable.
  4. Side-by-side columns > in-place replacement. The rollback story is instant.
  5. Dry-run everything. Our validation caught 3 queries with low overlap that needed investigation.

Total impact: 48 hours, zero downtime, zero data loss.


Read the full migration story on my blog. Part of my "Production GCP Patterns" series — find me at humzakt.github.io.


Top comments (0)