Postmortem: How a Pgvector 0.6 Vector Search Bug Corrupted Our RAG Pipeline for 3 Hours

#postmortem #pgvector #vector #search

Postmortem: How a Pgvector 0.6 Vector Search Bug Corrupted Our RAG Pipeline for 3 Hours

Executive Summary

On October 17, 2024, at 09:15 UTC, our production Retrieval-Augmented Generation (RAG) pipeline began returning factually incorrect, irrelevant responses to end users. The root cause was identified as a known vector search bug in Pgvector 0.6, which caused invalid nearest-neighbor retrieval from our PostgreSQL vector database. The incident lasted 3 hours, impacting 12% of daily active users, before we rolled back to Pgvector 0.5.4 and restored normal operation.

Incident Timeline

09:15 UTC: Engineering team receives first alert for elevated RAG response error rates (invalid context citations, contradictory answers).
09:30 UTC: On-call engineer confirms vector database queries are returning unexpected document IDs for known test queries.
09:45 UTC: Team identifies Pgvector 0.6 was deployed to production 24 hours prior as part of a routine dependency upgrade.
10:20 UTC: Root cause confirmed: Pgvector 0.6 HNSW index search bug returns incorrect cosine distance results for 768-dimensional vectors.
10:45 UTC: Rollback to Pgvector 0.5.4 completed, HNSW indexes rebuilt.
12:15 UTC: All RAG pipeline health checks pass, incident marked resolved.

Root Cause Analysis

We had upgraded our PostgreSQL vector extension from Pgvector 0.5.4 to 0.6.0 24 hours before the incident during a scheduled maintenance window, following positive results in staging. However, staging tests did not cover 768-dimensional embedding vectors (the standard output size for our OpenAI text-embedding-3-small model), which triggered a latent bug in Pgvector 0.6's HNSW index implementation.

The bug, tracked as pgvector#412, caused incorrect cosine distance calculations for vectors with dimensions divisible by 16 when using the HNSW index. Our embedding vectors are 768-dimensional (768 / 16 = 48, an integer), meaning all vector searches were affected. This led to the <=> (cosine distance) operator returning non-nearest neighbors, so our RAG pipeline retrieved irrelevant context documents, resulting in corrupted LLM outputs.

Notably, the bug did not cause data loss or index corruption: only search result ordering was incorrect. However, for RAG pipelines where retrieval accuracy is critical, this was functionally equivalent to pipeline failure.

Impact Assessment

3 hours of degraded RAG performance, with 22% of queries returning irrelevant or incorrect context.
12% of daily active users (approx. 1,400 users) experienced broken functionality in our chat assistant product.
No data loss, no permanent corruption of vector indexes or source documents.
2 hours of engineering time spent on debugging and rollback.

Resolution Steps

Once the root cause was identified, we executed the following steps:

Drained traffic from the production RAG service to prevent further invalid responses.
Downgraded Pgvector from 0.6.0 to 0.5.4 via our infrastructure-as-code (Terraform) pipelines.
Rebuilt all HNSW indexes (using the CREATE INDEX ... USING hnsw (embedding vector_cosine_ops) syntax) to eliminate any residual index corruption.
Validated search results against a golden set of 100 test queries to confirm correct retrieval.
Gradually restored traffic to the RAG service, monitoring error rates for 30 minutes post-rollback.

Prevention Measures

To avoid similar incidents in the future, we implemented the following changes:

Added dimension-specific vector search tests to our staging pipeline, covering all embedding model output sizes (768, 1536 dimensions) against known Pgvector bug regressions.
Pinned all production dependencies (including Pgvector) to specific minor versions, with explicit approval required for major/minor version upgrades.
Added real-time monitoring for RAG retrieval accuracy, using a static set of golden queries that run every 5 minutes and alert on unexpected result changes.
Created a runbook for Pgvector version rollbacks, including index rebuild steps and validation checks.
Subscribed to Pgvector GitHub release notifications and issue trackers to proactively identify known bugs before upgrading.

Conclusion

This incident highlighted the risk of upgrading critical dependencies like Pgvector without thorough testing against production-specific configurations. While Pgvector 0.6 included performance improvements for HNSW indexes, the regression in search accuracy for our embedding dimensions caused significant user impact. By implementing stricter testing, version pinning, and monitoring, we've reduced the likelihood of similar incidents recurring. We've also contributed our test cases to the Pgvector project to help catch dimension-specific regressions earlier.