DEV Community

Rhea Kapoor
Rhea Kapoor

Posted on

The Reality of Scale: What Billion-Transaction Systems Teach Us About Vector Databases

I've spent the last year implementing vector search for a payment system processing tens of billions of annual transactions. Here’s what matters when abstract databases meet physical infrastructure.

Why Scale Isn't Theoretical

We needed personalized recommendations across 200+ countries. Our requirements:

  1. Hourly ingestion of 50M+ vector updates
  2. <100ms p99 latency at peak traffic
  3. Support for 10B+ vectors without rearchitecting
  4. Dynamic schema changes during live updates

Commercial graph databases failed at 100M vectors. Custom solutions choked on batch writes.

Batch Ingestion: The Silent Killer

Test case: 48M vectors, average dimensionality 768

  • Competitor A: 8.2 hours (2.5K vectors/sec)
  • Competitor B: 6.1 hours (3.4K vectors/sec)
  • Milvus: 52 minutes (18.7K vectors/sec)

Why this matters:

Database Peak Memory CPU Utilization Failed Batches
A 38GB 92% 12%
B 41GB 88% 8%
Milvus 19GB 67% 0.2%

The difference came down to parallel I/O design. Milvus separates index building from ingestion, avoiding write amplification. This Python snippet shows the clean API:

from pymilvus import connections, Collection  
connections.connect("default", host='localhost', port='19530')  

# Define schema  
fields = [  
  FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),  
  FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)  
]  
schema = CollectionSchema(fields)  

# Insert without locking index  
collection = Collection("recommendations", schema)  
insert_result = collection.insert(batch_data)  
collection.flush()  
Enter fullscreen mode Exit fullscreen mode

The Consistency Trap

You’ll see these options in distributed systems:

Level Use Case Our Latency Cost
Strong Consistency Financial auditing +85ms
Bounded Staleness Recommendation engines +12ms
Session User-specific search +3ms
Eventual Analytics/cold storage -0ms

We used bounded staleness for checkout recommendations. Wrong choice for customer service bots though:

# Problematic pattern for conversational AI  
collection.query(  
  expr="user_id == 'abc123'",  
  consistency_level="BOUNDED",  
  timeout=10.0 # Caused 8% timeouts during concurrent writes  
)  
Enter fullscreen mode Exit fullscreen mode

Changed to session consistency with request batching. Timeouts dropped to 0.3%.

Deployment Lessons

  1. Never run on Kubernetes without these:
# Must-have for stateful services  
affinity:  
  podAntiAffinity:  
    requiredDuringSchedulingIgnoredDuringExecution:  
    - labelSelector:  
        matchExpressions:  
        - key: "app"  
          operator: In  
          values: ["milvus"]  
      topologyKey: "kubernetes.io/hostname"  
Enter fullscreen mode Exit fullscreen mode
  1. Storage tradeoffs:

    • SSD: Required for >1B vectors
    • Local NVMe: 37% faster than network-attached
    • MinIO object storage: Saved $16k/month vs cloud storage
  2. Indexing during ingestion increased latency 400%. Solution:

# Index after peak hours  
curl -X POST http://localhost:9091/api/v1/index \  
     -H "Content-Type: application/json" \  
     -d '{"collection_name": "recommendations", "index_type": "IVF_FLAT"}'  
Enter fullscreen mode Exit fullscreen mode

What I’d Do Differently Today

  1. Use quantized indexes (IVF_SQ8 over IVF_FLAT) - 60% memory reduction
  2. Pre-partition collections by geo-region
  3. Deploy Zilliz Cloud earlier for stateful service headaches

Still Unsolved Problems

  • Multi-tenant isolation at 1M+ QPS
  • Real-time index tuning
  • Cross-cluster replication without consistency nightmares

Our team now experiments with merging sparse/dense vectors using hybrid retrieval. Early results show 11% relevance improvement for customer service bots.

The physics of large-scale search don’t care about marketing. Test relentlessly.

Top comments (0)