I've spent the last year implementing vector search for a payment system processing tens of billions of annual transactions. Here’s what matters when abstract databases meet physical infrastructure.
Why Scale Isn't Theoretical
We needed personalized recommendations across 200+ countries. Our requirements:
- Hourly ingestion of 50M+ vector updates
- <100ms p99 latency at peak traffic
- Support for 10B+ vectors without rearchitecting
- Dynamic schema changes during live updates
Commercial graph databases failed at 100M vectors. Custom solutions choked on batch writes.
Batch Ingestion: The Silent Killer
Test case: 48M vectors, average dimensionality 768
- Competitor A: 8.2 hours (2.5K vectors/sec)
- Competitor B: 6.1 hours (3.4K vectors/sec)
- Milvus: 52 minutes (18.7K vectors/sec)
Why this matters:
Database | Peak Memory | CPU Utilization | Failed Batches |
---|---|---|---|
A | 38GB | 92% | 12% |
B | 41GB | 88% | 8% |
Milvus | 19GB | 67% | 0.2% |
The difference came down to parallel I/O design. Milvus separates index building from ingestion, avoiding write amplification. This Python snippet shows the clean API:
from pymilvus import connections, Collection
connections.connect("default", host='localhost', port='19530')
# Define schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)
]
schema = CollectionSchema(fields)
# Insert without locking index
collection = Collection("recommendations", schema)
insert_result = collection.insert(batch_data)
collection.flush()
The Consistency Trap
You’ll see these options in distributed systems:
Level | Use Case | Our Latency Cost |
---|---|---|
Strong Consistency | Financial auditing | +85ms |
Bounded Staleness | Recommendation engines | +12ms |
Session | User-specific search | +3ms |
Eventual | Analytics/cold storage | -0ms |
We used bounded staleness for checkout recommendations. Wrong choice for customer service bots though:
# Problematic pattern for conversational AI
collection.query(
expr="user_id == 'abc123'",
consistency_level="BOUNDED",
timeout=10.0 # Caused 8% timeouts during concurrent writes
)
Changed to session consistency with request batching. Timeouts dropped to 0.3%.
Deployment Lessons
- Never run on Kubernetes without these:
# Must-have for stateful services
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values: ["milvus"]
topologyKey: "kubernetes.io/hostname"
-
Storage tradeoffs:
- SSD: Required for >1B vectors
- Local NVMe: 37% faster than network-attached
- MinIO object storage: Saved $16k/month vs cloud storage
Indexing during ingestion increased latency 400%. Solution:
# Index after peak hours
curl -X POST http://localhost:9091/api/v1/index \
-H "Content-Type: application/json" \
-d '{"collection_name": "recommendations", "index_type": "IVF_FLAT"}'
What I’d Do Differently Today
- Use quantized indexes (IVF_SQ8 over IVF_FLAT) - 60% memory reduction
- Pre-partition collections by geo-region
- Deploy Zilliz Cloud earlier for stateful service headaches
Still Unsolved Problems
- Multi-tenant isolation at 1M+ QPS
- Real-time index tuning
- Cross-cluster replication without consistency nightmares
Our team now experiments with merging sparse/dense vectors using hybrid retrieval. Early results show 11% relevance improvement for customer service bots.
The physics of large-scale search don’t care about marketing. Test relentlessly.
Top comments (0)