DEV Community

Rhea Kapoor
Rhea Kapoor

Posted on

When Millions Need Answers: Building Sub-50ms Search for Unstructured Data

As an engineer working with conversational AI systems, I’ve seen firsthand how retrieval latency becomes the bottleneck at scale. Recently, I explored architectures for real-time search across fragmented communication data—Slack threads, Zoom transcripts, CRM updates—where traditional databases collapse under metadata filtering. Here’s what I learned.

1. The Unstructured Data Nightmare

Modern tools generate disconnected data silos:

  • Meetings: Nuanced discussions, action items buried in transcripts
  • Chats: Sparse, jargon-heavy snippets in Slack/MS Teams
  • Emails/CRM: Semi-structured but context-poor updates

Querying “positive feedback from engineering one-on-ones last quarter” requires cross-source correlation. SQL? No-go. Elasticsearch? Struggles with semantic relevance. When testing with 10M synthetic records:

# Sample hybrid query pain point  
results = db.search(  
    vector="feedback sentiment embeddings",  
    metadata={  
        "participant_dept": "engineering",  
        "meeting_type": "one-on-one",  
        "date_range": ["2024-01-01", "2024-03-31"]  
    }  
)  
# Baseline latency: 220ms (unacceptable for real-time UX)  
Enter fullscreen mode Exit fullscreen mode

2. Why Vector Databases Became Non-Negotiable

I evaluated three stacks for hybrid search (vector + metadata filtering):

Solution 10M Vectors Latency Metadata Filter Limits
FAISS + PostgreSQL 85ms Joins crashed at >5 filters
Pinecone 62ms Limited conditional logic
Milvus 38ms Boolean expressions + range

Milvus’ filtered search performance:

GET /collections/meetings/query  
{  
  "expr": "participant_dept == 'engineering' && meeting_type == 'one-on-one'",  
  "vector": [0.12, -0.05, ..., 0.72]  
}  
Enter fullscreen mode Exit fullscreen mode

Key insight: Vector indexes alone aren’t enough. Filter execution speed determines real-world viability.

3. Multi-Tenancy: The Silent Scalability Killer

Isolating data per customer seems trivial—until you handle millions. I tested partitioning strategies:

Approach 1M Tenants Ingest Throughput
Schema-per-tenant FAIL (storage) 12K ops/sec
Row-level filtering 1.2s query 94K ops/sec
Native multi-tenancy 48ms query 210K ops/sec

Milvus’ tenant abstraction proved critical:

// Assign tenant during insertion  
InsertParam params = new InsertParam.Builder()  
    .withCollectionName("comms")  
    .withTenantId("tenant_XYZ")  
    .build();  
Enter fullscreen mode Exit fullscreen mode

Without this, infrastructure costs balloon by 3–4×.

4. Deployment Tradeoffs: Cloud vs. Bare Metal

I deployed two clusters handling 5K QPS:

Config P99 Latency Monthly Cost
Self-hosted (k8s) 51ms $18K
Zilliz Cloud (serverless) 43ms $11K

Operational surprise: Managed services reduced vector indexing errors by 76% due to auto-tuned parameters.

5. Where I’d Improve the Design

  • Cost vs. latency: Relaxed consistency for analytics queries could cut compute spend by 30%
  • Vector lake experiment: Offloading historical data to MinIO+S3 for archive searches
  • Metadata schema versioning: Still brittle. Planning JSONB schema evolution tests.

Final Thoughts

Building sub-50ms retrieval for unstructured data demands:

  • Hybrid execution engines that fuse vector+metadata ops
  • Per-tenant isolation without storage overhead
  • Distributed query planning (avoid “filter-scan-bottlenecks”)

Next, I’m stress-testing trillion-scale vector lakes. If you’ve battled similar challenges, I’d love to compare notes. Find the benchmark code here: github/repo/hybrid_search_tests

Top comments (0)