Marcus Feldman

Posted on Jun 12

Scaling to 100,000 Collections: My Experience Pushing Multi-Tenant Vector Database Limits

Why Multi-Tenancy Matters for Vector Databases

When I started building AI products with real-world customer isolation requirements, the limitations of existing vector search infrastructure quickly became obvious. For anyone who’s wrestled with the question of whether to split each customer into their own database instance or try to multiplex everyone into a single shared set of tables, the trade-offs become stark as usage grows. Logical isolation, operational simplicity, and real query performance all pull in different directions.

In the vector space, the equivalent of a relational table is the collection—each with its own schema, index, and access boundaries. With SaaS-style applications, especially those with privacy or model customization demands, a collection-per-tenant approach has practical advantages: clean isolation, custom configs per tenant, and simpler scaling. But this model falls apart if the system can’t handle enough collections. When I talked to platform teams, some were managing thousands of separate clusters just to avoid hard system limits—operationally painful and expensive.

The Breaking Point: What Happens as Tenant Count Grows

I’ve seen this “crash” point in practice. Let’s say you’ve got a platform serving hundreds or thousands of customer workspaces, each mapping to a distinct vector collection. That works, right up until a certain scale. For example, when Milvus previously hit its 5,000 collection cap, teams building SaaS tools like Airtable (where every customer can spin up custom apps with isolated vector search) found themselves completely blocked.

What actually breaks? The most severe pain points I encountered are:

CreateCollection latency spiking from milliseconds to several seconds.
Insert throughput dropping off a cliff.
Indexing bottlenecks—too many small jobs clogging up scheduling.
Cluster recovery times stretching from minutes to hours as metadata grows.
Memory blowups—mostly from keeping too many in-memory filters or background job state.

For example, in one practical test: at 10,000 collections, each with 100 partitions, the system created over a million segments, leading to millions of concurrent indexing jobs and massive metadata load—causing cold starts to take 30+ minutes.

Engineering for Scale: Breaking Through the 5,000 Collection Wall

When I started hands-on with Milvus 2.5.x, the goal was to test whether true large-scale multi-tenancy was feasible in one cluster. Here’s what actually worked, with measured results.

Metadata: Making Core Operations Fast

The first bottleneck was metadata. Early versions needed to scan huge metadata tables for every API call—an approach that works for a few thousand objects but becomes catastrophic beyond that.

Optimization Strategy

Built a two-tier metadata index, so high-frequency lookups bypass table scans.
Moved from global locks to granular locks and, in critical paths, lock-free designs.
Eliminated unnecessary network hops between coordinator components; now internal calls are in-process.

Real-World Result

In my stress tests with 100,000 collections and 10 concurrent clients, the numbers spoke for themselves:

Operation	v2.5.3	v2.5.8	Speedup
CreateCollection	5.62s	0.501s	11×
Insert	7.2s	0.0526s	137×

Environment:

8-core MixCoord, 32GB RAM
3 etcd nodes (1 core, 1GB RAM each)
1,000 vectors per collection (128d), 10 clients
100,000 collections created/inserted continuously

Task Scheduling: Preventing Indexing Gridlock

With millions of small jobs, a naive scheduler will let small tasks block big ones and waste resources.

Scheduling Redesign

Dynamic resource units: each DataNode is divided into 2-core/8GB blocks; scheduler matches jobs to blocks.
Double-buffered queue: while one set of tasks executes, the next set is already being prepared.
Lock-free task queues: minimize thread contention for high concurrency.

Observed Throughput

Using 4 DataNodes (each 4 cores, 16GB), segment indexing throughput jumped from 10,000 to 300,000 tasks/hour—a 30× gain. The cluster could now index 100,000 collections without stalling.

Recovery at Scale: From Hours to Minutes

Recovering large clusters is usually the difference between “prod ready” and “just a demo.” Two things always broke for me: metadata load and message stream subscription.

Parallel Metadata Load

Old: Sequential load—startup time grows linearly with cluster size. Loading 3 million entries took >30 minutes.
New: Shard metadata by collection/partition, use Go goroutines for parallel load. Startup for same dataset: 1 minute (30× faster).

Message Subscription Dispatcher

Previously, Milvus would spin up one subscription per collection—at 100,000, this is absurd (would take ~17 hours, and kill memory). Now, collections are batched (e.g., 200 groups), with a single subscription per group and a Dispatcher routing messages to the right handler.

This cut startup subscription time from 17 hours to ~1 minute (1,000× faster), and eliminated memory spikes.

Smarter Memory Management

A major hidden cost in large clusters: BloomFilters for delete requests. If each filter is ~175KB (with a 0.1% FP rate) and you have a million segments, that’s 175GB RAM just for BloomFilters.

The Solution

Milvus 2.5.8 simply stopped loading these into memory. Now, Delete ops are logged to storage, and the existence check/cleanup is deferred to Compaction. That means massively less RAM used, at the price of (rarely) more compaction work.

Delete Flow:

User → Delete request → DataNode appends to binlog (no existence check)
       → Compaction eventually removes actual PKs, discards the rest

Before vs. After: A Table of Practical Improvements

Operation	Milvus v2.5.3	Milvus v2.5.8	Improvement
CreateCollection	5.62s	0.501s	11×
Insert	7.2s	0.0526s	137×
Metadata Recovery	~30 min	~1 min	30×
Stream Subscription	~17 hours	~1 min	1,000×
Memory Footprint	High	~70% less	Significant

Note: Actual results depend on hardware, but the step-change was always clear in my tests.

Deployment and Operational Reflections

After these changes, the shift is obvious: I can now provision a single cluster for 10,000+ tenant collections, each with independent schema/index/type. The old pain points—slow recovery, unpredictable query latency, memory blowups—don’t show up under stress. The operational burden of running dozens of separate clusters (just to work around hard limits) is gone.

For real SaaS and AI scenarios (law firms, data providers, customer-specific models), this is now genuinely viable. True per-tenant isolation, no performance penalty, and less infrastructure overhead.

Code Snippet: Creating Collections per Tenant

I found this workflow reliable for batch tenant creation:

from pymilvus import Collection, CollectionSchema, FieldSchema, DataType

def create_tenant_collection(tenant_id, vector_dim):
    schema = CollectionSchema([
        FieldSchema(name="pk", dtype=DataType.INT64, is_primary=True),
        FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=vector_dim)
    ])
    collection_name = f"tenant_{tenant_id}"
    return Collection(name=collection_name, schema=schema)

# Example: Create 100,000 collections (simulate with a loop)
for i in range(100000):
    create_tenant_collection(i, 128)

Design Trade-Offs

While these optimizations pay off at scale, there are still open questions:

Monitoring/Observability: Keeping track of resource usage, background tasks, and slow tenants is now much more important.
Compaction Strategy: Deferring deletes saves memory but may push more work to compaction; tuning compaction is critical for write-heavy workloads.
Query Optimization: At massive scale, further improvements to query path and cache management are on the roadmap.

Next Steps: What I Plan to Explore

I want to push even further, testing:

Ultra-large query workloads with uneven tenant sizes.
Automatic fairness/resource governance per-tenant.
Integration with RAG and knowledge base systems at this scale (best practices for scalable RAG).
Deeper dive into manual sharding pitfalls at multi-million collection scale.
More efficient storage engines for millions of small, sparse collections.

Conclusion: Multi-Tenant Vector Databases Are Ready for Real SaaS Workloads

After running through this process, I see that with the right architectural optimizations, collection-per-tenant models are not just possible—they’re preferable for many real AI workloads. The operational friction is lower, performance is more predictable, and disaster recovery is no longer a deal breaker. In future iterations, I plan to keep pushing these boundaries with more complex tenant patterns and bigger datasets, but the foundation is finally there.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.