Alain Airom (Ayrom)

Posted on Apr 12

The “The Architecture Handbook for Milvus Vector Database” Book Review

#rag #milvus #bookreview #vectordatabase

A feedback and my thoughts on the “The Architecture Handbook for Milvus Vector Database” book.

The image from the book cover produced by Packt Publications

Introduction-Mastering the Core of AI Memory

I think I can say that I know Milvus vector DB; I’ve used it many times locally for testing my applications with my Ollama+LLM+Vector Database, and Milvus is also one of the many vector databases proposed in the watsonx Data platform. However, I found this book very useful, and one of the best parts (among others) is Chapter 16: “Implementing Multi-Tenancy in Milvus”. This handbook, written by the core contributors of the Milvus project, systematically unpacks the full technology stack — from basic concepts and deployment to performance tuning and real-world implementation.

> Disclaimer: This is an independent review. I have no professional or financial relationship with the publisher or the authors.

A synthesis of the Book

Setup and Configuration

Chapter 1: Introduction to Milvus — This chapter sets the stage by defining vector databases and explaining how they handle unstructured data through encoding processes like one-hot and TF-IDF to enable similarity search.
Chapter 2: Deploying Milvus in Multiple Ways — It provides a practical guide to various deployment methods, including Docker Compose, Kubernetes operators, and Helm charts, helping users choose the right environment for their needs.

If you’ve already setup Milvus on your end, you can bypass these first chapters in my opinion!

Chapter 3: Interacting with Milvus — It learns the core objects — fields, schemas, collections, and indexes — and how to interact with the database using both the Python SDK and REST APIs for standard data operations.
Chapter 4: Configuring the Milvus System — This chapter focuses on managing system dependencies and service configurations, including setting up monitoring tools like Prometheus, Grafana, and Loki for log analysis.

Data Model — Reading Data, Modification and Compacting Data

Chapter 5: Understanding the Milvus Data Model and Architecture — It dives into the disaggregated four-tier architecture (access, coordination, worker, and storage) and the “lifeline” of the system: the Message Queue (MQ) and Timetick mechanism.

Chapter 6: Data Modification and Maintenance in Milvus — This section details the lifecycle of data “segments” and the internal journey of insert, delete, and upsert requests through the system’s various layers.

Chapter 7: Reading Data in Milvus — It explains the “scatter-gather” query pattern, replica management, and how Milvus ensures multi-level consistency between data freshness and search latency.
Chapter 8: Compaction and Garbage Collection — This chapter explores how background processes like L0 and clustering compaction optimize storage and query performance by managing fragmented segments and obsolete data.

In the context of the Milvus Vector Database, the processes of Compaction and Garbage Collection (GC) are critical because they address the inherent challenges of a high-performance, distributed LSM-tree (Log-Structured Merge-tree) based system.

These mechanisms are vital for maintaining a healthy and fast vector database:

Maintaining Search Speed (Performance Optimization): When we delete or update data in Milvus, the system doesn’t immediately “scrub” the old data from the disk (as this would be computationally expensive). Instead, it marks data as deleted. Over time, these “logical deletes” create fragmentation. So, during a search, the system still has to scan through these fragmented segments and “skipped” records, which slows down query latency. Therefore the ‘compaction’ merges small segments into larger, more optimized ones and physically removes deleted records, ensuring the search engine only processes relevant, “healthy” data.
Storage Efficiency: Without a ‘Garbage Collection’ system, the storage usage would grow indefinitely even if we dele as much data as we add.
Solving the “Small File” Problem (L0 Compaction): Milvus often receives data in small batches. If left unmanaged, this results in thousands of tiny files (Level 0 segments). Searching across 1,000 tiny files is significantly slower than searching across one large file due to the overhead of opening files and managing metadata. So “Compaction” continuously “rolls up” these tiny L0 segments into larger, indexed segments that are optimized for the Knowhere search engine.
Clustering ‘Compaction for Spatial Locality’: standard compaction just merges files, but Clustering Compaction goes a step further by reorganizing data based on the values of a specific field (like a partition key or a scalar field). By Pruning it allows the system to perform “segment pruning.” If our query only needs data from “Tenant A,” the system can skip entire segments that it knows only contain data for “Tenant B,” drastically reducing the I/O required for a search.
System Stability and Reliability: If segments are never compacted, the system’s metadata (stored in ETCD) becomes bloated. This can lead a) slow startup times, b) increased memory consumption by the Query Nodes and c) potential system crashes when the metadata layer becomes a bottleneck.

Chapter 9: Exploring Milvus’ Vector Engine — The focus shifts to the Knowhere engine, the core computational heart that supports various hardware acceleration frameworks and distance metrics like L2 and Cosine.
Chapter 10: Indexing in Detail: Algorithms and Parameters — (Based on the book’s structure) This chapter likely provides a deep dive into the mathematical foundations of indexing algorithms like HNSW, IVF, and Product Quantization (PQ).
Chapter 11: Performance Testing and Benchmarking — It outlines how to use tools like VectorDBBench to establish performance baselines and simulate data growth to test system limits.
Chapter 12: Stability and Reliability Testing — This chapter focuses on ensuring the system can handle failures through chaos engineering, recovery testing, and stress testing.

Scalabilty and Performance

Chapter 13: Scalability and Capacity Planning — It provides guidelines on how to scale the various tiers of Milvus — query nodes, data nodes, and index nodes — based on specific workload requirements.

Chapter 14: Performance Tuning and Best Practices — This part distills years of engineering experience into actionable advice for optimizing indexing and search parameters.

"""
Chapter 14: Configure in-memory replicas for horizontal scaling.

Loads a collection with a specified number of in-memory replicas to distribute
search load across multiple QueryNode instances.

Usage:
    python load_replicas.py [--uri localhost:19530] [--replicas 3]
"""

import argparse

from pymilvus import MilvusClient

COLLECTION_NAME = "test_collection"


def main():
    parser = argparse.ArgumentParser(description="Configure Milvus in-memory replicas")
    parser.add_argument(
        "--uri", default="http://localhost:19530", help="Milvus server URI"
    )
    parser.add_argument("--token", default="", help="Milvus authentication token")
    parser.add_argument("--collection", default=COLLECTION_NAME, help="Collection name")
    parser.add_argument(
        "--replicas", type=int, default=3, help="Number of in-memory replicas"
    )
    args = parser.parse_args()

    client = MilvusClient(uri=args.uri, token=args.token)

    # Release collection first if already loaded
    try:
        client.release_collection(args.collection)
        print(f"Released collection '{args.collection}'")
    except Exception:
        pass

    # Load with specified replica count
    client.load_collection(
        collection_name=args.collection,
        replica_number=args.replicas,
    )
    print(
        f"Collection '{args.collection}' loaded with {args.replicas} in-memory replicas"
    )
    print(
        f"Search requests will be distributed across {args.replicas} QueryNode instances"
    )


if __name__ == "__main__":
    main()

Advanced Topics

Chapter 15: Advanced Search and Querying — It explores techniques like batch search processing and horizontal scaling with multiple replicas to handle thousands of concurrent users.

"""
Chapter 15: Batch Search Optimization

This example demonstrates how batch processing dramatically improves
search throughput by amortizing overhead costs using real Wikipedia embeddings.
"""

import os
import time
from datasets import load_dataset
from pymilvus import MilvusClient, DataType


def load_wikipedia_data(limit=50000, batch_size=1000):
    """
    Load Wikipedia embeddings from Hugging Face dataset.

    Args:
        limit: Number of records to load
        batch_size: Records per batch

    Yields:
        Batch of records formatted for Milvus
    """
    print(f"Loading Wikipedia dataset (total_records={limit})...")
    docs = load_dataset("Cohere/wikipedia-22-12-simple-embeddings", split="train")

    # Use slice to get data efficiently
    for start_idx in range(0, limit, batch_size):
        end_idx = min(start_idx + batch_size, limit)
        batch_docs = docs[start_idx:end_idx]

        # Format batch for Milvus
        batch = []
        for i in range(len(batch_docs["emb"])):
            batch.append(
                {
                    "emb": batch_docs["emb"][i],
                    "title": batch_docs["title"][i][:500],  # Truncate to max length
                    "text": batch_docs["text"][i][:2000],  # Truncate to max length
                    "wiki_id": batch_docs["wiki_id"][i],
                    "views": float(batch_docs["views"][i]),
                }
            )

        yield batch


def get_search_vectors(num_searches=100):
    """Get real search vectors from Wikipedia dataset."""
    print(f"Loading {num_searches} search vectors from dataset...")
    docs = load_dataset("Cohere/wikipedia-22-12-simple-embeddings", split="train")

    search_vectors = []

    for i, doc in enumerate(docs):
        if i >= num_searches:
            break
        search_vectors.append(doc["emb"])

    return search_vectors


def main():
    # Connect to Milvus
    client = MilvusClient(uri=os.getenv("MILVUS_URI", "http://localhost:19530"))

    collection_name = "wiki_articles"
    num_vectors = 50000
    dim = 768

    # ========================================================================
    # Setup: Create Collection and Insert Data
    # ========================================================================
    print("=" * 70)
    print("Setup: Creating collection and inserting data")
    print("=" * 70)

    if client.has_collection(collection_name):
        client.drop_collection(collection_name)

    # Create collection
    schema = client.create_schema(auto_id=True, enable_dynamic_field=False)
    schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
    schema.add_field(field_name="emb", datatype=DataType.FLOAT_VECTOR, dim=dim)
    schema.add_field(field_name="title", datatype=DataType.VARCHAR, max_length=100)
    schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=5000)
    schema.add_field(field_name="wiki_id", datatype=DataType.INT64)
    schema.add_field(field_name="views", datatype=DataType.FLOAT)

    client.create_collection(collection_name=collection_name, schema=schema)

    # Create HNSW index
    index_params = client.prepare_index_params()
    index_params.add_index(
        field_name="emb",
        index_type="HNSW",
        metric_type="COSINE",
        params={"M": 16, "efConstruction": 200},
    )
    client.create_index(collection_name=collection_name, index_params=index_params)

    # Insert data in batches
    batch_num = 0
    total_inserted = 0
    for batch in load_wikipedia_data(limit=num_vectors, batch_size=1000):
        client.insert(collection_name=collection_name, data=batch)
        batch_num += 1
        total_inserted += len(batch)
        if batch_num % 10 == 0:
            print(f"Inserted {total_inserted} records...")

    print(f"Total inserted: {total_inserted} records")

    # Load collection
    client.load_collection(collection_name)

    search_params = {"metric_type": "COSINE", "params": {"ef": 128}}

    # ========================================================================
    # Example 1: Single Search
    # ========================================================================
    print("\n" + "=" * 70)
    print("Example 1: Single Search (Sequential)")
    print("=" * 70)

    search_vectors = get_search_vectors(num_searches=100)
    print("Running 100 sequential single searches...")

    start_time = time.time()
    for i, search_vector in enumerate(search_vectors):
        client.search(
            collection_name=collection_name,
            data=[search_vector],
            anns_field="emb",
            search_params=search_params,
            limit=10,
        )
    total_time = time.time() - start_time

    print(f"Total time: {total_time:.2f}s")
    print(f"Average time per search: {total_time / len(search_vectors) * 1000:.2f}ms")
    print(f"Throughput: {len(search_vectors) / total_time:.2f} searches/sec")

    # ========================================================================
    # Example 2: Batch Search
    # ========================================================================
    print("\n" + "=" * 70)
    print("Example 2: Batch Search (All at Once)")
    print("=" * 70)

    print("Running single batch search with 100 searches...")

    start_time = time.time()
    results = client.search(
        collection_name=collection_name,
        data=search_vectors,
        anns_field="emb",
        search_params=search_params,
        limit=10,
    )
    total_time = time.time() - start_time

    print(f"Total time: {total_time:.2f}s")
    print(f"Average time per search: {total_time / len(search_vectors) * 1000:.2f}ms")
    print(f"Throughput: {len(search_vectors) / total_time:.2f} searches/sec")

    print("\nSample results from first search:")
    for i, hit in enumerate(results[0][:5]):
        print(f"  {i + 1}. ID: {hit['id']}, Distance: {hit['distance']:.4f}")

    # ========================================================================
    # Cleanup
    # ========================================================================
    print("\nCleaning up...")
    client.release_collection(collection_name)
    client.drop_collection(collection_name)
    print("Done!")


if __name__ == "__main__":
    main()

Chapter 16: Implementing Multi-Tenancy in Milvus — This standout chapter details different strategies for isolating tenant data, ranging from database-level isolation to collection-level and partition-key-based approaches.

Database-Level Multi-Tenancy in Milvus
This notebook demonstrates how to implement database-level isolation for multiple tenants using Milvus. Each tenant gets their own dedicated database, providing the strongest form of tenant isolation.

Prerequisites
Running Milvus instance (v2.5.x or later)
PyMilvus SDK (v2.5.8 or compatible)
Setup and Configuration
# Install required packages if not already installed
# !pip install pymilvus==2.5.8
from pymilvus import MilvusClient, DataType
import random

# Configuration
MILVUS_URI = "http://localhost:19530"  # Update this to your Milvus instance
client = MilvusClient(uri=MILVUS_URI)

print(f"Connected to Milvus at {MILVUS_URI}")
Define Tenant Configurations
Each tenant will have their own database with specific collections.

# Define tenant configurations
TENANTS = {
    "company_a": {"db_name": "tenant_company_a", "collections": ["products", "users"]},
    "company_b": {
        "db_name": "tenant_company_b",
        "collections": ["inventory", "customers"],
    },
    "company_c": {
        "db_name": "tenant_company_c",
        "collections": ["documents", "analytics"],
    },
}

print(f"Configured {len(TENANTS)} tenants:")
for tenant_id, config in TENANTS.items():
    print(
        f"  - {tenant_id}: {config['db_name']} with collections {config['collections']}"
    )
Tenant Database Setup Functions
def setup_tenant_database(tenant_id, config):
    """Set up a dedicated database for a tenant"""
    db_name = config["db_name"]

    # Create database for tenant
    try:
        client.create_database(db_name)
        print(f"✓ Created database '{db_name}' for {tenant_id}")
    except Exception as e:
        print(f"Database '{db_name}' may already exist: {e}")

    # Switch to tenant database
    client.using_database(db_name)

    # Create collections for tenant
    for coll_name in config["collections"]:
        schema = client.create_schema(auto_id=True, enable_dynamic_fields=True)

        # Add fields (example schema)
        schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True)
        schema.add_field(field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=768)
        schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000)
        schema.add_field(
            field_name="category", datatype=DataType.VARCHAR, max_length=100
        )

        # Create collection
        try:
            client.create_collection(collection_name=coll_name, schema=schema)
            print(f"✓ Created collection '{coll_name}' in '{db_name}'")

            # Create index for vector field
            client.create_index(
                collection_name=coll_name,
                field_name="vector",
                index_params={"index_type": "FLAT", "metric_type": "COSINE"},
            )
            print(f"✓ Created index for '{coll_name}'")

        except Exception as e:
            print(f"Collection '{coll_name}' may already exist: {e}")

    # Switch back to default database
    client.using_database("default")
    return True
def insert_sample_data(tenant_id, collection_name, num_records=100):
    """Insert sample data into a tenant's collection"""
    db_name = TENANTS[tenant_id]["db_name"]

    # Switch to tenant database
    client.using_database(db_name)

    # Generate sample data
    data = []
    categories = ["electronics", "books", "clothing", "home", "sports"]

    for i in range(num_records):
        data.append(
            {
                "vector": [random.random() for _ in range(768)],
                "text": f"{tenant_id} item {i + 1}",
                "category": random.choice(categories),
            }
        )

    # Insert data
    client.insert(collection_name=collection_name, data=data)

    # Load collection for searching
    client.load_collection(collection_name)

    print(f"✓ Inserted {num_records} records into {tenant_id}/{collection_name}")

    # Switch back to default database
    client.using_database("default")
    return True
def query_tenant_data(tenant_id, collection_name, query_vector, limit=5):
    """Query data from a specific tenant's database"""
    db_name = TENANTS[tenant_id]["db_name"]

    # Switch to tenant database
    client.using_database(db_name)

    # Perform search - data is completely isolated
    results = client.search(
        collection_name=collection_name,
        data=[query_vector],
        limit=limit,
        output_fields=["text", "category"],
    )

    # Switch back to default database
    client.using_database("default")

    return results
Setup All Tenant Databases
Create separate databases for each tenant with their specific collections.

# Setup databases for all tenants
print("Setting up tenant databases...")
for tenant_id, config in TENANTS.items():
    print(f"\nSetting up {tenant_id}:")
    setup_tenant_database(tenant_id, config)
Insert Sample Data for Each Tenant
# Insert sample data for each tenant
print("Inserting sample data...")
for tenant_id, config in TENANTS.items():
    print(f"\nInserting data for {tenant_id}:")
    for collection_name in config["collections"]:
        insert_sample_data(tenant_id, collection_name, num_records=50)
Demonstrate Tenant Isolation
Show that each tenant can only access their own data.

# Generate a query vector
query_vector = [random.random() for _ in range(768)]

print("Demonstrating tenant isolation:")
print("\nQuerying Company A's products:")
results_a = query_tenant_data("company_a", "products", query_vector, limit=3)
for i, result in enumerate(results_a[0]):
    print(
        f"  {i + 1}. {result['entity']['text']} ({result['entity']['category']}) - Score: {result['distance']:.4f}"
    )

print("\nQuerying Company B's inventory:")
results_b = query_tenant_data("company_b", "inventory", query_vector, limit=3)
for i, result in enumerate(results_b[0]):
    print(
        f"  {i + 1}. {result['entity']['text']} ({result['entity']['category']}) - Score: {result['distance']:.4f}"
    )

print("\nQuerying Company C's documents:")
results_c = query_tenant_data("company_c", "documents", query_vector, limit=3)
for i, result in enumerate(results_c[0]):
    print(
        f"  {i + 1}. {result['entity']['text']} ({result['entity']['category']}) - Score: {result['distance']:.4f}"
    )
Verify Database Isolation
# List all databases to verify isolation
databases = client.list_databases()
print("All databases in Milvus:")
for db in databases:
    print(f"  - {db}")

print(
    f"\nTotal tenant databases created: {len([db for db in databases if db.startswith('tenant_')])}"
)
Test Cross-Tenant Access (Should Fail)
Demonstrate that tenants cannot access each other's data.

# Try to access Company B's data from Company A's database context
print("Testing cross-tenant access restriction:")
try:
    # Switch to Company A's database
    client.using_database(TENANTS["company_a"]["db_name"])

    # Try to access Company B's collection (this should fail)
    results = client.search(
        collection_name="inventory",  # Company B's collection
        data=[query_vector],
        limit=5,
    )
    print("❌ Cross-tenant access succeeded (this shouldn't happen!)")
except Exception as e:
    print(f"✓ Cross-tenant access properly blocked: {str(e)[:100]}...")
finally:
    # Switch back to default database
    client.using_database("default")
Performance and Resource Analysis
# Analyze resource usage per tenant
print("Database-Level Multi-Tenancy Analysis:")
print("\nBenefits:")
print("  ✓ Maximum data isolation (100% secure)")
print("  ✓ Simple tenant lifecycle management")
print("  ✓ Granular RBAC control per database")
print("  ✓ Zero risk of data leakage between tenants")

print("\nLimitations:")
print("  ⚠ Limited to ~64 tenants (Milvus database limit)")
print("  ⚠ 40-60% higher memory usage per tenant")
print("  ⚠ 3-5x more operational complexity")
print("  ⚠ Higher infrastructure costs")

total_collections = sum(len(config["collections"]) for config in TENANTS.values())
print("\nCurrent Setup:")
print(f"  - Total tenants: {len(TENANTS)}")
print(f"  - Total databases: {len(TENANTS)}")
print(f"  - Total collections: {total_collections}")
print(f"  - Average collections per tenant: {total_collections / len(TENANTS):.1f}")
Cleanup (Optional)
Uncomment and run the following cell to clean up the created databases.

# Cleanup - Uncomment to remove all tenant databases
# print("Cleaning up tenant databases...")
# for tenant_id, config in TENANTS.items():
#     try:
#         client.drop_database(config["db_name"])
#         print(f"✓ Dropped database {config['db_name']}")
#     except Exception as e:
#         print(f"Error dropping {config['db_name']}: {e}")
# print("Cleanup completed!")
Summary
This notebook demonstrated database-level multi-tenancy in Milvus, which provides:

Strongest Isolation: Complete data separation at the database level
Simple Management: Each tenant has their own namespace
Security: Zero risk of cross-tenant data access
Scalability Limit: Suitable for up to 64 enterprise tenants
This approach is ideal for enterprise customers requiring strict compliance and data isolation guarantees.

Chapter 17: Real-world Applications and Integrations — The book concludes by showing how Milvus integrates with frameworks like LangChain for RAG and image retrieval systems using pre-trained models like ResNet50.

Conclusion

I found this book incredibly useful. Even though, as mentioned, I consider myself a bit experienced with Milvus, the depth of technical detail helped me significantly strengthen my knowledge. Whether you are just starting with local testing or managing large-scale clusters on platforms like watsonx Data, this handbook provides the architectural clarity needed to build truly high-performance vector search systems.

>>> Thanks for reading <<<

DEV Community