Alain Airom

Posted on Sep 16

Moving Your Vector Database from ChromaDB to Milvus

#rag #milvus #chromadb #vectordatabase

A theorical approah to migrate data from ChromaDB to Milvus vector database

Introduction

In a recent discussion, I was approached to provide assistance with a specific, yet increasingly common, challenge: migrating vector data from a ChromaDB instance to a Milvus vector database. The request was not just for general advice, but for a tangible, practical solution that encompassed the entire process. This intricate migration involves several critical steps: first, meticulously extracting the existing vector data and associated information from the ChromaDB source; second, transforming this data, which crucially includes enriching it with additional, new metadata to enhance its utility and search capabilities within Milvus; and finally, loading the fully prepared dataset into the new Milvus system, all while rigorously safeguarding data integrity and ensuring seamless consistency throughout the transition.

An organization might consider migrating from ChromaDB to Milvus primarily due to scalability, performance, and advanced feature requirements that outgrow ChromaDB’s capabilities.

Why Migrate from ChromaDB to Milvus?

Organizations typically start with ChromaDB for its simplicity and ease of use, often for smaller-scale projects, local development, or applications with moderate data volumes. However, as their needs evolve, especially with a growing number of vectors, higher query QPS (queries per second), or complex filtering requirements, they might hit limitations that Milvus is designed to address.

Here’s a breakdown of the pros and cons for each, highlighting why a migration might be considered (from my own research and what I could gather):

ChromaDB

Pros:

1/ Ease of Use & Simplicity:

Developer-Friendly API: Very easy to set up and get started, often with just a few lines of Python code.
Lightweight: Can run in-memory, as a local file-based database, or with a simple client-server setup, making it ideal for quick prototypes and embedded applications.
Zero-Configuration (often): For local use, it requires almost no configuration, which speeds up development.

2/ Embedded/Local Deployment:

Excellent for local development, testing, and applications where the vector database needs to be co-located with the application.

3/ Flexible Schema (for metadata):

Allows for dynamic addition of metadata fields without strict schema enforcement upfront, offering flexibility during early development.

Cons:

1/ Scalability Limitations:

Less Suited for Large Datasets: While it supports persistent storage, scaling to millions or billions of vectors efficiently for high-throughput queries can be challenging compared to distributed systems.
Limited Distributed Capabilities: Primarily designed for single-node deployments or simpler client-server architectures, making horizontal scaling for massive workloads less straightforward or non-existent in its core offering.

2/ Performance at Scale:

May experience performance bottlenecks (latency for searches, ingestion rates) as vector counts and query complexity increase significantly.

3/ Feature Set (Compared to Milvus):

While robust for its use case, it offers fewer advanced indexing options, query types, and management features than enterprise-grade solutions.
Less robust ecosystem for large-scale operations, monitoring, and high availability.

4/ No Native Multi-tenancy/Isolation:

Managing multiple isolated workloads or tenants within a single ChromaDB instance can be less sophisticated.

Milvus

Pros:**

1/ Massive Scalability (Cloud-Native & Distributed):

Designed for Billions of Vectors: Built from the ground up to handle massive datasets (billions of vectors) and high query concurrency across distributed clusters.
Cloud-Native Architecture: Leverages Kubernetes and cloud storage (e.g., S3, Azure Blob Storage) for elastic scalability, high availability, and fault tolerance.

2/ High Performance & Low Latency:

Optimized Indexing Algorithms: Supports a wide range of state-of-the-art ANNS (Approximate Nearest Neighbor Search) algorithms (e.g., HNSW, IVF_FLAT, IVF_SQ8) for fast searches, even on vast datasets.
Efficient Data Ingestion: Designed for high-throughput data ingestion, crucial for frequently updated vector embeddings.

3/ Rich Feature Set:

Advanced Filtering: Offers powerful filtering capabilities on metadata, allowing for precise semantic search combined with structured criteria.
Diverse Data Types: Supports various data types for metadata fields, enabling more structured and complex data models.
Time Travel & Data Versioning: Allows querying historical versions of data (useful for auditing or analyzing changes).
Hybrid Search: Seamlessly combines vector similarity search with scalar filtering.

4/ Robust Ecosystem & Enterprise Readiness:

Observability: Provides tools and integrations for monitoring and managing large-scale deployments.
High Availability & Disaster Recovery: Architectural design supports robust HA and DR strategies.
Active Community & Commercial Support: Benefits from a large open-source community and commercial support options (Zilliz Cloud). Cons:

1/ Increased Complexity:

More Involved Setup: Setting up and managing a Milvus cluster is significantly more complex than ChromaDB, requiring knowledge of distributed systems, Kubernetes, and cloud infrastructure.
Higher Resource Requirements: A Milvus cluster consumes more computational and storage resources compared to a lightweight ChromaDB instance.

2/ Steeper Learning Curve:

The API and operational aspects (indexing strategies, shard management) have a steeper learning curve.

3/ Overkill for Small Projects:

For small-scale applications, prototypes, or local development, Milvus can be an unnecessary overhead, both in terms of resources and operational complexity.

Conclusion of Pros and Cons

An organization would typically migrate from ChromaDB to Milvus when their vector search needs mature beyond simple retrieval. This often happens when:

The volume of vectors grows into the millions or billions.
Query concurrency and QPS demands increase significantly.
The need for advanced features like complex metadata filtering, hybrid search, and high availability becomes critical.
The application requires a production-grade, fault-tolerant, and horizontally scalable vector database solution.
While Milvus introduces operational complexity, the benefits in terms of scalability, performance, and advanced features make it a compelling choice for organizations whose vector search capabilities are central to their growing product or service.

Hypothetical Migration Code: ChromaDB to Milvus Transition Logic

This code serves as a high-level conceptual blueprint for the migration process, akin to an algorithm rather than a fully hardened, production-ready solution. It illustrates the fundamental steps involved in moving data between these two vector databases.

It’s crucial to understand that this is a hypothetical implementation. Before deploying in any real-world scenario, comprehensive testing in a controlled environment is absolutely essential. You’ll need to:

Set up both a ChromaDB and a Milvus instance in your environment.
Populate ChromaDB with representative sample data that reflects your actual use case (including vector dimensions and metadata structures).
Thoroughly test the script’s functionality, including data extraction, transformation (especially metadata handling), and loading, to ensure data integrity and performance.
Validate the migrated data in Milvus by performing searches and filtering to confirm it behaves as expected.

The first sample code is supposed 🤞 to migrate data from a ChromaDB to a Milvus Database as it is!

🚩 Non tested Code! Based on my past code with Milvus ONLY! Test in Progress.

import chromadb
from pymilvus import MilvusClient, DataType, FieldSchema, CollectionSchema, Collection, connections
import numpy as np

# --- 1. CONFIGURATION ---

# ChromaDB configuration
CHROMA_DB_PATH = "./my_chroma_db"
CHROMA_COLLECTION_NAME = "my_sample_collection"

# Milvus configuration
MILVUS_URI = "http://localhost:19530"  # Or your Milvus server URI
MILVUS_COLLECTION_NAME = "my_migrated_collection"
VECTOR_DIMENSION = 128  # Ensure this matches the dimension of your vectors

# --- 2. EXTRACT DATA FROM CHROMADB ---

def get_chroma_data():
    """Extracts all data from a ChromaDB collection."""
    print("Connecting to ChromaDB...")
    client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
    collection = client.get_collection(name=CHROMA_COLLECTION_NAME)

    # Fetch all data from the collection
    results = collection.get(
        ids=None,  # Pass None to retrieve all IDs
        include=["embeddings", "documents", "metadatas"]
    )
    print(f"Extracted {len(results['ids'])} items from ChromaDB.")
    return results

# --- 3. LOAD DATA INTO MILVUS ---

def create_milvus_collection(client: MilvusClient):
    """Creates a new collection in Milvus with a specified schema."""
    print(f"Creating Milvus collection '{MILVUS_COLLECTION_NAME}'...")

    # Define the schema for the Milvus collection.
    # We'll use a `varchar` field for IDs, a `varchar` for documents, a `json` for metadata,
    # and a `float_vector` for the embeddings.
    fields = [
        FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=256),
        FieldSchema(name="document", dtype=DataType.VARCHAR, max_length=65535),
        FieldSchema(name="metadata", dtype=DataType.JSON),
        FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=VECTOR_DIMENSION),
    ]

    schema = CollectionSchema(fields, description="A collection migrated from ChromaDB")

    if client.has_collection(collection_name=MILVUS_COLLECTION_NAME):
        print("Collection already exists, dropping it...")
        client.drop_collection(collection_name=MILVUS_COLLECTION_NAME)

    client.create_collection(
        collection_name=MILVUS_COLLECTION_NAME,
        schema=schema,
        metric_type="COSINE",  # Or "L2" - must match your embedding model's metric
        index_params={"index_type": "AUTOINDEX"}
    )
    print("Milvus collection created successfully.")

def insert_milvus_data(client: MilvusClient, data: dict):
    """Inserts data extracted from ChromaDB into the Milvus collection."""
    print("Inserting data into Milvus...")

    # Prepare data for Milvus insertion
    milvus_data = [
        {"id": data['ids'][i], "document": data['documents'][i], "metadata": data['metadatas'][i], "vector": data['embeddings'][i]}
        for i in range(len(data['ids']))
    ]

    # Insert data in batches for better performance
    BATCH_SIZE = 100
    for i in range(0, len(milvus_data), BATCH_SIZE):
        batch = milvus_data[i:i + BATCH_SIZE]
        client.insert(
            collection_name=MILVUS_COLLECTION_NAME,
            data=batch
        )
        print(f"Inserted batch {i//BATCH_SIZE + 1}...")

    print("Data insertion complete. Flushing to disk...")
    client.flush(collection_name=MILVUS_COLLECTION_NAME)
    print("Data flushed. Milvus collection is ready.")

# --- 4. MAIN MIGRATION SCRIPT ---

if __name__ == "__main__":
    try:
        # Step 1: Extract data from ChromaDB
        chroma_data = get_chroma_data()
        if not chroma_data['ids']:
            print("ChromaDB collection is empty. Nothing to migrate.")
            exit()

        # Step 2: Connect to Milvus and prepare the collection
        print("Connecting to Milvus...")
        milvus_client = MilvusClient(uri=MILVUS_URI)
        create_milvus_collection(milvus_client)

        # Step 3: Insert the extracted data into Milvus
        insert_milvus_data(milvus_client, chroma_data)

        # Verify the migration
        stats = milvus_client.get_collection_stats(collection_name=MILVUS_COLLECTION_NAME)
        print(f"Migration complete! Milvus collection '{MILVUS_COLLECTION_NAME}' now contains {stats['row_count']} items.")

    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        # It's good practice to close the Milvus client connection
        if 'milvus_client' in locals():
            milvus_client.close()

Below, ythe core of the enhanced migration solution: the Python code specifically designed to not only transfer the “as-is” vector data but also to actively enrich the Milvus database with valuable, newly generated metadata during the process.

The enrichment implementation is the following;

Milvus Schema Modification: In create_milvus_collection, the fields list has been expanded to include three new FieldSchema objects:new_field_string (DataType.VARCHAR), new_field_int (DataType.INT64), new_field_bool (DataType.BOOL) This step is crucial because Milvus requires all fields to be predefined in the collection schema before data insertion.
Data Generation and Merging: A new helper function, generate_new_metadata, should be added to create a dictionary with arbitrary values for the new fields.
Entity Preparation: The insert_milvus_data function should use the **new_meta syntax (dictionary unpacking) to merge the new metadata into each entity dictionary. This creates a single, comprehensive dictionary for each record that matches the Milvus collection's new schema.

🚩 Non tested Code! Based on my past code with Milvus ONLY! Test in Progress.

import chromadb
from pymilvus import MilvusClient, DataType, FieldSchema, CollectionSchema, Collection, connections
import numpy as np
import random
import datetime

# --- 1. CONFIGURATION ---

# ChromaDB configuration
CHROMA_DB_PATH = "./my_chroma_db"
CHROMA_COLLECTION_NAME = "my_sample_collection"

# Milvus configuration
MILVUS_URI = "http://localhost:19530"
MILVUS_COLLECTION_NAME = "my_migrated_collection_with_new_meta"
VECTOR_DIMENSION = 128

# --- 2. EXTRACT DATA FROM CHROMADB ---

def get_chroma_data():
    """Extracts all data from a ChromaDB collection."""
    print("Connecting to ChromaDB...")
    client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
    collection = client.get_collection(name=CHROMA_COLLECTION_NAME)

    results = collection.get(
        ids=None,
        include=["embeddings", "documents", "metadatas"]
    )
    print(f"Extracted {len(results['ids'])} items from ChromaDB.")
    return results

# --- 3. LOAD DATA INTO MILVUS ---

def create_milvus_collection(client: MilvusClient):
    """Creates a new collection in Milvus with new metadata fields."""
    print(f"Creating Milvus collection '{MILVUS_COLLECTION_NAME}'...")

    # Define the schema including new metadata fields.
    fields = [
        FieldSchema(name="id", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=256),
        FieldSchema(name="document", dtype=DataType.VARCHAR, max_length=65535),
        FieldSchema(name="metadata", dtype=DataType.JSON),
        FieldSchema(name="new_field_string", dtype=DataType.VARCHAR, max_length=256),  # New String Field
        FieldSchema(name="new_field_int", dtype=DataType.INT64),  # New Integer Field
        FieldSchema(name="new_field_bool", dtype=DataType.BOOL),  # New Boolean Field
        FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=VECTOR_DIMENSION),
    ]

    schema = CollectionSchema(fields, description="A collection migrated from ChromaDB with new metadata")

    if client.has_collection(collection_name=MILVUS_COLLECTION_NAME):
        print("Collection already exists, dropping it...")
        client.drop_collection(collection_name=MILVUS_COLLECTION_NAME)

    client.create_collection(
        collection_name=MILVUS_COLLECTION_NAME,
        schema=schema,
        metric_type="COSINE",
        index_params={"index_type": "AUTOINDEX"}
    )
    print("Milvus collection created successfully with new metadata fields.")

    # You must create an index on the new fields for efficient filtering.
    # We will create an index on the `new_field_int` for demonstration.
    client.create_index(
        collection_name=MILVUS_COLLECTION_NAME,
        field_name="new_field_int",
    )
    print("Index created on 'new_field_int'.")


def insert_milvus_data(client: MilvusClient, data: dict):
    """Inserts data from ChromaDB and adds new metadata."""
    print("Inserting data into Milvus...")

    # Helper function to generate arbitrary new metadata
    def generate_new_metadata():
        return {
            "new_field_string": f"category_{random.choice(['A', 'B', 'C'])}",
            "new_field_int": random.randint(1, 100),
            "new_field_bool": random.choice([True, False]),
        }

    # Prepare data for Milvus insertion, including the new fields.
    milvus_data = []
    for i in range(len(data['ids'])):
        new_meta = generate_new_metadata()
        entity = {
            "id": data['ids'][i],
            "document": data['documents'][i],
            "metadata": data['metadatas'][i],
            "vector": data['embeddings'][i],
            **new_meta,  # Unpack the new metadata dictionary
        }
        milvus_data.append(entity)

    # Insert data in batches.
    BATCH_SIZE = 100
    for i in range(0, len(milvus_data), BATCH_SIZE):
        batch = milvus_data[i:i + BATCH_SIZE]
        client.insert(
            collection_name=MILVUS_COLLECTION_NAME,
            data=batch
        )
        print(f"Inserted batch {i//BATCH_SIZE + 1}...")

    print("✅ Data insertion complete. Flushing to disk...")
    client.flush(collection_name=MILVUS_COLLECTION_NAME)
    print("✅ Data flushed. Milvus collection is ready.")

# --- 4. MAIN MIGRATION SCRIPT ---

if __name__ == "__main__":
    try:
        # Step 1: Extract data from ChromaDB
        chroma_data = get_chroma_data()
        if not chroma_data['ids']:
            print("ChromaDB collection is empty. Nothing to migrate.🫣")
            exit()

        # Step 2: Connect to Milvus and prepare the collection
        print("✅ Connecting to Milvus...")
        milvus_client = MilvusClient(uri=MILVUS_URI)
        create_milvus_collection(milvus_client)

        # Step 3: Insert the extracted data into Milvus
        insert_milvus_data(milvus_client, chroma_data)

        # Verify the migration
        stats = milvus_client.get_collection_stats(collection_name=MILVUS_COLLECTION_NAME)
        print(f"⭐ Migration complete! Milvus collection '{MILVUS_COLLECTION_NAME}' now contains {stats['row_count']} items.")

    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        if 'milvus_client' in locals():
            milvus_client.close()

Conclusion

In conclusion, migrating from ChromaDB to Milvus represents a strategic step for organizations seeking to scale their vector search capabilities beyond initial prototyping or smaller-scale applications. While ChromaDB excels in ease of use and local development, Milvus offers unparalleled scalability, performance, and advanced features crucial for handling massive datasets and high-throughput queries in production environments. The hypothetical migration code we’ve explored provides a foundational blueprint for this transition, demonstrating how to extract data, define a new Milvus schema, and most importantly, enrich the vector database with additional, valuable metadata during the loading process. This enrichment, highlighted in the refined code example and architecture diagram, is key to unlocking more sophisticated filtering and search functionalities within Milvus. Ultimately, a well-planned migration, supported by thorough testing, allows organizations to leverage Milvus’s robust, cloud-native architecture for truly enterprise-grade vector database solutions.