Clustered index vs Non-clustered and how they look like in MongoDB

#mongodb #performance #programming #coding

When working with databases, performance optimization is often a key concern. One of the most effective ways to optimize queries is through indexing. In relational databases like SQL Server or MySQL, clustered and non-clustered indexes are important concepts for efficiently organizing and retrieving data. However, if you're moving to or working with NoSQL databases like MongoDB, you might wonder how these concepts apply. Let’s explore both and look at how MongoDB handles indexing.

What is an Index?

Before diving into clustered and non-clustered indexes, let's clarify what an index is. An index is a special data structure that improves the speed of data retrieval operations on a database table at the cost of additional storage space. Without an index, the database engine would have to perform a full scan of the entire collection or table to find the necessary records.

Clustered vs Non-Clustered Index

1. Clustered Index

Definition: A clustered index is a type of index where the data rows are stored physically on the disk in the same order as the index. In other words, the clustered index defines the order in which data is stored in a table.
Key Features:
- A table can have only one clustered index because the data can only be sorted in one way.
- The actual data of the table is part of the clustered index.
- Clustered indexes are generally created on a primary key.
- Faster data retrieval for range queries, as data is stored in sorted order.

2. Non-Clustered Index

Definition: A non-clustered index stores the data separately from the actual table. The index contains pointers to the physical location of the data rather than storing the data in the index itself.
Key Features:
- A table can have multiple non-clustered indexes.
- Non-clustered indexes are generally used to speed up lookups for columns that are not sorted physically.
- It creates a logical ordering of data that points to the physical data blocks.
- They take up additional space because they are separate from the data table.

3. When to Use Each Type

Choose a clustered index when:

You frequently run range queries
The column has high cardinality (many unique values)
You need fast data retrieval for a specific column

Opt for non-clustered indexes when:

You need multiple indexes on a table
You frequently search for specific values
You want to index columns used in JOIN, WHERE, and ORDER BY clauses

Indexing in MongoDB

MongoDB, being a NoSQL database, does not use the exact same terminology as SQL databases when it comes to indexes, but the underlying concepts of how data is indexed are similar. MongoDB uses B-tree indexes to store and retrieve data efficiently, and indexes can drastically improve query performance.

Clustered Indexes in MongoDB

MongoDB traditionally didn’t have a direct equivalent to SQL’s clustered indexes. However, as of MongoDB 5.3, MongoDB introduced clustered collections, which provide functionality similar to a clustered index in relational databases.

What are Clustered Collections in MongoDB?

A clustered collection in MongoDB organizes documents based on a specified field and stores data in the same order as the index on that field. The clustered index is embedded directly into the collection, which eliminates the need for a separate B-tree structure that MongoDB typically uses for secondary indexes.

Key Features:
- Documents are stored in the same order as the specified clustered index field.
- It is primarily useful for time-series data or use cases where data is naturally ordered by a specific field (e.g., timestamps, log data, etc.).
- MongoDB ensures that the clustered index field must be unique, so there can be only one clustered index per collection, much like SQL’s clustered index limitation.

How Clustered Collections Work in MongoDB

When you create a clustered collection, MongoDB physically organizes the documents based on the clustered index field, making range queries or searches on that field more efficient.
Unlike traditional secondary indexes, MongoDB does not need to maintain a separate index structure for clustered collections, as the documents are inherently organized according to the clustered index field.

Benchmarks

package main

import (
    "context"
    "fmt"
    "log"
    "testing"
    "time"

    "go.mongodb.org/mongo-driver/bson"
    "go.mongodb.org/mongo-driver/bson/primitive"
    "go.mongodb.org/mongo-driver/mongo"
    "go.mongodb.org/mongo-driver/mongo/options"
)

var (
    uri                  = "mongodb://localhost:27017"
    dbName               = "performance_test"
    clusteredCollName    = "clustered_logs"
    nonClusteredCollName = "non_clustered_logs"
    client               *mongo.Client
    db                   *mongo.Database
    numDocuments         = 1_000_000
)

func setup() {
    if client == nil {
        var err error
        client, err = mongo.Connect(context.TODO(), options.Client().ApplyURI(uri))
        if err != nil {
            log.Fatal(err)
        }
        db = client.Database(dbName)
    }
}

func BenchmarkQueryClusteredCollection(b *testing.B) {
    setup()

    // Drop collection before the benchmark to ensure fresh test data
    db.Collection(clusteredCollName).Drop(context.TODO())

    // // Create Clustered Collection
    opts := options.CreateCollection().
        SetClusteredIndex(bson.D{{Key: "key", Value: bson.D{{Key: "_id", Value: 1}}}, {Key: "unique", Value: true}})

    err := db.CreateCollection(context.TODO(), clusteredCollName, opts)
    if err != nil {
        log.Fatal("Error creating clustered collection: ", err)
    }

    coll := db.Collection(clusteredCollName)
    startTime := time.Now()
    insertTestData(coll)
    log.Printf("Clustered Collection: Inserted %d documents in %s\n", numDocuments, time.Since(startTime))

    startID := primitive.NewObjectIDFromTimestamp(startTime)
    endID := primitive.NewObjectIDFromTimestamp(time.Now())

    b.ResetTimer() // Reset the timer for the actual query benchmarking

    for i := 0; i < b.N; i++ {
        filter := bson.D{
            {Key: "_id", Value: bson.D{
                {Key: "$gte", Value: startID},
                {Key: "$lt", Value: endID},
            }},
        }

        _, err := coll.Find(context.TODO(), filter)
        if err != nil {
            b.Fatal(err)
        }
    }
}

func BenchmarkQueryNonClusteredCollection(b *testing.B) {
    setup()

    // Drop collection before the benchmark to ensure fresh test data
    db.Collection(nonClusteredCollName).Drop(context.TODO())

    // Create Non-Clustered Collection and Index
    err := db.CreateCollection(context.TODO(), nonClusteredCollName)
    if err != nil {
        log.Fatal("Error creating non-clustered collection: ", err)
    }

    coll := db.Collection(nonClusteredCollName)
    startTime := time.Now()
    insertTestData(coll)
    log.Printf("Non-Clustered Collection: Inserted %d documents in %s\n", numDocuments, time.Since(startTime))

    startID := primitive.NewObjectIDFromTimestamp(startTime)
    endID := primitive.NewObjectIDFromTimestamp(time.Now())

    b.ResetTimer() // Reset the timer for the actual query benchmarking

    for i := 0; i < b.N; i++ {
        filter := bson.D{
            {Key: "_id", Value: bson.D{
                {Key: "$gte", Value: startID},
                {Key: "$lt", Value: endID},
            }},
        }

        _, err := coll.Find(context.TODO(), filter)
        if err != nil {
            b.Fatal(err)
        }
    }
}

func insertTestData(coll *mongo.Collection) {
    for n := 0; n < numDocuments/100; n++ {
        docs := make([]interface{}, 100)
        for i := 0; i < 100; i++ {
            docs[i] = bson.D{
                {Key: "log", Value: fmt.Sprintf("Log entry %d", n*100+i)},
            }
        }
        _, err := coll.InsertMany(context.TODO(), docs)
        if err != nil {
            log.Fatal(err)
        }
    }
}

go test -bench=.

BenchmarkQueryClusteredCollection-8
Clustered Collection: Inserted 1000000 documents in 2.032176167s
    9693            113248 ns/op
BenchmarkQueryNonClusteredCollection-8
Non-Clustered Collection: Inserted 1000000 documents in 2.738731583s
    8781            137176 ns/op
PASS

Interpretation:

Insertion Time: The clustered collection is faster at inserting documents. This is expected because, in a clustered collection, the documents are already stored in the order of the _id field, which optimizes the insertion process. In contrast, non-clustered collections require managing the separate index for _id, adding overhead.
Query Performance: The clustered collection is faster for queries, which is consistent with MongoDB's behavior. Since the data in a clustered collection is stored according to the _id field, range queries can be executed more efficiently. The non-clustered collection requires additional lookups on the _id index, which adds overhead and makes the query slightly slower.

Key Differences between SQL Clustered Indexes and MongoDB Clustered Collections

Physical Data Organization: In both SQL and MongoDB, clustered indexes determine the physical order of data. However, MongoDB only recently introduced clustered collections in version 5.3, whereas SQL has long had clustered indexes.
Default Behavior: In SQL, a primary key often defaults to a clustered index, whereas in MongoDB, the _id field is indexed by default, but not as a clustered index. You need to explicitly create a clustered collection if you want MongoDB to organize documents based on a specific field.
Use Cases: Clustered collections in MongoDB are especially beneficial for time-series data and logs where the data is naturally sequential. In SQL, clustered indexes can be applied to various types of data and use cases.

Conclusion

With the introduction of clustered collections, MongoDB now offers a way to physically organize documents in a collection based on a specified field, similar to a clustered index in relational databases. This feature is especially useful for use cases like time-series data, where the natural ordering of data can improve query efficiency.

For MongoDB users, understanding and applying clustered collections to appropriate workloads can significantly enhance query performance, particularly when dealing with sequential or range-based data.

For more details, you can refer to MongoDB’s official documentation on clustered collections.

Top comments (3)

Sander • Oct 31

Nice write-up. The documentation on MongoDB website does not describe what clustered indexes are, and don't provide the guidance when (not) to use them. You did, thanks.