MongoDB Multikey Indexes and Index Bound Optimization

#mongodb #database #performance

Previously, I discussed how MongoDB keeps track of whether indexed fields contain arrays. This matters because if the database knows a filter is operating on scalar values, it can optimize index range scans.

As an example, let's begin with a collection that has an index on two fields, both containing only scalar values:

db.demo.createIndex( { field1:1, field2:1 } );

db.demo.insertOne(
 { _id: 0, field1: 2           , field2: "y"               },
);

With arrays, each combination is stored as a separate entry in the index. Instead of diving into the internals (as in the previous post), you can visualize this with an aggregation pipeline by unwinding every field:

db.demo.aggregate([ 
 { $unwind: "$field1" }, 
 { $unwind: "$field2" }, 
 { $project: { field1: 1, field2: 1, "document _id": "$_id", _id: 0 } },
 { $sort: { "document _id": 1 } } 
]);

[ { field1: 2, field2: 'y', 'document _id': 0 } ]

Note that this is simply an example to illustrate how index entries are created, ordered, and how they belong to a document. In a real index, the internal key is used. However, since showRecordId() can be used only on cursors and not in aggregation pipelines, I displayed "_id" instead.

Single-key index with bound intersection

As I have documents with only scalars, there's only one entry for this document, and the index is not multikey and this is visible in the execution plan as isMultiKey: false and multiKeyPaths: { field1: [], field2: [] } with empty markers:

db.demo.find(
 { field1: { $gt: 1, $lt: 3 } }
).explain("executionStats").executionStats;

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 0,
  totalKeysExamined: 1,
  totalDocsExamined: 1,
  executionStages: {
    isCached: false,
    stage: 'FETCH',
    nReturned: 1,
    executionTimeMillisEstimate: 0,
    works: 2,
    advanced: 1,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    docsExamined: 1,
    alreadyHasObj: 0,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 2,
      advanced: 1,
      needTime: 0,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      keyPattern: { field1: 1, field2: 1 },
      indexName: 'field1_1_field2_1',
      isMultiKey: false,
      multiKeyPaths: { field1: [], field2: [] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { field1: [ '(1, 3)' ], field2: [ '[MinKey, MaxKey]' ] },
      keysExamined: 1,
      seeks: 1,
      dupsTested: 0,
      dupsDropped: 0
    }
  }
}

As the query planner knows that there's only one index entry per document, the filter { field1: {$gt: 1, $lt: 3} } can be applied as one index range indexBounds: { field1: [ '(1, 3)' ], field2: [ '[MinKey, MaxKey]' ] }, reading only the index entry (keysExamined: 1) required to get the result (nReturned: 1).

Multikey index with bound intersection

I add a document with an array in field2:

db.demo.insertOne(
 { _id:1,  field1: 2           , field2: [ "x", "y", "z" ] },
);

My visualization of the index entries show multiple keys per document:

db.demo.aggregate([ 
 { $unwind: "$field1" }, 
 { $unwind: "$field2" }, 
 { $project: { field1: 1, field2: 1, "document _id": "$_id", _id: 0 } } ,
 { $sort: { "document _id": 1 } }
]);

[
  { field1: 2, field2: 'y', 'document _id': 0 },
  { field1: 2, field2: 'x', 'document _id': 1 },
  { field1: 2, field2: 'y', 'document _id': 1 },
  { field1: 2, field2: 'z', 'document _id': 1 }
]

The index is marked as multikey (isMultiKey: true), but only for field2 (multiKeyPaths: { field1: [], field2: [ 'field2' ] }):

db.demo.find(
 { field1: { $gt: 1, $lt: 3 } }
).explain("executionStats").executionStats;

{
  executionSuccess: true,
  nReturned: 2,
  executionTimeMillis: 0,
  totalKeysExamined: 4,
  totalDocsExamined: 2,
  executionStages: {
    isCached: false,
    stage: 'FETCH',
    nReturned: 2,
    executionTimeMillisEstimate: 0,
    works: 5,
    advanced: 2,
    needTime: 2,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    docsExamined: 2,
    alreadyHasObj: 0,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 2,
      executionTimeMillisEstimate: 0,
      works: 5,
      advanced: 2,
      needTime: 2,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      keyPattern: { field1: 1, field2: 1 },
      indexName: 'field1_1_field2_1',
      isMultiKey: true,
      multiKeyPaths: { field1: [], field2: [ 'field2' ] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { field1: [ '(1, 3)' ], field2: [ '[MinKey, MaxKey]' ] },
      keysExamined: 4,
      seeks: 1,
      dupsTested: 4,
      dupsDropped: 2
    }
  }
}

As the query planner knows that there's only one value per document for field1, the multiple keys have all the same value for this field. The filter can apply on it with tight bounds ((1, 3)) and the multiple keys are deduplicated (dupsTested: 4, dupsDropped: 2).

Multikey index with larger bound

I add a document with an array in field1:

db.demo.insertOne(
 { _id:2, field1: [ 0,5 ] , field2: "x"               },
);

My visualization of the index entries show the combinations:

db.demo.aggregate([ 
 { $unwind: "$field1" }, 
 { $unwind: "$field2" }, 
 { $project: { field1: 1, field2: 1, "document _id": "$_id", _id: 0 } } ,
 { $sort: { "document _id": 1 } }
]);

[
  { field1: 2, field2: 'y', 'document _id': 0 },
  { field1: 2, field2: 'x', 'document _id': 1 },
  { field1: 2, field2: 'y', 'document _id': 1 },
  { field1: 2, field2: 'z', 'document _id': 1 },
  { field1: 0, field2: 'x', 'document _id': 2 },
  { field1: 5, field2: 'x', 'document _id': 2 }
]

If I apply the filter to the collection, the document with {_id: 2} matches because it has values of field1 greater than 1 and values of field1 lower than 3 (I didn't use $elemMatch to apply to the same element):

db.demo.aggregate([ 
 { $match: { field1: { $gt: 1, $lt: 3 } } },
 { $unwind: "$field1" }, 
 { $unwind: "$field2" }, 
 { $project: { field1: 1, field2: 1, "document _id": "$_id", _id: 0 } } ,
 { $sort: { "document _id": 1 } }
]);

[
  { field1: 2, field2: 'y', 'document _id': 0 },
  { field1: 2, field2: 'x', 'document _id': 1 },
  { field1: 2, field2: 'y', 'document _id': 1 },
  { field1: 2, field2: 'z', 'document _id': 1 },
  { field1: 0, field2: 'x', 'document _id': 2 },
  { field1: 5, field2: 'x', 'document _id': 2 }
]

However, if I apply the same filter after the $unwind, that simulates the index entries, there's no entry for {_id: 2} that verifies { field1: { $gt: 1, $lt: 3 } }:

db.demo.aggregate([ 
 { $unwind: "$field1" }, 
 { $match: { field1: { $gt: 1, $lt: 3 } } },
 { $unwind: "$field2" }, 
 { $project: { field1: 1, field2: 1, "document _id": "$_id", _id: 0 } } ,
 { $sort: { "document _id": 1 } }
]);

[
  { field1: 2, field2: 'y', 'document _id': 0 },
  { field1: 2, field2: 'x', 'document _id': 1 },
  { field1: 2, field2: 'y', 'document _id': 1 },
  { field1: 2, field2: 'z', 'document _id': 1 }
]

This shows that the query planner cannot utilize the tight range field1: [ '(1, 3)' ] anymore. Because field1 is multikey (multiKeyPaths: { field1: [ 'field1' ] }), the planner can only apply the bound to the leading index field during the index scan. The other predicate must be evaluated after fetching the document that contains the whole array (filter: { field1: { '$gt': 1 } }):

db.demo.find(
 { field1: { $gt: 1, $lt: 3 } }
).explain("executionStats").executionStats;

{
  executionSuccess: true,
  nReturned: 3,
  executionTimeMillis: 0,
  totalKeysExamined: 5,
  totalDocsExamined: 3,
  executionStages: {
    isCached: false,
    stage: 'FETCH',
    filter: { field1: { '$gt': 1 } },
    nReturned: 3,
    executionTimeMillisEstimate: 0,
    works: 7,
    advanced: 3,
    needTime: 2,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    docsExamined: 3,
    alreadyHasObj: 0,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 3,
      executionTimeMillisEstimate: 0,
      works: 6,
      advanced: 3,
      needTime: 2,
      needYield: 0,
      saveState: 0,
      restoreState: 0,
      isEOF: 1,
      keyPattern: { field1: 1, field2: 1 },
      indexName: 'field1_1_field2_1',
      isMultiKey: true,
      multiKeyPaths: { field1: [ 'field1' ], field2: [ 'field2' ] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { field1: [ '[-inf.0, 3)' ], field2: [ '[MinKey, MaxKey]' ] },
      keysExamined: 5,
      seeks: 1,
      dupsTested: 5,
      dupsDropped: 2
    }
  }
}

This execution plan is logically equivalent to the following:

db.demo.aggregate([  
  // simulate index entries as combinations
  { $unwind: "$field1" },  
  { $unwind: "$field2" },  
  // simulate index range scan bounds
  { $match: { field1: { $lt: 3 } } },  
  // simulate deduplication
  { $group: { _id: "$_id" } },   
  // simulate fetching the full document
  { $lookup: { from: "demo", localField: "_id", foreignField: "_id", as: "fetch" } },   
  { $unwind: "$fetch" },  
  { $replaceRoot: { newRoot: "$fetch" } },
  // apply the remaining filter
  { $match: { field1: { $gt: 1 } } }, 
]).explain();

I show this to make it easier to understand why MongoDB cannot intersect the index bounds in this case, resulting in a wider scan range ([MinKey, MaxKey]), especially when the filter must apply to multiple keys (this behavior would be different if the query used $elemMatch). MongoDB allows flexible schema where a field can be an array, but keeps track of it to optimize the index range scan when it is known that there are only scalars in a field.

Final notes

MongoDB’s flexible schema lets you embed many‑to‑many relationships within a document, improving data locality and often eliminating the need for joins. Despite this, most queries traverse one-to-many relationships, building hierarchical views for particular use cases. When optimizing access through secondary indexes, it’s important not to combine too many multikey fields. If you attempt to do so, MongoDB will block both the creation of such indexes and insertion into collections containing them:

db.demo.insertOne(
 { _id:3, field1: [ 0,5 ] , field2: [ "x", "y", "z" ] },
);

MongoServerError: cannot index parallel arrays [field2] [field1]

MongoDB doesn't allow compound indexes on two array fields from the same document, known as "parallel arrays." Indexing such fields would generate a massive number of index entries due to all possible element combinations, making maintenance and semantics unmanageable. If you attempt this, MongoDB rejects it with a "cannot index parallel arrays" error. This rule ensures index entries remain well-defined. This is per-document, so partial indexes may be created as long a same index doesn't have entries for multiple arrays from the same document.

While the execution plan may display isMultiKey and multiKeyPaths, the multiKey status is managed on a per-index and per-field path basis. This state is updated automatically as documents are inserted, updated, or deleted, and stored in the index metadata, as illustrated in the previous post, rather than recalculated dynamically for each query.

I used top-level fields in this demonstration but arrays can be nested — which is why the list is called "path". multiKeyPaths is the list of dot-separated paths within a field that cause the index to be multikey (i.e., where MongoDB encountered arrays).

MongoDB determines precise index bounds on compound indexes by tracking which fields are multikey using multiKeyPaths metadata. This allows optimized range scans on scalar fields, even if other indexed fields contain arrays. What the query planner can do depends on the path and the presence of array in a field within the path. There are examples documented in Compound Bounds of Multiple Fields from the Same Array. To show a quick example, I add a field with a nested array and an index on its fields:

db.demo.createIndex( { "obj.sub1":1, "obj.sub2":1 } )
;
db.demo.insertOne(
 { _id:4, obj: { sub1: [ 1, 2, 3 ] , sub2:  "x" } },
);
db.demo.find(
 { "obj.sub1": { $gt:1 } , "obj.sub2": { $lt: 3 }  }
).explain("executionStats").executionStats
;

The plan shows that the multikey status comes from only one array field (obj.sub1) and all filters are pushed down to the index bounds:

      keyPattern: { 'obj.sub1': 1, 'obj.sub2': 1 },
      isMultiKey: true,
      multiKeyPaths: { 'obj.sub1': [ 'obj.sub1' ], 'obj.sub2': [] },
      indexBounds: {
        'obj.sub1': [ '(1, inf.0]' ],
        'obj.sub2': [ '[-inf.0, 3)' ]
      }

I insert another document with an array at higher level:

db.demo.insertOne(
 { _id:5, obj: [ { sub1: [ 1, 2, 3 ] , sub2:  "x" } ] },
);
db.demo.find(
 { "obj.sub1": { $gt:1 } , "obj.sub2": { $lt: 3 }  }
).explain("executionStats").executionStats;

Since obj is an array, MongoDB recognizes both obj and its subfield obj.sub1 as array paths ('obj.sub1': [ 'obj', 'obj.sub1' ]). On the other hand, obj.sub2 isn't an array, though its parent (obj) is, so obj.sub2's multikey path is recorded as [ 'obj' ] ('obj.sub2': [ 'obj' ]). Consequently, only a single filter could be applied to the index bounds:

      keyPattern: { 'obj.sub1': 1, 'obj.sub2': 1 },
      isMultiKey: true,
      multiKeyPaths: { 'obj.sub1': [ 'obj', 'obj.sub1' ], 'obj.sub2': [ 'obj' ] },
      indexBounds: {
        'obj.sub1': [ '(1, inf.0]' ],
        'obj.sub2': [ '[MinKey, MaxKey]' ]
      }

Here, the array status of the parent path (obj) propagates to its subfields for multikey tracking purposes.

Unlike SQL databases, which require predefined structure and data types, MongoDB’s flexible schema lets you store arrays or scalar values in any field without prior specification. Indexes define some aspects of the schema, like field names used by queries, but support any data type, even sub-objects and arrays. As data evolves, MongoDB’s storage engine tracks which indexed fields contain arrays, and records the multikey path. The query planner leverages this to optimize index range scans, efficiently intersecting or compounding scan boundaries when it detects fields contain only scalar values.