Franck Pachot for MongoDB

Posted on Aug 7 • Edited on Aug 12

MongoDB indexing internals: .showRecordId() and .hint({$natural:1})

#mongodb #database #storage #clustered

You can understand how MongoDB stores documents internally with simple queries that rely on the physical storage ordering. Some databases store records (called rows or tuples) in heap tables, using their physical location in the data files, such as ROWID in Oracle or CTID in PostgreSQL, to reference those records from index entries. In contrast, databases like MySQL's InnoDB or YugabyteDB store records in the primary key index ("clustered index", or "index organized table"), storing them by the logical order of their primary key values, so that secondary indexes point to these logical locations with the primary key, or an encoded version of it.
MongoDB default collections are similar to heap tables because their documents are stored independently of the primary key ("_id") exposed to the application. Internally, the WiredTiger storage engine organizes collection documents using a B+Tree structure, with an internal RecordId as the key, assigned by MongoDB. This structure resembles a clustered index, but it is clustered on an internal key rather than the primary key.

MongoDB’s approach improves on traditional heap tables, especially for storing variable-size documents, because WiredTiger uses B+Tree nodes for efficient space management, reusing space and splitting pages as needed, rather than relying on settings like PCTFREE or FILLFACTOR to reserve space for updates, or SHRINK/VACUUM operations to defragment after deletes.
To cover all cases, with clustered collections, MongoDB can generate the RecordId from the "_id", the primary key exposed to the application, making storage similar to how some databases organize tables in clustered indexes, as "_id" can be a generated ObjectId or defined by the application at insert time. So, when looking at storage internals levels, there are two keys:

"_id" is the application's primary key, a generated surrogate key, or a natural key. It is always indexed, and this unique index, like other secondary indexes, references the document with a RecordId
RecordId is the internal key. It can be generated from "_id" (in clustered collections), but is more generally generated as a monotonically increasing 64-bit integer during inserts. It can be considered a physical address by the query layer, but it is not directly mapped to an address in the filesystem because files are B+Tree structures. This offers physical data independence since the primary key generation pattern does not affect the storage organization. However, it is helpful to understand how it functions when reviewing execution plans.

Another perspective is that, aside from clustered collections, all indexes in MongoDB, including the primary key index on "_id", are essentially secondary indexes, similar to those found in heap table databases (such as Db2, Oracle, and PostgreSQL). However, instead of a heap table with a fixed block size and row identification tied to their physical location, MongoDB documents are stored within an internal B+Tree index using the WiredTiger engine. Both approaches have their rationale:

SQL databases are primarily optimized for fixed, normalized schemas with small, uniform row lengths, where a PCTFREE or FILLFACTOR can be set according to the expected updates. Storing larger types involves row chaining or slicing (like Oracle LOB chunks or PostgreSQL TOAST)
MongoDB is designed for flexible schemas, and collections can contain documents of any size up to the BSON limit. Typically, documents are tens to hundreds of kilobytes, with some larger ones reaching a few megabytes. This flexibility requires adaptable storage management and efforts to minimize fragmentation beyond a small, fixed page size. A B+Tree with a flexible leaf block size is a suitable structure for this purpose.

The document size is flexible, thanks to the storage described above, but the ideal document size is a frequent question. Until I write a blog post on this, here’s a slide: the green area indicates where the most efficient access is, the red side is acceptable for outliers if they don't grow further, but may be a sign of embedding too much, and the blue side works for small documents inserted and queried together, but it may also be a sign that you did unnecessary normalization and should embed more to avoid runtime scattered lookups.

An example to understand the internal ordering

To demonstrate how it works, I generate ten documents and insert them asynchronously, so they may be written to the database in a random order:

db.collection.drop();
Array.from({ length: 10 }, (_, i) => {
  db.collection.insertOne({
    _id: `#${String(i).padStart(5, '0')}` ,
      val: Math.random()
  });
});

If I query without any filter or sort, the query planner chooses a COLLSCAN, which reads the records in the order of their RecordID, in the order they were inserted:

test> db.collection.find();

[
  { _id: '#00002', val: 0.07658988613973294 },
  { _id: '#00008', val: 0.39893981577036675 },
  { _id: '#00009', val: 0.5279631881196858 },
  { _id: '#00007', val: 0.8445363162277748 },
  { _id: '#00006', val: 0.01935050813731909 },
  { _id: '#00004', val: 0.0732484258238264 },
  { _id: '#00005', val: 0.7733464850237388 },
  { _id: '#00003', val: 0.3356001641172073 },
  { _id: '#00000', val: 0.8956753135566624 },
  { _id: '#00001', val: 0.4952318922619017 }
]

Keep in mind that I'm working with just a single node here. Sharding and parallel processing might retrieve rows in a different order than how they're stored. You should not rely on any "natural" order. Instead, unless you're conducting this type of investigation, where you're guessing the physical ordering from the query layer, ensure that you use an explicit sort operation to specify the expected order of results.

I can display the RecordId with .showRecordId(), which adds it to the cursor projection:

test> db.collection.find().showRecordId();

[
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]

Documentation: showRecordId()

Forcing an index with a hint

I can force an index with a hint, for example the index on "_id" which was created automatically:

test> db.collection.find().hint( { _id: 1} ).showRecordId();

[
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]

This runs a IXSCAN instead of a COLLSCAN and returns the documents in the order of the index. You can verify it with .explain(), but it is also perceptible from the order of the document fetched, which follows the order of "_id" rather than the order of insertion as before (also called "natural" order).

Rather than using a hint, I can add a filter, and the query planner chooses the index. A filter like {$gt:MinKey} or {$lt:MaxKey} does not change the result, but changes the execution plan to an IXSCAN:

test> db.collection.find({_id:{$gt:MinKey}}).showRecordId();

[
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]

An equality filter will also run an IXSCAN, and we observe the result fetched in that order:

test> db.collection.find({_id:{$ne:null}}).showRecordId();

[
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') },
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') }
]

This technique is used to add an unbounded range predicate on the indexed sort field to get the index used for the sort in the absence of an equality predicate: MongoDB Equality, Sort, Range (ESR) without Equality (SR)

Forcing a full scan with a hint for natural order

Hints specify the index definition, and you may wonder how to force a full scan instead of the index scan chosen by the query planner. Remember that it's an index on RecordId that stores the documents. So you can hint this internal index using the $natural operator - asking for natural order of the collection documents:

test> db.collection.find({_id:{$ne:null}}).hint({$natural:1}).showRecordId();

[
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Long('1') },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Long('2') },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Long('3') },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Long('4') },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Long('5') },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Long('6') },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Long('7') },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Long('8') },
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Long('9') },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Long('10') }
]

The documents are fetched in order of RecordId from a COLLSCAN. The hint syntax allows an ascending or descending option to start at the beginning or end of the collection. I'm showing this to explain how records are stored internally. However, if you need a specific order, you should use sort() and let the query planner decide whether to use the index to avoid a sort operation.

MongoDB is more than a NoSQL database:

Like many NoSQL databases, it allows you to query the indexes directly with .hint(), forcing the access path
Like all SQL databases, it has a query planner offering data independence, allowing you to declare the collection and expected order with .sort() and let the database optimize the access path.

Avoid combining storage-level instructions, such as .hint(), .min(), or .max(), with declarative query filters in find() or $match, as this can undermine the query planner's guarantees that results match the query predicates. For example, hinting at a partial index might lead to incomplete results.

Covering indexes and "_id" projection

Understanding what is stored in the index entries helps optimize queries to use an index-only scan (covering index).

For example, the following query reads the index on "_id" and projects only "_id" (which is by default) and "val":

test> db.collection.find(
 { _id: { $ne: null } },
 { val: 1 }
).explain().queryPlanner.winningPlan;

{
  isCached: false,
  stage: 'PROJECTION_SIMPLE',
  transformBy: { val: 1 },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { _id: 1 },
      indexName: '_id_',
      isMultiKey: false,
      isUnique: true,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { _id: [ '[MinKey, null)', '(null, MaxKey]' ] }
    }
  }
}

Because the index on "_id" holds only the key ("_id") and RecordId, it must fetch the document (FETCH) before the projection (PROJECTION_SIMPLE). Even if it is a primary index from the application's point of view, it is physically equivalent to a secondary index.

I can see the same with another secondary index:

test> db.collection.createIndex( { val: 1 } );

val_1

test> db.collection.find(
 { val:{$gt: 0} },
 { val: 1 }
).explain().queryPlanner.winningPlan;

{
  isCached: false,
  stage: 'PROJECTION_SIMPLE',
  transformBy: { val: 1 },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { val: 1 },
      indexName: 'val_1',
      isMultiKey: false,
      multiKeyPaths: { val: [] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { val: [ '(0, inf.0]' ] }
    }
  }
}

Such query projects "_id" because it is there by default, and then the index on "val" is not covering all fields. To avoid the FETCH, I need to remove "_id" from the projection explicitly:

test> db.collection.find(
 { val:{$gt: 0} },
 { val: 1 , _id: 0 }
).explain().queryPlanner.winningPlan;

{
  isCached: false,
  stage: 'PROJECTION_COVERED',
  transformBy: { val: 1, _id: 0 },
  inputStage: {
    stage: 'IXSCAN',
    keyPattern: { val: 1 },
    indexName: 'val_1',
    isMultiKey: false,
    multiKeyPaths: { val: [] },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: { val: [ '(0, inf.0]' ] }
  }
}

Another possibility: if I need to project "_id", I can add it to the index definition, making it a covering index for my query:

test> db.collection.createIndex( { val: 1 , _id: 1 } );

val_1__id_1

test> db.collection.find(
 { val:{$gt: 0} },
 { val: 1 }
).explain().queryPlanner.winningPlan;

{
  isCached: false,
  stage: 'PROJECTION_COVERED',
  transformBy: { val: 1 },
  inputStage: {
    stage: 'IXSCAN',
    keyPattern: { val: 1, _id: 1 },
    indexName: 'val_1__id_1',
    isMultiKey: false,
    multiKeyPaths: { val: [], _id: [] },
    isUnique: false,
    isSparse: false,
    isPartial: false,
    indexVersion: 2,
    direction: 'forward',
    indexBounds: { val: [ '(0, inf.0]' ], _id: [ '[MinKey, MaxKey]' ] }
  }
}

You probably don't need the two indexes and can drop the non-covering one. I kept it to demonstrate that the query planner prefers the covering one, as it avoids a fetch. Here are my indexes:

test> db.collection.getIndexes();

[
  { v: 2, key: { _id: 1 }, name: '_id_' },
  { v: 2, key: { val: 1 }, name: 'val_1' },
  { v: 2, key: { val: 1, _id: 1 }, name: 'val_1__id_1' }
]

Although not visible in the index definition, all these indexes contain the RecordId in each entry, in addition to the key values. It is used only internally, has no business significance, and can differ on each replica. Unlike ROWID and CTID in Oracle and PostgreSQL, RecordId is immutable for a given document within a collection. Still, it may have a different value on replicas or after the collection is rewritten or reorganized.

Clustered collections to organize on "_id" order

MongoDB introduced the possibility of storing documents clustered by the "_id" primary key, allowing it to behave like an index-organized table in RDBMS. It is not advisable to use it widely because it was introduced for specific purposes and used internally. I use it here to show the different behavior related to RecordId.

To behave similarly to databases that store records in the primary key (like Index Organized Table in Oracle or Clustered Indexes in SQL Server), I can use a clustered collection:

db.createCollection("clustered", {  
  clusteredIndex: {  
    key: { _id: 1 },  
    unique: true  
  }  
});

db.collection.find().sort({ _id: 1 }).forEach(doc => {  
  db.clustered.insertOne(doc)  
});

With a clustered collection, a COLLSCAN reads the documents in the order of "_id". In reality, it reads in order of RecordId, but RecordId is an encoded version of "_id" that preserves the ordering:

test> db.clustered.find().showRecordId();

[
  { _id: '#00000', val: 0.8956753135566624, '$recordId': Binary.createFromBase64('PCMwMDAwMAA=', 0) },
  { _id: '#00001', val: 0.4952318922619017, '$recordId': Binary.createFromBase64('PCMwMDAwMQA=', 0) },
  { _id: '#00002', val: 0.07658988613973294, '$recordId': Binary.createFromBase64('PCMwMDAwMgA=', 0) },
  { _id: '#00003', val: 0.3356001641172073, '$recordId': Binary.createFromBase64('PCMwMDAwMwA=', 0) },
  { _id: '#00004', val: 0.0732484258238264, '$recordId': Binary.createFromBase64('PCMwMDAwNAA=', 0) },
  { _id: '#00005', val: 0.7733464850237388, '$recordId': Binary.createFromBase64('PCMwMDAwNQA=', 0) },
  { _id: '#00006', val: 0.01935050813731909, '$recordId': Binary.createFromBase64('PCMwMDAwNgA=', 0) },
  { _id: '#00007', val: 0.8445363162277748, '$recordId': Binary.createFromBase64('PCMwMDAwNwA=', 0) },
  { _id: '#00008', val: 0.39893981577036675, '$recordId': Binary.createFromBase64('PCMwMDAwOAA=', 0) },
  { _id: '#00009', val: 0.5279631881196858, '$recordId': Binary.createFromBase64('PCMwMDAwOQA=', 0) }
]

The query planner is aware of this and doesn't execute an additional sort operation when I add a .sort() to my query:

test> db.clustered.find().sort({_id:1}).explain("executionStats").executionStats;
{
  executionSuccess: true,
  nReturned: 10,
  executionTimeMillis: 0,
  totalKeysExamined: 0,
  totalDocsExamined: 10,
  executionStages: {
    isCached: false,
    stage: 'COLLSCAN',
    nReturned: 10,
    executionTimeMillisEstimate: 0,
    works: 11,
    advanced: 10,
    needTime: 0,
    needYield: 0,
    saveState: 0,
    restoreState: 0,
    isEOF: 1,
    direction: 'forward',
    docsExamined: 10
  }
}

Accessing a document by its "_id" in a clustered collection is fast, with an express clustered index scan:

test> db.clustered.find(
 { _id: "#00002" },
 { val: 1 }
).explain("executionStats").executionStats;

{
  executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 0,
  totalKeysExamined: 0,
  totalDocsExamined: 1,
  executionStages: {
    isCached: false,
    stage: 'EXPRESS_CLUSTERED_IXSCAN',
    keysExamined: 0,
    docsExamined: 1,
    nReturned: 1
  }
}

It doesn't account for keysExamined because no secondary index is involved. It is equivalent to a FETCH without an IXSCAN, with a single docsExamined and no keysExamined. A range query would have displayed a CLUSTERED_IXSCAN, also with no keyExamined.

To provide this fast access, MongoDB translates the value provided ("_id") to RecordID. However, the opposite conversion is not possible, and secondary indexes on clustered collections do not cover "_id" even if they have an encoded value of it in RecordID.

I create a secondary index and query with the projection of the key and "_id":

test> db.clustered.createIndex( { val: 1 } );

val_1

test> db.clustered.find(
 { val:{$gt: 0} },
 { val: 1 }
).explain().queryPlanner.winningPlan;

{
  isCached: false,
  stage: 'PROJECTION_SIMPLE',
  transformBy: { val: 1 },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { val: 1 },
      indexName: 'val_1',
      isMultiKey: false,
      multiKeyPaths: { val: [] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: { val: [ '(0, inf.0]' ] }
    }
  }
}

Like with unclustered collections, you need to explicitly add "_id" to the key to get it covered.

The index definition shows the clustered index with clustered: true

test> db.clustered.getIndexes();

[
  {
    v: 2,
    key: { _id: 1 },
    name: '_id_',
    unique: true,
    clustered: true
  },
  { v: 2, key: { val: 1 }, name: 'val_1' }
]

Understanding the internals of storage helps to understand the performance and cache efficiency. If you generate a random "_id", like UUIDv4, instead of the default ObjectId, it will distribute the values over the whole index rather than the last leaf of the B+Tree. If, in addition, you defined the collection as clustered, it's the documents that are scattered. Such inserts will be inefficient in terms of cache locality because they will bring random blocks rather than concentrating the inserts to fewer ones.

Takeaways

There are two immutable keys in MongoDB collections:

"_id" is the application visible primary key, natural or surrogate (generated by default as a global ObjectId). At the storage level, it is a field like others.
RecordId is the key of documents in the WiredTiger B+Tree for the collection, and referenced by the WiredTiger B+Tree for secondary indexes.

On fields other than "_id", secondary indexes store the RecordId in their index entry, to find documents by a key value or range.

This holds for both regular and clustered collections. What is different is how RecordId is generated, or calculated from "_id":

In default collections, "_id" is indexed in the same way as secondary indexes, to map an "_id" value to the RecordId, and find the full document. Even if it is all stored in B+Tree structures, it behaves like heap tables at the query level. Disk and cache locality depend on the order of insertion, as RecordID is a monotonically increasing integer.
In clustered collections, RecordID is an order-preserving encoded version of "_id" generated at insert so that there's no need for an additional index to fetch documents by "_id". It behaves like index-organized tables at the query level, and disk and cache locality depend on the "_id" values.

Note that while I discussed clustered collections, most users do not need to create them directly, as they were introduced for internal purposes, such as time series collections. In this article, I examined the natural order without using a sort operation to illustrate how documents are stored internally, without using deeper-level tools to dump the internal structures. However, in practice, you should always specify your desired sort order in queries and let the query planner select the optimal index for you. Finally, displaying the RecordId is useful for understanding storage internals. Still, it should not be used in your application, as each replica generates it, so you may read different values in a replica set.

Top comments (1)

Graeme Robinson • Aug 7

Great article Franck.