Franck Pachot

Posted on May 29

$exists and non-sparse indexes in MongoDB and in other DocumentDB

#postgres #documentdb #mongodb #database

In SQL databases, NULL represents an unknown value — not the absence of a value. When a value is simply non-applicable for a given entity, the correct relational modeling approach is normalization: the entity gets no row in the relevant table at all, rather than a NULL in a column. This distinction becomes tricky with OUTER JOIN results, where the absence of a row is surfaced as NULL across all columns of the unmatched side, including key columns — making it easy to confuse "unknown value" with "no row existed."

MongoDB has its own subtlety: a field can be explicitly set to null or simply not exist in the document at all. In the BSON representation, these are distinct — one is a key with a null-typed value, the other is the absence of the key entirely. The schema is flexible: you can define a field or not. But in indexes, this distinction disappears. Except for partial indexes, indexes must have a key value for every document it covers. For documents where the field is missing, MongoDB uses null as a stand-in — the same key value used for explicit nulls. This means an index scan cannot distinguish between the two states, and resolving null vs. missing requires fetching the full document to apply a residual filter.

Consequently, a standard index scan with a filter on null or $exists is inexact: the query planner performs an index scan on the null key and then fetches the full document to verify whether the field is truly null or simply absent.

An example: `{ $exists: true }` filter

When you query with { num: { $exists: true } }, you expect MongoDB to use an index on num. Let's test it on MongoDB, as well as some emulations: Oracle Database, Amazon DocumentDB (AWS), and DocumentDB extension on PostgreSQL (Microsoft).

Here is my test collection:

db.test.insertMany([
  { _id: 1, num: 42   },
  { _id: 2, num: 7    },
  { _id: 3, num: null },
  { _id: 4            },
  { _id: 5, num: 99   },
  { _id: 6            },
  { _id: 7, num: null },
  { _id: 8, num: 15   }
])

I have inserted eight documents:

four with real values (_id 1, 2, 5, 8),
two with the field explicitly set to null (_id 3, 7), and
two where the field is entirely absent (_id 4, 6).

The query { num: { $exists: true } } should return six documents — everything except _id 4 and 6.

Before touching indexes, notice that $exists is not the same as a null check:

db.test.find({ num: null })

[
  { _id: 3, num: null },
  { _id: 4 },
  { _id: 6 },
  { _id: 7, num: null }
]

db.test.find({ num: { $exists: false } })

[
  { _id: 4 },
  { _id: 6 }
]

db.test.find({ num: { $exists: true } })

[
  { _id: 1, num: 42 },
  { _id: 2, num: 7 },
  { _id: 3, num: null },
  { _id: 5, num: 99 },
  { _id: 7, num: null },
  { _id: 8, num: 15 }
]

A field set to null exists. A field not written into the document does not. This distinction is perfectly clear at the document level. At the index level, it is not.

Null in the index key is ambiguous

When MongoDB builds a B-tree index on num, it must create an entry for every document. For documents with no num field, the index key exists with a null value. For documents where num is explicitly set to null, it also stores null. Both cases produce the same index key.

Here is how the non-sparse index looks:

Non-sparse index on { num: 1 }:

  null  →  _id:3  { num: null }    explicit null
  null  →  _id:4  { }              missing field
  null  →  _id:6  { }              missing field
  null  →  _id:7  { num: null }    explicit null
  7     →  _id:2
  15    →  _id:8
  42    →  _id:1
  99    →  _id:5

The four entries under the null key are indistinguishable from the index alone. To evaluate $exists, the engine must read the actual document. This is called a residual predicate — a filter condition the index cannot resolve, deferred to a later fetch stage.

Another way to look at it: the document schema is flexible, with no structure declared upfront and fields that may or may not exist, whereas indexes are different—their schema is declared, and the key fields always exist.

MongoDB with a non-sparse index

I create a regular index, which is by default non-sparse and has one index entry per document (or more for multi-key indexes).

db.test.createIndex({ num: 1 })

db.test.find({ num: { $exists: true } }).explain("executionStats")

The execution plan shows what happens across the IXSCAN and FETCH stages:

executionStats: {
  nReturned: 6,
  totalKeysExamined: 8,
  totalDocsExamined: 8,
  executionStages: {
    stage: 'FETCH',
    filter: { num: { '$exists': true } },
    nReturned: 6,
    docsExamined: 8,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 8,
      isSparse: false,
      indexBounds: { num: [ '[MinKey, MaxKey]' ] },
      keysExamined: 8
    }
  }
}

The IXSCAN returns all 8 index entries across the full [MinKey, MaxKey] range. The FETCH stage then reads all 8 documents and applies filter: { num: { $exists: true } } as a residual predicate, discarding _id 4 and 6. Notice docsExamined: 8 but nReturned: 6 — two fetches were wasted. The index was used, but the null bucket forced unnecessary work.

MongoDB with a sparse index

A sparse index excludes documents where the indexed field is entirely absent. It does not exclude explicit null values. Documents _id 3 and 7 have num: null and are still indexed.

db.test.createIndex({ num: 1 }, { sparse: true })

db.test.find({ num: { $exists: true } }).explain("executionStats")

As I have no projection, there is still a FETCH, but only for the documents in the final result:

executionStats: {
  nReturned: 6,
  totalKeysExamined: 6,
  totalDocsExamined: 6,
  executionStages: {
    stage: 'FETCH',
    nReturned: 6,
    docsExamined: 6,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 6,
      isSparse: true,
      indexBounds: { num: [ '[MinKey, MaxKey]' ] },
      keysExamined: 6
    }
  }
}

keysExamined dropped from 8 to 6 — the two missing-field documents are not in the index. More importantly, the FETCH stage has no filter. There is no residual predicate. Every document pointed to by the sparse index either has a real value or has an explicit null — both satisfy $exists: true. The index structure itself proves the condition. The FETCH still happens because find() needs to return the documents, but it is doing useful work only, not wasted disambiguation.

Here is how the sparse index looks:

Sparse index on { num: 1 }:

  null  →  _id:3  { num: null }    explicit null — indexed
  null  →  _id:7  { num: null }    explicit null — indexed
  7     →  _id:2
  15    →  _id:8
  42    →  _id:1
  99    →  _id:5

  _id:4  { }  — not indexed
  _id:6  { }  — not indexed

The null bucket still exists in a sparse index, but it contains only explicit nulls. The ambiguity is gone.

Oracle Database

I reproduced the same on Oracle Database with the MongoDB emulation:

ora> db.test.createIndex({ num: 1 })
num_1
ora> db.test.find({ num: { $exists: true } }).explain("executionStats")
{
  queryPlanner: {
    namespace: 'ora.test',
    parsedQuery: { num: { '$exists': true } },
    rewrittenQuery: { num: { '$exists': true } },
    generatedSql: `select "DATA",rawtohex("RESID"),"ETAG" from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) ' type(strict))`,
    winningPlan: ' Plan Hash Value  : 3552627291 \n' +
      '\n' +
      '--------------------------------------------------------------------------------------------------\n' +
      '| Id  | Operation                             | Name            | Rows | Bytes | Cost | Time     |\n' +
      '--------------------------------------------------------------------------------------------------\n' +
      '|   0 | SELECT STATEMENT                      |                 |    1 | 24501 |    2 | 00:00:01 |\n' +
      '|   1 |   TABLE ACCESS BY INDEX ROWID BATCHED | test            |    1 | 24501 |    2 | 00:00:01 |\n' +
      '|   2 |    HASH UNIQUE                        |                 |    1 | 24501 |      |          |\n' +
      '| * 3 |     INDEX RANGE SCAN (MULTI VALUE)    | $ora:test.num_1 |    1 |       |    1 | 00:00:01 |\n' +
      '--------------------------------------------------------------------------------------------------\n' +
      '\n' +
      'Predicate Information (identified by operation id):\n' +
      '------------------------------------------\n' +
      `* 3 - access(JSON_QUERY("DATA" /*+ LOB_BY_VALUE */ FORMAT OSON , '$."num"[*]' RETURNING ANY ORA_RAWCOMPARE ASIS WITHOUT ARRAY WRAPPER ERROR ON ERROR PRESENT ON EMPTY NULL ON MISMATCH TYPE(LAX)\n` +
      "  MULTIVALUE)>HEXTORAW('01'))\n" +
      '\n' +
      '\n' +
      'Notes\n' +
      '-----\n' +
      '- Dynamic sampling used for this statement ( level = 2 )\n' +
      '\n'
  },
  serverInfo: { host: 'localhost', port: 27017, version: '7.0.22' },
  ok: 1
}
ora>

It doesn't display the execution statistics, but I can get it from the SQL endpoint:

sql> select /*+ gather_plan_statistics */ "DATA",rawtohex("RESID"),"ETAG" from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) ' type(strict));

DATA                    RAWTOHEX("RESID")    ETAG
_______________________ ____________________ ___________________________________
{"_id":3,"num":null}    03C104               523160F3D2777CB2E0637B5B000A71CD
{"_id":7,"num":null}    03C108               523160F3D27F7CB2E0637B5B000A71CD
{"_id":2,"num":7}       03C103               523160F3D2757CB2E0637B5B000A71CD
{"_id":8,"num":15}      03C109               523160F3D2817CB2E0637B5B000A71CD
{"_id":1,"num":42}      03C102               523160F3D2737CB2E0637B5B000A71CD
{"_id":5,"num":99}      03C106               523160F3D27B7CB2E0637B5B000A71CD

6 rows selected.

sql> select * from dbms_xplan.display_cursor(format=>'allstats last');

PLAN_TABLE_OUTPUT
____________________________________________________________________________________________________________________
SQL_ID  c08vsvqpn75vw, child number 0
-------------------------------------
select /*+ gather_plan_statistics */ "DATA",rawtohex("RESID"),"ETAG"
from "ORA"."test" where JSON_EXISTS("DATA",'$?(exists(@.num)) '
type(strict))

Plan hash value: 3552627291

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |      1 |        |      6 |00:00:00.01 |       2 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| test            |      1 |      1 |      6 |00:00:00.01 |       2 |
|   2 |   HASH UNIQUE                       |                 |      1 |      1 |      6 |00:00:00.01 |       1 |
|*  3 |    INDEX RANGE SCAN (MULTI VALUE)   | $ora:test.num_1 |      1 |      1 |      6 |00:00:00.01 |       1 |
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("test"."SYS_NC00005$">HEXTORAW('01'))

Note
-----
   - dynamic statistics used: dynamic sampling (level=2)


26 rows selected.

The index range scan returned 6 entries, as if it were a sparse index. We cannot create a sparse index on Oracle Database:

ora> db.test.dropIndex({ num: 1 })
{ nIndexesWas: 2, ok: 1 }

ora> db.test.createIndex({ num: 1 } , { sparse: 1 })
MongoServerError[MONGO-67]: Unsupported index option: sparse

Amazon DocumentDB (AWS)

AWS DocumentDB speaks the MongoDB wire protocol but is built on a completely different architecture. The storage layer is distributed like Aurora, replicated across three availability zones. The query planner and storage engine are specific to Amazon DocumentDB and deliver performance characteristics that differ from both MongoDB and standard PostgreSQL.

Non-sparse index on Amazon DocumentDB

The { num: { $exists: true } } query does not use the non-sparse index (created as db.test.createIndex({ num: 1 })) on Amazon DocumentDB (tested on version 8, planner version 3):

queryPlanner: {
  plannerVersion: 3,
  winningPlan: { stage: 'COLLSCAN', filter: { num: { '$exists': true } } }
},
executionStats: {
  nReturned: '6',
  executionTimeMillis: '14.121',
  planningTimeMillis: '14.019',
  executionStages: {
    stage: 'COLLSCAN',
    nReturned: '6',
    executionTimeMillisEstimate: '0.025'
  }
}

The index is completely abandoned. The planner chose a full collection scan.

Sparse index on Amazon DocumentDB

With the index created as db.test.createIndex({ num: 1 }, { sparse: true }), the index is used:

queryPlanner: {
  plannerVersion: 3,
  winningPlan: { stage: 'IXSCAN', indexName: 'num_1', direction: 'forward' }
},
executionStats: {
  nReturned: '6',
  executionTimeMillis: '10.034',
  planningTimeMillis: '8.128',
  executionStages: {
    stage: 'IXSCAN',
    nReturned: '6',
    executionTimeMillisEstimate: '1.842',
    indexName: 'num_1',
    direction: 'forward'
  }
}

Every entry in the sparse index provably satisfies $exists: true. It scans 6 index entries and returns 6 documents. While a sparse index is optional in MongoDB, it is mandatory in Amazon DocumentDB to use an index for this query at all.

Microsoft DocumentDB on PostgreSQL

Microsoft DocumentDB is implemented as an open-source PostgreSQL extension, accessed via the MongoDB wire protocol through a compatible endpoint.

With DocumentDB on PostgreSQL, a sparse index is not required for an optimal access path. I created the index as db.test.createIndex({ num: 1 }) and used a hint to force the index, since on a small collection the cost-based planner would otherwise prefer a sequential scan:

db.test.find(
  { num: { $exists: true } }
).hint("num_1").explain("executionStats")

(Be careful when using a hint with MongoDB queries, as it may change the result, limiting the scan to what is indexed)

The execution plan reads only the necessary entries from the index:

executionStats: {
  nReturned: Long('6'),
  executionTimeMillis: 0.093,
  executionStartAtTimeMillis: 0.089,
  totalDocsExamined: Long('6'),
  totalKeysExamined: Long('6'),
  executionStages: {
    stage: 'FETCH',
    nReturned: Long('6'),
    executionTimeMillis: 0.093,
    executionStartAtTimeMillis: 0.089,
    totalKeysExamined: 6,
    numBlocksFromCache: 24,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: Long('6'),
      executionTimeMillis: 0.093,
      executionStartAtTimeMillis: 0.089,
      indexName: 'num_1',
      totalKeysExamined: 6,
      numBlocksFromCache: 24
    }
  }
}

It shows the same count for index entries (totalKeysExamined: 6) and documents fetched (totalDocsExamined: 6).

Here we can go further: we can bypass the MongoDB layer entirely and query PostgreSQL directly, seeing exactly what the database engine sees.

To understand why, we can look at the underlying implementation from the PostgreSQL catalog:

\d documentdb_data.documents_15

           Table "documentdb_data.documents_15"
     Column      |  Type  | Collation | Nullable | Default
-----------------+--------+-----------+----------+---------
 shard_key_value | bigint |           | not null |
 object_id       | bson   |           | not null |
 document        | bson   |           | not null |
Indexes:
    "collection_pk_15" PRIMARY KEY, btree (shard_key_value, object_id)
    "documents_rum_index_47" documentdb_extended_rum
        (document bson_extended_rum_composite_path_ops
         (pathspec='[ "num" ]', tl='2691'))

There are no individual columns for num, name, or any other document field. The entire document is stored as a single bson blob in the document column. PostgreSQL has no native knowledge of what is inside it. The collection name test maps to documents_15, where 15 is the collection's internal identifier.

The index is not a standard PostgreSQL B-tree. It is an Extended RUM index — documentdb_extended_rum — with a custom operator class: bson_extended_rum_composite_path_ops. RUM is an extension of GIN (Generalized Inverted Index) that adds support for ordering, range scans, and additional per-entry metadata. The operator class is the critical piece: it knows how to extract the num field from the opaque BSON blob and store it in a structure PostgreSQL can search. pathspec='[ "num" ]' tells it which field to index.

We can obtain the PostgreSQL execution plan directly using the DocumentDB API. I disabled sequential scans to override the cost-based planner's preference on this small table:

postgres=# set enable_seqscan to off;
           explain (analyze, buffers, verbose, costs off)
select document from bson_aggregation_find(
  'test',
  '{
    "find": "test",
    "filter": { "num": { "$exists": true } }
  }'::documentdb_core.bson
);
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Index Scan using num_1 on documentdb_data.documents_15 collection (actual time=0.036..0.040 rows=6 loops=1)
   Output: document
   Index Cond: (collection.document @>= '{ "num" : { "$minKey" : 1 } }'::bson)
   Buffers: shared hit=3
 Planning:
   Buffers: shared hit=26
 Planning Time: 0.416 ms
 Execution Time: 0.052 ms
(8 rows)

The $exists: true predicate has been translated into a PostgreSQL index condition: document @>= '{ "num": { "$minKey": 1 } }'::bson. This uses a custom BSON operator @>= meaning "document has field num with a value greater than or equal to MinKey."

MinKey is a special BSON sentinel value that sits below every other BSON value in the type ordering. The condition @>= MinKey therefore means "field num exists and has any BSON value at all" — which is exactly $exists: true. Existence becomes a range scan from the minimum possible value: an elegant encoding.

The RUM index is path-based and only creates entries for paths that actually exist in documents. However, documents where num is absent also have their index entry, that are scanned by the opposite filter { num: {$exists": false} }:

postgres=# explain (analyze, buffers, verbose, costs off)
select document from bson_aggregation_find(
  'test',
  '{
    "find": "test",
    "filter": { "num": { "$exists": false } }
  }'::documentdb_core.bson
);
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Index Scan using num_1 on documentdb_data.documents_15 collection (actual time=0.030..0.033 rows=2 loops=1)
   Output: document
   Index Cond: (collection.document @? '{ "num" : false }'::bson)
   Buffers: shared hit=3
 Planning:
   Buffers: shared hit=80
 Planning Time: 0.282 ms
 Execution Time: 0.070 ms
(8 rows)

This has read the two rows (rows=2) without the num field.

The complete picture

Here is the summary for { $exists: true } queries on the example above:

	Non-sparse index	Sparse index
MongoDB	FETCH ← IXSCAN, 8 keys, 8 docs, residual filter, 2 wasted fetches	FETCH ← IXSCAN, 6 keys, 6 docs, no residual filter
Amazon DocumentDB (AWS)	COLLSCAN, no index, 8 docs	IXSCAN, 6 keys
DocumentDB on PostgreSQL (Microsoft)	FETCH ← IXSCAN, 6 keys, 6 docs, no residual filter	FETCH ← IXSCAN, 6 keys, 6 docs, no residual filter
Oracle Database (ORDS)	TABLE ACCESS BY INDEX ROWID BATCHED ← HASH UNIQUE ← INDEX RANGE SCAN (MULTI VALUE), 6 keys, 6 docs	sparse index not supported

What this means in practice

🟠 The safe pattern for $exists: true queries in Amazon DocumentDB is to use a sparse index. Without it, you get a collection scan regardless of selectivity. With it, you get a direct index scan without any fetch overhead.

🟢 In MongoDB, the benefit is real but more modest: a sparse index saves the wasted fetch-and-discard work for missing-field documents and removes the residual predicate. The difference scales with how many documents are missing the field entirely.

🔴 In Oracle Database where traditionally indexes do not store null keys, indexes act like they are sparse.

🔵 In DocumentDB on PostgreSQL, the execution is optimal with a standard non-sparse RUM index. In case of doubt, verify plans with explain() from the MongoDB API (port 10260) and EXPLAIN from the PostgreSQL API (port 9712).

DEV Community

$exists and non-sparse indexes in MongoDB and in other DocumentDB

An example: `{ $exists: true }` filter

Null in the index key is ambiguous

MongoDB with a non-sparse index

MongoDB with a sparse index

Oracle Database

Amazon DocumentDB (AWS)

Non-sparse index on Amazon DocumentDB

Sparse index on Amazon DocumentDB

Microsoft DocumentDB on PostgreSQL

The complete picture

What this means in practice

Top comments (0)

An example: { $exists: true } filter

Null in the index key is ambiguous

MongoDB with a non-sparse index

MongoDB with a sparse index

Oracle Database

Amazon DocumentDB (AWS)

Non-sparse index on Amazon DocumentDB

Sparse index on Amazon DocumentDB

Microsoft DocumentDB on PostgreSQL

The complete picture

What this means in practice

An example: `{ $exists: true }` filter