Franck Pachot for MongoDB

Posted on Aug 4 • Edited on Aug 5

Why MongoDB skips indexes when flattening or renaming sub-document fields in $project before $match aggregation pipeline

#mongodb #document #database #index

MongoDB is a general-purpose database that offers more than just a JSON datastore. It is recognized for its efficient indexing and aggregation capabilities within flexible schemas. The query planner optimizes the aggregation pipeline to filter data early and effectively utilize indexes.
Although this process looks similar to SQL databases, MongoDB stands out by supporting advanced indexing on nested objects and arrays.

It's important to note that some optimizer transformations applicable to tabular data in relational databases may not work the same way with documents. Keeping this in mind helps in understanding why an index is employed or not, as the query planner must ensure consistent results across all possible documents in the flexible schema.

As an example, a projection renaming a JSON dotted path before filtering can alter its semantics and hinder index usage.

Example

I store customer contact information like this:

db.customers.insertMany([
{  "customer_id": "C001",
  "contact": {
    "email": "marla.singer@fightclub.com",
    "phone": "555-1010"
  }
}
]);

{
  acknowledged: true,
  insertedIds: { '0': ObjectId('6890a5530400e901b3d4b0e1') }
}

The subdocument offers flexibility, accommodating customers who may have multiple emails, phone numbers or other contact information. However, I prefer to begin with a simple document that does not include arrays.

To enable efficient lookup by email, I create the following index:

db.customers.createIndex({ "contact.email": 1 })

contact.email_1

In a normalized SQL database, it is necessary to establish upfront that each user can only have one email address. If this is the case, the email should be stored in a column within the customers table or in a separate "contacts" table, with a unique foreign key to the customers table.

In contrast, MongoDB's document model allows flexibility, as you do not need to decide between One-to-One or One-to-Many relationships once for all future insertions. The data model remains accurate even if a customer has multiple contacts or a contact has several email addresses, thus avoiding significant refactoring when business rules change.

If you want to flatten the information in your aggregation pipeline, you may rename the contact's email to a simple top-level "email" field in a $project stage. You can achieve this by writing the following projection in an aggregation pipeline:

  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }

You may additionally search for one email and execute the following aggregation pipeline:

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "marla.singer@fightclub.com" }
  }
])

[
  {
    _id: ObjectId('6890a5530400e901b3d4b0e1'),
    customer_id: 'C001',
    email: 'marla.singer@fightclub.com'
  }
]

Intuitively, this seems reasonable and you test it on your data sample, assuming one email per customer. However you realize that your index is not used:

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "marla.singer@fightclub.com" }
  }
]).explain( ).stages[0]['$cursor'].queryPlanner.winningPlan

{
  isCached: false,
  stage: 'PROJECTION_DEFAULT',
  transformBy: { _id: true, customer_id: true, email: '$contact.email' },
  inputStage: { stage: 'COLLSCAN', direction: 'forward' }
}

The $match operation was not pushed down to the $project stage, resulting in a COLLSCAN instead of an IXSCAN.

If you are used to SQL databases and tabular data, you might assume that the projection only renames "contact.email" to "email". This could lead to an expectation that the query planner would transform the expression $match: { "email": ... } to $match: { "contact.email": ... } and use the index on "contact.email".
However, this is not the case in a document database. The reason lies in the potential existence of arrays and embedded documents. Renaming dotted paths to a top-level field can alter the semantics based on the data's structure.

To use your index, you must start with the $match stage and $project later:

db.customers.aggregate([
  {
    $match: { "contact.email": "marla.singer@fightclub.com" }
  },
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }
]).explain( ).queryPlanner.winningPlan

{
  isCached: false,
  stage: 'PROJECTION_DEFAULT',
  transformBy: { _id: true, customer_id: true, email: '$contact.email' },
  inputStage: {
    stage: 'FETCH',
    inputStage: {
      stage: 'IXSCAN',
      keyPattern: { 'contact.email': 1 },
      indexName: 'contact.email_1',
      isMultiKey: false,
      multiKeyPaths: { 'contact.email': [] },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: {
        'contact.email': [ '["marla.singer@fightclub.com", "marla.singer@fightclub.com"]' ]
      }
    }
  }
}

The index was used and the result is apparently the same. However, it is semantically different and the result could be different with array fields, which is why the query planner doesn't perform the transformation.

With MongoDB flexible schema, I can insert sub-documents with different shape, like an array of multiple contacts, or even array of arrays:

db.customers.insertMany([
  {
    customer_id: "C002",
    contact: [
      { email: "robert.paulson@fightclub.com" },
      { phone: "555-2020" }
    ]
  },
  {
    customer_id: "C003",
    contact: [
      { email: ["narrator@fightclub.com", "tyler.durden@fightclub.com"] },
      { email: "jack@fightclub.com" },
    ]
  }
]);

{
  acknowledged: true,
  insertedIds: {
    '0': ObjectId('6890a74d0400e901b3d4b0e2'),
    '1': ObjectId('6890a74d0400e901b3d4b0e3')
  }
}

MongoDB accommodates changing business requirements and increasingly complex data structures throughout the application lifecycle. In C002, I introduced an array of contact elements, including email addresses and phone numbers. In C003, I added multiple values for a single element, enabling users to store numerous emails under a single "contact.email" path.

The projection of "contact.email" to the top level still allows the filter to be applied to array elements in C002. However, the projection of C003 results in an array of arrays.

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }
])

[
  {
    _id: ObjectId('6890a5530400e901b3d4b0e1'),
    customer_id: 'C001',
    email: 'marla.singer@fightclub.com'
  },
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e2'),
    customer_id: 'C002',
    email: [ 'robert.paulson@fightclub.com' ]
  },
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

With such a projection, $match: { "contact.email": "robert.paulson@fightclub.com" } will find one document that has one item equal to "robert.paulson@fightclub.com". However, $match: { "contact.email": "tyler.durden@fightclub.com" } will find no document. The projection of C003 is an array that has one item being an array, and would be in the result of $match: { "contact.email": [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ] } but not on $match: { "contact.email": 'tyler.durden@fightclub.com' } alone.

With the projection first, I get a result for C002:


db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "robert.paulson@fightclub.com" }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e2'),
    customer_id: 'C002',
    email: [ 'robert.paulson@fightclub.com' ]
  }
]

However, still with the projection first, I get no result for C003:

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: "tyler.durden@fightclub.com" }
  }
]);

(no result)

To get C003 I need to match one of the first-level array items fully:


db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ] }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

db.customers.aggregate([
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  },
  {
    $match: { email: 'jack@fightclub.com' }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

This is different from a query that filters first, using a dotted path:

db.customers.aggregate([
  {
    $match: { "contact.email": 'tyler.durden@fightclub.com' }
  },
  {
    $project: {
      customer_id: 1,
      email: "$contact.email"
    }
  }
]);

[
  {
    _id: ObjectId('6890a74d0400e901b3d4b0e3'),
    customer_id: 'C003',
    email: [
      [ 'narrator@fightclub.com', 'tyler.durden@fightclub.com' ],
      'jack@fightclub.com'
    ]
  }
]

When building pipelines or views in MongoDB, avoid renaming dotted fields (e.g., "contact.email" to "email") if you intend to query and index the flattened field in downstream filters, especially in views or multi-stage aggregations. MongoDB's query optimizer cannot safely push a $match down before that $project, because the semantics with arrays/embedded documents and arrays-of-arrays would result in incorrect matches.

This differs from SQL, where the schema is fixed and column renaming in views or subqueries is safe. With MongoDB’s flexible schema and support for arrays at any level, such rewrite optimizations are not always correct.

For best performance (index usage), always filter on the original dotted path before projecting/renaming it. This may require placing filtering stages early, or restructuring the pipeline to avoid flattening nested fields you want to use for indexed lookups.

Top comments (1)

Franck Pachot MongoDB • Sep 13

Another example of this on StackOverflow:

Jun 2 '21 Comments: 3 Answers: 1

I want to make use of MongoDB Views, but from my point of view, views in MongoDB seem to be made just for "static" use cases where the $match condition is set up front. Mongo does not seem to be able to push down a filter criteria properly to the…

Open Full Question