Artur Garcia Costa

Posted on Jan 22 • Originally published at foojay.io

The Cost of Not Knowing MongoDB - Part 1: appV0 to appV4

#database #mongodb #performance #infrastructure

Application Version 1: The baseline implementation
Application Version 2: Better Understanding Indexing
Application Version 3: Better Data Types and Field Name Shorthanding
Application Version 4: Taking Advantage of the _id Index

Article Introduction

Welcome to the first part of the series, "The Cost of Not Knowing MongoDB". This comprehensive analysis explores how different MongoDB schema design decisions can dramatically impact application performance, demonstrating the critical importance of understanding MongoDB's underlying mechanisms.

In this first article, we examine four progressive application versions, appV1 through appV4, each representing common approaches developers take when working with MongoDB. Through detailed performance testing and analysis, we reveal how seemingly minor schema modifications can lead to significant improvements in throughput, latency, and resource utilization.

The journey begins with appV1, a baseline implementation that reflects typical patterns used by junior MongoDB developers. We then progress through increasingly optimized versions, introducing concepts such as field concatenation, data type optimization, and strategic field abbreviation. Each version builds upon the lessons learned from its predecessor, culminating in appV4.

This foundational knowledge sets the stage for Part 2, where we explore advanced patterns like the Bucket Pattern and Computed Pattern to achieve even greater performance improvements.

Application Version 1 (appV1): The baseline implementation

Introduction

The first application version and the base case for our comparison would have been developed by someone with a junior knowledge level of MongoDB who just took a quick look at the documentation and learned that every document in a collection must have an _id field and this field is always unique indexed.

To take advantage of the _id obligatory field and index, the developer decides to store the values of key and date in an embedded document in the _id field. With that, each document will register the status totals for one user, specified by the field _id.key, in one day, specified by the field _id.date.

Schema

The application implementation presented above would have the following TypeScript document schema denominated SchemaV1:

type SchemaV1 = {
  _id: {
    key: string;
    date: Date;
  };
  approved?: number;
  noFunds?: number;
  pending?: number;
  rejected?: number;
};

Bulk Upsert

Based on the specification presented, we have the following updateOne operation for each event generated by this application version:

const operation = {
  updateOne: {
    filter: {
      _id: { date: event.date, key: event.key },
    },
    update: {
      $inc: {
        approved: event.approved,
        noFunds: event.noFunds,
        pending: event.pending,
        rejected: event.rejected,
      },
    },
    upsert: true,
  },
};

filter:

Target the document where the _id field matches { date: event.date, key: event.key }.

update:

Uses the $inc operator to increment counters (approved, noFunds, pending, rejected) based on the event data.

upsert:

Ensures a new document is created if no matching document exists.

Get Reports

To fulfill the Get Reports operation, five aggregation pipelines are required, one for each date interval. Each pipeline follows the same structure, differing only in the filtering criteria in the $match stage:

const pipeline = [
  {
    $match: {
      "_id.key": request.key,
      "_id.date": { $gte: request.date - oneYear, $lt: request.date },
    },
  },
  {
    $group: {
      _id: null,
      approved: { $sum: "$approved" },
      noFunds: { $sum: "$noFunds" },
      pending: { $sum: "$pending" },
      rejected: { $sum: "$rejected" },
    },
  },
];

{ $match: {...} }:

Filters documents based on the key and date fields.
The "_id.key" field matches the user key provided in the request.
The "_id.date" field filters documents within the specified date range using $gte (greater than or equal to) and $lt (less than).

{ $group: {...} }:
- Group the filtered documents into a single result.
- The _id field is set to null to group all matching documents from the previous stage together.
- Computes the sum of the approved, noFunds, pending, and rejected fields using the $sum operator.

Indexes

Initially, appV1 aimed to use the default index on the _id field (which contained an embedded document with key and date). However, this default index on the embedded _id field was not sufficient to efficiently support the query patterns, particularly for the Get Reports function, which filters by _id.key and _id.date.

To address this, an additional compound index was created:

const keys = { "_id.key": 1, "_id.date": 1 };
const options = { unique: true };

db.appV1.createIndex(keys, options);

This explicit index on _id.key and _id.date ensures that queries filtering and sorting on these fields can be performed efficiently. The unique: true option enforces that the combination of _id.key and _id.date is unique across all documents in the collection. For a more detailed explanation of why an index on an embedded document's fields might be needed even if the top-level field is indexed, refer to Appendices - Index on Embedded Documents.

Initial Scenario Statistics

Collection Statistics

To evaluate the performance of appV1, we inserted 500 million event documents into the collection using the schema and Bulk Upsert function described earlier.

Collection	Documents	Data Size	Avg. Document Size	Storage Size	Indexes	Index Size
appV1	359,639,622	39.58GB	119B	8.78GB	2	20.06GB

Event Statistics

To evaluate the storage efficiency per event, the Event Statistics are calculated by dividing the total Data Size and Index Size by the 500 million events.

Collection	Data Size/Event	Index Size/Event	Total Size/Event
appV1	85B	43.1B	128.1B

Load Test Results

Executing the load test for appV1 and plotting it alongside the Desired values, we have the following results for Get Reports and Bulk Upsert.

Get Reports Rate

The application never reaches the target rate of 25 reports per second during the first 10 minutes phase, peaking at only 16.5 reports per second. During the rest of the test, the rate stays around 6 reports per second.

Get Reports Latency

Begins at 2 seconds and progressively increases throughout the test duration, reaching a maximum of 6.5 seconds with an average of 4.5 seconds.

Bulk Upsert Rate

The application only reaches the desired rate of 250 events per second during the first 10 minutes of the test. During the rest of the test, the rate degrades to around 200 events per second.

Bulk Upsert Latency:

Starts at 10 seconds and exhibits similar degradation patterns, escalating to a maximum of 62 seconds with an average of 42 seconds.

Issues and Improvements

The first issue that can be pointed out and improved in this implementation is the document schema in combination with the two indexes. Because the fields key and date are in an embedded document in the field _id, their values are indexed twice: by the default/obligatory index in the _id field and by the index we created to support the Bulk Upserts and Get Reports operations.

As the key field is a 64-character string and the date field is of type date, these two values use at least 68 bytes of storage. As we have two indexes, each document will contribute to 136 index bytes in a non-compressed scenario.

The improvement here is to extract the fields key and date from the _id field and let the _id field keep its default value of type ObjectId. The ObjectId data type takes only 12 bytes of storage.

This first implementation can be seen as a forced worst-case scenario to make the more optimized solutions look better. Unfortunately, that is not the case. It's not hard to find implementations like this on the internet, and I've worked on a big project with a schema like this one, from which I got the idea for this first case.

Application Version 2 (appV2): Better Understanding Indexing

Introduction

As discussed in the issues and improvements of appV1, embedding the fields key and date as a document in the _id field trying to take advantage of its obligatory index is not a good solution for our application because we would still need to create an extra index and the index on the _id field would take more storage than needed.

To solve the issue of the index on the _id field being bigger than needed, the solution is to move out the fields key and date from the embedded document in the _id field, and let the _id field have its default value of type ObjectId. Each document would still register the status totals for one user, specified by the field key, in one day, specified by the field date, the same way it's done in appV1.

The second application version and the improvements to get to it would still have been developed by someone with a junior knowledge level of MongoDB, but who has gone more in-depth in the documentation related to indexes in MongoDB, especially when indexing fields of type documents.

Schema

The application implementation presented above would have the following TypeScript document schema denominated SchemaV2:

type SchemaV2 = {
  _id: ObjectId;
  key: string;
  date: Date;
  approved?: number;
  noFunds?: number;
  pending?: number;
  rejected?: number;
};

Bulk Upsert

Based on the specification presented, we have the following updateOne operation for each event generated by this application version:

const operation = {
  updateOne: {
    filter: { key: event.key, date: event.date },
    update: {
      $inc: {
        approved: event.approved,
        noFunds: event.noFunds,
        pending: event.pending,
        rejected: event.rejected,
      },
    },
    upsert: true,
  },
};

This updateOne operation has a similar logic to the one in appV1, with the only difference being the filter criteria.

filter:

Target the document where the fields date and key from the event document matches the fields key and date from a document in the collection.

Get Reports

const pipeline = [
  {
    $match: {
      key: request.key,
      date: { $gte: request.date - oneYear, $lt: request.date },
    },
  },
  {
    $group: {
      _id: null,
      approved: { $sum: "$approved" },
      noFunds: { $sum: "$noFunds" },
      pending: { $sum: "$pending" },
      rejected: { $sum: "$rejected" },
    },
  },
];

This aggregation operation has a similar logic to the one in appV1, with the only difference being the filtering criteria in the $match stage.

{ $match: {...} }:

The key field matches the user key provided in the request.
The date field filters documents within the specified date range using $gte (greater than or equal to) and $lt (less than).

Indexes

In appV2, the key and date fields were moved out of the _id field and became top-level fields. To support efficient querying for both Bulk Upsert (filtering by key and date) and Get Reports (filtering by key and a date range), a compound index was created on these two fields:

const keys = { key: 1, date: 1 };
const options = { unique: true };

db.appV2.createIndex(keys, options);

Initial Scenario Statistics

Collection Statistics

To evaluate the performance of appV2, we inserted 500 million event documents into the collection using the schema and Bulk Upsert function described earlier. For comparison, the tables below also include statistics from previous comparable application versions:

Collection	Documents	Data Size	Avg. Document Size	Storage Size	Indexes	Index Size
appV1	359,639,622	39.58GB	119B	8.78GB	2	20.06GB
appV2	359,614,536	41.92GB	126B	10.46GB	2	16.66GB

Event Statistics

To evaluate the storage efficiency per event, the Event Statistics are calculated by dividing the total Data Size and Index Size by the 500 million events.

Collection	Data Size/Event	Index Size/Event	Total Size/Event
appV1	85B	43.1B	128.1B
appV2	90B	35.8B	125.8B

Analyzing the tables above, we can see that from appV1 to appV2, we increased the data size by 6% and decreased the index size by 17%. We can say that our goal of making the index on the _id field smaller was accomplished.

Looking at the Event Statistics, the total size per event value decreased only by 1.8%, from 128.1B to 125.8B. With this difference being so small, there is a good chance that we won’t see significant improvements from a performance point of view.

Load Test Results

Executing the load test for appV2 and plotting it alongside the results for appV1 and Desired rates, we have the following results for Get Reports and Bulk Upsert.

Get Reports Rates

Performance remains suboptimal, reaching only 17 reports per second compared to the target of 25 reports per second for the first 10 minutes of the test, slightly better than appV1. For the rest of the test, both versions have equally bad performance.

Get Reports Latency

appV2 demonstrates considerably worse latency performance compared to appV1, indicating that the schema changes negatively impacted read operations.

Bulk Upsert Rates

Similar to appV1, appV2 achieves the target rate of 250 events per second only during the first 10 minutes of testing. For the rest of the test, appV2 has a slightly better performance than appV1, but still way below the desired rates.

Bulk Upsert Latency

appV2 shows marginal improvement over appV1, suggesting some benefit from the reduced index size for write operations.

Performance Summary

The results align with the modest 1.8% improvement observed in the Initial Scenario Statistics. appV2's performance characteristics demonstrate that simply restructuring the _id field provides minimal benefits. The marginal improvements in Bulk Upsert operations (attributed to smaller indexes) are offset by degraded Get Reports performance (attributed to larger document sizes), resulting in negligible overall performance gains.

Issues and Improvements

The following document is a sample from the collection appV2:

const document = {
  _id: ObjectId("6685c0dfc2445d3c5913008f"),
  key: "0000000000000000000000000000000000000000000000000000000000000001",
  date: new Date("2022-06-25T00:00:00.000Z"),
  approved: 10,
  noFunds: 3,
  pending: 1,
  rejected: 1,
};

Analyzing it, aiming to reduce its size, two points of improvement can be found. One is the field key, which is of type string and will always have 64 characters of hexadecimal data, and the other is the name of the status fields, which combined can have up to 30 characters.

The field key, as presented in the scenario section, is composed of hexadecimal data, in which each character requires four bits to be presented. In our implementation so far, we have stored this data as strings using UTF-8 encoding, in which each character requires eight bits to be represented. So, we are using double the storage we need. One way around this issue is to store the hexadecimal data in its raw format using the binary data.

For the status field names, we can see that the names of the fields use more storage than the value itself. The field names are strings with at least 7 UTF-8 characters, which takes at least 7 bytes. The value of the status fields is a 32-bit integer, which takes 4 bytes. We can shorthand the status names by their first character, where approved becomes a, noFunds becomes n, pending becomes p, and rejected becomes r.

Application Version 3 (appV3): Better Data Types and Field Name Shorthanding

Introduction

As discussed in the issues and improvements of appV2, to reduce the document size, two improvements were proposed. One is to convert the data type of the field key from string to binary, requiring four bits to represent each hexadecimal character instead of the eight bits of a UTF-8 character. The other is to shorthand the name of the status fields by their first letter, requiring one byte for each field name instead of seven bytes. Each document would still register the status totals for one user, specified by the field key, in one day, specified by the field date, the same way it was done in the previous implementations.

To convert the key value from string to binary/buffer, the following TypeScript function was created:

const buildKey = (key: string): Buffer => {
  return Buffer.from(key, "hex");
};

The third application version has two improvements compared to the second version. The improvement of storing the field key as binary data to reduce its storage need would have been thought of by an intermediate to senior MongoDB developer. The improvement of shortening the names of the status fields would have been thought of by an intermediate MongoDB developer.

Schema

The application implementation presented above would have the following TypeScript document schema denominated SchemaV3:

type SchemaV3 = {
  _id: ObjectId;
  key: Buffer;
  date: Date;
  a?: number;
  n?: number;
  p?: number;
  r?: number;
};

Bulk Upsert

Based on the specification presented, we have the following updateOne operation for each event generated by this application version:

const operation = {
  updateOne: {
    filter: { key: buildKey(event.key), date: event.date },
    update: {
      $inc: {
        a: event.approved,
        n: event.noFunds,
        p: event.pending,
        r: event.rejected,
      },
    },
    upsert: true,
  },
};

This updateOne operation has a similar logic to the one in appV2, with the differences being the filter criteria and the $inc operation.

filter:

Target the document where the fields date and key from the event document matches the fields key and date from a document in the collection
The key is converted to binary format using the buildKey function.

update:

Uses the $inc operator to increment counters (a, n, p, r) based on the event data.

Get Reports

const pipeline = [
  {
    $match: {
      key: buildKey(event.key),
      date: { $gte: request.date - oneYear, $lt: request.date },
    },
  },
  {
    $group: {
      _id: null,
      approved: { $sum: "$a" },
      noFunds: { $sum: "$n" },
      pending: { $sum: "$p" },
      rejected: { $sum: "$r" },
    },
  },
];

This aggregation operation has a similar logic to the one in appV2, with the differences being the filtering criteria in the $match stage and the name of the statuses fields in the $group stage.

{ $match: {...} }:

The key field is converted to binary format using the buildKey function.

{ $group: {...} }:

Computes the sum of the a, n, p, and r fields using the $sum operator.

Indexes

Similar to appV2, appV3 relies on a compound index on the key and date fields to optimize Bulk Upsert and Get Reports operations. Even though the key field is now stored as binary data and status field names are shortened, the query patterns remain the same, necessitating the following index:

const keys = { key: 1, date: 1 };
const options = { unique: true };

db.appV3.createIndex(keys, options);

Initial Scenario Statistics

Collection Statistics

To evaluate the performance of appV3, we inserted 500 million event documents into the collection using the schema and Bulk Upsert function described earlier. For comparison, the tables below also include statistics from previous comparable application versions:

Collection	Documents	Data Size	Avg. Document Size	Storage Size	Indexes	Index Size
appV2	359,614,536	41.92GB	126B	10.46GB	2	16.66GB
appV3	359,633,376	28.7GB	86B	8.96GB	2	16.37GB

Event Statistics

To evaluate the storage efficiency per event, the Event Statistics are calculated by dividing the total Data Size and Index Size by the 500 million events.

Collection	Data Size/Event	Index Size/Event	Total Size/Event
appV2	90B	35.8B	125.8B
appV3	62B	35.2B	96.8B

Analyzing the tables above, we can see that from appV2 to appV3, there was practically no change in the index size and a decrease of 32% in the data size. Our goal of reducing the document size was accomplished.

Looking at the Event Statistics, the total size per event value decreased by 23%, from 125.8B to 96.8B. With this reduction, we’ll probably see considerable improvements.

Load Test Results

Executing the load test for appV3 and plotting it alongside the results for appV2 and Desired rates, we have the following results for Get Reports and Bulk Upsert.

Get Reports Rates

While still falling short of the 25 reports per second target, appV3 demonstrates some improvement, maintaining approximately 16 reports per second for half the test duration, an enhancement over appV2.

Get Reports Latency

Maintains approximately 1.2 seconds for the first 100 minutes before degrading to levels similar to previous versions.

Bulk Upsert Rates

Successfully maintains target rates for the first 100 minutes of testing - achieving 250 events per second from 0-50 minutes and 500 events per second from 50-100 minutes. This marks the first version to sustain target performance for extended periods.

Bulk Upsert Latency

Sustains approximately 2.5 seconds during the first 100 minutes, considerably better than previous implementations, before experiencing degradation in the final test phase.

Issues and Improvements

Looking at the collection stats of appV3 and thinking about how MongoDB is executing our queries and what indexes are being used, we can see that the _id field and its index aren't being used in our application. The field by itself is not a big deal from a performance standpoint, but its obligatory unique index is that every time a new document is inserted in the collection, the index structure on the _id field has to be updated.

Going back to the idea from appV1 of trying to take advantage of the obligatory _id field and its index, is there a way that we can use it in our application?

Let's take a look at our filtering criteria in the Get Report and Bulk Upsert functions:

const bulkUpsertFilter = {
  key: event.key,
  date: event.date,
};

const getReportsFilter = {
  key: request.key,
  date: {
    $gte: new Date("2021-06-15"),
    $lt: new Date("2022-06-15"),
  },
};

In both filtering criteria, the key field is compared using equality. The date field is compared using equality in the Bulk Upsert and range in the Get Reports. What if we combine these two field values in just one, concatenating them, and store it in _id?

To guide us on how we should order the fields in the resulting concatenated value and get the best performance of the index on it, let's follow the Equality, Sort, and Range rule (ESR).

As seen above, the key field is compared by equality in both cases, and the date field is compared by equality just in one case, so let's choose the key field for the first part of our concatenated value and the date field for the second part. As we don't have a Sort operation in our queries, we can skip it. Next, we have Range comparison, which is used in the date field, so now it makes sense to keep it as the second part of our concatenated value. With that, the most optimal way of concatenating the two values and getting the best performance of its index is key+date.

One point of attention is how we are going to format the date field in this concatenation in a way that the range filter works, and we don't store more data than we really need. One possible implementation will be presented and tested in the next application version, appV4.

Application Version 4 (appV4): Taking Advantage of the `_id` Index

Introduction

As presented in the issues and improvements of appV3, one way to take advantage of the obligatory field and index on _id is to store on it the concatenated value of key + date. One thing that we need to cover now is what data type the _id field will have and how we are going to format the date field.

As seen in previous implementations, storing the key field as binary/hexadecimal data improved the performance. So, let's see if we can also store the resulting concatenated field, key + date, as binary/hexadecimal.

To store the date field in a binary/hexadecimal type, we have some options. One could be converting it to a 4-byte timestamp that measures the seconds since the Unix epoch, and the other could be converting it to the format YYYYMMDD, which stores year, month, and day. Both cases would require the same 32 bits/8 hexadecimal characters.

For our case, let's use the second option and store the date value as YYYYMMDD because it'll help in future implementation/improvements. Considering a key field with the value 0001 and a date field with the value 2022-01-01, we would have the following _id field:

const _id = Buffer.from("000120220101", "hex");

To concatenate and convert the key and date fields to their desired format and type, the following TypeScript function was created:

const buildId = (key: string, date: Date): Buffer => {
  const day = date.toISOString().split("T")[0].replace(/-/g, ""); // YYYYMMDD
  return Buffer.from(`${key}${day}`, "hex");
};

Each document would still register the status totals for one user in one day, specified by _id field, the same way it's done in the previous implementations.

These changes reflect an advanced understanding of MongoDB's indexing strategies and storage optimization techniques, demonstrating the expertise of a very experienced senior developer with deep knowledge of BSON data types and compound key design patterns.

Schema

The application implementation presented above would have the following TypeScript document schema denominated SchemaV4:

type SchemaV4 = {
  _id: Buffer;
  a?: number;
  n?: number;
  p?: number;
  r?: number;
};

Bulk Upsert

Based on the specification presented, we have the following updateOne operation for each event generated by this application version:

const operation = {
  updateOne: {
    filter: { _id: buildId(event.key, event.date) },
    update: {
      $inc: {
        a: event.approved,
        n: event.noFunds,
        p: event.pending,
        r: event.rejected,
      },
    },
    upsert: true,
  },
};

This updateOne operation has a similar logic to the one in appV3, with the only difference being the filter criteria.

filter:

Target the document where the _id field matches the concatenated value of key and date.
The buildId function converts the key and date into a binary format.

Get Reports

const pipeline = [
  {
    $match: {
      _id: {
        $gte: buildId(request.key, Date.now() - oneYear),
        $lt: buildId(request.key, Date.now()),
      },
    },
  },
  {
    $group: {
      _id: null,
      approved: { $sum: "$a" },
      noFunds: { $sum: "$n" },
      pending: { $sum: "$p" },
      rejected: { $sum: "$r" },
    },
  },
];

This aggregation operation has a similar logic to the one in appV1, with the only difference being the filtering criteria in the $match stage.

{ $match: {...} }:

The _id field is a binary representation of the concatenated key and date values.
The $gte operator specifies the start of the date range, while $lt specifies the end.

Indexes

The key design goal of appV4 was to leverage the mandatory, default index on the _id field. By storing the concatenated key and date (formatted as YYYYMMDD and converted to binary) directly in the _id field, appV4 eliminates the need for any additional custom indexes.

The default index on _id is automatically created by MongoDB and is unique. This index now directly supports the filtering requirements for both Bulk Upsert (equality match on the full _id) and Get Reports (range queries on the _id based on key and date ranges).

Initial Scenario Statistics

Collection Statistics

To evaluate the performance of appV4, we inserted 500 million event documents into the collection using the schema and Bulk Upsert function described earlier. For comparison, the tables below also include statistics from previous comparable application versions:

Collection	Documents	Data Size	Avg. Document Size	Storage Size	Indexes	Index Size
appV3	359,633,376	28.7GB	86B	8.96GB	2	16.37GB
appV4	359,615,279	19.66GB	59B	6.69GB	1	9.5GB

Event Statistics

To evaluate the storage efficiency per event, the Event Statistics are calculated by dividing the total Data Size and Index Size by the 500 million events.

Collection	Data Size/Event	Index Size/Event	Total Size/Event
appV3	62B	35.2B	96.8B
appV4	42.4B	20.4B	62.6B

Analyzing the tables above, we can see that from appV3 to appV4, we reduced the data size by 32% and the index size by 42%—big improvements. We also have one less index to maintain now.

Looking at the Event Statistics, the total size per event value decreased by 35%, from 96.8B to 62.6B. With this reduction, we’ll probably see some significant improvements in performance.

Load Test Results

Executing the load test for appV4 and plotting it alongside the results for appV3 and Desired rates, we have the following results for Get Reports and Bulk Upsert.

Get Reports Rates

While still not achieving the target of 25 reports per second, appV4 shows consistently better average rates compared to appV3, representing incremental progress toward optimal performance.

Get Reports Latency

Both versions exhibit comparable latency behavior throughout most of the test duration. However, during the final 100 minutes when performance degrades, appV4 demonstrates better resilience with smaller latency increases compared to appV3.

Bulk Upsert Rates

Both versions successfully maintain target rates during the first 100 minutes, but appV4 demonstrates superior performance during the degraded final 100 minutes, sustaining higher rates than appV3.

Bulk Upsert Latency

Performance Summary

Despite the substantial improvements observed in the Initial Scenario Statistics (35% reduction in Total Size per Event from 96.8B to 62.6B), the performance gains in appV4 are more modest than anticipated. This suggests that while index optimization and storage reduction provide measurable benefits, the fundamental architectural constraints require more significant changes. The results indicate that appV4 has reached the optimization ceiling for the current document-per-day approach.

Issues and Improvements

Enough of looking at our documents to get a better performance. Let's focus on the application behavior.

When generating the oneYear totals, the Get Reports function will need to retrieve something close to 60 documents on average, and in the worst-case scenario, 365 documents. To access each one of these documents, one index entry will have to be visited, and one disk read operation will have to be performed. How can we increase the data density of the documents in our application and, with that, reduce the index entries and read operations needed to perform the desired operation?

One way of doing that is using the Bucket Pattern. According to the MongoDB documentation, "The bucket pattern separates long series of data into distinct objects. Separating large data series into smaller groups can improve query access patterns and simplify application logic."

Looking at our application from the perspective of the bucket pattern, so far, we have bucketed our data by daily user, each document containing the status totals for one user in one day. We can increase the bucketing range of our schema and, in one document, store events or status totals from a week, month, or even quarter.

Conclusion

That is the end of the first part of the series. We covered how indexes work on fields of type documents and saw some small changes that we can make to our application to reduce its storage and index needs, and as a consequence, improve its performance.

Here is a quick review of the improvements made between the application versions:

appV1 to appV2: Moved out the fields key and date from an embedded document in the _id field and let it have its default value, ObjectId
appV2 to appV3: Reduced the document size by short-handing the name of status fields and changed the data type of the key field from string to binary/hexadecimal
appV3 to appV4: Removed the need for an extra index by concatenating the values of key and date and storing them on the _id field

So far, none of our applications have gotten even close to the desired rates, but let's not give up. As presented in the issues and improvements of appV4, we can still improve our application by using the Bucket Pattern. The Bucket Pattern with the Computed Pattern will be the main points of improvement for the next application version, appV5, and its revisions.

Appendices

Index on Embedded Documents

This section examines how MongoDB indexes embedded document fields and explains why the appV1 implementation requires an additional index beyond the default _id index.

Index Behavior Analysis

To understand MongoDB's indexing behavior with embedded documents, we'll analyze how the default _id index performs with our appV1 query patterns. The following tests demonstrate the difference between exact document matching and embedded field queries through the explain functionality.

// A sample document
const doc = {
  _id: { key: "0001", date: new Date("2020-01-01") },
  approved: 2,
  rejected: 1,
};

// Making sure we have an empty collection
db.appV1.drop();

// Inserting the document in the `appV1` collection
db.appV1.insertOne(doc);

// Finding a document using `Bulk Upsert` filtering criteria
const bulkUpsertFilter = {
  _id: { key: "0001", date: new Date("2020-01-01") },
};
db.appV1.find(bulkUpsertFilter).explain("executionStats");
/*{
...
  executionStats: {
    nReturned: 1,
    totalKeysExamined: 1,
    totalDocsExamined: 1,
    ...
    executionStages: {
      stage: 'EXPRESS_IXSCAN',
      ...
    }
    ...
  },
  ...
}*/

// Finding a document using `Get Reports` filtering criteria
const getReportsFilter = {
  "_id.key": "0001",
  "_id.date": { $gte: new Date("2019-01-01"), $lte: new Date("2021-01-01") },
};
db.appV1.find(getReportsFilter).explain("executionStats");
/*{
...
  executionStats: {
    nReturned: 1,
    totalKeysExamined: 0,
    totalDocsExamined: 1,
    ...
    executionStages: {
      stage: 'COLLSCAN',
      ...
    }
    ...
  },
  ...
}*/

Index Utilization Results

The execution statistics reveal a critical performance difference:

Bulk Upsert Query: Uses the index efficiently (EXPRESS_IXSCAN) because it matches the entire embedded document exactly
Get Reports Query: Performs a collection scan (COLLSCAN) because it queries individual fields within the embedded document

This behavior occurs because MongoDB treats embedded documents as atomic values when indexing, not as collections of individual fields.

MongoDB's Embedded Document Indexing Strategy

MongoDB handles different data types with varying indexing approaches:

Primitive Types: Directly indexed with their native values
Arrays: Special indexing that creates entries for each array element
Embedded Documents: Indexed as serialized, atomic values

For embedded documents, MongoDB creates index entries using a stringified representation of the entire document structure:

const documentValue = { key: '0001', date: 2010-01-01T00:00:00.000Z };
const indexValue = "{key:0001,date:2010-01-01T00:00:00.000Z}";

Index Limitation Implications

This indexing strategy creates a fundamental limitation: since the index stores the embedded document as a serialized blob, MongoDB cannot access or search individual fields within that structure. Consequently:

Queries matching the entire embedded document can use the index effectively
Queries targeting specific embedded fields (like _id.key or _id.date) cannot utilize the index
Range queries on embedded fields require full collection scans

This explains why the appV1 implementation requires an additional compound index on _id.key and _id.date to support efficient querying of individual embedded document fields.

Table Of Contents

Article Introduction

Application Version 1 (appV1): The baseline implementation

Introduction

Schema

Bulk Upsert

Get Reports

Indexes

Initial Scenario Statistics

Collection Statistics

Event Statistics

Load Test Results

Get Reports Rate

Get Reports Latency

Bulk Upsert Rate

Bulk Upsert Latency:

Issues and Improvements

Application Version 2 (appV2): Better Understanding Indexing

Introduction

Schema

Bulk Upsert

Get Reports

Indexes

Initial Scenario Statistics

Collection Statistics

Event Statistics

Load Test Results

Get Reports Rates

Get Reports Latency

Bulk Upsert Rates

Bulk Upsert Latency

Performance Summary

Issues and Improvements

Application Version 3 (appV3): Better Data Types and Field Name Shorthanding

Introduction

Schema

Bulk Upsert

Get Reports

Indexes

Initial Scenario Statistics

Collection Statistics

Event Statistics

Load Test Results

Get Reports Rates

Get Reports Latency

Bulk Upsert Rates

Bulk Upsert Latency

Issues and Improvements

Application Version 4 (appV4): Taking Advantage of the _id Index

Introduction

Schema

Bulk Upsert

Get Reports

Indexes

Initial Scenario Statistics

Collection Statistics

Event Statistics

Load Test Results

Get Reports Rates

Get Reports Latency

Bulk Upsert Rates

Bulk Upsert Latency

Performance Summary

Issues and Improvements

Conclusion

Appendices

Index on Embedded Documents

Index Behavior Analysis

Index Utilization Results

MongoDB's Embedded Document Indexing Strategy

Index Limitation Implications

Application Version 4 (appV4): Taking Advantage of the `_id` Index