Table Of Contents
- Application Version 1: The baseline implementation
- Application Version 2: Better Understanding Indexing
- Application Version 3: Better Data Types and Field Name Shorthanding
- Application Version 4: Taking Advantage of the
_idIndex
Article Introduction
Welcome to the first part of the series, "The Cost of Not Knowing MongoDB". This comprehensive analysis explores how different MongoDB schema design decisions can dramatically impact application performance, demonstrating the critical importance of understanding MongoDB's underlying mechanisms.
In this first article, we examine four progressive application versions, appV1 through appV4, each representing common approaches developers take when working with MongoDB. Through detailed performance testing and analysis, we reveal how seemingly minor schema modifications can lead to significant improvements in throughput, latency, and resource utilization.
The journey begins with appV1, a baseline implementation that reflects typical patterns used by junior MongoDB developers. We then progress through increasingly optimized versions, introducing concepts such as field concatenation, data type optimization, and strategic field abbreviation. Each version builds upon the lessons learned from its predecessor, culminating in appV4.
This foundational knowledge sets the stage for Part 2, where we explore advanced patterns like the Bucket Pattern and Computed Pattern to achieve even greater performance improvements.
Application Version 1 (appV1): The baseline implementation
Introduction
The first application version and the base case for our comparison would have been developed by someone with a junior knowledge level of MongoDB who just took a quick look at the documentation and learned that every document in a collection must have an _id field and this field is always unique indexed.
To take advantage of the _id obligatory field and index, the developer decides to store the values of key and date in an embedded document in the _id field. With that, each document will register the status totals for one user, specified by the field _id.key, in one day, specified by the field _id.date.
Schema
The application implementation presented above would have the following TypeScript document schema denominated SchemaV1:
type SchemaV1 = {
_id: {
key: string;
date: Date;
};
approved?: number;
noFunds?: number;
pending?: number;
rejected?: number;
};
Bulk Upsert
Based on the specification presented, we have the following updateOne operation for each event generated by this application version:
const operation = {
updateOne: {
filter: {
_id: { date: event.date, key: event.key },
},
update: {
$inc: {
approved: event.approved,
noFunds: event.noFunds,
pending: event.pending,
rejected: event.rejected,
},
},
upsert: true,
},
};
filter:
- Target the document where the
_idfield matches{ date: event.date, key: event.key }.
update:
- Uses the
$incoperator to increment counters (approved,noFunds,pending,rejected) based on the event data.
upsert:
- Ensures a new document is created if no matching document exists.
Get Reports
To fulfill the Get Reports operation, five aggregation pipelines are required, one for each date interval. Each pipeline follows the same structure, differing only in the filtering criteria in the $match stage:
const pipeline = [
{
$match: {
"_id.key": request.key,
"_id.date": { $gte: request.date - oneYear, $lt: request.date },
},
},
{
$group: {
_id: null,
approved: { $sum: "$approved" },
noFunds: { $sum: "$noFunds" },
pending: { $sum: "$pending" },
rejected: { $sum: "$rejected" },
},
},
];
-
{ $match: {...} }:
- Filters documents based on the
keyanddatefields. - The
"_id.key"field matches the user key provided in the request. - The
"_id.date"field filters documents within the specified date range using$gte(greater than or equal to) and$lt(less than).
-
{ $group: {...} }:- Group the filtered documents into a single result.
- The
_idfield is set tonullto group all matching documents from the previous stage together. - Computes the sum of the
approved,noFunds,pending, andrejectedfields using the$sumoperator.
Indexes
Initially, appV1 aimed to use the default index on the _id field (which contained an embedded document with key and date). However, this default index on the embedded _id field was not sufficient to efficiently support the query patterns, particularly for the Get Reports function, which filters by _id.key and _id.date.
To address this, an additional compound index was created:
const keys = { "_id.key": 1, "_id.date": 1 };
const options = { unique: true };
db.appV1.createIndex(keys, options);
This explicit index on _id.key and _id.date ensures that queries filtering and sorting on these fields can be performed efficiently. The unique: true option enforces that the combination of _id.key and _id.date is unique across all documents in the collection. For a more detailed explanation of why an index on an embedded document's fields might be needed even if the top-level field is indexed, refer to Appendices - Index on Embedded Documents.
Initial Scenario Statistics
Collection Statistics
To evaluate the performance of appV1, we inserted 500 million event documents into the collection using the schema and Bulk Upsert function described earlier.
| Collection | Documents | Data Size | Avg. Document Size | Storage Size | Indexes | Index Size |
|---|---|---|---|---|---|---|
| appV1 | 359,639,622 | 39.58GB | 119B | 8.78GB | 2 | 20.06GB |
Event Statistics
To evaluate the storage efficiency per event, the Event Statistics are calculated by dividing the total Data Size and Index Size by the 500 million events.
| Collection | Data Size/Event | Index Size/Event | Total Size/Event |
|---|---|---|---|
| appV1 | 85B | 43.1B | 128.1B |
Load Test Results
Executing the load test for appV1 and plotting it alongside the Desired values, we have the following results for Get Reports and Bulk Upsert.
Get Reports Rate
The application never reaches the target rate of 25 reports per second during the first 10 minutes phase, peaking at only 16.5 reports per second. During the rest of the test, the rate stays around 6 reports per second.
Get Reports Latency
Begins at 2 seconds and progressively increases throughout the test duration, reaching a maximum of 6.5 seconds with an average of 4.5 seconds.
Bulk Upsert Rate
The application only reaches the desired rate of 250 events per second during the first 10 minutes of the test. During the rest of the test, the rate degrades to around 200 events per second.
Bulk Upsert Latency:
Starts at 10 seconds and exhibits similar degradation patterns, escalating to a maximum of 62 seconds with an average of 42 seconds.
Issues and Improvements
The first issue that can be pointed out and improved in this implementation is the document schema in combination with the two indexes. Because the fields key and date are in an embedded document in the field _id, their values are indexed twice: by the default/obligatory index in the _id field and by the index we created to support the Bulk Upserts and Get Reports operations.
As the key field is a 64-character string and the date field is of type date, these two values use at least 68 bytes of storage. As we have two indexes, each document will contribute to 136 index bytes in a non-compressed scenario.
The improvement here is to extract the fields key and date from the _id field and let the _id field keep its default value of type ObjectId. The ObjectId data type takes only 12 bytes of storage.
This first implementation can be seen as a forced worst-case scenario to make the more optimized solutions look better. Unfortunately, that is not the case. It's not hard to find implementations like this on the internet, and I've worked on a big project with a schema like this one, from which I got the idea for this first case.
Application Version 2 (appV2): Better Understanding Indexing
Introduction
As discussed in the issues and improvements of appV1, embedding the fields key and date as a document in the _id field trying to take advantage of its obligatory index is not a good solution for our application because we would still need to create an extra index and the index on the _id field would take more storage than needed.
To solve the issue of the index on the _id field being bigger than needed, the solution is to move out the fields key and date from the embedded document in the _id field, and let the _id field have its default value of type ObjectId. Each document would still register the status totals for one user, specified by the field key, in one day, specified by the field date, the same way it's done in appV1.
The second application version and the improvements to get to it would still have been developed by someone with a junior knowledge level of MongoDB, but who has gone more in-depth in the documentation related to indexes in MongoDB, especially when indexing fields of type documents.
Schema
The application implementation presented above would have the following TypeScript document schema denominated SchemaV2:
type SchemaV2 = {
_id: ObjectId;
key: string;
date: Date;
approved?: number;
noFunds?: number;
pending?: number;
rejected?: number;
};
Bulk Upsert
Based on the specification presented, we have the following updateOne operation for each event generated by this application version:
const operation = {
updateOne: {
filter: { key: event.key, date: event.date },
update: {
$inc: {
approved: event.approved,
noFunds: event.noFunds,
pending: event.pending,
rejected: event.rejected,
},
},
upsert: true,
},
};
This updateOne operation has a similar logic to the one in appV1, with the only difference being the filter criteria.
filter:
- Target the document where the fields
dateandkeyfrom theeventdocument matches the fieldskeyanddatefrom a document in the collection.
Get Reports
To fulfill the Get Reports operation, five aggregation pipelines are required, one for each date interval. Each pipeline follows the same structure, differing only in the filtering criteria in the $match stage:
const pipeline = [
{
$match: {
key: request.key,
date: { $gte: request.date - oneYear, $lt: request.date },
},
},
{
$group: {
_id: null,
approved: { $sum: "$approved" },
noFunds: { $sum: "$noFunds" },
pending: { $sum: "$pending" },
rejected: { $sum: "$rejected" },
},
},
];
This aggregation operation has a similar logic to the one in appV1, with the only difference being the filtering criteria in the $match stage.
-
{ $match: {...} }:
- The
keyfield matches the user key provided in the request. - The
datefield filters documents within the specified date range using$gte(greater than or equal to) and$lt(less than).
Indexes
In appV2, the key and date fields were moved out of the _id field and became top-level fields. To support efficient querying for both Bulk Upsert (filtering by key and date) and Get Reports (filtering by key and a date range), a compound index was created on these two fields:
const keys = { key: 1, date: 1 };
const options = { unique: true };
db.appV2.createIndex(keys, options);
Initial Scenario Statistics
Collection Statistics
To evaluate the performance of appV2, we inserted 500 million event documents into the collection using the schema and Bulk Upsert function described earlier. For comparison, the tables below also include statistics from previous comparable application versions:
| Collection | Documents | Data Size | Avg. Document Size | Storage Size | Indexes | Index Size |
|---|---|---|---|---|---|---|
| appV1 | 359,639,622 | 39.58GB | 119B | 8.78GB | 2 | 20.06GB |
| appV2 | 359,614,536 | 41.92GB | 126B | 10.46GB | 2 | 16.66GB |
Event Statistics
To evaluate the storage efficiency per event, the Event Statistics are calculated by dividing the total Data Size and Index Size by the 500 million events.
| Collection | Data Size/Event | Index Size/Event | Total Size/Event |
|---|---|---|---|
| appV1 | 85B | 43.1B | 128.1B |
| appV2 | 90B | 35.8B | 125.8B |
Analyzing the tables above, we can see that from appV1 to appV2, we increased the data size by 6% and decreased the index size by 17%. We can say that our goal of making the index on the _id field smaller was accomplished.
Looking at the Event Statistics, the total size per event value decreased only by 1.8%, from 128.1B to 125.8B. With this difference being so small, there is a good chance that we won’t see significant improvements from a performance point of view.
Load Test Results
Executing the load test for appV2 and plotting it alongside the results for appV1 and Desired rates, we have the following results for Get Reports and Bulk Upsert.
Get Reports Rates
Performance remains suboptimal, reaching only 17 reports per second compared to the target of 25 reports per second for the first 10 minutes of the test, slightly better than appV1. For the rest of the test, both versions have equally bad performance.
Get Reports Latency
appV2 demonstrates considerably worse latency performance compared to appV1, indicating that the schema changes negatively impacted read operations.
Bulk Upsert Rates
Similar to appV1, appV2 achieves the target rate of 250 events per second only during the first 10 minutes of testing. For the rest of the test, appV2 has a slightly better performance than appV1, but still way below the desired rates.
Bulk Upsert Latency
appV2 shows marginal improvement over appV1, suggesting some benefit from the reduced index size for write operations.
Performance Summary
The results align with the modest 1.8% improvement observed in the Initial Scenario Statistics. appV2's performance characteristics demonstrate that simply restructuring the _id field provides minimal benefits. The marginal improvements in Bulk Upsert operations (attributed to smaller indexes) are offset by degraded Get Reports performance (attributed to larger document sizes), resulting in negligible overall performance gains.
Issues and Improvements
The following document is a sample from the collection appV2:
const document = {
_id: ObjectId("6685c0dfc2445d3c5913008f"),
key: "0000000000000000000000000000000000000000000000000000000000000001",
date: new Date("2022-06-25T00:00:00.000Z"),
approved: 10,
noFunds: 3,
pending: 1,
rejected: 1,
};
Analyzing it, aiming to reduce its size, two points of improvement can be found. One is the field key, which is of type string and will always have 64 characters of hexadecimal data, and the other is the name of the status fields, which combined can have up to 30 characters.
The field key, as presented in the scenario section, is composed of hexadecimal data, in which each character requires four bits to be presented. In our implementation so far, we have stored this data as strings using UTF-8 encoding, in which each character requires eight bits to be represented. So, we are using double the storage we need. One way around this issue is to store the hexadecimal data in its raw format using the binary data.
For the status field names, we can see that the names of the fields use more storage than the value itself. The field names are strings with at least 7 UTF-8 characters, which takes at least 7 bytes. The value of the status fields is a 32-bit integer, which takes 4 bytes. We can shorthand the status names by their first character, where approved becomes a, noFunds becomes n, pending becomes p, and rejected becomes r.
Application Version 3 (appV3): Better Data Types and Field Name Shorthanding
Introduction
As discussed in the issues and improvements of appV2, to reduce the document size, two improvements were proposed. One is to convert the data type of the field key from string to binary, requiring four bits to represent each hexadecimal character instead of the eight bits of a UTF-8 character. The other is to shorthand the name of the status fields by their first letter, requiring one byte for each field name instead of seven bytes. Each document would still register the status totals for one user, specified by the field key, in one day, specified by the field date, the same way it was done in the previous implementations.
To convert the key value from string to binary/buffer, the following TypeScript function was created:
const buildKey = (key: string): Buffer => {
return Buffer.from(key, "hex");
};
The third application version has two improvements compared to the second version. The improvement of storing the field key as binary data to reduce its storage need would have been thought of by an intermediate to senior MongoDB developer. The improvement of shortening the names of the status fields would have been thought of by an intermediate MongoDB developer.
Schema
The application implementation presented above would have the following TypeScript document schema denominated SchemaV3:
type SchemaV3 = {
_id: ObjectId;
key: Buffer;
date: Date;
a?: number;
n?: number;
p?: number;
r?: number;
};
Bulk Upsert
Based on the specification presented, we have the following updateOne operation for each event generated by this application version:
const operation = {
updateOne: {
filter: { key: buildKey(event.key), date: event.date },
update: {
$inc: {
a: event.approved,
n: event.noFunds,
p: event.pending,
r: event.rejected,
},
},
upsert: true,
},
};
This updateOne operation has a similar logic to the one in appV2, with the differences being the filter criteria and the $inc operation.
filter:
- Target the document where the fields
dateandkeyfrom theeventdocument matches the fieldskeyanddatefrom a document in the collection - The
keyis converted to binary format using thebuildKeyfunction.
update:
- Uses the
$incoperator to increment counters (a,n,p,r) based on the event data.
Get Reports
To fulfill the Get Reports operation, five aggregation pipelines are required, one for each date interval. Each pipeline follows the same structure, differing only in the filtering criteria in the $match stage:
const pipeline = [
{
$match: {
key: buildKey(event.key),
date: { $gte: request.date - oneYear, $lt: request.date },
},
},
{
$group: {
_id: null,
approved: { $sum: "$a" },
noFunds: { $sum: "$n" },
pending: { $sum: "$p" },
rejected: { $sum: "$r" },
},
},
];
This aggregation operation has a similar logic to the one in appV2, with the differences being the filtering criteria in the $match stage and the name of the statuses fields in the $group stage.
-
{ $match: {...} }:
- The
keyfield is converted to binary format using thebuildKeyfunction.
-
{ $group: {...} }:
- Computes the sum of the
a,n,p, andrfields using the$sumoperator.
Indexes
Similar to appV2, appV3 relies on a compound index on the key and date fields to optimize Bulk Upsert and Get Reports operations. Even though the key field is now stored as binary data and status field names are shortened, the query patterns remain the same, necessitating the following index:
const keys = { key: 1, date: 1 };
const options = { unique: true };
db.appV3.createIndex(keys, options);
Initial Scenario Statistics
Collection Statistics
To evaluate the performance of appV3, we inserted 500 million event documents into the collection using the schema and Bulk Upsert function described earlier. For comparison, the tables below also include statistics from previous comparable application versions:
| Collection | Documents | Data Size | Avg. Document Size | Storage Size | Indexes | Index Size |
|---|---|---|---|---|---|---|
| appV2 | 359,614,536 | 41.92GB | 126B | 10.46GB | 2 | 16.66GB |
| appV3 | 359,633,376 | 28.7GB | 86B | 8.96GB | 2 | 16.37GB |
Event Statistics
To evaluate the storage efficiency per event, the Event Statistics are calculated by dividing the total Data Size and Index Size by the 500 million events.
| Collection | Data Size/Event | Index Size/Event | Total Size/Event |
|---|---|---|---|
| appV2 | 90B | 35.8B | 125.8B |
| appV3 | 62B | 35.2B | 96.8B |
Analyzing the tables above, we can see that from appV2 to appV3, there was practically no change in the index size and a decrease of 32% in the data size. Our goal of reducing the document size was accomplished.
Looking at the Event Statistics, the total size per event value decreased by 23%, from 125.8B to 96.8B. With this reduction, we’ll probably see considerable improvements.
Load Test Results
Executing the load test for appV3 and plotting it alongside the results for appV2 and Desired rates, we have the following results for Get Reports and Bulk Upsert.
Get Reports Rates
While still falling short of the 25 reports per second target, appV3 demonstrates some improvement, maintaining approximately 16 reports per second for half the test duration, an enhancement over appV2.
Get Reports Latency
Maintains approximately 1.2 seconds for the first 100 minutes before degrading to levels similar to previous versions.
Bulk Upsert Rates
Successfully maintains target rates for the first 100 minutes of testing - achieving 250 events per second from 0-50 minutes and 500 events per second from 50-100 minutes. This marks the first version to sustain target performance for extended periods.
Bulk Upsert Latency
Sustains approximately 2.5 seconds during the first 100 minutes, considerably better than previous implementations, before experiencing degradation in the final test phase.
Issues and Improvements
Looking at the collection stats of appV3 and thinking about how MongoDB is executing our queries and what indexes are being used, we can see that the _id field and its index aren't being used in our application. The field by itself is not a big deal from a performance standpoint, but its obligatory unique index is that every time a new document is inserted in the collection, the index structure on the _id field has to be updated.
Going back to the idea from appV1 of trying to take advantage of the obligatory _id field and its index, is there a way that we can use it in our application?
Let's take a look at our filtering criteria in the Get Report and Bulk Upsert functions:
const bulkUpsertFilter = {
key: event.key,
date: event.date,
};
const getReportsFilter = {
key: request.key,
date: {
$gte: new Date("2021-06-15"),
$lt: new Date("2022-06-15"),
},
};
In both filtering criteria, the key field is compared using equality. The date field is compared using equality in the Bulk Upsert and range in the Get Reports. What if we combine these two field values in just one, concatenating them, and store it in _id?
To guide us on how we should order the fields in the resulting concatenated value and get the best performance of the index on it, let's follow the Equality, Sort, and Range rule (ESR).
As seen above, the key field is compared by equality in both cases, and the date field is compared by equality just in one case, so let's choose the key field for the first part of our concatenated value and the date field for the second part. As we don't have a Sort operation in our queries, we can skip it. Next, we have Range comparison, which is used in the date field, so now it makes sense to keep it as the second part of our concatenated value. With that, the most optimal way of concatenating the two values and getting the best performance of its index is key+date.
One point of attention is how we are going to format the date field in this concatenation in a way that the range filter works, and we don't store more data than we really need. One possible implementation will be presented and tested in the next application version, appV4.
Application Version 4 (appV4): Taking Advantage of the _id Index
Introduction
As presented in the issues and improvements of appV3, one way to take advantage of the obligatory field and index on _id is to store on it the concatenated value of key + date. One thing that we need to cover now is what data type the _id field will have and how we are going to format the date field.
As seen in previous implementations, storing the key field as binary/hexadecimal data improved the performance. So, let's see if we can also store the resulting concatenated field, key + date, as binary/hexadecimal.
To store the date field in a binary/hexadecimal type, we have some options. One could be converting it to a 4-byte timestamp that measures the seconds since the Unix epoch, and the other could be converting it to the format YYYYMMDD, which stores year, month, and day. Both cases would require the same 32 bits/8 hexadecimal characters.
For our case, let's use the second option and store the date value as YYYYMMDD because it'll help in future implementation/improvements. Considering a key field with the value 0001 and a date field with the value 2022-01-01, we would have the following _id field:
const _id = Buffer.from("000120220101", "hex");
To concatenate and convert the key and date fields to their desired format and type, the following TypeScript function was created:
const buildId = (key: string, date: Date): Buffer => {
const day = date.toISOString().split("T")[0].replace(/-/g, ""); // YYYYMMDD
return Buffer.from(`${key}${day}`, "hex");
};
Each document would still register the status totals for one user in one day, specified by _id field, the same way it's done in the previous implementations.
These changes reflect an advanced understanding of MongoDB's indexing strategies and storage optimization techniques, demonstrating the expertise of a very experienced senior developer with deep knowledge of BSON data types and compound key design patterns.
Schema
The application implementation presented above would have the following TypeScript document schema denominated SchemaV4:
type SchemaV4 = {
_id: Buffer;
a?: number;
n?: number;
p?: number;
r?: number;
};
Bulk Upsert
Based on the specification presented, we have the following updateOne operation for each event generated by this application version:
const operation = {
updateOne: {
filter: { _id: buildId(event.key, event.date) },
update: {
$inc: {
a: event.approved,
n: event.noFunds,
p: event.pending,
r: event.rejected,
},
},
upsert: true,
},
};
This updateOne operation has a similar logic to the one in appV3, with the only difference being the filter criteria.
filter:
- Target the document where the
_idfield matches the concatenated value ofkeyanddate. - The
buildIdfunction converts thekeyanddateinto a binary format.
Get Reports
To fulfill the Get Reports operation, five aggregation pipelines are required, one for each date interval. Each pipeline follows the same structure, differing only in the filtering criteria in the $match stage:
const pipeline = [
{
$match: {
_id: {
$gte: buildId(request.key, Date.now() - oneYear),
$lt: buildId(request.key, Date.now()),
},
},
},
{
$group: {
_id: null,
approved: { $sum: "$a" },
noFunds: { $sum: "$n" },
pending: { $sum: "$p" },
rejected: { $sum: "$r" },
},
},
];
This aggregation operation has a similar logic to the one in appV1, with the only difference being the filtering criteria in the $match stage.
-
{ $match: {...} }:
- The
_idfield is a binary representation of the concatenatedkeyanddatevalues. - The
$gteoperator specifies the start of the date range, while$ltspecifies the end.
Indexes
The key design goal of appV4 was to leverage the mandatory, default index on the _id field. By storing the concatenated key and date (formatted as YYYYMMDD and converted to binary) directly in the _id field, appV4 eliminates the need for any additional custom indexes.
The default index on _id is automatically created by MongoDB and is unique. This index now directly supports the filtering requirements for both Bulk Upsert (equality match on the full _id) and Get Reports (range queries on the _id based on key and date ranges).
Initial Scenario Statistics
Collection Statistics
To evaluate the performance of appV4, we inserted 500 million event documents into the collection using the schema and Bulk Upsert function described earlier. For comparison, the tables below also include statistics from previous comparable application versions:
| Collection | Documents | Data Size | Avg. Document Size | Storage Size | Indexes | Index Size |
|---|---|---|---|---|---|---|
| appV3 | 359,633,376 | 28.7GB | 86B | 8.96GB | 2 | 16.37GB |
| appV4 | 359,615,279 | 19.66GB | 59B | 6.69GB | 1 | 9.5GB |
Event Statistics
To evaluate the storage efficiency per event, the Event Statistics are calculated by dividing the total Data Size and Index Size by the 500 million events.
| Collection | Data Size/Event | Index Size/Event | Total Size/Event |
|---|---|---|---|
| appV3 | 62B | 35.2B | 96.8B |
| appV4 | 42.4B | 20.4B | 62.6B |
Analyzing the tables above, we can see that from appV3 to appV4, we reduced the data size by 32% and the index size by 42%—big improvements. We also have one less index to maintain now.
Looking at the Event Statistics, the total size per event value decreased by 35%, from 96.8B to 62.6B. With this reduction, we’ll probably see some significant improvements in performance.
Load Test Results
Executing the load test for appV4 and plotting it alongside the results for appV3 and Desired rates, we have the following results for Get Reports and Bulk Upsert.
Get Reports Rates
While still not achieving the target of 25 reports per second, appV4 shows consistently better average rates compared to appV3, representing incremental progress toward optimal performance.
Get Reports Latency
Both versions exhibit comparable latency behavior throughout most of the test duration. However, during the final 100 minutes when performance degrades, appV4 demonstrates better resilience with smaller latency increases compared to appV3.
Bulk Upsert Rates
Both versions successfully maintain target rates during the first 100 minutes, but appV4 demonstrates superior performance during the degraded final 100 minutes, sustaining higher rates than appV3.
Bulk Upsert Latency
Both versions exhibit comparable latency behavior throughout most of the test duration. However, during the final 100 minutes when performance degrades, appV4 demonstrates better resilience with smaller latency increases compared to appV3.
Performance Summary
Despite the substantial improvements observed in the Initial Scenario Statistics (35% reduction in Total Size per Event from 96.8B to 62.6B), the performance gains in appV4 are more modest than anticipated. This suggests that while index optimization and storage reduction provide measurable benefits, the fundamental architectural constraints require more significant changes. The results indicate that appV4 has reached the optimization ceiling for the current document-per-day approach.
Issues and Improvements
Enough of looking at our documents to get a better performance. Let's focus on the application behavior.
When generating the oneYear totals, the Get Reports function will need to retrieve something close to 60 documents on average, and in the worst-case scenario, 365 documents. To access each one of these documents, one index entry will have to be visited, and one disk read operation will have to be performed. How can we increase the data density of the documents in our application and, with that, reduce the index entries and read operations needed to perform the desired operation?
One way of doing that is using the Bucket Pattern. According to the MongoDB documentation, "The bucket pattern separates long series of data into distinct objects. Separating large data series into smaller groups can improve query access patterns and simplify application logic."
Looking at our application from the perspective of the bucket pattern, so far, we have bucketed our data by daily user, each document containing the status totals for one user in one day. We can increase the bucketing range of our schema and, in one document, store events or status totals from a week, month, or even quarter.
Conclusion
That is the end of the first part of the series. We covered how indexes work on fields of type documents and saw some small changes that we can make to our application to reduce its storage and index needs, and as a consequence, improve its performance.
Here is a quick review of the improvements made between the application versions:
-
appV1toappV2: Moved out the fieldskeyanddatefrom an embedded document in the_idfield and let it have its default value, ObjectId -
appV2toappV3: Reduced the document size by short-handing the name of status fields and changed the data type of thekeyfield from string to binary/hexadecimal -
appV3toappV4: Removed the need for an extra index by concatenating the values ofkeyanddateand storing them on the_idfield
So far, none of our applications have gotten even close to the desired rates, but let's not give up. As presented in the issues and improvements of appV4, we can still improve our application by using the Bucket Pattern. The Bucket Pattern with the Computed Pattern will be the main points of improvement for the next application version, appV5, and its revisions.
Appendices
Index on Embedded Documents
This section examines how MongoDB indexes embedded document fields and explains why the appV1 implementation requires an additional index beyond the default _id index.
Index Behavior Analysis
To understand MongoDB's indexing behavior with embedded documents, we'll analyze how the default _id index performs with our appV1 query patterns. The following tests demonstrate the difference between exact document matching and embedded field queries through the explain functionality.
// A sample document
const doc = {
_id: { key: "0001", date: new Date("2020-01-01") },
approved: 2,
rejected: 1,
};
// Making sure we have an empty collection
db.appV1.drop();
// Inserting the document in the `appV1` collection
db.appV1.insertOne(doc);
// Finding a document using `Bulk Upsert` filtering criteria
const bulkUpsertFilter = {
_id: { key: "0001", date: new Date("2020-01-01") },
};
db.appV1.find(bulkUpsertFilter).explain("executionStats");
/*{
...
executionStats: {
nReturned: 1,
totalKeysExamined: 1,
totalDocsExamined: 1,
...
executionStages: {
stage: 'EXPRESS_IXSCAN',
...
}
...
},
...
}*/
// Finding a document using `Get Reports` filtering criteria
const getReportsFilter = {
"_id.key": "0001",
"_id.date": { $gte: new Date("2019-01-01"), $lte: new Date("2021-01-01") },
};
db.appV1.find(getReportsFilter).explain("executionStats");
/*{
...
executionStats: {
nReturned: 1,
totalKeysExamined: 0,
totalDocsExamined: 1,
...
executionStages: {
stage: 'COLLSCAN',
...
}
...
},
...
}*/
Index Utilization Results
The execution statistics reveal a critical performance difference:
-
Bulk Upsert Query: Uses the index efficiently (
EXPRESS_IXSCAN) because it matches the entire embedded document exactly -
Get Reports Query: Performs a collection scan (
COLLSCAN) because it queries individual fields within the embedded document
This behavior occurs because MongoDB treats embedded documents as atomic values when indexing, not as collections of individual fields.
MongoDB's Embedded Document Indexing Strategy
MongoDB handles different data types with varying indexing approaches:
- Primitive Types: Directly indexed with their native values
- Arrays: Special indexing that creates entries for each array element
- Embedded Documents: Indexed as serialized, atomic values
For embedded documents, MongoDB creates index entries using a stringified representation of the entire document structure:
const documentValue = { key: '0001', date: 2010-01-01T00:00:00.000Z };
const indexValue = "{key:0001,date:2010-01-01T00:00:00.000Z}";
Index Limitation Implications
This indexing strategy creates a fundamental limitation: since the index stores the embedded document as a serialized blob, MongoDB cannot access or search individual fields within that structure. Consequently:
- Queries matching the entire embedded document can use the index effectively
- Queries targeting specific embedded fields (like
_id.keyor_id.date) cannot utilize the index - Range queries on embedded fields require full collection scans
This explains why the appV1 implementation requires an additional compound index on _id.key and _id.date to support efficient querying of individual embedded document fields.
















Top comments (0)