How schema anti-patterns in MongoDB can cost you $$$$

#architecture #database #mongodb #performance

This article was written by Darshan Jayarama.

MongoDB is a flexible schema database. This means you have the flexibility to modify the structure of the data store. One collection can have 5 fields in one row(document), 10 in another row, 1000 in another row, and 1 in another row. It doesn't matter how the structure has been aligned in the collection.

If you are a developer with a SQL background, you may find this to be a fantasy, like a science fiction movie. However, that is the flexibility you can achieve with the help of MongoDB. A good percentage of people choose MongoDB for this flexible schema, but let it be known that abusing this feature could cost you big time.

I would like to share some of the story from when I was working with customers to tune the application's performance.

Massive boundless array

Problem: One customer had ping data from an IOT device stored in an array field. In the first few hours of the day, document writes were working fine, but as time passed, writes started to become expensive.

Reason: As the document gets inflated with the growing array field, the findOneAndUpdate() method gets heavy in the network for each document to fetch →update →Save them.

Solution: As the ping frequency is every minute for 1000s of IoT devices, we created a new collection that maps device and ping data to deviceID, hence throughout the day, write was stabilized.

Timeseries saved time

Problem: For the same customer, write was stabilized with a referential model, but each read started costing them time as it had to map the same device ping throughout the day while creating the report.

Reason: Each deviceID data was spread across storage in different pages, so for 100 records, it was technically fetching 100 pages, keeping the required data, and throwing away the rest. Which increased the pagein-pageout.

Solution: MongoDB offers a timeseries collection, which utilizes a bucketing pattern, it has been explained in detail in the video series.

We have created a timeseries collection to store the ping data with deviceID as the metaField and timestamp as the timeField.

When multitenant became the villain

Problem: A customer from a webhosting platform was using a different DB for a different customer with a customer-identification suffix like billing_cust1, data_cust2… Started growing a customer base, became a pain in their database.

Reason: Customer was using MongoDB ChangeStream, which increased the number of databases and caused a delay in the start of their application collections. As explained in the documentation, changestream uses oplog collection in which we cannot utilize the indexes, hence, opening a high number of targeting streams can impact the performance.

Solution: We converted multi-tenancy to a monolithic database structure, storing all customer data we stored in a single collection with customerID in each document. So whenever we query for a specific customer, we append another filter with customerID to perform a targeted query and ensure data safety. This reduced 10000s of collections into 5–10 collections.

$lookup is designed badly

Problem: The application slowed down during the delivery time while verifying the product to be delivered.

Reason: A famous e-commerce company was storing productID in the orders collection. When a scan was performed via the QR code, it was performing multiple lookups in the products collection to fetch the product details.

Solution: Because the number of products was limited, we remodeled the table’s schema and stored the product details in the order collection. As the order details would no longer be accessed regularly after delivery, we archive orders that are 3–4 months old.

Closing thoughts: Performance tuning is a never-ending task, analyze the queries → tune → repeat.

Top comments (1)

mote • Apr 28

The 'massive arrays of references' anti-pattern hits harder in embedded contexts than people realize. I ran into this on a drone project where the MongoDB driver alone ate 40MB RAM just for connection pooling.

Your point about data duplication being cheaper than joins is spot-on for serverless, but on edge devices with 512MB total RAM, you start questioning if a document DB is the right abstraction at all. We ended up needing something that could handle vectors + time-series + key-value without three separate processes.

The schema validation tip is gold though. Catching 'array will grow unbounded' at write-time instead of 3AM when the OOM killer strikes is worth the extra setup.

Have you measured the overhead of schema validation on write-heavy workloads? Curious if the tradeoff shifts at high insert rates.