DEV Community

Burhanuddin Ahmed
Burhanuddin Ahmed

Posted on • Updated on

Optimizing MongoDB collection with 200 millions rows of data using indexing

Preface

We have a MongoDB collection with almost 200 million rows. Basically, this collection stores log data from an operation that I can not tell you here. Every day ~40 million new rows are inserted and have 5 days of retention, which means 5 days old data will automatically be deleted. We have a cron job to check if the data needs to be deleted or not.

If the number of users is increasing, potentially we would have more than 200 million.

We are using ExpressJS and Mongoose.

What I want to do and the problems

Basically, the log data has event_type property and we want to count the number of each events that happened in the last 24 hours per user.

Here's our code:

const countTypeA = await LogSchema.countDocuments({
  createdAt: { $gte: new Date('<24 hour before>') },
  userId: <secret id hehe>,
  one_other_secret_filter: 'secret value',
  event_type: 'A'
})
Enter fullscreen mode Exit fullscreen mode

We have over 20 event types, so we call the code above more than 20 times with different event_type and this makes the API response takes so long, and often the API returns a timeout error.

How I do it

We figure out the thing we can do when working with large data is indexing.

Indexing

Add index for the property which we are using as the filter. Because we use four properties in our count filter, so we decide to put compound index to this Log collection.

LogSchema.index({ createdAt: -1, userId: 1, one_other_secret_filter: 1, event_type: 1 })
Enter fullscreen mode Exit fullscreen mode

We put -1 for createdAt because we want it to be indexed in descending order. One of our filter is createdAt: { $gte: new Date('<24 hour before>') }, so index by the latest record would be make it faster.

After adding a new index, MongoDB will rebuild its index and this process will be done in the background, so we still be able to make the read-write process. The rebuilding process took 30-40 minutes because we have a lot of data compared with an empty collection, which would make the indexing process just a second.

Using MongoClient instead of Mongoose

We thought our job was done, but the problems still exist. Our API response still returns a timeout error.

After hours of researching, we found something on Stackoverflow. Mongoose is much slower than Mongo Shell.

We try it immediately, we hit our terminal and go to Mongo Shell, we try db.collection.count({ <FILTER> }) to Mongo Shell.

TADA!!

It returns the result real quickly.

We conclude that the problem might be on Mongoose. But it's impossible to replace Mongoose with MongoClient as we already have tons of modules relying on Mongoose.

Okay, so we only migrate some modules that use countDocuments() to MongoClient, other than that will use Mongoose.

Split request into small chunks

As I said above, we have 20 event types which means we call the countDocuments query 20 times.

Let's say 2 seconds per query, so the response time is around 40 seconds. Can't you imagine if the user should wait and see the loading indicator for 40 secs? That's a nightmare.

Just an example:

function (request, response) {
  const types = ['A', 'B', ..., 20]

  for (const t of types) {
    <MONGO COUNT QUERY>
  }

  res.json(<data>)
}
Enter fullscreen mode Exit fullscreen mode

Actually, we can use Promise.all but we choose to chunk it because some queries may still take a bit longer and surely it will affect the response time.

You can use query search params to know which type you are going to fetch, it's actually similar to pagination, and just discuss with your Frontend dev about the best approach for your team.

Updated:

function (request, response) {
  const size = 2
  const start = request.query.page - 1
  const end = start + size

  const types = ['A', 'B', ..., 20]

  for (const t of types.slice(start, end)) {
    <MONGO COUNT QUERY>
  }

  res.json(<data>)
}
Enter fullscreen mode Exit fullscreen mode

So now it only takes 2-3 secs per request to get the data.

Happy Coding!!!

Top comments (0)