Situation
I have a Node.js backend service using the NestJS framework that depends on an object modeling tool called Mongoose to connect to the MongoDB database. It serves to import data from CSV to the database via API endpoint. The CSV consists of >10k rows, each has multiple columns. The service has tasks to:
- Convert the CSV data to an array of JavaScript objects.
- Calculate the value of new fields for each object using some business logic.
- Insert each object into the database OR update the existing document if the new field value matches.
Task
It was relatively easy to do task 1 by utilizing the csv-parse library to convert rows of columns of data into an array of JavaScript objects. The problem happened when doing calculations on each object and when inserting the object into the database. Calculations on each object mean it might query the database, do mathematic calculations, do string manipulation, etc. Hence, it resulted in different updates for each object. Furthermore, the database operation could be updateOne or create depending on the value of new field. Below is the pseudocode on the 1st version of the code.
Guess how is the performance? To process 1k of data it took 1 hour or more! If an error occurred, then it took longer. Imagine how long I need to wait to process 10k of data. It was such a wasted time.
Action and Result
That's why I explored the root cause of this problem, and I found out that updateOne or create commands on every looping consumes most of the time. Why? Because in updateOne or create, it did a round-trip communication to the database. Moreover, create needs time to index. Thus, n data = n round-trip where each round-trip may take 100 ms to 1000 ms.
It was where I found the solution: Model.bulkWrite. Instead of doing n round-trip communication to the database in every loop, we do it in batch. The idea was to only do calculations on each loop. After that, it only sends the insert or update operations at one call to do write operations in batch. Now the service can process 8k data only in under 5 minutes. Below is the code after being fixed.
Short about Model.bulkWrite
By definition from MongoDB, Model.bulkWrite or db.collection.bulkWrite is a method that
"Performs multiple write operations with controls for order of execution".
It receives write operations as input and returns counts for each write operation, and an array containing an _id for each inserted or upserted document.
In Mongoose, the details look like this.
# write operations:
- insertOne
- replaceOne
- updateOne
- updateMany
- deleteOne
- deleteMany
# options: some bulk write options.
# counts:
- insertedCount
- matchedCount
- modifiedCount
- deletedCount
- upsertedCount
# upsertedIds: array of ObjectIds
So I recommend using Model.bulkWrite in your application if you have a similar use case, where your application does write operations on lots of data to MongoDB with different operations on each document.
Top comments (0)