Description
Poll phase is hijacked by the uncontrolled flow from readable stream.
This is a short post for a probable long post later. Recently I was implementing something using Node.js streams. And I got a really good opportunity to think about backpressure, event loop and closures! All concepts in one!
Problem overview
There's a csv file. Parse it using csv-parser and store the rows in an internal buffer. As soon as the buffer gets full, commit the rows in the database. The functionality should utilize asynchronous programming. Simply means that the main thread should be empty while the application is performing I/O operations.
Seemed an easy problem until I started implementing it.
Before moving to the solution, think about your solution. (Make sure not to miss the important detail: buffer)
Solution overview
Visualization of the problem:
File.csv (I/O) -> readable stream (csv-parser) -> buffer -> commit to the database (I/O)
Focus on the I/O part. I/O is handled in Node.js poll phase.
The readable stream starts filling up the poll phase. As soon as a row is read by csv-parser, it will push the on data event handler to the main thread.
This event handler will fill the buffer till 5 rows from csv file.
After 5 rows, the handler should make a db call to commit the 5 rows.
Since this is an I/O operation, we specify the async await operation.
As soon as you complete the implementation and test, you'll see that the async db call made the main thread empty, so the rows from poll phase keep emitting the on data handler.
Experts call this as "backpressure".
Even though this implementation works for a small file, it will spike the RAM usage for a larger file.
The current implementation will create a poll phase like this:
csv, csv,... milion csv rows then
db, db, ... million/5 db writes.
All of these will try to execute at cpu speed.
The drawback of this will be: at no point the main thread will be free until all the reads and db calls get executed. If this functionality is implemented on a rest api server, you will face a downtime since the main thread gets blocked now due to overflowing the poll phase.
So, how do you control the flow of water from the tap? By closing the tap ;P
Our main problem was: Poll phase is hijacked by the uncontrolled flow from readable stream. So, the solution is to simply put a pause when the buffer is full. While the stream is paused, make the db call to commit the changes and empty the buffer. Then resume the stream. This will make the poll phase look something like this:
csv csv csv csv csv (5 only) db
So the poll phase is not attacked by the js code. In between these operations, any api call made to the rest api server will also get served successfully since an api call will not need to wait for millions of callbacks to execute, only 6 callbacks.
That's how the main thread is also made unblocked and the server is not also down and RAM usage is not spiked.
Phew!
At first it seemed like a jargon, but later on, it was just an event loop question!
Conclusion
Never block the main thread!
PS: I will publish my project with complete setup of csv-parser, express js, knex.js and cron job with a well commented code and complete documentation for the educational purposes. Thanks for your attention!
Top comments (1)
I I captain, I'll keep my tap closed and main thread free as bird ;)