DEV Community

Muhammad Hasnain
Muhammad Hasnain

Posted on

How would you tackle this NodeJS project?

Hello!

I work at a startup and have been working on a cool project NodeJS based project. Even though, it is a WordPress position, when I'm free, I take my time to work on interesting projects. Given what we're trying to achieve in this soon-to-be proprietary system, I noticed optimization issues with it. I tried bombarding the endpoints with 200 requests at a time and pm2 monit showed a few issues.

Before I talk about the pm2 stats, I would like to talk about what the system actually does. We sent a domain name to our endpoint, for example, say, dev.to. The endpoint makes an entry to database and it emits an event to analyze that website. The endpoint quickly does what it is supposed and sends back a 200 response but the process afterwards takes a lot of time as it involves HTTP requests a huge sets of things it needs to analysis the website for, a long with Puppeteer, potentially a lot of loops with hundreds, if not thousands of iterations.

What I talked about above, resulted in at almost 100% of CPU usage in PM2 stats, the heap was almost 100% too, EventEmitter gave us memory leaks warnings. Given that there is no queue or in-memory DS like Redis, I think the event loop was overwhelmed with 200 requests at the same time which involves a lot of processing! I'm not happy with the results and will present issues further down the road.

NOTE: Keep in mind that the slow process I am talking about takes place on an app level. The ExpressJS route only emits and event to start the process for that specific domain that it just saved into the database. I did this because there was no need to keep a user waiting for 10 seconds!

I discussed this with my boss and he encouraged me to take time and ask for help around the community. This is why I am here! Have you worked on such a project? If so, how did you handle it? What would you recommend me to do in this case? Should I go for cronjobs instead of event based system? Would really appreciate an answer.

Thank you! Also, if you have any questions regarding the project, let me know and I shall answer them.

Top comments (4)

Collapse
 
ahmedmagdy11 profile image
Ahmed Magdy

it's probably bad practice aka bad code that lead to memory leaks.. i would suggest to read about bad practices that lead to this kind of behaviour and refactor the code from there. recognizing the wrong or bad pieces of code will be a lot way faster to fix.

Collapse
 
hasnaindev profile image
Muhammad Hasnain • Edited

Thanks! But isn't that the point though? If I use Bubble Sort in order to sort an array because I don't know about Merge Sort at that point, regardless of how I write code, I'm stuck with the slow O^2 Bubble Sort algorithm.

Anyways, according to my findings, I realize that I must use Redis, queueing system, consumers and multiple threads so I can run these tasks in parallel. This will not only reduce the abuse of heap but will also utilize CPU much more effectively.

I also discovered a library that takes care of all of the above and more, called, "Kue." I'm not sure if it, by default, runs on a different thread of has out-of-the-box support for it.

Again, thanks for taking you time to read and participating in the discussion!

EDIT: Kue does support parallel processing!

Collapse
 
dev786 profile image
Devashish Rana • Edited

I see that you are already using an asynchronous way of handling requests by saving the domain in the database and triggering an event, but the other request will also add the job in the database and will run the event.

I might try the following approaches

  1. Simple Scheduled Job: trigger an event in an interval to fetch a job from the database and execute it and keep track of it using some sort of flag. This can also happen in the following way
    SetInterval (in same code) or cron job --> trigger an event --> fetch batches from database --> spawn a few threads and process the batches either parallelly or synchronously.
    Even a synchronous approach won't harm this as you can notify the user once the process is done but it can be slow, depends on the use case.

  2. Utilize a queue (any queue will be okay for POC): Push the domain to the queue and configure consumers to fetch from the queue and do the task. You will have to control the number of consumers that can run parallelly to reduce the load.
    Make sure that the producer does not produce more than what the consumer can handle which might also lead to a memory issue.

You can even isolate the main server to just get the request and push it to the database and you can have another instance to take care of the heavy lifting if you think that the load can be a lot on the main server.

It can't be the best solution, I would also appreciate it if others can correct my approach and let me learn more.

Collapse
 
hasnaindev profile image
Muhammad Hasnain

Thanks!

I looked into general optimizations for NodeJS apps and your answer is on point, I doubt that there will be a better one but if someone gives it, I'll notify you in the comment perhaps. One article, called, "CPU Intensive Node.js" suggested that one should use multiple treads along with Redis and a priority queue to handle CPU intensive tasks. It is exactly what you suggested.