Scheduling a task in node.js can be achieved in several different ways depending on your requirements. For this example, let's say we've created a library function to scrape data off a website and we wanted to run this on an hourly basis. We'll see how the solution to scheduling this task changes as the function becomes more robust. As with most things in software, we'll start trading off simplicity for a more scalable but complex solution
Running on a single node.js process
To start with let's assume the WebScraper
library only fetches data from a single website and dumps it in a data store somewhere. This is quite a simple use case and we can probably get away with just using an open-source library like node-cron and running a single process.
Start by installing the dependencies:
npm install cron
Next, the entry point would look something like this:
const CronJob = require("cron").CronJob;
const WebScraper = require("./lib/webscraper");
const job = new CronJob("0 * * * *", WebScraper);
job.start();
Here we've defined the schedule using a crontab that calls the WebScraper
function every hour.
Scaling to multiple jobs
Let's say you've now iterated over the WebScraper
function a few times and added a feature where it accepts an arbitrary URL to crawl and fetch data from.
The same code above can be expanded for multiple jobs like this:
const CronJob = require("cron").CronJob;
const WebScraper = require("./lib/webscraper");
const URLS = [
// list of urls to scrape...
];
const jobs = URLS.map((url) => {
const job = new CronJob("0 0 * * *", () => WebScrapper(url));
job.start();
return job;
});
However, this isn't scalable as the number of jobs grows for a couple of reasons.
- It's inefficient and hard to scale horizontally. As the number of jobs grows, the process they're running on will become a bottleneck. Eventually, you'll have to figure out how to run the jobs in parallel over multiple processes in order to complete them in a reasonable time.
- Tracking and retying failures get tricky. As the number of jobs increases, it becomes more likely that some will fail intermittently. With our current approach, we have no way to easily track which jobs have failed and why. And if they failed we also have no way to retry them.
Using a job queue
To deal with the scaling problems we can look into using a task queue instead. A queue will have many jobs that can be distributed to a worker. Since they're stateless, the workers can also be scaled horizontally to multiple processes as required.
There are a few different libraries for implementing a task queue in Node.js, but for this example, we'll take a look at Bull. The message queue implemented by this library is also backed by Redis which handles the distribution of jobs to workers.
Start by installing the dependencies:
npm install bull
Following on from our last example we can set up and add jobs to the queue with the following code (this also assumes you have access to a Redis cluster):
const Queue = require("bull");
const webScraperQueue = new Queue("Fetch Data", process.env.REDIS_URL);
const URLS = [
// list of urls to scrape...
];
URLS.forEach((url) =>
webScraperQueue.add({ url }, { repeat: { cron: "0 0 * * *" } })
);
The worker code would then look something like this:
const Queue = require("bull");
const WebScraper = require("./lib/webscraper");
const webScraperQueue = new Queue("Fetch Data", process.env.REDIS_URL);
webScraperQueue.process("*", async (job) => {
const { url } = job.data;
const res = await WebScraper(url);
return res;
});
Although slightly more complicated, this system is more scalable as the queue grows with more jobs. The two bottlenecks when scaling are:
- The size of the queue which can be fixed by scaling up the Redis cluster.
- The processing time which can be fixed by scaling up the number of worker processes.
When adding jobs to the queue we can also set additional options for retires in the case of an inevitable failed attempt.
Adding observability and making queues easier to manage
While the system can now be easily scaled, we still need a way to add some observability into what is happening in production. Information on the state of the jobs, stack traces, and logs are all things that can help us to determine the overall health of the queue while also being useful tools for debugging when needed.
Bull has a few different third-party UI options to enable observability:
All three options are open source and can be straightforward to integrate. Arena and bull-board can both be mounted to an existing express.js app while ZeroQueue is built with the aim of having as little code as possible. The latter comes with authentication and the ability to also create and manage queues from the UI rather than through code.
Summary
- The simplest way to schedule a job in node.js would be to use an open-source library like node-cron.
- As the number of jobs grows, you will likely want to transition to a task queue such as Bull to overcome processing bottlenecks and continue scaling.
- Because of the added complexity, you'll also likely want to leverage a UI to easily manage your queues and get better observability into how your system is performing. For this, you can leverage dashboards such as arena, bull-board and zeroqueue.
Top comments (0)