Ashwani Yadav

Posted on Mar 7

How Background Jobs Actually Work in Rocket.Chat — A Deep Dive into Agenda

#webdev #programming #opensource #learning

You know that feeling when you look at an open source codebase for the first time and think, "Wow, this is massive, where do I even start?"

That's exactly how I felt when I started exploring Rocket.Chat. I was curious about one specific thing: how does Rocket.Chat handle background tasks? Things like cleaning up old files, syncing users from LDAP, sending scheduled reports — all the stuff that happens behind the scenes while you're busy chatting.

Turns out, the answer is a library called Agenda, and how Rocket.Chat uses it taught me a lot more than I expected.

Why not just use `setInterval`?

This was my first question. Node.js already has setInterval. Why bring in a whole library?

Think about it:

Server crashes? Your setInterval is gone. The job never runs again until someone manually restarts things.
Running multiple instances? Great, now the same job runs 3 times simultaneously on 3 different servers. That's not a feature, that's a bug.
No history? Good luck figuring out when something last ran, or why it failed.

Agenda solves all of these by storing jobs in MongoDB. The database becomes the single source of truth. If a server goes down, another one picks up the job. If two servers try to grab the same job, MongoDB's atomic operations ensure only one wins.

It's basically setInterval that survived the chaos of distributed systems.

Rocket.Chat doesn't use Agenda directly

Here's something I didn't expect: Rocket.Chat maintains its own forked version of Agenda. Check packages/agenda/package.json:

"description": "Fork of https://github.com/agenda/agenda"

This fork lives locally in the monorepo, which means Rocket.Chat has full control over the scheduling behavior without depending on upstream releases.

The Three Layers

After reading through the code, I noticed Agenda is used in three distinct layers:

Layer 1: The Engine (`packages/agenda`)

This is the raw scheduling engine. The core Agenda class in Agenda.ts handles everything: database connections, job definitions, polling MongoDB every minute to find due jobs, locking them with atomic findOneAndUpdate, executing them, and managing concurrency limits.

Layer 2: The Wrapper (`packages/cron`)

Rocket.Chat built a wrapper called AgendaCronJobs on top of the engine. It does two key things:

1. Simplifies the API:

cronJobs.add('VideoConferences', '0 */3 * * *', async () => {
  await expireOldConferences();
});

2. Records execution history:

Every time a job runs, the wrapper logs it to rocketchat_cron_history:

const { insertedId } = await CronHistory.insertOne({
  intendedAt: new Date(),
  name: jobName,
  startedAt: new Date(),
});
// job runs...
await CronHistory.updateOne(
  { _id: insertedId },
  { $set: { finishedAt: new Date(), result } }
);

This is invaluable for debugging — you can see exactly when a job ran, how long it took, and whether it failed.

Layer 3: The Actual Jobs

These are spread across the codebase. They don't care about MongoDB or locking. They just say "run this function on this schedule" and the layers below handle the rest.

A good way to think about it:

Layer 1 = the postal system (handles delivery logistics)
Layer 2 = the post office (accepts your letter and tracks it)
Layer 3 = you writing the letter (just defines the content)

The `IJob` Interface — What Gets Stored

Reading packages/agenda/src/definition/IJob.ts was eye-opening. Every job in MongoDB has these fields:

interface IJob {
  name: string;              // Job identifier
  nextRunAt?: Date | null;   // When it should run next
  type?: 'once' | 'single' | 'normal';
  repeatInterval?: string;   // Cron expression
  lastRunAt?: Date;          // When it last started
  lastFinishedAt?: Date;     // When it last completed
  lockedAt?: Date | null;    // Lock timestamp
  disabled?: boolean;        // Can disable without deleting!
  failedAt?: Date;           // When it failed
  failReason?: string;       // Error message
  failCount?: number;        // Cumulative failure count
  priority?: number;         // Execution priority
  data?: Record<string, any>;
}

A few things stood out:

disabled: There's already built-in support for disabling jobs without removing them. The engine checks disabled: { $ne: true } when finding the next job to run.
lockedAt: This is how Agenda prevents duplicate execution across servers. Once a server locks a job, others skip it.
failCount + failReason: Error tracking is built into the data model. You don't need external monitoring to know how often something fails.

The Job Class Has More Than I Expected

Looking at Job.ts, I found methods I wasn't expecting:

// Built-in disable/enable
disable(): Job { this.attrs.disabled = true; return this; }
enable(): Job { this.attrs.disabled = false; return this; }

// Status check
isRunning(): boolean {
  if (!this.attrs.lastRunAt) return false;
  if (!this.attrs.lastFinishedAt) return true;
  if (this.attrs.lockedAt && 
      this.attrs.lastRunAt.getTime() > this.attrs.lastFinishedAt.getTime()) {
    return true;
  }
  return false;
}

So the infrastructure for checking job status and toggling jobs on/off already exists at the engine level. That's pretty cool.

How a Job Actually Runs — End to End

Following a job from schedule to execution was one of the more interesting exercises:

1. Registration — cronJobs.add() calls agenda.define() to register the handler, then agenda.every() to set the schedule.

2. Polling — Every minute, Agenda queries MongoDB for jobs where nextRunAt <= now AND disabled !== true AND lockedAt is null or expired.

3. Locking — findOneAndUpdate atomically sets lockedAt = new Date(). Because it's atomic, even if two servers query simultaneously, only one gets the lock.

4. Execution — The handler runs. Events fire: start → success/fail → complete.

5. Cleanup — lockedAt is set to null, nextRunAt is recalculated from the cron expression.

6. Lock Expiry — Default lock lifetime is 10 minutes. If a server crashes while holding a lock, after 10 minutes another server can reclaim the job. Self-healing.

Jobs Are Everywhere (Not Just `server/cron/`)

This was my biggest surprise. I assumed all background jobs lived in apps/meteor/server/cron/. That folder has 6 files — NPS surveys, OEmbed cache cleanup, video conference expiry, etc.

But when I searched the entire codebase for cronJobs.add(, I found 19 distinct jobs registered across many different modules:

Core (6): NPS, OEmbed cleanup, usage reports, video conferences, temp file cleanup, user data exports
System (4): Version checking, cloud workspace sync, retention policy pruning, Smarsh compliance exports
Authentication (5): CROWD sync, LDAP sync (with 4 sub-jobs: user sync, avatar sync, auto-logout, attribute-based access control)
Apps (2): Marketplace update checks, app request notifications
Livechat (2): Business hour scheduling, daylight saving time adjustment

Each module has its own patterns too. Some jobs have fixed schedules (run at 2 AM daily), some are setting-driven (admin changes frequency from the UI), and some use random offsets — like Cloud Workspace Sync which picks a random minute to prevent all Rocket.Chat instances from hitting the cloud API at the same time:

const minute = Math.floor(Math.random() * 60);
await cronJobs.add(licenseCronName, `${minute} */12 * * *`, ...);

There Are Actually Two Scheduling Systems

Here's something that took me a while to piece together. Rocket.Chat doesn't have just one scheduling system — it has two:

1. Core Cron Jobs → Use the AgendaCronJobs wrapper → Store jobs in rocketchat_cron → History goes to rocketchat_cron_history

2. Apps Engine Scheduler → Uses its own separate Agenda instance via AppSchedulerBridge → Stores jobs in rocketchat_apps_scheduler

The Apps Engine (Rocket.Chat's app framework) has a completely independent scheduler. When a marketplace app registers background tasks, those go into a different MongoDB collection entirely. Different Agenda instance, different collection, different lifecycle.

This means if you ever wanted a complete picture of all background tasks running in a Rocket.Chat instance, you'd need to look at three separate collections, not two.

What I Learned

1. Read the code, not just the docs.
I understood locking by reading _findAndLockNextJob(), not by reading about it. Documentation gives you the "what." Source code gives you the "how" and "why."

2. grep is your best friend.
I only found 19 jobs because I searched the entire codebase for cronJobs.add( instead of just browsing one folder. Assumptions about where code lives will mislead you.

3. Layered architecture makes complexity manageable.
The person writing an NPS job doesn't need to know about MongoDB locking. The layers abstract that away beautifully.

4. Background jobs are invisible infrastructure.
Users never see them, but without them, files don't get cleaned up, licenses don't sync, old messages don't get pruned, and LDAP users fall out of date. They're the unsung heroes of the application.

If you're exploring a large open source codebase, I'd recommend picking one system and following it end-to-end. Don't try to understand everything at once — just pick a thread and pull it.

Happy exploring! 🚀

DEV Community

How Background Jobs Actually Work in Rocket.Chat — A Deep Dive into Agenda

Why not just use `setInterval`?

Rocket.Chat doesn't use Agenda directly

The Three Layers

Layer 1: The Engine (`packages/agenda`)

Layer 2: The Wrapper (`packages/cron`)

Layer 3: The Actual Jobs

The `IJob` Interface — What Gets Stored

The Job Class Has More Than I Expected

How a Job Actually Runs — End to End

Jobs Are Everywhere (Not Just `server/cron/`)

There Are Actually Two Scheduling Systems

What I Learned

Top comments (0)

Why not just use setInterval?

Rocket.Chat doesn't use Agenda directly

The Three Layers

Layer 1: The Engine (packages/agenda)

Layer 2: The Wrapper (packages/cron)

Layer 3: The Actual Jobs

The IJob Interface — What Gets Stored

The Job Class Has More Than I Expected

How a Job Actually Runs — End to End

Jobs Are Everywhere (Not Just server/cron/)

There Are Actually Two Scheduling Systems

What I Learned

Why not just use `setInterval`?

Layer 1: The Engine (`packages/agenda`)

Layer 2: The Wrapper (`packages/cron`)

The `IJob` Interface — What Gets Stored

Jobs Are Everywhere (Not Just `server/cron/`)