Building a batched notification engine

#notifications #tutorial #architecture #node

👋 We’re Knock. We run flexible notifications infrastructure that developers use to schedule, batch, and deliver notifications, without the need to write custom application code.

In this blog post we'll take a deep dive into building a batched notification engine, including some of the technical details on how you might design and construct it yourself.

What are batched notifications?

A batched notification collapses together one or more relevant notifications into a single notification message delivered to a recipient across a window of time.

In a naive notification system each event maps to a single notification sent to a user. A common example: a document collaboration app where every comment results in an email notification (looking at you, overly noisy Notion default notification settings!). Contrast that to a batched notification system, where all comments left on a document within a window of time are batched together and sent as a single notification message.

When you batch notifications, you send your customers fewer, higher-information-density notifications, which leads to increased engagement on the notifications you do send and, ultimately, happier customers with better retention rates.

Note: you might also think of a batched notification as a kind of "notification digest". However, at Knock we think about digests as operating on a heterogenous set of notification types. In our document commenting example, a digest could contain notifications about comments, likes, and replies. A batch instead operates on a single notification type. We would build separate batches for comments, likes, and replies.

Designing a batched notification system

There are, broadly speaking, two different approaches we can take in the design of a batched notification system:

Batch on write: notifications are accumulated into batches per recipient when an event has occurred. Batches are "flushed" at the end of a batch window to become a notification.
Batch on read: notification batches are lazily generated by periodically running a task (usually via a cron job) that finds all notifications that have not been sent, collapsing them into batches, and sending notifications.

The biggest difference between these two approaches is how they scale: a batch on write system trades storage for building an optimized lookup table of what needs to be batched and when. A batch on read system must (fairly inefficiently) query ever increasing amounts of data to determine what to batch and when.

In our experience building a batch on write system is more effort but is generally worth it to future proof your notification system. Of course this approach also isn't perfect, and it has it's own scaling challenges to overcome. We touch on some of those later on in the post.

Table design

For this example we'll model our system using a good ol' fashion relational database. Our table design may therefore look like:

A notifications table to keep track of the individual notifications that a user should receive.
A notification_batches table to keep track of all of the batched notifications for a recipient.
A notification_batch_notifications table to keep track of the individual notifications per batch (our entries on the queue).

CREATE TABLE `notifications` (
  `id` serial PRIMARY KEY,
  `type` varchar(255) NOT NULL,
  `actor_id` INT NOT NULL,
  `recipient_id` INT NOT NULL,
  `object_id` INT NOT NULL,
  `object_type` varchar(255) NOT NULL,
  `inserted_at` TIMESTAMP NOT NULL
);

CREATE TABLE `notification_batches` (
  `id` serial PRIMARY KEY,
  `type` varchar(255) NOT NULL,
  `recipient_id` INT NOT NULL,
  `batch_key` varchar(255) NOT NULL,
  `object_id` INT NOT NULL,
  `object_type` varchar(255) NOT NULL,
  `closes_at` TIMESTAMP NOT NULL,
  `processed_at` TIMESTAMP,
  `inserted_at` TIMESTAMP NOT NULL
);

CREATE TABLE `notification_batch_notifications` (
  `notification_batch_id` INT NOT NULL,
  `notification_id` INT NOT NULL,
  `inserted_at` TIMESTAMP NOT NULL,
  PRIMARY KEY (notification_batch_id, notification_id),
  FOREIGN KEY (notification_batch_id) REFERENCES notification_batches (id),
  FOREIGN KEY (notification_id) REFERENCES notifications (id),
);

A few details on the design of our tables:

We use a polymorphic design with object_id and object_type to reference the object attached to a notification
We use a batch_key on our notification batches table, which we'll use as a lookup key to accumulate items into open batches. For example, if we want to batch all comments in the document for a single recipient our batch_key would be an identifier that includes the document_id
We keep a closes_at timestamp to store when the batch window should close
We store a processed_at timestamp to keep track of the batches that we've flushed

Batching notifications

In order to batch our notifications for our users, we'll want to (per recipient):

Generate a batch_key to use to accumulate notifications into a batch for a window of time
Create a notification entry to keep track of the action that occurred, as well tracking the object that the action occurred on
Find an "open" notification batch using the batch_key where the batch window has not closed. If there isn't an open batch, then create one using the batch key and set the closes_at window to now() + batch_window

Let's see what this might look like in practice in our codebase using our document commenting example (granular implementation details omitted):

// Create our comment for the document
const comment = await Comments.createComment(
  document,
  { text: commentText },
  user
);

// Find all of the recipients for the document (excluding the user who created the comment)
const recipients = await Documents.getCollaborators(document);
const recipientsToNotify = recipients.filter((recipient) => recipient.id !== user.id);

// The key we want to query an open batch for
const batchKey = `document:${document.id}:comments`;

// How long do we want this batch window to be open? (5 mins)
const batchWindow = 60 * 5;

recipientsToNotify.forEach((recipient) => {
  // For each recipient, generate a notification and add it to the batch
  const notification = await Notifications.createNotification(
    "new-comment",
    { object: comment, actor: user },
    recipient
  );

  // Find an open batch by the key given for this recipient
  // SELECT * FROM notification_batches AS nb
  // WHERE nb.recipient_id == ? AND nb.batch_key == ? AND nb.closes_at <= now();
  const batch = await Notifications.findOrCreateBatch(
    recipient,
    batchKey,
    { object: document, type: "new-comment", batchWindow }
  );

  // Add the notification to the batch
  const batchedNotification = await Notifications.addNotificationToBatch(batch, notification);
});

Flushing closed batches

We'll next need a way to "flush" our batches at the end of the batch window to produce a notification message per recipient. There are two separate approaches we can leverage to do this:

Enqueue a job to be executed at the end of the batch window once the batch has been created
Have a cron task that runs every minute to find any batch windows that are closed but not yet sent

If you're dealing with an ephemeral job queue (like something Redis backed) then the first option might be a non-starter for you given that you could end up dropping scheduled jobs in the event of a failure. Similarly, not all job queues support future scheduled jobs. Instead lets take a look at some code for executing a cron job to flush our batches:

// SELECT * FROM 'notification_batches' AS batch where batch.closes_at >= now() AND batch.processed_at is NULL;
const batches = await Notifications.getBatchesToBeProcessed();

batches.forEach((batch) => {
  // For each batch, generate a notification message
  const { template, subjectLine } = await Notifications.generateEmailFromBatch(batch);

  // Send our email
  await Notifications.sendEmail(batch.recipient, subjectLine, template);

  // Mark the batch as processed
  await Notifications.markBatchAsProcessed(batch);
});

Notice here that we're also keeping track of a processed_at field for our batches as well, so that we know if we need to reprocess any of batches in the event of an issue with the cron job.

Generating our batched notification message

Now that we have our batched notifications, we'll use them to generate actual notification messages. This is the code inside our Notifications.generateEmailFromBatch function in the example above.

Note: one important consideration you'll want to think through here is the total number of items fetched in the batch. In our example, the number of items in the batch can theoretically be unbounded, which may lead to poor performance when fetching and rendering a notification template.

In our document commenting example, we might have the following template (here, written in Liquid for simplicity) to show the available comments for the document:

<h1>Comments for {{ batch.object.name }}</h1>

{% for notification in batch.notifications %}
  <p>
    <strong>{{ notification.object.author.name }}</strong> said at {{ notification.inserted_at }}:
  </p>

  <blockquote>
    <p>{{ notification.object.text }}</p>
  </blockquote>
{% endfor %}

Preparing this design for production

The above design is a naive implementation of a batching system, and there are a few important details to consider when taking this design to production:

Protecting against race conditions whereby two comments can be created at the same time, leading to multiple batches being generated
Ensuring each batch executes only once, so that we don't send duplicate messages
Handling retries with the delivery of our email notifications

Extending our batched notification system

Building upon our design, we may want to handle more advanced scenarios:

Keeping a count of the total number of items that are stored in the batch. This is useful for when you want to display a subset of the items in the batch, but still have the ability to show the total number of items that was added within the window (e.g. "There were 15 new comments on X").
Adding the ability to flush a batch window early. When a batch hits a certain size, flush the batch window early to ensure users get notified sooner than later at specified thresholds of activity.
Removing one or more items from a batch. To return to our document collaboration example, if users are allowed to delete comments, we'll want to remove those comments from our batch before the batch window closes and a notification is sent to users.
Enabling user-specific batch windows. Your users may wish to customize the duration of their batch window, such that they can determine shorter or longer frequencies in which to receive notifications. This is especially helpful for digesting use cases, where some users will want a daily digest and others will want them once a week.
Partitioning cron jobs to flush batches to handle large numbers of users. Most applications won't need this level of optimization, but if your product does serve very large numbers of users this can become an interesting challenge as your notifications scale.

Don't want to build this yourself?

If you've read the above and thought that this sounds like a large lift for you or your engineering team, you're not alone. That's exactly why we built Knock.

Knock is a complete solution for powering product notifications that handles batching out of the box with no cron jobs or job queues to set up. We even support removing items from batches. You can use Knock to orchestrate notifications across multiple channels, manage user preferences, and keep your notification templates in a single place that's visible to your whole team.

If you want to try out Knock to power your batched notifications (and much more!), you can sign up for free here. We have a generous free tier you can use to get started.