Design a Notification Fan-Out Service: Push, Email, SMS at Scale

#systemdesign #interview #distributedsystems #scalability

Book: System Design Pocket Guide: Interviews — 15 Real System Designs, Step by Step
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You drew the box labeled "Notification Service." An arrow goes in from the order service. Three arrows go out: push, email, SMS. The interviewer nods, then asks: "A celebrity posts. Two million followers should get a push. Walk me through what happens in the next ten seconds."

That is the whole question. Not "how do you send an email." Anyone can call SendGrid. The question is what happens when one event has to become two million sends, three providers each have their own rate limit, and Twilio returns a 429 halfway through.

What the interviewer is actually testing

A notification system looks simple because the happy path is simple. The hard parts are all in the corners:

One event becomes N sends. N can be 1 or 2 million. The system has to handle both without a separate code path.
Every channel is a different external API with its own latency, error model, and rate limit. SES throttles per-region. Twilio throttles per-number. APNs wants a persistent HTTP/2 connection.
Sends fail constantly and most failures are retryable. Some are not. Telling them apart is the job.
The same notification must never go out twice. A duplicate push at 3am loses you a user.

So the first questions you ask, before drawing anything: "What's the fan-out factor, and is it bounded? Do notifications have a delivery deadline, or is best-effort fine? And is exactly-once a hard requirement, or is at-least-once with dedup acceptable?" The answers shape the next 40 minutes.

Fan-out on write vs fan-out on read

This is the decision that splits the design, and it is the one most candidates skip.

Fan-out on write means: when the event arrives, you immediately expand it into one job per recipient and enqueue all N. The celebrity posts, you push two million rows onto a queue right now.

Fan-out on read means: you store the event once. Each recipient's device pulls (or gets notified to pull) when it next connects. Nothing gets expanded until someone asks.

The trade is the same one you see in social feeds. Write-fan-out gives instant delivery and simple readers, but a single high-degree event creates a write storm and you pay storage for every copy. Read-fan-out is cheap to ingest and handles celebrity accounts without a spike, but readers do more work and "instant" becomes "whenever you next poll."

The answer most production systems land on is hybrid:

Normal users (low follower count): fan-out on write. Expand immediately, deliver fast.
High-degree accounts above a threshold: fan-out on read, or a deferred write-fan-out drained over minutes.

Say the threshold out loud. Something like: "Accounts under 10K followers fan out on write. Above that, I defer and drain the expansion so one celebrity post doesn't starve everyone else's notifications." That sentence is what separates a hire from a "knows the buzzwords."

The pipeline

Four stages, each its own queue. Decoupling them is what lets the slow part (provider sends) not back up the fast part (ingestion).

event -> [ingest] -> [fan-out] -> [per-channel queues]
      -> [channel workers] -> providers

Ingest. Accept the event, validate it, write it once to durable storage, ack the producer. Fast. No expansion here.
Fan-out worker. Reads the event, resolves the recipient list, applies user preferences (did they mute push? unsubscribe from email?), and emits one message per (recipient, channel) onto the right channel queue.
Channel queues. One queue per channel: push, email, sms. Separate queues so a Twilio outage can't block your push throughput.
Channel workers. Pull from their queue, call the provider through an adapter, handle the response, retry or dead-letter as needed.

The fan-out worker is where preference resolution happens, not the channel worker. You want to drop muted recipients before you spend a queue slot on them.

def fan_out(event, queues, prefs_store):
    recipients = resolve_recipients(event)
    for user_id in recipients:
        prefs = prefs_store.get(user_id)
        for channel in ("push", "email", "sms"):
            if not prefs.allows(channel, event.type):
                continue
            msg = {
                "notification_id": event.id,
                "user_id": user_id,
                "channel": channel,
                "payload": render(event, channel, user_id),
            }
            queues[channel].publish(msg)

Channel adapters: one interface, many providers

Every channel speaks a different protocol. You hide that behind one interface so the worker doesn't care whether it's hitting APNs or SES.

from dataclasses import dataclass

@dataclass
class SendResult:
    ok: bool
    retryable: bool
    provider_id: str | None = None
    retry_after: float | None = None

class ChannelAdapter:
    def send(self, msg: dict) -> SendResult:
        raise NotImplementedError

class EmailAdapter(ChannelAdapter):
    def __init__(self, client):
        self.client = client

    def send(self, msg: dict) -> SendResult:
        resp = self.client.send_email(
            to=msg["payload"]["to"],
            subject=msg["payload"]["subject"],
            body=msg["payload"]["body"],
        )
        if resp.status == 200:
            return SendResult(True, False, resp.message_id)
        if resp.status == 429 or resp.status >= 500:
            return SendResult(
                False, True,
                retry_after=resp.retry_after,
            )
        # 400-class: bad address, unsubscribed.
        # Never retry, never succeed.
        return SendResult(False, False)

The split that matters is in the response mapping. A 400 (bad email, invalid phone number) is a permanent failure. Retrying it wastes quota and never works. A 429 or 5xx is transient. The adapter's whole job is to translate each provider's idiosyncratic error model into one honest retryable flag.

Per-provider rate limits

Each provider gives you a budget and punishes you for exceeding it. SES has a per-second send rate per region. Twilio limits messages per number per second. APNs will throttle a connection that pushes too hard. If your channel workers just send as fast as they drain the queue, you will hit those limits during exactly the bursts you built this system to handle.

You need a rate limiter shared across all workers for a given provider, not per-worker. A token bucket in Redis does this:

import time

# refill `rate` tokens/sec, burst up to `capacity`
LUA = """
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local cap = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local t = redis.call('HMGET', key, 'tokens', 'ts')
local tokens = tonumber(t[1]) or cap
local ts = tonumber(t[2]) or now
tokens = math.min(cap, tokens + (now - ts) * rate)
if tokens < 1 then
  redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
  return 0
end
tokens = tokens - 1
redis.call('HMSET', key, 'tokens', tokens, 'ts', now)
redis.call('EXPIRE', key, 60)
return 1
"""

def allow(redis, provider, rate, cap):
    ok = redis.eval(
        LUA, 1, f"rl:{provider}",
        rate, cap, time.time(),
    )
    return ok == 1

A worker that gets denied a token doesn't drop the message. It re-queues with a short delay, or better, the queue itself supports a visibility delay so the message reappears in a second. The limiter shapes throughput; the queue holds the backlog. Pairing them is what lets you absorb a two-million-send burst without tripping a single provider's limit.

Dedup: the same notification, exactly once-ish

Two things cause duplicates. The producer retries the original event (network blip, no ack received). And your own at-least-once queue redelivers a message after a worker crashes mid-send.

You solve both with an idempotency key and a dedup store. The key is deterministic: notification_id + user_id + channel. Before sending, claim the key. If it's already claimed, skip.

def send_once(redis, adapter, msg, ttl=86400):
    key = (
        f"sent:{msg['notification_id']}"
        f":{msg['user_id']}:{msg['channel']}"
    )
    # SET NX: claim only if not already claimed
    claimed = redis.set(key, "pending", nx=True, ex=ttl)
    if not claimed:
        return  # already sent or in flight
    result = adapter.send(msg)
    if result.ok:
        redis.set(key, "sent", ex=ttl)
    elif not result.retryable:
        redis.set(key, "failed", ex=ttl)
    else:
        # release the claim so retry can re-attempt
        redis.delete(key)
        raise RetryableError()

Be honest in the interview: this is at-least-once delivery with best-effort dedup, not true exactly-once. There's a window where a worker sends to the provider, then crashes before writing sent. The redelivery sees no claim and sends again. True exactly-once would need the provider to accept your idempotency key (Stripe does this; most notification providers don't). For push and email, a rare duplicate is acceptable. For anything billable, you say so and pick a different design.

Retry and the dead-letter queue

Retryable failures get retried with exponential backoff and jitter. Permanent failures go straight to a dead-letter queue. The DLQ is not a graveyard; it's a debugging tool and a manual-replay hatch.

import random

def handle(msg, adapter, requeue, dlq, max_attempts=5):
    attempt = msg.get("attempt", 0)
    result = adapter.send(msg)

    if result.ok:
        return
    if not result.retryable or attempt + 1 >= max_attempts:
        dlq.publish({**msg, "final_attempt": attempt})
        return

    delay = min(2 ** attempt, 60)
    delay += random.uniform(0, delay * 0.2)  # jitter
    if result.retry_after:
        delay = max(delay, result.retry_after)
    requeue.publish({**msg, "attempt": attempt + 1}, delay=delay)

Three things candidates miss here. Jitter (without it, a provider outage recovery causes a thundering retry herd that knocks the provider over again). Honoring retry_after from the 429 (the provider told you when to come back; listen). And a real cap on attempts so a permanently broken recipient doesn't loop forever. Everything past the cap lands in the DLQ with enough context to replay or discard.

The 90-second answer

When they say "design a notification fan-out service," say this:

"Two questions first. What's the fan-out factor, and is it bounded? And is at-least-once with dedup acceptable, or do we need exactly-once? I'll assume fan-out can hit millions and at-least-once is fine for push and email.

Ingest accepts the event, writes it once, acks fast. A fan-out worker resolves recipients and user preferences, then emits one message per recipient-channel pair onto a per-channel queue: push, email, SMS each separate, so one provider outage can't block the others.

For high-degree accounts above a follower threshold, I defer the expansion and drain it over time, so one celebrity post doesn't starve everyone else. Normal users fan out on write for instant delivery.

Channel workers pull from their queue and call providers through an adapter that maps each provider's error model to one retryable flag. A shared token-bucket rate limiter in Redis keeps each provider under its per-second budget. Denied messages re-queue with a delay instead of dropping.

Dedup uses an idempotency key of notification-plus-user-plus-channel, claimed with SET NX before sending. Retryable failures back off with jitter and honor retry-after. Permanent failures and exhausted retries go to a dead-letter queue for replay.

What this does well: absorbs a two-million-send burst without tripping a provider limit or double-sending. What it doesn't do: true exactly-once. For billable notifications I'd need provider-side idempotency."

That hits fan-out strategy, the pipeline, channel adapters, rate limits, dedup, retry, and DLQ, plus the limits of the design. The interviewer can push on any of them and you have a coherent thing to defend.

Follow-ups that catch people

"How do you handle a user who unsubscribes mid-fan-out?" Preference check happens in the fan-out worker, but for a long-draining celebrity expansion you re-check at send time too, because the user might unsubscribe in the minutes between expand and deliver. Cheap Redis lookup, worth it.

"What about scheduled or quiet-hours notifications?" A delay queue or a separate scheduler keyed on send-time. The pipeline downstream doesn't change; only the entry point does. Don't bolt scheduling into the channel worker.

"How do you know a push was actually delivered?" You don't, from the send call alone. APNs and FCM accept the message and deliver asynchronously. Real delivery confirmation comes from a separate feedback channel (device receipts, FCM delivery callbacks) that you fold back into the dedup store's state. Say that the send result means "accepted by the provider," not "seen by the human."

If this was useful

This layered shape (ingest, fan-out strategy, channel adapters, rate limits, dedup, retry, DLQ) is the same skeleton behind half the "design X at scale" questions. The System Design Pocket Guide: Interviews walks through 15 of these end to end, including the queue patterns, idempotency tricks, and rate-limiting designs that turn a scary prompt into a routine one.