Designing a Notification System That Survives 10x Traffic Spikes

#architecture #eventdriven #microservices #tutorial

Book: System Design Pocket Guide: Interviews
Also by me: Event-Driven Architecture Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Picture a growth team that ships a "weekly digest" on a Wednesday. The cron job enqueues a couple million emails in minutes, the email provider throttles the account for exceeding its contracted send rate, and the incident channel turns into a wall of red. Every notification produced after the throttle goes into a tight retry loop, which produces more 429s, which produces more retries.

The canonical failure mode is not volume. It is the shape of the volume: synchronous, undeduplicated, retry-greedy, and bottlenecked behind a third-party rate limit you do not control.

The second-pass design is what you build after you have lost a Wednesday to a digest job and you want the next launch to absorb 10x without paging anyone.

The problem space, named honestly

Notification systems have to do six things, and every one of them is a place where teams cut corners on day one and regret it on day ninety.

Fanout. One event, many recipients. Sometimes 1, sometimes 50M.
Channel selection. Push, email, SMS, in-app, webhook. Each has its own SDK, its own throttle, its own failure modes.
User preferences. Some users opted out of marketing emails three months ago and you cannot send them this one.
Deduplication. The same "your order shipped" event must not produce three SMS texts because three services thought they owned it.
Retry and DLQ. Transient failure from APNs is normal. A bad payload is permanent and must not poison the queue.
Per-provider rate limiting. SendGrid, Twilio, FCM, APNs each enforce their own caps; exceeding them can result in throttling, deliverability degradation, or account-level enforcement depending on the provider.

Skip any of these and you have built a notification firehose. Get all six and you have a system.

Why event-driven decoupling is non-negotiable

The synchronous version of "send a notification when X happens" looks innocent. The order service calls the notification service, which calls the email provider, which returns 200, which propagates back. Every layer is on the request path. When the provider takes 8 seconds to respond, the order service takes 8 seconds to respond. When it 429s you, the order service surfaces a 5xx to your customer.

The event-driven version pushes a single event to a topic and returns. Whatever happens downstream (fanout, channel routing, retries, DLQs) is invisible to the producer. This is the only way the system tolerates the failure modes that 10x traffic exposes (scaling notification fan-out at 50M devices).

Concretely, that means a small set of Kafka topics, one per channel, fed by a single producer-side event, with channel-specific worker pools downstream. The shape:

order.created  ─►  notification.dispatcher  ─►  notif.email
                                            ─►  notif.push
                                            ─►  notif.sms
                                            ─►  notif.inapp

The dispatcher is the one place that knows about user preferences, dedup keys, and channel selection. Workers know only their channel.

A 70-line sketch you can actually read

This is the worker side: a Python consumer for the email topic with retry, jitter, dedup, and per-provider rate limiting. It is not production code. It is small enough to fit in your head.

import json
import random
import time
from dataclasses import dataclass

import redis
from confluent_kafka import Consumer
from ratelimit import limits, RateLimitException

r = redis.Redis()
DEDUP_TTL = 86_400  # 24h


@dataclass
class Email:
    to: str
    subject: str
    body: str
    idem_key: str  # event_id + user_id + channel


def already_sent(key: str) -> bool:
    return not r.set(f"sent:{key}", 1, nx=True, ex=DEDUP_TTL)


@limits(calls=PROVIDER_RPS, period=1)  # provider contract cap
def send_via_provider(msg: Email) -> None:
    # raises on 429/5xx; returns on 200
    provider_client.send(msg)


def send_with_retry(msg: Email, max_attempts: int = 5) -> bool:
    for attempt in range(max_attempts):
        try:
            send_via_provider(msg)
            return True
        except RateLimitException:
            time.sleep(0.25 + random.random() * 0.25)
        except TransientProviderError:
            backoff = (2 ** attempt) + random.random()
            time.sleep(min(backoff, 30))
        except PermanentProviderError:
            return False  # straight to DLQ, no further retries
    return False

That handles the send path. The consumer loop pulls from Kafka and decides whether to dispatch, dedup, or DLQ:

def consume() -> None:
    c = Consumer({
        "bootstrap.servers": "kafka:9092",
        "group.id": "email-workers",
        "enable.auto.commit": False,
    })
    c.subscribe(["notif.email"])
    while True:
        m = c.poll(1.0)
        if m is None or m.error():
            continue
        msg = Email(**json.loads(m.value()))
        if already_sent(msg.idem_key):
            c.commit(m)
            continue
        if send_with_retry(msg):
            c.commit(m)
        else:
            dlq_publish("notif.email.dlq", msg)
            c.commit(m)

Five things in that file are load-bearing.

Idempotency keys are a Redis SET NX EX. Cheap, atomic, expires after 24 hours so the dedup table does not grow forever. The key is event_id + user_id + channel, never just event_id. The same event legitimately produces a push and an email.

Manual offset commits. The default Kafka behavior commits offsets on poll. If you crash mid-send, the message is lost. enable.auto.commit=false plus a commit after success or after DLQ means at-least-once delivery, paired with the dedup check, approximates exactly-once from the recipient's perspective.

Exponential backoff with jitter. Synchronized retries are how you turn a transient SendGrid blip into a self-inflicted DDoS. The random.random() term is what stops every worker from retrying at the same millisecond (retry storms as self-inflicted DDoS).

Permanent vs transient error split. A 400 from the provider (bad email address, malformed payload) never recovers. Retrying it five times wastes capacity and delays the messages behind it. It goes straight to DLQ.

DLQ is a separate Kafka topic. A topic, with a consumer that is allowed to be slow and human-supervised. You build the dashboard over the DLQ topic; you replay from it when the bug is fixed.

The dedup story, in detail

The mistake teams make: they put the dedup key in the database, write to it on send, and check it on next event. Two problems. One, the database becomes a hot spot during fanout. Two, the check-then-write is not atomic. Two workers picking up the same message will both pass the check.

Redis SET key 1 NX EX 86400 is atomic and runs at hundreds of thousands of ops per second per node. At 50M-event fanout you have to shard the Redis cluster, but the per-key cost stays flat. The 24-hour TTL is a tradeoff: long enough that retries within a day collapse, short enough that the keyspace does not grow without bound. If you need longer dedup windows, persist the idempotency keys to a partitioned table and check Redis first as a hot cache.

Fanout for the 50M case

The 1-event-per-recipient model breaks at scale. If your "all users" notification produces 50M Kafka messages on a single topic, you have moved the problem from the producer to the broker.

The pattern that survives: a single campaign event with a target predicate. A campaign orchestrator paginates the user table in batches of 10k–50k and emits one batch message per channel. Workers expand each batch and dispatch in parallel.

campaign.created (target=premium_us, copy_id=xyz)
  → orchestrator paginates users
    → notif.email.batch (users=[1..10000], copy_id=xyz)
    → notif.email.batch (users=[10001..20000], copy_id=xyz)
    → ...

This keeps the message count bounded by users / batch_size rather than users. It also gives the orchestrator a natural place to apply user preferences and per-channel rate limits before the batch hits the workers (batched fanout pattern).

Per-provider rate limiting, the right place

Workers cannot self-rate-limit. Twenty workers each capped at 600/sec gives you 12,000/sec, fine until autoscaling adds ten more workers and your contract was 12,000/sec total.

Two real options.

A token bucket in Redis, refilled at the contract rate, decremented per send. Every worker, regardless of count, takes a token before dispatching. The bucket is the global cap. This is the pattern that survives autoscaling, multi-region, and the on-call engineer who doubles your worker count at 3am.

Or a single-tenant proxy in front of the provider: a small service whose only job is to enforce the contract cap and return 429s to your workers when over. Workers backoff on the 429 like they would on a real provider 429. The proxy is one more thing to operate, but it gives you a clean place to put per-recipient throttling and provider failover.

For most teams under 10M notifications/day, the Redis token bucket is enough. Above that, the proxy starts paying for itself.

Channel selection and user preferences

The dispatcher is where opinionated logic lives. Every event arrives with a notification type. The dispatcher reads the user's preference object (a JSON document, cached aggressively) and decides which channels to fan out to.

The thing teams get wrong: they put preference checks inside each worker. Now you have four workers each loading the same preference doc on every message. The right place is the dispatcher, before the channel-specific events are produced. Workers should be dumb pipes that trust the dispatcher's decision.

Preferences also need a global kill switch. When SendGrid is on fire, you do not want 800k retries piling up in notif.email. A feature-flag-style toggle that lets the dispatcher route emails to a delayed-replay topic, or drop non-urgent ones, saves you during provider outages.

What 10x actually breaks

When traffic goes 10x, four things fail in this order.

The producer's synchronous path. Anything still on the request path slows down. Fix: be honest about which calls return 202 and which return 200.
The dedup store. Redis hits its connection limit before its CPU. Fix: connection pooling, sharding, and a circuit breaker that fails-open on dedup (you'd rather send twice than not at all, for most notification types).
The provider rate limit. 429s pile up faster than retries can drain. Fix: token bucket, graceful degradation per channel, and DLQ replay.
The DLQ. It fills up faster than the human team can investigate. Fix: a triage consumer that auto-categorizes by error type and only pages a human for unknown failures.

The system that survives is the one where every step in the pipeline is asynchronous, every retry has jitter, every dedup is atomic, and every provider call is rate-limited at the global level. None of that is fancy. All of it is what the digest job that took down Wednesday did not have.

If this was useful

If you are going to be designing systems like this in interviews or design reviews, System Design Pocket Guide: Interviews walks through fifteen real designs at this level of detail. And the tradeoffs around topics, dedup, and at-least-once vs exactly-once are the entire subject of the Event-Driven Architecture Pocket Guide: Outbox, Saga, CQRS, and the failure modes that kill teams who skip them.