Damir Karimov

Posted on Jun 9 • Originally published at blog.damir-karimov.com

Real-Time Notification Systems Are Harder Than Most Teams Expect

#systemdesign #websockets #kafka #backend

If you’ve ever thought, “It’s just a WebSocket event,” this article is for you.

Notification systems look simple on the surface, but in production they fail in annoying, expensive, and user-visible ways:

duplicate notifications
missing events
race conditions
delayed delivery
mobile disconnects
retry storms
ordering bugs
state drift across regions

The tricky part is not sending a message.

The tricky part is making sure the right user gets the right notification, in the right order, with enough reliability that the system can survive crashes, retries, and mobile networks.

The problem with “just send a WebSocket event”

A basic notification flow looks like this:

Backend → WebSocket Server → Client

That works in local dev. It even works for a while in production.

Then real traffic arrives, and the system suddenly has to handle:

reconnects
offline users
multiple devices
persistence
retries
fan-out
backpressure
push fallbacks
deduplication
ordering

At that point, your “WebSocket feature” has become a distributed messaging system.

And that is a very different problem.

Delivery semantics matter first

Before you design the system, decide what guarantees you need.

Semantics	Meaning	Use case
At-most-once	Messages may be lost, but won’t be duplicated	Low-priority updates
At-least-once	Messages won’t be lost, but may be duplicated	Payments, security alerts
Effectively-once	Duplicates are removed with dedupe logic	Critical product events

Most teams make a mistake here.

They start building transport first, then discover later that they actually needed:

idempotency keys
durable cursors
sequence numbers
replay support
acknowledgements

That is why notification systems become expensive: the real problem is not delivery, it is delivery semantics.

Why fan-out breaks systems

One event is easy.

One event to 10,000 users is not.

A single action can trigger:

feed updates
badge counter updates
push notifications
email digests
analytics events
moderation triggers

That creates:

queue amplification
retry cascades
hot partitions
uneven load
latency spikes

So the system stops being “send a notification” and becomes “shape traffic safely under failure.”

Why duplicates happen

Duplicates usually do not come from one single bug. They appear from the interaction of retries, crashes, and missing idempotency.

A common chain looks like this:

1. Message is written to Kafka
2. Consumer processes it
3. Consumer crashes before committing offset
4. Partition is reassigned
5. Another consumer reads the same message
6. User gets the notification twice

That is not random.

That is at-least-once delivery with missing deduplication.

The fix

Use:

idempotency keys
dedupe storage
sequence numbers
consumer-side protection before side effects

Ordering is harder than throughput

Most users don’t complain about 200ms of delay.

They absolutely notice this:

“Payment refunded” arrives before “Payment received”

That destroys trust immediately.

Global ordering is usually too expensive. In practice, teams often choose:

per-user ordering
per-conversation ordering
approximate ordering
causal consistency where needed

For most products, per-user ordering is the best balance.

Example: sequence numbers per user

class NotificationLog:
    def __init__(self, user_id):
        self.user_id = user_id
        self.sequence = 0

    def append(self, notification):
        self.sequence += 1
        record = {
            "user_id": self.user_id,
            "sequence": self.sequence,
            "notification": notification,
            "idempotency_key": f"{self.user_id}:{self.sequence}"
        }
        kafka.produce(topic="notifications", key=self.user_id, value=record)
        return record

This is a simple way to keep ordering stable inside a user shard.

Mobile makes everything worse

Desktop clients are relatively stable.

Mobile clients are not.

You have to deal with:

app backgrounding
battery optimization
network switching
silent disconnects
delayed push delivery
OS throttling

That’s why real systems often combine:

WebSockets for active sessions
APNs for iOS
FCM for Android
polling or pull fallback
local persistence
sync checkpoints

Important detail

APNs and FCM are not guaranteed single-delivery transport.

They can:

delay notifications
drop messages under pressure
coalesce updates
expire tokens
throttle traffic

So if the notification matters, the server still needs durable state.

A real incident example

At 3AM, an on-call engineer gets paged because one user received dozens of duplicate payment emails.

What happened?

the consumer crashed mid-batch
the offset was not committed
Kafka redelivered the same event
the email sender had no dedupe check
the user got spammed

That kind of issue is painful because it is not one bug.

It is a chain of small design decisions that only becomes visible under failure.

The practical fix

def send_notification(notification):
    idempotency_key = f"{notification.user_id}:{notification.sequence}"

    if redis.exists(f"dedupe:{idempotency_key}"):
        consumer.commit()
        return

    email_service.send(notification)
    redis.setex(f"dedupe:{idempotency_key}", 24 * 3600, "1")
    consumer.commit()

The important part is not the code style.

It is the fact that the system now assumes duplicates can happen and is built to survive them.

Observability is not optional

If you cannot observe the pipeline, you cannot debug it.

Useful signals include:

queue lag
retry count
delivery success rate
connection churn
consumer health
fan-out latency
push provider errors

The real question is not:

Did we send the event?

It is:

Can we prove the user received it?

Those are very different questions.

Metrics worth tracking

Metric	Good target
Delivery success rate	>99.5%
p99 delivery latency	<500ms
Consumer lag	low and stable
Retry rate	close to zero
Connection churn	predictable

What to do in production

A good notification system needs a runbook, not just code.

If retries spike:

Throttle producers.
Pause non-critical workers.
Increase retry backoff.
Check consumer lag.
Check push provider errors.
Rehydrate missed clients from durable state.
Replay safely with idempotency keys.

That is what makes the system operationally survivable.

The scaling path

A lot of teams go through the same evolution.

Startup

API → WebSocket Server → Client

Mid-scale

API → Kafka → Notification Workers → WebSocket Gateway Cluster → Client

High-scale

API → Multi-region event bus → regional workers → regional gateways → clients

At higher scale, the hardest problems are usually:

state distribution
per-user ordering
region routing
dedupe
offline recovery

Not CPU.

State.

What strong teams optimize for

Early teams optimize for:

speed of delivery

Strong teams optimize for:

correctness
recoverability
observability
graceful degradation
idempotency

That difference matters a lot in production.

A notification that is slightly late is usually acceptable.

A notification that is duplicated, lost, or out of order is not.

Testing the system

You should test the failure modes, not just the happy path.

Try:

dropping WebSocket connections mid-message
killing consumers during processing
simulating mobile sleep/wake cycles
forcing Kafka rebalances
replaying duplicate events
load testing fan-out spikes

If the system only works when nothing fails, it is not ready.

Final thought

Real-time notification systems look simple until scale, retries, mobile behavior, ordering, and distributed state show up.

Then they become one of the hardest backend problems in the product.

The goal is not just to send events.

The goal is to make sure the right user gets the right notification, with the right semantics, even when the system is under stress.

DEV Community