DEV Community

Cover image for Real-Time Notification Systems Are Harder Than Most Teams Expect
Damir Karimov
Damir Karimov

Posted on • Originally published at blog.damir-karimov.com

Real-Time Notification Systems Are Harder Than Most Teams Expect

If you’ve ever thought, “It’s just a WebSocket event,” this article is for you.

Notification systems look simple on the surface, but in production they fail in annoying, expensive, and user-visible ways:

  • duplicate notifications
  • missing events
  • race conditions
  • delayed delivery
  • mobile disconnects
  • retry storms
  • ordering bugs
  • state drift across regions

The tricky part is not sending a message.

The tricky part is making sure the right user gets the right notification, in the right order, with enough reliability that the system can survive crashes, retries, and mobile networks.


The problem with “just send a WebSocket event”

A basic notification flow looks like this:

Backend → WebSocket Server → Client
Enter fullscreen mode Exit fullscreen mode

That works in local dev. It even works for a while in production.

Then real traffic arrives, and the system suddenly has to handle:

  • reconnects
  • offline users
  • multiple devices
  • persistence
  • retries
  • fan-out
  • backpressure
  • push fallbacks
  • deduplication
  • ordering

At that point, your “WebSocket feature” has become a distributed messaging system.

And that is a very different problem.


Delivery semantics matter first

Before you design the system, decide what guarantees you need.

Semantics Meaning Use case
At-most-once Messages may be lost, but won’t be duplicated Low-priority updates
At-least-once Messages won’t be lost, but may be duplicated Payments, security alerts
Effectively-once Duplicates are removed with dedupe logic Critical product events

Most teams make a mistake here.

They start building transport first, then discover later that they actually needed:

  • idempotency keys
  • durable cursors
  • sequence numbers
  • replay support
  • acknowledgements

That is why notification systems become expensive: the real problem is not delivery, it is delivery semantics.


Why fan-out breaks systems

One event is easy.

One event to 10,000 users is not.

A single action can trigger:

  • feed updates
  • badge counter updates
  • push notifications
  • email digests
  • analytics events
  • moderation triggers

That creates:

  • queue amplification
  • retry cascades
  • hot partitions
  • uneven load
  • latency spikes

So the system stops being “send a notification” and becomes “shape traffic safely under failure.”


Why duplicates happen

Duplicates usually do not come from one single bug. They appear from the interaction of retries, crashes, and missing idempotency.

A common chain looks like this:

1. Message is written to Kafka
2. Consumer processes it
3. Consumer crashes before committing offset
4. Partition is reassigned
5. Another consumer reads the same message
6. User gets the notification twice
Enter fullscreen mode Exit fullscreen mode

That is not random.

That is at-least-once delivery with missing deduplication.

The fix

Use:

  • idempotency keys
  • dedupe storage
  • sequence numbers
  • consumer-side protection before side effects

Ordering is harder than throughput

Most users don’t complain about 200ms of delay.

They absolutely notice this:

“Payment refunded” arrives before “Payment received”

That destroys trust immediately.

Global ordering is usually too expensive. In practice, teams often choose:

  • per-user ordering
  • per-conversation ordering
  • approximate ordering
  • causal consistency where needed

For most products, per-user ordering is the best balance.

Example: sequence numbers per user

class NotificationLog:
    def __init__(self, user_id):
        self.user_id = user_id
        self.sequence = 0

    def append(self, notification):
        self.sequence += 1
        record = {
            "user_id": self.user_id,
            "sequence": self.sequence,
            "notification": notification,
            "idempotency_key": f"{self.user_id}:{self.sequence}"
        }
        kafka.produce(topic="notifications", key=self.user_id, value=record)
        return record
Enter fullscreen mode Exit fullscreen mode

This is a simple way to keep ordering stable inside a user shard.


Mobile makes everything worse

Desktop clients are relatively stable.

Mobile clients are not.

You have to deal with:

  • app backgrounding
  • battery optimization
  • network switching
  • silent disconnects
  • delayed push delivery
  • OS throttling

That’s why real systems often combine:

  • WebSockets for active sessions
  • APNs for iOS
  • FCM for Android
  • polling or pull fallback
  • local persistence
  • sync checkpoints

Important detail

APNs and FCM are not guaranteed single-delivery transport.

They can:

  • delay notifications
  • drop messages under pressure
  • coalesce updates
  • expire tokens
  • throttle traffic

So if the notification matters, the server still needs durable state.


A real incident example

At 3AM, an on-call engineer gets paged because one user received dozens of duplicate payment emails.

What happened?

  • the consumer crashed mid-batch
  • the offset was not committed
  • Kafka redelivered the same event
  • the email sender had no dedupe check
  • the user got spammed

That kind of issue is painful because it is not one bug.

It is a chain of small design decisions that only becomes visible under failure.

The practical fix

def send_notification(notification):
    idempotency_key = f"{notification.user_id}:{notification.sequence}"

    if redis.exists(f"dedupe:{idempotency_key}"):
        consumer.commit()
        return

    email_service.send(notification)
    redis.setex(f"dedupe:{idempotency_key}", 24 * 3600, "1")
    consumer.commit()
Enter fullscreen mode Exit fullscreen mode

The important part is not the code style.

It is the fact that the system now assumes duplicates can happen and is built to survive them.


Observability is not optional

If you cannot observe the pipeline, you cannot debug it.

Useful signals include:

  • queue lag
  • retry count
  • delivery success rate
  • connection churn
  • consumer health
  • fan-out latency
  • push provider errors

The real question is not:

Did we send the event?

It is:

Can we prove the user received it?

Those are very different questions.

Metrics worth tracking

Metric Good target
Delivery success rate >99.5%
p99 delivery latency <500ms
Consumer lag low and stable
Retry rate close to zero
Connection churn predictable

What to do in production

A good notification system needs a runbook, not just code.

If retries spike:

  1. Throttle producers.
  2. Pause non-critical workers.
  3. Increase retry backoff.
  4. Check consumer lag.
  5. Check push provider errors.
  6. Rehydrate missed clients from durable state.
  7. Replay safely with idempotency keys.

That is what makes the system operationally survivable.


The scaling path

A lot of teams go through the same evolution.

Startup

API → WebSocket Server → Client
Enter fullscreen mode Exit fullscreen mode

Mid-scale

API → Kafka → Notification Workers → WebSocket Gateway Cluster → Client
Enter fullscreen mode Exit fullscreen mode

High-scale

API → Multi-region event bus → regional workers → regional gateways → clients
Enter fullscreen mode Exit fullscreen mode

At higher scale, the hardest problems are usually:

  • state distribution
  • per-user ordering
  • region routing
  • dedupe
  • offline recovery

Not CPU.

State.


What strong teams optimize for

Early teams optimize for:

  • speed of delivery

Strong teams optimize for:

  • correctness
  • recoverability
  • observability
  • graceful degradation
  • idempotency

That difference matters a lot in production.

A notification that is slightly late is usually acceptable.

A notification that is duplicated, lost, or out of order is not.


Testing the system

You should test the failure modes, not just the happy path.

Try:

  • dropping WebSocket connections mid-message
  • killing consumers during processing
  • simulating mobile sleep/wake cycles
  • forcing Kafka rebalances
  • replaying duplicate events
  • load testing fan-out spikes

If the system only works when nothing fails, it is not ready.


Final thought

Real-time notification systems look simple until scale, retries, mobile behavior, ordering, and distributed state show up.

Then they become one of the hardest backend problems in the product.

The goal is not just to send events.

The goal is to make sure the right user gets the right notification, with the right semantics, even when the system is under stress.

Top comments (0)