DEV Community

Cover image for Designing a Reliable Notification System for 1M+ Users (Push, SMS, Email)
Isaac Ojerumu
Isaac Ojerumu

Posted on

Designing a Reliable Notification System for 1M+ Users (Push, SMS, Email)

In fintech, notifications are not a “nice-to-have” feature.

They’re part of the product’s trust layer.

If a user transfers money and doesn’t get a confirmation, they panic.
If an OTP arrives 3 minutes late, login fails.
If price alerts come twice, users lose confidence fast.

At small scale, sending notifications feels simple:

Application → Twilio/SendGrid → Done
Enter fullscreen mode Exit fullscreen mode

But once you’re dealing with millions of users, multiple channels, retries, provider outages, and traffic spikes… notification systems become distributed systems problems.

And distributed systems are mostly about handling failure gracefully.

Imagine a fintech platform sending:

  • transaction alerts
  • OTPs
  • portfolio updates
  • price alerts
  • security notifications

…to over 1 million users across:

  • Push notifications
  • SMS
  • Email

The challenge isn’t just “sending messages.”

The real challenge is making sure the system:

  • doesn’t send duplicates
  • doesn’t silently lose messages
  • survives provider outages
  • scales during spikes
  • remains observable when things go wrong

Here’s the architecture I’d use.


High-Level Architecture

[App Service]
      │
      ▼
[Notification Queue (Redis Streams / SQS / Kafka)]
      │
      ▼
[Worker Pool]
      │
      ▼
[Provider Router]
      │
 ┌────┼───────────────────────────────┐
 ▼    ▼                               ▼
SMS  Email                           Push
 │      │                              │
 ▼      ▼                              ▼
Twilio SendGrid                      FCM/APNs
 │
 ▼
Fallback Providers
(Termii / Mailgun / Direct APNs)
      │
      ▼
[Delivery Log + Idempotency Store]
(PostgreSQL + Redis)
Enter fullscreen mode Exit fullscreen mode

1. Queue-Based Architecture

One of the biggest mistakes teams make early on is sending notifications directly from the API request cycle.

That works… until traffic spikes.

Imagine Black Friday, a crypto market crash, or salary payment day.
Suddenly, millions of notifications need to go out almost at once.

If your application waits for Twilio or SendGrid to respond before returning a response to the user, your entire app becomes hostage to external providers.

That’s dangerous.

Instead, the API should do one thing:

Accept the request quickly and push a notification event into a queue.

From there, worker processes handle delivery asynchronously.

This changes the system completely.

Queues give you:

  • horizontal scalability
  • retry capability
  • backpressure handling
  • failure isolation

If providers slow down, the queue absorbs the spike instead of crashing your application.

At this scale, queues stop being optional infrastructure.
They become the safety buffer protecting the rest of your system.

Recommended technologies:

  • Redis Streams
  • AWS SQS
  • Kafka (especially for very high throughput systems)

2. Reliability & Idempotency

The hardest problem in notification systems usually isn’t failed sends.

It’s duplicate sends.

Users are surprisingly tolerant of delayed notifications.
They are not tolerant of receiving the same debit alert three times.

Retries are where duplicates usually happen.

Example:

  1. Worker sends SMS
  2. Provider actually succeeds
  3. Network timeout occurs before acknowledgment returns
  4. Worker assumes failure
  5. Worker retries
  6. User receives duplicate SMS

To prevent this, every notification should carry an idempotency_key.

Before sending, workers check:

“Have we already processed this exact notification?”

Example constraint:

UNIQUE(user_id, notification_type, idempotency_key)
Enter fullscreen mode Exit fullscreen mode

This is one of those small architectural decisions that saves massive operational pain later.

Even if retries happen multiple times, the database becomes the final protection layer against duplicates.

Every delivery attempt should also be logged.

Not just successes — everything.

notification_attempts
Enter fullscreen mode Exit fullscreen mode

Including:

  • provider used
  • response status
  • retry count
  • timestamps
  • provider error payloads

Because when something goes wrong in production, you want evidence, not guesses.


3. Provider Routing & Fallbacks

A reality every senior engineer eventually learns:

Third-party providers fail more often than you expect.

Twilio can degrade.
SendGrid can throttle requests.
FCM can delay pushes.

The mistake is designing systems that assume providers are always available.

Reliable systems assume failure is normal.

So instead of hardcoding a single provider, introduce a provider routing layer.

Example Routing Strategy

Channel Primary Fallback
SMS Twilio / Termii Flutterwave SMS
Email SendGrid Mailgun
Push Firebase FCM Direct APNs

The worker flow becomes:

  1. Attempt primary provider
  2. Detect timeout or failure
  3. Retry intelligently
  4. Automatically fail over if necessary

Users shouldn’t notice your provider had a bad day.

That’s the goal.


4. Retry Strategy

Retries sound simple until they start causing damage.

Bad retry systems can:

  • spam users
  • overload providers
  • amplify outages
  • generate huge costs

A common mistake is retrying too aggressively.

If Twilio is already struggling, hammering it with thousands of immediate retries only makes things worse.

Instead, use exponential backoff.

Example:

Retry #1 → 30 seconds
Retry #2 → 2 minutes
Retry #3 → 10 minutes
Enter fullscreen mode Exit fullscreen mode

This gives providers time to recover while keeping pressure manageable.

And after maximum retries?

Move the message into a Dead Letter Queue (DLQ).

That queue is basically your “something unusual happened here” bucket.

At that point, engineers should be alerted.


5. Reconciliation Jobs

One subtle issue in distributed systems:

Sometimes providers say “accepted” even though delivery eventually fails.

That creates dangerous blind spots.

A notification may look successful internally while the user never actually receives it.

This is why reconciliation jobs matter.

Every few minutes, background jobs should scan for suspicious states:

Notifications stuck in "pending" for too long
Enter fullscreen mode Exit fullscreen mode

Then:

→ Re-query provider APIs
→ Update delivery status
→ Retry if needed
Enter fullscreen mode Exit fullscreen mode

These jobs quietly save systems from edge cases caused by:

  • webhook failures
  • transient outages
  • network interruptions
  • delayed provider processing

A lot of reliability engineering is really just building systems that continuously self-correct.


6. User Preferences & Rate Limiting

Good notification systems are not just reliable.

They’re respectful.

Users should control how they’re contacted.

Examples:

  • disable marketing SMS
  • mute non-critical notifications after 10PM
  • push-only preferences
  • email digests instead of instant alerts

Simple table:

user_notification_settings
Enter fullscreen mode Exit fullscreen mode

…can dramatically improve user experience.

Rate limiting matters too.

Without it, bugs or loops can become expensive very quickly.

Imagine accidentally sending OTPs in a retry loop to thousands of users.

Redis-based limits help protect against this.

Example:

Max 3 SMS/hour/user
Enter fullscreen mode Exit fullscreen mode

That protects:

  • users from spam
  • providers from overload
  • the business from runaway costs

7. Monitoring & Observability

At scale, invisible systems are dangerous systems.

You need to know:

  • Are queues growing?
  • Are workers failing?
  • Which provider is slowing down?
  • Which region is affected?

The most important metrics are usually boring operational ones.

Golden Signals

  • queue latency
  • queue depth
  • worker error rate
  • provider response times

Then business-level metrics:

  • delivery success rate
  • percentage delivered within 30 seconds
  • provider costs
  • user opt-out rates

And finally: alerts.

Example:

Alert if SMS failure rate exceeds 5% for 2 minutes
Enter fullscreen mode Exit fullscreen mode

The earlier you detect degradation, the smaller the incident becomes.


8. Testing Failure Scenarios

The biggest difference between systems that look reliable and systems that are reliable is failure testing.

Because everything works in happy-path demos.

The real question is:

What happens when dependencies misbehave?

One useful strategy is shadow testing.

Route a tiny percentage of production traffic through a new provider and compare results safely.

Example:

  • compare latency
  • compare delivery rates
  • validate formatting consistency

Chaos testing is also incredibly valuable.

Example:

Intentionally fail 10% of Twilio requests in staging
Enter fullscreen mode Exit fullscreen mode

That sounds scary initially.

But it validates whether:

  • retries actually work
  • failovers trigger correctly
  • reconciliation jobs recover messages

Reliable systems are engineered through controlled failure exposure.


Why This Architecture Works

What makes this architecture resilient is that it assumes bad things will happen.

Because eventually:

  • providers fail
  • queues spike
  • retries happen
  • networks become unreliable

The system survives because reliability is built into the architecture itself.

By combining:

  • queues
  • idempotency
  • retries
  • fallback providers
  • reconciliation jobs
  • observability

…the platform continues operating even during partial outages and heavy traffic spikes.

And in fintech, reliability isn’t just infrastructure quality.

It directly affects user trust.


Tech Stack Example

Component Suggested Tech
Queue Redis Streams / Kafka / SQS
Workers Laravel Queues / Go Workers / Node Consumers
Cache Redis
Database PostgreSQL
Monitoring Prometheus + Grafana
Alerts PagerDuty / Slack
SMS Twilio / Termii
Email SendGrid / Mailgun
Push Firebase FCM

Final Thoughts

Most notification systems work during normal traffic.

That’s not the hard part.

The hard part is surviving:

  • provider outages
  • duplicate retry scenarios
  • partial failures
  • sudden spikes in traffic

That’s where architecture starts to matter.

Because users rarely remember the notifications that worked.

They remember the moments when communication failed during something important.

Top comments (0)