Isaac Ojerumu

Posted on May 24

Designing a Reliable Notification System for 1M+ Users (Push, SMS, Email)

#architecture #backend #distributedsystems #systemdesign

In fintech, notifications are not a “nice-to-have” feature.

They’re part of the product’s trust layer.

If a user transfers money and doesn’t get a confirmation, they panic.
If an OTP arrives 3 minutes late, login fails.
If price alerts come twice, users lose confidence fast.

At small scale, sending notifications feels simple:

Application → Twilio/SendGrid → Done

But once you’re dealing with millions of users, multiple channels, retries, provider outages, and traffic spikes… notification systems become distributed systems problems.

And distributed systems are mostly about handling failure gracefully.

Imagine a fintech platform sending:

transaction alerts
OTPs
portfolio updates
price alerts
security notifications

…to over 1 million users across:

Push notifications
SMS
Email

The challenge isn’t just “sending messages.”

The real challenge is making sure the system:

doesn’t send duplicates
doesn’t silently lose messages
survives provider outages
scales during spikes
remains observable when things go wrong

Here’s the architecture I’d use.

High-Level Architecture

[App Service]
      │
      ▼
[Notification Queue (Redis Streams / SQS / Kafka)]
      │
      ▼
[Worker Pool]
      │
      ▼
[Provider Router]
      │
 ┌────┼───────────────────────────────┐
 ▼    ▼                               ▼
SMS  Email                           Push
 │      │                              │
 ▼      ▼                              ▼
Twilio SendGrid                      FCM/APNs
 │
 ▼
Fallback Providers
(Termii / Mailgun / Direct APNs)
      │
      ▼
[Delivery Log + Idempotency Store]
(PostgreSQL + Redis)

1. Queue-Based Architecture

One of the biggest mistakes teams make early on is sending notifications directly from the API request cycle.

That works… until traffic spikes.

Imagine Black Friday, a crypto market crash, or salary payment day.
Suddenly, millions of notifications need to go out almost at once.

If your application waits for Twilio or SendGrid to respond before returning a response to the user, your entire app becomes hostage to external providers.

That’s dangerous.

Instead, the API should do one thing:

Accept the request quickly and push a notification event into a queue.

From there, worker processes handle delivery asynchronously.

This changes the system completely.

Queues give you:

horizontal scalability
retry capability
backpressure handling
failure isolation

If providers slow down, the queue absorbs the spike instead of crashing your application.

At this scale, queues stop being optional infrastructure.
They become the safety buffer protecting the rest of your system.

Recommended technologies:

Redis Streams
AWS SQS
Kafka (especially for very high throughput systems)

2. Reliability & Idempotency

The hardest problem in notification systems usually isn’t failed sends.

It’s duplicate sends.

Users are surprisingly tolerant of delayed notifications.
They are not tolerant of receiving the same debit alert three times.

Retries are where duplicates usually happen.

Example:

Worker sends SMS
Provider actually succeeds
Network timeout occurs before acknowledgment returns
Worker assumes failure
Worker retries
User receives duplicate SMS

To prevent this, every notification should carry an idempotency_key.

Before sending, workers check:

“Have we already processed this exact notification?”

Example constraint:

UNIQUE(user_id, notification_type, idempotency_key)

This is one of those small architectural decisions that saves massive operational pain later.

Even if retries happen multiple times, the database becomes the final protection layer against duplicates.

Every delivery attempt should also be logged.

Not just successes — everything.

notification_attempts

Including:

provider used
response status
retry count
timestamps
provider error payloads

Because when something goes wrong in production, you want evidence, not guesses.

3. Provider Routing & Fallbacks

A reality every senior engineer eventually learns:

Third-party providers fail more often than you expect.

Twilio can degrade.
SendGrid can throttle requests.
FCM can delay pushes.

The mistake is designing systems that assume providers are always available.

Reliable systems assume failure is normal.

So instead of hardcoding a single provider, introduce a provider routing layer.

Example Routing Strategy

Channel	Primary	Fallback
SMS	Twilio / Termii	Flutterwave SMS
Email	SendGrid	Mailgun
Push	Firebase FCM	Direct APNs

The worker flow becomes:

Attempt primary provider
Detect timeout or failure
Retry intelligently
Automatically fail over if necessary

Users shouldn’t notice your provider had a bad day.

That’s the goal.

4. Retry Strategy

Retries sound simple until they start causing damage.

Bad retry systems can:

spam users
overload providers
amplify outages
generate huge costs

A common mistake is retrying too aggressively.

If Twilio is already struggling, hammering it with thousands of immediate retries only makes things worse.

Instead, use exponential backoff.

Example:

Retry #1 → 30 seconds
Retry #2 → 2 minutes
Retry #3 → 10 minutes

This gives providers time to recover while keeping pressure manageable.

And after maximum retries?

Move the message into a Dead Letter Queue (DLQ).

That queue is basically your “something unusual happened here” bucket.

At that point, engineers should be alerted.

5. Reconciliation Jobs

One subtle issue in distributed systems:

Sometimes providers say “accepted” even though delivery eventually fails.

That creates dangerous blind spots.

A notification may look successful internally while the user never actually receives it.

This is why reconciliation jobs matter.

Every few minutes, background jobs should scan for suspicious states:

Notifications stuck in "pending" for too long

Then:

→ Re-query provider APIs
→ Update delivery status
→ Retry if needed

These jobs quietly save systems from edge cases caused by:

webhook failures
transient outages
network interruptions
delayed provider processing

A lot of reliability engineering is really just building systems that continuously self-correct.

6. User Preferences & Rate Limiting

Good notification systems are not just reliable.

They’re respectful.

Users should control how they’re contacted.

Examples:

disable marketing SMS
mute non-critical notifications after 10PM
push-only preferences
email digests instead of instant alerts

Simple table:

user_notification_settings

…can dramatically improve user experience.

Rate limiting matters too.

Without it, bugs or loops can become expensive very quickly.

Imagine accidentally sending OTPs in a retry loop to thousands of users.

Redis-based limits help protect against this.

Example:

Max 3 SMS/hour/user

That protects:

users from spam
providers from overload
the business from runaway costs

7. Monitoring & Observability

At scale, invisible systems are dangerous systems.

You need to know:

Are queues growing?
Are workers failing?
Which provider is slowing down?
Which region is affected?

The most important metrics are usually boring operational ones.

Golden Signals

queue latency
queue depth
worker error rate
provider response times

Then business-level metrics:

delivery success rate
percentage delivered within 30 seconds
provider costs
user opt-out rates

And finally: alerts.

Example:

Alert if SMS failure rate exceeds 5% for 2 minutes

The earlier you detect degradation, the smaller the incident becomes.

8. Testing Failure Scenarios

The biggest difference between systems that look reliable and systems that are reliable is failure testing.

Because everything works in happy-path demos.

The real question is:

What happens when dependencies misbehave?

One useful strategy is shadow testing.

Route a tiny percentage of production traffic through a new provider and compare results safely.

Example:

compare latency
compare delivery rates
validate formatting consistency

Chaos testing is also incredibly valuable.

Example:

Intentionally fail 10% of Twilio requests in staging

That sounds scary initially.

But it validates whether:

retries actually work
failovers trigger correctly
reconciliation jobs recover messages

Reliable systems are engineered through controlled failure exposure.

Why This Architecture Works

What makes this architecture resilient is that it assumes bad things will happen.

Because eventually:

providers fail
queues spike
retries happen
networks become unreliable

The system survives because reliability is built into the architecture itself.

By combining:

queues
idempotency
retries
fallback providers
reconciliation jobs
observability

…the platform continues operating even during partial outages and heavy traffic spikes.

And in fintech, reliability isn’t just infrastructure quality.

It directly affects user trust.

Tech Stack Example

Component	Suggested Tech
Queue	Redis Streams / Kafka / SQS
Workers	Laravel Queues / Go Workers / Node Consumers
Cache	Redis
Database	PostgreSQL
Monitoring	Prometheus + Grafana
Alerts	PagerDuty / Slack
SMS	Twilio / Termii
Email	SendGrid / Mailgun
Push	Firebase FCM

Final Thoughts

Most notification systems work during normal traffic.

That’s not the hard part.

The hard part is surviving:

provider outages
duplicate retry scenarios
partial failures
sudden spikes in traffic

That’s where architecture starts to matter.

Because users rarely remember the notifications that worked.

They remember the moments when communication failed during something important.

DEV Community

Designing a Reliable Notification System for 1M+ Users (Push, SMS, Email)

High-Level Architecture

1. Queue-Based Architecture

2. Reliability & Idempotency

3. Provider Routing & Fallbacks

Example Routing Strategy

4. Retry Strategy

5. Reconciliation Jobs

6. User Preferences & Rate Limiting

7. Monitoring & Observability

Golden Signals

8. Testing Failure Scenarios

Why This Architecture Works

Tech Stack Example

Final Thoughts

Top comments (0)