In fintech, notifications are not a “nice-to-have” feature.
They’re part of the product’s trust layer.
If a user transfers money and doesn’t get a confirmation, they panic.
If an OTP arrives 3 minutes late, login fails.
If price alerts come twice, users lose confidence fast.
At small scale, sending notifications feels simple:
Application → Twilio/SendGrid → Done
But once you’re dealing with millions of users, multiple channels, retries, provider outages, and traffic spikes… notification systems become distributed systems problems.
And distributed systems are mostly about handling failure gracefully.
Imagine a fintech platform sending:
- transaction alerts
- OTPs
- portfolio updates
- price alerts
- security notifications
…to over 1 million users across:
- Push notifications
- SMS
The challenge isn’t just “sending messages.”
The real challenge is making sure the system:
- doesn’t send duplicates
- doesn’t silently lose messages
- survives provider outages
- scales during spikes
- remains observable when things go wrong
Here’s the architecture I’d use.
High-Level Architecture
[App Service]
│
▼
[Notification Queue (Redis Streams / SQS / Kafka)]
│
▼
[Worker Pool]
│
▼
[Provider Router]
│
┌────┼───────────────────────────────┐
▼ ▼ ▼
SMS Email Push
│ │ │
▼ ▼ ▼
Twilio SendGrid FCM/APNs
│
▼
Fallback Providers
(Termii / Mailgun / Direct APNs)
│
▼
[Delivery Log + Idempotency Store]
(PostgreSQL + Redis)
1. Queue-Based Architecture
One of the biggest mistakes teams make early on is sending notifications directly from the API request cycle.
That works… until traffic spikes.
Imagine Black Friday, a crypto market crash, or salary payment day.
Suddenly, millions of notifications need to go out almost at once.
If your application waits for Twilio or SendGrid to respond before returning a response to the user, your entire app becomes hostage to external providers.
That’s dangerous.
Instead, the API should do one thing:
Accept the request quickly and push a notification event into a queue.
From there, worker processes handle delivery asynchronously.
This changes the system completely.
Queues give you:
- horizontal scalability
- retry capability
- backpressure handling
- failure isolation
If providers slow down, the queue absorbs the spike instead of crashing your application.
At this scale, queues stop being optional infrastructure.
They become the safety buffer protecting the rest of your system.
Recommended technologies:
- Redis Streams
- AWS SQS
- Kafka (especially for very high throughput systems)
2. Reliability & Idempotency
The hardest problem in notification systems usually isn’t failed sends.
It’s duplicate sends.
Users are surprisingly tolerant of delayed notifications.
They are not tolerant of receiving the same debit alert three times.
Retries are where duplicates usually happen.
Example:
- Worker sends SMS
- Provider actually succeeds
- Network timeout occurs before acknowledgment returns
- Worker assumes failure
- Worker retries
- User receives duplicate SMS
To prevent this, every notification should carry an idempotency_key.
Before sending, workers check:
“Have we already processed this exact notification?”
Example constraint:
UNIQUE(user_id, notification_type, idempotency_key)
This is one of those small architectural decisions that saves massive operational pain later.
Even if retries happen multiple times, the database becomes the final protection layer against duplicates.
Every delivery attempt should also be logged.
Not just successes — everything.
notification_attempts
Including:
- provider used
- response status
- retry count
- timestamps
- provider error payloads
Because when something goes wrong in production, you want evidence, not guesses.
3. Provider Routing & Fallbacks
A reality every senior engineer eventually learns:
Third-party providers fail more often than you expect.
Twilio can degrade.
SendGrid can throttle requests.
FCM can delay pushes.
The mistake is designing systems that assume providers are always available.
Reliable systems assume failure is normal.
So instead of hardcoding a single provider, introduce a provider routing layer.
Example Routing Strategy
| Channel | Primary | Fallback |
|---|---|---|
| SMS | Twilio / Termii | Flutterwave SMS |
| SendGrid | Mailgun | |
| Push | Firebase FCM | Direct APNs |
The worker flow becomes:
- Attempt primary provider
- Detect timeout or failure
- Retry intelligently
- Automatically fail over if necessary
Users shouldn’t notice your provider had a bad day.
That’s the goal.
4. Retry Strategy
Retries sound simple until they start causing damage.
Bad retry systems can:
- spam users
- overload providers
- amplify outages
- generate huge costs
A common mistake is retrying too aggressively.
If Twilio is already struggling, hammering it with thousands of immediate retries only makes things worse.
Instead, use exponential backoff.
Example:
Retry #1 → 30 seconds
Retry #2 → 2 minutes
Retry #3 → 10 minutes
This gives providers time to recover while keeping pressure manageable.
And after maximum retries?
Move the message into a Dead Letter Queue (DLQ).
That queue is basically your “something unusual happened here” bucket.
At that point, engineers should be alerted.
5. Reconciliation Jobs
One subtle issue in distributed systems:
Sometimes providers say “accepted” even though delivery eventually fails.
That creates dangerous blind spots.
A notification may look successful internally while the user never actually receives it.
This is why reconciliation jobs matter.
Every few minutes, background jobs should scan for suspicious states:
Notifications stuck in "pending" for too long
Then:
→ Re-query provider APIs
→ Update delivery status
→ Retry if needed
These jobs quietly save systems from edge cases caused by:
- webhook failures
- transient outages
- network interruptions
- delayed provider processing
A lot of reliability engineering is really just building systems that continuously self-correct.
6. User Preferences & Rate Limiting
Good notification systems are not just reliable.
They’re respectful.
Users should control how they’re contacted.
Examples:
- disable marketing SMS
- mute non-critical notifications after 10PM
- push-only preferences
- email digests instead of instant alerts
Simple table:
user_notification_settings
…can dramatically improve user experience.
Rate limiting matters too.
Without it, bugs or loops can become expensive very quickly.
Imagine accidentally sending OTPs in a retry loop to thousands of users.
Redis-based limits help protect against this.
Example:
Max 3 SMS/hour/user
That protects:
- users from spam
- providers from overload
- the business from runaway costs
7. Monitoring & Observability
At scale, invisible systems are dangerous systems.
You need to know:
- Are queues growing?
- Are workers failing?
- Which provider is slowing down?
- Which region is affected?
The most important metrics are usually boring operational ones.
Golden Signals
- queue latency
- queue depth
- worker error rate
- provider response times
Then business-level metrics:
- delivery success rate
- percentage delivered within 30 seconds
- provider costs
- user opt-out rates
And finally: alerts.
Example:
Alert if SMS failure rate exceeds 5% for 2 minutes
The earlier you detect degradation, the smaller the incident becomes.
8. Testing Failure Scenarios
The biggest difference between systems that look reliable and systems that are reliable is failure testing.
Because everything works in happy-path demos.
The real question is:
What happens when dependencies misbehave?
One useful strategy is shadow testing.
Route a tiny percentage of production traffic through a new provider and compare results safely.
Example:
- compare latency
- compare delivery rates
- validate formatting consistency
Chaos testing is also incredibly valuable.
Example:
Intentionally fail 10% of Twilio requests in staging
That sounds scary initially.
But it validates whether:
- retries actually work
- failovers trigger correctly
- reconciliation jobs recover messages
Reliable systems are engineered through controlled failure exposure.
Why This Architecture Works
What makes this architecture resilient is that it assumes bad things will happen.
Because eventually:
- providers fail
- queues spike
- retries happen
- networks become unreliable
The system survives because reliability is built into the architecture itself.
By combining:
- queues
- idempotency
- retries
- fallback providers
- reconciliation jobs
- observability
…the platform continues operating even during partial outages and heavy traffic spikes.
And in fintech, reliability isn’t just infrastructure quality.
It directly affects user trust.
Tech Stack Example
| Component | Suggested Tech |
|---|---|
| Queue | Redis Streams / Kafka / SQS |
| Workers | Laravel Queues / Go Workers / Node Consumers |
| Cache | Redis |
| Database | PostgreSQL |
| Monitoring | Prometheus + Grafana |
| Alerts | PagerDuty / Slack |
| SMS | Twilio / Termii |
| SendGrid / Mailgun | |
| Push | Firebase FCM |
Final Thoughts
Most notification systems work during normal traffic.
That’s not the hard part.
The hard part is surviving:
- provider outages
- duplicate retry scenarios
- partial failures
- sudden spikes in traffic
That’s where architecture starts to matter.
Because users rarely remember the notifications that worked.
They remember the moments when communication failed during something important.
Top comments (0)