Overview
Design a highly reliable notification system supporting push notifications, SMS, and email for over 1 million users. The system must guarantee:
- no duplicate notifications
- no missed sends
- graceful degradation during provider failures
- scalability and observability
High-Level Architecture
Client/API
↓
Notification Service
↓
Message Queue (Kafka/SQS/RabbitMQ)
↓
Notification Workers
↓
Provider Adapters
(SendGrid, Twilio, Firebase, etc.)
↓
Webhook/Event Processor
↓
Notification Database + Audit Logs
Core Components
1. Notification API Service
Handles:
- notification creation
- validation
- idempotency checks
- enqueueing messages
Each request receives:
- notification_id
- idempotency_key
The service stores notifications before processing to ensure durability.
2. Message Queue
Kafka/SQS/RabbitMQ used for asynchronous processing.
Benefits:
- decouples producers and consumers
- absorbs traffic spikes
- supports retries
- prevents request blocking
Messages are persisted until successfully processed.
3. Notification Workers
Dedicated workers for:
- SMS
- push notifications
Responsibilities:
- consume queue messages
- send notifications
- retry transient failures
- update delivery status
Workers scale horizontally.
Reliability Strategy
Idempotency
Every notification has:
- unique notification_id
- idempotency_key
Database constraint:
UNIQUE(idempotency_key)
Before sending:
- workers check if notification already succeeded
- prevents duplicate sends during retries
Retry & Failure Handling
Failures classified as:
- transient → retry
- permanent → fail immediately
Retry strategy:
- exponential backoff
- dead-letter queue (DLQ)
- max retry threshold
Example:
1 min → 5 min → 15 min → 1 hour
Graceful Degradation
If a provider fails:
- fallback provider activates automatically
Example:
Twilio fails → switch to Termii
SendGrid fails → switch to SES
Circuit breakers prevent continuously calling unhealthy providers.
Delivery Tracking
Webhook processors receive:
- delivered
- failed
- bounced
- opened events
All events stored in audit tables.
Database Design
Tables:
- notifications
- delivery_attempts
- provider_logs
- templates
- user_preferences
Indexes:
(user_id, created_at)
(status, channel)
(notification_id)
Preventing Missed Sends
To avoid message loss:
- notifications stored before queueing
- transactional outbox pattern used
- reconciliation jobs scan for stuck notifications
Example:
PENDING > 10 mins → requeue
Observability
Monitoring:
- queue depth
- provider latency
- retry count
- failed delivery rate
Tools:
- Prometheus
- Grafana
- CloudWatch
- Sentry
Alerts trigger on abnormal failure spikes.
Scalability
System supports 1M+ users through:
- horizontal worker scaling
- partitioned queues
- stateless services
- Redis caching
- batch processing where applicable
Security
- encrypted provider credentials
- signed webhooks
- rate limiting
- RBAC for admin access
- audit logging for all notification events
Tech Stack Example
- Backend: Node.js / Python
- Queue: Kafka or SQS
- Database: PostgreSQL
- Cache: Redis
- Providers: Twilio, Firebase, SendGrid
- Monitoring: Grafana + Prometheus
- Deployment: Kubernetes/ECS
Top comments (0)