DEV Community

MOTY
MOTY

Posted on

Notification System Technical Specification

Overview

Design a highly reliable notification system supporting push notifications, SMS, and email for over 1 million users. The system must guarantee:

  • no duplicate notifications
  • no missed sends
  • graceful degradation during provider failures
  • scalability and observability

High-Level Architecture

Client/API
   ↓
Notification Service
   ↓
Message Queue (Kafka/SQS/RabbitMQ)
   ↓
Notification Workers
   ↓
Provider Adapters
(SendGrid, Twilio, Firebase, etc.)
   ↓
Webhook/Event Processor
   ↓
Notification Database + Audit Logs
Enter fullscreen mode Exit fullscreen mode

Core Components

1. Notification API Service

Handles:

  • notification creation
  • validation
  • idempotency checks
  • enqueueing messages

Each request receives:

  • notification_id
  • idempotency_key

The service stores notifications before processing to ensure durability.


2. Message Queue

Kafka/SQS/RabbitMQ used for asynchronous processing.

Benefits:

  • decouples producers and consumers
  • absorbs traffic spikes
  • supports retries
  • prevents request blocking

Messages are persisted until successfully processed.


3. Notification Workers

Dedicated workers for:

  • email
  • SMS
  • push notifications

Responsibilities:

  • consume queue messages
  • send notifications
  • retry transient failures
  • update delivery status

Workers scale horizontally.


Reliability Strategy

Idempotency

Every notification has:

  • unique notification_id
  • idempotency_key

Database constraint:

UNIQUE(idempotency_key)
Enter fullscreen mode Exit fullscreen mode

Before sending:

  • workers check if notification already succeeded
  • prevents duplicate sends during retries

Retry & Failure Handling

Failures classified as:

  • transient → retry
  • permanent → fail immediately

Retry strategy:

  • exponential backoff
  • dead-letter queue (DLQ)
  • max retry threshold

Example:

1 min → 5 min → 15 min → 1 hour
Enter fullscreen mode Exit fullscreen mode

Graceful Degradation

If a provider fails:

  • fallback provider activates automatically

Example:

Twilio fails → switch to Termii
SendGrid fails → switch to SES
Enter fullscreen mode Exit fullscreen mode

Circuit breakers prevent continuously calling unhealthy providers.


Delivery Tracking

Webhook processors receive:

  • delivered
  • failed
  • bounced
  • opened events

All events stored in audit tables.


Database Design

Tables:

  • notifications
  • delivery_attempts
  • provider_logs
  • templates
  • user_preferences

Indexes:

(user_id, created_at)
(status, channel)
(notification_id)
Enter fullscreen mode Exit fullscreen mode

Preventing Missed Sends

To avoid message loss:

  • notifications stored before queueing
  • transactional outbox pattern used
  • reconciliation jobs scan for stuck notifications

Example:

PENDING > 10 mins → requeue
Enter fullscreen mode Exit fullscreen mode

Observability

Monitoring:

  • queue depth
  • provider latency
  • retry count
  • failed delivery rate

Tools:

  • Prometheus
  • Grafana
  • CloudWatch
  • Sentry

Alerts trigger on abnormal failure spikes.


Scalability

System supports 1M+ users through:

  • horizontal worker scaling
  • partitioned queues
  • stateless services
  • Redis caching
  • batch processing where applicable

Security

  • encrypted provider credentials
  • signed webhooks
  • rate limiting
  • RBAC for admin access
  • audit logging for all notification events

Tech Stack Example

  • Backend: Node.js / Python
  • Queue: Kafka or SQS
  • Database: PostgreSQL
  • Cache: Redis
  • Providers: Twilio, Firebase, SendGrid
  • Monitoring: Grafana + Prometheus
  • Deployment: Kubernetes/ECS

Top comments (0)