DEV Community

Cover image for Design of a Push Notification Service
Ramin Farajpour Cami
Ramin Farajpour Cami

Posted on • Edited on

Design of a Push Notification Service

A comprehensive system design for handling 1 billion notifications per day


Table of Contents

  1. System Requirements
  2. Capacity Estimation
  3. API Design
  4. Database Design
  5. High-Level Architecture
  6. Component Design
  7. Trade-offs & Technology Choices
  8. Failure Scenarios & Mitigation
  9. Future Improvements

System Requirements

Functional Requirements

  • Message Targeting: Ability to target messages based on user attributes, device, and other relevant criteria
  • Delivery Scheduling: System should manage delivery times, potentially scheduling notifications for optimal impact
  • Personalization: Notifications should be customizable for individual user preferences and characteristics
  • Localization: The system must support multiple languages and regional settings
  • User Engagement Tracking: Capability to track how users interact with notifications to facilitate continuous improvement
  • Performance Monitoring: Tools to monitor the speed and reliability of notification delivery
  • Prioritization: Capability to prioritize notifications, such as prioritizing OTPs over promotional messages

Non-Functional Requirements

  • Scalability: Handle high volume of notifications and users, scaling dynamically as demand increases
  • Reliability: High availability and fault tolerance to ensure continuous operation
  • Performance: Fast response times for incoming notification requests and timely delivery
  • Security: Secure handling and storage of user data and notification content
  • Maintainability: Ease of updating the system and integrating with other services

Capacity Estimation

Metric Value Notes
Daily Notifications 1 billion Target volume
Notifications per Second ~11,574 Assuming constant traffic
Active Users 100 million Concurrent connection handling required
Average Notification Size 500 bytes Including headers and payload
Data Throughput 5.79 MB/s Sustained data transfer rate
Storage Requirements Scalable For user preferences, engagement tracking, and logs

API Design

1. Send Notification

POST /notifications/send
Enter fullscreen mode Exit fullscreen mode

Request:

{
  "user_id": "12345",
  "message": "Your OTP is 6789",
  "priority": "high",
  "language": "en",
  "device_id": "device123"
}
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "status": "success",
  "notification_id": "abc123",
  "message": "Notification sent successfully."
}
Enter fullscreen mode Exit fullscreen mode

2. Update User Preferences

POST /users/{user_id}/preferences/update
Enter fullscreen mode Exit fullscreen mode

Request:

{
  "preferences": {
    "language": "es",
    "marketing_notifications": false
  }
}
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "status": "success",
  "message": "Preferences updated successfully."
}
Enter fullscreen mode Exit fullscreen mode

3. Fetch Notification Status

GET /notifications/{notification_id}/status
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "notification_id": "abc123",
  "status": "delivered",
  "delivered_at": "2023-04-14T12:00:00Z"
}
Enter fullscreen mode Exit fullscreen mode

4. Track User Engagement

POST /engagements/track
Enter fullscreen mode Exit fullscreen mode

Request:

{
  "notification_id": "abc123",
  "user_id": "12345",
  "action": "clicked"
}
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "status": "success",
  "message": "Engagement recorded successfully."
}
Enter fullscreen mode Exit fullscreen mode

Database Design

Entity-Relationship Diagram

Database Strategy

Hybrid Approach:

  • PostgreSQL for USER and NOTIFICATION data (strong consistency, ACID properties)
  • Apache Cassandra for ENGAGEMENT data (high write throughput, scalability)

Key Design Decisions

  • USER table stores preferences and profile information with referential integrity
  • NOTIFICATION table maintains delivery status and metadata for audit trails
  • ENGAGEMENT table captures all user interactions for analytics and optimization
  • Partitioning strategy based on user_id for horizontal scalability

High-Level Architecture

Request Flow Sequence

Component Design

1. Message Queue (Apache Kafka)

Key Features:

  • Distributed Architecture: Horizontal scaling with multiple brokers
  • Priority Queuing: Separate topics for different priority levels
  • Partitioning Strategy: By user_id hash for load distribution
  • Durability: Configurable replication factor for fault tolerance

2. Delivery Service

Implementation Details:

  • Circuit Breaker Pattern: Prevent cascade failures
  • Exponential Backoff: delay = base_delay * (2^attempt) + jitter
  • Batch Processing: Group notifications for efficiency
  • Platform-Specific Handling: Dedicated services for iOS, Android, Web

3. Tracking Service

Analytics Capabilities:

  • Real-time Processing: Apache Flink for stream processing
  • Sliding Window Analysis: 1min, 5min, 1hour, 1day windows
  • Metrics Tracked: Click-through rates, conversion rates, engagement scores
  • Pattern Detection: Anomaly detection for unusual engagement patterns

Trade-offs & Technology Choices

Message Queue: Apache Kafka vs RabbitMQ

Aspect Apache Kafka RabbitMQ
Throughput ✅ Very High (1M+ msgs/sec) ⚠️ Moderate (100K msgs/sec)
Durability ✅ Excellent (persistent logs) ✅ Good (persistent queues)
Complexity ⚠️ High operational overhead ✅ Lower complexity
Ordering ✅ Per-partition ordering ⚠️ Limited ordering guarantees
Scalability ✅ Horizontal scaling ⚠️ Vertical scaling preferred

Choice: Kafka - Better suited for high-throughput, distributed environments

Database Strategy: Hybrid Approach

Data Type Database Rationale
User Data PostgreSQL ACID properties, complex queries, referential integrity
Notifications PostgreSQL Transaction support, status consistency
Engagement Cassandra High write throughput, eventual consistency acceptable

Microservices Architecture

Advantages:

  • Independent scaling of delivery services per platform
  • Technology diversity (different languages/frameworks)
  • Fault isolation and resilience

Challenges:

  • Increased operational complexity
  • Network latency between services
  • Distributed transaction challenges

Failure Scenarios & Mitigation

1. Message Queue Overload

Scenario: Queue becomes overwhelmed during traffic spikes

Mitigation Strategies:

  • Rate Limiting: Token bucket algorithm at API Gateway
  • Auto-scaling: Kubernetes HPA based on queue depth
  • Circuit Breaker: Fail fast when queue is full
  • Backpressure: Gradual throttling of incoming requests

2. Database Bottlenecks

Scenario: High read/write operations cause latency

Mitigation Strategies:

  • Read Replicas: Route read queries to replicas
  • Database Sharding: Partition by user_id
  • Redis Caching: Cache frequently accessed data
  • Connection Pooling: PgBouncer for PostgreSQL

3. External Service Failures

Scenario: APNS/FCM services are down or rate-limiting

Mitigation Strategies:

  • Exponential Backoff: Progressive retry delays
  • Dead Letter Queue: Store failed messages for later processing
  • Alternative Providers: Fallback to secondary providers
  • Circuit Breaker: Stop calling failing services temporarily

Future Improvements

1. Machine Learning Integration

Adaptive Delivery Timing:

  • Analyze user behavior patterns to determine optimal send times
  • Personalize delivery schedules based on engagement history
  • A/B testing framework for notification strategies

Implementation:

# Pseudocode for ML-based timing
def predict_optimal_time(user_id):
    user_features = get_user_engagement_history(user_id)
    timezone_offset = get_user_timezone(user_id)
    model_prediction = ml_model.predict(user_features)
    return adjust_for_timezone(model_prediction, timezone_offset)
Enter fullscreen mode Exit fullscreen mode

2. Advanced Personalization

Content Personalization:

  • Dynamic message content based on user preferences
  • Sentiment analysis for tone adjustment
  • Multi-variate testing for message optimization

3. Location-Based Services

Geofencing Integration:

  • Trigger notifications based on user location
  • Proximity-based promotional campaigns
  • Location-aware content delivery

4. Performance Enhancements

Edge Computing:

  • Deploy notification services at edge locations
  • Reduce latency for global users
  • Implement intelligent request routing

Real-time Analytics:

  • Stream processing for instant feedback
  • Live dashboards for notification performance
  • Automated alerting for system anomalies

Monitoring & Observability

Key Metrics

Category Metrics Target SLA
Throughput Messages/second > 15,000/sec
Latency End-to-end delivery time < 5 seconds
Reliability Delivery success rate > 99.9%
Engagement Click-through rate Monitor trends
System Health Error rate < 0.1%

Alerting Strategy


This push notification service design provides a robust, scalable foundation for handling billions of notifications while maintaining high performance, reliability, and user engagement. The architecture leverages modern distributed systems patterns and technologies to ensure the system can grow with increasing demands while providing rich analytics and personalization capabilities.

Top comments (0)