Vikas Kumar

Posted on Feb 7

Design HLD - Notification Sytem

#systemdesign #hld #interview

Requirements

Functional Requirements

Support sending notifications to users.
Support delivery across multiple channels (Email, SMS, Push, In-app).
Support critical and promotional notification types.
Support user notification preferences and opt-in/opt-out.
Support scheduled notifications.
Support bulk notifications targeting large user groups.
Support safe retries and idempotent notification processing.
Support tracking of notification delivery status.

Non-Functional Requirements

Highly available and fault tolerant.
Low-latency delivery for critical notifications.
High throughput with large-scale fan-out.
Highly scalable with increasing traffic.
Durable notification processing with no message loss.
Secure notification delivery and access control.
Cost-efficient operation at scale.

Key Concepts You Must Know

Notification vs Delivery Attempt

A notification represents the logical intent to notify a user, while delivery attempts represent concrete, channel-specific executions. A single notification can result in multiple delivery attempts due to retries, fallbacks, or multi-channel delivery.

Critical vs Promotional Isolation

Critical notifications such as OTPs or chat messages must be processed in isolation from promotional traffic. This prevents head-of-line blocking and guarantees that spikes in bulk or campaign traffic do not impact latency-sensitive notifications.

Priority-Aware Queuing

Notifications are routed through priority-aware queues so that high-priority messages are always processed ahead of lower-priority ones. This ensures predictable latency for critical flows even under heavy system load.

Idempotent Processing

All notification operations must be idempotent to safely handle retries caused by network failures or timeouts. Repeating the same request should always result in the same final state without creating duplicate notifications.

Safe Retries

Transient failures during delivery should trigger automatic retries using controlled retry policies such as exponential backoff. Retries must be bounded to avoid infinite loops and system overload.

Scheduling vs Immediate Delivery

Immediate notifications are dispatched as soon as they are accepted by the system, while scheduled notifications are stored and triggered at a future time. Scheduling logic must be reliable and time-correct to ensure notifications are sent neither early nor late.

Bulk Fan-out Model

Bulk notifications should be expanded asynchronously into individual notification instances. Fan-out must happen outside the critical path to prevent large campaigns from overwhelming the system.

User Preferences Enforcement

Notification delivery must respect user-configured preferences such as opt-in, opt-out, preferred channels, and quiet hours. Preferences are enforced consistently across all notification types, with configurable exceptions for critical messages.

Dead Letter Queue (DLQ)

Notifications that fail permanently after exhausting retries are moved to a Dead Letter Queue. The DLQ provides visibility, auditability, and a mechanism for manual inspection or reprocessing.

Durable Event Processing

Once a notification is accepted, it must be durably persisted so it is not lost due to crashes or restarts. Durability guarantees that every accepted notification is eventually processed or explicitly marked as failed.

Capacity Estimation

Key Assumptions

DAU (Daily Active Users): ~50 million
Notifications per user per day: ~5
Traffic mix: ~80% critical, ~20% promotional
Traffic pattern: Write-heavy with bursty fan-out
System scale: Large-scale, distributed SaaS system assumed

Notification Volume Estimation

Total notifications per day ⇒ 50M users × 5 notifications ⇒ ~250M notifications/day
Critical notifications ⇒ ~80% of 250M ≈ ~200M/day
Promotional notifications ⇒ ~20% of 250M ≈ ~50M/day

Throughput Estimation (QPS)

Average write QPS ⇒ 250M / 86,400 ⇒ ~2,900 notifications/sec
Peak write QPS ⇒ Up to ~1,000,000 notifications/sec during spikes
Fan-out amplification ⇒ A single bulk request can expand into thousands to millions of notifications

Read Traffic Estimation

Status checks, analytics, dashboards ⇒ Reads assumed ~2–3× writes ⇒ Average read QPS ≈ ~6,000–9,000/sec

Metadata Size Estimation

Metadata per notification ⇒ ~1 KB (IDs, user, channel, status, retries, timestamps)
Metadata per day ⇒ 250M × 1 KB ⇒ ~250 GB/day
Monthly metadata (30 days retention) ⇒ ~7.5 TB

Core Entities

User: Represents a system user who receives notifications.
Notification: Represents the logical intent to notify a user; stores type, priority, schedule, and lifecycle state, not delivery execution.
Delivery Attempt: Represents a single channel-specific attempt to deliver a notification and captures retries and failures.
Notification Preference: Represents user-defined preferences such as opt-in/opt-out, preferred channels, and quiet hours.
Campaign: Represents a bulk or promotional notification request that targets a large group of users.
Schedule: Represents a time-based trigger that controls when a notification or campaign should be delivered.
Retry Task: Represents a delayed retry for a failed delivery attempt using a retry policy.
Dead Letter Entry: Represents a permanently failed notification that requires audit or manual intervention.

Database Design

Database Choice

The system uses a distributed NoSQL database (such as Cassandra or DynamoDB) to store notification metadata. This is because the system needs to handle very high write traffic, scale horizontally, and remain fast even during large notification spikes.
Data is partitioned by tenant and user so that notifications are evenly spread across nodes and no single partition becomes a bottleneck. Time-based fields (like creation time) are used to efficiently query recent notifications and to clean up old data.
A relational database may be used for tenant configuration, billing, and reporting, where strong relationships and transactional queries are more important than write throughput.

Users Table

Represents system users.

User

user_id (PK)
tenant_id
created_at
status

Used for

User identity
Tenant isolation
Preference lookup

Notification Table

Represents a user-visible notification.

Notification

notification_id (PK)
user_id (FK → User)
tenant_id
type (critical / promotional)
priority
status (pending / delivered / failed / expired)
scheduled_at
expiry_at
created_at

Key Points

One row per user notification
Represents intent and lifecycle
Used for auditing and status queries

DeliveryAttempt Table

Represents channel-level delivery execution.

DeliveryAttempt

attempt_id (PK)
notification_id (FK → Notification)
channel (email / sms / push / in-app)
status (success / failed / retrying)
retry_count
last_error
created_at

Key Points

Multiple attempts per notification
Tracks retries and failures
Enables per-channel isolation

NotificationPreference Table

Represents user notification preferences.

NotificationPreference

user_id (PK)
channel
enabled
quiet_hours
updated_at

Key Points

Source of truth for opt-in / opt-out
Enforced during processing

Campaign Table

Represents bulk notification requests.

Campaign

campaign_id (PK)
tenant_id
status (scheduled / active / completed / cancelled)
scheduled_at
expiry_at
created_at

Key Points

Used only for bulk notifications
Expanded asynchronously into notifications

RetryTask Table

Represents scheduled retries.

RetryTask

retry_task_id (PK)
attempt_id (FK → DeliveryAttempt)
next_retry_at
retry_policy
created_at

Key Points

Retries are time-based, not immediate
Drives retry scheduling

DeadLetter Table

Represents permanently failed notifications.

DeadLetter

notification_id
channel
failure_reason
created_at

Key Points

Terminal failure state
Used for audit and investigation

Indexing Strategy

| Access Pattern           | Index                 |
| ------------------------ | --------------------- |
| Fetch user notifications | (user_id, created_at) |
| Priority processing      | (priority, status)    |
| Retry scheduling         | (next_retry_at)       |
| Campaign expansion       | (campaign_id)         |
| Cleanup jobs             | (status, expiry_at)   |

Indexes are chosen based on actual query patterns, not theoretical normalization.

Transaction Model

The system avoids complex multi-table transactions. Each notification-related operation is handled as a single atomic write, which keeps the system fast and reliable.
To handle retries safely, the system uses idempotency keys, ensuring that the same request processed multiple times results in only one notification. Notification state moves forward in a controlled manner (for example: PENDING → DELIVERED → FAILED) and never moves backward.

This approach keeps the system correct even when requests are retried or processed in parallel.

Failure Handling

If a notification is saved successfully but delivery fails, it remains in a pending or retryable state and is retried automatically. Retry information is stored so the system can safely continue even after crashes or restarts.
Notifications that fail permanently are moved to a Dead Letter Queue, making failures visible and easy to investigate. Background jobs periodically scan for stuck or inconsistent records and safely recover or clean them up.

Consistency Model

The system uses strong consistency for critical data such as notification creation, status updates, retries, and user preferences. This ensures users do not receive duplicate or incorrect notifications.
For analytics and reporting, the system uses eventual consistency, since slight delays in metrics do not affect correctness. This balance allows the system to scale efficiently while keeping user-facing behavior correct.

API / Endpoints

Send Notification → POST: /notifications

Creates a new notification request.

Request

{
  "user_id": "string",
  "type": "critical | promotional",
  "channels": ["email", "sms", "push"],
  "message": {
    "title": "string",
    "body": "string"
  },
  "schedule_at": "datetime (optional)",
  "expiry_at": "datetime (optional)",
  "idempotency_key": "string"
}

Response

{
  "status": "accepted",
  "notification_id": "uuid"
}

Send Bulk Notifications

Creates a bulk notification campaign. → POST: /notifications/bulk

Request

{
  "campaign_name": "string",
  "type": "promotional",
  "target": {
    "segment_id": "string"
  },
  "channels": ["email", "push"],
  "message": {
    "title": "string",
    "body": "string"
  },
  "schedule_at": "datetime",
  "expiry_at": "datetime"
}

Response

{
  "status": "accepted",
  "campaign_id": "uuid"
}

Get Notification Status

Fetches the current status of a notification. → GET: /notifications/{notification_id}

Response

{
  "notification_id": "uuid",
  "status": "pending | delivered | failed | expired",
  "last_updated": "datetime"
}

Retry Notification (Internal / Admin)

Triggers a retry for a failed notification. → POST: /notifications/{notification_id}/retry

Response

{
  "status": "retry_scheduled"
}

Cancel Scheduled Notification

Cancels a notification that has not yet been delivered. → DELETE: /notifications/{notification_id}

Response

{
  "status": "cancelled"
}

Get User Notification Preferences

Fetches notification preferences for a user. → GET: /users/{user_id}/preferences

Response

{
  "channels": {
    "email": true,
    "sms": false,
    "push": true
  },
  "quiet_hours": {
    "start": "22:00",
    "end": "08:00"
  }
}

Update User Notification Preferences

Updates notification preferences for a user. → PUT: /users/{user_id}/preferences

Request

{
  "channels": {
    "email": true,
    "sms": false,
    "push": true
  },
  "quiet_hours": {
    "start": "22:00",
    "end": "08:00"
  }
}

Response

{
  "status": "updated"
}

List Notifications (Optional)

Fetches recent notifications for a user. → GET: /users/{user_id}/notifications?limit=20

Response

{
  "notifications": [
    {
      "notification_id": "uuid",
      "status": "delivered",
      "created_at": "datetime"
    }
  ]
}

Key API Design Notes

All write APIs are idempotent using idempotency_key.
APIs are asynchronous; delivery is not guaranteed at request time.
Bulk APIs only enqueue campaigns; fan-out happens asynchronously.
Admin and retry APIs are restricted to internal services.

System Components

1. Client (Web / Mobile / Backend Producers)

Primary Responsibilities:

Generates notification requests in response to user actions or system events such as login, payment, chat messages, or campaigns.
Attaches idempotency keys and contextual metadata (user, tenant, type, priority).
Does not wait for delivery completion and treats notification APIs as asynchronous.

Examples:
Web apps, Mobile apps, Order Service, Auth Service, Chat Service

Why:
Keeps product services simple and prevents notification latency from impacting core user flows.

2. API Gateway

Primary Responsibilities:

Acts as the secure ingress layer for all notification APIs.
Performs authentication, authorization, tenant validation, schema validation, and request normalization
Applies per-tenant and per-client rate limits to protect downstream systems.
Rejects duplicate requests early using idempotency keys when possible.

Examples:
AWS API Gateway, Kong, NGINX, Envoy

Why:
Provides centralized security, traffic control, and isolation at scale.

3. Notification Service (Control Plane)

Primary Responsibilities:

Validates notification requests and applies business rules.
Classifies notifications as critical or promotional and assigns priority.
Fetches and enforces user preferences including opt-in, channel selection, and quiet hours.
Validates scheduling and expiry constraints.
Persists notification metadata as the source of truth.
Publishes notification events to the message queue for further processing.

Examples:
Spring Boot / Node.js / Go microservice

Why:
Centralizes orchestration logic while keeping the system asynchronous and scalable.

4. Message Queue / Event Bus

Primary Responsibilities:

Decouples notification ingestion from processing and delivery.
Buffers traffic spikes and absorbs bursty workloads.
Provides ordering guarantees where required (e.g., per user).
Uses separate topics or queues to isolate critical traffic from promotional traffic.
Ensures at-least-once delivery semantics.

Examples:
Apache Kafka, AWS SNS + SQS

Why:
Enables high-throughput, fault-tolerant, and scalable event-driven processing.

5. Scheduler Service

Primary Responsibilities:

Stores and manages scheduled notifications and delayed retry tasks.
Triggers notification events exactly at their scheduled execution time.
Ensures notifications are not delivered before schedule_at or after expiry_at.
Handles large volumes of scheduled tasks using partitioned or sharded scheduling.

Examples:
Kafka delay topics, Redis Sorted Sets, Quartz, AWS EventBridge

Why:
Provides reliable time-based execution without inefficient polling.

6. Campaign / Fan-out Service

Primary Responsibilities:

Processes bulk notification requests and resolves target audiences.
Expands campaigns into per-user notification events asynchronously.
Applies batching, throttling, and backpressure to control fan-out rate.
Tracks campaign progress and completion state.

Examples:
Custom fan-out service + Kafka consumers, Flink/Spark for very large campaigns

Why:
Prevents large campaigns from overwhelming real-time notification flows.

7. Channel Workers – Email

Primary Responsibilities:

Consumes email notification events and formats email content.
Integrates with email providers and handles provider-specific constraints.
Manages retries, bounces, and transient failures.
Emits delivery results back into the system.

Examples:
Amazon SES, SendGrid, Mailgun

Why:
Email delivery requires specialized handling and independent scaling.

8. Channel Workers – SMS

Primary Responsibilities:

Delivers SMS notifications with low latency.
Handles provider throttling, regional routing, and failover.
Normalizes errors from different providers into a common failure model.

Examples:
Twilio, Vonage (Nexmo), AWS SNS

Why:
SMS delivery is latency-sensitive and highly provider-dependent.

9. Channel Workers – Push

Primary Responsibilities:

Sends push notifications to mobile and web devices.
Manages device tokens, expiration, and invalid token cleanup.
Handles platform-specific delivery semantics and retries.

Examples:
Firebase Cloud Messaging (FCM), Apple Push Notification Service (APNs)

Why:
Push platforms require tight integration with OS-level services.

10. Channel Workers – In-App

Primary Responsibilities:

Delivers real-time notifications to active users over persistent connections.
Maintains connection state and fan-out to connected clients.
Falls back gracefully when users are offline.

Examples:
WebSockets, Server-Sent Events (SSE), Redis Pub/Sub

Why:
Provides the lowest-latency notification path for active users.

11. Retry Service

Primary Responsibilities:

Tracks failed delivery attempts and retry counts.
Applies retry policies such as exponential backoff and maximum retry limits.
Schedules retries through the Scheduler Service.
Ensures retries are controlled and do not cause retry storms.

Examples:
Kafka retry topics, Redis delay queues, SQS with visibility timeout

Why:
Improves reliability while protecting the system under failure conditions.

12. Dead Letter Queue (DLQ)

Primary Responsibilities:
Stores notifications that fail permanently after all retries.
Captures failure context and error metadata.
Supports auditing, alerting, and optional manual reprocessing.

Examples:
Kafka DLQ topics, AWS SQS DLQ

Why:
Ensures failures are visible and never silently dropped.

13. Preference Service

Primary Responsibilities:
Stores user notification preferences and channel-level settings.
Provides low-latency reads for preference enforcement.
Acts as the single source of truth for opt-in and quiet hours.

Examples:
Microservice + Redis cache + DynamoDB/Cassandra

Why:
Preference checks are on the critical path and must be fast and consistent.

14. Metadata Database

Primary Responsibilities:
Stores notification lifecycle state, delivery attempts, retry metadata, and audit logs.
Supports strong consistency for state transitions.
Optimized for high write throughput and time-based access patterns.

Examples:
Cassandra, DynamoDB, ScyllaDB

Why:
Designed for massive scale and durability under heavy write load.

15. Cache

Primary Responsibilities:

Caches hot data such as preferences, idempotency keys, and rate-limit counters.
Reduces load on the primary database and lowers latency.

Examples:
Redis, Memcached

Why:
Improves performance and protects databases under peak load.

16. Analytics & Tracking Service

Primary Responsibilities:

Consumes delivery events asynchronously.
Generates metrics for success rate, latency, retries, and failures.
Supports dashboards, alerts, and reporting.

Examples:
Kafka Streams, Flink, ClickHouse, BigQuery

Why:
Separates observability from the critical delivery path.

17. Monitoring & Alerting Service

Primary Responsibilities:

Tracks system health, queue lag, error rates, and SLOs.
Triggers alerts for abnormal behavior or degradation.

Examples:
Prometheus, Grafana, Datadog

Why:
Early detection is critical in high-throughput systems.

18. Logging Service

Primary Responsibilities:
Aggregates logs from all services for debugging and audits.
Supports correlation across distributed requests.

Examples:
ELK Stack, OpenSearch

Why:
Distributed systems require centralized visibility.

19. Security & Secrets Management

Primary Responsibilities:

Manages encryption keys, API credentials, and sensitive configuration.
Enforces encryption at rest and in transit.

Examples:
AWS KMS, HashiCorp Vault, AWS Secrets Manager

Why:
Protects sensitive data and ensures compliance.

High-Level Flows

Flow 0: Default Notification Flow (Happy Path)

This is the baseline flow that everything else builds on.

Client sends a notification request with an idempotency key to the API Gateway.
API Gateway authenticates the client, validates the request, and applies rate limits.
Request is forwarded to the Notification Service.
Notification Service: Validates payload, Classifies notification type (critical / promotional), Assigns priority, Fetches and enforces user preferences, Validates scheduling and expiry
Notification metadata is written durably to the database.
Notification Service publishes an event to the appropriate queue/topic.
Channel Worker consumes the event and sends the notification via the provider.
Delivery result is recorded and emitted to analytics.

Guarantee: Notification is accepted, processed asynchronously, and delivered successfully.

Flow 1: Critical Notification (Low-Latency Path)

Notification is classified as critical (OTP, chat, security alert).
Event is published to a high-priority queue/topic.
Dedicated high-priority Channel Workers consume the event immediately.
Worker sends notification to the provider with aggressive timeouts.
Delivery result is recorded synchronously.

Guarantee: Sub-second p99 latency, No impact from bulk or promotional traffic

Flow 2: Promotional Notification (Best-Effort Path)

Notification is classified as promotional.
Notification Service enforces: Opt-in / opt-out, Quiet hours, Frequency caps, Expiry time
Event is published to a low-priority queue/topic.
Workers process messages opportunistically.
Before sending, expiry is re-checked.

Guarantee: Delivered only within validity window, Never blocks critical traffic

Flow 3: Scheduled Notification

Client provides schedule_at.
Notification Service stores the notification in scheduled state.
Scheduler Service tracks the schedule using a time-indexed store.
At trigger time, Scheduler publishes the event to the queue.
Normal delivery flow resumes.

Guarantee: Sent exactly at scheduled time, No early or late delivery

Flow 4: Bulk Notification / Campaign (Fan-out)

Client creates a bulk campaign.
Notification Service stores campaign metadata.
Campaign Service resolves target users asynchronously.
Campaign is expanded into per-user notifications in batches.
Batched events are published gradually with throttling.
Channel Workers deliver independently.

Guarantee: Fan-out is controlled, Bulk traffic never overloads real-time flows

Flow 5: Retry on Transient Failure

Failure Detection

Channel Worker calls provider.
Provider returns transient error: Timeout, 5xx, Rate limit, Network error

Retry Handling

Worker records failure and retry count.
Retry Service evaluates retry policy: Is error retryable? Retry count < max?
Retry Service computes next retry time (exponential backoff).
Retry is scheduled via Scheduler Service.
Scheduler republishes the event at retry time.
Worker retries delivery.

Guarantee: Safe retries, No retry storms, System remains stable under partial outages

Flow 6: Provider Failover (Multi-Vendor)

Channel Worker detects provider degradation: High error rate, Throttling, Timeouts.
Circuit breaker opens for the failing provider.
Traffic is shifted to a secondary provider (if configured).
Delivery attempts continue via backup provider.
Primary provider is retried after cool-down.

Guarantee: High availability despite provider outages, Graceful degradation

Flow 7: Permanent Failure → DLQ

Notification exceeds maximum retry attempts OR
Error is classified as non-retryable (invalid number, blocked email).
Notification is marked as failed.
Payload and failure context are written to DLQ.
Alerts are triggered for investigation.

Guarantee: No silent drops, Full auditability

Flow 8: Idempotent Request Handling

Client retries request due to timeout.
API Gateway / Notification Service checks idempotency key.
Duplicate request is detected.
Existing notification reference is returned.

Guarantee: No duplicate notifications, Safe client retries

Flow 9: Cancellation of Scheduled Notification

Client requests cancellation.
Notification Service validates state.
Notification is marked cancelled.
Scheduler skips execution if encountered.

Guarantee: Safe cancellation before delivery

Flow 10: Expiry Enforcement

Notification has expiry_at.
Before delivery, worker checks current time.
If expired: Delivery is skipped, Status is marked expired

Guarantee: Promotions are never delivered late

Flow 11: Per-User Ordering (When Required)

Notifications are keyed by user/device.
Queue guarantees ordering per key.
Workers process in order for each user.

Guarantee: Correct ordering for chat and conversational flows

Flow 12: Analytics & Tracking

Workers emit delivery events.
Analytics Service consumes asynchronously.
Metrics, dashboards, and alerts update.

Guarantee: Observability without impacting delivery latency

Deep Dives – Functional Requirements

1. Support Sending Notifications to Users

The system exposes asynchronous APIs that allow internal services and external clients to trigger notifications in a non-blocking manner.
Once a request is accepted, notification intent is durably persisted, ensuring the notification is not lost even if downstream components fail.

2. Support Delivery Across Multiple Channels

Notifications can be delivered through Email, SMS, Push, and In-app channels.
Each channel is implemented as an independent delivery pipeline with its own workers, providers, retry logic, and scaling policy, preventing failures in one channel from impacting others.

3. Support Critical and Promotional Notification Types

Notifications are classified at ingestion time based on type and priority.
Critical notifications are routed through high-priority queues and dedicated workers to guarantee low latency, while promotional notifications are routed through low-priority paths that tolerate delay and throttling.

4. Support User Notification Preferences and Opt-In/Opt-Out

User preferences such as channel enablement, quiet hours, and frequency limits are enforced before delivery.
Preferences are cached for low-latency access and treated as the source of truth, with limited and explicit overrides allowed for critical system alerts.

5. Support Scheduled Notifications

The system allows notifications to be scheduled for future delivery using a distributed scheduler.
Scheduled notifications are triggered exactly at the specified time, survive service restarts, and are validated against expiry constraints before being dispatched.

6. Support Bulk Notifications Targeting Large User Groups

Bulk notifications are modeled as campaigns that are expanded asynchronously into per-user notifications.
Fan-out is performed in batches with throttling and backpressure to protect downstream systems and preserve the performance of real-time notifications.

7. Support Safe Retries and Idempotent Processing

All notification operations use idempotency keys to ensure retries do not create duplicates.
Delivery failures are retried using controlled retry policies such as exponential backoff, with retry state persisted to survive crashes and restarts.

8. Support Tracking of Notification Delivery Status

Each notification and its delivery attempts are tracked through well-defined lifecycle states.
Delivery events are emitted asynchronously to analytics systems, enabling auditing, monitoring, and reporting without impacting delivery latency.

Non-Functional Requirements

1. Highly Available and Fault Tolerant

The system is composed of stateless services deployed across multiple availability zones.
All critical state (notification metadata, retry state, schedules) is stored in replicated and durable systems.
Failures of individual services, nodes, or zones do not result in downtime or message loss.

2. Low-Latency Delivery for Critical Notifications

Critical notifications are isolated using priority-aware queues and dedicated worker pools.
This prevents head-of-line blocking from bulk or promotional traffic.
The critical delivery path minimizes synchronous work to achieve predictable sub-second p99 latency.

3. High Throughput with Large-Scale Fan-out

The system uses asynchronous ingestion and delivery pipelines backed by high-throughput message queues.
Bulk notifications are expanded and delivered in batches with controlled fan-out rates.
This allows the system to sustain millions of notifications per second during peak events.

4. Highly Scalable with Increasing Traffic

All components scale horizontally and independently.
API servers scale with request volume, queues scale via partitioning, and workers scale based on backlog and lag.
Capacity increases linearly by adding instances, without architectural changes.

5. Durable Notification Processing with No Message Loss

Once a notification request is accepted, it is durably persisted before processing begins.
At-least-once delivery guarantees ensure notifications are eventually processed even after crashes or restarts.
Explicit lifecycle states prevent silent drops or stuck notifications.

6. Secure Notification Delivery and Access Control

All APIs are authenticated and authorized at the gateway layer with tenant-level isolation.
Sensitive data is encrypted both in transit and at rest.
Access to external delivery providers is tightly controlled using scoped credentials and secret rotation.

7. Cost-Efficient Operation at Scale

The system avoids synchronous delivery and keeps the critical path lightweight.
Promotional traffic is throttled and deprioritized to reduce peak infrastructure costs.
Analytics and reporting are handled asynchronously, keeping delivery fast and cost-efficient.

Trade Offs

1. At-Least-Once Delivery vs Exactly-Once Delivery

Choice: At-least-once delivery with idempotent processing.

Pros

Ensures no notification is ever lost.
Simplifies system design and improves throughput.

Cons

Duplicate delivery attempts are possible in failure scenarios.

Why This Works
Idempotency keys and state tracking prevent user-visible duplicates while preserving durability, which is more critical than strict exactly-once semantics.

2. Priority Isolation vs Single Unified Queue

Choice: Separate queues and workers for critical and promotional notifications.

Pros

Guarantees low latency for critical notifications.
Prevents promotional spikes from impacting OTPs or chat messages.

Cons

Increases operational complexity and infrastructure cost.

Why This Works
Latency guarantees for critical traffic are non-negotiable in real systems, and isolation is the simplest and most reliable way to enforce them.

3. Asynchronous Processing vs Synchronous Delivery

Choice: Asynchronous notification ingestion and delivery.

Pros

Enables very high throughput and resilience to downstream failures.
Protects clients from provider latency and outages.

Cons

Clients do not get immediate delivery confirmation.

Why This Works
Notifications are inherently asynchronous, and durability plus retries provide stronger guarantees than blocking APIs.

4. Fan-out at Write Time vs Fan-out at Read Time

Choice: Fan-out at write time for bulk and campaign notifications.

Pros

Simplifies delivery logic and tracking.
Allows per-user preference checks and rate limiting.

Cons

Higher write amplification and storage usage.

Why This Works
Write-heavy fan-out enables precise control, retries, and auditing, which are required for large-scale notification platforms.

5. Strong Consistency vs Eventual Consistency

Choice: Strong consistency for notification state, eventual consistency for analytics.

Pros

Prevents duplicate deliveries and inconsistent user experience.
Improves availability and performance for non-critical data.

Cons

Analytics may lag slightly behind real-time.

Why This Works
Users care about correct delivery, not real-time dashboards. Separating consistency models optimizes both correctness and scale.

6. Centralized Preference Checks vs Cached Preferences

Choice: Cache-first preference checks with database fallback.

Pros

Reduces latency and database load.
Supports real-time delivery at scale.

Cons

Cache invalidation adds complexity.

Why This Works
Preferences change infrequently compared to delivery volume, making caching a high-impact optimization.

7. Single Provider vs Multi-Provider Strategy

Choice: Multi-provider integration for email and SMS.

Pros

Improves reliability and reduces vendor lock-in.
Enables failover during provider outages.

Cons

Higher integration and operational complexity.

Why This Works
External providers are unreliable by nature; redundancy is essential for critical notifications.

8. Aggressive Retries vs Controlled Backoff

Choice: Controlled retries with exponential backoff.

Pros

Prevents retry storms and provider overload.
Improves system stability under failure.

Cons

Retries may introduce delivery delays.

Why This Works
Stability and provider trust are more important than aggressive retrying, especially at high scale.

9. Immediate Deletion vs Retained Delivery Logs

Choice: Retain notification logs with configurable TTL.

Pros

Supports auditing, debugging, and compliance.
Enables analytics and reporting.

Cons

Requires additional storage.

Why This Works
Storage is cheap compared to the cost of missing audit data in incidents or compliance scenarios.

10. Cost Optimization vs Peak Performance

Choice: Optimize cost for promotional traffic, optimize performance for critical traffic.

Pros

Keeps infrastructure costs predictable.
Protects user experience for high-priority notifications.

Cons

Promotional notifications may be delayed during peak load.

Why This Works
Business impact of delayed promotions is far lower than delayed critical alerts.

Frequently Asked Questions in Interviews

Q. Why do we separate critical and promotional notifications?

Critical notifications (OTP, security alerts, chat messages) have strict latency and reliability SLOs, while promotional notifications can tolerate delays.
By isolating them into separate queues, partitions, and worker pools, we prevent head-of-line blocking where a promotional spike could delay time-sensitive messages.
This guarantees predictable latency for critical traffic even during large campaigns.

Q. Why is at-least-once delivery preferred over exactly-once delivery?

Exactly-once delivery requires distributed transactions across queues, databases, and external providers, which is expensive and fragile at scale.
At-least-once delivery guarantees durability and availability, which are more important for notifications.
User-visible duplicates are avoided using idempotency keys and state checks, achieving practical correctness with far lower complexity.

Q. How do you prevent duplicate notifications during retries?

Each notification has a globally unique notification ID or idempotency key.
Before sending, workers check the persisted delivery state to ensure the notification hasn’t already been delivered.
Retries update state atomically, so even if the same message is processed twice, only one delivery attempt succeeds.

Q. How do you handle massive fan-out for promotional campaigns?

Bulk campaigns are expanded asynchronously rather than synchronously at API time.
The system processes recipients in batches, applies preferences and rate limits, and enqueues individual delivery tasks gradually.
Fan-out rate is throttled to protect downstream providers and internal infrastructure.

Q. What happens if the notification service crashes mid-processing?

All important state transitions are persisted before moving to the next step.
If a worker crashes after pulling a message but before acknowledging it, the message is re-delivered by the queue.
Because processing is idempotent, retries do not corrupt state or cause duplicates.

Q. How is per-user ordering guaranteed?

Notifications are partitioned by user ID (or user-channel key) in the message queue.
Consumers process messages sequentially within a partition, ensuring ordering for a given user.
Global ordering is intentionally not guaranteed, as it does not scale and is unnecessary.

Q. How do you handle external provider failures (SMS, Email, Push)?

Providers are treated as unreliable dependencies.
Each provider integration includes timeouts, bounded retries, and circuit breakers.
Failures are retried later or routed to fallback providers if configured.

Q. What if a provider is slow but not fully down?

Latency-based circuit breakers detect degradation even when errors are low.
Traffic is gradually reduced or paused to avoid queue buildup and cascading failures.
This protects system stability and prevents retry storms.

Q. How do you ensure users don’t receive expired promotions?

Promotional notifications include an explicit expiration timestamp.
Workers validate the expiry at delivery time and discard expired notifications immediately.
This ensures correctness even if notifications are delayed due to retries or backpressure.

Q. How are user preferences enforced at scale?

User preferences are cached in memory (e.g., Redis) for fast access.
The database remains the source of truth but is only consulted on cache misses or updates.
This allows preference checks to be performed inline without adding latency.

Q. How do you support scheduled notifications at large scale?

Scheduled notifications are stored in time-partitioned storage keyed by execution time.
A scheduler scans upcoming time windows and enqueues notifications just-in-time for delivery.
This avoids keeping millions of delayed messages sitting in queues.

Q. How do you prevent notification spam?

Rate limits are applied per user, per channel, and per tenant.
Promotional notifications are capped daily, while critical notifications bypass limits.
This protects user experience without impacting essential communication.

Q. How is multi-tenancy handled?

Each tenant has isolated identifiers, quotas, rate limits, and metrics.
Traffic from one tenant cannot starve resources for others.
Billing and usage tracking are enforced at the tenant level.

Q. How do you monitor system health?

Metrics track queue depth, consumer lag, latency percentiles, retry rates, and provider errors.
Dashboards provide real-time visibility, and alerts trigger when SLOs are violated.
This allows proactive issue detection before users are impacted.

Q. How do you debug a missing or delayed notification?

Every notification has a traceable lifecycle with immutable logs.
Operators can trace a notification ID across ingestion, scheduling, retries, and delivery attempts.
Dead Letter Queues preserve full context for permanent failures.

Q. What are the biggest scalability bottlenecks?

Metadata writes, fan-out amplification, and external provider rate limits.
These are mitigated using partitioning, batching, caching, and backpressure.
Provider limits often become the true ceiling, not internal infrastructure.

Q. How does the system behave under extreme load?

Critical notifications continue to flow with priority.
Promotional traffic is throttled, delayed, or dropped first.
The system degrades gracefully instead of failing catastrophically.

Q. Why not make notification delivery synchronous?

Synchronous delivery couples system availability to external providers.
Any provider latency or outage would block clients and reduce availability.
Asynchronous processing decouples ingestion from delivery and improves resilience.

Q. How would the system change at 10× or 100× scale?

The architecture remains the same.
We increase partitions, workers, and regional deployments.
No redesign is required—only capacity expansion.

Q. How do you add a new notification channel (e.g., WhatsApp)?

Add a new channel processor and provider integration.
Core ingestion, scheduling, retry, and tracking logic remains unchanged.
This keeps the system extensible and pluggable.

Q. What guarantees does the system actually provide?

Near-real-time delivery for critical notifications.
At-least-once delivery with idempotency.
Per-user ordering where required.
No delivery after expiry for promotions.

High-Level Summary

This notification system delivers low-latency, highly reliable critical notifications while supporting large-scale promotional fan-out without interference.
It uses an asynchronous, event-driven architecture with durable queues, idempotent processing, and safe retries to prevent message loss or duplication.
Traffic isolation, rate limiting, and expiry checks ensure correctness and user experience even during spikes or provider failures.
The system scales linearly and cost-efficiently, matching real-world production notification platforms.

Feel free to ask questions or share your thoughts — happy to discuss!