DEV Community

Cover image for Building a 5000+ Notifications/sec Event Pipeline with Kafka & Distributed Idempotency
Ankit Kumar Shaw
Ankit Kumar Shaw

Posted on

Building a 5000+ Notifications/sec Event Pipeline with Kafka & Distributed Idempotency

The Hook: When Synchronous Becomes the Bottleneck

Picture this: Your leave approval API is taking 2.3 seconds to respond. Not because the database is slow. Not because the business logic is complex. But because it's waiting for:

  • An email to be sent to the approver
  • An SMS notification to the employee
  • An audit log to be written to a compliance database
  • A push notification to mobile devices

Each of these external calls adds 300-500ms of latency. And if one fails? The entire approval operation fails. The user sees an error. The leave request is lost. Support tickets flood in.

This was the reality when I joined the notification team. Our leave management system—serving 20,000 employees—was tightly coupled with notification delivery. Every approval action had to wait for emails, SMS, and audit logs to complete. If SendGrid was slow or Twilio was down, leave approvals ground to a halt.

The solution? Decouple notification delivery from the critical path using event-driven architecture with Apache Kafka.

This is the story of how we re-architected the notification system to handle 5000+ notifications/sec with zero message loss, 100% idempotency, and 85% reduction in third-party failure propagation—all while cutting approval API latency by 70%.


The Problem: Synchronous Coupling at Scale

Original Architecture (The Synchronous Nightmare)

User → Leave Approval API → [Business Logic]
                                    ↓
                    ┌───────────────┴───────────────┐
                    │                               │
          [Send Email] ──→ SendGrid (500ms)        │
                    │                               │
          [Send SMS] ──→ Twilio (400ms)            │
                    │                               │
          [Audit Log] ──→ Compliance DB (300ms)    │
                    │                               │
          [Push Notification] ──→ FCM (350ms)      │
                    │                               │
                    └───────────────┬───────────────┘
                                    ↓
                          Total Latency: 2.3s
                          Return Response to User
Enter fullscreen mode Exit fullscreen mode

The Cascading Failures:

  1. Latency Amplification: Every notification channel adds to API response time
  2. Failure Propagation: If SendGrid is down, the entire approval fails
  3. No Retry Logic: Failed notifications are lost forever
  4. Poor User Experience: Users wait 2+ seconds for a simple approval
  5. Tight Coupling: Business logic can't evolve independently of notification logic

Production Impact:

  • 450 incidents/month from third-party notification provider failures
  • p99 latency: 2.3 seconds (unacceptable for user-facing APIs)
  • Zero fault isolation: One provider outage affects all operations

The Solution: Event-Driven Architecture with Kafka

The Core Insight

Observation: Notification delivery is not part of the core business transaction. If a leave is approved, it's approved—whether or not the email sends immediately is a separate concern.

Key Architectural Decision: Decouple notification delivery from the approval workflow using asynchronous event processing.

Target Architecture (Event-Driven)

User → Leave Approval API → [Business Logic]
                                    ↓
                    ┌───────────────┴───────────────┐
                    │  Save to Database             │
                    │  (50ms)                       │
                    └───────────────┬───────────────┘
                                    ↓
                    Publish Event to Kafka Topic
                    (5ms - async, fire-and-forget)
                                    ↓
                          Return Response: 55ms
                          (User sees success immediately)

                    ┌─────────────────────────────────┐
                    │   Kafka Topic (12 partitions)   │
                    └─────────────┬───────────────────┘
                                  ↓
              ┌───────────────────┴───────────────────┐
              │                                       │
    [Consumer Group: Email]              [Consumer Group: SMS]
    [Consumer Group: Audit]              [Consumer Group: Push]
              │                                       │
              ↓                                       ↓
    Process independently with                Circuit breakers,
    retries, DLQ, idempotency                exponential backoff
Enter fullscreen mode Exit fullscreen mode

Immediate Benefits:

  • 70% latency reduction: API response time drops from 2.3s → 700ms (then further optimized to 250ms with DB tuning)
  • Fault Isolation: Email provider outage doesn't affect leave approval
  • Independent Scaling: Notification consumers scale independently of API layer
  • Retry Logic: Failed notifications automatically retry with exponential backoff
  • Zero Message Loss: Dead Letter Queues (DLQ) capture failures for manual intervention

Deep Dive: Kafka Architecture Decisions

Why Kafka Over RabbitMQ or SQS?

As I researched event-streaming platforms, I evaluated three options:

Aspect Kafka RabbitMQ AWS SQS
Throughput 1M+ msg/sec 50K msg/sec 120K msg/sec
Ordering Guarantee Per-partition ordering No strict ordering FIFO queues (limited throughput)
Replay Capability Yes (offset management) No Limited (max 14 days retention)
Durability Replicated logs Optional persistence High (managed by AWS)
Consumer Groups Native support Manual implementation Limited
Best For Event streaming, high throughput Task queues, RPC Serverless, decoupled microservices

Why We Chose Kafka:

  1. Throughput Requirements: We needed to scale from 100 → 5000+ notifications/sec
  2. Replay Capability: Critical for debugging and reprocessing failed batches
  3. Ordering Guarantees: Per-user notification ordering was essential (e.g., "leave requested" before "leave approved")
  4. Consumer Groups: Native support for multiple independent consumers (Email, SMS, Push, Audit)

Kafka Partitioning Strategy: Scaling to 5000+ Notifications/sec

The Partitioning Challenge

Key Insight: Kafka partitions are the unit of parallelism. More partitions = more concurrent consumers = higher throughput.

Design Decision: Use 12 partitions partitioned by userId hash.

Key Configuration:

// Idempotent producer with userId as partition key
config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
config.put(ProducerConfig.ACKS_CONFIG, "all"); // Wait for all replicas

kafkaTemplate.send("notification-events", 
                   event.getUserId().toString(), // Partition key
                   event);
Enter fullscreen mode Exit fullscreen mode

Why Partition by userId?

  • Ordering Guarantee: All notifications for a user go to the same partition (FIFO order preserved)
  • Load Distribution: Uniform distribution across partitions (assuming balanced user activity)
  • Consumer Affinity: Each consumer processes a subset of users (better caching)

Throughput Calculation:

  • Single consumer throughput: ~500 msg/sec (bounded by external API calls)
  • 12 partitions × 500 msg/sec = 6000 msg/sec peak capacity
  • Production average: ~800 msg/sec, peak: 5000+ msg/sec

The Idempotency Challenge: Ensuring "Exactly-Once" Semantics

The Duplicate Notification Problem

Scenario: Kafka delivers "at-least-once" by default. If a consumer processes a message but crashes before committing the offset, the message is redelivered.

Without Idempotency:

1. Consumer receives "Leave Approved" email notification
2. SendGrid API call succeeds (email sent)
3. Consumer crashes before committing Kafka offset
4. Message redelivered → duplicate email sent
Enter fullscreen mode Exit fullscreen mode

Real-World Impact: Users receiving 3-5 duplicate emails for the same action is unacceptable.

Solution: Two-Barrier Idempotency Pattern

Architecture:

Kafka Message → [Check Redis] → [Check PostgreSQL] → [Send Notification] → [Update Both]
                     ↓                ↓
              Fast dedup          Durable dedup
              (~1ms)              (~10ms)
Enter fullscreen mode Exit fullscreen mode

Core Logic:

@KafkaListener(topics = "notification-events")
public void processNotification(NotificationEvent event) {
    String dedupeKey = generateDedupeKey(event); // userId + eventType + timestamp

    // Barrier 1: Redis (Fast Path - 99.9% of duplicates caught here)
    if (!redisTemplate.opsForValue().setIfAbsent(dedupeKey, "PROCESSING", Duration.ofMinutes(10))) {
        return; // Duplicate detected
    }

    // Barrier 2: PostgreSQL (Durable Fallback)
    if (notificationRepo.existsByDedupeKey(dedupeKey)) {
        return; // Duplicate detected in database
    }

    // Process notification and mark as completed in both barriers
    emailService.send(event);
    notificationRepo.save(new NotificationRecord(dedupeKey, event));
    redisTemplate.opsForValue().set(dedupeKey, "COMPLETED", Duration.ofHours(24));
}
Enter fullscreen mode Exit fullscreen mode

Why Two Barriers?

  1. Redis (Primary): Ultra-fast deduplication (<1ms) for 99.9% of cases
  2. PostgreSQL (Fallback): Durable guarantee even if Redis is evicted or fails

Result: 100% deduplication with minimal performance overhead


Circuit Breakers & Exponential Backoff: Surviving Third-Party Failures

The Third-Party Dependency Problem

Observation: External notification providers (SendGrid, Twilio, FCM) fail regularly:

  • Rate limits (429 errors)
  • Transient network issues
  • Scheduled maintenance
  • Provider outages

Before Event-Driven Architecture:

  • These failures propagated to the leave approval API
  • 450 incidents/month affecting core business flows

After Kafka + Circuit Breakers:

  • Failures isolated to notification consumers
  • Core business logic unaffected
  • ~68 incidents/month (85% reduction)

Exponential Backoff: Configured Kafka consumer retry with exponential backoff (1s → 2s → 4s → 8s → 16s max).

Result:

  • SendGrid outage → Circuit breaker opens → DLQ captures events → No impact to core system
  • 450 → ~68 incidents/month (85% reduction in failure propagation)

Dead Letter Queue (DLQ): Zero-Loss Failure Handling

The Problem of Permanent Failures

Scenarios:

  • Invalid phone number (SMS)
  • Malformed email address
  • User opted out of notifications
  • Provider returns 400 (non-retryable error)

Without DLQ: After N retries, message is discarded → data loss.

DLQ Architecture

Kafka Topic: notification-events
        ↓
    Consumer
        ↓
    [Try Process]
        ↓
    ┌──────────┬──────────┐
    │          │          │
  Success   Retryable  Non-Retryable
    │       Failure     Failure
    │          │          │
    ✓      [Retry]   [Send to DLQ]
            (3x)          ↓
                   Kafka Topic: notification-dlq
                          ↓
                   [Manual Review Queue]
                   [Alerting/Monitoring]
Enter fullscreen mode Exit fullscreen mode

Implementation:

@KafkaListener(topics = "notification-events")
public void consume(NotificationEvent event) {
    try {
        processNotification(event);
    } catch (RetryableException e) {
        throw e; // Let Kafka consumer retry
    } catch (NonRetryableException e) {
        log.error("Non-retryable error, sending to DLQ: {}", e.getMessage());
        dlqProducer.send("notification-dlq", event);
    }
}
Enter fullscreen mode Exit fullscreen mode

DLQ Monitoring:

  • Grafana dashboard for DLQ message count
  • PagerDuty alert if DLQ > 100 messages
  • Daily batch job to review and reprocess DLQ messages

The Backend Engineer's Perspective: Event-Driven Patterns

As I built this system, I kept seeing parallels to distributed system patterns I'd encountered:

  1. Kafka Consumers = Microservices
    Each consumer group (Email, SMS, Push, Audit) is an independent microservice. They scale, fail, and deploy independently.

  2. Idempotency = Distributed Transactions
    The two-barrier pattern is similar to two-phase commit (2PC), but optimized for performance (fast Redis path + durable PostgreSQL fallback).

  3. Circuit Breakers = Resilience Patterns
    Same pattern I use in Spring Boot microservices with Resilience4j. The only difference? Here it's protecting Kafka consumers instead of REST APIs.

  4. DLQ = Exception Handling at Scale
    DLQ is like a try-catch block for distributed systems. Instead of rethrowing, we quarantine failures for human review.

  5. Partitioning = Database Sharding
    Kafka partitions are like database shards—both distribute load across multiple nodes while preserving per-key ordering.

The Insight: Event-driven architecture isn't fundamentally different from synchronous microservices. It's the same engineering principles applied to asynchronous, decoupled workflows.


Conclusion: From Synchronous Coupling to Scalable Event Streams

When I started this project, I thought event-driven architecture was "advanced" and maybe unnecessary. Why not just keep it simple with synchronous REST calls?

But as I profiled the system, measured the latency, and watched third-party failures bring down core business flows, the path forward became clear: services that don't need to fail together shouldn't be coupled together.

Kafka gave us that decoupling. Circuit breakers gave us resilience. Idempotency gave us correctness. And DLQs gave us zero-loss guarantees.

The result? A notification system that handles 5000+ notifications/sec, recovers from failures autonomously, and never blocks the critical path of leave approvals.

If you're building microservices and hitting synchronous coupling bottlenecks, start with one question: "Does this operation need to complete before returning a response to the user?"

If the answer is no, make it asynchronous. Your users—and your on-call engineers—will thank you.


Connect with me: If you're building event-driven systems or have questions about Kafka architecture, let's discuss! I'm always learning and happy to share insights.

Top comments (0)