Ankit Kumar Shaw

Posted on Apr 3

Building a 5000+ Notifications/sec Event Pipeline with Kafka & Distributed Idempotency

#architecture #microservices #kafka #backend

The Hook: When Synchronous Becomes the Bottleneck

Picture this: Your leave approval API is taking 2.3 seconds to respond. Not because the database is slow. Not because the business logic is complex. But because it's waiting for:

An email to be sent to the approver
An SMS notification to the employee
An audit log to be written to a compliance database
A push notification to mobile devices

Each of these external calls adds 300-500ms of latency. And if one fails? The entire approval operation fails. The user sees an error. The leave request is lost. Support tickets flood in.

This was the reality when I joined the notification team. Our leave management system—serving 20,000 employees—was tightly coupled with notification delivery. Every approval action had to wait for emails, SMS, and audit logs to complete. If SendGrid was slow or Twilio was down, leave approvals ground to a halt.

The solution? Decouple notification delivery from the critical path using event-driven architecture with Apache Kafka.

This is the story of how we re-architected the notification system to handle 5000+ notifications/sec with zero message loss, 100% idempotency, and 85% reduction in third-party failure propagation—all while cutting approval API latency by 70%.

The Problem: Synchronous Coupling at Scale

Original Architecture (The Synchronous Nightmare)

User → Leave Approval API → [Business Logic]
                                    ↓
                    ┌───────────────┴───────────────┐
                    │                               │
          [Send Email] ──→ SendGrid (500ms)        │
                    │                               │
          [Send SMS] ──→ Twilio (400ms)            │
                    │                               │
          [Audit Log] ──→ Compliance DB (300ms)    │
                    │                               │
          [Push Notification] ──→ FCM (350ms)      │
                    │                               │
                    └───────────────┬───────────────┘
                                    ↓
                          Total Latency: 2.3s
                          Return Response to User

The Cascading Failures:

Latency Amplification: Every notification channel adds to API response time
Failure Propagation: If SendGrid is down, the entire approval fails
No Retry Logic: Failed notifications are lost forever
Poor User Experience: Users wait 2+ seconds for a simple approval
Tight Coupling: Business logic can't evolve independently of notification logic

Production Impact:

450 incidents/month from third-party notification provider failures
p99 latency: 2.3 seconds (unacceptable for user-facing APIs)
Zero fault isolation: One provider outage affects all operations

The Solution: Event-Driven Architecture with Kafka

The Core Insight

Observation: Notification delivery is not part of the core business transaction. If a leave is approved, it's approved—whether or not the email sends immediately is a separate concern.

Key Architectural Decision: Decouple notification delivery from the approval workflow using asynchronous event processing.

Target Architecture (Event-Driven)

User → Leave Approval API → [Business Logic]
                                    ↓
                    ┌───────────────┴───────────────┐
                    │  Save to Database             │
                    │  (50ms)                       │
                    └───────────────┬───────────────┘
                                    ↓
                    Publish Event to Kafka Topic
                    (5ms - async, fire-and-forget)
                                    ↓
                          Return Response: 55ms
                          (User sees success immediately)

                    ┌─────────────────────────────────┐
                    │   Kafka Topic (12 partitions)   │
                    └─────────────┬───────────────────┘
                                  ↓
              ┌───────────────────┴───────────────────┐
              │                                       │
    [Consumer Group: Email]              [Consumer Group: SMS]
    [Consumer Group: Audit]              [Consumer Group: Push]
              │                                       │
              ↓                                       ↓
    Process independently with                Circuit breakers,
    retries, DLQ, idempotency                exponential backoff

Immediate Benefits:

70% latency reduction: API response time drops from 2.3s → 700ms (then further optimized to 250ms with DB tuning)
Fault Isolation: Email provider outage doesn't affect leave approval
Independent Scaling: Notification consumers scale independently of API layer
Retry Logic: Failed notifications automatically retry with exponential backoff
Zero Message Loss: Dead Letter Queues (DLQ) capture failures for manual intervention

Deep Dive: Kafka Architecture Decisions

Why Kafka Over RabbitMQ or SQS?

As I researched event-streaming platforms, I evaluated three options:

Aspect	Kafka	RabbitMQ	AWS SQS
Throughput	1M+ msg/sec	50K msg/sec	120K msg/sec
Ordering Guarantee	Per-partition ordering	No strict ordering	FIFO queues (limited throughput)
Replay Capability	Yes (offset management)	No	Limited (max 14 days retention)
Durability	Replicated logs	Optional persistence	High (managed by AWS)
Consumer Groups	Native support	Manual implementation	Limited
Best For	Event streaming, high throughput	Task queues, RPC	Serverless, decoupled microservices

Why We Chose Kafka:

Throughput Requirements: We needed to scale from 100 → 5000+ notifications/sec
Replay Capability: Critical for debugging and reprocessing failed batches
Ordering Guarantees: Per-user notification ordering was essential (e.g., "leave requested" before "leave approved")
Consumer Groups: Native support for multiple independent consumers (Email, SMS, Push, Audit)

Kafka Partitioning Strategy: Scaling to 5000+ Notifications/sec

The Partitioning Challenge

Key Insight: Kafka partitions are the unit of parallelism. More partitions = more concurrent consumers = higher throughput.

Design Decision: Use 12 partitions partitioned by userId hash.

Key Configuration:

// Idempotent producer with userId as partition key
config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
config.put(ProducerConfig.ACKS_CONFIG, "all"); // Wait for all replicas

kafkaTemplate.send("notification-events", 
                   event.getUserId().toString(), // Partition key
                   event);

Why Partition by userId?

Ordering Guarantee: All notifications for a user go to the same partition (FIFO order preserved)
Load Distribution: Uniform distribution across partitions (assuming balanced user activity)
Consumer Affinity: Each consumer processes a subset of users (better caching)

Throughput Calculation:

Single consumer throughput: ~500 msg/sec (bounded by external API calls)
12 partitions × 500 msg/sec = 6000 msg/sec peak capacity
Production average: ~800 msg/sec, peak: 5000+ msg/sec

The Idempotency Challenge: Ensuring "Exactly-Once" Semantics

The Duplicate Notification Problem

Scenario: Kafka delivers "at-least-once" by default. If a consumer processes a message but crashes before committing the offset, the message is redelivered.

Without Idempotency:

1. Consumer receives "Leave Approved" email notification
2. SendGrid API call succeeds (email sent)
3. Consumer crashes before committing Kafka offset
4. Message redelivered → duplicate email sent

Real-World Impact: Users receiving 3-5 duplicate emails for the same action is unacceptable.

Solution: Two-Barrier Idempotency Pattern

Architecture:

Kafka Message → [Check Redis] → [Check PostgreSQL] → [Send Notification] → [Update Both]
                     ↓                ↓
              Fast dedup          Durable dedup
              (~1ms)              (~10ms)

Core Logic:

@KafkaListener(topics = "notification-events")
public void processNotification(NotificationEvent event) {
    String dedupeKey = generateDedupeKey(event); // userId + eventType + timestamp

    // Barrier 1: Redis (Fast Path - 99.9% of duplicates caught here)
    if (!redisTemplate.opsForValue().setIfAbsent(dedupeKey, "PROCESSING", Duration.ofMinutes(10))) {
        return; // Duplicate detected
    }

    // Barrier 2: PostgreSQL (Durable Fallback)
    if (notificationRepo.existsByDedupeKey(dedupeKey)) {
        return; // Duplicate detected in database
    }

    // Process notification and mark as completed in both barriers
    emailService.send(event);
    notificationRepo.save(new NotificationRecord(dedupeKey, event));
    redisTemplate.opsForValue().set(dedupeKey, "COMPLETED", Duration.ofHours(24));
}

Why Two Barriers?

Redis (Primary): Ultra-fast deduplication (<1ms) for 99.9% of cases
PostgreSQL (Fallback): Durable guarantee even if Redis is evicted or fails

Result: 100% deduplication with minimal performance overhead

Circuit Breakers & Exponential Backoff: Surviving Third-Party Failures

The Third-Party Dependency Problem

Observation: External notification providers (SendGrid, Twilio, FCM) fail regularly:

Rate limits (429 errors)
Transient network issues
Scheduled maintenance
Provider outages

Before Event-Driven Architecture:

These failures propagated to the leave approval API
450 incidents/month affecting core business flows

After Kafka + Circuit Breakers:

Failures isolated to notification consumers
Core business logic unaffected
~68 incidents/month (85% reduction)

Exponential Backoff: Configured Kafka consumer retry with exponential backoff (1s → 2s → 4s → 8s → 16s max).

Result:

SendGrid outage → Circuit breaker opens → DLQ captures events → No impact to core system
450 → ~68 incidents/month (85% reduction in failure propagation)

Dead Letter Queue (DLQ): Zero-Loss Failure Handling

The Problem of Permanent Failures

Scenarios:

Invalid phone number (SMS)
Malformed email address
User opted out of notifications
Provider returns 400 (non-retryable error)

Without DLQ: After N retries, message is discarded → data loss.

DLQ Architecture

Kafka Topic: notification-events
        ↓
    Consumer
        ↓
    [Try Process]
        ↓
    ┌──────────┬──────────┐
    │          │          │
  Success   Retryable  Non-Retryable
    │       Failure     Failure
    │          │          │
    ✓      [Retry]   [Send to DLQ]
            (3x)          ↓
                   Kafka Topic: notification-dlq
                          ↓
                   [Manual Review Queue]
                   [Alerting/Monitoring]

Implementation:

@KafkaListener(topics = "notification-events")
public void consume(NotificationEvent event) {
    try {
        processNotification(event);
    } catch (RetryableException e) {
        throw e; // Let Kafka consumer retry
    } catch (NonRetryableException e) {
        log.error("Non-retryable error, sending to DLQ: {}", e.getMessage());
        dlqProducer.send("notification-dlq", event);
    }
}

DLQ Monitoring:

Grafana dashboard for DLQ message count
PagerDuty alert if DLQ > 100 messages
Daily batch job to review and reprocess DLQ messages

The Backend Engineer's Perspective: Event-Driven Patterns

As I built this system, I kept seeing parallels to distributed system patterns I'd encountered:

Kafka Consumers = Microservices
Each consumer group (Email, SMS, Push, Audit) is an independent microservice. They scale, fail, and deploy independently.
Idempotency = Distributed Transactions
The two-barrier pattern is similar to two-phase commit (2PC), but optimized for performance (fast Redis path + durable PostgreSQL fallback).
Circuit Breakers = Resilience Patterns
Same pattern I use in Spring Boot microservices with Resilience4j. The only difference? Here it's protecting Kafka consumers instead of REST APIs.
DLQ = Exception Handling at Scale
DLQ is like a try-catch block for distributed systems. Instead of rethrowing, we quarantine failures for human review.
Partitioning = Database Sharding
Kafka partitions are like database shards—both distribute load across multiple nodes while preserving per-key ordering.

The Insight: Event-driven architecture isn't fundamentally different from synchronous microservices. It's the same engineering principles applied to asynchronous, decoupled workflows.

Conclusion: From Synchronous Coupling to Scalable Event Streams

When I started this project, I thought event-driven architecture was "advanced" and maybe unnecessary. Why not just keep it simple with synchronous REST calls?

But as I profiled the system, measured the latency, and watched third-party failures bring down core business flows, the path forward became clear: services that don't need to fail together shouldn't be coupled together.

Kafka gave us that decoupling. Circuit breakers gave us resilience. Idempotency gave us correctness. And DLQs gave us zero-loss guarantees.

The result? A notification system that handles 5000+ notifications/sec, recovers from failures autonomously, and never blocks the critical path of leave approvals.

If you're building microservices and hitting synchronous coupling bottlenecks, start with one question: "Does this operation need to complete before returning a response to the user?"

If the answer is no, make it asynchronous. Your users—and your on-call engineers—will thank you.

Connect with me: If you're building event-driven systems or have questions about Kafka architecture, let's discuss! I'm always learning and happy to share insights.

Top comments (1)

Himanshu Gupta • May 4 • Edited

Thanks for this detailed post. Did you consider Solace (solace.com/products/event-broker/) at all for this use case?

It meets all of your requirements and you would need to try to shoehorn "DLQ" logic into kafka. You also wouldn't need to worry about partitioning to support the throughput for your use case. Kafka partition rebalancing is a common nightmare that you would need to deal with.

And instead of just publishing to a single flat topic "notification-events", you can publish to more descriptive and powerful hierarchical topics such as notification/leave/{employeeID}/newLeaveRequest and notification/leave/{employeeID}/leaveApproved. And your consumers can choose what data they want to consume using wildcards. For example an app that just cares about approved leaves of all employees can subscribe to notification/leave/*/leaveApproved and an audit app that needs to see all events can subscribe to notification/leave/>. Gives your consumers a lot more flexibility and easily allows you to add consumers with different requirements.

kafka also doesn't have full queueing features. Yes, it recently added support for queues but it's not complete (doesn't guarantee ordering). This usually leads users to bolt-on some other queuing product such as SQS on top of kafka which leads to additional cost, overhead, and complexity.