The Hook: When Synchronous Becomes the Bottleneck
Picture this: Your leave approval API is taking 2.3 seconds to respond. Not because the database is slow. Not because the business logic is complex. But because it's waiting for:
- An email to be sent to the approver
- An SMS notification to the employee
- An audit log to be written to a compliance database
- A push notification to mobile devices
Each of these external calls adds 300-500ms of latency. And if one fails? The entire approval operation fails. The user sees an error. The leave request is lost. Support tickets flood in.
This was the reality when I joined the notification team. Our leave management system—serving 20,000 employees—was tightly coupled with notification delivery. Every approval action had to wait for emails, SMS, and audit logs to complete. If SendGrid was slow or Twilio was down, leave approvals ground to a halt.
The solution? Decouple notification delivery from the critical path using event-driven architecture with Apache Kafka.
This is the story of how we re-architected the notification system to handle 5000+ notifications/sec with zero message loss, 100% idempotency, and 85% reduction in third-party failure propagation—all while cutting approval API latency by 70%.
The Problem: Synchronous Coupling at Scale
Original Architecture (The Synchronous Nightmare)
User → Leave Approval API → [Business Logic]
↓
┌───────────────┴───────────────┐
│ │
[Send Email] ──→ SendGrid (500ms) │
│ │
[Send SMS] ──→ Twilio (400ms) │
│ │
[Audit Log] ──→ Compliance DB (300ms) │
│ │
[Push Notification] ──→ FCM (350ms) │
│ │
└───────────────┬───────────────┘
↓
Total Latency: 2.3s
Return Response to User
The Cascading Failures:
- Latency Amplification: Every notification channel adds to API response time
- Failure Propagation: If SendGrid is down, the entire approval fails
- No Retry Logic: Failed notifications are lost forever
- Poor User Experience: Users wait 2+ seconds for a simple approval
- Tight Coupling: Business logic can't evolve independently of notification logic
Production Impact:
- 450 incidents/month from third-party notification provider failures
- p99 latency: 2.3 seconds (unacceptable for user-facing APIs)
- Zero fault isolation: One provider outage affects all operations
The Solution: Event-Driven Architecture with Kafka
The Core Insight
Observation: Notification delivery is not part of the core business transaction. If a leave is approved, it's approved—whether or not the email sends immediately is a separate concern.
Key Architectural Decision: Decouple notification delivery from the approval workflow using asynchronous event processing.
Target Architecture (Event-Driven)
User → Leave Approval API → [Business Logic]
↓
┌───────────────┴───────────────┐
│ Save to Database │
│ (50ms) │
└───────────────┬───────────────┘
↓
Publish Event to Kafka Topic
(5ms - async, fire-and-forget)
↓
Return Response: 55ms
(User sees success immediately)
┌─────────────────────────────────┐
│ Kafka Topic (12 partitions) │
└─────────────┬───────────────────┘
↓
┌───────────────────┴───────────────────┐
│ │
[Consumer Group: Email] [Consumer Group: SMS]
[Consumer Group: Audit] [Consumer Group: Push]
│ │
↓ ↓
Process independently with Circuit breakers,
retries, DLQ, idempotency exponential backoff
Immediate Benefits:
- 70% latency reduction: API response time drops from 2.3s → 700ms (then further optimized to 250ms with DB tuning)
- Fault Isolation: Email provider outage doesn't affect leave approval
- Independent Scaling: Notification consumers scale independently of API layer
- Retry Logic: Failed notifications automatically retry with exponential backoff
- Zero Message Loss: Dead Letter Queues (DLQ) capture failures for manual intervention
Deep Dive: Kafka Architecture Decisions
Why Kafka Over RabbitMQ or SQS?
As I researched event-streaming platforms, I evaluated three options:
| Aspect | Kafka | RabbitMQ | AWS SQS |
|---|---|---|---|
| Throughput | 1M+ msg/sec | 50K msg/sec | 120K msg/sec |
| Ordering Guarantee | Per-partition ordering | No strict ordering | FIFO queues (limited throughput) |
| Replay Capability | Yes (offset management) | No | Limited (max 14 days retention) |
| Durability | Replicated logs | Optional persistence | High (managed by AWS) |
| Consumer Groups | Native support | Manual implementation | Limited |
| Best For | Event streaming, high throughput | Task queues, RPC | Serverless, decoupled microservices |
Why We Chose Kafka:
- Throughput Requirements: We needed to scale from 100 → 5000+ notifications/sec
- Replay Capability: Critical for debugging and reprocessing failed batches
- Ordering Guarantees: Per-user notification ordering was essential (e.g., "leave requested" before "leave approved")
- Consumer Groups: Native support for multiple independent consumers (Email, SMS, Push, Audit)
Kafka Partitioning Strategy: Scaling to 5000+ Notifications/sec
The Partitioning Challenge
Key Insight: Kafka partitions are the unit of parallelism. More partitions = more concurrent consumers = higher throughput.
Design Decision: Use 12 partitions partitioned by userId hash.
Key Configuration:
// Idempotent producer with userId as partition key
config.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
config.put(ProducerConfig.ACKS_CONFIG, "all"); // Wait for all replicas
kafkaTemplate.send("notification-events",
event.getUserId().toString(), // Partition key
event);
Why Partition by userId?
- Ordering Guarantee: All notifications for a user go to the same partition (FIFO order preserved)
- Load Distribution: Uniform distribution across partitions (assuming balanced user activity)
- Consumer Affinity: Each consumer processes a subset of users (better caching)
Throughput Calculation:
- Single consumer throughput: ~500 msg/sec (bounded by external API calls)
- 12 partitions × 500 msg/sec = 6000 msg/sec peak capacity
- Production average: ~800 msg/sec, peak: 5000+ msg/sec
The Idempotency Challenge: Ensuring "Exactly-Once" Semantics
The Duplicate Notification Problem
Scenario: Kafka delivers "at-least-once" by default. If a consumer processes a message but crashes before committing the offset, the message is redelivered.
Without Idempotency:
1. Consumer receives "Leave Approved" email notification
2. SendGrid API call succeeds (email sent)
3. Consumer crashes before committing Kafka offset
4. Message redelivered → duplicate email sent
Real-World Impact: Users receiving 3-5 duplicate emails for the same action is unacceptable.
Solution: Two-Barrier Idempotency Pattern
Architecture:
Kafka Message → [Check Redis] → [Check PostgreSQL] → [Send Notification] → [Update Both]
↓ ↓
Fast dedup Durable dedup
(~1ms) (~10ms)
Core Logic:
@KafkaListener(topics = "notification-events")
public void processNotification(NotificationEvent event) {
String dedupeKey = generateDedupeKey(event); // userId + eventType + timestamp
// Barrier 1: Redis (Fast Path - 99.9% of duplicates caught here)
if (!redisTemplate.opsForValue().setIfAbsent(dedupeKey, "PROCESSING", Duration.ofMinutes(10))) {
return; // Duplicate detected
}
// Barrier 2: PostgreSQL (Durable Fallback)
if (notificationRepo.existsByDedupeKey(dedupeKey)) {
return; // Duplicate detected in database
}
// Process notification and mark as completed in both barriers
emailService.send(event);
notificationRepo.save(new NotificationRecord(dedupeKey, event));
redisTemplate.opsForValue().set(dedupeKey, "COMPLETED", Duration.ofHours(24));
}
Why Two Barriers?
- Redis (Primary): Ultra-fast deduplication (<1ms) for 99.9% of cases
- PostgreSQL (Fallback): Durable guarantee even if Redis is evicted or fails
Result: 100% deduplication with minimal performance overhead
Circuit Breakers & Exponential Backoff: Surviving Third-Party Failures
The Third-Party Dependency Problem
Observation: External notification providers (SendGrid, Twilio, FCM) fail regularly:
- Rate limits (429 errors)
- Transient network issues
- Scheduled maintenance
- Provider outages
Before Event-Driven Architecture:
- These failures propagated to the leave approval API
- 450 incidents/month affecting core business flows
After Kafka + Circuit Breakers:
- Failures isolated to notification consumers
- Core business logic unaffected
- ~68 incidents/month (85% reduction)
Exponential Backoff: Configured Kafka consumer retry with exponential backoff (1s → 2s → 4s → 8s → 16s max).
Result:
- SendGrid outage → Circuit breaker opens → DLQ captures events → No impact to core system
- 450 → ~68 incidents/month (85% reduction in failure propagation)
Dead Letter Queue (DLQ): Zero-Loss Failure Handling
The Problem of Permanent Failures
Scenarios:
- Invalid phone number (SMS)
- Malformed email address
- User opted out of notifications
- Provider returns 400 (non-retryable error)
Without DLQ: After N retries, message is discarded → data loss.
DLQ Architecture
Kafka Topic: notification-events
↓
Consumer
↓
[Try Process]
↓
┌──────────┬──────────┐
│ │ │
Success Retryable Non-Retryable
│ Failure Failure
│ │ │
✓ [Retry] [Send to DLQ]
(3x) ↓
Kafka Topic: notification-dlq
↓
[Manual Review Queue]
[Alerting/Monitoring]
Implementation:
@KafkaListener(topics = "notification-events")
public void consume(NotificationEvent event) {
try {
processNotification(event);
} catch (RetryableException e) {
throw e; // Let Kafka consumer retry
} catch (NonRetryableException e) {
log.error("Non-retryable error, sending to DLQ: {}", e.getMessage());
dlqProducer.send("notification-dlq", event);
}
}
DLQ Monitoring:
- Grafana dashboard for DLQ message count
- PagerDuty alert if DLQ > 100 messages
- Daily batch job to review and reprocess DLQ messages
The Backend Engineer's Perspective: Event-Driven Patterns
As I built this system, I kept seeing parallels to distributed system patterns I'd encountered:
Kafka Consumers = Microservices
Each consumer group (Email, SMS, Push, Audit) is an independent microservice. They scale, fail, and deploy independently.Idempotency = Distributed Transactions
The two-barrier pattern is similar to two-phase commit (2PC), but optimized for performance (fast Redis path + durable PostgreSQL fallback).Circuit Breakers = Resilience Patterns
Same pattern I use in Spring Boot microservices with Resilience4j. The only difference? Here it's protecting Kafka consumers instead of REST APIs.DLQ = Exception Handling at Scale
DLQ is like a try-catch block for distributed systems. Instead of rethrowing, we quarantine failures for human review.Partitioning = Database Sharding
Kafka partitions are like database shards—both distribute load across multiple nodes while preserving per-key ordering.
The Insight: Event-driven architecture isn't fundamentally different from synchronous microservices. It's the same engineering principles applied to asynchronous, decoupled workflows.
Conclusion: From Synchronous Coupling to Scalable Event Streams
When I started this project, I thought event-driven architecture was "advanced" and maybe unnecessary. Why not just keep it simple with synchronous REST calls?
But as I profiled the system, measured the latency, and watched third-party failures bring down core business flows, the path forward became clear: services that don't need to fail together shouldn't be coupled together.
Kafka gave us that decoupling. Circuit breakers gave us resilience. Idempotency gave us correctness. And DLQs gave us zero-loss guarantees.
The result? A notification system that handles 5000+ notifications/sec, recovers from failures autonomously, and never blocks the critical path of leave approvals.
If you're building microservices and hitting synchronous coupling bottlenecks, start with one question: "Does this operation need to complete before returning a response to the user?"
If the answer is no, make it asynchronous. Your users—and your on-call engineers—will thank you.
Connect with me: If you're building event-driven systems or have questions about Kafka architecture, let's discuss! I'm always learning and happy to share insights.
Top comments (0)