What If Your Database Goes Down? REST vs Kafka Under Fire
Companies like Uber, Netflix, and Airbnb have migrated from traditional REST APIs to event-driven architectures. This shift isn't driven by trends, but by fundamental architectural resilience requirements. To quantify these differences, I built a production-grade chaos engineering proof-of-concept, simulating real-world database failures under sustained load.
The Hypothesis
Traditional synchronous REST APIs create tight coupling between API servers and databases. When the database fails, the API fails. Event-driven architectures using Kafka decouple producers from consumers through message buffering, theoretically providing resilience during infrastructure failures.
We designed an experiment to measure this resilience difference empirically.
Experiment Design
Scenario
We simulated Uber's real-time driver location tracking system, where high-frequency updates must be reliably persisted to PostgreSQL.
Infrastructure
- Load Pattern: Constant 50 requests/second
- Failure Injection: PostgreSQL crash for 120 seconds (t=90s to t=210s)
- Duration: 5 minutes total (300 seconds)
- Load Testing: k6 with automated orchestration
- Monitoring: Prometheus + Grafana for real-time metrics collection
Architectures Tested
Architecture A: Synchronous REST
- Direct HTTP POST to REST API
- Immediate database INSERT
- Response after database confirmation
Architecture B: Asynchronous Kafka
- HTTP POST to Producer API
- Event published to Kafka topic
- Consumer processes events with circuit breaker pattern
- Asynchronous database INSERT
Both architectures handled identical traffic patterns and experienced the same database failure window.
Results
REST API Performance
Metric Value
Total Requests 30,002
Successful 15,001
Failed 15,001
Error Rate 50.00%
P50 Latency 3.60ms
P95 Latency 15.55ms
Average Latency 11.24ms
Failure Characteristics:
- Immediate error propagation to clients
- No request buffering capability
- Manual intervention required for recovery
- 50% of requests returned HTTP 500 errors
- Complete service degradation during database outage
Kafka Architecture Performance
Metric Value
Total Requests 15,000
Successful 15,000
Failed 0
Error Rate 0.00%
P50 Latency 3.47ms
P95 Latency 12.10ms
Average Latency 6.77ms
Latency Improvement 39.77%
Failure Characteristics:
- Zero client-facing errors
- Automatic request buffering in Kafka topics
- Circuit breaker pattern enabled graceful degradation
- Automatic recovery when database restored
- Transparent failure handling from client perspective
Architecture Analysis
REST: Synchronous Coupling
Client → REST API → PostgreSQL → Response
↓ (if DB fails)
HTTP 500 Error
Failure Mode:
When PostgreSQL becomes unavailable, the REST API has no buffering mechanism. Each incoming request attempts a database connection, fails after timeout (2000ms), and returns an error to the client. This creates a cascading failure pattern where API health is directly tied to database availability.
Characteristics:
- Strong consistency guarantees
- Immediate failure feedback
- No request buffering
- Tight coupling between components
- Simple operational model
Kafka: Asynchronous Decoupling
Client → Producer API → Kafka Topic → Consumer → PostgreSQL
↓ (immediate) ↓ (buffered)
HTTP 202 Accepted Circuit Breaker
Failure Mode:
When PostgreSQL fails, the Producer API continues accepting requests and publishing events to Kafka. The Consumer detects database failures through the circuit breaker pattern, transitions to OPEN state, and stops attempting writes. Events accumulate in Kafka's persistent log. When PostgreSQL recovers, the circuit breaker transitions to HALF_OPEN, tests connectivity, and resumes processing the buffered events.
Characteristics:
- Eventual consistency model
- Request buffering via Kafka topics
- Automatic failure detection and recovery
- Component independence
- Complex operational requirements
Circuit Breaker Implementation
The circuit breaker pattern is critical for preventing cascading failures in distributed systems. Our implementation uses a state machine with three states:
class CircuitBreaker {
constructor(threshold = 5, timeout = 30000, resetTimeout = 60000) {
this.state = 'CLOSED';
this.failureThreshold = threshold;
this.resetTimeout = resetTimeout;
this.failureCount = 0;
this.nextAttempt = Date.now();
}
async execute(fn) {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttempt) {
throw new Error('Circuit breaker OPEN - database unavailable');
}
this.state = 'HALF_OPEN';
}
try {
const result = await fn();
this.onSuccess();
return result;
} catch (err) {
this.onFailure();
throw err;
}
}
onSuccess() {
this.failureCount = 0;
if (this.state === 'HALF_OPEN') {
this.successCount++;
if (this.successCount >= 3) {
this.state = 'CLOSED';
this.successCount = 0;
}
}
}
onFailure() {
this.failureCount++;
if (this.failureCount >= this.failureThreshold) {
this.state = 'OPEN';
this.nextAttempt = Date.now() + this.resetTimeout;
}
}
}
State Transitions:
- CLOSED: Normal operation, all requests processed
- OPEN: Failure threshold exceeded, stop attempting database writes
- HALF_OPEN: Testing recovery, require 3 consecutive successes before full recovery
Business Impact Analysis
Assumptions
- Request rate: 1,000 requests/second
- Revenue per request: $0.01
- Database outage duration: 5 minutes
- Error rate impact: 50% (REST) vs 0% (Kafka)
REST Architecture Impact
Failed Requests: 150,000 (50% × 1,000 req/s × 300s)
Revenue Loss: $1,500
Recovery Time: Manual intervention required
Customer Impact: Severe degradation, visible errors
Kafka Architecture Impact
Failed Requests: 0 (buffered in Kafka)
Revenue Loss: $0
Recovery Time: Automatic
Customer Impact: None (transparent to clients)
ROI Calculation:
At scale, assuming one 5-minute database outage per month, Kafka prevents $18,000 in annual revenue loss. This excludes:
- Customer support costs from incident tickets
- Engineering time for manual recovery
- Reputational damage from service degradation
- Compliance implications for SLA violations
Technical Stack
- Runtime: Node.js 18+ with Express
- Message Broker: Apache Kafka 3.x with KafkaJS client
- Database: PostgreSQL 15 with connection pooling
- Load Testing: k6 with custom JavaScript scenarios
- Monitoring: Prometheus 2.x + Grafana 9.x
- Orchestration: Docker Compose for reproducible environments
- Chaos Engineering: Automated failure injection scripts
Observability Metrics
Prometheus Instrumentation
// REST API Metrics
rest_http_requests_total{status="200|500"}
rest_request_duration_seconds{quantile="0.5|0.95|0.99"}
// Kafka Producer Metrics
kafka_produce_total
kafka_produce_latency_milliseconds
kafka_produce_errors_total
// Consumer Metrics with Circuit Breaker
circuit_breaker_state{state="CLOSED|HALF_OPEN|OPEN"}
kafka_consumer_lag_seconds
db_write_failures_total
db_connection_pool_size
These metrics enabled real-time failure detection and post-mortem analysis of system behavior during the outage window.
Reproduction Guide
Prerequisites
# Install k6 load testing tool
brew install k6 # macOS
sudo apt-get install k6 # Linux
# Clone repository
git clone https://github.com/builtbychikara/WhatIfSeries
cd WhatIfSeries/what-if/kafka-vs-rest-polling
Infrastructure Startup
# Start Kafka, PostgreSQL, Prometheus, and Grafana
docker-compose up -d
# Verify all services are healthy
docker-compose ps
Service Deployment
Open three terminal windows:
# Terminal 1: REST API (Port 3001)
npm run start:rest
# Terminal 2: Kafka Producer API (Port 3002)
npm run start:kafka
# Terminal 3: Kafka Consumer with Circuit Breaker (Port 3003)
npm run start:consumer
Execute Experiment
# Run automated chaos engineering experiment
npm run experiment
Experiment Timeline
Phase Time Load Database State
Warmup 0-30s 25 req/s Healthy
Normal 30-90s 50 req/s Healthy
Crash 90-210s 50 req/s OFFLINE
Recovery 210-300s 50 req/s Healthy
The experiment script automatically injects the database failure at t=90s and restores it at t=210s, while maintaining constant load throughout.
Key Findings
1. Resilience Through Decoupling
Event-driven architectures achieve resilience not through redundancy, but through temporal decoupling. Kafka's persistent log provides a buffer that absorbs transient failures, converting availability problems into latency problems.
2. Circuit Breaker Necessity
Without the circuit breaker pattern, the Kafka consumer would continuously retry failed database writes, wasting resources and potentially overwhelming the database during recovery. The circuit breaker provides:
- Automatic failure detection
- Resource conservation during outages
- Controlled recovery testing
- Metrics for failure state visibility
3. Consistency Trade-offs
REST provides strong consistency - clients know immediately whether their write succeeded. Kafka provides eventual consistency - clients receive acknowledgment that the write is buffered, not that it's persisted. This trade-off must align with business requirements.
4. Operational Complexity
Kafka introduces operational overhead:
- Additional infrastructure (brokers, ZooKeeper/KRaft)
- Message ordering guarantees to maintain
- Consumer lag monitoring requirements
- Topic partition management
- Schema evolution considerations
This complexity must be justified by resilience requirements.
Decision Framework
Choose REST When:
- Traffic Patterns: Request rate under 100 req/s
- Consistency Requirements: Strong consistency required
- Team Capabilities: Limited operational expertise
- System Simplicity: Monolithic architecture preferred
- Latency Requirements: Sub-10ms response times required
Choose Kafka When:
- Traffic Patterns: Request rate exceeds 1,000 req/s
- Resilience Requirements: Infrastructure failures must be transparent
- Event Consumers: Multiple downstream systems need the same events
- Consistency Trade-offs: Eventual consistency acceptable
- Operational Maturity: Team can manage distributed systems
Future Work
This experiment is part of the What If Series, exploring system design decisions through empirical measurement rather than theoretical analysis.
Upcoming Experiments:
- High-throughput scaling: Measuring Kafka performance at 1M+ req/s
- Cache failure analysis: Redis outage impact on application tier
- Real-time processing: Stream processing 1TB of logs with sub-second latency
- Network partition testing: Consistency guarantees under split-brain scenarios
Code Repository
Full implementation including:
- Complete source code for both architectures
- Docker Compose infrastructure definitions
- k6 load testing scenarios
- Prometheus metric exporters
- Automated chaos injection scripts
- Grafana dashboard configurations
GitHub: https://github.com/builtbychikara/WhatIfSeries/tree/main/what-if/kafka-vs-rest-polling
Key Files:
-
kafka-pubsub-service/consumer.js- Circuit breaker implementation -
k6/runner.js- Chaos engineering experiment orchestration -
docker-compose.yml- Complete infrastructure stack
Conclusion
Event-driven architectures using Kafka demonstrate measurably superior resilience during database failures, with zero client-facing errors compared to 50% error rates in synchronous REST implementations. However, this resilience comes at the cost of operational complexity and consistency guarantees.
The choice between REST and Kafka should be driven by quantified requirements for resilience, throughput, and acceptable consistency models, not by architectural preferences or industry trends.
Acknowledgments: Built in collaboration with Aakanksh Singh to demonstrate production-grade system design patterns through empirical measurement.
Tech Stack: Node.js, Apache Kafka, PostgreSQL, k6, Prometheus, Grafana, Docker

Top comments (0)