Bhupesh Chikara

Posted on Feb 26

What If Your Database Goes Down? REST vs Kafka Under Fire

#kafka #systemdesign #chaosengineering #node

What If Your Database Goes Down? REST vs Kafka Under Fire

Companies like Uber, Netflix, and Airbnb have migrated from traditional REST APIs to event-driven architectures. This shift isn't driven by trends, but by fundamental architectural resilience requirements. To quantify these differences, I built a production-grade chaos engineering proof-of-concept, simulating real-world database failures under sustained load.

The Hypothesis

Traditional synchronous REST APIs create tight coupling between API servers and databases. When the database fails, the API fails. Event-driven architectures using Kafka decouple producers from consumers through message buffering, theoretically providing resilience during infrastructure failures.

We designed an experiment to measure this resilience difference empirically.

Experiment Design

Scenario

We simulated Uber's real-time driver location tracking system, where high-frequency updates must be reliably persisted to PostgreSQL.

Infrastructure

Load Pattern: Constant 50 requests/second
Failure Injection: PostgreSQL crash for 120 seconds (t=90s to t=210s)
Duration: 5 minutes total (300 seconds)
Load Testing: k6 with automated orchestration
Monitoring: Prometheus + Grafana for real-time metrics collection

Architectures Tested

Architecture A: Synchronous REST

Direct HTTP POST to REST API
Immediate database INSERT
Response after database confirmation

Architecture B: Asynchronous Kafka

HTTP POST to Producer API
Event published to Kafka topic
Consumer processes events with circuit breaker pattern
Asynchronous database INSERT

Both architectures handled identical traffic patterns and experienced the same database failure window.

Results

REST API Performance

Metric                  Value
Total Requests         30,002
Successful             15,001
Failed                 15,001
Error Rate             50.00%
P50 Latency            3.60ms
P95 Latency            15.55ms
Average Latency        11.24ms

Failure Characteristics:

Immediate error propagation to clients
No request buffering capability
Manual intervention required for recovery
50% of requests returned HTTP 500 errors
Complete service degradation during database outage

Kafka Architecture Performance

Metric                  Value
Total Requests         15,000
Successful             15,000
Failed                 0
Error Rate             0.00%
P50 Latency            3.47ms
P95 Latency            12.10ms
Average Latency        6.77ms
Latency Improvement    39.77%

Failure Characteristics:

Zero client-facing errors
Automatic request buffering in Kafka topics
Circuit breaker pattern enabled graceful degradation
Automatic recovery when database restored
Transparent failure handling from client perspective

Architecture Analysis

REST: Synchronous Coupling

Client → REST API → PostgreSQL → Response
           ↓ (if DB fails)
       HTTP 500 Error

Failure Mode:
When PostgreSQL becomes unavailable, the REST API has no buffering mechanism. Each incoming request attempts a database connection, fails after timeout (2000ms), and returns an error to the client. This creates a cascading failure pattern where API health is directly tied to database availability.

Characteristics:

Strong consistency guarantees
Immediate failure feedback
No request buffering
Tight coupling between components
Simple operational model

Kafka: Asynchronous Decoupling

Client → Producer API → Kafka Topic → Consumer → PostgreSQL
         ↓ (immediate)           ↓ (buffered)
     HTTP 202 Accepted     Circuit Breaker

Failure Mode:
When PostgreSQL fails, the Producer API continues accepting requests and publishing events to Kafka. The Consumer detects database failures through the circuit breaker pattern, transitions to OPEN state, and stops attempting writes. Events accumulate in Kafka's persistent log. When PostgreSQL recovers, the circuit breaker transitions to HALF_OPEN, tests connectivity, and resumes processing the buffered events.

Characteristics:

Eventual consistency model
Request buffering via Kafka topics
Automatic failure detection and recovery
Component independence
Complex operational requirements

Circuit Breaker Implementation

The circuit breaker pattern is critical for preventing cascading failures in distributed systems. Our implementation uses a state machine with three states:

class CircuitBreaker {
  constructor(threshold = 5, timeout = 30000, resetTimeout = 60000) {
    this.state = 'CLOSED';
    this.failureThreshold = threshold;
    this.resetTimeout = resetTimeout;
    this.failureCount = 0;
    this.nextAttempt = Date.now();
  }

  async execute(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker OPEN - database unavailable');
      }
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  onSuccess() {
    this.failureCount = 0;
    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      if (this.successCount >= 3) {
        this.state = 'CLOSED';
        this.successCount = 0;
      }
    }
  }

  onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.resetTimeout;
    }
  }
}

State Transitions:

CLOSED: Normal operation, all requests processed
OPEN: Failure threshold exceeded, stop attempting database writes
HALF_OPEN: Testing recovery, require 3 consecutive successes before full recovery

Business Impact Analysis

Assumptions

Request rate: 1,000 requests/second
Revenue per request: $0.01
Database outage duration: 5 minutes
Error rate impact: 50% (REST) vs 0% (Kafka)

REST Architecture Impact

Failed Requests:    150,000 (50% × 1,000 req/s × 300s)
Revenue Loss:       $1,500
Recovery Time:      Manual intervention required
Customer Impact:    Severe degradation, visible errors

Kafka Architecture Impact

Failed Requests:    0 (buffered in Kafka)
Revenue Loss:       $0
Recovery Time:      Automatic
Customer Impact:    None (transparent to clients)

ROI Calculation:
At scale, assuming one 5-minute database outage per month, Kafka prevents $18,000 in annual revenue loss. This excludes:

Customer support costs from incident tickets
Engineering time for manual recovery
Reputational damage from service degradation
Compliance implications for SLA violations

Technical Stack

Runtime: Node.js 18+ with Express
Message Broker: Apache Kafka 3.x with KafkaJS client
Database: PostgreSQL 15 with connection pooling
Load Testing: k6 with custom JavaScript scenarios
Monitoring: Prometheus 2.x + Grafana 9.x
Orchestration: Docker Compose for reproducible environments
Chaos Engineering: Automated failure injection scripts

Observability Metrics

Prometheus Instrumentation

// REST API Metrics
rest_http_requests_total{status="200|500"}
rest_request_duration_seconds{quantile="0.5|0.95|0.99"}

// Kafka Producer Metrics
kafka_produce_total
kafka_produce_latency_milliseconds
kafka_produce_errors_total

// Consumer Metrics with Circuit Breaker
circuit_breaker_state{state="CLOSED|HALF_OPEN|OPEN"}
kafka_consumer_lag_seconds
db_write_failures_total
db_connection_pool_size

These metrics enabled real-time failure detection and post-mortem analysis of system behavior during the outage window.

Reproduction Guide

Prerequisites

# Install k6 load testing tool
brew install k6  # macOS
sudo apt-get install k6  # Linux

# Clone repository
git clone https://github.com/builtbychikara/WhatIfSeries
cd WhatIfSeries/what-if/kafka-vs-rest-polling

Infrastructure Startup

# Start Kafka, PostgreSQL, Prometheus, and Grafana
docker-compose up -d

# Verify all services are healthy
docker-compose ps

Service Deployment

Open three terminal windows:

# Terminal 1: REST API (Port 3001)
npm run start:rest

# Terminal 2: Kafka Producer API (Port 3002)
npm run start:kafka

# Terminal 3: Kafka Consumer with Circuit Breaker (Port 3003)
npm run start:consumer

Execute Experiment

# Run automated chaos engineering experiment
npm run experiment

Experiment Timeline

Phase         Time        Load        Database State
Warmup        0-30s       25 req/s    Healthy
Normal        30-90s      50 req/s    Healthy
Crash         90-210s     50 req/s    OFFLINE
Recovery      210-300s    50 req/s    Healthy

The experiment script automatically injects the database failure at t=90s and restores it at t=210s, while maintaining constant load throughout.

Key Findings

1. Resilience Through Decoupling

Event-driven architectures achieve resilience not through redundancy, but through temporal decoupling. Kafka's persistent log provides a buffer that absorbs transient failures, converting availability problems into latency problems.

2. Circuit Breaker Necessity

Without the circuit breaker pattern, the Kafka consumer would continuously retry failed database writes, wasting resources and potentially overwhelming the database during recovery. The circuit breaker provides:

Automatic failure detection
Resource conservation during outages
Controlled recovery testing
Metrics for failure state visibility

3. Consistency Trade-offs

REST provides strong consistency - clients know immediately whether their write succeeded. Kafka provides eventual consistency - clients receive acknowledgment that the write is buffered, not that it's persisted. This trade-off must align with business requirements.

4. Operational Complexity

Kafka introduces operational overhead:

Additional infrastructure (brokers, ZooKeeper/KRaft)
Message ordering guarantees to maintain
Consumer lag monitoring requirements
Topic partition management
Schema evolution considerations

This complexity must be justified by resilience requirements.

Decision Framework

Choose REST When:

Traffic Patterns: Request rate under 100 req/s
Consistency Requirements: Strong consistency required
Team Capabilities: Limited operational expertise
System Simplicity: Monolithic architecture preferred
Latency Requirements: Sub-10ms response times required

Choose Kafka When:

Traffic Patterns: Request rate exceeds 1,000 req/s
Resilience Requirements: Infrastructure failures must be transparent
Event Consumers: Multiple downstream systems need the same events
Consistency Trade-offs: Eventual consistency acceptable
Operational Maturity: Team can manage distributed systems

Future Work

This experiment is part of the What If Series, exploring system design decisions through empirical measurement rather than theoretical analysis.

Upcoming Experiments:

High-throughput scaling: Measuring Kafka performance at 1M+ req/s
Cache failure analysis: Redis outage impact on application tier
Real-time processing: Stream processing 1TB of logs with sub-second latency
Network partition testing: Consistency guarantees under split-brain scenarios

Code Repository

Full implementation including:

Complete source code for both architectures
Docker Compose infrastructure definitions
k6 load testing scenarios
Prometheus metric exporters
Automated chaos injection scripts
Grafana dashboard configurations

GitHub: https://github.com/builtbychikara/WhatIfSeries/tree/main/what-if/kafka-vs-rest-polling

Key Files:

kafka-pubsub-service/consumer.js - Circuit breaker implementation
k6/runner.js - Chaos engineering experiment orchestration
docker-compose.yml - Complete infrastructure stack

Conclusion

Event-driven architectures using Kafka demonstrate measurably superior resilience during database failures, with zero client-facing errors compared to 50% error rates in synchronous REST implementations. However, this resilience comes at the cost of operational complexity and consistency guarantees.

The choice between REST and Kafka should be driven by quantified requirements for resilience, throughput, and acceptable consistency models, not by architectural preferences or industry trends.

Acknowledgments: Built in collaboration with Aakanksh Singh to demonstrate production-grade system design patterns through empirical measurement.

Tech Stack: Node.js, Apache Kafka, PostgreSQL, k6, Prometheus, Grafana, Docker

DEV Community

What If Your Database Goes Down? REST vs Kafka Under Fire

What If Your Database Goes Down? REST vs Kafka Under Fire

The Hypothesis

Experiment Design

Scenario

Infrastructure

Architectures Tested

Results

REST API Performance

Kafka Architecture Performance

Architecture Analysis

REST: Synchronous Coupling

Kafka: Asynchronous Decoupling

Circuit Breaker Implementation

Business Impact Analysis

Assumptions

REST Architecture Impact

Kafka Architecture Impact

Technical Stack

Observability Metrics

Prometheus Instrumentation

Reproduction Guide

Prerequisites

Infrastructure Startup

Service Deployment

Execute Experiment

Experiment Timeline

Key Findings

1. Resilience Through Decoupling

2. Circuit Breaker Necessity

3. Consistency Trade-offs

4. Operational Complexity

Decision Framework

Choose REST When:

Choose Kafka When:

Future Work

Code Repository

Conclusion

Top comments (0)