DEV Community

Cover image for What If Your Database Goes Down? REST vs Kafka Under Fire
Bhupesh Chikara
Bhupesh Chikara

Posted on

What If Your Database Goes Down? REST vs Kafka Under Fire

What If Your Database Goes Down? REST vs Kafka Under Fire

Companies like Uber, Netflix, and Airbnb have migrated from traditional REST APIs to event-driven architectures. This shift isn't driven by trends, but by fundamental architectural resilience requirements. To quantify these differences, I built a production-grade chaos engineering proof-of-concept, simulating real-world database failures under sustained load.

The Hypothesis

Traditional synchronous REST APIs create tight coupling between API servers and databases. When the database fails, the API fails. Event-driven architectures using Kafka decouple producers from consumers through message buffering, theoretically providing resilience during infrastructure failures.

We designed an experiment to measure this resilience difference empirically.

REST vs Kafka Architecture Comparison

Experiment Design

Scenario

We simulated Uber's real-time driver location tracking system, where high-frequency updates must be reliably persisted to PostgreSQL.

Infrastructure

  • Load Pattern: Constant 50 requests/second
  • Failure Injection: PostgreSQL crash for 120 seconds (t=90s to t=210s)
  • Duration: 5 minutes total (300 seconds)
  • Load Testing: k6 with automated orchestration
  • Monitoring: Prometheus + Grafana for real-time metrics collection

Architectures Tested

Architecture A: Synchronous REST

  • Direct HTTP POST to REST API
  • Immediate database INSERT
  • Response after database confirmation

Architecture B: Asynchronous Kafka

  • HTTP POST to Producer API
  • Event published to Kafka topic
  • Consumer processes events with circuit breaker pattern
  • Asynchronous database INSERT

Both architectures handled identical traffic patterns and experienced the same database failure window.

Results

REST API Performance

Metric                  Value
Total Requests         30,002
Successful             15,001
Failed                 15,001
Error Rate             50.00%
P50 Latency            3.60ms
P95 Latency            15.55ms
Average Latency        11.24ms
Enter fullscreen mode Exit fullscreen mode

Failure Characteristics:

  • Immediate error propagation to clients
  • No request buffering capability
  • Manual intervention required for recovery
  • 50% of requests returned HTTP 500 errors
  • Complete service degradation during database outage

Kafka Architecture Performance

Metric                  Value
Total Requests         15,000
Successful             15,000
Failed                 0
Error Rate             0.00%
P50 Latency            3.47ms
P95 Latency            12.10ms
Average Latency        6.77ms
Latency Improvement    39.77%
Enter fullscreen mode Exit fullscreen mode

Failure Characteristics:

  • Zero client-facing errors
  • Automatic request buffering in Kafka topics
  • Circuit breaker pattern enabled graceful degradation
  • Automatic recovery when database restored
  • Transparent failure handling from client perspective

Architecture Analysis

REST: Synchronous Coupling

Client → REST API → PostgreSQL → Response
           ↓ (if DB fails)
       HTTP 500 Error
Enter fullscreen mode Exit fullscreen mode

Failure Mode:
When PostgreSQL becomes unavailable, the REST API has no buffering mechanism. Each incoming request attempts a database connection, fails after timeout (2000ms), and returns an error to the client. This creates a cascading failure pattern where API health is directly tied to database availability.

Characteristics:

  • Strong consistency guarantees
  • Immediate failure feedback
  • No request buffering
  • Tight coupling between components
  • Simple operational model

Kafka: Asynchronous Decoupling

Client → Producer API → Kafka Topic → Consumer → PostgreSQL
         ↓ (immediate)           ↓ (buffered)
     HTTP 202 Accepted     Circuit Breaker
Enter fullscreen mode Exit fullscreen mode

Failure Mode:
When PostgreSQL fails, the Producer API continues accepting requests and publishing events to Kafka. The Consumer detects database failures through the circuit breaker pattern, transitions to OPEN state, and stops attempting writes. Events accumulate in Kafka's persistent log. When PostgreSQL recovers, the circuit breaker transitions to HALF_OPEN, tests connectivity, and resumes processing the buffered events.

Characteristics:

  • Eventual consistency model
  • Request buffering via Kafka topics
  • Automatic failure detection and recovery
  • Component independence
  • Complex operational requirements

Circuit Breaker Implementation

The circuit breaker pattern is critical for preventing cascading failures in distributed systems. Our implementation uses a state machine with three states:

class CircuitBreaker {
  constructor(threshold = 5, timeout = 30000, resetTimeout = 60000) {
    this.state = 'CLOSED';
    this.failureThreshold = threshold;
    this.resetTimeout = resetTimeout;
    this.failureCount = 0;
    this.nextAttempt = Date.now();
  }

  async execute(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker OPEN - database unavailable');
      }
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  onSuccess() {
    this.failureCount = 0;
    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      if (this.successCount >= 3) {
        this.state = 'CLOSED';
        this.successCount = 0;
      }
    }
  }

  onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.resetTimeout;
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

State Transitions:

  • CLOSED: Normal operation, all requests processed
  • OPEN: Failure threshold exceeded, stop attempting database writes
  • HALF_OPEN: Testing recovery, require 3 consecutive successes before full recovery

Business Impact Analysis

Assumptions

  • Request rate: 1,000 requests/second
  • Revenue per request: $0.01
  • Database outage duration: 5 minutes
  • Error rate impact: 50% (REST) vs 0% (Kafka)

REST Architecture Impact

Failed Requests:    150,000 (50% × 1,000 req/s × 300s)
Revenue Loss:       $1,500
Recovery Time:      Manual intervention required
Customer Impact:    Severe degradation, visible errors
Enter fullscreen mode Exit fullscreen mode

Kafka Architecture Impact

Failed Requests:    0 (buffered in Kafka)
Revenue Loss:       $0
Recovery Time:      Automatic
Customer Impact:    None (transparent to clients)
Enter fullscreen mode Exit fullscreen mode

ROI Calculation:
At scale, assuming one 5-minute database outage per month, Kafka prevents $18,000 in annual revenue loss. This excludes:

  • Customer support costs from incident tickets
  • Engineering time for manual recovery
  • Reputational damage from service degradation
  • Compliance implications for SLA violations

Technical Stack

  • Runtime: Node.js 18+ with Express
  • Message Broker: Apache Kafka 3.x with KafkaJS client
  • Database: PostgreSQL 15 with connection pooling
  • Load Testing: k6 with custom JavaScript scenarios
  • Monitoring: Prometheus 2.x + Grafana 9.x
  • Orchestration: Docker Compose for reproducible environments
  • Chaos Engineering: Automated failure injection scripts

Observability Metrics

Prometheus Instrumentation

// REST API Metrics
rest_http_requests_total{status="200|500"}
rest_request_duration_seconds{quantile="0.5|0.95|0.99"}

// Kafka Producer Metrics
kafka_produce_total
kafka_produce_latency_milliseconds
kafka_produce_errors_total

// Consumer Metrics with Circuit Breaker
circuit_breaker_state{state="CLOSED|HALF_OPEN|OPEN"}
kafka_consumer_lag_seconds
db_write_failures_total
db_connection_pool_size
Enter fullscreen mode Exit fullscreen mode

These metrics enabled real-time failure detection and post-mortem analysis of system behavior during the outage window.

Reproduction Guide

Prerequisites

# Install k6 load testing tool
brew install k6  # macOS
sudo apt-get install k6  # Linux

# Clone repository
git clone https://github.com/builtbychikara/WhatIfSeries
cd WhatIfSeries/what-if/kafka-vs-rest-polling
Enter fullscreen mode Exit fullscreen mode

Infrastructure Startup

# Start Kafka, PostgreSQL, Prometheus, and Grafana
docker-compose up -d

# Verify all services are healthy
docker-compose ps
Enter fullscreen mode Exit fullscreen mode

Service Deployment

Open three terminal windows:

# Terminal 1: REST API (Port 3001)
npm run start:rest

# Terminal 2: Kafka Producer API (Port 3002)
npm run start:kafka

# Terminal 3: Kafka Consumer with Circuit Breaker (Port 3003)
npm run start:consumer
Enter fullscreen mode Exit fullscreen mode

Execute Experiment

# Run automated chaos engineering experiment
npm run experiment
Enter fullscreen mode Exit fullscreen mode

Experiment Timeline

Phase         Time        Load        Database State
Warmup        0-30s       25 req/s    Healthy
Normal        30-90s      50 req/s    Healthy
Crash         90-210s     50 req/s    OFFLINE
Recovery      210-300s    50 req/s    Healthy
Enter fullscreen mode Exit fullscreen mode

The experiment script automatically injects the database failure at t=90s and restores it at t=210s, while maintaining constant load throughout.

Key Findings

1. Resilience Through Decoupling

Event-driven architectures achieve resilience not through redundancy, but through temporal decoupling. Kafka's persistent log provides a buffer that absorbs transient failures, converting availability problems into latency problems.

2. Circuit Breaker Necessity

Without the circuit breaker pattern, the Kafka consumer would continuously retry failed database writes, wasting resources and potentially overwhelming the database during recovery. The circuit breaker provides:

  • Automatic failure detection
  • Resource conservation during outages
  • Controlled recovery testing
  • Metrics for failure state visibility

3. Consistency Trade-offs

REST provides strong consistency - clients know immediately whether their write succeeded. Kafka provides eventual consistency - clients receive acknowledgment that the write is buffered, not that it's persisted. This trade-off must align with business requirements.

4. Operational Complexity

Kafka introduces operational overhead:

  • Additional infrastructure (brokers, ZooKeeper/KRaft)
  • Message ordering guarantees to maintain
  • Consumer lag monitoring requirements
  • Topic partition management
  • Schema evolution considerations

This complexity must be justified by resilience requirements.

Decision Framework

Choose REST When:

  1. Traffic Patterns: Request rate under 100 req/s
  2. Consistency Requirements: Strong consistency required
  3. Team Capabilities: Limited operational expertise
  4. System Simplicity: Monolithic architecture preferred
  5. Latency Requirements: Sub-10ms response times required

Choose Kafka When:

  1. Traffic Patterns: Request rate exceeds 1,000 req/s
  2. Resilience Requirements: Infrastructure failures must be transparent
  3. Event Consumers: Multiple downstream systems need the same events
  4. Consistency Trade-offs: Eventual consistency acceptable
  5. Operational Maturity: Team can manage distributed systems

Future Work

This experiment is part of the What If Series, exploring system design decisions through empirical measurement rather than theoretical analysis.

Upcoming Experiments:

  • High-throughput scaling: Measuring Kafka performance at 1M+ req/s
  • Cache failure analysis: Redis outage impact on application tier
  • Real-time processing: Stream processing 1TB of logs with sub-second latency
  • Network partition testing: Consistency guarantees under split-brain scenarios

Code Repository

Full implementation including:

  • Complete source code for both architectures
  • Docker Compose infrastructure definitions
  • k6 load testing scenarios
  • Prometheus metric exporters
  • Automated chaos injection scripts
  • Grafana dashboard configurations

GitHub: https://github.com/builtbychikara/WhatIfSeries/tree/main/what-if/kafka-vs-rest-polling

Key Files:

  • kafka-pubsub-service/consumer.js - Circuit breaker implementation
  • k6/runner.js - Chaos engineering experiment orchestration
  • docker-compose.yml - Complete infrastructure stack

Conclusion

Event-driven architectures using Kafka demonstrate measurably superior resilience during database failures, with zero client-facing errors compared to 50% error rates in synchronous REST implementations. However, this resilience comes at the cost of operational complexity and consistency guarantees.

The choice between REST and Kafka should be driven by quantified requirements for resilience, throughput, and acceptable consistency models, not by architectural preferences or industry trends.

Acknowledgments: Built in collaboration with Aakanksh Singh to demonstrate production-grade system design patterns through empirical measurement.

Tech Stack: Node.js, Apache Kafka, PostgreSQL, k6, Prometheus, Grafana, Docker

Top comments (0)