DEV Community

Cover image for Cache Strategies Explained: Part 2 - Advanced Architectures
SpicyCode
SpicyCode

Posted on

Cache Strategies Explained: Part 2 - Advanced Architectures

From Write-Behind to Write-Ahead Log: How Netflix guarantees zero data loss at global scale


This article is the continuation of Part 1 - The Fundamentals. If you haven't read the first part, I recommend starting there to understand caching basics.


Table of Contents


Recap: The Netflix Incident

Netflix, Production Incident (Reported September 2025)

A developer types ALTER TABLE user_preferences...

Three seconds later: massive database corruption.

Transparency note: Netflix hasn't publicly disclosed the exact number of affected records. The incident demonstrated the critical importance of their cache + WAL architecture, but specific numbers aren't verifiable from public sources.

Result: Zero customer complaints, zero downtime, zero data loss.

How?

Thanks to two silent technologies:

  1. A cache with extendable TTL
  2. A Write-Ahead Log (WAL) that had captured all mutations

In this Part 2, we'll break down exactly how Netflix transformed classic Write-Behind into enterprise-grade critical architecture.


Why Write-Behind Isn't Enough Anymore

The Context: 6 Critical Challenges at Netflix Scale

In 2024-2025, Netflix was facing recurring challenges causing production incidents:

  1. Accidental data loss and corruption in databases
  2. System entropy between different datastores (Cassandra and Elasticsearch becoming inconsistent)
  3. Multi-partition updates (e.g., building secondary indexes on NoSQL)
  4. Data replication (in-region and cross-region)
  5. Reliable retry mechanisms for real-time pipelines at scale
  6. Mass deletions causing OOM (Out Of Memory) on Key-Value nodes

Direct quote from Netflix article (September 2025):

"During a particular incident, a developer executed an ALTER TABLE command that caused data corruption. Fortunately, the data was protected by cache, so the ability to quickly extend cache TTL combined with the application writing mutations to Kafka allowed us to recover. Without the application's resilience features, there would have been permanent data loss."


The Problem with Traditional Write-Behind

Application → Cache (instant)
                ↓
           Async Queue → Database (later)
Enter fullscreen mode Exit fullscreen mode

What happens if:

  • The message queue crashes before writing to DB?
  • The database is corrupted?
  • You need to replicate across 4 geographic regions?

Answer: Data loss.

Classic Write-Behind wasn't enough for Netflix anymore. They needed a solution with enterprise-grade durability guarantees.


The Write-Ahead Log (WAL)

Fundamental Principle of WAL

Netflix developed a generic WAL system that transforms Write-Behind into enterprise-grade critical architecture.

Application
    ↓
1. DURABLE write to Kafka (Write-Ahead Log)
    ↓
2. Only after confirmation → write to Cache
    ↓
3. Consumers read from Kafka → write to DB
    ↓
4. On failure → automatic infinite retry until success
Enter fullscreen mode Exit fullscreen mode

Guarantee: zero data loss, even in catastrophic scenarios.

Write-Behind Classic vs Netflix WAL: Architectural Difference

Write-Behind classic (cache-first approach):

Application
    ↓ 1. INSTANT write
Cache (volatile memory)
    ↓ 2. ASYNCHRONOUS write (non-durable queue)
Database

RISK: if crash between step 1 and 2 → DATA LOSS
Enter fullscreen mode Exit fullscreen mode

Netflix WAL (durability-first approach):

Application
    ↓ 1. DURABLE write
Kafka (Write-Ahead Log)
    ↓ 2. PARALLEL write after Kafka confirmation
    ├──→ Cache
    ├──→ Database
    └──→ Other consumers

Guarantee: even in case of crash → ZERO LOSS (replay from Kafka)
Enter fullscreen mode Exit fullscreen mode

The fundamental difference:

  • Write-Behind = performance optimization (cache first)
  • WAL = durability guarantee (durable log first)

Netflix inverted the priorities: durability before speed.


WAL Architecture for Global EVCache Replication

Here's how Netflix synchronizes its cache across the world:

┌──────────────────────┐
│  EVCache Client      │  (Region US-WEST)
│  Application writes  │
└──────────┬───────────┘
           │
           ↓ Write mutations to Kafka (WAL)
           │
┌──────────┴───────────────────────────────────┐
│         Kafka Topics (Durable WAL)           │
│  • Sequence numbers for guaranteed order      │
│  • Configurable retention                     │
│  • Internal Kafka replication                 │
└──────────┬───────────────────────────────────┘
           │
           ├────────────┬─────────────┬─────────────┐
           ↓            ↓             ↓             ↓
     ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
     │Consumer │  │Consumer │  │Consumer │  │Consumer │
     │ US-EAST │  │   EU    │  │  APAC   │  │  LATAM  │
     └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘
          │            │            │            │
          ↓            ↓            ↓            ↓
     ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
     │ Writer  │  │ Writer  │  │ Writer  │  │ Writer  │
     │ Groups  │  │ Groups  │  │ Groups  │  │ Groups  │
     └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘
          │            │            │            │
          ↓            ↓            ↓            ↓
     EVCache      EVCache      EVCache      EVCache
     Servers      Servers      Servers      Servers
    (Regional)   (Regional)   (Regional)   (Regional)
Enter fullscreen mode Exit fullscreen mode

Detailed flow:

  1. Application (region US-WEST) writes a mutation: SET user:123 = {...}

  2. WAL Producer writes to Kafka with:

    • Key: user:123
    • Value: data + metadata
    • Sequence number: 12,847,392
    • Timestamp: 2025-02-16T10:32:45Z
  3. 4 Regional consumers (US-EAST, EU, APAC, LATAM):

    • Read from the same Kafka topic
    • Consume in parallel and independently
    • Each maintains its own offset
  4. Local Writer Groups:

    • Receive mutations
    • Write to their region's EVCache servers
    • Retry on failure

Result: one write in US-WEST is automatically and reliably replicated across 4 regions.


The WAL API: Intentional Simplicity

One of the strengths of Netflix's WAL is its extremely simple API. Here's the main endpoint:

rpc WriteToLog (WriteToLogRequest) returns (WriteToLogResponse)

// Request
message WriteToLogRequest {
  string namespace = 1;        // Identifier for a particular WAL
  Lifecycle lifecycle = 2;     // Delay and original write timestamp
  bytes payload = 3;           // Message content
  Target target = 4;           // Where to send the payload
}

// Response
message WriteToLogResponse {
  Trilean durable = 1;  // SUCCESS / FAILED / UNKNOWN
  string message = 2;   // Failure reason
}
Enter fullscreen mode Exit fullscreen mode

Why this simplicity?

  • Easy onboarding for teams
  • Complete abstraction of underlying implementation
  • Flexibility via the "namespace" concept

The 3 WAL Personas

Netflix's WAL can adopt 3 different personas depending on namespace configuration.

Persona #1: Delayed Queue

Example configuration (Product Data Systems):

{
  "namespace": "pds",
  "persistenceConfiguration": {
    "physicalStorage": {
      "type": "SQS"
    },
    "config": {
      "wal-queue": ["dgwwal-dq-pds"],
      "wal-dlq-queue": ["dgwwal-dlq-pds"],
      "queue.poll-interval.secs": 10,
      "queue.max-messages-per-poll": 100
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Usage:

# Send a message that will be delivered in 3600 seconds (1h)
wal.write(
    namespace="pds",
    payload=message,
    delay=3600
)
Enter fullscreen mode Exit fullscreen mode

Backend: SQS (Amazon Simple Queue Service)


Persona #2: Generic Cross-Region Replication

Example configuration (EVCache):

{
  "namespace": "evcache_foobar",
  "persistenceConfiguration": {
    "physicalStorage": {
      "type": "KAFKA"
    },
    "config": {
      "consumer_stack": "consumer",
      "target": {
        "us-east-1": "dgwwal.foobar.cluster.us-east-1.netflix.net",
        "us-east-2": "dgwwal.foobar.cluster.us-east-2.netflix.net",
        "us-west-2": "dgwwal.foobar.cluster.us-west-2.netflix.net",
        "eu-west-1": "dgwwal.foobar.cluster.eu-west-1.netflix.net"
      },
      "wal-kafka-topics": ["evcache_foobar"],
      "wal-kafka-dlq-topics": []
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Usage:

# Write to EVCache in region US-WEST-2
evcache.set("user:123", user_data)

# WAL automatically replicates to:
# → US-EAST-1
# → US-EAST-2
# → EU-WEST-1
Enter fullscreen mode Exit fullscreen mode

Backend: Kafka


Persona #3: Multi-Partition Mutations (2-Phase Commit)

Example configuration (Key-Value):

{
  "namespace": "kv_foobar",
  "persistenceConfiguration": {
    "physicalStorage": {
      "type": "KAFKA"
    },
    "config": {
      "durable_storage": {
        "type": "kv",
        "namespace": "foobar_wal_type",
        "shard": "walfoobar"
      },
      "wal-kafka-topics": ["foobar_kv_multi_id"],
      "wal-kafka-dlq-topics": ["foobar_kv_multi_id-dlq"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Usage:

# Single request that modifies multiple tables/partitions
kv.mutate_items([
    PutItem(table="users", id="123", data=user_data),
    PutItem(table="profiles", id="123", data=profile_data),
    DeleteItem(table="cache", id="old:123")
])

# WAL guarantees ALL operations will eventually succeed
Enter fullscreen mode Exit fullscreen mode

Backend: Kafka + Durable Storage (for 2-phase commit)

Key detail: presence of durable_storage enables 2-phase commit semantics.


Real-World WAL Use Cases at Netflix

The generic WAL isn't just for EVCache. Netflix uses it for:

1. Queues with Intelligent Retries

Mutation failed → WAL
    ↓
Exponential backoff retry
    ↓
Retry until success (or DLQ after X attempts)
Enter fullscreen mode Exit fullscreen mode

2. Cross-Region Replication (EVCache Global)

  • 4 synchronized geographic regions
  • Replication latency: a few seconds
  • Guaranteed eventual consistency

3. Multi-Partition / Multi-Table Mutations

Complex transaction:
1. Write to Table A (partition 1)
2. Write to Table B (partition 7)
3. Update Cache

With WAL:
- Two-phase commit semantics
- Atomic guarantee
- Automatic rollback on partial failure
Enter fullscreen mode Exit fullscreen mode

4. Database Failure Protection

Catastrophe scenario:

11:30 - Cassandra database becomes unavailable
11:31 - Applications continue writing to WAL (Kafka)
13:00 - Cassandra comes back online
13:01 - WAL automatically replays all missed mutations
13:15 - System 100% synchronized, ZERO data loss
Enter fullscreen mode Exit fullscreen mode

Write-Behind vs WAL: The Comparative Match

Aspect Classic Write-Behind Netflix WAL Winner
Durability Not guaranteed (memory queue) Strong guarantee (Kafka) WAL
Failure Resilience Possible loss No loss WAL
Retries Manual/basic Automatic/intelligent WAL
Cross-Region Not natively supported Native multi-region support WAL
Operation Ordering Can be lost Preserved (sequence numbers) WAL
Complexity Simple Complex (Kafka, consumers, etc.) Write-Behind
Write Latency Ultra-fast (<1ms) Fast (~5-10ms Kafka) Write-Behind
Infrastructure Minimal Heavy (Kafka cluster, consumers) Write-Behind

Conclusion: WAL sacrifices some simplicity and latency to gain enterprise-grade durability guarantees.


When NOT to Use WAL

The Netflix WAL is powerful but comes with significant costs. Here's when it's over-engineered:

Don't Use WAL If:

1. Startup / Small Team (< 10 people)

  • Managed Kafka infrastructure cost (AWS MSK, Confluent Cloud): 500€-2000€/month minimum
  • Operational complexity: monitoring, consumer tuning, DLQ management
  • Development time: 2-4 weeks implementation
  • Alternative: Simple Write-Behind with SQS/RabbitMQ queue is sufficient

2. Non-Critical Data

  • Logs, analytics, metrics, tracking events
  • Loss of a few entries is acceptable
  • Alternative: Simple Write-Behind or even fire-and-forget

3. Critical Latency (< 5ms required)

  • WAL adds 5-10ms latency (Kafka round-trip)
  • Real-time gaming, high-frequency trading
  • Alternative: Write-Behind + asynchronous replication

4. Simple Infrastructure / Single-Region

  • No geographic replication needed
  • Single datacenter
  • Alternative: Cache-Aside + regular backups is sufficient

5. Limited Budget

  • Infrastructure: Kafka cluster (3+ brokers) + Zookeeper/KRaft
  • Operations: DevOps expertise required
  • Alternative: Simple managed services (Redis Cloud + RDS with replication)

When WAL Becomes NECESSARY:

  • Critical data (finance, healthcare, user profiles)
  • Zero loss tolerable
  • Multi-region replication mandatory
  • Complex operations (multi-table, atomic)
  • Mature infrastructure with dedicated DevOps team
  • Infrastructure budget > 5000€/month

Golden rule: Start simple (Cache-Aside + Write-Behind), evolve to WAL when your durability constraints justify it.


Incident Resolution: Minute by Minute

Back to our ALTER TABLE corruption incident. Here's exactly what happened:

Step 1: Detection (T+3 seconds)

Alert: Database corrupted
Status: Millions of records affected
Severity: CRITICAL
Enter fullscreen mode Exit fullscreen mode

Step 2: Immediate Protection (T+30 seconds)

# Extend cache TTL to buy time
cache.extend_ttl("user_preferences:*", ttl=7200)  # 2 hours

# Users continue to be served by cache
# No one notices the problem
Enter fullscreen mode Exit fullscreen mode

Step 3: Recovery (T+5 minutes)

# Identify last healthy Kafka offset
last_good_offset = kafka.find_offset_before(corruption_timestamp)

# Isolate corrupted database
database.set_read_only()

# Restore from backup
database.restore_from_snapshot(timestamp=corruption_timestamp - 60)
Enter fullscreen mode Exit fullscreen mode

Step 4: Replay (T+15 minutes)

# Replay all mutations from WAL
wal.replay_from_offset(
    start_offset=last_good_offset,
    target_database=database,
    verify=True  # Integrity verification
)

# Verify consistency
assert database.count() == expected_count
assert cache.check_consistency(database) == True
Enter fullscreen mode Exit fullscreen mode

Step 5: Back to Normal (T+20 minutes)

# Reset TTL to normal
cache.reset_ttl("user_preferences:*", ttl=3600)  # 1 hour

# Re-enable writes
database.set_read_write()

# Check service
monitoring.check_all_metrics()  # ALL GREEN
Enter fullscreen mode Exit fullscreen mode

Final result:

  • Zero data loss
  • Zero service interruption
  • Recovery time: 20 minutes
  • Business impact: $0

Without cache + WAL:

  • Millions of users affected
  • Several hours of interruption
  • Customer data loss
  • Business impact: tens of millions of dollars

Lessons Learned by Netflix Building WAL

Netflix publicly shared the key lessons from this project:

1. Pluggable Architecture Is Fundamental

"The ability to support different targets — databases, caches, queues, or upstream applications — via configuration rather than code changes has been fundamental to WAL's success."

Concrete example:

Same API, different backends per use case:
- Delayed Queue → SQS
- Cross-Region Replication → Kafka
- Multi-Partition → Kafka + Durable Storage

Backend change = config change, not code!
Enter fullscreen mode Exit fullscreen mode

2. Reuse Existing Building Blocks

"We already had control plane infrastructure, Key-Value abstractions, and other components in place. Building on top of these existing abstractions allowed us to focus on the unique challenges WAL needed to solve."

Lesson for your project:
Don't reinvent the wheel. If your company already has:

  • A messaging system (Kafka, RabbitMQ)
  • A database abstraction
  • A monitoring system

Build ON TOP rather than redoing everything from scratch.


3. Separation of Concerns = Scalability

"By separating message processing from consumption, and allowing independent scaling of each component, we can handle traffic spikes and failures more gracefully."

Netflix WAL architecture:

Producer Group (independent scaling)
    ↕ Auto-scale based on CPU/Network
Queue (Kafka/SQS)
    ↕ Auto-scale based on CPU/Network
Consumer Group (independent scaling)
Enter fullscreen mode Exit fullscreen mode

If producers are overloaded → scale just producers.
If consumers are slow → scale just consumers.


4. Systems Fail — Understand Tradeoffs

"WAL itself has failure modes, including traffic spikes, slow consumers, and non-transient errors. We use abstractions and operational strategies like data partitioning and backpressure signals to manage this, but tradeoffs must be understood."

WAL failure modes identified by Netflix:

  1. Traffic Surge

    • Problem: 10x normal traffic suddenly
    • Solution: automatic load shedding + backpressure
  2. Slow Consumer

    • Problem: one consumer processes 10x more slowly
    • Solution: automatic scaling + DLQ for problematic messages
  3. Non-Transient Errors

    • Problem: a mutation always fails (e.g., DB constraint violated)
    • Solution: DLQ after X attempts + operator alerts
  4. Queue Lag Building Up

    • Problem: messages accumulate faster than processed
    • Solution: lag monitoring + proactive auto-scaling

The fundamental tradeoff accepted:

Eventual Consistency (few seconds delay)
    VS
Immediate Consistency (data always up-to-date)

Netflix chose: Eventual Consistency
Why? Performance + Zero Data Loss
Enter fullscreen mode Exit fullscreen mode

WAL vs Using Kafka/SQS Directly

Legitimate question: why not just use Kafka directly?

Netflix's answer:

Aspect Kafka/SQS Direct Netflix WAL
Initial setup Complex (configs, topics, consumers, DLQ, monitoring) Simple (1 API call)
Backend change Code rewrite Config change
Retry logic Must implement yourself Built-in with exponential backoff
DLQ Manually configure Default for each namespace
Cross-region Must architect yourself Ready-to-use persona
2-Phase Commit Implement from scratch Persona with durable storage
Monitoring Build yourself Integrated (Data Gateway)
Authentication Configure Automatic mTLS

Netflix's conclusion:

"WAL is an abstraction over underlying queues, so the underlying technology can be changed per use case without code changes. WAL emphasizes a simple but effective API that saves users from complicated setups and configurations."


Conclusion: Lessons from the Giants

From Incident to Innovation

Remember: a simple ALTER TABLE command that could have cost millions and affected millions of users.

What made the difference?

  1. A well-sized cache with flexible TTL
  2. A Write-Ahead Log capturing all mutations
  3. A prepared team with runbooks for this type of incident
  4. Resilient architecture treating cache as protection, not just optimization

This incident perfectly illustrates what we explored in this series: caching isn't just about performance, it's about system resilience.


The Perfect Cache Doesn't Exist

Every strategy has tradeoffs:

  • TTL → Can serve stale data
  • LRU → Can evict important data
  • Write-Through → Write latency
  • Write-Behind → Risk of loss (without WAL)
  • WAL → Infrastructure complexity

The best cache is the one adapted to YOUR use case.

And as Netflix demonstrated in 2024: the best cache is the one that saves you when everything goes wrong at 11:30 on a Tuesday morning.


When to Adopt WAL?

Adopt WAL if:

  • Your data is critical (financial, healthcare, user profiles)
  • You can't tolerate ANY data loss
  • You need to replicate across geographic regions
  • You have complex operations (multi-table, atomic)

Simple Write-Behind is sufficient if:

  • Non-critical data (logs, analytics, metrics)
  • Loss of a few entries acceptable
  • Simple infrastructure (1 region, 1 datacenter)
  • You're starting out (start simple, evolve later)

Further Reading

Recommended Resources

Official Engineering Blogs:

Academic Papers:

  • TAO: Facebook's Distributed Data Store (USENIX)
  • CacheSack: Admission Algorithms for Flash Caches (Google)
  • Spanner: Google's Globally-Distributed Database

Open Source Tools:


Acknowledgments

All information in this article is based on verifiable public sources:

  • Official company engineering blogs
  • Published academic papers
  • Technical conferences (QCon, USENIX, etc.)
  • Official system documentation

Special mention: Netflix article "Building a Resilient Data Platform with Write-Ahead Log at Netflix" (September 2025) by Prudhviraj Karumanchi, Samuel Fu, Sriram Rangarajan, Vidhya Arvind, Yun Wang, and John Lu, which provided exceptionally rich details on the ALTER TABLE incident and complete WAL architecture.

Big thanks to the engineering teams sharing their practices with the community!

Top comments (0)