SpicyCode

Posted on Feb 16

Cache Strategies Explained: Part 2 - Advanced Architectures

#architecture #distributedsystems #performance #systemdesign

From Write-Behind to Write-Ahead Log: How Netflix guarantees zero data loss at global scale

This article is the continuation of Part 1 - The Fundamentals. If you haven't read the first part, I recommend starting there to understand caching basics.

Recap: The Netflix Incident
Why Write-Behind Isn't Enough Anymore
The Write-Ahead Log (WAL)
Real-World WAL Use Cases at Netflix
Write-Behind vs WAL: Comparison
Incident Resolution: Minute by Minute
Lessons Learned by Netflix
WAL vs Using Kafka/SQS Directly
Conclusion

Recap: The Netflix Incident

Netflix, Production Incident (Reported September 2025)

A developer types ALTER TABLE user_preferences...

Three seconds later: massive database corruption.

Transparency note: Netflix hasn't publicly disclosed the exact number of affected records. The incident demonstrated the critical importance of their cache + WAL architecture, but specific numbers aren't verifiable from public sources.

Result: Zero customer complaints, zero downtime, zero data loss.

How?

Thanks to two silent technologies:

A cache with extendable TTL
A Write-Ahead Log (WAL) that had captured all mutations

In this Part 2, we'll break down exactly how Netflix transformed classic Write-Behind into enterprise-grade critical architecture.

Why Write-Behind Isn't Enough Anymore

The Context: 6 Critical Challenges at Netflix Scale

In 2024-2025, Netflix was facing recurring challenges causing production incidents:

Accidental data loss and corruption in databases
System entropy between different datastores (Cassandra and Elasticsearch becoming inconsistent)
Multi-partition updates (e.g., building secondary indexes on NoSQL)
Data replication (in-region and cross-region)
Reliable retry mechanisms for real-time pipelines at scale
Mass deletions causing OOM (Out Of Memory) on Key-Value nodes

Direct quote from Netflix article (September 2025):

"During a particular incident, a developer executed an ALTER TABLE command that caused data corruption. Fortunately, the data was protected by cache, so the ability to quickly extend cache TTL combined with the application writing mutations to Kafka allowed us to recover. Without the application's resilience features, there would have been permanent data loss."

The Problem with Traditional Write-Behind

Application → Cache (instant)
                ↓
           Async Queue → Database (later)

What happens if:

The message queue crashes before writing to DB?
The database is corrupted?
You need to replicate across 4 geographic regions?

Answer: Data loss.

Classic Write-Behind wasn't enough for Netflix anymore. They needed a solution with enterprise-grade durability guarantees.

The Write-Ahead Log (WAL)

Fundamental Principle of WAL

Netflix developed a generic WAL system that transforms Write-Behind into enterprise-grade critical architecture.

Application
    ↓
1. DURABLE write to Kafka (Write-Ahead Log)
    ↓
2. Only after confirmation → write to Cache
    ↓
3. Consumers read from Kafka → write to DB
    ↓
4. On failure → automatic infinite retry until success

Guarantee: zero data loss, even in catastrophic scenarios.

Write-Behind Classic vs Netflix WAL: Architectural Difference

Write-Behind classic (cache-first approach):

Application
    ↓ 1. INSTANT write
Cache (volatile memory)
    ↓ 2. ASYNCHRONOUS write (non-durable queue)
Database

RISK: if crash between step 1 and 2 → DATA LOSS

Netflix WAL (durability-first approach):

Application
    ↓ 1. DURABLE write
Kafka (Write-Ahead Log)
    ↓ 2. PARALLEL write after Kafka confirmation
    ├──→ Cache
    ├──→ Database
    └──→ Other consumers

Guarantee: even in case of crash → ZERO LOSS (replay from Kafka)

The fundamental difference:

Write-Behind = performance optimization (cache first)
WAL = durability guarantee (durable log first)

Netflix inverted the priorities: durability before speed.

WAL Architecture for Global EVCache Replication

Here's how Netflix synchronizes its cache across the world:

┌──────────────────────┐
│  EVCache Client      │  (Region US-WEST)
│  Application writes  │
└──────────┬───────────┘
           │
           ↓ Write mutations to Kafka (WAL)
           │
┌──────────┴───────────────────────────────────┐
│         Kafka Topics (Durable WAL)           │
│  • Sequence numbers for guaranteed order      │
│  • Configurable retention                     │
│  • Internal Kafka replication                 │
└──────────┬───────────────────────────────────┘
           │
           ├────────────┬─────────────┬─────────────┐
           ↓            ↓             ↓             ↓
     ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
     │Consumer │  │Consumer │  │Consumer │  │Consumer │
     │ US-EAST │  │   EU    │  │  APAC   │  │  LATAM  │
     └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘
          │            │            │            │
          ↓            ↓            ↓            ↓
     ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
     │ Writer  │  │ Writer  │  │ Writer  │  │ Writer  │
     │ Groups  │  │ Groups  │  │ Groups  │  │ Groups  │
     └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘
          │            │            │            │
          ↓            ↓            ↓            ↓
     EVCache      EVCache      EVCache      EVCache
     Servers      Servers      Servers      Servers
    (Regional)   (Regional)   (Regional)   (Regional)

Detailed flow:

Application (region US-WEST) writes a mutation: SET user:123 = {...}
WAL Producer writes to Kafka with:
- Key: user:123
- Value: data + metadata
- Sequence number: 12,847,392
- Timestamp: 2025-02-16T10:32:45Z
4 Regional consumers (US-EAST, EU, APAC, LATAM):
- Read from the same Kafka topic
- Consume in parallel and independently
- Each maintains its own offset
Local Writer Groups:
- Receive mutations
- Write to their region's EVCache servers
- Retry on failure

Result: one write in US-WEST is automatically and reliably replicated across 4 regions.

The WAL API: Intentional Simplicity

One of the strengths of Netflix's WAL is its extremely simple API. Here's the main endpoint:

rpc WriteToLog (WriteToLogRequest) returns (WriteToLogResponse)

// Request
message WriteToLogRequest {
  string namespace = 1;        // Identifier for a particular WAL
  Lifecycle lifecycle = 2;     // Delay and original write timestamp
  bytes payload = 3;           // Message content
  Target target = 4;           // Where to send the payload
}

// Response
message WriteToLogResponse {
  Trilean durable = 1;  // SUCCESS / FAILED / UNKNOWN
  string message = 2;   // Failure reason
}

Why this simplicity?

Easy onboarding for teams
Complete abstraction of underlying implementation
Flexibility via the "namespace" concept

The 3 WAL Personas

Netflix's WAL can adopt 3 different personas depending on namespace configuration.

Persona #1: Delayed Queue

Example configuration (Product Data Systems):

{
  "namespace": "pds",
  "persistenceConfiguration": {
    "physicalStorage": {
      "type": "SQS"
    },
    "config": {
      "wal-queue": ["dgwwal-dq-pds"],
      "wal-dlq-queue": ["dgwwal-dlq-pds"],
      "queue.poll-interval.secs": 10,
      "queue.max-messages-per-poll": 100
    }
  }
}

Usage:

# Send a message that will be delivered in 3600 seconds (1h)
wal.write(
    namespace="pds",
    payload=message,
    delay=3600
)

Backend: SQS (Amazon Simple Queue Service)

Persona #2: Generic Cross-Region Replication

Example configuration (EVCache):

{
  "namespace": "evcache_foobar",
  "persistenceConfiguration": {
    "physicalStorage": {
      "type": "KAFKA"
    },
    "config": {
      "consumer_stack": "consumer",
      "target": {
        "us-east-1": "dgwwal.foobar.cluster.us-east-1.netflix.net",
        "us-east-2": "dgwwal.foobar.cluster.us-east-2.netflix.net",
        "us-west-2": "dgwwal.foobar.cluster.us-west-2.netflix.net",
        "eu-west-1": "dgwwal.foobar.cluster.eu-west-1.netflix.net"
      },
      "wal-kafka-topics": ["evcache_foobar"],
      "wal-kafka-dlq-topics": []
    }
  }
}

Usage:

# Write to EVCache in region US-WEST-2
evcache.set("user:123", user_data)

# WAL automatically replicates to:
# → US-EAST-1
# → US-EAST-2
# → EU-WEST-1

Backend: Kafka

Persona #3: Multi-Partition Mutations (2-Phase Commit)

Example configuration (Key-Value):

{
  "namespace": "kv_foobar",
  "persistenceConfiguration": {
    "physicalStorage": {
      "type": "KAFKA"
    },
    "config": {
      "durable_storage": {
        "type": "kv",
        "namespace": "foobar_wal_type",
        "shard": "walfoobar"
      },
      "wal-kafka-topics": ["foobar_kv_multi_id"],
      "wal-kafka-dlq-topics": ["foobar_kv_multi_id-dlq"]
    }
  }
}

Usage:

# Single request that modifies multiple tables/partitions
kv.mutate_items([
    PutItem(table="users", id="123", data=user_data),
    PutItem(table="profiles", id="123", data=profile_data),
    DeleteItem(table="cache", id="old:123")
])

# WAL guarantees ALL operations will eventually succeed

Backend: Kafka + Durable Storage (for 2-phase commit)

Key detail: presence of durable_storage enables 2-phase commit semantics.

Real-World WAL Use Cases at Netflix

The generic WAL isn't just for EVCache. Netflix uses it for:

1. Queues with Intelligent Retries

Mutation failed → WAL
    ↓
Exponential backoff retry
    ↓
Retry until success (or DLQ after X attempts)

2. Cross-Region Replication (EVCache Global)

4 synchronized geographic regions
Replication latency: a few seconds
Guaranteed eventual consistency

3. Multi-Partition / Multi-Table Mutations

Complex transaction:
1. Write to Table A (partition 1)
2. Write to Table B (partition 7)
3. Update Cache

With WAL:
- Two-phase commit semantics
- Atomic guarantee
- Automatic rollback on partial failure

4. Database Failure Protection

Catastrophe scenario:

11:30 - Cassandra database becomes unavailable
11:31 - Applications continue writing to WAL (Kafka)
13:00 - Cassandra comes back online
13:01 - WAL automatically replays all missed mutations
13:15 - System 100% synchronized, ZERO data loss

Write-Behind vs WAL: The Comparative Match

Aspect	Classic Write-Behind	Netflix WAL	Winner
Durability	Not guaranteed (memory queue)	Strong guarantee (Kafka)	WAL
Failure Resilience	Possible loss	No loss	WAL
Retries	Manual/basic	Automatic/intelligent	WAL
Cross-Region	Not natively supported	Native multi-region support	WAL
Operation Ordering	Can be lost	Preserved (sequence numbers)	WAL
Complexity	Simple	Complex (Kafka, consumers, etc.)	Write-Behind
Write Latency	Ultra-fast (<1ms)	Fast (~5-10ms Kafka)	Write-Behind
Infrastructure	Minimal	Heavy (Kafka cluster, consumers)	Write-Behind

Conclusion: WAL sacrifices some simplicity and latency to gain enterprise-grade durability guarantees.

When NOT to Use WAL

The Netflix WAL is powerful but comes with significant costs. Here's when it's over-engineered:

Don't Use WAL If:

1. Startup / Small Team (< 10 people)

Managed Kafka infrastructure cost (AWS MSK, Confluent Cloud): 500€-2000€/month minimum
Operational complexity: monitoring, consumer tuning, DLQ management
Development time: 2-4 weeks implementation
Alternative: Simple Write-Behind with SQS/RabbitMQ queue is sufficient

2. Non-Critical Data

Logs, analytics, metrics, tracking events
Loss of a few entries is acceptable
Alternative: Simple Write-Behind or even fire-and-forget

3. Critical Latency (< 5ms required)

WAL adds 5-10ms latency (Kafka round-trip)
Real-time gaming, high-frequency trading
Alternative: Write-Behind + asynchronous replication

4. Simple Infrastructure / Single-Region

No geographic replication needed
Single datacenter
Alternative: Cache-Aside + regular backups is sufficient

5. Limited Budget

Infrastructure: Kafka cluster (3+ brokers) + Zookeeper/KRaft
Operations: DevOps expertise required
Alternative: Simple managed services (Redis Cloud + RDS with replication)

When WAL Becomes NECESSARY:

Critical data (finance, healthcare, user profiles)
Zero loss tolerable
Multi-region replication mandatory
Complex operations (multi-table, atomic)
Mature infrastructure with dedicated DevOps team
Infrastructure budget > 5000€/month

Golden rule: Start simple (Cache-Aside + Write-Behind), evolve to WAL when your durability constraints justify it.

Incident Resolution: Minute by Minute

Back to our ALTER TABLE corruption incident. Here's exactly what happened:

Step 1: Detection (T+3 seconds)

Alert: Database corrupted
Status: Millions of records affected
Severity: CRITICAL

Step 2: Immediate Protection (T+30 seconds)

# Extend cache TTL to buy time
cache.extend_ttl("user_preferences:*", ttl=7200)  # 2 hours

# Users continue to be served by cache
# No one notices the problem

Step 3: Recovery (T+5 minutes)

# Identify last healthy Kafka offset
last_good_offset = kafka.find_offset_before(corruption_timestamp)

# Isolate corrupted database
database.set_read_only()

# Restore from backup
database.restore_from_snapshot(timestamp=corruption_timestamp - 60)

Step 4: Replay (T+15 minutes)

# Replay all mutations from WAL
wal.replay_from_offset(
    start_offset=last_good_offset,
    target_database=database,
    verify=True  # Integrity verification
)

# Verify consistency
assert database.count() == expected_count
assert cache.check_consistency(database) == True

Step 5: Back to Normal (T+20 minutes)

# Reset TTL to normal
cache.reset_ttl("user_preferences:*", ttl=3600)  # 1 hour

# Re-enable writes
database.set_read_write()

# Check service
monitoring.check_all_metrics()  # ALL GREEN

Final result:

Zero data loss
Zero service interruption
Recovery time: 20 minutes
Business impact: $0

Without cache + WAL:

Millions of users affected
Several hours of interruption
Customer data loss
Business impact: tens of millions of dollars

Lessons Learned by Netflix Building WAL

Netflix publicly shared the key lessons from this project:

1. Pluggable Architecture Is Fundamental

"The ability to support different targets — databases, caches, queues, or upstream applications — via configuration rather than code changes has been fundamental to WAL's success."

Concrete example:

Same API, different backends per use case:
- Delayed Queue → SQS
- Cross-Region Replication → Kafka
- Multi-Partition → Kafka + Durable Storage

Backend change = config change, not code!

2. Reuse Existing Building Blocks

"We already had control plane infrastructure, Key-Value abstractions, and other components in place. Building on top of these existing abstractions allowed us to focus on the unique challenges WAL needed to solve."

Lesson for your project:
Don't reinvent the wheel. If your company already has:

A messaging system (Kafka, RabbitMQ)
A database abstraction
A monitoring system

Build ON TOP rather than redoing everything from scratch.

3. Separation of Concerns = Scalability

"By separating message processing from consumption, and allowing independent scaling of each component, we can handle traffic spikes and failures more gracefully."

Netflix WAL architecture:

Producer Group (independent scaling)
    ↕ Auto-scale based on CPU/Network
Queue (Kafka/SQS)
    ↕ Auto-scale based on CPU/Network
Consumer Group (independent scaling)

If producers are overloaded → scale just producers.
If consumers are slow → scale just consumers.

4. Systems Fail — Understand Tradeoffs

"WAL itself has failure modes, including traffic spikes, slow consumers, and non-transient errors. We use abstractions and operational strategies like data partitioning and backpressure signals to manage this, but tradeoffs must be understood."

WAL failure modes identified by Netflix:

Traffic Surge
- Problem: 10x normal traffic suddenly
- Solution: automatic load shedding + backpressure
Slow Consumer
- Problem: one consumer processes 10x more slowly
- Solution: automatic scaling + DLQ for problematic messages
Non-Transient Errors
- Problem: a mutation always fails (e.g., DB constraint violated)
- Solution: DLQ after X attempts + operator alerts
Queue Lag Building Up
- Problem: messages accumulate faster than processed
- Solution: lag monitoring + proactive auto-scaling

The fundamental tradeoff accepted:

Eventual Consistency (few seconds delay)
    VS
Immediate Consistency (data always up-to-date)

Netflix chose: Eventual Consistency
Why? Performance + Zero Data Loss

WAL vs Using Kafka/SQS Directly

Legitimate question: why not just use Kafka directly?

Netflix's answer:

Aspect	Kafka/SQS Direct	Netflix WAL
Initial setup	Complex (configs, topics, consumers, DLQ, monitoring)	Simple (1 API call)
Backend change	Code rewrite	Config change
Retry logic	Must implement yourself	Built-in with exponential backoff
DLQ	Manually configure	Default for each namespace
Cross-region	Must architect yourself	Ready-to-use persona
2-Phase Commit	Implement from scratch	Persona with durable storage
Monitoring	Build yourself	Integrated (Data Gateway)
Authentication	Configure	Automatic mTLS

Netflix's conclusion:

"WAL is an abstraction over underlying queues, so the underlying technology can be changed per use case without code changes. WAL emphasizes a simple but effective API that saves users from complicated setups and configurations."

Conclusion: Lessons from the Giants

From Incident to Innovation

Remember: a simple ALTER TABLE command that could have cost millions and affected millions of users.

What made the difference?

A well-sized cache with flexible TTL
A Write-Ahead Log capturing all mutations
A prepared team with runbooks for this type of incident
Resilient architecture treating cache as protection, not just optimization

This incident perfectly illustrates what we explored in this series: caching isn't just about performance, it's about system resilience.

The Perfect Cache Doesn't Exist

Every strategy has tradeoffs:

TTL → Can serve stale data
LRU → Can evict important data
Write-Through → Write latency
Write-Behind → Risk of loss (without WAL)
WAL → Infrastructure complexity

The best cache is the one adapted to YOUR use case.

And as Netflix demonstrated in 2024: the best cache is the one that saves you when everything goes wrong at 11:30 on a Tuesday morning.

When to Adopt WAL?

Adopt WAL if:

Your data is critical (financial, healthcare, user profiles)
You can't tolerate ANY data loss
You need to replicate across geographic regions
You have complex operations (multi-table, atomic)

Simple Write-Behind is sufficient if:

Non-critical data (logs, analytics, metrics)
Loss of a few entries acceptable
Simple infrastructure (1 region, 1 datacenter)
You're starting out (start simple, evolve later)

Acknowledgments

All information in this article is based on verifiable public sources:

Official company engineering blogs
Published academic papers
Technical conferences (QCon, USENIX, etc.)
Official system documentation

Special mention: Netflix article "Building a Resilient Data Platform with Write-Ahead Log at Netflix" (September 2025) by Prudhviraj Karumanchi, Samuel Fu, Sriram Rangarajan, Vidhya Arvind, Yun Wang, and John Lu, which provided exceptionally rich details on the ALTER TABLE incident and complete WAL architecture.

Big thanks to the engineering teams sharing their practices with the community!

Table of Contents

Recap: The Netflix Incident

Netflix, Production Incident (Reported September 2025)

Why Write-Behind Isn't Enough Anymore

The Context: 6 Critical Challenges at Netflix Scale

The Problem with Traditional Write-Behind

The Write-Ahead Log (WAL)

Fundamental Principle of WAL

Write-Behind Classic vs Netflix WAL: Architectural Difference

WAL Architecture for Global EVCache Replication

The WAL API: Intentional Simplicity

The 3 WAL Personas

Persona #1: Delayed Queue

Persona #2: Generic Cross-Region Replication

Persona #3: Multi-Partition Mutations (2-Phase Commit)

Real-World WAL Use Cases at Netflix

1. Queues with Intelligent Retries

2. Cross-Region Replication (EVCache Global)

3. Multi-Partition / Multi-Table Mutations

4. Database Failure Protection

Write-Behind vs WAL: The Comparative Match

When NOT to Use WAL

Don't Use WAL If:

When WAL Becomes NECESSARY:

Incident Resolution: Minute by Minute

Step 1: Detection (T+3 seconds)

Step 2: Immediate Protection (T+30 seconds)

Step 3: Recovery (T+5 minutes)

Step 4: Replay (T+15 minutes)

Step 5: Back to Normal (T+20 minutes)

Lessons Learned by Netflix Building WAL

1. Pluggable Architecture Is Fundamental

2. Reuse Existing Building Blocks

3. Separation of Concerns = Scalability

4. Systems Fail — Understand Tradeoffs

WAL vs Using Kafka/SQS Directly

Conclusion: Lessons from the Giants

From Incident to Innovation

The Perfect Cache Doesn't Exist

When to Adopt WAL?

Further Reading

Recommended Resources

Acknowledgments