From Write-Behind to Write-Ahead Log: How Netflix guarantees zero data loss at global scale
This article is the continuation of Part 1 - The Fundamentals. If you haven't read the first part, I recommend starting there to understand caching basics.
Table of Contents
- Recap: The Netflix Incident
- Why Write-Behind Isn't Enough Anymore
- The Write-Ahead Log (WAL)
- Real-World WAL Use Cases at Netflix
- Write-Behind vs WAL: Comparison
- Incident Resolution: Minute by Minute
- Lessons Learned by Netflix
- WAL vs Using Kafka/SQS Directly
- Conclusion
Recap: The Netflix Incident
Netflix, Production Incident (Reported September 2025)
A developer types ALTER TABLE user_preferences...
Three seconds later: massive database corruption.
Transparency note: Netflix hasn't publicly disclosed the exact number of affected records. The incident demonstrated the critical importance of their cache + WAL architecture, but specific numbers aren't verifiable from public sources.
Result: Zero customer complaints, zero downtime, zero data loss.
How?
Thanks to two silent technologies:
- A cache with extendable TTL
- A Write-Ahead Log (WAL) that had captured all mutations
In this Part 2, we'll break down exactly how Netflix transformed classic Write-Behind into enterprise-grade critical architecture.
Why Write-Behind Isn't Enough Anymore
The Context: 6 Critical Challenges at Netflix Scale
In 2024-2025, Netflix was facing recurring challenges causing production incidents:
- Accidental data loss and corruption in databases
- System entropy between different datastores (Cassandra and Elasticsearch becoming inconsistent)
- Multi-partition updates (e.g., building secondary indexes on NoSQL)
- Data replication (in-region and cross-region)
- Reliable retry mechanisms for real-time pipelines at scale
- Mass deletions causing OOM (Out Of Memory) on Key-Value nodes
Direct quote from Netflix article (September 2025):
"During a particular incident, a developer executed an ALTER TABLE command that caused data corruption. Fortunately, the data was protected by cache, so the ability to quickly extend cache TTL combined with the application writing mutations to Kafka allowed us to recover. Without the application's resilience features, there would have been permanent data loss."
The Problem with Traditional Write-Behind
Application → Cache (instant)
↓
Async Queue → Database (later)
What happens if:
- The message queue crashes before writing to DB?
- The database is corrupted?
- You need to replicate across 4 geographic regions?
Answer: Data loss.
Classic Write-Behind wasn't enough for Netflix anymore. They needed a solution with enterprise-grade durability guarantees.
The Write-Ahead Log (WAL)
Fundamental Principle of WAL
Netflix developed a generic WAL system that transforms Write-Behind into enterprise-grade critical architecture.
Application
↓
1. DURABLE write to Kafka (Write-Ahead Log)
↓
2. Only after confirmation → write to Cache
↓
3. Consumers read from Kafka → write to DB
↓
4. On failure → automatic infinite retry until success
Guarantee: zero data loss, even in catastrophic scenarios.
Write-Behind Classic vs Netflix WAL: Architectural Difference
Write-Behind classic (cache-first approach):
Application
↓ 1. INSTANT write
Cache (volatile memory)
↓ 2. ASYNCHRONOUS write (non-durable queue)
Database
RISK: if crash between step 1 and 2 → DATA LOSS
Netflix WAL (durability-first approach):
Application
↓ 1. DURABLE write
Kafka (Write-Ahead Log)
↓ 2. PARALLEL write after Kafka confirmation
├──→ Cache
├──→ Database
└──→ Other consumers
Guarantee: even in case of crash → ZERO LOSS (replay from Kafka)
The fundamental difference:
- Write-Behind = performance optimization (cache first)
- WAL = durability guarantee (durable log first)
Netflix inverted the priorities: durability before speed.
WAL Architecture for Global EVCache Replication
Here's how Netflix synchronizes its cache across the world:
┌──────────────────────┐
│ EVCache Client │ (Region US-WEST)
│ Application writes │
└──────────┬───────────┘
│
↓ Write mutations to Kafka (WAL)
│
┌──────────┴───────────────────────────────────┐
│ Kafka Topics (Durable WAL) │
│ • Sequence numbers for guaranteed order │
│ • Configurable retention │
│ • Internal Kafka replication │
└──────────┬───────────────────────────────────┘
│
├────────────┬─────────────┬─────────────┐
↓ ↓ ↓ ↓
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Consumer │ │Consumer │ │Consumer │ │Consumer │
│ US-EAST │ │ EU │ │ APAC │ │ LATAM │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
↓ ↓ ↓ ↓
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Writer │ │ Writer │ │ Writer │ │ Writer │
│ Groups │ │ Groups │ │ Groups │ │ Groups │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
↓ ↓ ↓ ↓
EVCache EVCache EVCache EVCache
Servers Servers Servers Servers
(Regional) (Regional) (Regional) (Regional)
Detailed flow:
Application (region US-WEST) writes a mutation:
SET user:123 = {...}-
WAL Producer writes to Kafka with:
- Key:
user:123 - Value: data + metadata
- Sequence number:
12,847,392 - Timestamp:
2025-02-16T10:32:45Z
- Key:
-
4 Regional consumers (US-EAST, EU, APAC, LATAM):
- Read from the same Kafka topic
- Consume in parallel and independently
- Each maintains its own offset
-
Local Writer Groups:
- Receive mutations
- Write to their region's EVCache servers
- Retry on failure
Result: one write in US-WEST is automatically and reliably replicated across 4 regions.
The WAL API: Intentional Simplicity
One of the strengths of Netflix's WAL is its extremely simple API. Here's the main endpoint:
rpc WriteToLog (WriteToLogRequest) returns (WriteToLogResponse)
// Request
message WriteToLogRequest {
string namespace = 1; // Identifier for a particular WAL
Lifecycle lifecycle = 2; // Delay and original write timestamp
bytes payload = 3; // Message content
Target target = 4; // Where to send the payload
}
// Response
message WriteToLogResponse {
Trilean durable = 1; // SUCCESS / FAILED / UNKNOWN
string message = 2; // Failure reason
}
Why this simplicity?
- Easy onboarding for teams
- Complete abstraction of underlying implementation
- Flexibility via the "namespace" concept
The 3 WAL Personas
Netflix's WAL can adopt 3 different personas depending on namespace configuration.
Persona #1: Delayed Queue
Example configuration (Product Data Systems):
{
"namespace": "pds",
"persistenceConfiguration": {
"physicalStorage": {
"type": "SQS"
},
"config": {
"wal-queue": ["dgwwal-dq-pds"],
"wal-dlq-queue": ["dgwwal-dlq-pds"],
"queue.poll-interval.secs": 10,
"queue.max-messages-per-poll": 100
}
}
}
Usage:
# Send a message that will be delivered in 3600 seconds (1h)
wal.write(
namespace="pds",
payload=message,
delay=3600
)
Backend: SQS (Amazon Simple Queue Service)
Persona #2: Generic Cross-Region Replication
Example configuration (EVCache):
{
"namespace": "evcache_foobar",
"persistenceConfiguration": {
"physicalStorage": {
"type": "KAFKA"
},
"config": {
"consumer_stack": "consumer",
"target": {
"us-east-1": "dgwwal.foobar.cluster.us-east-1.netflix.net",
"us-east-2": "dgwwal.foobar.cluster.us-east-2.netflix.net",
"us-west-2": "dgwwal.foobar.cluster.us-west-2.netflix.net",
"eu-west-1": "dgwwal.foobar.cluster.eu-west-1.netflix.net"
},
"wal-kafka-topics": ["evcache_foobar"],
"wal-kafka-dlq-topics": []
}
}
}
Usage:
# Write to EVCache in region US-WEST-2
evcache.set("user:123", user_data)
# WAL automatically replicates to:
# → US-EAST-1
# → US-EAST-2
# → EU-WEST-1
Backend: Kafka
Persona #3: Multi-Partition Mutations (2-Phase Commit)
Example configuration (Key-Value):
{
"namespace": "kv_foobar",
"persistenceConfiguration": {
"physicalStorage": {
"type": "KAFKA"
},
"config": {
"durable_storage": {
"type": "kv",
"namespace": "foobar_wal_type",
"shard": "walfoobar"
},
"wal-kafka-topics": ["foobar_kv_multi_id"],
"wal-kafka-dlq-topics": ["foobar_kv_multi_id-dlq"]
}
}
}
Usage:
# Single request that modifies multiple tables/partitions
kv.mutate_items([
PutItem(table="users", id="123", data=user_data),
PutItem(table="profiles", id="123", data=profile_data),
DeleteItem(table="cache", id="old:123")
])
# WAL guarantees ALL operations will eventually succeed
Backend: Kafka + Durable Storage (for 2-phase commit)
Key detail: presence of durable_storage enables 2-phase commit semantics.
Real-World WAL Use Cases at Netflix
The generic WAL isn't just for EVCache. Netflix uses it for:
1. Queues with Intelligent Retries
Mutation failed → WAL
↓
Exponential backoff retry
↓
Retry until success (or DLQ after X attempts)
2. Cross-Region Replication (EVCache Global)
- 4 synchronized geographic regions
- Replication latency: a few seconds
- Guaranteed eventual consistency
3. Multi-Partition / Multi-Table Mutations
Complex transaction:
1. Write to Table A (partition 1)
2. Write to Table B (partition 7)
3. Update Cache
With WAL:
- Two-phase commit semantics
- Atomic guarantee
- Automatic rollback on partial failure
4. Database Failure Protection
Catastrophe scenario:
11:30 - Cassandra database becomes unavailable
11:31 - Applications continue writing to WAL (Kafka)
13:00 - Cassandra comes back online
13:01 - WAL automatically replays all missed mutations
13:15 - System 100% synchronized, ZERO data loss
Write-Behind vs WAL: The Comparative Match
| Aspect | Classic Write-Behind | Netflix WAL | Winner |
|---|---|---|---|
| Durability | Not guaranteed (memory queue) | Strong guarantee (Kafka) | WAL |
| Failure Resilience | Possible loss | No loss | WAL |
| Retries | Manual/basic | Automatic/intelligent | WAL |
| Cross-Region | Not natively supported | Native multi-region support | WAL |
| Operation Ordering | Can be lost | Preserved (sequence numbers) | WAL |
| Complexity | Simple | Complex (Kafka, consumers, etc.) | Write-Behind |
| Write Latency | Ultra-fast (<1ms) | Fast (~5-10ms Kafka) | Write-Behind |
| Infrastructure | Minimal | Heavy (Kafka cluster, consumers) | Write-Behind |
Conclusion: WAL sacrifices some simplicity and latency to gain enterprise-grade durability guarantees.
When NOT to Use WAL
The Netflix WAL is powerful but comes with significant costs. Here's when it's over-engineered:
Don't Use WAL If:
1. Startup / Small Team (< 10 people)
- Managed Kafka infrastructure cost (AWS MSK, Confluent Cloud): 500€-2000€/month minimum
- Operational complexity: monitoring, consumer tuning, DLQ management
- Development time: 2-4 weeks implementation
- Alternative: Simple Write-Behind with SQS/RabbitMQ queue is sufficient
2. Non-Critical Data
- Logs, analytics, metrics, tracking events
- Loss of a few entries is acceptable
- Alternative: Simple Write-Behind or even fire-and-forget
3. Critical Latency (< 5ms required)
- WAL adds 5-10ms latency (Kafka round-trip)
- Real-time gaming, high-frequency trading
- Alternative: Write-Behind + asynchronous replication
4. Simple Infrastructure / Single-Region
- No geographic replication needed
- Single datacenter
- Alternative: Cache-Aside + regular backups is sufficient
5. Limited Budget
- Infrastructure: Kafka cluster (3+ brokers) + Zookeeper/KRaft
- Operations: DevOps expertise required
- Alternative: Simple managed services (Redis Cloud + RDS with replication)
When WAL Becomes NECESSARY:
- Critical data (finance, healthcare, user profiles)
- Zero loss tolerable
- Multi-region replication mandatory
- Complex operations (multi-table, atomic)
- Mature infrastructure with dedicated DevOps team
- Infrastructure budget > 5000€/month
Golden rule: Start simple (Cache-Aside + Write-Behind), evolve to WAL when your durability constraints justify it.
Incident Resolution: Minute by Minute
Back to our ALTER TABLE corruption incident. Here's exactly what happened:
Step 1: Detection (T+3 seconds)
Alert: Database corrupted
Status: Millions of records affected
Severity: CRITICAL
Step 2: Immediate Protection (T+30 seconds)
# Extend cache TTL to buy time
cache.extend_ttl("user_preferences:*", ttl=7200) # 2 hours
# Users continue to be served by cache
# No one notices the problem
Step 3: Recovery (T+5 minutes)
# Identify last healthy Kafka offset
last_good_offset = kafka.find_offset_before(corruption_timestamp)
# Isolate corrupted database
database.set_read_only()
# Restore from backup
database.restore_from_snapshot(timestamp=corruption_timestamp - 60)
Step 4: Replay (T+15 minutes)
# Replay all mutations from WAL
wal.replay_from_offset(
start_offset=last_good_offset,
target_database=database,
verify=True # Integrity verification
)
# Verify consistency
assert database.count() == expected_count
assert cache.check_consistency(database) == True
Step 5: Back to Normal (T+20 minutes)
# Reset TTL to normal
cache.reset_ttl("user_preferences:*", ttl=3600) # 1 hour
# Re-enable writes
database.set_read_write()
# Check service
monitoring.check_all_metrics() # ALL GREEN
Final result:
- Zero data loss
- Zero service interruption
- Recovery time: 20 minutes
- Business impact: $0
Without cache + WAL:
- Millions of users affected
- Several hours of interruption
- Customer data loss
- Business impact: tens of millions of dollars
Lessons Learned by Netflix Building WAL
Netflix publicly shared the key lessons from this project:
1. Pluggable Architecture Is Fundamental
"The ability to support different targets — databases, caches, queues, or upstream applications — via configuration rather than code changes has been fundamental to WAL's success."
Concrete example:
Same API, different backends per use case:
- Delayed Queue → SQS
- Cross-Region Replication → Kafka
- Multi-Partition → Kafka + Durable Storage
Backend change = config change, not code!
2. Reuse Existing Building Blocks
"We already had control plane infrastructure, Key-Value abstractions, and other components in place. Building on top of these existing abstractions allowed us to focus on the unique challenges WAL needed to solve."
Lesson for your project:
Don't reinvent the wheel. If your company already has:
- A messaging system (Kafka, RabbitMQ)
- A database abstraction
- A monitoring system
Build ON TOP rather than redoing everything from scratch.
3. Separation of Concerns = Scalability
"By separating message processing from consumption, and allowing independent scaling of each component, we can handle traffic spikes and failures more gracefully."
Netflix WAL architecture:
Producer Group (independent scaling)
↕ Auto-scale based on CPU/Network
Queue (Kafka/SQS)
↕ Auto-scale based on CPU/Network
Consumer Group (independent scaling)
If producers are overloaded → scale just producers.
If consumers are slow → scale just consumers.
4. Systems Fail — Understand Tradeoffs
"WAL itself has failure modes, including traffic spikes, slow consumers, and non-transient errors. We use abstractions and operational strategies like data partitioning and backpressure signals to manage this, but tradeoffs must be understood."
WAL failure modes identified by Netflix:
-
Traffic Surge
- Problem: 10x normal traffic suddenly
- Solution: automatic load shedding + backpressure
-
Slow Consumer
- Problem: one consumer processes 10x more slowly
- Solution: automatic scaling + DLQ for problematic messages
-
Non-Transient Errors
- Problem: a mutation always fails (e.g., DB constraint violated)
- Solution: DLQ after X attempts + operator alerts
-
Queue Lag Building Up
- Problem: messages accumulate faster than processed
- Solution: lag monitoring + proactive auto-scaling
The fundamental tradeoff accepted:
Eventual Consistency (few seconds delay)
VS
Immediate Consistency (data always up-to-date)
Netflix chose: Eventual Consistency
Why? Performance + Zero Data Loss
WAL vs Using Kafka/SQS Directly
Legitimate question: why not just use Kafka directly?
Netflix's answer:
| Aspect | Kafka/SQS Direct | Netflix WAL |
|---|---|---|
| Initial setup | Complex (configs, topics, consumers, DLQ, monitoring) | Simple (1 API call) |
| Backend change | Code rewrite | Config change |
| Retry logic | Must implement yourself | Built-in with exponential backoff |
| DLQ | Manually configure | Default for each namespace |
| Cross-region | Must architect yourself | Ready-to-use persona |
| 2-Phase Commit | Implement from scratch | Persona with durable storage |
| Monitoring | Build yourself | Integrated (Data Gateway) |
| Authentication | Configure | Automatic mTLS |
Netflix's conclusion:
"WAL is an abstraction over underlying queues, so the underlying technology can be changed per use case without code changes. WAL emphasizes a simple but effective API that saves users from complicated setups and configurations."
Conclusion: Lessons from the Giants
From Incident to Innovation
Remember: a simple ALTER TABLE command that could have cost millions and affected millions of users.
What made the difference?
- A well-sized cache with flexible TTL
- A Write-Ahead Log capturing all mutations
- A prepared team with runbooks for this type of incident
- Resilient architecture treating cache as protection, not just optimization
This incident perfectly illustrates what we explored in this series: caching isn't just about performance, it's about system resilience.
The Perfect Cache Doesn't Exist
Every strategy has tradeoffs:
- TTL → Can serve stale data
- LRU → Can evict important data
- Write-Through → Write latency
- Write-Behind → Risk of loss (without WAL)
- WAL → Infrastructure complexity
The best cache is the one adapted to YOUR use case.
And as Netflix demonstrated in 2024: the best cache is the one that saves you when everything goes wrong at 11:30 on a Tuesday morning.
When to Adopt WAL?
Adopt WAL if:
- Your data is critical (financial, healthcare, user profiles)
- You can't tolerate ANY data loss
- You need to replicate across geographic regions
- You have complex operations (multi-table, atomic)
Simple Write-Behind is sufficient if:
- Non-critical data (logs, analytics, metrics)
- Loss of a few entries acceptable
- Simple infrastructure (1 region, 1 datacenter)
- You're starting out (start simple, evolve later)
Further Reading
Recommended Resources
Official Engineering Blogs:
- Netflix TechBlog: https://netflixtechblog.com
- Facebook Engineering: https://engineering.fb.com
- Twitter Engineering: https://blog.x.com/engineering
- Spotify Engineering: https://engineering.atspotify.com
- LinkedIn Engineering: https://engineering.linkedin.com
Academic Papers:
- TAO: Facebook's Distributed Data Store (USENIX)
- CacheSack: Admission Algorithms for Flash Caches (Google)
- Spanner: Google's Globally-Distributed Database
Open Source Tools:
- Redis: https://redis.io
- Memcached: https://memcached.org
- EVCache (Netflix): https://github.com/Netflix/EVCache
Acknowledgments
All information in this article is based on verifiable public sources:
- Official company engineering blogs
- Published academic papers
- Technical conferences (QCon, USENIX, etc.)
- Official system documentation
Special mention: Netflix article "Building a Resilient Data Platform with Write-Ahead Log at Netflix" (September 2025) by Prudhviraj Karumanchi, Samuel Fu, Sriram Rangarajan, Vidhya Arvind, Yun Wang, and John Lu, which provided exceptionally rich details on the ALTER TABLE incident and complete WAL architecture.
Big thanks to the engineering teams sharing their practices with the community!
Top comments (0)