How tech giants (Netflix, Facebook, Google, Twitter) serve billions of requests per second using caching
Table of Contents
- The Incident That Changed Everything
- Why Caching Is Not Optional
- The 6 Fundamental Strategies
- Comparison Table
- How Giants Use Caching
- Real-World Challenges
- The Invalidation Problem
- Getting Started Guide
- Essential Metrics
The Incident That Changed Everything
Netflix, Production Incident (Reported September 2025)
An experienced developer types an ALTER TABLE command in their terminal. This is routine work, something they've done hundreds of times. They hit Enter.
ALTER TABLE user_preferences...
Three seconds later, the alert fires.
Dashboards light up red. The primary database just suffered massive corruption. Critical user preference data profiles, watch lists, personalized recommendations became unusable.
In a typical company, this is where you start calculating the millions of dollars this incident will cost. Where careers can hang in the balance.
But at Netflix, something unexpected happens.
No customer noticed anything. No complaints, no service interruption. 200+ million subscribers kept watching their shows peacefully.
How is this possible?
Two silent technologies saved the day:
- A cache continuing to serve valid data
- A Write-Ahead Log (WAL) that had captured all mutations before the corruption
Engineers simply extended the cache TTL, replayed mutations from Kafka, cleaned up the corruption, and resumed operations. Result: zero data loss, zero downtime.
Transparency note: Netflix hasn't publicly disclosed the exact number of affected records or full incident details. Information comes from their official blog post (September 2025) demonstrating the critical importance of their cache + WAL architecture for resilience.
Why Caching Is Not Optional
This incident proves that caching isn't just a performance optimization. It's a critical protection layer that can mean the difference between a minor incident and a multi-million dollar catastrophe.
In this two-part series, we'll explore:
- Part 1: Fundamental strategies every developer should know
- Part 2: Enterprise-grade advanced architectures (WAL, multi-region, resilience)
The 6 Fundamental Strategies
1. TTL (Time-To-Live) - Temporal Expiration
TTL defines how long data remains valid in cache before being automatically deleted or refreshed.
Implementation example:
# Redis with TTL
cache.set("user:123", user_data, ttl=3600) # Expires after 1 hour
Ideal use cases:
- Weather data (hourly refresh)
- News feeds (updated every 5 minutes)
- Product prices (daily changes)
- User sessions
TTL is universal. Every major tech company uses it in some form.
Important: TTL and eviction policies work together
In production, TTL and LRU/LFU operate simultaneously in Redis/Memcached:
# Redis configuration: maxmemory-policy allkeys-lru
cache.set("user:123", data, ttl=3600)
# This item will expire in 1 hour OR be evicted earlier if cache is full (LRU)
Data can disappear from cache for two reasons:
- TTL expired: time elapsed (3600 seconds in the example)
- Eviction: cache full, least recently used item removed (LRU)
This combination ensures both data freshness (TTL) and optimal memory usage (LRU).
2. LRU (Least Recently Used) - Priority to Recent Items
When cache is full, LRU removes the least recently accessed data. It's like organizing your desk: you keep what you use often within reach.
Visual workflow:
Cache (capacity: 3 items)
1. Access A → [A]
2. Access B → [A, B]
3. Access C → [A, B, C]
4. Access D → [B, C, D] // A removed (oldest)
5. Access B → [C, D, B] // B moves to front
Ideal use cases:
- Web pages (repeated navigation)
- Active user sessions
- Browsing history
Used in production by: Netflix (EVCache with client-side LRU)
3. LFU (Least Frequently Used) - Priority to Popularity
LFU keeps the most frequently requested data, regardless of last access time.
LRU vs LFU difference:
- LRU: "When did you last use this?"
- LFU: "How many times have you used this total?"
Concrete example:
Data: A (used 10x), B (used 2x), C (used 5x)
Cache full → Remove B (least frequent)
Ideal use cases:
- E-commerce best-sellers
- Viral content with lasting popularity
- Repetitive search queries
4. Write-Through vs Write-Behind - Write Strategies
Write-Through (Synchronous Write)
Application writes to cache AND database simultaneously.
def save_user(user):
cache.set(f"user:{user.id}", user)
database.save(user) # Both at the same time
Pros: guaranteed data consistency
Cons: higher write latency
Use case: banking, financial transactions, critical data
Used by: Facebook TAO (synchronous cache + DB writes)
Write-Behind / Write-Back (Asynchronous Write)
Application writes to cache first, then to database asynchronously.
def save_user(user):
cache.set(f"user:{user.id}", user)
queue.add_job("save_to_db", user) # Async (via message queue)
Pros: ultra-fast writes
Cons: risk of loss if crash before DB save
Use case: logs, analytics, non-critical metrics
Important note: Simple Write-Behind has production limitations. In Part 2, we'll see how Netflix transformed it into Write-Ahead Log (WAL) for enterprise-grade durability guarantees.
5. Cache-Aside (Lazy Loading) - The Most Common Pattern
This is the dominant strategy in the industry. The application manages the cache itself.
def get_user(user_id):
# 1. Check cache
user = cache.get(f"user:{user_id}")
if user:
return user # Cache HIT
# 2. Not in cache? Fetch from DB
user = database.get_user(user_id) # Cache MISS
# 3. Store in cache for next time
cache.set(f"user:{user_id}", user, ttl=3600)
return user
Used by: Netflix, Spotify, Twitter, and most web applications
6. Read-Through Cache - Delegation to Cache
The cache itself automatically manages database reads (transparent to the application).
# Application simply asks the cache
user = cache.get("user:123")
# Cache automatically fetches from DB if needed
Used by: Facebook (evolution of their architecture)
Comparison Table
| Strategy | Pros | Cons | Use Case |
|---|---|---|---|
| TTL | Simple, predictable | May serve stale data | Weather, news |
| LRU | Adapts to temporal patterns | May evict important data | Sessions, navigation |
| LFU | Keeps popular data | More complex to implement | Best-sellers |
| Write-Through | Guaranteed consistency | Write latency | Banking, critical data |
| Write-Behind | Very fast | Risk of loss | Logs, analytics |
| Cache-Aside | Flexible, full control | App manages logic | Most cases |
| Read-Through | Transparent to app | Requires middleware | Complex systems |
How Giants Use Caching
Netflix - EVCache: Billions of Requests/Second
Infrastructure:
- Distributed cache based on Memcached
- Combined strategies: TTL + LRU + Cache-Aside
- Geographic replication across 4 global regions
- Some clusters with 2 copies, others with 9 (depending on criticality)
Verified performance:
- Handles billions of requests per second
- Cache warming: reduced 45 GB/s → 100 MB/s network traffic
Multi-tier architecture:
L1: Local memory cache (client-side LRU)
↓
L2: EVCache distributed (TTL)
↓
L3: Multi-zone replication
↓
Database
Key lesson: Netflix pre-calculates and pre-loads cache before putting servers in production (cache warming).
Facebook/Meta - TAO: 1 Billion Reads/Second
Architectural evolution:
- Phase 1: Memcache + MySQL (Cache-Aside look-aside)
- Phase 2: TAO (The Associations and Objects) - abstraction layer
- Current strategy: Write-Through (synchronous cache + DB writes)
Verified performance:
- 96.4% hit rate on reads
- Over 1 billion read requests/second
- Millions of writes/second
Technical innovation: "Leases"
To avoid the thundering herd problem (massive rush when cache expires):
- Only one request can hit the database every 10 seconds per key
- Other requests wait or retrieve the freshly calculated value
Concrete result: reduction from 17,000 req/s → 1,300 req/s to database during peaks.
Twitter/X - Manhattan + Redis: Consistency at Scale
Infrastructure:
- Manhattan (distributed key-value store)
- Redis (Haplo) as primary cache for Timeline
- Strategy: Cache-Aside + eventual consistency by default
Verified performance:
- 320 million packets/second
- 120 GB/s network throughput
- Tens of millions of read QPS
- Cache represents only 3% of infrastructure but is critical
Particularity: strong consistency option available via consensus for critical data.
Google - Bigtable + Spanner: Multi-Tier Cache
Sophisticated architecture:
L1: Row cache (in-memory) → Reduces CPU by 25%
↓
L2: Block cache (local SSD)
↓
L3: Colossus Flash Cache (datacenter)
↓
Persistent storage
Verified performance:
- Bigtable: 17,000 point reads/second per node (1.7x improvement)
- Colossus Flash Cache: over 5 billion requests/second
- Spanner automatically caches query execution plans
Innovation: CacheSack
Intelligent admission algorithm for flash cache that optimizes total cost of ownership (TCO).
Real-World Challenges
1. The Thundering Herd
The problem:
When a popular key expires, thousands of requests simultaneously hit the database.
Cache expires at 12:00:00
↓
10,000 requests arrive at 12:00:01
↓
All go to DB simultaneously → CRASH
Facebook solution (Leases):
- Only one request authorized every 10 seconds
- Others wait or read the freshly calculated value
Measured result: 17,000 req/s → 1,300 req/s
2. Cache Warming
The problem:
Starting with empty cache = terrible latency during first few minutes/hours.
Netflix solution:
- Copy data from EBS snapshots
- Load cache BEFORE putting servers in production
- Avoids "warm-up" period
Measured result: 45 GB/s → 100 MB/s network traffic saved
3. Geographic Consistency
The problem:
How to synchronize caches across multiple continents?
Adopted solutions:
- Eventual consistency by default (few seconds delay acceptable)
- Optional strong consistency for critical data
- Asynchronous replication between regions
Examples:
- Spotify: EU ↔ NA replication
- Netflix: 4 global regions
- Facebook: global datacenters with synchronization
The Invalidation Problem
As Phil Karlton's famous quote says:
"There are only 2 hard problems in computer science: cache invalidation and naming things."
The 4 Invalidation Strategies
1. TTL (Time-To-Live)
cache.set("product:123", data, ttl=3600) # Auto-expires
- Simple, predictable
- May serve stale data
2. Manual Invalidation
def update_user(user_id, new_data):
database.update(user_id, new_data)
cache.delete(f"user:{user_id}") # Explicit deletion
- Full control
- Risk of missing some keys
3. Event-Based
# When an event occurs
event_bus.on("user_updated", lambda user_id: cache.delete(f"user:{user_id}"))
- Automatic, decoupled
- System complexity
4. Version Tagging
cache.set(f"user:{user_id}:v{version}", data)
# When updating, just change the version
- No need to delete old one
- Uses more memory
Getting Started Guide
Decision Tree: Which Strategy Should You Choose?
Are your data critical (banking, healthcare, user profiles)?
│
├─ YES → Zero data loss tolerable?
│ │
│ ├─ YES → Multi-region replication necessary?
│ │ │
│ │ ├─ YES → Write-Through + WAL (Netflix-style)
│ │ │ Example: Banking, Healthcare
│ │ │
│ │ └─ NO → Write-Through (synchronous cache + DB)
│ │ Example: E-commerce, B2B SaaS
│ │
│ └─ NO → Loss of a few seconds acceptable?
│ │
│ └─ YES → Write-Behind (asynchronous)
│ Example: Analytics, metrics
│
└─ NO → Highly unequal popularity (few items very popular)?
│
├─ YES → Cache-Aside + LFU
│ Example: E-commerce (best-selling products)
│
└─ NO → Data with limited lifetime?
│
├─ YES → Cache-Aside + TTL
│ Example: Weather API, RSS feeds
│
└─ NO → Cache-Aside + LRU (universal default)
Example: Majority of web applications
Concrete use cases by company size:
| Size | Users | Recommended Stack | Example |
|---|---|---|---|
| Startup | < 100K | Cache-Aside + Redis + TTL | Blog, MVP, early-stage SaaS |
| Scale-up | 100K-1M | Cache-Aside + Redis Cluster + LRU | E-commerce, growth SaaS |
| Enterprise | 1M-10M | Write-Through + Multi-region | Fintech, Healthcare |
| Hyper-scale | 10M+ | Write-Through + WAL + Flash Cache | Netflix, Facebook |
Simple rule:
- Don't know what to choose? → Start with Cache-Aside + TTL + LRU
- This is what 80% of web applications use successfully
To Start: Cache-Aside + TTL
Why this choice?
- It's the most used pattern in the industry
- Used by Netflix, Spotify, Twitter, and most startups
- Easy to understand and implement
- Works for the vast majority of use cases
Universal starting pattern:
def get_data(key):
# 1. Check cache
data = cache.get(key)
if data:
return data # Cache HIT
# 2. Cache MISS → go to DB
data = database.query(key)
# 3. Store in cache
cache.set(key, data, ttl=300) # 5 minutes
return data
Progressive Evolution: The Maturity Curve
Phase 1: Early Days (1-100K users)
- Simple cache: Redis or Memcached
- Pattern: Cache-Aside + TTL
- Infrastructure: 1-2 cache servers
Phase 2: Growth (100K-1M users)
- Distributed cache (Redis/Memcached cluster)
- Monitoring: hit rate, latency
- Add cache warming for popular data
Phase 3: Scale (1M-10M users)
- Multi-tier architecture (memory + distributed)
- Geographic replication
- Anti-thundering herd system
- Event-based invalidation
Phase 4: Hyper-scale (10M+ users)
- Flash cache (SSD)
- Sophisticated admission algorithms
- Global replication
- Strong consistency for critical data
Essential Metrics
1. Hit Rate
Hit Rate = (Cache Hits / Total Requests) × 100
Targets:
- Excellent: >95%
- Good: 90-95%
- Needs improvement: <90%
Hit rate measured at Facebook: 96.4%
2. Latency (P50, P95, P99)
P50: 50% of requests respond in less than X ms
P95: 95% of requests respond in less than Y ms
P99: 99% of requests respond in less than Z ms
Typical targets:
- Cache hit: <1ms
- Cache miss: <50ms (including DB)
3. Eviction Rate
How many times per second are data removed from cache due to lack of space?
If too high: increase cache size or optimize TTL
Part 1 Conclusion
In this first part, we covered the fundamental caching strategies used by all web giants.
You now understand:
- The 6 basic strategies (TTL, LRU, LFU, Write-Through/Write-Behind, Cache-Aside, Read-Through)
- How Netflix, Facebook, Google, and Twitter use caching
- Real-world challenges (thundering herd, cache warming, consistency)
- Where to start for your own project
In Part 2: Advanced Architectures, we'll discover:
- Netflix's Write-Ahead Log (WAL) in detail
- How to survive database corruption with zero downtime
- Multi-region replication
- Tradeoffs and lessons learned at enterprise scale
Up next: Part 2 - From Write-Behind to Write-Ahead Log: How Netflix Guarantees Zero Data Loss
Top comments (0)