SpicyCode

Posted on Feb 16

Cache Strategies Explained: Part 1 - The Fundamentals

#caching #redis #systemdesign #architecture

How tech giants (Netflix, Facebook, Google, Twitter) serve billions of requests per second using caching

The Incident That Changed Everything
Why Caching Is Not Optional
The 6 Fundamental Strategies
Comparison Table
How Giants Use Caching
Real-World Challenges
The Invalidation Problem
Getting Started Guide
Essential Metrics

The Incident That Changed Everything

Netflix, Production Incident (Reported September 2025)

An experienced developer types an ALTER TABLE command in their terminal. This is routine work, something they've done hundreds of times. They hit Enter.

ALTER TABLE user_preferences...

Three seconds later, the alert fires.

Dashboards light up red. The primary database just suffered massive corruption. Critical user preference data profiles, watch lists, personalized recommendations became unusable.

In a typical company, this is where you start calculating the millions of dollars this incident will cost. Where careers can hang in the balance.

But at Netflix, something unexpected happens.

No customer noticed anything. No complaints, no service interruption. 200+ million subscribers kept watching their shows peacefully.

How is this possible?

Two silent technologies saved the day:

A cache continuing to serve valid data
A Write-Ahead Log (WAL) that had captured all mutations before the corruption

Engineers simply extended the cache TTL, replayed mutations from Kafka, cleaned up the corruption, and resumed operations. Result: zero data loss, zero downtime.

Transparency note: Netflix hasn't publicly disclosed the exact number of affected records or full incident details. Information comes from their official blog post (September 2025) demonstrating the critical importance of their cache + WAL architecture for resilience.

Why Caching Is Not Optional

This incident proves that caching isn't just a performance optimization. It's a critical protection layer that can mean the difference between a minor incident and a multi-million dollar catastrophe.

In this two-part series, we'll explore:

Part 1: Fundamental strategies every developer should know
Part 2: Enterprise-grade advanced architectures (WAL, multi-region, resilience)

The 6 Fundamental Strategies

1. TTL (Time-To-Live) - Temporal Expiration

TTL defines how long data remains valid in cache before being automatically deleted or refreshed.

Implementation example:

# Redis with TTL
cache.set("user:123", user_data, ttl=3600)  # Expires after 1 hour

Ideal use cases:

Weather data (hourly refresh)
News feeds (updated every 5 minutes)
Product prices (daily changes)
User sessions

TTL is universal. Every major tech company uses it in some form.

Important: TTL and eviction policies work together

In production, TTL and LRU/LFU operate simultaneously in Redis/Memcached:

# Redis configuration: maxmemory-policy allkeys-lru
cache.set("user:123", data, ttl=3600)

# This item will expire in 1 hour OR be evicted earlier if cache is full (LRU)

Data can disappear from cache for two reasons:

TTL expired: time elapsed (3600 seconds in the example)
Eviction: cache full, least recently used item removed (LRU)

This combination ensures both data freshness (TTL) and optimal memory usage (LRU).

2. LRU (Least Recently Used) - Priority to Recent Items

When cache is full, LRU removes the least recently accessed data. It's like organizing your desk: you keep what you use often within reach.

Visual workflow:

Cache (capacity: 3 items)
1. Access A → [A]
2. Access B → [A, B]
3. Access C → [A, B, C]
4. Access D → [B, C, D]  // A removed (oldest)
5. Access B → [C, D, B]  // B moves to front

Ideal use cases:

Web pages (repeated navigation)
Active user sessions
Browsing history

Used in production by: Netflix (EVCache with client-side LRU)

3. LFU (Least Frequently Used) - Priority to Popularity

LFU keeps the most frequently requested data, regardless of last access time.

LRU vs LFU difference:

LRU: "When did you last use this?"
LFU: "How many times have you used this total?"

Concrete example:

Data: A (used 10x), B (used 2x), C (used 5x)
Cache full → Remove B (least frequent)

Ideal use cases:

E-commerce best-sellers
Viral content with lasting popularity
Repetitive search queries

4. Write-Through vs Write-Behind - Write Strategies

Write-Through (Synchronous Write)

Application writes to cache AND database simultaneously.

def save_user(user):
    cache.set(f"user:{user.id}", user)
    database.save(user)  # Both at the same time

Pros: guaranteed data consistency
Cons: higher write latency
Use case: banking, financial transactions, critical data

Used by: Facebook TAO (synchronous cache + DB writes)

Write-Behind / Write-Back (Asynchronous Write)

Application writes to cache first, then to database asynchronously.

def save_user(user):
    cache.set(f"user:{user.id}", user)
    queue.add_job("save_to_db", user)  # Async (via message queue)

Pros: ultra-fast writes
Cons: risk of loss if crash before DB save
Use case: logs, analytics, non-critical metrics

Important note: Simple Write-Behind has production limitations. In Part 2, we'll see how Netflix transformed it into Write-Ahead Log (WAL) for enterprise-grade durability guarantees.

5. Cache-Aside (Lazy Loading) - The Most Common Pattern

This is the dominant strategy in the industry. The application manages the cache itself.

def get_user(user_id):
    # 1. Check cache
    user = cache.get(f"user:{user_id}")

    if user:
        return user  # Cache HIT

    # 2. Not in cache? Fetch from DB
    user = database.get_user(user_id)  # Cache MISS

    # 3. Store in cache for next time
    cache.set(f"user:{user_id}", user, ttl=3600)

    return user

Used by: Netflix, Spotify, Twitter, and most web applications

6. Read-Through Cache - Delegation to Cache

The cache itself automatically manages database reads (transparent to the application).

# Application simply asks the cache
user = cache.get("user:123")
# Cache automatically fetches from DB if needed

Used by: Facebook (evolution of their architecture)

Comparison Table

Strategy	Pros	Cons	Use Case
TTL	Simple, predictable	May serve stale data	Weather, news
LRU	Adapts to temporal patterns	May evict important data	Sessions, navigation
LFU	Keeps popular data	More complex to implement	Best-sellers
Write-Through	Guaranteed consistency	Write latency	Banking, critical data
Write-Behind	Very fast	Risk of loss	Logs, analytics
Cache-Aside	Flexible, full control	App manages logic	Most cases
Read-Through	Transparent to app	Requires middleware	Complex systems

How Giants Use Caching

Netflix - EVCache: Billions of Requests/Second

Infrastructure:

Distributed cache based on Memcached
Combined strategies: TTL + LRU + Cache-Aside
Geographic replication across 4 global regions
Some clusters with 2 copies, others with 9 (depending on criticality)

Verified performance:

Handles billions of requests per second
Cache warming: reduced 45 GB/s → 100 MB/s network traffic

Multi-tier architecture:

L1: Local memory cache (client-side LRU)
    ↓
L2: EVCache distributed (TTL)
    ↓
L3: Multi-zone replication
    ↓
Database

Key lesson: Netflix pre-calculates and pre-loads cache before putting servers in production (cache warming).

Facebook/Meta - TAO: 1 Billion Reads/Second

Architectural evolution:

Phase 1: Memcache + MySQL (Cache-Aside look-aside)
Phase 2: TAO (The Associations and Objects) - abstraction layer
Current strategy: Write-Through (synchronous cache + DB writes)

Verified performance:

96.4% hit rate on reads
Over 1 billion read requests/second
Millions of writes/second

Technical innovation: "Leases"
To avoid the thundering herd problem (massive rush when cache expires):

Only one request can hit the database every 10 seconds per key
Other requests wait or retrieve the freshly calculated value

Concrete result: reduction from 17,000 req/s → 1,300 req/s to database during peaks.

Twitter/X - Manhattan + Redis: Consistency at Scale

Infrastructure:

Manhattan (distributed key-value store)
Redis (Haplo) as primary cache for Timeline
Strategy: Cache-Aside + eventual consistency by default

Verified performance:

320 million packets/second
120 GB/s network throughput
Tens of millions of read QPS
Cache represents only 3% of infrastructure but is critical

Particularity: strong consistency option available via consensus for critical data.

Google - Bigtable + Spanner: Multi-Tier Cache

Sophisticated architecture:

L1: Row cache (in-memory) → Reduces CPU by 25%
    ↓
L2: Block cache (local SSD)
    ↓
L3: Colossus Flash Cache (datacenter)
    ↓
Persistent storage

Verified performance:

Bigtable: 17,000 point reads/second per node (1.7x improvement)
Colossus Flash Cache: over 5 billion requests/second
Spanner automatically caches query execution plans

Innovation: CacheSack
Intelligent admission algorithm for flash cache that optimizes total cost of ownership (TCO).

Real-World Challenges

1. The Thundering Herd

The problem:
When a popular key expires, thousands of requests simultaneously hit the database.

Cache expires at 12:00:00
    ↓
10,000 requests arrive at 12:00:01
    ↓
All go to DB simultaneously → CRASH

Facebook solution (Leases):

Only one request authorized every 10 seconds
Others wait or read the freshly calculated value

Measured result: 17,000 req/s → 1,300 req/s

2. Cache Warming

The problem:
Starting with empty cache = terrible latency during first few minutes/hours.

Netflix solution:

Copy data from EBS snapshots
Load cache BEFORE putting servers in production
Avoids "warm-up" period

Measured result: 45 GB/s → 100 MB/s network traffic saved

3. Geographic Consistency

The problem:
How to synchronize caches across multiple continents?

Adopted solutions:

Eventual consistency by default (few seconds delay acceptable)
Optional strong consistency for critical data
Asynchronous replication between regions

Examples:

Spotify: EU ↔ NA replication
Netflix: 4 global regions
Facebook: global datacenters with synchronization

The Invalidation Problem

As Phil Karlton's famous quote says:

"There are only 2 hard problems in computer science: cache invalidation and naming things."

The 4 Invalidation Strategies

1. TTL (Time-To-Live)

cache.set("product:123", data, ttl=3600)  # Auto-expires

Simple, predictable
May serve stale data

2. Manual Invalidation

def update_user(user_id, new_data):
    database.update(user_id, new_data)
    cache.delete(f"user:{user_id}")  # Explicit deletion

Full control
Risk of missing some keys

3. Event-Based

# When an event occurs
event_bus.on("user_updated", lambda user_id: cache.delete(f"user:{user_id}"))

Automatic, decoupled
System complexity

4. Version Tagging

cache.set(f"user:{user_id}:v{version}", data)
# When updating, just change the version

No need to delete old one
Uses more memory

Getting Started Guide

Decision Tree: Which Strategy Should You Choose?

Are your data critical (banking, healthcare, user profiles)?
│
├─ YES → Zero data loss tolerable?
│   │
│   ├─ YES → Multi-region replication necessary?
│   │   │
│   │   ├─ YES → Write-Through + WAL (Netflix-style)
│   │   │         Example: Banking, Healthcare
│   │   │
│   │   └─ NO → Write-Through (synchronous cache + DB)
│   │             Example: E-commerce, B2B SaaS
│   │
│   └─ NO → Loss of a few seconds acceptable?
│       │
│       └─ YES → Write-Behind (asynchronous)
│                 Example: Analytics, metrics
│
└─ NO → Highly unequal popularity (few items very popular)?
    │
    ├─ YES → Cache-Aside + LFU
    │         Example: E-commerce (best-selling products)
    │
    └─ NO → Data with limited lifetime?
        │
        ├─ YES → Cache-Aside + TTL
        │         Example: Weather API, RSS feeds
        │
        └─ NO → Cache-Aside + LRU (universal default)
                  Example: Majority of web applications

Concrete use cases by company size:

Size	Users	Recommended Stack	Example
Startup	< 100K	Cache-Aside + Redis + TTL	Blog, MVP, early-stage SaaS
Scale-up	100K-1M	Cache-Aside + Redis Cluster + LRU	E-commerce, growth SaaS
Enterprise	1M-10M	Write-Through + Multi-region	Fintech, Healthcare
Hyper-scale	10M+	Write-Through + WAL + Flash Cache	Netflix, Facebook

Simple rule:

Don't know what to choose? → Start with Cache-Aside + TTL + LRU
This is what 80% of web applications use successfully

To Start: Cache-Aside + TTL

Why this choice?

It's the most used pattern in the industry
Used by Netflix, Spotify, Twitter, and most startups
Easy to understand and implement
Works for the vast majority of use cases

Universal starting pattern:

def get_data(key):
    # 1. Check cache
    data = cache.get(key)

    if data:
        return data  # Cache HIT

    # 2. Cache MISS → go to DB
    data = database.query(key)

    # 3. Store in cache
    cache.set(key, data, ttl=300)  # 5 minutes

    return data

Progressive Evolution: The Maturity Curve

Phase 1: Early Days (1-100K users)

Simple cache: Redis or Memcached
Pattern: Cache-Aside + TTL
Infrastructure: 1-2 cache servers

Phase 2: Growth (100K-1M users)

Distributed cache (Redis/Memcached cluster)
Monitoring: hit rate, latency
Add cache warming for popular data

Phase 3: Scale (1M-10M users)

Multi-tier architecture (memory + distributed)
Geographic replication
Anti-thundering herd system
Event-based invalidation

Phase 4: Hyper-scale (10M+ users)

Flash cache (SSD)
Sophisticated admission algorithms
Global replication
Strong consistency for critical data

Essential Metrics

1. Hit Rate

Hit Rate = (Cache Hits / Total Requests) × 100

Targets:

Excellent: >95%
Good: 90-95%
Needs improvement: <90%

Hit rate measured at Facebook: 96.4%

2. Latency (P50, P95, P99)

P50: 50% of requests respond in less than X ms
P95: 95% of requests respond in less than Y ms
P99: 99% of requests respond in less than Z ms

Typical targets:

Cache hit: <1ms
Cache miss: <50ms (including DB)

3. Eviction Rate

How many times per second are data removed from cache due to lack of space?

If too high: increase cache size or optimize TTL

Part 1 Conclusion

In this first part, we covered the fundamental caching strategies used by all web giants.

You now understand:

The 6 basic strategies (TTL, LRU, LFU, Write-Through/Write-Behind, Cache-Aside, Read-Through)
How Netflix, Facebook, Google, and Twitter use caching
Real-world challenges (thundering herd, cache warming, consistency)
Where to start for your own project

In Part 2: Advanced Architectures, we'll discover:

Netflix's Write-Ahead Log (WAL) in detail
How to survive database corruption with zero downtime
Multi-region replication
Tradeoffs and lessons learned at enterprise scale

Up next: Part 2 - From Write-Behind to Write-Ahead Log: How Netflix Guarantees Zero Data Loss

DEV Community

Cache Strategies Explained: Part 1 - The Fundamentals

Table of Contents

The Incident That Changed Everything

Netflix, Production Incident (Reported September 2025)

Why Caching Is Not Optional

The 6 Fundamental Strategies

1. TTL (Time-To-Live) - Temporal Expiration

2. LRU (Least Recently Used) - Priority to Recent Items

3. LFU (Least Frequently Used) - Priority to Popularity

4. Write-Through vs Write-Behind - Write Strategies

Write-Through (Synchronous Write)

Write-Behind / Write-Back (Asynchronous Write)

5. Cache-Aside (Lazy Loading) - The Most Common Pattern

6. Read-Through Cache - Delegation to Cache

Comparison Table

How Giants Use Caching

Netflix - EVCache: Billions of Requests/Second

Facebook/Meta - TAO: 1 Billion Reads/Second

Twitter/X - Manhattan + Redis: Consistency at Scale

Google - Bigtable + Spanner: Multi-Tier Cache

Real-World Challenges

1. The Thundering Herd

2. Cache Warming

3. Geographic Consistency

The Invalidation Problem

The 4 Invalidation Strategies

Getting Started Guide

Decision Tree: Which Strategy Should You Choose?

To Start: Cache-Aside + TTL

Progressive Evolution: The Maturity Curve

Essential Metrics

1. Hit Rate

2. Latency (P50, P95, P99)

3. Eviction Rate

Part 1 Conclusion

Top comments (0)