DEV Community

Sivagurunathan Velayutham
Sivagurunathan Velayutham

Posted on

When Your Database Goes Down for 25+ Minutes: Building a Survival Cache

In microservice architectures, config services are critical infrastructure. They store feature flags, API endpoints, and runtime settings that services query constantly on startup, during requests, when auto-scaling. Most are backed by a database with aggressive caching. Everything works beautifully, until your database goes down.

Here's the nightmare scenario: Your cache has a 5-minute TTL. Your database outage lasts 25+ minutes. At the 5-minute mark, cache entries start expiring. Services start failing. New instances can't bootstrap. Your availability drops to zero.

This is the story of building a cache that survives prolonged database outages by persisting stale data to disk and the hard lessons learned along the way.

The Problem Nobody Talks About

Everyone tells you to cache your database. "Just use Redis!" "Throw some Caffeine in there!" And they're right for normal operations.

But here's what the tutorials don't cover: What happens when your cache expires during a prolonged outage?

The failure sequence looks like this:

  • T+0 min: Database goes down. Cache still serving traffic (100% hit rate).
  • T+5 min: First cache entries expire. Cache misses start happening.
  • T+6 min: Cache miss → try database → timeout. Service starts returning errors.
  • T+10 min: Most cache entries expired. Availability plummets.
  • T+15 min: Auto-scaling spins up new instances. They can't fetch configs. Immediate crash.
  • T+25 min: Database finally recovers. You've been down for 20 minutes.

The traditional solution is replication i.e Aurora multi-region, DynamoDB global tables, all that good stuff. But replication has its own problems:

Cost: You're running duplicate infrastructure 24/7 for failure scenarios that happen 2-3 times per year.

Complexity: Cross-region replication, failover logic, data consistency concerns, network latency.

Partial protection: Regional outages still take you down. Replication lag can be seconds to minutes.

There had to be a simpler approach.

The Core Insight: Stale Data Beats No Data

Here's the controversial take that changed everything: For read-heavy config services, serving 10-minute-old data during an outage is infinitely better than serving nothing.

Think about what your config service actually stores:

  • Feature flags: Don't change every second
  • Service endpoints: Relatively stable
  • API rate limits: Rarely updated mid-incident
  • Routing rules: Can tolerate brief staleness

Sure, you might serve a feature flag that was disabled 5 minutes ago. But that's better than taking down your entire service because the config is unreachable.

The question became: How do I serve stale data when my cache is empty and my database is unavailable?

The answer: Persist cache evictions to local disk.

Architecture: The Three-Tier Survival Strategy

I built what I call a "tier cache"—three layers of defense against database failures:

Architecture

Normal Operation Flow:

  1. Request comes in → check L1 (memory)
  2. Cache hit (99% of the time) → return immediately in ~2.5μs
  3. Cache miss → fetch from L2 (database)
  4. Write to L1 for fast access
  5. Asynchronously write to L3 (disk) for outage protection

Outage Operation Flow:

  1. Request comes in → check L1 (memory)
  2. Cache miss → try L2 (database) → connection timeout
  3. Fall back to L3 (disk) → serve stale data
  4. Service stays alive with degraded data

The key innovation: Every cache eviction gets persisted to disk. When the database is unreachable, we serve from this stale disk cache. It's not perfect data, but it keeps services running.

Why RocksDB?

My first instinct was simple file serialization. Why not just dump everything to JSON?

File cacheFile = new File("cache-backup.json");
objectMapper.writeValue(cacheFile, cacheData);
Enter fullscreen mode Exit fullscreen mode

This worked great for 100 entries in my test. Then I tried 10,000 realistic config objects:

  • File size: 45MB of verbose JSON
  • Write time: 280ms (blocking the cache)
  • Read time: 380ms (sequential scan to find one key)

Completely unusable.

I needed something that could:

  • Read individual keys fast without scanning the entire file
  • Compress data since config JSON is highly repetitive
  • Handle writes efficiently without blocking cache operations
  • Survive crashes without losing all data

After researching embedded databases, RocksDB emerged as the clear winner:

Compression: My 45MB JSON dump compressed to ~8MB with LZ4 (5.6x reduction). Real-world compression varies by data patterns—typically in 2-4x.

Fast random reads: Log-Structured Merge (LSM) tree design optimized for key-value lookups. 10-50μs to fetch any key.

Write-optimized: Writes go to memory first, then flush to disk in batches. No blocking on individual writes.

Battle-tested: Powers production systems at Facebook, LinkedIn, Netflix. If it's good enough for them, it's good enough for my config service.

Crash safety: Write-Ahead Logging (WAL) ensures durability even if the process crashes.

public class RocksDBDiskStore implements AutoCloseable {
    private final RocksDB db;
    private final ObjectMapper mapper;

    public RocksDBDiskStore(String path) throws RocksDBException {
        RocksDB.loadLibrary();

        Options options = new Options()
            .setCreateIfMissing(true)
            .setCompressionType(CompressionType.LZ4_COMPRESSION)
            .setMaxOpenFiles(256)
            .setWriteBufferSize(8 * 1024 * 1024); // 8MB buffer

        this.db = RocksDB.open(options, path);
        this.mapper = new ObjectMapper();
    }
}
Enter fullscreen mode Exit fullscreen mode

Disk Management Built-In

Implementation: RocksDB has a configurable background cleanup thread:

// From RocksDBDiskStore.java
if(cleanupDuration > 0) {
    this.scheduler = Executors.newSingleThreadScheduledExecutor(r -> {
        Thread thread = new Thread(r, "RocksDB-Cleanup");
        thread.setDaemon(true);
        return thread;
    });
    this.scheduler.scheduleAtFixedRate(
        this::cleanup, 
        cleanupDuration, 
        cleanupDuration, 
        unit
    );
}
Enter fullscreen mode Exit fullscreen mode

This daemon thread runs periodic cleanup to prevent unbounded disk growth. You configure the cleanup frequency when initializing the disk store, ensuring L3 doesn't consume all server disk space over time.

Cache Eviction: The Secret Sauce

The clever part is when data gets written to RocksDB. I don't persist every cache write—that would be wasteful. Instead, I persist on cache eviction.

Caffeine's removal listener is the key:

this.cache = Caffeine.newBuilder()
            .maximumSize(maxSize)
            .expireAfterWrite(ttl)
            .evictionListener((key, value, cause) -> {
                this.diskStore.save(key, value); // write to RocksDB        
            })
            .build();
Enter fullscreen mode Exit fullscreen mode

When does eviction happen?

  1. Time-based expiry: Entry sits unused for X minutes → TTL expires → eviction
  2. Size-based eviction: Cache hits 10,000 entries → least recently used gets evicted

Why this approach is efficient:

Hot data stays in memory: Frequently accessed configs never touch disk.

Cold data gets archived: When a config entry expires from L1, it gets persisted to L3 for outage scenarios.

Eviction-triggered persistence: Data is written to disk when evicted from memory, not on every cache operation.

During normal operations: L3 is write-mostly, read-rarely. The database is healthy, so cache misses go to L2, not L3.

During outages: L3 becomes read-heavy. Cache misses can't reach L2 (database down), so they fall back to L3 for stale data.

This design means your disk isn't constantly thrashing with writes—it only persists data that's already being evicted from memory anyway.

Benchmarking: Does This Actually Work?

I built a test harness to simulate realistic failure scenarios. Here are the results that convinced me this approach works:

Test 1: Long Outage Resilience (25-min database failure)

Setup: 10K cache entries, 5-min TTL, simulated database outage at T+0

Time Elapsed Tier Cache EhCache (disk) Caffeine Only
3 minutes 100% 100% 100%
5 minutes 100% 0% 0%
7 minutes 100% 0% 0%
10 minutes 100% 0% 0%
25 minutes 100% 0% 0%

Key finding: Tier cache maintained availability for previously-cached keys by
serving from L3 (RocksDB) after L1 expired.This assumes all requested keys were previously cached. In reality, newly added configs or never-requested keys won't be in L3 and will fail. This represents typical production traffic patterns.

Why did EhCache fail? Its disk persistence is designed for overflow, not outage recovery. When the cache expires, it tries to fetch from the database (which is down) rather than serving stale disk data.

Test 2: Normal Operation Performance

Setup: Database healthy, measuring latency for cache operations

Operation Tier Cache EhCache Caffeine
Cache hit (memory) 2.50 μs 6.31 μs 2.74 μs
Cache miss (DB up) 1.2 ms 1.3 ms 1.1 ms
Disk fallback 19.11 μs N/A N/A

Important clarification: The "cache miss" numbers include network round-trip (mocked) to the database. The "disk fallback" is what happens when the DB is down—we serve from RocksDB instead.

During normal operations, tier cache performs nearly identically to vanilla Caffeine. The disk layer only matters during outages.

Test 3: Write Throughput Under Memory Pressure

Setup: 50K writes with 10K cache size limit (heavy eviction)

Strategy Total Time Throughput vs Baseline
Caffeine Only 37 ms 1,351,351/s 100%
Tier Cache 140 ms 357,143/s 26%
EhCache 201 ms 248,756/s 18%

This is the cost. Async disk persistence reduces write throughput by ~74%. Every eviction triggers a disk write, and under heavy churn, this adds up.

What I Got Wrong

This is a learning project, not production-ready code. Here are the real limitations you need to understand:

1. The Cold Start Problem

New instances start with empty RocksDB. During an outage, they have no stale data to serve.

What happens: Auto-scaling spins up a new pod → L1 empty → L2 down → L3 empty → requests fail.

My benchmarks showed 100% availability, but that assumed warm caches. Real-world availability during outages depends on whether instances have previously cached the requested keys.

2. Single Node Limitation

Each instance maintains its own local RocksDB. In a distributed deployment with multiple instances, each has different stale data based on what it personally cached. Request routing becomes non-deterministic—the same config key might return different values depending on which instance handles the request.

This isn't a bug to fix; it's a fundamental architectural choice. Local disk persistence trades consistency for simplicity. Solving this requires either accepting eventual consistency or moving to distributed storage like Redis, which defeats the "simple local cache" design goal.

When Should You Actually Use This?

This project demonstrates caching patterns and outage resilience strategies. Based on the architecture:

Appropriate for:

  • Single-node applications
  • Systems where eventual consistency across instances is acceptable

Not appropriate for:

  • Multi-instance production deployments requiring consistency
  • Applications needing strong consistency guarantees

Try It Yourself

The full implementation is here github.com/SivagurunathanV/tier-cache

Quick start:

git clone https://github.com/SivagurunathanV/tier-cache
cd tier-cache
./gradlew test    # Run test suite
./gradlew run     # Interactive demo
Enter fullscreen mode Exit fullscreen mode

What's Next?

If you're building something similar:

  • Start simple (JSON files) and profile before over-engineering
  • Measure your actual outage frequency and duration
  • Calculate the real cost of downtime vs. infrastructure
  • Test with realistic failure scenarios, not just happy paths

Key improvements for production:

  • Implement write coalescing (batch evictions)
  • Add circuit breakers and error handling
  • Build comprehensive observability
  • Test cold start and multi-instance scenarios

I'd love to hear about your failure survival strategies. What patterns have kept your services alive during database outages? What trade-offs have you made?


Resources:


Top comments (1)

Collapse
 
arunlakshman profile image
Arun Lakshman R

excellent article