DEV Community

Shubham Bhati
Shubham Bhati

Posted on

Beyond the Cache Miss: Designing Resilient Caching Layers with Redis Degradation Strategies

1. The Anatomy of a Cache Disaster

To design a solution, we must first analyze how cache failures manifest as systemic outages. Consider a standard read-through caching pattern:

[Client] ---> [API Gateway] ---> [Product Service] 
                                    |         |
                              (1) Read    (2) Miss? Read DB
                                    v         v
                                 [Redis]   [PostgreSQL]
Enter fullscreen mode Exit fullscreen mode

The Failure Cascade

If Redis latency increases from 2ms to 2000ms (due to network congestion or CPU saturation), the following cascade occurs:

  1. Thread Pool Exhaustion: The API container's HTTP worker threads (e.g., Tomcat, Netty) wait on Redis read timeouts. New incoming requests queue up, quickly exhausting the container's thread pool.
  2. The Cache Stampede (Thundering Herd): If the Redis connection drops entirely, all concurrent requests for a popular resource miss simultaneously. They bypass the cache and hit the downstream database together.
  3. Database Demolition: The database, designed for a fraction of the cache's read volume, experiences immediate connection pool saturation, CPU spikes to 100%, and starts dropping requests. The entire platform goes offline.

Root Cause Analysis (RCA)

The root cause is tight coupling and synchronous blocking on the caching layer. The application treats Redis as a hard dependency rather than an opportunistic optimization.


2. The Resilient Architecture: Multi-Tier Caching & Circuit Breaking

To build a resilient caching layer, we must implement three core design patterns:

  1. Dual-Layer Caching (L1/L2):
    • L1 (Local Memory): A small, fast, in-memory cache (e.g., Caffeine) residing within the application process JVM.
    • L2 (Distributed Cache): Redis.
  2. Circuit Breaking & Fallbacks: Wrapping Redis interactions inside a circuit breaker. If Redis error rates or response times cross a threshold, the breaker trips, bypassing Redis entirely and falling back to L1 or a safe database read-through.
  3. Asynchronous Non-Blocking Refresh (Stale-While-Revalidate): Serving slightly stale data from L1/L2 while asynchronously updating the cache in the background.
                  +---------------------------------------+
                  |           Product Service             |
                  +---------------------------------------+
                                      |
                           [Check L1 Cache (Local)]
                                      | (Miss)
                                      v
                        +----------------------------+
                        |   Resilience4j Breaker     |
                        +----------------------------+
                          /                        \
                 (Closed)/                          \(Open / Half-Open)
                        v                            v
               [Check L2 (Redis)]             [Fallback Path]
                 /            \                      |
         (Hit)  /      (Miss)  \ (Error)             |
               v                v                    v
         [Return Data]   [Query DB & Populate]  [Query DB (Rate Limited) / Stale Data]
Enter fullscreen mode Exit fullscreen mode

3. Implementation: Building a Resilient Cache Manager in Spring Boot

Let's implement this architecture using Java 17, Spring Boot 3, Caffeine (L1), Redis (L2), and Resilience4j.

Dependency Configuration (pom.xml)

<dependencies>
    <!-- Spring Boot Starter Cache -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-cache</artifactId>
    </dependency>
    <!-- Redis -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-redis</artifactId>
    </dependency>
    <!-- Caffeine (L1 Cache) -->
    <dependency>
        <groupId>com.github.ben-manes.caffeine</groupId>
        <artifactId>caffeine</artifactId>
    </dependency>
    <!-- Resilience4j Circuit Breaker -->
    <dependency>
        <groupId>io.github.resilience4j</groupId>
        <artifactId>resilience4j-spring-boot3</artifactId>
        <version>2.1.0</version>
    </dependency>
</dependencies>
Enter fullscreen mode Exit fullscreen mode

The Resilient Cache Layer Implementation

We will write a custom ResilientProductService that coordinates the L1 cache, the L2 Redis cache protected by a circuit breaker, and the database fallback.

package com.example.cache.service;

import com.example.cache.model.Product;
import com.example.cache.repository.ProductRepository;
import com.github.benmanes.caffeine.cache.Cache;
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.data.redis.core.RedisTemplate;
import org.springframework.stereotype.Service;

import java.time.Duration;
import java.util.Optional;

@Slf4j
@Service
@RequiredArgsConstructor
public class ResilientProductService {

    private final ProductRepository productRepository;
    private final RedisTemplate<String, Product> redisTemplate;
    private final Cache<String, Product> l1CaffeineCache; // Local JVM cache

    private static final String REDIS_KEY_PREFIX = "product:";
    private static final String REDIS_CIRCUIT_BREAKER = "redisService";

    /**
     * Fetch Product with a resilient multi-tier fallback architecture.
     * 1. Try L1 Cache (Caffeine)
     * 2. Try L2 Cache (Redis) - Wrapped in Circuit Breaker
     * 3. Database Fallback (with automatic L1/L2 repopulation)
     */
    public Product getProduct(String productId) {
        // Step 1: Query L1 (In-Memory JVM Cache) - Instant, cannot fail due to network
        Product product = l1CaffeineCache.getIfPresent(productId);
        if (product != null) {
            log.debug("L1 Cache Hit for product: {}", productId);
            return product;
        }

        // Step 2: Query L2 (Redis) via Circuit Breaker
        return getProductFromL2WithCircuitBreaker(productId);
    }

    /**
     * Redis lookup protected by Resilience4j.
     * If Redis is slow or down, the fallbackMethod is executed.
     */
    @CircuitBreaker(name = REDIS_CIRCUIT_BREAKER, fallbackMethod = "fallbackGetProductFromDb")
    private Product getProductFromL2WithCircuitBreaker(String productId) {
        log.debug("L1 Cache Miss. Querying L2 (Redis) for product: {}", productId);
        String key = REDIS_KEY_PREFIX + productId;

        // This operation will throw an exception if Redis is unreachable, 
        // triggering the circuit breaker and fallback.
        Product product = redisTemplate.opsForValue().get(key);

        if (product != null) {
            log.debug("L2 Cache Hit for product: {}", productId);
            // Populate L1 cache so subsequent reads avoid L2/Network completely
            l1CaffeineCache.put(productId, product);
            return product;
        }

        // Step 3: L2 Miss -> Query Database
        log.warn("L2 Cache Miss for product: {}. Fetching from Database.", productId);
        Product dbProduct = productRepository.findById(productId)
                .orElseThrow(() -> new ResourceNotFoundException("Product not found: " + productId));

        // Asynchronously or synchronously populate caches
        populateCaches(productId, dbProduct);
        return dbProduct;
    }

    /**
     * Fallback Method executed when the Redis Circuit Breaker is OPEN or Redis throws an Exception.
     * This bypasses Redis to protect database connection pools from starvation.
     */
    private Product fallbackGetProductFromDb(String productId, Throwable throwable) {
        log.error("Redis Cache Unavailable (Circuit Breaker status/error: {}). Falling back directly to DB.", 
                  throwable.getMessage());

        // Under degradation, we fetch from the database. 
        // Optional: Implement a rate-limiter or semaphore here to prevent database overload!
        Product dbProduct = productRepository.findById(productId)
                .orElseThrow(() -> new ResourceNotFoundException("Product not found: " + productId));

        // Populate L1 (Local Memory) only. Do NOT touch Redis while it is struggling.
        l1CaffeineCache.put(productId, dbProduct);

        return dbProduct;
    }

    private void populateCaches(String productId, Product product) {
        // Populate L1
        l1CaffeineCache.put(productId, product);

        // Populate L2 (Redis) with write-timeout protection
        try {
            redisTemplate.opsForValue().set(
                REDIS_KEY_PREFIX + productId, 
                product, 
                Duration.ofMinutes(10)
            );
        } catch (Exception e) {
            log.error("Failed to populate L2 Redis Cache. Suppressing exception to avoid client disruption.", e);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Application Configuration (application.yml)

The circuit breaker config is critical. We must set low timeouts for Redis connections and configure the circuit breaker sensitivity to trip quickly.

spring:
  data:
    redis:
      host: localhost
      port: 6379
      connect-timeout: 200ms # Short connect timeout
      timeout: 100ms         # Very aggressive read timeout for microsecond cache lookups

resilience4j:
  circuitbreaker:
    instances:
      redisService:
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 20           # Track last 20 requests
        minimumNumberOfCalls: 10         # Min calls before calculating error rate
        failureRateThreshold: 50         # Trip if 50% of last 20 calls failed
        slowCallRateThreshold: 75        # Trip if 75% of calls are slower than limit
        slowCallDurationThreshold: 50ms  # Call is "slow" if it takes > 50ms
        waitDurationInOpenState: 15s     # Keep breaker open for 15s before retrying
        permittedNumberOfCallsInHalfOpenState: 5
        automaticTransitionFromOpenToHalfOpenEnabled: true
Enter fullscreen mode Exit fullscreen mode

4. Mitigating Cache Stampede: Mutex Locking

If a highly popular cache key expires (e.g., home page configuration), the fallback database lookup can still trigger a spike. To solve this, we implement Single-Flight Lock (using a Mutex lock / local synchronization) to ensure only one thread fetches the missing data from the database, while other concurrent requests wait or yield stale data.

Below is an implementation of a thread-safe local lock bypass for database reads:

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.locks.ReentrantLock;

@Service
public class StampedeProofProductService {

    private final ProductRepository productRepository;
    private final Cache<String, Product> l1Cache;
    private final ConcurrentHashMap<String, ReentrantLock> keyLocks = new ConcurrentHashMap<>();

    public Product getProductWithStampedeProtection(String productId) {
        Product product = l1Cache.getIfPresent(productId);
        if (product != null) {
            return product;
        }

        // Get or create a lock specific to this productId
        ReentrantLock lock = keyLocks.computeIfAbsent(productId, k -> new ReentrantLock());

        if (lock.tryLock()) {
            try {
                // Double-checked locking pattern
                Product doubleCheck = l1Cache.getIfPresent(productId);
                if (doubleCheck != null) {
                    return doubleCheck;
                }

                // Fetch from Database
                Product dbProduct = productRepository.findById(productId).orElseThrow();
                l1Cache.put(productId, dbProduct);
                return dbProduct;
            } finally {
                lock.unlock();
                keyLocks.remove(productId); // Clean up map
            }
        } else {
            // If lock cannot be acquired, a concurrent thread is already pulling from the DB.
            // Option A: Sleep briefly and retry local cache lookup.
            // Option B: Serve a stale/cached default payload.
            return handleContentionFallback(productId);
        }
    }

    private Product handleContentionFallback(String productId) {
        try {
            Thread.sleep(50); // Small backoff
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
        // Retry once from L1
        Product fallback = l1Cache.getIfPresent(productId);
        if (fallback != null) {
            return fallback;
        }
        // Last-resort mock / read-through fallback logic
        return Product.builder().id(productId).name("Fallback Limited Details").build();
    }
}
Enter fullscreen mode Exit fullscreen mode

5. Operational Verification: Testing Caching Failure Modes

To verify your degradation layer is functioning, run architectural chaos testing:

Test Case Simulation Action Expected System Behavior Verified?
Normal Path Warm cache read L1 Hit (0ms overhead) / L2 Hit (1-3ms latency). [ ]
L2 Connection Drop Block port 6379 using iptables First few requests trigger timeouts. Circuit breaker trips to OPEN. Subsequent requests go directly to DB / L1 without querying Redis. [ ]
Redis CPU Spike (100%) Execute complex script in Redis Reads exceed 50ms timeout threshold. Circuit Breaker transitions to OPEN due to slowCallRateThreshold. DB protected. [ ]
Self-Healing Unblock port 6379 Circuit Breaker goes to HALF-OPEN after 15s. Sends 5 test requests. Passes. Breaker goes CLOSED. System restored automatically. [ ]

Summary: Caching Best Practices for Architects

  1. Establish Aggressive Timeouts: Never use default timeouts for cache connections. For Redis, connection timeouts should be <200ms, and command execution timeouts <100ms.
  2. Never Let Redis Failure Crash the App: Wrap distributed cache calls in a fallback or a circuit breaker.
  3. Keep L1 Lean: Use L1 (Caffeine/Ehcache) to cache high-frequency read keys with extremely short TTLs (e.g., 30-60 seconds) to guard against sudden hot-key spikes.
  4. Log & Monitor Cache State Transitions: Create alerts for Circuit Breaker status changes (CLOSED to OPEN) to detect issues before they affect end-users.

Top comments (0)