Prathamesh Thakre

Posted on Apr 25

Redis + Circuit Breaker: The Real Defense Against Database Meltdowns in High-Throughput Systems

#architecture #database #performance #systemdesign

Your system handles 10,000 requests per second. Everything is fine until 11 PM when your database connection pool exhausts itself.

What you see: Requests start failing with SQLException: Cannot get a Connection. Your monitoring alerts scream. Panic sets in.

What you do: You throw cache at it. Add Redis. Deploy. Response times drop, errors clear up. You declare victory.

Then, 48 hours later, it happens again. This time, Redis is also getting hammered. Cache misses climb to 80%. You're hitting the database anyway.

The uncomfortable truth? Redis alone doesn't protect you from a database failure. Neither does a circuit breaker alone. Together, they're unstoppable.

But most implementations treat them as separate concerns. They shouldn't be. In a high-throughput system, they need to work together, understanding what the other is doing, and degrading intelligently when things break.

This is where most Spring Boot applications fail under real load.

The Three Failure Modes of High-Throughput Systems

Before we talk about solutions, let's understand what actually kills your database:

Failure Mode 1: The Cache-Miss Avalanche

Your cache is working perfectly. Hit rate is 95%. Then, one operation fails (a network blip, a bad deploy). A cache key expires incorrectly, or your cache layer goes down.

Suddenly, every single request hits the database.

Normal state:
  100 req/s → Cache (95 hits) → 5 DB queries/s ✓

Cache fails:
  100 req/s → Cache (0 hits) → 100 DB queries/s ✗

Database connection pool: 20 connections
Queue: 80 requests waiting
Connections timeout: 50 failures
More requests arrive: Cascading failure

Your database was designed for 5 queries per second. It just got 100. Connection pools exhaust. Queries queue up and timeout. Your database suffers a slow, painful death.

And here's the cruel part: even after the cache recovers, the damage is done. Your database is already under water, responding slowly, triggering more cache misses, which triggers more database hits, which makes the database slower...

This is a cascade. And a circuit breaker is your only exit.

Failure Mode 2: The Slow-Database Trap

Your database becomes slow but doesn't fail. A missing index. A long-running migration. A query that's suddenly taking 500ms instead of 50ms.

Your thread pool has 20 threads. Each thread is now blocking for 500ms waiting for a database response. Within seconds, all 20 threads are busy. New requests queue up (remember Article 3? unbounded queues?). Threads are still blocking. Queue grows. Memory pressure increases.

Your application dies not because the database failed, but because it got slow.

A circuit breaker sees the slow responses, recognizes the pattern, and stops sending traffic to the database. Suddenly, your API returns 503 quickly instead of timing out after 30 seconds.

Failure Mode 3: The Resource Leak

Your connection pool is exhausted, but it's not because of traffic. It's because of connections that aren't being returned. A bug in your code, a timeout that doesn't close the connection, a third-party library that leaks resources.

Now you have:

20 available connections in the pool
50 connections checked out but never returned
100 new requests trying to acquire connections
80 requests timing out waiting

Your database is fine. Your application is broken.

A circuit breaker won't fix this directly, but combined with proper monitoring and backpressure (timeout handling), it ensures you fail fast instead of hanging.

The Architecture Most Teams Get Wrong

Here's what I see in most Spring Boot applications:

@Service
public class ProductService {

    @Autowired
    private ProductRepository repository;

    @Cacheable("products")
    public Product getProduct(Long id) {
        // Redis caching at the method level
        return repository.findById(id).orElseThrow();
    }
}

This has problems:

Cache layer is independent of database layer. If the database is slow, the cache layer doesn't know. It might serve stale data forever, or it might just let requests pile up.
No circuit breaker on the database. If the database fails, every cache miss attempts to hit it, potentially making things worse.
No timeout strategy. How long does the cache check wait for the database? If it waits forever, you've just turned a database failure into an application hang.
No degradation path. What happens when both cache and database are struggling? Do you return stale data? Return errors? Return nothing?

Here's what you actually need:

Request
  ↓
[Check Redis Cache]
  ├─→ CACHE HIT: Return immediately ✓
  ├─→ CACHE MISS: Try database
  │    ├─→ Circuit OPEN (DB is down): Return fallback/error
  │    ├─→ Circuit HALF-OPEN: Try DB, update cache if success
  │    ├─→ Circuit CLOSED: Query DB, update cache, return ✓
  │         └─→ Query TIMEOUT: Fail fast, don't retry
  └─→ CACHE ERROR: Try database with circuit breaker

This is the architecture that survives real-world chaos.

Implementation: Redis + Circuit Breaker in Spring Boot

Let's build this step by step.

Step 1: Basic Redis Configuration with Timeout

@Configuration
@EnableCaching
public class CacheConfig {

    @Bean
    public RedisCacheManagerBuilderCustomizer redisCacheManagerBuilderCustomizer() {
        return builder -> builder
            .withCacheConfiguration("products",
                RedisCacheConfiguration.defaultCacheConfig()
                    .entryTtl(Duration.ofMinutes(30))
                    .serializeValuesWith(
                        RedisSerializationContext.SerializationPair.fromSerializer(
                            new GenericJackson2JsonRedisSerializer()
                        )
                    ))
            .withCacheConfiguration("orders",
                RedisCacheConfiguration.defaultCacheConfig()
                    .entryTtl(Duration.ofMinutes(10)));
    }

    @Bean
    public LettuceConnectionFactory redisConnectionFactory() {
        return new LettuceConnectionFactory();
    }

    @Bean
    public RedisTemplate<String, Object> redisTemplate(
            RedisConnectionFactory connectionFactory) {
        RedisTemplate<String, Object> template = new RedisTemplate<>();
        template.setConnectionFactory(connectionFactory);
        return template;
    }
}

Notice the TTL is explicit. In a high-throughput system, you can't afford cache entries that live forever.

Step 2: Add Resilience4j Circuit Breaker

Add the dependency:

<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-spring-boot3</artifactId>
    <version>2.1.0</version>
</dependency>
<dependency>
    <groupId>io.github.resilience4j</groupId>
    <artifactId>resilience4j-circuitbreaker</artifactId>
    <version>2.1.0</version>
</dependency>

Configure circuit breaker in application.yml:

resilience4j:
  circuitbreaker:
    configs:
      default:
        registerHealthIndicator: true
        slidingWindowSize: 100
        failureRateThreshold: 50.0
        slowCallRateThreshold: 50.0
        slowCallDurationThreshold: 2000ms
        permittedNumberOfCallsInHalfOpenState: 10
        automaticTransitionFromOpenToHalfOpenEnabled: true
        waitDurationInOpenState: 10s

    instances:
      productRepository:
        baseConfig: default
        slidingWindowType: COUNT_BASED

What this means:

Track the last 100 calls
If 50% fail OR 50% are slower than 2 seconds, OPEN the circuit
When OPEN, wait 10 seconds before trying again (HALF-OPEN)
In HALF-OPEN, allow 10 attempts before deciding if things are better

Step 3: The Real Implementation

Here's where Redis and circuit breaker come together:

@Service
@Slf4j
public class ProductService {

    @Autowired
    private ProductRepository repository;

    @Autowired
    private RedisTemplate<String, Object> redisTemplate;

    @Autowired
    private CircuitBreaker productCircuitBreaker;

    private static final String CACHE_KEY_PREFIX = "products:";
    private static final Duration CACHE_TTL = Duration.ofMinutes(30);
    private static final Duration QUERY_TIMEOUT = Duration.ofSeconds(5);

    public Product getProduct(Long id) {
        String cacheKey = CACHE_KEY_PREFIX + id;

        try {
            // Step 1: Check Redis first
            Object cached = redisTemplate.opsForValue().get(cacheKey);
            if (cached != null) {
                log.debug("Cache hit for product {}", id);
                return (Product) cached;
            }
        } catch (Exception e) {
            // Redis is down. Don't let this prevent us from trying the database.
            log.warn("Redis cache check failed for product {}: {}", id, e.getMessage());
        }

        // Step 2: Cache miss or cache error. Try database with circuit breaker.
        try {
            Product product = productCircuitBreaker.executeSupplier(() ->
                queryProductWithTimeout(id)
            );

            // Step 3: Update cache on successful database hit
            try {
                redisTemplate.opsForValue().set(
                    cacheKey,
                    product,
                    CACHE_TTL
                );
                log.debug("Cached product {} in Redis", id);
            } catch (Exception e) {
                log.warn("Failed to cache product {}: {}", id, e.getMessage());
                // Don't fail the request. Just return the product.
            }

            return product;

        } catch (CallNotPermittedException e) {
            // Circuit is OPEN. Database is down.
            log.error("Circuit breaker is OPEN for product {}", id);
            return getProductFallback(id);

        } catch (io.github.resilience4j.core.registry.EntryNotFoundConfiguredException e) {
            // This shouldn't happen, but handle gracefully
            log.error("Circuit breaker configuration error for product {}", id);
            throw new ServiceUnavailableException("Product service is degraded");
        }
    }

    private Product queryProductWithTimeout(Long id) {
        // The circuit breaker wraps this, monitoring for failures and slowness
        try {
            return CompletableFuture.supplyAsync(() ->
                repository.findById(id).orElseThrow(
                    () -> new ProductNotFoundException("Product " + id + " not found")
                )
            ).get(QUERY_TIMEOUT.toMillis(), TimeUnit.MILLISECONDS);

        } catch (TimeoutException e) {
            throw new DatabaseTimeoutException(
                "Database query timed out after " + QUERY_TIMEOUT.toMillis() + "ms"
            );
        } catch (InterruptedException | ExecutionException e) {
            throw new DatabaseException("Database query failed: " + e.getMessage(), e);
        }
    }

    private Product getProductFallback(Long id) {
        // Circuit is OPEN. Try to return stale data from Redis if available.
        try {
            Object cached = redisTemplate.opsForValue().get(CACHE_KEY_PREFIX + id);
            if (cached != null) {
                log.info("Returning stale cache for product {} (circuit is open)", id);
                return (Product) cached;
            }
        } catch (Exception e) {
            log.debug("Cannot access Redis fallback: {}", e.getMessage());
        }

        // No fallback available. Return error.
        throw new ServiceUnavailableException(
            "Product service is temporarily unavailable. Please try again."
        );
    }

    @CacheEvict(value = "products", key = "#id")
    @CircuitBreaker(name = "productRepository")
    public void updateProduct(Long id, ProductUpdateRequest request) {
        Product product = repository.findById(id).orElseThrow();
        product.setName(request.name());
        product.setPrice(request.price());
        repository.save(product);

        // Also explicitly clear Redis cache
        try {
            redisTemplate.delete(CACHE_KEY_PREFIX + id);
        } catch (Exception e) {
            log.warn("Failed to clear cache for product {}: {}", id, e.getMessage());
        }
    }

    @CircuitBreaker(name = "productRepository")
    public List<Product> searchProducts(String query) {
        // Use a different cache for search results with shorter TTL
        String cacheKey = "search:" + query.hashCode();

        try {
            Object cached = redisTemplate.opsForValue().get(cacheKey);
            if (cached != null) {
                return (List<Product>) cached;
            }
        } catch (Exception e) {
            log.warn("Search cache check failed: {}", e.getMessage());
        }

        List<Product> results = repository.findByNameContainingIgnoreCase(query);

        try {
            // Search results change frequently, shorter TTL
            redisTemplate.opsForValue().set(
                cacheKey,
                results,
                Duration.ofMinutes(5)
            );
        } catch (Exception e) {
            log.warn("Failed to cache search results: {}", e.getMessage());
        }

        return results;
    }
}

This is a real implementation. Let's break down what's happening:

Cache check is non-blocking. If Redis fails, we continue. Redis failures don't cascade.
Circuit breaker wraps database calls. If the database is slow or fails, the circuit opens automatically, preventing further hammering.
Timeout is enforced. Database queries can't block forever. After 5 seconds, we fail fast.
Fallback uses stale data. When the circuit is open, we return yesterday's data if available. A stale product is better than no product.
Cache invalidation is explicit. When data changes, we evict it.

Step 4: Monitoring and Metrics

You need visibility into what's happening:

@Component
@Slf4j
public class CircuitBreakerMetrics {

    @Autowired
    private CircuitBreakerRegistry circuitBreakerRegistry;

    @Autowired
    private CacheManager cacheManager;

    @Scheduled(fixedRate = 10000) // Every 10 seconds
    public void logMetrics() {
        CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("productRepository");

        log.info("Circuit Breaker Stats - State: {}, Failure Rate: {}%, Slow Call Rate: {}%",
            cb.getState(),
            cb.getMetrics().getFailureRate(),
            cb.getMetrics().getSlowCallRate()
        );

        Cache productsCache = cacheManager.getCache("products");
        if (productsCache instanceof RedisCache redisCache) {
            log.info("Redis Cache Stats for 'products' - Approximate size unknown, TTL: 30m");
        }
    }
}

Add to application.yml:

management:
  endpoints:
    web:
      exposure:
        include: health,metrics,circuitbreakers
  endpoint:
    health:
      show-details: always
  metrics:
    tags:
      application: product-service

Now you can check /actuator/health to see circuit breaker status:

{
  "status": "UP",
  "components": {
    "circuitBreakers": {
      "status": "UP",
      "details": {
        "productRepository": {
          "status": "UP",
          "details": {
            "state": "CLOSED",
            "failureRate": "0.0%",
            "slowCallRate": "5.2%"
          }
        }
      }
    }
  }
}

The Gotchas: Where This Implementation Can Still Break

Gotcha 1: Cache Stampede on Circuit Breaker Recovery

When the circuit transitions from OPEN to HALF-OPEN, suddenly 10 requests try the database simultaneously (your permittedNumberOfCallsInHalfOpenState). If many of these requests want the same product, they all miss the cache and query at once.

Circuit: OPEN
  All requests get fallback (stale data) ✓

Circuit: HALF-OPEN (10 permits)
  Request 1: Query product 42 → Hits DB
  Request 2: Query product 42 → Hits DB (cache miss, both in-flight)
  Request 3: Query product 42 → Hits DB (cache miss, all three in-flight)
  ...
  All 10 permits → Same 3 products queried 10 times total

Solution: Use a request-level cache (in-memory, per-request) for HALF-OPEN state:

@Service
@Slf4j
public class ProductService {

    private final ThreadLocal<Map<String, Product>> requestCache = 
        ThreadLocal.withInitial(HashMap::new);

    @Around("execution(public * com.example.service.ProductService.getProduct(..))")
    public Object cacheWithinRequest(ProceedingJoinPoint pjp) throws Throwable {
        String cacheKey = "req:" + pjp.getArgs()[0];
        Map<String, Product> cache = requestCache.get();

        if (cache.containsKey(cacheKey)) {
            return cache.get(cacheKey);
        }

        try {
            Product result = (Product) pjp.proceed();
            cache.put(cacheKey, result);
            return result;
        } finally {
            requestCache.remove();
        }
    }
}

Gotcha 2: Stale Data Divergence

Your circuit is open. You're serving stale product prices from Redis. Meanwhile, your admin panel is writing new prices to the database. When the circuit closes, users see the old price, then suddenly the new price changes. Confusion ensues.

Solution: Add a "stale" marker to cache entries:

public class CachedProduct {
    public Product product;
    public long cachedAtMillis;
    public boolean isStale;
}

private Product getProductFallback(Long id) {
    try {
        Object cached = redisTemplate.opsForValue().get(CACHE_KEY_PREFIX + id);
        if (cached instanceof CachedProduct cp) {
            long age = System.currentTimeMillis() - cp.cachedAtMillis;
            if (age > 60_000) {  // Older than 1 minute
                cp.isStale = true;
            }
            log.info("Returning {} cache for product {}", 
                cp.isStale ? "stale" : "fresh", id);
            return cp.product;
        }
    } catch (Exception e) {
        log.debug("Cannot access Redis fallback: {}", e.getMessage());
    }
    throw new ServiceUnavailableException("Product service unavailable");
}

Include staleness in the API response so clients know the data age.

Gotcha 3: Circuit Breaker State Not Visible to Load Balancer

Your application has 5 instances. Instance 1's circuit breaker is open (it detected the database is slow). Instances 2-5 still think the database is fine and send traffic to it. You get 80% of the benefit, not 100%.

Solution: Make circuit breaker state part of the health check. Your load balancer removes unhealthy instances:

@Component
public class CircuitBreakerHealthIndicator implements HealthIndicator {

    @Autowired
    private CircuitBreakerRegistry registry;

    @Override
    public Health health() {
        CircuitBreaker cb = registry.circuitBreaker("productRepository");
        if (cb.getState() == OPEN) {
            return Health.outOfService()
                .withDetail("reason", "Circuit breaker is OPEN")
                .build();
        }
        return Health.up().build();
    }
}

Now, when one instance's circuit opens, it reports UNHEALTHY and gets removed from the load balancer.

Gotcha 4: Cache Invalidation Across Instances

You have 5 instances. Instance 1 updates a product and clears the Redis cache. But Instance 2 is still serving the old value because it has it in local memory (if you're using multi-level caching).

Solution: Use a cache invalidation message broker:

@Service
@Slf4j
public class ProductService {

    @Autowired
    private RedisTemplate<String, String> redisTemplate;

    @Autowired
    private ApplicationEventPublisher eventPublisher;

    public void updateProduct(Long id, ProductUpdateRequest request) {
        Product product = repository.findById(id).orElseThrow();
        product.setName(request.name());
        repository.save(product);

        // Publish invalidation event
        String invalidationMessage = "product:" + id;
        redisTemplate.convertAndSend("cache-invalidation", invalidationMessage);

        eventPublisher.publishEvent(new ProductInvalidatedEvent(id));
    }
}

@Component
@Slf4j
public class CacheInvalidationListener {

    @Autowired
    private StringRedisTemplate redisTemplate;

    @Bean
    public MessageListenerAdapter messageListenerAdapter() {
        return new MessageListenerAdapter((message) -> {
            String key = new String((byte[]) message.getBody());
            log.info("Invalidating cache for {}", key);
            // Clear local caches
        });
    }
}

The Real-World Trade-Offs

You've now built a system that handles:

Cache misses without failing
Database failures without crashing
Slow databases without hanging
Circuit breaker recovery without stampeding

But it's more complex. More moving parts. More things that can fail.

Here's the honest assessment:

Scenario	Complexity	Reliability	Cost
Just database	Low	Low (high failure risk)	Low
Just cache	Low	Medium (cache failures)	Low
Database + cache	Medium	Medium (no circuit breaker)	Medium
Database + cache + circuit breaker	High	High (handles failures gracefully)	Medium

You pay in complexity. You gain in reliability. For a high-throughput system, it's always worth it.

The Mindset Shift

Here's what separates systems that survive 3 AM pages from those that don't:

Wrong approach: "Redis will make everything fast, and the database will always be there."

Right approach: "Redis handles the happy path. Circuit breaker handles the sad path. Together, they make the system survive chaos."

The circuit breaker isn't a performance optimization. It's a survival mechanism. It's saying: "If things are broken, fail fast and gracefully, not catastrophically."

Redis isn't just a cache. It's a fallback layer. It's saying: "Yesterday's data is better than no data."

When you implement both, you're not building a system that's 2x better. You're building a system that behaves fundamentally differently under stress. Instead of cascading failures, you get degradation. Instead of 10 seconds of errors, you get 10 seconds of slower responses with stale data.

Your users might not notice the difference. Your oncall engineer will.

One More Thing

This implementation assumes you control both the cache and circuit breaker. If you're using a managed cache (AWS ElastiCache, Azure Cache for Redis) and a managed database (RDS, Cloud SQL), you can't control the circuit breaker on their side.

What you can do: Put the circuit breaker in your application, not in the infrastructure. That's what this article shows.

And if you're wondering why you haven't had to think about this before: You probably haven't scaled to the point where it matters. Congratulations. But keep this pattern in your back pocket. When you do scale, you'll be ready.

Your 11 PM database will thank you.

DEV Community