Your system handles 10,000 requests per second. Everything is fine until 11 PM when your database connection pool exhausts itself.
What you see: Requests start failing with SQLException: Cannot get a Connection. Your monitoring alerts scream. Panic sets in.
What you do: You throw cache at it. Add Redis. Deploy. Response times drop, errors clear up. You declare victory.
Then, 48 hours later, it happens again. This time, Redis is also getting hammered. Cache misses climb to 80%. You're hitting the database anyway.
The uncomfortable truth? Redis alone doesn't protect you from a database failure. Neither does a circuit breaker alone. Together, they're unstoppable.
But most implementations treat them as separate concerns. They shouldn't be. In a high-throughput system, they need to work together, understanding what the other is doing, and degrading intelligently when things break.
This is where most Spring Boot applications fail under real load.
The Three Failure Modes of High-Throughput Systems
Before we talk about solutions, let's understand what actually kills your database:
Failure Mode 1: The Cache-Miss Avalanche
Your cache is working perfectly. Hit rate is 95%. Then, one operation fails (a network blip, a bad deploy). A cache key expires incorrectly, or your cache layer goes down.
Suddenly, every single request hits the database.
Normal state:
100 req/s → Cache (95 hits) → 5 DB queries/s ✓
Cache fails:
100 req/s → Cache (0 hits) → 100 DB queries/s ✗
Database connection pool: 20 connections
Queue: 80 requests waiting
Connections timeout: 50 failures
More requests arrive: Cascading failure
Your database was designed for 5 queries per second. It just got 100. Connection pools exhaust. Queries queue up and timeout. Your database suffers a slow, painful death.
And here's the cruel part: even after the cache recovers, the damage is done. Your database is already under water, responding slowly, triggering more cache misses, which triggers more database hits, which makes the database slower...
This is a cascade. And a circuit breaker is your only exit.
Failure Mode 2: The Slow-Database Trap
Your database becomes slow but doesn't fail. A missing index. A long-running migration. A query that's suddenly taking 500ms instead of 50ms.
Your thread pool has 20 threads. Each thread is now blocking for 500ms waiting for a database response. Within seconds, all 20 threads are busy. New requests queue up (remember Article 3? unbounded queues?). Threads are still blocking. Queue grows. Memory pressure increases.
Your application dies not because the database failed, but because it got slow.
A circuit breaker sees the slow responses, recognizes the pattern, and stops sending traffic to the database. Suddenly, your API returns 503 quickly instead of timing out after 30 seconds.
Failure Mode 3: The Resource Leak
Your connection pool is exhausted, but it's not because of traffic. It's because of connections that aren't being returned. A bug in your code, a timeout that doesn't close the connection, a third-party library that leaks resources.
Now you have:
- 20 available connections in the pool
- 50 connections checked out but never returned
- 100 new requests trying to acquire connections
- 80 requests timing out waiting
Your database is fine. Your application is broken.
A circuit breaker won't fix this directly, but combined with proper monitoring and backpressure (timeout handling), it ensures you fail fast instead of hanging.
The Architecture Most Teams Get Wrong
Here's what I see in most Spring Boot applications:
@Service
public class ProductService {
@Autowired
private ProductRepository repository;
@Cacheable("products")
public Product getProduct(Long id) {
// Redis caching at the method level
return repository.findById(id).orElseThrow();
}
}
This has problems:
Cache layer is independent of database layer. If the database is slow, the cache layer doesn't know. It might serve stale data forever, or it might just let requests pile up.
No circuit breaker on the database. If the database fails, every cache miss attempts to hit it, potentially making things worse.
No timeout strategy. How long does the cache check wait for the database? If it waits forever, you've just turned a database failure into an application hang.
No degradation path. What happens when both cache and database are struggling? Do you return stale data? Return errors? Return nothing?
Here's what you actually need:
Request
↓
[Check Redis Cache]
├─→ CACHE HIT: Return immediately ✓
├─→ CACHE MISS: Try database
│ ├─→ Circuit OPEN (DB is down): Return fallback/error
│ ├─→ Circuit HALF-OPEN: Try DB, update cache if success
│ ├─→ Circuit CLOSED: Query DB, update cache, return ✓
│ └─→ Query TIMEOUT: Fail fast, don't retry
└─→ CACHE ERROR: Try database with circuit breaker
This is the architecture that survives real-world chaos.
Implementation: Redis + Circuit Breaker in Spring Boot
Let's build this step by step.
Step 1: Basic Redis Configuration with Timeout
@Configuration
@EnableCaching
public class CacheConfig {
@Bean
public RedisCacheManagerBuilderCustomizer redisCacheManagerBuilderCustomizer() {
return builder -> builder
.withCacheConfiguration("products",
RedisCacheConfiguration.defaultCacheConfig()
.entryTtl(Duration.ofMinutes(30))
.serializeValuesWith(
RedisSerializationContext.SerializationPair.fromSerializer(
new GenericJackson2JsonRedisSerializer()
)
))
.withCacheConfiguration("orders",
RedisCacheConfiguration.defaultCacheConfig()
.entryTtl(Duration.ofMinutes(10)));
}
@Bean
public LettuceConnectionFactory redisConnectionFactory() {
return new LettuceConnectionFactory();
}
@Bean
public RedisTemplate<String, Object> redisTemplate(
RedisConnectionFactory connectionFactory) {
RedisTemplate<String, Object> template = new RedisTemplate<>();
template.setConnectionFactory(connectionFactory);
return template;
}
}
Notice the TTL is explicit. In a high-throughput system, you can't afford cache entries that live forever.
Step 2: Add Resilience4j Circuit Breaker
Add the dependency:
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-circuitbreaker</artifactId>
<version>2.1.0</version>
</dependency>
Configure circuit breaker in application.yml:
resilience4j:
circuitbreaker:
configs:
default:
registerHealthIndicator: true
slidingWindowSize: 100
failureRateThreshold: 50.0
slowCallRateThreshold: 50.0
slowCallDurationThreshold: 2000ms
permittedNumberOfCallsInHalfOpenState: 10
automaticTransitionFromOpenToHalfOpenEnabled: true
waitDurationInOpenState: 10s
instances:
productRepository:
baseConfig: default
slidingWindowType: COUNT_BASED
What this means:
- Track the last 100 calls
- If 50% fail OR 50% are slower than 2 seconds, OPEN the circuit
- When OPEN, wait 10 seconds before trying again (HALF-OPEN)
- In HALF-OPEN, allow 10 attempts before deciding if things are better
Step 3: The Real Implementation
Here's where Redis and circuit breaker come together:
@Service
@Slf4j
public class ProductService {
@Autowired
private ProductRepository repository;
@Autowired
private RedisTemplate<String, Object> redisTemplate;
@Autowired
private CircuitBreaker productCircuitBreaker;
private static final String CACHE_KEY_PREFIX = "products:";
private static final Duration CACHE_TTL = Duration.ofMinutes(30);
private static final Duration QUERY_TIMEOUT = Duration.ofSeconds(5);
public Product getProduct(Long id) {
String cacheKey = CACHE_KEY_PREFIX + id;
try {
// Step 1: Check Redis first
Object cached = redisTemplate.opsForValue().get(cacheKey);
if (cached != null) {
log.debug("Cache hit for product {}", id);
return (Product) cached;
}
} catch (Exception e) {
// Redis is down. Don't let this prevent us from trying the database.
log.warn("Redis cache check failed for product {}: {}", id, e.getMessage());
}
// Step 2: Cache miss or cache error. Try database with circuit breaker.
try {
Product product = productCircuitBreaker.executeSupplier(() ->
queryProductWithTimeout(id)
);
// Step 3: Update cache on successful database hit
try {
redisTemplate.opsForValue().set(
cacheKey,
product,
CACHE_TTL
);
log.debug("Cached product {} in Redis", id);
} catch (Exception e) {
log.warn("Failed to cache product {}: {}", id, e.getMessage());
// Don't fail the request. Just return the product.
}
return product;
} catch (CallNotPermittedException e) {
// Circuit is OPEN. Database is down.
log.error("Circuit breaker is OPEN for product {}", id);
return getProductFallback(id);
} catch (io.github.resilience4j.core.registry.EntryNotFoundConfiguredException e) {
// This shouldn't happen, but handle gracefully
log.error("Circuit breaker configuration error for product {}", id);
throw new ServiceUnavailableException("Product service is degraded");
}
}
private Product queryProductWithTimeout(Long id) {
// The circuit breaker wraps this, monitoring for failures and slowness
try {
return CompletableFuture.supplyAsync(() ->
repository.findById(id).orElseThrow(
() -> new ProductNotFoundException("Product " + id + " not found")
)
).get(QUERY_TIMEOUT.toMillis(), TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
throw new DatabaseTimeoutException(
"Database query timed out after " + QUERY_TIMEOUT.toMillis() + "ms"
);
} catch (InterruptedException | ExecutionException e) {
throw new DatabaseException("Database query failed: " + e.getMessage(), e);
}
}
private Product getProductFallback(Long id) {
// Circuit is OPEN. Try to return stale data from Redis if available.
try {
Object cached = redisTemplate.opsForValue().get(CACHE_KEY_PREFIX + id);
if (cached != null) {
log.info("Returning stale cache for product {} (circuit is open)", id);
return (Product) cached;
}
} catch (Exception e) {
log.debug("Cannot access Redis fallback: {}", e.getMessage());
}
// No fallback available. Return error.
throw new ServiceUnavailableException(
"Product service is temporarily unavailable. Please try again."
);
}
@CacheEvict(value = "products", key = "#id")
@CircuitBreaker(name = "productRepository")
public void updateProduct(Long id, ProductUpdateRequest request) {
Product product = repository.findById(id).orElseThrow();
product.setName(request.name());
product.setPrice(request.price());
repository.save(product);
// Also explicitly clear Redis cache
try {
redisTemplate.delete(CACHE_KEY_PREFIX + id);
} catch (Exception e) {
log.warn("Failed to clear cache for product {}: {}", id, e.getMessage());
}
}
@CircuitBreaker(name = "productRepository")
public List<Product> searchProducts(String query) {
// Use a different cache for search results with shorter TTL
String cacheKey = "search:" + query.hashCode();
try {
Object cached = redisTemplate.opsForValue().get(cacheKey);
if (cached != null) {
return (List<Product>) cached;
}
} catch (Exception e) {
log.warn("Search cache check failed: {}", e.getMessage());
}
List<Product> results = repository.findByNameContainingIgnoreCase(query);
try {
// Search results change frequently, shorter TTL
redisTemplate.opsForValue().set(
cacheKey,
results,
Duration.ofMinutes(5)
);
} catch (Exception e) {
log.warn("Failed to cache search results: {}", e.getMessage());
}
return results;
}
}
This is a real implementation. Let's break down what's happening:
Cache check is non-blocking. If Redis fails, we continue. Redis failures don't cascade.
Circuit breaker wraps database calls. If the database is slow or fails, the circuit opens automatically, preventing further hammering.
Timeout is enforced. Database queries can't block forever. After 5 seconds, we fail fast.
Fallback uses stale data. When the circuit is open, we return yesterday's data if available. A stale product is better than no product.
Cache invalidation is explicit. When data changes, we evict it.
Step 4: Monitoring and Metrics
You need visibility into what's happening:
@Component
@Slf4j
public class CircuitBreakerMetrics {
@Autowired
private CircuitBreakerRegistry circuitBreakerRegistry;
@Autowired
private CacheManager cacheManager;
@Scheduled(fixedRate = 10000) // Every 10 seconds
public void logMetrics() {
CircuitBreaker cb = circuitBreakerRegistry.circuitBreaker("productRepository");
log.info("Circuit Breaker Stats - State: {}, Failure Rate: {}%, Slow Call Rate: {}%",
cb.getState(),
cb.getMetrics().getFailureRate(),
cb.getMetrics().getSlowCallRate()
);
Cache productsCache = cacheManager.getCache("products");
if (productsCache instanceof RedisCache redisCache) {
log.info("Redis Cache Stats for 'products' - Approximate size unknown, TTL: 30m");
}
}
}
Add to application.yml:
management:
endpoints:
web:
exposure:
include: health,metrics,circuitbreakers
endpoint:
health:
show-details: always
metrics:
tags:
application: product-service
Now you can check /actuator/health to see circuit breaker status:
{
"status": "UP",
"components": {
"circuitBreakers": {
"status": "UP",
"details": {
"productRepository": {
"status": "UP",
"details": {
"state": "CLOSED",
"failureRate": "0.0%",
"slowCallRate": "5.2%"
}
}
}
}
}
}
The Gotchas: Where This Implementation Can Still Break
Gotcha 1: Cache Stampede on Circuit Breaker Recovery
When the circuit transitions from OPEN to HALF-OPEN, suddenly 10 requests try the database simultaneously (your permittedNumberOfCallsInHalfOpenState). If many of these requests want the same product, they all miss the cache and query at once.
Circuit: OPEN
All requests get fallback (stale data) ✓
Circuit: HALF-OPEN (10 permits)
Request 1: Query product 42 → Hits DB
Request 2: Query product 42 → Hits DB (cache miss, both in-flight)
Request 3: Query product 42 → Hits DB (cache miss, all three in-flight)
...
All 10 permits → Same 3 products queried 10 times total
Solution: Use a request-level cache (in-memory, per-request) for HALF-OPEN state:
@Service
@Slf4j
public class ProductService {
private final ThreadLocal<Map<String, Product>> requestCache =
ThreadLocal.withInitial(HashMap::new);
@Around("execution(public * com.example.service.ProductService.getProduct(..))")
public Object cacheWithinRequest(ProceedingJoinPoint pjp) throws Throwable {
String cacheKey = "req:" + pjp.getArgs()[0];
Map<String, Product> cache = requestCache.get();
if (cache.containsKey(cacheKey)) {
return cache.get(cacheKey);
}
try {
Product result = (Product) pjp.proceed();
cache.put(cacheKey, result);
return result;
} finally {
requestCache.remove();
}
}
}
Gotcha 2: Stale Data Divergence
Your circuit is open. You're serving stale product prices from Redis. Meanwhile, your admin panel is writing new prices to the database. When the circuit closes, users see the old price, then suddenly the new price changes. Confusion ensues.
Solution: Add a "stale" marker to cache entries:
public class CachedProduct {
public Product product;
public long cachedAtMillis;
public boolean isStale;
}
private Product getProductFallback(Long id) {
try {
Object cached = redisTemplate.opsForValue().get(CACHE_KEY_PREFIX + id);
if (cached instanceof CachedProduct cp) {
long age = System.currentTimeMillis() - cp.cachedAtMillis;
if (age > 60_000) { // Older than 1 minute
cp.isStale = true;
}
log.info("Returning {} cache for product {}",
cp.isStale ? "stale" : "fresh", id);
return cp.product;
}
} catch (Exception e) {
log.debug("Cannot access Redis fallback: {}", e.getMessage());
}
throw new ServiceUnavailableException("Product service unavailable");
}
Include staleness in the API response so clients know the data age.
Gotcha 3: Circuit Breaker State Not Visible to Load Balancer
Your application has 5 instances. Instance 1's circuit breaker is open (it detected the database is slow). Instances 2-5 still think the database is fine and send traffic to it. You get 80% of the benefit, not 100%.
Solution: Make circuit breaker state part of the health check. Your load balancer removes unhealthy instances:
@Component
public class CircuitBreakerHealthIndicator implements HealthIndicator {
@Autowired
private CircuitBreakerRegistry registry;
@Override
public Health health() {
CircuitBreaker cb = registry.circuitBreaker("productRepository");
if (cb.getState() == OPEN) {
return Health.outOfService()
.withDetail("reason", "Circuit breaker is OPEN")
.build();
}
return Health.up().build();
}
}
Now, when one instance's circuit opens, it reports UNHEALTHY and gets removed from the load balancer.
Gotcha 4: Cache Invalidation Across Instances
You have 5 instances. Instance 1 updates a product and clears the Redis cache. But Instance 2 is still serving the old value because it has it in local memory (if you're using multi-level caching).
Solution: Use a cache invalidation message broker:
@Service
@Slf4j
public class ProductService {
@Autowired
private RedisTemplate<String, String> redisTemplate;
@Autowired
private ApplicationEventPublisher eventPublisher;
public void updateProduct(Long id, ProductUpdateRequest request) {
Product product = repository.findById(id).orElseThrow();
product.setName(request.name());
repository.save(product);
// Publish invalidation event
String invalidationMessage = "product:" + id;
redisTemplate.convertAndSend("cache-invalidation", invalidationMessage);
eventPublisher.publishEvent(new ProductInvalidatedEvent(id));
}
}
@Component
@Slf4j
public class CacheInvalidationListener {
@Autowired
private StringRedisTemplate redisTemplate;
@Bean
public MessageListenerAdapter messageListenerAdapter() {
return new MessageListenerAdapter((message) -> {
String key = new String((byte[]) message.getBody());
log.info("Invalidating cache for {}", key);
// Clear local caches
});
}
}
The Real-World Trade-Offs
You've now built a system that handles:
- Cache misses without failing
- Database failures without crashing
- Slow databases without hanging
- Circuit breaker recovery without stampeding
But it's more complex. More moving parts. More things that can fail.
Here's the honest assessment:
| Scenario | Complexity | Reliability | Cost |
|---|---|---|---|
| Just database | Low | Low (high failure risk) | Low |
| Just cache | Low | Medium (cache failures) | Low |
| Database + cache | Medium | Medium (no circuit breaker) | Medium |
| Database + cache + circuit breaker | High | High (handles failures gracefully) | Medium |
You pay in complexity. You gain in reliability. For a high-throughput system, it's always worth it.
The Mindset Shift
Here's what separates systems that survive 3 AM pages from those that don't:
Wrong approach: "Redis will make everything fast, and the database will always be there."
Right approach: "Redis handles the happy path. Circuit breaker handles the sad path. Together, they make the system survive chaos."
The circuit breaker isn't a performance optimization. It's a survival mechanism. It's saying: "If things are broken, fail fast and gracefully, not catastrophically."
Redis isn't just a cache. It's a fallback layer. It's saying: "Yesterday's data is better than no data."
When you implement both, you're not building a system that's 2x better. You're building a system that behaves fundamentally differently under stress. Instead of cascading failures, you get degradation. Instead of 10 seconds of errors, you get 10 seconds of slower responses with stale data.
Your users might not notice the difference. Your oncall engineer will.
One More Thing
This implementation assumes you control both the cache and circuit breaker. If you're using a managed cache (AWS ElastiCache, Azure Cache for Redis) and a managed database (RDS, Cloud SQL), you can't control the circuit breaker on their side.
What you can do: Put the circuit breaker in your application, not in the infrastructure. That's what this article shows.
And if you're wondering why you haven't had to think about this before: You probably haven't scaled to the point where it matters. Congratulations. But keep this pattern in your back pocket. When you do scale, you'll be ready.
Your 11 PM database will thank you.
Top comments (0)