I wanted to understand what really happens to a distributed system when things go wrong. So I broke it on purpose.
This article walks through a chaos engineering experiment I built around a real-time stock quotes API for NASDAQ-listed companies. The stack involves Java 21, Spring Boot 4, Redis, PostgreSQL, Resilience4j and ToxiProxy — and the results were more interesting than I expected.
What is Chaos Engineering?
Chaos Engineering is the practice of intentionally injecting failures into a system to observe how it behaves under stress. The idea, popularized by Netflix with their Chaos Monkey tool, is simple: if failures are going to happen in production anyway, it's better to discover how your system reacts in a controlled experiment than during a real outage.
In a financial context like stock quotes, the stakes are clear. Imagine a system that feeds real-time prices to traders during market open at 9:30 AM EST — and suddenly Redis starts responding in 2 seconds instead of 6 milliseconds. What happens? Does the system degrade gracefully, or does it collapse?
That's what I set out to find out.
Architecture
The project simulates a production-like environment where every external dependency passes through ToxiProxy — a programmable network proxy that can inject latency, bandwidth limits, connection resets and more.
Client
└─► Nginx (port 8000 / 20000)
└─► Backend (Spring Boot :8080)
├─► ToxiProxy :20001 ──► PostgreSQL :5432
├─► ToxiProxy :20002 ──► Redis :6379
└─► ToxiProxy :20003 ──► finnhub.io:443
The key detail: the backend never connects directly to PostgreSQL, Redis or Finnhub. Every call passes through ToxiProxy, so we can degrade any dependency at any time without touching application code.
Tech stack:
- Java 21 + Spring Boot 4
- PostgreSQL 17 (persistence)
- Redis 7 (cache via Spring Cache + Lettuce)
- Resilience4j (Circuit Breaker + Retry)
- ToxiProxy (chaos injection)
- Nginx (reverse proxy + SSL termination for Finnhub)
- Finnhub API (real market data)
- Docker + Docker Compose
How the Cache Works
The application fetches stock quotes from Finnhub, saves them to PostgreSQL and caches them in Redis using @Cacheable:
@Cacheable(value = "quotes", key = "#symbol")
public StockQuote getQuote(String symbol) {
log.info("Cache MISS for symbol: {}", symbol);
FinnhubQuoteResponse response = finnhubClient.fetchQuote(symbol);
if (Objects.equals(response.currentPrice(), BigDecimal.ZERO)) {
throw new AssetNotFoundException("Asset not found for symbol: " + symbol);
}
StockQuote quote = stockQuoteMapper.toEntity(symbol, response);
return stockQuoteRepository.save(quote);
}
The flow is:
- 1st call (cache miss): Finnhub → PostgreSQL → Redis → response
- 2nd call (cache hit): Redis → response (PostgreSQL and Finnhub are never touched)
A background scheduler also refreshes all quotes every 60 seconds, so the cache stays warm automatically.
The Circuit Breaker
The Finnhub client is wrapped with Resilience4j annotations:
@CircuitBreaker(name = "financialApi", fallbackMethod = "fetchQuoteFallback")
@Retry(name = "financialApi")
public FinnhubQuoteResponse fetchQuote(String symbol) {
return finnhubWebClient.get()
.uri(uri -> uri.path("/quote")
.queryParam("symbol", symbol)
.queryParam("token", properties.token())
.build())
.retrieve()
.bodyToMono(FinnhubQuoteResponse.class)
.timeout(Duration.ofSeconds(3))
.block();
}
private FinnhubQuoteResponse fetchQuoteFallback(String symbol, Exception e) {
log.error("Fallback triggered for symbol: {} - reason: {}", symbol, e.getMessage());
return FinnhubQuoteResponse.empty();
}
The Circuit Breaker is configured to open when 50% of calls are slow (above 3 seconds) across a sliding window of 10 calls:
resilience4j:
circuitbreaker:
configs:
default:
registerHealthIndicator: true
slidingWindowSize: 5
minimumNumberOfCalls: 3
permittedNumberOfCallsInHalfOpenState: 3
automaticTransitionFromOpenToHalfOpenEnabled: true
waitDurationInOpenState: 30s
slowCallRateThreshold: 50
failureRateThreshold: 50
eventConsumerBufferSize: 10
instances:
financialApi:
baseConfig: default
waitDurationInOpenState: 60s
recordExceptions:
- org.springframework.web.client.HttpServerErrorException
- java.util.concurrent.TimeoutException
- java.io.IOException
Chaos Scripts
Three scripts control the experiment:
setup-toxiproxy.sh — creates the proxies, no chaos yet:
#!/bin/bash
set -e
echo "Configuring Toxiproxy proxies..."
curl -X POST http://localhost:8474/proxies \
-H "Content-Type: application/json" \
-d '{"name":"postgres_proxy","listen":"0.0.0.0:20001","upstream":"stock-quotes-postgres:5432","enabled":true}'
curl -X POST http://localhost:8474/proxies \
-H "Content-Type: application/json" \
-d '{"name":"redis_proxy","listen":"0.0.0.0:20002","upstream":"stock-quotes-redis:6379","enabled":true}'
curl -X POST http://localhost:8474/proxies \
-H "Content-Type: application/json" \
-d '{
"name":"finnhub_proxy",
"listen":"0.0.0.0:20003",
"upstream":"finnhub.io:443",
"enabled":true
}'
echo "Toxiproxy configuration complete."
inject-chaos.sh — injects 2000ms ±500ms latency on all three proxies:
#!/bin/bash
set -e
TOXIPROXY_URL="http://localhost:8474"
echo "Injecting chaos: 2000ms latency (±500ms jitter) to postgres_proxy, redis_proxy and finnhub_proxy..."
curl --fail -X POST "$TOXIPROXY_URL/proxies/postgres_proxy/toxics" \
-H "Content-Type: application/json" \
-d '{"name":"latency_downstream","type":"latency","stream":"downstream","attributes":{"latency":2000,"jitter":500}}'
curl --fail -X POST "$TOXIPROXY_URL/proxies/redis_proxy/toxics" \
-H "Content-Type: application/json" \
-d '{"name":"latency_downstream","type":"latency","stream":"downstream","attributes":{"latency":2000,"jitter":500}}'
curl --fail -X POST "$TOXIPROXY_URL/proxies/finnhub_proxy/toxics" \
-H "Content-Type: application/json" \
-d '{"name":"latency_downstream","type":"latency","stream":"downstream","attributes":{"latency":2000,"jitter":500}}'
echo "Chaos injected. Monitor your Resilience4j Circuit Breaker status."
remove-chaos.sh — removes all toxics, system recovers:
#!/bin/bash
set -e
TOXIPROXY_URL="http://localhost:8474"
echo "Removing chaos from postgres_proxy, redis_proxy and finnhub_proxy..."
curl --fail -X DELETE "$TOXIPROXY_URL/proxies/postgres_proxy/toxics/latency_downstream"
curl --fail -X DELETE "$TOXIPROXY_URL/proxies/redis_proxy/toxics/latency_downstream"
curl --fail -X DELETE "$TOXIPROXY_URL/proxies/finnhub_proxy/toxics/latency_downstream"
echo "Chaos removed. System should be recovering — watch the Circuit Breaker close."
Experiment Results
Baseline — No Chaos
# First call (cold cache)
curl -w "\nTotal time: %{time_total}s\n" http://localhost:8080/api/v1/stock-quotes/get-by-symbol/AAPL
# Total time: 0.029s
# Second call (Redis cache hit)
curl -w "\nTotal time: %{time_total}s\n" http://localhost:8080/api/v1/stock-quotes/get-by-symbol/AAPL
# Total time: 0.006s
Redis responding in 6ms is the happy path. The PostgreSQL metrics confirm this — the database itself responds in under 2ms, the overhead comes from the ToxiProxy layer even without chaos injected.
With Chaos — 2000ms ±500ms on All Services
Here's where things get interesting.
./scripts/inject-chaos.sh
# First call
curl -w "\nTotal time: %{time_total}s\n" http://localhost:8080/api/v1/stock-quotes/get-by-symbol/AAPL
# Total time: 9.4s
# Second call (expected Redis hit...)
curl -w "\nTotal time: %{time_total}s\n" http://localhost:8080/api/v1/stock-quotes/get-by-symbol/AAPL
# Total time: 10.8s
The second call was slower than the first. That was unexpected.
The reason: with 2000ms of latency on the Redis proxy, the write to Redis after the first call was delayed so much that by the time the second request arrived, the cache entry hadn't been written yet. Both calls went all the way to Finnhub and PostgreSQL — each adding their own 2000ms latency.
This is the invisible failure mode: Redis latency doesn't just slow down cache reads, it breaks the cache population itself. The system continued to function, but silently lost all caching benefits. Every request was hitting the full stack.
Real Latency Numbers per Asset

Figure 1: Comparison between healthy Redis hits (green) and system performance under 2000ms ±500ms injected latency (red). Note that even a cold cache miss under normal conditions takes only 29ms, proving that chaos doesn't just slow down the system—it fundamentally breaks the efficiency of the architecture.
| Asset | No Chaos (Redis hit) | With Chaos |
|---|---|---|
| AAPL | 5.9ms | 1823.9ms |
| GOOGL | 6.7ms | 1870.6ms |
| NVDA | 7.4ms | 2279.5ms |
| AMZN | 7.2ms | 2295.1ms |
| META | 6.4ms | 1522.2ms |
| MSFT | 6.0ms | 2445.1ms |
| AMD | 5.6ms | 2180.9ms |
Redis is approximately 300x faster than the chaos scenario.
Circuit Breaker Opens
After enough slow calls, the Circuit Breaker opened automatically:
ERROR: Fallback triggered for symbol: INTC - reason: CircuitBreaker 'financialApi' is OPEN
WARN: Financial API Circuit Breaker state changed: OPEN -> HALF_OPEN
curl http://localhost:8080/actuator/circuitbreakers | jq
{
"circuitBreakers": {
"financialApi": {
"bufferedCalls": 3,
"failedCalls": 0,
"failureRate": "0.0%",
"failureRateThreshold": "50.0%",
"notPermittedCalls": 0,
"slowCallRate": "100.0%",
"slowCallRateThreshold": "50.0%",
"slowCalls": 3,
"slowFailedCalls": 0,
"state": "OPEN"
}
}
}
Recovery
After removing chaos:
./scripts/remove-chaos.sh
The Circuit Breaker transitioned automatically: OPEN → HALF_OPEN → CLOSED. The system recovered without any manual intervention.
Key Takeaways
1. Redis latency silently breaks your cache. This is the most important finding. The system didn't throw errors — it just stopped caching, and every request paid the full cost. Without observability, this would be invisible in production.
2. Circuit Breakers are essential, but need correct wiring. Getting the @CircuitBreaker annotation to work required: creating an interface for the client class (so Spring AOP could create a proxy), aligning the instance name across the annotation, YAML config and registry, and using the right exception types. The annotation is simple — the configuration is not.
3. ToxiProxy gives you a realistic baseline before chaos. Even with zero toxics configured, routing through ToxiProxy adds ~250ms of overhead from TCP proxy traversal and deserialization. This is your real baseline, not localhost-to-localhost speed.
4. The scheduler is also affected. The background job that refreshes quotes every 60 seconds suffered the same latency — meaning cached data would eventually become stale even if the circuit breaker protected the API endpoint.
Running the Project
git clone https://github.com/Doug16Yanc/stock-quotes.git
cd stock-quotes
cp .env-example .env
# fill in your Finnhub API key and credentials
# start infrastructure
docker compose up -d postgres redis toxiproxy
# configure proxies
./scripts/setup-toxiproxy.sh
# start application
docker compose up -d backend nginx
# test
curl http://localhost:8080/api/v1/stock-quotes/get-by-symbol/AAPL
What's Next? (Hardening the Resilience)
Now that we’ve proven how the system behaves under stress and confirmed that the PostgreSQL fallback keeps the lights on, the next steps to achieve true high availability are:
Stale-While-Revalidate Pattern: Implement a strategy where Redis serves "stale" (slightly outdated) data instantly while a background thread fetches the update, eliminating user-facing latency during cache refreshes.
Bulkhead Isolation: Isolate resource pools (threads and connections) to ensure that a catastrophic slowdown in the NASDAQ API doesn't exhaust the backend's thread pool, which could otherwise crash unrelated services.
Deep Observability: Integrate Micrometer, Prometheus, and Grafana to visualize Circuit Breaker state transitions in real-time and create alerts based on "Anomalous Cache Miss Rates."
Stress & Load Testing: Use k6 to simulate a massive volume of concurrent requests during chaos injection to observe the "Thundering Herd" effect—where multiple requests try to rebuild the cache simultaneously.
If you found this useful or have questions, drop a comment. The full source code is on GitHub: Doug16Yanc/stock-quotes
Breaking things on purpose is the best way to learn how they work.
Top comments (0)