Building resilient distributed systems requires thorough testing of failure scenarios. While unit tests are great for business logic, they can't simulate the complex network failures that happen in production. This is where Toxiproxy comes in—a powerful tool for testing how your application handles real-world network chaos.
In this tutorial, we'll explore how to test a Redis circuit breaker implementation using Toxiproxy to simulate various failure modes, from complete outages to subtle network degradation.
What is Toxiproxy?
Toxiproxy is a TCP proxy developed by Shopify that simulates network and system conditions for chaos and resiliency testing. Unlike traditional testing approaches that mock dependencies, Toxiproxy sits between your application and its dependencies, allowing you to inject real network failures:
- Connection failures (service down)
- Latency (slow networks)
- Bandwidth limitations (network congestion)
- Connection resets (unstable networks)
- Data corruption (packet loss)
This makes it ideal for testing circuit breakers, retry logic, timeouts, and other resilience patterns.
Understanding Circuit Breakers
Before we dive into testing, let's briefly review the circuit breaker pattern. A circuit breaker prevents an application from repeatedly trying to execute an operation that's likely to fail, giving the failing service time to recover.
The circuit breaker has three states:
- Closed: Normal operation, requests pass through
- Open: Too many failures detected, requests are blocked or fail-fast
- Half-Open: Testing if the service has recovered
For our Redis implementation, we'll configure the circuit breaker to:
- Open after 5 consecutive failures
- Either fail-open (allow requests through) or fail-closed (block requests)
- Log when state transitions occur
Setup
Installing Toxiproxy
First, install Toxiproxy on your system:
macOS:
brew install toxiproxy
Starting the Toxiproxy Server
Start the Toxiproxy server in the background:
toxiproxy-server &
The server starts on localhost:8474 by default. You can verify it's running:
curl http://localhost:8474/version
Creating a Proxy for Redis
Now create a proxy that sits between your auth service and Redis:
# Create proxy: listens on 6380, forwards to Redis on 6379
toxiproxy-cli create redis-proxy \
-l localhost:6380 \
-u localhost:6379
This creates a proxy named redis-proxy that:
- Listens on port 6380 (your application will connect here)
- Forwards traffic to Redis on port 6379
Verify the proxy was created:
toxiproxy-cli list
Output:
Name Listen Upstream Enabled
============================================================
redis-proxy localhost:6380 localhost:6379 true
Configuring Your Application
Point your auth service to use the Toxiproxy port instead of connecting directly to Redis:
export REDIS_HOST=localhost
export REDIS_PORT=6380 # Toxiproxy port, not 6379
# Restart your service
Now all Redis traffic flows through Toxiproxy, allowing you to inject failures without modifying your application code.
Testing Scenarios
Let's explore different failure scenarios and verify our circuit breaker behaves correctly.
Scenario 1: Simulate Redis Down (Circuit Opens)
The most critical test—what happens when Redis becomes completely unavailable?
Disable the proxy:
toxiproxy-cli toggle redis-proxy
This simulates Redis being down. Your application will start getting connection failures.
Expected Behavior:
- First 4 requests fail but circuit remains closed
- 5th consecutive failure → Circuit breaker opens
- Logs should show:
{"level":"warn","msg":"Circuit breaker opened - Redis failures exceeded threshold","service":"auth"}
- Subsequent requests are either:
- Fail-open: Allowed through (degraded mode, no Redis caching)
- Fail-closed: Rejected immediately (fail-fast)
Monitoring the circuit state:
While the circuit is open, check your application metrics or logs. The circuit should remain open for the configured recovery period.
Re-enable the proxy:
toxiproxy-cli toggle redis-proxy
After re-enabling, the circuit should transition to half-open on the next request, then back to closed if the request succeeds.
Verification checklist:
- ✅ Circuit opens after 5 failures
- ✅ Log message appears with correct timestamp
- ✅ Requests are handled according to fail-open/fail-closed policy
- ✅ Circuit recovers when service returns
- ✅ Application continues functioning (degraded or failed requests)
Scenario 2: Add Latency (Slow Redis)
Network latency is often more insidious than complete failures. It can cause timeouts, thread pool exhaustion, and cascading failures.
Add 5-second latency:
toxiproxy-cli toxic add redis-proxy \
-t latency \
-a latency=5000
This adds a 5000ms (5 second) delay to all requests passing through the proxy.
Expected Behavior:
If your Redis client timeout is less than 5 seconds (e.g., 2 seconds), requests will timeout and count as failures. After 5 consecutive timeouts, the circuit should open.
Logs you should see:
{"level":"error","msg":"Redis operation timeout","error":"context deadline exceeded"}
{"level":"warn","msg":"Circuit breaker opened - Redis failures exceeded threshold"}
Testing different latency levels:
# Moderate latency (1 second)
toxiproxy-cli toxic update redis-proxy \
-n latency_downstream \
-a latency=1000
# Extreme latency (10 seconds)
toxiproxy-cli toxic update redis-proxy \
-n latency_downstream \
-a latency=10000
Remove the latency toxic:
toxiproxy-cli toxic remove redis-proxy -n latency_downstream
Key insights:
- Latency above your timeout threshold behaves like a failure
- Helps verify your timeouts are properly configured
- Tests thread pool behavior under slow dependencies
Scenario 3: Connection Reset (Network Errors)
Simulate unstable network connections that reset mid-request:
Add reset_peer toxic:
toxiproxy-cli toxic add redis-proxy \
-t reset_peer \
-a timeout=500
This closes the connection 500ms after receiving data, simulating network instability.
Expected Behavior:
Connections will be abruptly closed, causing errors like:
connection reset by peer
unexpected EOF
broken pipe
These should count as failures and eventually open the circuit.
Remove the toxic:
toxiproxy-cli toxic remove redis-proxy -n reset_peer_downstream
When to use this test:
- Verifying connection pool recovery
- Testing retry logic
- Ensuring proper cleanup of broken connections
Scenario 4: Bandwidth Limit (Network Congestion)
Simulate network congestion with bandwidth restrictions:
Limit to 1KB/s:
toxiproxy-cli toxic add redis-proxy \
-t bandwidth \
-a rate=1
This restricts throughput to 1 kilobyte per second, simulating a severely congested network.
Expected Behavior:
- Small Redis operations (GET, SET) might still work but slowly
- Large operations (fetching big values, pipeline operations) will timeout
- Gradual degradation rather than immediate failure
Test different bandwidth levels:
# Moderate congestion (10 KB/s)
toxiproxy-cli toxic update redis-proxy \
-n bandwidth_downstream \
-a rate=10
# Severe congestion (100 bytes/s)
toxiproxy-cli toxic update redis-proxy \
-n bandwidth_downstream \
-a rate=0.1
Remove the toxic:
toxiproxy-cli toxic remove redis-proxy -n bandwidth_downstream
What this tests:
- Application behavior under sustained degradation
- Whether your timeouts are appropriate for typical data sizes
- If your circuit breaker is sensitive enough to detect slow failures
Scenario 5: Jitter (Variable Latency)
Real networks don't have consistent latency—they fluctuate. Simulate this with jitter:
Add latency with jitter:
toxiproxy-cli toxic add redis-proxy \
-t latency \
-a latency=1000 \
-a jitter=500
This creates latency ranging from 500ms to 1500ms (1000ms ± 500ms).
Expected Behavior:
- Some requests complete quickly
- Others timeout randomly
- Circuit breaker sees intermittent failures
- Tests the circuit breaker's threshold logic
Why this matters:
- More realistic than fixed latency
- Reveals issues with retry timing
- Tests how your system handles sporadic failures
Scenario 6: Slow Close (Graceful Degradation)
Simulate a service slowly dying:
Add slice toxic:
toxiproxy-cli toxic add redis-proxy \
-t slice \
-a average_size=64 \
-a size_variation=32 \
-a delay=100
This slices data into small chunks with delays between them.
Expected Behavior:
- Operations take progressively longer
- Eventually exceed timeouts
- Allows testing gradual degradation vs sudden failure
Advanced Testing Patterns
Combining Multiple Toxics
You can apply multiple toxics simultaneously to simulate complex scenarios:
# Add both latency and packet loss
toxiproxy-cli toxic add redis-proxy -t latency -a latency=200
toxiproxy-cli toxic add redis-proxy -t slow_close -a delay=100
This simulates a network that's both slow AND unstable.
Automated Testing Script
Create a bash script to run your test scenarios automatically:
#!/bin/bash
echo "Starting Redis circuit breaker tests..."
# Test 1: Complete failure
echo "Test 1: Simulating Redis down"
toxiproxy-cli toggle redis-proxy
sleep 5
curl http://localhost:8080/health
toxiproxy-cli toggle redis-proxy
# Test 2: High latency
echo "Test 2: Adding high latency"
toxiproxy-cli toxic add redis-proxy -t latency -a latency=5000
sleep 5
curl http://localhost:8080/health
toxiproxy-cli toxic remove redis-proxy -n latency_downstream
# Test 3: Reset connections
echo "Test 3: Reset connections"
toxiproxy-cli toxic add redis-proxy -t reset_peer -a timeout=500
sleep 5
curl http://localhost:8080/health
toxiproxy-cli toxic remove redis-proxy -n reset_peer_downstream
echo "Tests complete!"
Key Metrics to Track
- Circuit state: closed, open, half-open
- Failure count: Number of consecutive failures
- Success rate: Percentage of successful requests
- Latency percentiles: p50, p95, p99
- Request volume: Total requests during the test
Best Practices
1. Test All Failure Modes
Don't just test complete outages. Real production issues include:
- Partial failures (some operations succeed, others fail)
- Slow failures (latency-induced timeouts)
- Intermittent failures (flaky networks)
2. Verify Recovery
Always test that your circuit breaker recovers properly:
# Cause failure
toxiproxy-cli toggle redis-proxy
# Wait for circuit to open
sleep 10
# Restore service
toxiproxy-cli toggle redis-proxy
# Verify recovery
curl http://localhost:8080/health
3. Test Under Load
Run toxics while your service is under load to see realistic behavior:
# Start load test
hey -z 60s -c 10 http://localhost:8080/api/endpoint &
# Inject failure mid-test
sleep 20
toxiproxy-cli toxic add redis-proxy -t latency -a latency=3000
4. Clean Up Between Tests
Always remove toxics and reset state:
# Remove all toxics
toxiproxy-cli toxic remove redis-proxy --all
# Or reset the entire proxy
toxiproxy-cli delete redis-proxy
toxiproxy-cli create redis-proxy -l localhost:6380 -u localhost:6379
Real-World Example
Here's a complete test scenario simulating a production incident:
#!/bin/bash
echo "=== Simulating Production Incident ==="
# Phase 1: Normal operation
echo "Phase 1: Normal operation (30s)"
sleep 30
# Phase 2: Redis starts slowing down
echo "Phase 2: Redis latency increases (1s → 3s)"
toxiproxy-cli toxic add redis-proxy -t latency -a latency=1000
sleep 10
toxiproxy-cli toxic update redis-proxy -n latency_downstream -a latency=3000
sleep 10
# Phase 3: Redis becomes unstable
echo "Phase 3: Connections start resetting"
toxiproxy-cli toxic add redis-proxy -t reset_peer -a timeout=1000
sleep 10
# Phase 4: Complete outage
echo "Phase 4: Redis goes down completely"
toxiproxy-cli toggle redis-proxy
sleep 20
# Phase 5: Redis recovers
echo "Phase 5: Redis comes back online"
toxiproxy-cli toggle redis-proxy
toxiproxy-cli toxic remove redis-proxy --all
sleep 20
echo "=== Incident simulation complete ==="
echo "Check logs and metrics to verify circuit breaker behavior"
Expected outcome:
- Phase 1-2: Elevated latency, some timeouts
- Phase 3: Circuit might start opening/closing intermittently
- Phase 4: Circuit should open and stay open
- Phase 5: Circuit should recover to closed state
Conclusion
Testing with Toxiproxy transforms abstract resilience patterns into concrete, verifiable behaviors. By simulating real network failures, you can:
- Validate that your circuit breaker opens at the correct threshold
- Verify that your application handles failures gracefully
- Discover edge cases before they occur in production
- Build confidence in your system's resilience
The key is to test not just complete failures, but the spectrum of degradation that happens in production: latency spikes, intermittent errors, bandwidth constraints, and gradual deterioration.
Remember: A circuit breaker that's never tested is just technical debt in disguise.
Resources
- Toxiproxy GitHub Repository
- Netflix's Hystrix Circuit Breaker
- Martin Fowler on Circuit Breakers
- Redis Client Best Practices
Have you used Toxiproxy for testing? What failure scenarios have you discovered that surprised you? Share your experiences in the comments! 💬
Top comments (0)