Akarshan Gandotra

Posted on Feb 3

Testing Redis Circuit Breaker with Toxiproxy

#redis #testing #resilience #devops

Building resilient distributed systems requires thorough testing of failure scenarios. While unit tests are great for business logic, they can't simulate the complex network failures that happen in production. This is where Toxiproxy comes in—a powerful tool for testing how your application handles real-world network chaos.

In this tutorial, we'll explore how to test a Redis circuit breaker implementation using Toxiproxy to simulate various failure modes, from complete outages to subtle network degradation.

What is Toxiproxy?

Toxiproxy is a TCP proxy developed by Shopify that simulates network and system conditions for chaos and resiliency testing. Unlike traditional testing approaches that mock dependencies, Toxiproxy sits between your application and its dependencies, allowing you to inject real network failures:

Connection failures (service down)
Latency (slow networks)
Bandwidth limitations (network congestion)
Connection resets (unstable networks)
Data corruption (packet loss)

This makes it ideal for testing circuit breakers, retry logic, timeouts, and other resilience patterns.

Understanding Circuit Breakers

Before we dive into testing, let's briefly review the circuit breaker pattern. A circuit breaker prevents an application from repeatedly trying to execute an operation that's likely to fail, giving the failing service time to recover.

The circuit breaker has three states:

Closed: Normal operation, requests pass through
Open: Too many failures detected, requests are blocked or fail-fast
Half-Open: Testing if the service has recovered

For our Redis implementation, we'll configure the circuit breaker to:

Open after 5 consecutive failures
Either fail-open (allow requests through) or fail-closed (block requests)
Log when state transitions occur

Setup

Installing Toxiproxy

First, install Toxiproxy on your system:

macOS:

brew install toxiproxy

Starting the Toxiproxy Server

Start the Toxiproxy server in the background:

toxiproxy-server &

The server starts on localhost:8474 by default. You can verify it's running:

curl http://localhost:8474/version

Creating a Proxy for Redis

Now create a proxy that sits between your auth service and Redis:

# Create proxy: listens on 6380, forwards to Redis on 6379
toxiproxy-cli create redis-proxy \
  -l localhost:6380 \
  -u localhost:6379

This creates a proxy named redis-proxy that:

Listens on port 6380 (your application will connect here)
Forwards traffic to Redis on port 6379

Verify the proxy was created:

toxiproxy-cli list

Output:

Name        Listen          Upstream        Enabled
============================================================
redis-proxy localhost:6380  localhost:6379  true

Configuring Your Application

Point your auth service to use the Toxiproxy port instead of connecting directly to Redis:

export REDIS_HOST=localhost
export REDIS_PORT=6380  # Toxiproxy port, not 6379

# Restart your service

Now all Redis traffic flows through Toxiproxy, allowing you to inject failures without modifying your application code.

Testing Scenarios

Let's explore different failure scenarios and verify our circuit breaker behaves correctly.

Scenario 1: Simulate Redis Down (Circuit Opens)

The most critical test—what happens when Redis becomes completely unavailable?

Disable the proxy:

toxiproxy-cli toggle redis-proxy

This simulates Redis being down. Your application will start getting connection failures.

Expected Behavior:

First 4 requests fail but circuit remains closed
5th consecutive failure → Circuit breaker opens
Logs should show:

   {"level":"warn","msg":"Circuit breaker opened - Redis failures exceeded threshold","service":"auth"}

Subsequent requests are either:
- Fail-open: Allowed through (degraded mode, no Redis caching)
- Fail-closed: Rejected immediately (fail-fast)

Monitoring the circuit state:

While the circuit is open, check your application metrics or logs. The circuit should remain open for the configured recovery period.

Re-enable the proxy:

toxiproxy-cli toggle redis-proxy

After re-enabling, the circuit should transition to half-open on the next request, then back to closed if the request succeeds.

Verification checklist:

✅ Circuit opens after 5 failures
✅ Log message appears with correct timestamp
✅ Requests are handled according to fail-open/fail-closed policy
✅ Circuit recovers when service returns
✅ Application continues functioning (degraded or failed requests)

Scenario 2: Add Latency (Slow Redis)

Network latency is often more insidious than complete failures. It can cause timeouts, thread pool exhaustion, and cascading failures.

Add 5-second latency:

toxiproxy-cli toxic add redis-proxy \
  -t latency \
  -a latency=5000

This adds a 5000ms (5 second) delay to all requests passing through the proxy.

Expected Behavior:

If your Redis client timeout is less than 5 seconds (e.g., 2 seconds), requests will timeout and count as failures. After 5 consecutive timeouts, the circuit should open.

Logs you should see:

{"level":"error","msg":"Redis operation timeout","error":"context deadline exceeded"}
{"level":"warn","msg":"Circuit breaker opened - Redis failures exceeded threshold"}

Testing different latency levels:

# Moderate latency (1 second)
toxiproxy-cli toxic update redis-proxy \
  -n latency_downstream \
  -a latency=1000

# Extreme latency (10 seconds)
toxiproxy-cli toxic update redis-proxy \
  -n latency_downstream \
  -a latency=10000

Remove the latency toxic:

toxiproxy-cli toxic remove redis-proxy -n latency_downstream

Key insights:

Latency above your timeout threshold behaves like a failure
Helps verify your timeouts are properly configured
Tests thread pool behavior under slow dependencies

Scenario 3: Connection Reset (Network Errors)

Simulate unstable network connections that reset mid-request:

Add reset_peer toxic:

toxiproxy-cli toxic add redis-proxy \
  -t reset_peer \
  -a timeout=500

This closes the connection 500ms after receiving data, simulating network instability.

Expected Behavior:

Connections will be abruptly closed, causing errors like:

connection reset by peer
unexpected EOF
broken pipe

These should count as failures and eventually open the circuit.

Remove the toxic:

toxiproxy-cli toxic remove redis-proxy -n reset_peer_downstream

When to use this test:

Verifying connection pool recovery
Testing retry logic
Ensuring proper cleanup of broken connections

Scenario 4: Bandwidth Limit (Network Congestion)

Simulate network congestion with bandwidth restrictions:

Limit to 1KB/s:

toxiproxy-cli toxic add redis-proxy \
  -t bandwidth \
  -a rate=1

This restricts throughput to 1 kilobyte per second, simulating a severely congested network.

Expected Behavior:

Small Redis operations (GET, SET) might still work but slowly
Large operations (fetching big values, pipeline operations) will timeout
Gradual degradation rather than immediate failure

Test different bandwidth levels:

# Moderate congestion (10 KB/s)
toxiproxy-cli toxic update redis-proxy \
  -n bandwidth_downstream \
  -a rate=10

# Severe congestion (100 bytes/s)
toxiproxy-cli toxic update redis-proxy \
  -n bandwidth_downstream \
  -a rate=0.1

Remove the toxic:

toxiproxy-cli toxic remove redis-proxy -n bandwidth_downstream

What this tests:

Application behavior under sustained degradation
Whether your timeouts are appropriate for typical data sizes
If your circuit breaker is sensitive enough to detect slow failures

Scenario 5: Jitter (Variable Latency)

Real networks don't have consistent latency—they fluctuate. Simulate this with jitter:

Add latency with jitter:

toxiproxy-cli toxic add redis-proxy \
  -t latency \
  -a latency=1000 \
  -a jitter=500

This creates latency ranging from 500ms to 1500ms (1000ms ± 500ms).

Expected Behavior:

Some requests complete quickly
Others timeout randomly
Circuit breaker sees intermittent failures
Tests the circuit breaker's threshold logic

Why this matters:

More realistic than fixed latency
Reveals issues with retry timing
Tests how your system handles sporadic failures

Scenario 6: Slow Close (Graceful Degradation)

Simulate a service slowly dying:

Add slice toxic:

toxiproxy-cli toxic add redis-proxy \
  -t slice \
  -a average_size=64 \
  -a size_variation=32 \
  -a delay=100

This slices data into small chunks with delays between them.

Expected Behavior:

Operations take progressively longer
Eventually exceed timeouts
Allows testing gradual degradation vs sudden failure

Advanced Testing Patterns

Combining Multiple Toxics

You can apply multiple toxics simultaneously to simulate complex scenarios:

# Add both latency and packet loss
toxiproxy-cli toxic add redis-proxy -t latency -a latency=200
toxiproxy-cli toxic add redis-proxy -t slow_close -a delay=100

This simulates a network that's both slow AND unstable.

Automated Testing Script

Create a bash script to run your test scenarios automatically:

#!/bin/bash

echo "Starting Redis circuit breaker tests..."

# Test 1: Complete failure
echo "Test 1: Simulating Redis down"
toxiproxy-cli toggle redis-proxy
sleep 5
curl http://localhost:8080/health
toxiproxy-cli toggle redis-proxy

# Test 2: High latency
echo "Test 2: Adding high latency"
toxiproxy-cli toxic add redis-proxy -t latency -a latency=5000
sleep 5
curl http://localhost:8080/health
toxiproxy-cli toxic remove redis-proxy -n latency_downstream

# Test 3: Reset connections
echo "Test 3: Reset connections"
toxiproxy-cli toxic add redis-proxy -t reset_peer -a timeout=500
sleep 5
curl http://localhost:8080/health
toxiproxy-cli toxic remove redis-proxy -n reset_peer_downstream

echo "Tests complete!"

Key Metrics to Track

Circuit state: closed, open, half-open
Failure count: Number of consecutive failures
Success rate: Percentage of successful requests
Latency percentiles: p50, p95, p99
Request volume: Total requests during the test

Best Practices

1. Test All Failure Modes

Don't just test complete outages. Real production issues include:

Partial failures (some operations succeed, others fail)
Slow failures (latency-induced timeouts)
Intermittent failures (flaky networks)

2. Verify Recovery

Always test that your circuit breaker recovers properly:

# Cause failure
toxiproxy-cli toggle redis-proxy

# Wait for circuit to open
sleep 10

# Restore service
toxiproxy-cli toggle redis-proxy

# Verify recovery
curl http://localhost:8080/health

3. Test Under Load

Run toxics while your service is under load to see realistic behavior:

# Start load test
hey -z 60s -c 10 http://localhost:8080/api/endpoint &

# Inject failure mid-test
sleep 20
toxiproxy-cli toxic add redis-proxy -t latency -a latency=3000

4. Clean Up Between Tests

Always remove toxics and reset state:

# Remove all toxics
toxiproxy-cli toxic remove redis-proxy --all

# Or reset the entire proxy
toxiproxy-cli delete redis-proxy
toxiproxy-cli create redis-proxy -l localhost:6380 -u localhost:6379

Real-World Example

Here's a complete test scenario simulating a production incident:

#!/bin/bash

echo "=== Simulating Production Incident ==="

# Phase 1: Normal operation
echo "Phase 1: Normal operation (30s)"
sleep 30

# Phase 2: Redis starts slowing down
echo "Phase 2: Redis latency increases (1s → 3s)"
toxiproxy-cli toxic add redis-proxy -t latency -a latency=1000
sleep 10
toxiproxy-cli toxic update redis-proxy -n latency_downstream -a latency=3000
sleep 10

# Phase 3: Redis becomes unstable
echo "Phase 3: Connections start resetting"
toxiproxy-cli toxic add redis-proxy -t reset_peer -a timeout=1000
sleep 10

# Phase 4: Complete outage
echo "Phase 4: Redis goes down completely"
toxiproxy-cli toggle redis-proxy
sleep 20

# Phase 5: Redis recovers
echo "Phase 5: Redis comes back online"
toxiproxy-cli toggle redis-proxy
toxiproxy-cli toxic remove redis-proxy --all
sleep 20

echo "=== Incident simulation complete ==="
echo "Check logs and metrics to verify circuit breaker behavior"

Expected outcome:

Phase 1-2: Elevated latency, some timeouts
Phase 3: Circuit might start opening/closing intermittently
Phase 4: Circuit should open and stay open
Phase 5: Circuit should recover to closed state

Conclusion

Testing with Toxiproxy transforms abstract resilience patterns into concrete, verifiable behaviors. By simulating real network failures, you can:

Validate that your circuit breaker opens at the correct threshold
Verify that your application handles failures gracefully
Discover edge cases before they occur in production
Build confidence in your system's resilience

The key is to test not just complete failures, but the spectrum of degradation that happens in production: latency spikes, intermittent errors, bandwidth constraints, and gradual deterioration.

Remember: A circuit breaker that's never tested is just technical debt in disguise.

Resources

Have you used Toxiproxy for testing? What failure scenarios have you discovered that surprised you? Share your experiences in the comments! 💬

DEV Community

Testing Redis Circuit Breaker with Toxiproxy

What is Toxiproxy?

Understanding Circuit Breakers

Setup

Installing Toxiproxy

Starting the Toxiproxy Server

Creating a Proxy for Redis

Configuring Your Application

Testing Scenarios

Scenario 1: Simulate Redis Down (Circuit Opens)

Scenario 2: Add Latency (Slow Redis)

Scenario 3: Connection Reset (Network Errors)

Scenario 4: Bandwidth Limit (Network Congestion)

Scenario 5: Jitter (Variable Latency)

Scenario 6: Slow Close (Graceful Degradation)

Advanced Testing Patterns

Combining Multiple Toxics

Automated Testing Script

Key Metrics to Track

Best Practices

1. Test All Failure Modes

2. Verify Recovery

3. Test Under Load

4. Clean Up Between Tests

Real-World Example

Conclusion

Resources

Top comments (0)