Load Testing in Production: How We Do It Safely

#loadtesting #sre #performance #devops

Why Staging Load Tests Lie

We ran perfect load tests in staging. 10,000 requests per second, sub-100ms latency, zero errors. Then we launched and fell over at 3,000 RPS in production.

Why? Staging had different data (100 rows vs 10 million), different traffic patterns (uniform vs Zipfian), and different infrastructure (single AZ vs multi-AZ).

Staging load tests are theater. Production load tests find real problems.

The Safety Framework

production_load_test_rules:
# Rule 1: Always use real traffic patterns, not uniform
traffic_pattern: "replay_production_logs" # Not synthetic

# Rule 2: Gradual ramp-up, never step function
ramp_up:
start: "10% of current traffic"
increment: "10% every 5 minutes"
max: "200% of current traffic"

# Rule 3: Automatic abort conditions
abort_if:
error_rate: "> 1%"
p99_latency: "> 2x baseline"
cpu_utilization: "> 90%"
any_circuit_breaker: "open"

# Rule 4: Business hours only
allowed_window: "Tue-Thu 10am-2pm CT"

# Rule 5: Team readiness
requires:
- ic_designated: true
- rollback_plan: documented
- stakeholders_notified: true

Our Load Testing Tool: k6

// k6 load test script
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

const errorRate = new Rate('errors');
const latency = new Trend('request_latency');

export const options = {
stages: [
{ duration: '5m', target: 100 }, // Ramp to 100 RPS
{ duration: '5m', target: 200 }, // Ramp to 200 RPS
{ duration: '5m', target: 500 }, // Ramp to 500 RPS
{ duration: '10m', target: 500 }, // Hold at 500 RPS
{ duration: '5m', target: 1000 }, // Push to 1000 RPS
{ duration: '10m', target: 1000 }, // Hold at 1000 RPS
{ duration: '5m', target: 0 }, // Ramp down
],
thresholds: {
'http_req_duration': ['p(99)<500'], // p99 under 500ms
'errors': ['rate<0.01'], // Error rate under 1%
},
};

export default function () {
// Simulate real user behavior with weighted endpoints
const endpoints = [
{ url: '/api/products', weight: 40 },
{ url: '/api/search?q=shoes', weight: 25 },
{ url: '/api/cart', weight: 20 },
{ url: '/api/checkout', weight: 10 },
{ url: '/api/user/profile', weight: 5 },
];

const rand = Math.random() * 100;
let cumWeight = 0;

for (const endpoint of endpoints) {
cumWeight += endpoint.weight;
if (rand <= cumWeight) {
const res = http.get(`https://api.production.com${endpoint.url}`);
check(res, {
'status is 200': (r) => r.status === 200,
'latency < 500ms': (r) => r.timings.duration < 500,
});
errorRate.add(res.status!== 200);
latency.add(res.timings.duration);
break;
}
}

sleep(Math.random() * 2); // Random think time
}

The Abort System

def monitor_load_test(test_id):
"""Continuously monitor and auto-abort if thresholds breached."""
while test_is_running(test_id):
metrics = get_current_metrics()

if metrics['error_rate'] > 0.01:
abort_load_test(test_id, reason="Error rate exceeded 1%")
alert_team(f"Load test aborted: error rate {metrics['error_rate']:.2%}")
return

if metrics['p99_latency'] > metrics['baseline_p99'] * 2:
abort_load_test(test_id, reason="Latency 2x above baseline")
alert_team(f"Load test aborted: p99 {metrics['p99_latency']}ms")
return

if any(cb.state == 'OPEN' for cb in get_circuit_breakers()):
abort_load_test(test_id, reason="Circuit breaker opened")
alert_team(f"Load test aborted: circuit breaker tripped")
return

time.sleep(10) # Check every 10 seconds

What We Learned

Production load tests revealed:

Connection pool exhaustion at 800 RPS Invisible in staging (different pool settings)
Cache stampede during cold start Staging cache was always warm
Database lock contention on inventory updates 100 rows vs 10M rows
CDN rate limiting at 500 RPS from single AZ Never tested multi-origin
Memory leak under sustained load Only visible after 30+ minutes

None of these showed up in staging. All of them would have caused production outages.

Results

Before production load testing:
Incidents caused by scale issues: 3-4/quarter
Time to find scaling bottleneck: Days (post-incident)

After:
Incidents caused by scale issues: 0-1/quarter
Time to find scaling bottleneck: Hours (proactive)

If you want AI-powered load testing that predicts breaking points before they happen, check out what we're building at Nova AI Ops.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com