These are the notes from Chapter 22: Addressing Cascading Failures from the book Site Reliability Engineering, How Google Runs Production Systems.
This is a post of a series. The previous post can be seen here:
SRE book notes: Software Engineering in SRE
Hercules Lemke Merscher ・ Feb 2 ・ 2 min read
Because cascading failures are hard to predict, the testing strategies are the most insightful part of this chapter, IMHO. Consider reading the rest of it, as it covers the causes and preventions as well.
You should test your service to determine how it behaves under heavy load in order to gain confidence that it won’t enter a cascading failure under various circumstances.
Load test components until they break. As load increases, a component typically handles requests successfully until it reaches a point at which it can’t handle more requests. At this point, the component should ideally start serving errors or degraded results in response to additional load, but not significantly reduce the rate at which it successfully handles requests.A component that is highly susceptible to a cascading failure will start crashing or serving a very high rate of errors when it becomes overloaded; a better designed component will instead be able to reject a few requests and survive.
If you’re load testing a stateful service or a service that employs caching, your load test should track state between multiple interactions and check correctness at high load, which is often where subtle concurrency bugs hit.
If you believe your system has proper protections against being overloaded, consider performing failure tests in a small slice of production to find the point at which the components in your system fail under real traffic. These limits may not be adequately reflected by synthetic load test traffic, so real traffic tests may provide more realistic results than load tests, at the risk of causing user-visible pain.
Test your noncritical backends, and make sure their unavailability does not interfere with the critical components of your service.
Your requests may significantly slow down and consume resources waiting for noncritical backends to finish.
if heavy load causes most servers to crash as soon as they become healthy, you can get the service up and running again by:
- Addressing the initial triggering condition (by adding capacity, for example).
- Reducing load enough so that the crashing stops. Consider being aggressive here—if the entire service is crash-looping, only allow, say, 1% of the traffic through.
- Allowing the majority of the servers to become healthy.
- Gradually ramping up the load.
This strategy allows caches to warm up, connections to be established, etc., before load returns to normal levels.
Without proper care, some system changes meant to reduce background errors or otherwise improve the steady state can expose the service to greater risk of a full outage. Retrying on failures, shifting load around from unhealthy servers, killing unhealthy servers, adding caches to improve performance or reduce latency: all of these might be implemented to improve the normal case, but can improve the chance of causing a large-scale failure. Be careful when evaluating changes to ensure that one outage is not being traded for another.
If you liked this post, consider subscribing to my newsletter Bit Maybe Wise.
You can also follow me on Twitter and Mastodon.
Photo by Bradyn Trollip on Unsplash
Top comments (0)