DEV Community

Cover image for Solving Meta's top 4 outage causes: 1/4 Unexpected Dependencies
jameslaneovermind
jameslaneovermind

Posted on

Solving Meta's top 4 outage causes: 1/4 Unexpected Dependencies

In 2022 Francois Richard deliver an excellent talk at SRECon EMEA about how Meta drained every backbone router simultaneously (you remember, it was that outage). Here were their top 4 trending root causes of outages:

  1. Configuration updates
  2. System Overload
  3. Unexpected Dependencies
  4. Complexity Migrations

Unexpected Dependencies

Let's start with unexpected dependencies and why they are so difficult to identify even for companies like Meta. The problem starts at the tool stack, observability tools are usually the first line of defence however they measure outputs such as metrics, logs & traces. While useful they require a good mental model and a deep understanding of the application in order to interpret them. But what we are talking about is unexpected issues, which often fall outside of our own mental model and/or observability tools. 

When these type of outages happen, they can be complex to resolve as the system's behaviour contradicts out understanding of how it should work. This leads to confusion and requires individuals rebuild their mental model of the system on the fly, as mentioned in the brilliant STELLA report. In Meta's case services went down globally for close to six hours.

The solution? Blast Radius.

By measuring config changes (inputs) instead of outputs we can:

  1. Ensure that the configuration and current state of a system are readily accessible.
  2. Enabling users to easily discover the potential impact of their intended changes and what areas might be affected.
  3. Providing users with the means to validate that their modifications have not caused any issues downstream.

Image description
Using our GitHub action (or doing it manually) you can go from Terraform Plan → Blast Radius. The blast radius is based on your live AWS state, not Terraform, which lets you see what might break:

  • Includes resources not managed by Terraform
  • Discovers dependencies even if they were created manually
  • Shows live data, not out-of-date CMDB data
  • Does all of this with read-only access, no agents, no telemetry, and no input from you. If you had to tell us how your apps are architected, we're hardly going to find unexpected dependencies are we?

This means that before making a change, you would be able to identify any potential unexpected dependencies in the blast radius. 

What's next? (2/4) Configuration Updates

We're not stopping with just blast radius. Once you've decided to apply your changes, track them with Overmind. Since we've already worked out all the dependencies, we can tell you if your changes has broken something downstream, even if you didn't know it existed.
Want to try it?

Sign up and start calculating blast radius now! It's free for individuals, if you're interested in a team plan contact us.


Overmind is a SaaS Terraform impact analysis tool. It discovers your AWS infrastructure so that it can calculate the blast radius of an application change, including resources managed outside of Terraform. Helping you to identify the causes of outages by showing you which changes caused which problems. While also helping you to deploy changes faster by giving an impact analysis report before any change is made. From this report you can understand if the change can be confidently made, or held back if it's too risky, preventing outages in the first place.

Note: This is beta software & we'd love any feedback. Either by Discord, or book a meeting.

Top comments (0)