DEV Community

Fault Tolerance in Distributed Systems: Strategies and Case Studies

Nikita Vetoshkin on October 18, 2023

The complex technological web that supports our daily lives has grown into a vast network of distributed systems. It is especially visible in the p...
Collapse
 
oliviatencredi profile image
Olivia Tencredi

very thoughtful article. Thanks
My question: Could you elaborate on the role of performance isolation mechanisms in Google's fault-tolerance approach and how they ensure uninterrupted service for clients in the face of unpredictable resource usage patterns?

Collapse
 
nekto0n profile image
Nikita Vetoshkin

Hey, @oliviatencredi! Thanks for your interest, this is very deep question actually.

I'd start with a definition of "unpredictable". If we think of probabilities as a measure of our ignorance, then for Google in 99.9% of cases usage pattern are predictable as they (an all planet scale operators) have an automated feedback system in place:

  • assess current and predicted demand based on previous periods
  • provide this as an input to capacity planning teams and services

That is handling unpredictability and managing resource usage on global scale. Local fluctuations are never caught by this. Black swan events do happen. Replication and autoscaling do play crucial role here (coupled with scalable design), but the most interesting trick is:

  • replication accounts surges of demand
  • unused resources are overprovisioning and is not cheap

The trick is to find some workloads that can occupy all available slack resources like a gas BUT on a best-effort QoS. It can be... YouTube video encoding, running MapReduce jobs, etc. They are compressible, delayed execution is totally fine for these kinds of jobs.

Some details and more links to follow can be found in research.google/pubs/pub49065/ - great overview of years of Google's experience.

Collapse
 
ababichev profile image
Andrei Babichev

Well written!