The complex technological web that supports our daily lives has grown into a vast network of distributed systems. It is especially visible in the p...
For further actions, you may consider blocking this person and/or reporting abuse
very thoughtful article. Thanks
My question: Could you elaborate on the role of performance isolation mechanisms in Google's fault-tolerance approach and how they ensure uninterrupted service for clients in the face of unpredictable resource usage patterns?
Hey, @oliviatencredi! Thanks for your interest, this is very deep question actually.
I'd start with a definition of "unpredictable". If we think of probabilities as a measure of our ignorance, then for Google in 99.9% of cases usage pattern are predictable as they (an all planet scale operators) have an automated feedback system in place:
That is handling unpredictability and managing resource usage on global scale. Local fluctuations are never caught by this. Black swan events do happen. Replication and autoscaling do play crucial role here (coupled with scalable design), but the most interesting trick is:
The trick is to find some workloads that can occupy all available slack resources like a gas BUT on a best-effort QoS. It can be... YouTube video encoding, running MapReduce jobs, etc. They are compressible, delayed execution is totally fine for these kinds of jobs.
Some details and more links to follow can be found in research.google/pubs/pub49065/ - great overview of years of Google's experience.
Well written!