DEV Community

Discussion on: 🤔 Explain Distributed Storage - and how it goes down for github / uilicious / cloud / etc

Collapse
 
picocreator profile image
Eugene Cheah • Edited

For amazon 2017 case it was human error, and im quite sure since then they probably added more code in to prevent the exact same mistake from happening.

However to answer that question in more general terms ...

Typically for half a cluster on that scale to fail. Its most commonly either from

  • Human or code error (most common)
  • Network partition error

Its rarely a single server itself

Most of these 1K+ server systems, can actually handle gradual node by node, or zone by zone failure really well (about ~10% failure), especially with an active team monitoring and adjusting it, and sufficient time given in between. So for example if lets say I have 2 million servers (i wish) and for some reason 0.2 million in korea has been destroyed. I could reconfigure the remaining 1.8 million to reject the failed 0.2 million, and ensure they have sufficient replica among themselves. And the rest of the world would not notice the system down (in their fallout shelter).

Side note for clusters at this scale, majority vote may not be just 50%+1, and are typically configured to be at 60%-80% threshold.

However all of this works assuming our networking (ie: internet) is working as designed. And that the above concepts is implemented in code by design (sometimes its simply bugs 😶)

Depending on where you live - your internet connection to certain parts of the world maybe dependent on only a handful of fiber optical cables, or even one. As ISP may preconfigure certain ip addresses to a fixed route.

World fiber map
(Above is World map of major fiber optical cables for reference)

And all it takes from there is one of these fiber choke points, to either go down (which infamously happened for the entire country of Armenia), or to be misconfigured, either by the ISP or the data center router (which could effectively be in a single line, of configuration or code error).

The core backbone of the internet is very heavily dependent on being properly configured by every trusted party involved with it. And sometimes - things happen like google accidentally disconnecting japan

Hence when such network failure happens - the cluster could then be all of sudden split into two, both side failing one another to protect their data. Half being on either side of the broken link.

Even then, once the connection is restored. And sometimes especially for misconfiguration errors it could only be for a few minutes. To prevent data corruption, the two halves of the cluster would very commonly scan and verify each replica, by comparing every one of them. A process which could take hours.

Cheers, hopefully this helps highlight two of the many possible reasons.