DEV Community

Evan Lin
Evan Lin

Posted on • Originally published at evanlin.com on

TIL: Byzantine Generals Problem in Real-World Distributed Systems

title: [TIL][Reading Notes] Byzantine Generals Problem in the Real World of Distributed Systems (note from CloudFlare - A Byzantine failure in the real world)
published: false
date: 2021-01-15 00:00:00 UTC
tags: 
canonical_url: http://www.evanlin.com/til-byzantine-fail-cloudflare/
---

![](https://miro.medium.com/max/700/1*kJJpYLrKZ5hgByA-q3Zkjw.jpeg)

# Preface

When learning the Raft algorithm, the Byzantine Failure is usually excluded. Unexpectedly, CloudFlare's incident report last November used the real-world Byzantine problem as the title. I'll use this to organize some thoughts.

## What is Byzantine failure

In a distributed system, different computers communicate with each other as a "Consensus Communication" data confirmation process. It requires different computers to report what they are going to do or to vote for a leader. If a computer tells some members A and another group of members B, causing the entire group to fail to reach consensus or reach an unexpected state, it is called a Byzantine Failure. In many cases, Consensus Algorithms such as Paxos and Raft will first assume that Byzantine Failure does not exist because this problem will raise the complexity of Consensus to another level.

#### Reference articles:

- [Wiki Byzantine Generals Problem](https://zh.wikipedia.org/wiki/%E6%8B%9C%E5%8D%A0%E5%BA%AD%E5%B0%86%E5%86%9B%E9%97%AE%E9%A2%98)

## About CloudFlare's recovery mechanism

Before exploring more complex issues, there is actually an interesting angle to observe in this article. That is, how to view their backup mechanisms for system maintenance through CloudFlare's Incident Report.

### Service backup mechanism

- Each service is a series of Rack Servers
- Each machine has two switches
- Each machine rack has two or more power supply devices
- Each server uses a RAID-10 backup mechanism ([that is, RAID 1 + RAID 0 backup mechanism](https://en.wikipedia.org/wiki/Nested_RAID_levels#RAID_10_(RAID_1+0)))
- Each Rack has at least three or more machines.

## The problem that occurred

![](https://blog.cloudflare.com/content/images/2020/11/image1-20.png)

(**Image explanation**: Top left is Server 1, right is Server 2, and below is Server 3, which is also the Leader)

- Due to a network problem between Server 1 and Server 2.
- This caused Server 1 and Server 2 to have inconsistent information.
  - Server 1 believes that the Leader (Server 3) is offline
  - Server 2 believes that the Leader is running normally.
- It is also for this reason that CloudFlare calls this problem a Byzantine Failure

# Reference

- 

[Cloudflare Dashboard and Cloudflare API service issues](https://www.cloudflarestatus.com/incidents/9ggr0k6dwzwg?_ga=2.204546386.37818800.1609918736-1905359649.1609918736)

- 

[A Byzantine failure in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/)

- 

[Raft does not Guarantee Liveness in the face of Network Faults](https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/)

- 

[wiki: Byzantine Generals Problem](https://zh.wikipedia.org/wiki/%E6%8B%9C%E5%8D%A0%E5%BA%AD%E5%B0%86%E5%86%9B%E9%97%AE%E9%A2%98)

- [Raft lecture (Raft user study) by Diego Ongaro](https://www.youtube.com/watch?v=YbZ3zDzDnrw)
- [The Cloudflare Blog](https://blog.cloudflare.com/)
- [Improving the Resiliency of Our Infrastructure DNS Zone](https://blog.cloudflare.com/improving-the-resiliency-of-our-infrastructure-dns-zone/)
- [A Byzantine failure in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/)
- [Link aggregation - Wikipedia](https://en.wikipedia.org/wiki/Link_aggregation#Link_Aggregation_Control_Protocol)
- [Cloudflare Status - Cloudflare Dashboard and Cloudflare API service issues](https://www.cloudflarestatus.com/incidents/9ggr0k6dwzwg?_ga=2.204546386.37818800.1609918736-1905359649.1609918736)
- [Raft does not Guarantee Liveness in the face of Network Faults](https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/)
- 

[The power of the adversary](https://decentralizedthoughts.github.io/2019-06-07-modeling-the-adversary/)

- 

[Pull requests · etcd-io/etcd](https://github.com/etcd-io/etcd/pulls)

- 

[byzantine generals problem - Google Search](https://www.google.com/search?q=byzantine+generals+problem&sxsrf=ALeKk02ykB_xPVEN1o-7eVpMRgnk0z8R5g:1610015200344&tbm=isch&source=iu&ictx=1&fir=Ykr9zvzdD0RtHM%2CQrxo5tRgIuvd0M%2C_&vet=1&usg=AI4_-kQpPeUE1vmVPMIHsWCVog2PlYnURw&sa=X&ved=2ahUKEwjOreWAzonuAhXSIqYKHewTDusQ_h16BAgXEAE#imgrc=105uZAuhRI_3BM)

- 

| [Understanding the Byzantine Generals’ Problem (and how it affects you) | by Anthony Stevens | Coinmonks | Medium](https://medium.com/coinmonks/a-note-from-anthony-if-you-havent-already-please-read-the-article-gaining-clarity-on-key-787989107969) |

- 

[拜占庭將軍問題 - 維基百科,自由的百科全書](https://zh.wikipedia.org/wiki/%E6%8B%9C%E5%8D%A0%E5%BA%AD%E5%B0%86%E5%86%9B%E9%97%AE%E9%A2%98)!

- 

[Raft lecture (Raft user study) - YouTube](https://www.youtube.com/watch?v=YbZ3zDzDnrw)

- [Raft Consensus Algorithm](https://raft.github.io/)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)