State machine replication

Introduction

In computer science, state machine replication (SMR) or state machine approach is a general method for implementing a fault-tolerant service by replicating servers and coordinating client interactions with server replicas. The approach also provides a framework for understanding and designing replication management protocols [Wikipedia].

What is Fault-tolerant?

Fault tolerance is simply the property that guarantees the provision of service without problems even if some of the replicated copies break down.

Type of failure

There are many reasons why a server may not be able to provide service. For example, human error, natural disasters, and machine failure. These failures are referred to as Fail-Stop that simply cannot be serviced.

Apart from fail stops, there is Byzantine Failure. Byzantine failure is easier to understand if you simply imagine that a node has been hijacked. In other words, with a stop failure, if you send a message after the failure, you will not receive a reply (because you did nothing). But with a Byzantine failure, if you message it, it may send you a fake message, or it may ignore you and make it look like it is stopped. Also, multiple Byzantine nodes might cooperate to do bad things.

How many failures are tolerated?

First, consider a stop fault. Look at the image below.

This situation has N=3 and F=1. Node B is stopped. In other words, there is no response when a message is sent to node B. In this case, node A and node C can check with each other to find the correct value. Let's take the idea further. If node A stops, then node C has no way to determine if its value is correct. In other words, if N=3, we know that it will not work well if F more than 1. In summary, the required condition for assuming only stop faults is (N >= 2F + 1).

Next, we will deal with Byzantine failure. What if node B in the image above is a Byzantine failure? In that case, when node A wants to check the consistency of the values, it checks node B and node C for the values from them. However, node B may send a false message because of a Byzantine fault. In other words, node A cannot determine the value if it receives two different values. This is the same from the standpoint of node C. In other words, assuming Byzantine failure, N=3 and F=1 are not sufficient.

Look at the image below.

This situation has N=4 and F=1. This situation is also the same, with node B having a Byzantine failure and all other nodes normal. In this case, even if node A receives a false value from the Byzantine node, it can still determine which is the correct value by majority vote if it receives the correct values from the other two normal nodes. In summary, the required condition for assuming Byzantine faults is (N >= 3F + 1).

State machine Approach

The following is from [Wikipedia].

Place copies of the State Machine on multiple, independent servers.
Receive client requests, interpreted as Inputs to the State Machine.
Choose an ordering for the Inputs.
Execute Inputs in the chosen order on each server.
Respond to clients with the Output from the State Machine.
Monitor replicas for differences in State or Output.

Specific examples are shown below. The blue circles indicate the servers (replicas) that make up the state machine replicas. The arrows indicate that an update request occurred to the replicas indicated by the arrows in the order of their numbers.

In this case, if the process is successful, all updates will be performed in the same order in all replicas, as shown in the image below.

Comment

In this article, state machine replication was explained. It is easy to imagine that the number of messages exchanged between replicas would be enormous in order for all replicas to perform the update process in the same order. For this reason, research is still being conducted on protocols to reduce the communication cost of state machine replication. Examples include Fast Paxos, EPaxos, and Mencius. Please look forward to my future explanation of these as well.

Thank you for reading to the end.

DEV Community

State machine replication

Introduction

What is Fault-tolerant?

Type of failure

How many failures are tolerated?

State machine Approach

Comment

Top comments (0)

Read next

Github Actions: how to deploy application to AWS ECS and migrate database with Flyway

Presenter Pattern in Ruby on Rails: Implementation, Pros, and Cons.

Product Landing Page

test