DEV Community: Sergey Fedorov

A Sketch of Reversible Deterministic Concurrency for Distributed Protocols

Sergey Fedorov — Fri, 30 May 2025 17:51:15 +0000

This post presents preliminary results of elaborating the idea that was introduced on project's GitHub and which can be referred to as reversible deterministic concurrency. We try to make those ideas a little more concrete and apply them to modelling some well-known distributed protocols.

The original post can be found here.

Designing, verifying, correctly implementing and later improving core mechanisms of complex distributed, decentralized systems, such as Byzantine fault tolerant consensus, is notoriously difficult and error-prone. One of the biggest challenges here is dealing with the inherently concurrent nature of distributed systems. So there is an underlying problem of structuring the inherently concurrent logic of distributed protocols that we don’t really know how to solve in a simple, flexible, and reliable way. Embarrassingly often, we approach this rather awkwardly and unsurprisingly end up with awfully complicated, obscure, and fragile code.

Considering what modelling approach to initially adopt in the Replica_IO framework, an approach that would allow both specifying and implementing complex distributed protocols in a natural way, I decided to challenge the status quo and try to rethink the conventional approaches to modeling distributed systems and the ways of expressing distributed protocols. So came up the idea of reversible deterministic concurrency. The idea was inspired by a number of explored models of computation and programming models, as well as some principles from information theory, modern physics, reversible and quantum computing, such as conservation of information.

The long-term goal is to develop this idea further into an approach for specifying core mechanisms of concurrent, distributed systems, implementing them in Rust code, as well as verifying their correctness. The approach will be used as the foundation for the Replica_IO framework. Following this approach, it should be safe and easy to express concurrency and communication within the framework in a structured and composable way.

In order to get some intuition about the idea, we'll begin by discussing some basic principles behind it. Then we'll try modelling some well-known distributed protocols following those principles, namely the Bracha's reliable broadcast and the Tendermint consensus protocols.

Please bear in mind that this write-up is an early sketch and that many details of the approach presented here, including terminology, will likely undergo significant changes in the future.

Approach

We conceptually think of concurrent systems as consisting of deterministic components that can be connected with each other and interact concurrently. The only means of interaction with the components are through their inputs and outputs. Each input and output can convey a certain type of data, and they are typed accordingly. Inputs and outputs matching in type can be connected pairwise, so that data can flow through the connection. Connecting inputs and outputs of different components enables data flow between those components.

Base and composite components, extrinsic inputs and outputs

Components can be composite, i.e. consist of lower-level sub-components connected together. The remaining unconnected inputs and outputs become the inputs and outputs of the composite component. The process of decomposition bottoms out at base components that are considered primitive and not decomposed further. Apart from base components, there can be extrinsic inputs and outputs for exchanging information with the environment. The extrinsic inputs and outputs can represent interfaces for interacting with entities that are considered external to the modeled system, or they can be sources of auxiliary data items and sinks to dispose of excessive data.

Input-output pairs, interaction lines, and two kinds of interaction

Inputs and outputs can only appear in complementary pairs. This applies to inputs and outputs of base components, and therefore of composite components, as well as to extrinsic inputs and outputs. Moreover, in a complete system, every input must be connected to exactly one output matching in type. So the input-output pairs connected one to another form a kind of chains or interaction lines. Data items flowing along different interaction lines can interact with each other by redistributing the information contained within those data items when they arrive at the same base component. So there are two kinds of interaction within the system: passing data items from one component to another along interaction lines and exchange of information between data items on different interaction lines within base components.

Determinism and reversibility

All components are conceptually deterministic: provided all the component's assumptions hold true, the output values must be completely determined by the input values and nothing else. Although the availability of output values can depend both on the availability of the input values and on the input values themselves, neither the output values nor their availability may per se depend on the relative order in which the input values become available. (It is worth noting here that this determinism is only conceptual: later we'll see how some input values may be left unspecified and, in normal operation, get chosen on spot, possibly depending on the availability and values of other inputs.) Moreover, the components must also be reversible, so that the original input values can always be recovered from the resulting output values.

One can view the determinism and reversibility requirements as a consequence of making all information flow within the system explicit. If all information that is used to determine the output values is contained in the input values then the inputs uniquely determine the outputs and the computation is deterministic. Similarly, if the output values collectively contain all the information that was contained in the original input values then the inputs can be recovered from the outputs and the computation is reversible. Determinism and reversibility are dual to each other: if a computation carried out in one direction is reversible then the same computation carried out in the opposite direction is deterministic.

Requiring that components are conceptually deterministic we make non-determinism explicit, structurally evident rather than emergent or accidental, while treating concurrency implicitly as emerging naturally from data availability and the system’s structure rather than from control flow. This should improve modularity and facilitate compositional reasoning. Together with the reversibility requirement this ensures that there is no hidden data flow, enables backward reasoning, backtracking mechanisms, reverse debugging and would facilitate management of resources, such as memory.

Simple components

Sum

We can represent components graphically. For example, the following figure depicts a component that simply adds one integer value to another:

The component is represented by a rectangle divided into horizontal sections, one section for each input-output pair, i.e. interaction line. The component is labeled above by its name, Sum; its interaction lines are labeled as x: Int and y: Int , where x and y denote the names of the interaction lines and Int is the type of data items conveyed by them. The inputs are on the left side of the component whereas the outputs are on the right side. The component takes two integers as input values, denoted as a and b, along lines x and y and returns two integers as output values; the first input value goes through unchanged, whereas the second one becomes the sum of the two input values. The first input is required to determine both outputs, and the second input is required to determine the second output. Since this component has just two interaction lines, one of which leaves the value unchanged, we can alternatively depict it as follows:

It is easy to see that this component is deterministic, i.e. the output values are completely determined by the input values, no matter in which order the inputs become available. It is also not hard to see that this component is reversible, i.e. the original input values can always be recovered from the resulting output values. We can depict the inverse of the component as follows:

Generalized controlled swap

Let's consider another example. The following component represents a kind of generalized controlled swap operation¹:

or depicted alternatively:

It has three interaction lines: one control line, c, and two target lines, t1 and t2. The value of the control line goes through unchanged, but it determines the output values of the target lines: If the control value is true then the values of the target lines get swapped, otherwise they simply go through unchanged. The controlled swap component is an inverse of itself.

Concurrency of interaction lines

Interaction lines are highly concurrent: an output value becomes available as soon as it can be obtained from the input values that are available for the interaction. For example, in the CSwap component, once the value on the control line is available, each of the target outputs remains dependent only on one of the target inputs and becomes available as soon as that input value is available, concurrently and independently of each other.

Demand-driven nature of interactions

Interactions are demand-driven: certain extrinsic inputs and some inputs of base components may demand a value from the connected output, and this demand propagates further along the inputs required to determine the demanded output value until it eventually reaches available values and triggers the interactions required to satisfy the demand. Sinks are never demanding, but they will consume the value when it becomes available.

Unspecified values

Certain inputs of base components may be connected to special sources of unspecified values. Those unspecified values are not real; they are just placeholders that stand for concrete values. During normal operation, the component will choose appropriate concrete values for such unspecified values and act as if they were guessed correctly and provided by the environment. When debugging the system or during verification, the sources of unspecified values can be forced to provide specific values; in that case, the component has no choice but to take the forced value as is. One can think of an unspecified value as a "superposition" of all possible values of the corresponding type that will "collapse" to a concrete value when going through the base component.

The `Def` component

The following component is meant to be used with such sources of unspecified values:

Both interaction lines of this component conceptually go through unchanged; however, if the value of the second input is unspecified, as denoted in the picture by the special source labeled with ?, then the component chooses it to be equal to the value of the first input. This is effectively a way to copy the value of the first input in normal operation, but the second line can be forced to an arbitrary value, e.g. to model a Byzantine failure during verification. To avoid cluttering, we will often abbreviate the Def component in diagrams as follows:

`CSwap` with unspecified control

The control line of the controlled swap component can also be left unspecified, e.g. as follows:

or in an abbreviated form:

In the diagram above, the control line is connected to a source of unspecified values, and the second target output is connected to a sink (which is never demanding). When there is a demand for the first target output value, w, both of the target inputs will be demanded, and the demand for w will be satisfied by taking either u or v, whichever becomes available earlier, and the value of the control line will be chosen accordingly. If both target inputs are available when the target output is demanded then the value of the control line is chosen to be false, i.e. the component is biased towards passing the values of target lines through without swapping them.

Synchronization

In some cases, we need to make sure that a certain value only becomes available if some other value also becomes available, without changing the values themselves. This can be achieved with the following synchronization primitive:

Values u and v go through the bar unchanged, but only once they both become available. Thus values u and v on the right side mean something more than those on the left side: a value becoming available on any line on the right side implies the existence and availability of some value on the other line as well.

Signals

For similar reasons, it may also be useful to have a kind of signalling interaction lines of unit type, which convey data items that can take only a single possible value. Such signal values themselves have no meaning other than their existence, i.e. availability. Signalling lines work naturally with synchronization bars:

or in an abbreviated form:

In the diagram above, values u and v go through unchanged, but v becomes available on the right side only when u becomes available. However, because there are separate synchronization bars and an additional signal line, the availability of v has no effect on the availability of u.

Delays and timers

In order to represent time-dependent behavior of the system, we introduce the delay component:

or alternatively:

Values d and v go through the delay component unchanged, but, in normal operation, value v becomes available on the second output with a delay of at least time duration d.

Using the delay component, we can construct a timer component that passes an arbitrary value through the trigger line, and then, after a given amount of time provided on the duration line, lets a unit value to pass through the signal line (which would normally be connected to a source of an immediately available unit value):

One-shot and composite lines

Individual inputs and outputs, and therefore interaction lines composed thereof, are conceptually one-shot, i.e. they can convey only a single data item. However, we can bundle multiple inputs/outputs together and thus construct composite inputs/outputs and interaction lines. There is no synchronization imposed by composing interaction lines together, i.e. each individual line remains concurrent in composition.

We can compose two interaction lines together and decompose them back by constructing and deconstructing a pair, graphically represented as follows:

We can also uniformly bundle an arbitrary number of lines together as follows:

In the diagram above, the source labeled with the bang symbol ! denotes an empty source. (With just two lines being multiplexed uniformly, the notation above appears the same as the one for constructing a pure pair; in such cases, the ambiguity will be resolved by the context.)

Using this notation, we can also bundle a virtually indefinite number of lines together:

Note that the line composition primitives are not components: there is neither interaction nor synchronization between individual interaction lines caused by the composition; this is just notational convenience for dealing with multiple interaction lines.

Transposition

It may be useful to rearrange the elements within a composition. We can transpose the last two levels of a composition as follows:

The transposer connects line xi,j from composite line x^mn on the left to line yj,i of composite line y^nm on the right. For example, it will rearrange

[ [u, v, w], [x, y, z] ]

into

[ [u, x], [v, y], [w, z] ]

The `Select` component

Using composite interaction lines we can define a component representing a choice from an arbitrary number of options:

In the Select component, the first input, choice, determines the value of which element from the second, composite input, options, should become available on the third output, chosen. The chosen input is supposed to be connected to a source of unspecified values. The choice input can also be left unspecified; in that case, the component chooses it to select the element of options that becomes available earlier.

Uncluttering diagrams

To unclutter diagrams, we may depict several connections in one stroke and omit sources of unspecified values and sinks like this:

Repetitive patterns

Replication box

We would often need to connect one or many components in a repetitive pattern. To represent such patterns in a diagram, we can depict a single instance of the repeating part of the diagram enclosed in a replication box:

In the diagram above, component G enclosed in a replication box is virtually repeated for each element of the composite lines entering and exiting the replication box.

Recursive loops

We can use the replication box to represent recursive patterns of connections as follows:

Domains and projections

Each interaction line belongs to a certain domain, which characterizes its location. For example, we can define a separate domain for each node of a distributed system. Interaction lines cannot cross domain boundaries, but some components may span multiple domains and thus enable cross-domain interaction.

We can then derive a projection of a system model onto a subset of its domains. Connections with base components that span the projection boundary will be represented as extrinsic inputs and outputs. So we can project a model of the whole distributed system separately onto each node and derive the local logic of individual nodes. The base components that represent means of communication between nodes, which are necessarily cross-domain, will be replaced by extrinsic inputs and outputs standing for gateways to the networking layer.

Assume-guarantee reasoning and modelling of faults

The behavior of individual components could be specified in terms of the assumptions about their inputs and the guarantees about their outputs. This would enable assume-guarantee reasoning about correctness of the system and compositional verification. In the context of distributed systems, benign node faults can be modeled by suppressing some extrinsic sources belonging to the corresponding domains, whereas Byzantine faults can be modeled by forcing some of the extrinsic sources with arbitrary values ("garbage in — garbage out").

Confining interaction lines

The total number of interaction lines in complex components can quickly become too large. Therefore, components should only expose a subset of input-output pairs that is enough to guarantee determinism and reversibility under the component's assumptions, confining the remaining interaction lines by connecting them to sources and sinks inside the component. For example, in distributed protocols, the exact set of votes forming a valid quorum for certain decisions of the protocol logic often does not affect the outcomes, given the assumptions about the maximal fraction of faulty nodes hold true; so the corresponding interaction lines can be confined within a component representing the protocol.

Modelling Distributed Protocols

Bracha's Reliable Broadcast

Now we'll try to apply the approach to modelling a relatively simple distributed protocol, namely Bracha's broadcast². Bracha's broadcast is a foundational distributed protocol implementing a Byzantine fault-tolerant reliable broadcast mechanism in the asynchronous model. The protocol allows one party to broadcast a message such that all correct parties eventually deliver the message and agree on the delivered message, given that less than a third of the parties may be corrupt. The protocol adopts the asynchronous model, i.e. it makes no timing assumptions but requires that every message sent by a correct process is eventually received.

For the sake of simplicity, we will model a one-shot version of the protocol that only allows broadcasting a single value from a designated sender party. The protocol can be expressed in traditional event-oriented pseudo-code as follows:

upon broadcast(m): // only sender
    send message <SEND, m> to all

upon receiving a message <SEND, m> from the sender:
    send message <ECHO, m> to all

upon receiving n-f messages <ECHO, m> and not having sent a READY message:
    send message <READY, m> to all

upon receiving f+1 messages <READY, m> and not having sent a READY message:
    send message <READY, m> to all

upon receiving n-f messages <READY, m>:
    deliver(m)

Peer-to-peer links

We will start by modelling communication between individual nodes in the system. Each party in the protocol is represented by a node, and each node is modeled as a separate domain. For the nodes to communicate, those domains need to be able to interact through some kind of cross-domain component. So we define the following component for that purpose:

The Link component allows sending an authenticated value from one node to another. If the rcv input is left unspecified, as supposed, the component chooses the value from the snd input as the value of the rcv output. The snd line belongs to the message sender's domain, whereas the rcv line belongs to the destination node's domain. If we treat Link as a composite component, we can model it as shown on the right side of the diagram above, where the internal interaction line in the middle belongs to a separate domain and represents the communication medium.

Weak broadcast

Although the underlying communication mechanism is point-to-point message exchange, if we look closer, the basic communication pattern used in the reliable broadcast protocol is sending the same message to all nodes. Let's refer to this pattern as weak broadcast and represent it as the following component:

The bc line belongs to the sending node's domain, whereas the elements of the composite dlvr line belong to the corresponding receiving nodes' domains. We can model a corrupt sender committing the failure of message equivocation (sending different messages to some receivers) by forcing some of the sources in its domain with arbitrary values. The component is mainly composed of Link and Def components replicated for each receiving node. The Def components are meant to create an individual copy of the bc value for each instance of Link. Note the recursive pattern, used together with the replication box, that forms a kind of forward loop in order to thread the value from the bc input through the replicated components.

Quorums

Another kind of component that we will need is for collecting quorums of values received from peer nodes:

The WeakQuroum and StrongQuorum components are identical except the number of votes required to reach the quorum: f+1 and n-f, respectively. WeakQuorum ensures that there is at least one vote from a correct node, whereas StrongQuorum ensures that any two of such quorums must intersect in at least one correct node. Moreover, StrongQuorum ensures that any of its quorums contains at least as many votes from correct nodes as the total number of votes required by WeakQuorum. The votes line represents the values received from peer nodes, whereas the value line represents the value supported by a sufficient quorum of votes. If the value input is left unspecified then the component will choose the value corresponding to the earliest quorum of votes becoming available; if the value input is forced then the component will make the complementary output available only once a quorum becomes available on the votes input.

Overall protocol model

Now we can model the Bracha's reliable broadcast protocol as a cross-domain component with an interface identical to WB but providing the guarantees of reliable broadcast:

In this diagram, we omit sources of unspecified values and sinks. We also omit types, but indicate the dimentionality of composite lines. We label the instances of the WB component with names: send, echo, and ready, indicating the kind of protocol message being communicated through those components.

The send component is used to send a value from the designated sender node to all nodes, whereas the echo and ready components are replicated for each node and represent all-to-all communication. Note that the composite dlvr outputs of the WB components replicated for each node sending ECHO and READY messages go through transposers outside of replication boxes. Remember that individual lines of the dlvr output belong to the corresponding receiving nodes' domains, but we need to connect the votes inputs of the quorum components so that they all belong to the same node's domain. The trasnposers rearrange the composite lines to achieve that.

The Select component, replicated for each node, represents a choice between two alternative causes for sending a READY message: either upon receiving a strong quorum of ECHO messages or upon receiving a weak quorum of READY message from other nodes.

Consistent broadcast and modularized protocol model

We should recognize that a part of this reliable broadcast protocol actually represents a consistent broadcast protocol. Consistent broadcast guarantees that receivers agree on the delivered values, but it does not guarantee that all correct receivers eventually deliver the value even if some correct receivers deliver. We can represent this sub-protocol as a cross-domain component:

and then restructure the reliable broadcast protocol as follows:

Tendermint Consensus

Now we'll try to apply the approach to a more complicated protocol and sketch a model of the Tendermint consensus protocol³. Tendermint is a Byzantine fault-tolerant sequence consensus protocol (a.k.a. total order broadcast or atomic broadcast), which allows nodes to agree on a single growing sequence of values where values can be proposed by different nodes. The protocol adopts the partially synchronous model, i.e. it assumes that eventually all messages are delivered within certain time bound, although there might be periods of asynchrony in the system. The timing assumptions complicate modelling by adding more non-determinism and appear in the protocol in a form of timeouts.

Gossip communication

The Tendermint protocol relies upon a gossip-based mechanism for communication between nodes: a node can broadcast a message through the gossip mechanism and it will be delivered at all correct nodes. Moreover, if a correct node delivers a message from the gossip mechanism then the same message is guaranteed to be delivered at all other correct nodes. We will model this communication mechanism as the following cross-domain component:

The bc line belongs to the sending node and represents the set of values broadcast through this instance of the gossip mechanism; each element of the dlvr line belongs to the corresponding receiving node and represents the set of values delivered from that instance of the gossip mechanism.

Individual elements in the Set type are concurrent and can become available independently of each other. We can think of the Set<T> type as a composition of single-bit lines, each representing the presence or absence of a particular value of T in the set.

If the bc line belongs to a correct node then the same set of values becomes available on all elements of the dlvr output that belong to correct nodes as the set of values on the bc input. Moreover, if any value becomes available on one element of dlvr of a correct node then the same value also becomes available on all correct nodes' elements of dlvr.

Correct nodes in the Tendermint protocol only ever broadcast a single value through each instance of the Gossip component; however, Byzantine nodes may broadcast multiple values and all correct nodes would eventually deliver all of those values from that instance of the Gossip component.

Overall protocol model

Nodes in the Tendermint protocol decide on a single value at each height in the growing sequence of values. At each height, the protocol may need one or multiple rounds in order to reach agreement between nodes and determine the decision value to commit at that height. Each round consists of the proposal, prevote, and precommit stages. So we can model the overall protocol as the following cross-domain component:

Here, again, we omit sources of unspecified values and sinks, omit most of the types, but indicate the dimentionality of composite lines with superscript indeces.

The Tendermint component represents the overall protocol. Its composite value input represents the values that would be proposed by each node (index n) in each round (index r) at each height (index h); the proposer input represent the designated proposer for each round at each height; the decision output represents the decision value by each node at each height. Finally, the composite aux line represents the auxiliary lines added for each height in order to satisfy the determinism and reversibility requirements.

The Height component represents the protocol logic for each height. Its interaction lines are mostly identical to those of the Tendermint component, without the last level of composition, except that the auxiliary line is represented more explicitly. Namely, the aux line of Tendermint is composed of the following lines of Height: decisionRound representing the round number in which each nodes commits the decision value, timeouts and aborts representing timeouts and aborts happened at each node in each round.

`Height` component

We model the Height component as follows:

The Round component represents a single instance of the repetitive part of the protocol that includes the proposal, prevote, and precommit stages, in which a designated proposer broadcasts a proposal proposing a decision value to all nodes who can then broadcast prevote and precommit votes for the proposed value. At each node, instead of voting for the proposal, the prevote and precommit stages may timeout, in which case the node broadcasts a special nil vote, or abort, in which case the node does not broadcast any vote in that stage of the round; the propose stage can also abort. As we'll see later, the rounds and stages are coordinated by the signals from the pmSignals input. Those signals are required for the round stages to broadcast a proposal or vote, to timeout, or to abort. The actual timeout and abort decisions are represented by the timeouts and aborts lines.

The Tendermint protocol defines certain conditions that a proposal must satisfy in order to be voted for. Those conditions depend on the outcomes of the previous round, so there is a forward loop, represented by the loopFwd line, that connects successive instances of the replicated Round component and conveys the required values from the previous to the next round.

The Commit component represents the decision logic of the protocol that is replicated for each node. Given the proposal and precommit votes from each round, it determines the single decision by selecting the value proposed in one of the rounds, the decision round, for which there exists a strong quorum of matching precommit votes.

The Pacemaker component coordinates the rounds and stages of the protocol by making the corresponding signals available according to the required dynamics of the protocol. We'll see those signals connected to the corresponding components in the next diagram.

`Round` component

We model the Round component as follows:

The Propose, Prevote, and Precommit components represent the stages of the protocol logic and determine the values of the corresponding messages to broadcast through the gossip mechanism. Prevote and Precommit are replicated so that there are individual instances for each node. Note that there are separate instances of the Gossip component for each node and stage, except for the proposal that is broadcast only by the designated proposer node, and those instances are replicated in the overall protocol component for each round and height. This way, there is no need to include the message tag, height and round numbers in the values that are broadcast through the Gossip component instances. The composite lines representing all-to-all communication go through transposers outside of the replication boxes in order to rearrange their components appropriately.

The validVR and lockedVR are composite lines representing (validValue, validRound) and (lockedValue, lockedRound), respectively, with the values corresponding to the variables of the same name from the protocol pseudo-code as it is described in the original paper³. Basically, validVR is used to determine whether the proposer should propose a new value or re-propose a value proposed in one of the previous rounds, lockedVR determines which proposals are safe to cast a prevote for. As we can see from the diagram, the values in those lines are updated in the Precommit component. The prevoteQVals line is a composite line each element of which corresponds to the prevote, if any, supported by a strong quorum in the previous rounds. Note how the prevoteQVals and round lines are updated for each node in a replication box: the composite prevoteQVals line is extended with an additional line connected to the corresponding output of the Precommit component, whereas round is simply incremented by the component labeled with +1.

We will not decompose the components further since the main point was to make a sketch of how we can model the overall structure of the Tendermint protocol, namely the heights, rounds, stages, the interconnection between them, and represent the dynamics of the protocol.

Conclusion

We explored a possible way of applying the ideas of reversible deterministic concurrency to modelling distributed protocols. We tried to make the vague ideas a little more concrete and give a better intuition of how an approach based on those ideas may look like by considering some guiding principles, spelling out some details, introducing some primitives and patterns, expressing this in a graphical notation. We considered modelling distributed systems as a whole and then deriving local projections for individual nodes. We also considered how we can model node failures, both benign and Byzantine. Finally, we saw how some concrete, well-known distributed protocols may look like when modeled following this approach.

We also anticipated some potential benefits of the approach, such as modularity, composability, compositional reasoning and verification, absence of hidden data flows, enhanced debugging and verification methods, resource management, representing a distributed system as a whole and then deriving the logic and implementation for its local nodes.

This is just the beginning of developing those ideas into a practical solution. We'll need to see what this approach means for different aspects of designing and implementing distributed protocols. We'll need to invent some kind of textual notation and develop a clear, consistent concept codifying the core principles. We'll need to find a way of implementing those ideas in real code and examine its expressivity and limitations, better understand the benefits and drawbacks. In other words, we need to find a good way of turning this into a solid foundation for our framework.

If you like the project and find it valuable, please support its further development! 🙏
❤️ Replica_IO

If you have any thought you would like to share or any question regarding this post, please add a comment here. You are also welcome to start a new discussion or chime in to our Discord server.

Controlled swap of individual bits is known as Fredkin gate in reversible and quantum computing; here we generalize it to operate on target values of arbitrary type. ↩
To learn about Bracha's broadcast in detail, please refer to the original paper, where it was introduced as a key building block of an asynchronous consensus protocol. The following post may also be helpful: Living with Asynchrony: Bracha's Reliable Broadcast, as well as these lecture notes. ↩
For a more precise and detailed description of the Tendermint protocol please refer to the original paper. ↩

On Frameworks for Implementing Distributed Protocols

Sergey Fedorov — Thu, 29 Aug 2024 15:33:10 +0000

This post concludes the second phase of the state-of-the-art exploration in the scope of milestone M0.1 of the Replica_IO project, namely exploration of existing frameworks for implementing distributed protocols. It shares the main conclusions drawn from exploring 7 different frameworks.

The original post can be found here.

A companion video is available on YouTube:

Exploring Distributed Protocol Frameworks

Trying to make a real breakthrough, such as what the Replica_IO project aims at, it is important to learn from prior attempts to deal with the problem. Having explored how real-world code bases typically implement the core distributed protocols like consensus and having summarized the findings in the previous post, I continued the exploration by surveying existing attempts to find a better approach to the problem, i.e. different frameworks for implementing distributed protocols.

I looked at the documentation and examples provided by each of the frameworks, as well as into their implementation, in order to figure out how they model and structure distributed systems, what kind of notation is used to specify and implement distributed protocols in those frameworks, what is their approach to communication, concurrency, and composition of protocol components, and what they offer for ensuring the correctness of protocols and their implementations.

After having explored each of the frameworks, I summarized and shared some of my findings. You can find those overviews on this wiki page.

Here is the full list of 7 frameworks, based on different programming languages, that I explored¹:

Mir — a framework for implementing, debugging, and analyzing distributed protocols (based on Go);
Atlas — a modular framework for building distributed mechanisms focused on configurability and performance (based on Rust);
Babel — a generic framework for implementing and executing distributed protocols (based on Java);
DiStaL — a framework for implementing and executing distributed protocols (based on Scala);
PSync — a framework for implementing and verifying fault-tolerant distributed protocols (based on Scala);
Disel — a framework for implementation and compositional machine-assisted verification of distributed systems and their clients (based on Coq);
Verdi — a framework for implementing and formally verifying distributed systems (based on Coq).

In the subsequent sections, I will share some of the observations and conclusions I made while exploring those frameworks. I decided to structure the discussion around the following aspects:

model: how distributed systems and their components are modeled;
structure: how distributed protocol components are structured and composed together;
notation: what kind of notation is used to specify and implement distributed protocols;
operation: how distributed protocol components are executed and interact with each other;
verification: how distributed protocols and their implementations are verified for correctness.

But before we go into details, I would like to note that most of those frameworks were purely academic efforts and almost all of them seem to be abandoned now; unfortunately, they didn't seem to have found practical use. Nevertheless, it was good to learn from them. Let's now dive in.

Model

The way that systems and their components are modeled has a profound effect on the structure and shape of their specifications and implementation, on the operational aspects and verification for correctness. We can consider different levels of abstraction when modeling distributed systems: the high-level model of the system as a whole, the model of individual nodes within the system, as well as individual protocols and components within the nodes. Let's take a look at how the explored frameworks model distributed systems and their components.

The common approach is to model distributed systems at high level as state transition systems where the global system state changes upon external and internal events, such as client requests, messages exchanged within the system, and timeouts. In this model, the global system state includes the states of all individual nodes and their components, as well as the environment state. The environment state usually includes the state of the network the nodes communicate through, in particular the messages in transit. Transitioning from one state to another happens according to a global transition function triggered by events. The transition function receives the current state of the system together with the triggering event and returns the new system state.

The explored frameworks follow the message-passing approach, where the states of individual nodes and protocol components are disjoint, i.e. different parts do not share pieces of state. Interaction between nodes happens through the network by sending and receiving messages. Sending and receiving of messages are modeled as events modifying the global system state by updating the set of messages in transit in the network state and, in case of message receiving, the state of the target component in the destination node. For example, in Disel, the system state includes a "message soup" for each protocol, which models the current state and the history of the network. In Disel's abstract model, messages are never removed from the message soup, instead they are marked either as active or consumed.

Nodes and network failures are modeled as special events, e.g. dropping or duplicating messages in the network, disabling normal event handling in faulty nodes or (partially) resetting their state. Some of the frameworks only consider crash faults, leaving Byzantine faults for the future work. The system models in most of the frameworks do not seem to include timing assumption, so they can be considered asynchronous. In contrast, PSync employs the Heard-Of model based on communication-closed rounds, which provides an illusion of simple synchronous communication on top of the partial synchrony of the actual underlying network. In this model, protocol execution proceeds in explicit rounds, alternating communication with protocol state transition based on the set of messages received during the round. Network and node faults in this model are unified, and the network assumptions are specified in terms of the heard-of sets.

Communication between nodes is predominantly modeled as fire-and-forget message delivery, where messages can be reordered or dropped by the network abstraction. Though, in Verdi, protocols are first modeled with an idealistic, reliable network semantics, which can then be translated into weaker fault models using verified system transformers. Babel allows modeling protocols in stronger communication models using communication channel abstractions, which represent communication mechanisms with different properties and guarantees.

Individual components within nodes are commonly modeled as sequential, event-driven state machines interacting with each other via reliable message passing. The message passing normally follows the one-to-one one-way or request-response patterns; however, some of the frameworks also support one-to-many notifications between components. Some frameworks (e.g., Verdi) explicitly distinguish between the events that are external to the distributed protocol, like client requests, and internal events, like exchanging messages between protocol components and nodes within the distributed system.

To overcome the limitations of strict state separation between protocol components in the abstract model, Disel allows coupling protocols via inter-protocol behavioral dependencies, called send-hooks, which allow restricted logical access to other protocol's state. Disel doesn't seem to strictly follow the message-passing model for protocol components within the same node, supporting generic composition of protocol components. Disel also provides mechanisms to establish stronger properties of protocols and their combinations by strengthening them with additional inductive invariants.

As we can see, although there are some interesting variations and extensions, the underlying system model in the explored frameworks is largely the same, namely the one of a state transition system composed of components, which are sequential, event-driven state machines with disjoint state, interacting via message passing. This is remarkably similar to how distributed protocols are usually implemented in real-world code bases, which do not use any framework, as I described in the previous post. This approach tends to shift the focus more to the operational rather than functional and logical aspects of the system. There I also pointed out, when discussing how protocol implementations attempt to evade concurrency, that this approach may complicate the implementation and cause fragmentation of the protocol logic. Perhaps, adopting the same kind of the underlying model is one of the reasons why the explored frameworks do not seem to have made a real breakthrough in designing and implementing distributed protocols.

Structure

In all of the explored frameworks, protocol implementations rely on some kind of runtime or shim provided by the framework, hiding low-level concerns from implementation of the protocol logic. In most of the frameworks, the runtime is responsible for coordinating the execution of protocol components and interaction between them; some of the frameworks also provide there dedicated interfaces for communication over the network and setting timers. Protocol components normally need to be registered within the runtime before they can function within the system.

In most cases, protocol components within the same node can interact by simply sending some kind of internal messages to each other, without establishing any explicit connection. In contrast, components in Atlas require explicit orchestration of the interaction with the rest of the system. Disel instead supports composition of components expressed as effectful functional programs; it also allows loosely coupling protocol components via inter-protocol behavioral dependencies, called send-hooks, at the formal specification level.

Since the runtime in Mir is only responsible for coordinating the execution of protocol components and interaction between them, there are special components that provide functionality for sending and receiving messages over the network, as well as for setting local timers and for cryptographic operations. Mir explicitly distinguishes between passive components, which can only produce events synchronously, as a result of processing input events provided by the runtime, and active components, which can asynchronously produce new events on their own.

There are frameworks that provide means for enhancing protocol components with additional properties. Disel provides a protocol combinator ProtocolWithIndInv, which allows elaborating a protocol by strengthening it with additional inductive invariants. Verdi provides verified system transformers, which allow transforming protocols specified, implemented, and verified in an idealistic, reliable network semantics into an equivalent implementation, preserving the transformed system properties under a weaker fault model. PSync provides a class, which can be used to wrap protocol round instances to support updating progress conditions and synchronizing rounds in a Byzantine setting.

There are two main approaches to structure interaction of distributed protocols with the rest of the application. Some frameworks provide a mechanism for interacting with protocol components by sending and receiving messages, same as protocol components interact with each other. Other frameworks allow defining dedicated interfaces for that purpose, featuring callable methods or special IO events. In some of the frameworks, protocol components are supplied with some kind of handles to trigger side effects, such as sending messages or setting up timers; in other frameworks, side effects only happen after returning the control back to the runtime, e.g. through the return value or using a monadic structure.

Protocol components commonly consist of the component's state and protocol logic structured as handlers modifying the state or state transition functions, which are triggered by the runtime upon certain events or conditions. In Distal and Disel, the handlers/transitions can be augmented with guarding conditions that must hold in order to trigger the action.

The round-based model in PSync imposes a particular structure of protocol components. Protocol components in PSync must specify a sequence of protocol rounds. Each round, in general, consists of methods to: initialize the round, send messages at the beginning of the round, process received messages, and finally update the internal state before transitioning into the next round.

PSync and Disel explicitly separate the protocol specification from its implementation, whereby the specification is used to formally verify the implementation. Protocol specifications in PSync consist of protocol properties, as well as safety and liveness predicates (assumptions). In order to aid automated verification in PSync, the specification should also include round and/or phase invariants; round transition relations are automatically derived from the code, although this imposes certain limitations on the code. In Disel, high-level abstract protocol specifications are defined in terms of state-space coherence predicates and send/receive transitions.

The configuration of protocol components within nodes and their internal structure can be more static or dynamic. For example, in Babel, protocol components and various components' callbacks are registered within the runtime dynamically. The configuration of top-level components in Mir is rather static, for it cannot be changed after initialization, but there is a special component, called factory module, that supports creating sub-components dynamically. Such flexibility at runtime can make formal verification particularly hard, so the frameworks focused on formal verification (PSync, Disel, and Verdi) tend to be rather static with respect to the configuration and internal structure of protocol components.

Overall, the abstract model adopted by a framework, e.g. the model of a generic state transition system with message-passing or the Heard-Of model based on communication-closed rounds, largely determines how protocol specifications and implementations are structured within the framework. The framework's features like formal verification can add further restrictions, e.g. restricting runtime flexibility of the configuration and internal structure of protocol components.

Notation

All of the explored frameworks use the general-purpose programming language they are based on as the primary means for expressing protocol implementations. To enhance ergonomics and expressiveness, most of the frameworks introduce some notational extensions, e.g. elements of an embedded domain-specific language (eDSL).

Using a regular general-purpose programming language as the basis allows tapping directly into the comprehensive features of the language, such as the type system, polymorphism, inheritance, and metaprogramming. On the other hand, implementing protocol components in regular code can produce some undesirable effects such as nondeterminism, e.g. when iterating over built-in unordered data structures, which is problematic for reproducible simulation-based testing. For that reason, Mir provides some utility functions that should be used in place of usual idiomatic code to avoid that kind of nondeterminism. Rich expressiveness of general-purpose languages can also make automatic verification difficult, e.g. PSync is quite limited in what kind of Scala constructs it supports in automatic derivation of transition relations from the code.

In most cases, the notational extensions introduced by the frameworks serve to make the code more declarative and concise. One target of such enhancements is providing a convenient and clear way of expressing the typical event-oriented pseudo-code notation with upon statements. For example, the protocol logic in Distal is implemented by defining rules and the corresponding actions, whereby the rules are expressed in a declarative style resembling typical pseudo-code found in the literature and specify event predicates, such as the message type and matching conditions, as well as the means to specify composite events, which are triggered by collections of messages.

Another area of applying notational enhancements is for expressing certain actions performed by the protocol logic and overcoming the limitations of the host language. For example, Distal provides a special notation for sending messages, discarding received messages, as well as for scheduling actions to be executed in future. Disel extends Coq's specification language Gallina with effectful commands (actions), such as sending and receiving messages, reading from local state, monadic sequential composition, and general recursion. For message sending and receiving actions, Disel provides transition wrappers, which lift low-level operations on the networking layer to the level of well-typed program primitives corresponding to the protocol specification. Similarly, Verdi provides a monad for expressing the actions of sending messages, emitting output event, and manipulating the current state, as well as convenience notation for various monadic bindings.

Finally, expressing protocol properties, assumptions, and invariants, as well as the related annotations in the protocol implementation can also benefit from notational enhancements. PSync, for instance, defines a DSL for expressing properties, predicates and invariants, in which one of the main primitives is the notion of domain, representing a set of values of certain type with universal and existential quantification defined for it, as well as set comprehension. Disel features a special notation for representing the higher-order Hoare types of program fragments.

Many elements of a DSL can be implemented using common programming language techniques like polymorphism, inheritance, composition, higher-order functions, and the type system features. This way, Distal implements most of its DSL as ordinary methods and convenience aliases. However, this approach has certain limitations, and such techniques as metaprogramming (macros) or code generation are often required. For instance, in Distal, the key element of the DSL is implemented as a macro; in PSync, automatic derivation of transition relations from the code is also implemented as a macro. Code generation in Mir reduces the amount of hand-written boilerplate code by processing Protobuf definitions annotated with special extensions. Disel and Verdi take advantage of Coq's syntax extension features.

Clear and convenient notation for defining protocol specifications and their implementation is crucial for ergonomics and expressiveness. Building upon ordinary code, clever use of common programming language techniques, such as polymorphism, inheritance, composition, higher-order functions, and the type system features, should be the preferred approach for achieving notational expressiveness. Such techniques as metaprogramming and code generation can greatly help overcoming the limitations of that approach or further improving the notation, but they can make the framework more complicated and, therefore, should be employed judiciously.

Operation

In terms of operation, the core part in most of the frameworks is some kind of runtime or engine that orchestrates the execution of protocol components and their interaction. For concurrent execution of protocol components, the runtimes mostly rely on conventional concurrency mechanisms used in the corresponding language's ecosystem, such as goroutines in Go and execution pools in Java and Scala. Atlas, notably, is based on native OS threads rather than an async Rust runtime like Tokio, presumably due to a performance overhead of the latter.

There are two main approaches to implementing interaction based on message-passing between protocol components and with the networking layer: using a central event dispatching loop or through explicit channels established between components. Mir is a good example of the former approach, whereas Atlas follows the latter one. Since different protocol components normally operate asynchronously, there is often some kind of event queue or buffer placed between the components to accommodate for the asynchrony.

This raises an issue of preventing unbounded growth of those queues. Mir addresses this problem by temporarily blocking the influx of external events from active modules when the amount of events buffered in internal queues exceeds certain thresholds. Atlas relies on flow control provided by bounded buffered channels for communication between components.

In general, garbage collection issue is an important aspect of operation in distributed protocol implementations. Sometimes it can happen automatically, e.g. in PSync, received message sets are automatically discarded by the runtime upon transitioning into a new round. However, in some cases, it requires special care. For example, in Distal, protocol components should explicitly discard automatically buffered messages that become irrelevant for further execution of the protocol, in order to avoid unbounded growth of state and slowing down evaluation of rules. In Mir, there is an interesting pattern for performing garbage collection of internal component's state, whereby each disposable piece of state is assigned a numerical retention index, and the component removes the pieces of state whose retention index is below a specified value upon processing a dedicated garbage collection event emitted by another component.

Communication between nodes in most of the frameworks is implemented in a simplistic manner, providing best-effort message delivery, following the fire-and-forget communication style. Babel is a notable exception, since it introduces a notion of communication channel abstraction, where different channel types can represent different communication mechanisms with different properties and guarantees, e.g. more reliable message delivery, multiplexing connections, and φ-accrual failure detection.

So orchestrating the execution and coordinating interaction of protocol components often require some kind of runtime provided by the framework. The runtime should take care of asynchrony, garbage collection, communication, and coordination within the node, preventing unbounded growth of internal state and ensuring good performance.

Verification

Not all of the explored frameworks are concerned with verifying correctness of protocols and their implementations. Verification is the primary area of focus for Disel and Verdi, as well as PSync. Verification is not a major concern in Mir, though it provides some mechanisms that can be considered as lightweight verification of protocol implementation correctness.

Mir includes support for recording, inspecting, modifying, and replaying traces of events being passed by the node engine between the components, which can be very helpful for debugging, but also to perform correctness analysis. Mir also comes with a simple discrete-event simulation runtime that can be used for randomized reproducible testing with simulated time.

Full-fledged formal verification of protocol specifications and implementation in Disel and Verdi relies on machine-assisted theorem proving, whereas PSync attempts to automate the process. Theorem proving, even if machine-assisted, is a very difficult, time consuming task and requires special expertise. Automated verification, on the other hand, is in general undecidable and can only be achieved with certain restrictions to the system model and the protocol implementation.

Formal verification requires formal specification of the assumptions and the required properties. In Disel, protocol specifications are defined in terms of state-space coherence predicates and send/receive transitions. In Verdi, the correct behavior of the protocol is specified as a logical predicate over its possible traces of events. Formal specifications in PSync are expressed in terms of protocol properties, expressed in a fragment of a first-order temporal logic, as well as safety and liveness predicates (assumptions), expressed in terms of cardinalities of heard-of sets.

In these frameworks, the method of formally proving the correctness is typically by induction, constructing inductive state invariants. Coming up with appropriate inductive invariants constitutes the greatest difficulty in the verification effort and requires special skills.

In PSync, the simple round-based structure with lock-step semantics makes protocol implementation amenable to automated verification that can check safety and liveness properties. The verification, though, requires inductive invariants at the boundaries between rounds. However, the automated verification problem is decidable with certain constraints.

The correctness of protocol implementation in Verdi is proved directly, whereas Disel employs a different approach. For implementing protocol specification, Disel provides a DSL, embedded shallowly into Coq, that extends Coq's specification language Gallina. Disel programs and their fragments are assigned types corresponding to the protocol specification in higher-order separation-style Hoare logic. For message sending and receiving actions, Disel provides transition wrappers, which lift low-level operations on the networking layer to the level of well-typed program primitives corresponding to the protocol specification. Well-typed programs are guaranteed to be correct by construction w.r.t. the protocol specifications.

In Verdi, once the protocol is specified, implemented, and verified in an idealistic reliable network semantics, it can be translated using verified system transformers into an equivalent implementation, preserving the transformed system properties under a weaker fault model.

in Disel, protocol specifications can be generic and parameterized. The generic protocol specifications with their proven invariants can be used in composition. Disel provides mechanisms to establish stronger properties of generic protocols and their combinations by elaborating the protocol, i.e. strengthening it with additional inductive invariants.

Correctness is very important for the critical fault-tolerant protocols; therefore, the verification aspect deserves special attention. There are relatively easy-to-apply lightweight methods, such as randomized reproducible testing with discrete-event simulation, and there are sophisticated formal methods typically requiring special skills and expertise, such as machine-assisted theorem proving. However, some particularly structured system models may enable automated reasoning and make formal verification more practical, though at the cost of additional limitations.

Conclusions

Having explored these 7 frameworks for implementing distributed protocols, I found most of them not sufficiently developed for practical use, though some of the ideas and techniques employed there were worth exploring. Being purely academic efforts, most of the frameworks seem to have been abandoned after exploring some ideas and publishing the results. As of time of this writing, Mir is perhaps the most mature, well documented, and up-to-date one among these frameworks.

Focusing on some particular aspects of implementing distributed protocols, such as unifying and standardizing components, testing and debugging, notational convenience, or formal verification, while mostly neglecting the remaining aspects, may be appropriate for academic research, but this is certainly not sufficient for achieving practical adoption. Moreover, the programming languages that most of the frameworks are based on would make it hard to integrate the protocol implementations directly into larger code bases written in other languages, let alone that having to learn a less commonly used language like Scala or Coq is an additional obstacle for adoption. Please refer to the previous post about practical aspects of implementing distributed protocols.

Most important, perhaps, is that all of the frameworks seem to basically adopt the same model of a state transition system and trying to mimic the event-oriented notation as found in typical pseudo-code with upon statements in the literature on distributed protocols. This largely impacts how protocol implementations are expressed and structured in those frameworks.

I believe, if we want to make a real breakthrough in both designing and implementing distributed protocols, which is what the Replica_IO project is about, then we should first of all be innovative, challenging the status quo, and think holistically, taking into account all relevant aspects. Trying to rethink the conventional distributed system model and the way of expressing distributed protocols would be a good start.

Next Steps

Having explored some distributed protocol implementations and frameworks for implementing distributed protocols, now it is a good time to clearly state the problems in designing and implementing distributed protocols to focus on for the rest of milestone M0.1 and start gathering ideas on how we could approach them. Apart from that, there are still some potentially related concepts, approaches, and techniques worth looking into as part of the initial state-of-the-art exploration. The exploration tasks are tracked in the scope of this issue on GitHub.

Once the initial exploratory stage is over, it will be time to come up with key ideas concerning core principles that will guide the process of designing and implementing generic components within the framework (milestone M0.1). Then those ideas will be developed into clearly formulated concepts (milestone M0.2), their feasibility will be verified with code (milestone M0.3). After that, prototype, MVP, and production versions of the framework will be developed and released (milestones M1, M2, and M3).

It does not mean at all that exploration, ideation, and prototyping will not take place at later stages; the milestones simply define the framework's general level of maturity. The framework will continuously evolve and expand and at some point become a de facto standard for implementing critical fault-tolerant systems providing a growing collection of easy-to-use reliable and efficient distributed replication mechanisms.

If you like the project and find it valuable, please support its further development! 🙏
❤️ Replica_IO

If you have any thought you would like to share or any question regarding this post, please add a comment here. You are also welcome to start a new discussion or chime in to our Discord server.

If you know of some other framework for implementing distributed protocols that I should have looked into, please let me know. ↩

On Implementation of Distributed Protocols

Sergey Fedorov — Fri, 05 Apr 2024 11:10:27 +0000

This post concludes the first phase of the state-of-the-art exploration in the scope of milestone M0.1 of the Replica_IO project, namely exploration of selected notable distributed protocol implementations. It shares the main conclusions drawn from exploring 14 different code bases and outlines the key areas of focus for the next steps developing the Replica_IO framework.

The original post can be found here.

A companion video is available on YouTube:

Exploring Distributed Protocol Implementations

I believe that discovering neat, yet practical, solutions to complicated problems demands serious, deliberate preparation. Clearly, before being able to come up with such solutions, one first needs to acquire a deep understanding of the problem, identify the relevant aspects and requirements. It is also important to learn from prior attempts to deal with the problem. Otherwise, it would be naive to expect any significant advancement beyond the status quo.

Since Replica_IO aims at making a breakthrough in designing and implementing distributed protocols, I decided to start exploring the state of the art by selecting and looking into a number of notable distributed protocol implementations. Although I already had experience implementing such protocols myself, nevertheless, I decided to dive in and see for myself how others had approached this challenge. I wanted to learn from those projects, better understand the typical requirements and difficulties coming up in real-world use cases, and, perhaps, discover some interesting techniques or ideas along the way, as well as to identify the key areas of focus for the next steps.

So I onboarded myself into one code base after the other, as if I were to work on it. I was focused on the general structure of code, node-to-node communication mechanisms, the implementation details of the core protocols ensuring consistency between nodes, as well as mechanisms for monitoring and controlling execution of the protocol. I tried my best to understand how those protocols are structured and implemented. After having explored each of the code bases, I summarized and shared some of my findings. You can find those overviews on this wiki page.

Here is the full list of the code bases, written in different programming languages, that I explored¹:

Tendermint Core / CometBFT — a state machine replication engine (written in Go);
etcd Raft — a library for maintaining replicated state machines (written in Go);
AptosBFT — a consensus component supporting state machine replication in the Aptos blockchain (written in Rust);
BFT-SMaRt — a library implementing BFT-SMaRt, a state machine replication system (written in Java);
SmartBFT-Go — a library implementing state machine replication inspired by BFT-SMaRt (written in Go);
Substrate — a framework for building application-specific blockchains (written in Rust);
Lighthouse — an Ethereum consensus client (written in Rust);
Algorand — a blockchain based on the Algorand consensus protocol (written in Go);
Avalanche — a blockchain platform based on the Avalanche consensus protocol (written in Go);
Internet Computer blockchain (ICP) — a general-purpose blockchain system developed by the DFINITY Foundation (written in Rust);
Sui — a smart contract platform based on Narwhal and Bullshark protocols (written in Rust);
Apache ZooKeeper — a distributed coordination, synchronization, and configuration service (written in Java);
Apache Kafka — a distributed event streaming platform implementing a variant of the Raft consensus protocol (written in Java, integrated with Scala);
Cardano — a blockchain platform based on the Ouroboros family of consensus protocols (written in Haskell).

In the subsequent sections, I will share with you some of the observations and conclusions I made while exploring those code bases. I decided to structure the discussion around the following aspects of implementing distributed protocols:

complexity: what makes distributed protocols hard to reason about and implement;
correctness: how to ensure that the implementation guarantees the requires properties;
resource utilization: how to prevent ineffective expenditure of limited computing resources;
maintainability: how to manage long-running distributed systems and diagnose issues;
flexibility: how to achieve high adaptability, reusability, and evolvability of code.

Complexity

Distributed, fault-tolerant protocols are notoriously hard to implement, and there are justifiable reasons for that. This is primarily because that kind of system consists of largely independent nodes communicating through potentially unstable, unreliable network; some of the nodes may fail in different ways. The protocol is required to tolerate, within a bound, such unfavorable conditions and keep working reliably. More than that, it is supposed to deliver decent performance using limited resources. All this adds a great deal of inherent, essential complexity, which we simply cannot remove without weakening our requirements.

However, when it comes to actually designing and implementing these protocols, there is also another kind of complexity: incidental, non-essential complexity. This kind of complexity, though being closely related, does not strictly belong to the problem. We incidentally introduce it because we are not aware of or fail to recognize a simpler way of solving the problem at hand.

Incidental complexity can start creeping in when trying to understand and interpret a protocol specification², which is often too far from the realities of software engineering. Simply the way a protocol is specified can misguide the engineer trying to implement it and induce all sorts of difficulties. For example, pseudo-code in scientific papers is often defined in terms of global, unstructured variables and omits concurrency issues.

Implementing a distributed protocol in one of the conventional programming languages, chances are that the implementation will simply employ some general techniques commonly used in that language's ecosystem. Such general techniques may be very powerful and universal, but freely using this unconstrained power and flexibility, we can easily end up with a code base that is very hard to understand and maintain. For example, dealing with concurrency and synchronization using low-level primitives in the implementation of high-level protocol logic clutters the code and multiplies the complexity.

Haste is another great source of incidental complexity. There is always temptation to cut corners, especially when under time pressure. Imprudently copying approaches from elsewhere, adding temporary workarounds and ad hoc patches makes code entangled and poorly structured.

Using advanced features and sophisticated techniques can also add unnecessary complexity. Though this is ambivalent because it can actually help to express the implementation more conveniently and simplify the reasoning about it, but only when the advanced machinery is hidden behind a simple, clear, and easy to use interface.

It is pretty clear that introducing additional complexity is generally bad. But does it really matter? Couldn't we just implement the thing somehow, test it well, and simply tolerate the additional complexity? Well, surely, with rigorous testing, we can be sufficiently confident that our implementation is correct. However, in that case, making a small change, e.g. applying a simple fix to address a major issue discovered later can reportedly³ take months of work. So it would be very hard to further improve, adapt, or reuse such implementation.

We need a structured, yet flexible enough, approach guiding us away from incidental complexity if we wish to avoid wasted efforts and foster innovation in the field. Let's look into more details concerning complexity in implementation of distributed protocols.

Modularity

We can deal with a complex problem, such as implementing a distributed protocol, by dividing it into smaller, simpler problems, solving them individually, and then combining the solutions to finally address the original problem. This is, basically, what modularity is about. In this process, it is crucial how we divide the problem, what kind of pieces we get, and how we combine them back together.

Granularity

First of all, modularity comes in different levels of granularity. Implementing a large component, such as a state machine replication engine, we can define its external dependencies, then split the component into several chunks of functionality and stop there. That is certainly better than having to deal with a complete monolith, but this level of modularity would still be too coarse. Instead, we can continue decomposing the sub-components further until we end up with reasonably small and simple, yet non-trivial, components; this is fine modularity.

All of the explored code bases exhibit some level of modularity. It is quite common to separate concerns by delegating pieces of functionality to external components. This way, most of the code bases clearly separate implementation of the protocol logic from such functionality as communication between nodes, producing and verifying cryptographic signatures, persistent storage, executing transactions on the replicated state, etc. Most of the implementations also separate dispatching of events, such as inbound messages, from their handling; there is typically a component responsible for classifying events and a number of components responsible for handling specific event types. Quite often there are separate components implementing the protocol logic specific to different roles that a node can play, e.g. leader and follower, or modes of operation, e.g. synchronization and normal operation. It is also common to separate different logical stages of the protocol, e.g. creating a proposal, validating a proposal, finalizing the decision. Another common pattern is to have a separate class of component responsible for maintaining state for each of the remote peers the node communicates with.

Some implementations go further and introduce smaller components, e.g. encapsulating the state of each individual proposal or representing the logic of counting votes and determining if there is a sufficient quorum. Nevertheless, there still remain components that are too complicated and hard to follow, so this modularity cannot be considered fine. To combat complexity, we need to learn how to achieve fine modularity.

Decomposition

As one can cut bricks at different angles, so one can decompose components into sub-components in different ways. One way is to focus on the operational aspects, i.e. on how the pieces of implementation are going to be executed. With this approach, components would be primarily organized around actual data and control flow. This has a profound effect on the structure of the implementation.

Focusing on the operational aspects, protocol implementations will tend to be represented as stateful components, or as collections of stateful components, reacting to external events. This naturally induces applying the object-oriented approach to structuring the whole implementation, in which the protocol logic is mostly expressed as modifying pieces of component's state in response to individual inbound events and optionally producing new outbound events.

Although pieces of functionality tend to imply some state, individual events in a distributed system mostly happen as logical consequences of some other, causally related events in the scope of a larger distributed process. Thus structuring the implementation around event handling might not help to clearly express the overall protocol logic.

The way we approach decomposition also greatly impacts such properties of code as coupling and cohesion, i.e. the degree of interdependence between different components and the strength of relationship between the elements inside components, respectively. Loose coupling and high cohesion are generally desirable.

Failing to recognize the significance of implicit logical connections and properly express them often causes higher degree of coupling between components, i.e. entanglement. It is particularly important to distinguish essential and incidental complexity here. Sometimes complications, such as circular dependencies, may occur naturally and represent essential features of the protocol logic, e.g. recursiveness. For example, some internal events should often be treated, for the most part, the same way as equivalent external events. This can be achieved by looping those events back for handling in the protocol implementation.

Organizing components in a more structured way helps to manage dependencies between them, e.g. they can be arranged in layers or hierarchically, chained together in pipelines, etc. For example, in the Algorand implementation, the core logic of the consensus protocol is structured as a hierarchical state machine. The layered approach is well exemplified in the tower networking library, which is used by the Sui implementation. In the Apache ZooKeeper implementation, client requests are processed using a pipeline of request processing components chained together.

The amount of mutable internal state maintained by components also matters. Making components more static can often help to simplify the implementation. For example, in the Sui implementation, most of the consensus-specific components are static in terms of consensus configuration, i.e. instead of supporting reconfiguration directly in those components, they are simply recreated upon reconfiguration.

Many components require certain context or environment, i.e. they depend on some common piece of state or functionality like information about prior communication with the remote peer, access to persistent storage, diagnostic logging, etc. This is usually accomplished by capturing references to the environment inside the component or passing it explicitly. In functional programming, one can represent the environment with a monadic interface. Some programming languages provide special features for that purpose, e.g. contextual parameters in Scala. Interacting with the context from withing a component should be convenient but clearly constrained by the component's interface.

We would like to possibly avoid fragmentation of the core logic and facilitate local reasoning so that it is easier to reason about correctness, especially when introducing changes, without being too much concerned about larger scopes. We need to shift the focus more onto the functional and logical aspects, i.e. what the pieces of implementation achieve and how they ensure the desired outcome, so that the protocol implementation better reflects causal dependencies and logical connections.

Composability

Even a highly modular implementation is not necessarily highly composable, i.e. allowing to easily recombine and reuse its components. It is hard to reuse components that are not composable. Moreover, composability has huge transformative potential: unlocking true power of expressiveness and flexibility, we can push the limits and uncover a new dimension of possibilities for finding better solutions, whether to fix a flaw in an existing implementation or to design and implement something completely new.

Composability primarily emerges from the properties of individual components and the way they can be combined together. It demands components that are not only loosely coupled, but generic as well. It also requires unified means of abstraction and combination that satisfy certain properties, such as closure and associativity, while preserving principal properties of individual components in combination.

All of the explored code bases were meant to implement only a specific protocol or a close family of protocols, except Substrate, which is supposed to support a wide range of protocols. Most of the implementations define abstract interfaces for major components, employing various forms of polymorphism, and apply the dependency inversion principle. This makes components replacable and can help to reduce coupling. Being able to replace a component allows using alternative implementations of the component, e.g. for unit testing. However, most of those dependency inversion interfaces seem to only make sense within a very specific, predefined structure of the whole implementation, i.e. they are mostly about decomposing rather than recomposing. Truly composable components, on the other hand, are those that can be put together in an open-ended range of new, surprising combinations.

Though there are some examples of composability used in the explored code bases. The communication layer in the Sui implementation takes advantage of the layered design of anemo, a peer-to-peer networking library based on tower: it processes RPC requests from other nodes through pipelines composed out of tower middleware layers provided by the anemo library. The state machine representing the core logic of the consensus protocol in the Algorand implementation consists of uniformly defined event handlers organized in a hierarchy of event routers dispatching events to the corresponding handler.

Asynchrony and concurrency make achieving composability particularly challenging. Components implemented using usual concurrent programming techniques based on lock-based synchronization primitives fall short of composability: a simple combination of individually absolutely correct components may easily fail to ensure consistency or cause a deadlock. Ensuring correctness in this model often requires breaking abstractions and dealing with synchronization directly in an awkward and error-prone way. Alternative concurrent programming techniques, such as software transactional memory (STM) used in the Cardano implementation, can help to overcome these issues without compromising on modularity and composability. More on asynchrony and concurrency in the next section.

Functional programming places a significant emphasis on composability. One of its core principles is to break down programs into smaller, reusable functions, avoiding side effects, that can be easily combined to create more complex functionality. This encourages a more declarative notation, which often results in code that is easier to reason about. The approaches and techniques employed in functional programming, such as immutability, lazy evaluation, monads and effect systems, etc., are therefore a great source of ideas for enhancing composability.

Composability is indispensable for future-proof software solutions. Though this property doesn't necessarily emerge together with modularity; conversely, achieving it may be challenging, especially in the inherently concurrent context of distributed systems. Therefore, we should approach this proactively and deliberately design for composability.

Concurrency

Concurrent programming is a way to structure code into multiple threads of control—concurrent tasks—that can execute concurrently. Observable effects caused by individual tasks can interleave in concurrent execution. Understanding and reasoning about code in concurrent programming requires a more complex mental model compared to sequential programming. Perhaps, nondeterminism is the main source of complexity in concurrent programming: concurrent programs can produce different results depending on the exact timing of external events and task execution.

Concurrent programming is known to be error prone. Concurrent tasks accessing shared resources generally require some form of coordination. Depending on the available mechanisms for interaction and communication between concurrent tasks, there may be different methods of coordinating them and controlling concurrency, e.g. lock-based synchronization primitives, message passing, and software transactional memory. However, properly applying those methods in a nontrivial system often becomes complicated and requires great deal of care. When concurrent tasks happen to interfere with each other in unanticipated ways, subtle issues, such as race conditions, deadlocks, and resource starvation, may start manifesting themselves.

Concurrent programs with mutable memory shared between threads can suffer from data races. A data race is basically a situation where one thread accesses a memory location whereas another thread can simultaneously perform a conflicting write to that memory location. Preventing data races is not only important to avoid memory corruption; this can also significantly simplify the mental model.

Normally, we can assume sequential consistency in concurrent programs that are free of data races. In essence, a sequentially consistent execution of a concurrent program must be equivalent to some sequential execution, respecting the order and semantics of operations specified in the program. So such executions are linear schedules, each representing a possible concurrent interleaving of the program.

Execution schedules that only differ in the interleaving of operations local to threads of execution, i.e. operations not visible to other threads or externally, are effectively equivalent. Therefore, the number of possible distinct schedules depends on the number of non-local operations in the execution, i.e. operations used to communicate between threads or cause externally visible effects, and it grows exponentially.

Concurrent programming is an effective model of computation, but it is more complex and requires an appropriate approach in order to avoid subtle concurrency issues. Data face freedom is a particularly desired property since it simplifies the model providing sequential consistency. Under that model, reducing the number of non-local operations can greatly help to further simplify reasoning about the concurrent program.

Approaches to Concurrency

Programming languages support different approaches to concurrency; they provide different features and different concurrency mechanisms in their runtimes and ecosystems. The explored code bases are written in the following languages: Java, Go, Rust, and Haskell. Let's have a look at how those code bases approach concurrency, depending on the choice of programming language.

The code bases written in traditional mainstream languages like Java tend to achieve concurrency by explicitly spawning OS threads, which communicate through shared mutable memory and synchronize with lock-based primitives. Those implementations are normally structured into objects exposing thread-safe methods to interact with them concurrently. This approach is well known and established in the industry; it is therefore widely supported in the ecosystems built around those languages. Newer system programming languages like Rust usually provide support for this approach, as well.

This low-level approach gives the programmer a high level of control as it directly reflects how concurrency is actually achieved by the system. On the other hand, it requires a lot of care since properly using low-level synchronization primitives together is tricky and error prone. Moreover, OS threads are relatively expensive, and, therefore, building highly concurrent programs by frequently spawning short-living threads on demand is impractical. Instead, programs are often organized into a small number of long-running threads; though using thread pools can help to achieve more flexibility. Most importantly, as mentioned before, concurrent components synchronized with the lock-based primitives suffer from poor composability.

The Go language has built-in support for concurrency based on preemptive multitasking with lightweight user threads called goroutines. It encourages message-passing style of communication and synchronization between goroutines through blocking, optionally buffered FIFO channels; though it also supports traditional lock-based synchronization. The built-in select statement can be used to combine several channel operations in order to perform a single pseudo-randomly selected operation that is ready to proceed; unless there is a default case, the select statement blocks until at least one of the operations can proceed.

The select statement in Go allows composing multiple potentially blocking operations on channels into a single operation. For that reason, some of the explored code bases, e.g. SmartBFT-Go, occasionally use Go channels in place of traditional lock-based synchronization primitives in order to combine them with channel operations in a single select statement.

Go does not restrict access to shared mutable data by concurrent goroutines, so data races are still possible. Go provides quite limited control over the runtime managing execution of goroutines, thus making fine-tuning and controlling concurrent execution difficult.

The Rust language emphasizes safety without sacrificing performance. Thanks to the ownership model and strong type system, it can effectively ensure at compile time that the code is free of data races. Being a system programming language, Rust supports concurrent programming with OS threads and shared memory, which are useful to optimize performance and for implementing other styles of concurrency, such as message passing. Rust's ownership and type system features prevent accidental sharing of mutable state between threads.

The explored code bases written in Rust primarily rely on the asynchronous programming features of Rust to achieve concurrency. Async Rust can be seen as a form of cooperative multitasking where asynchronous, non-blocking computations are represented with the Future trait (interface). Rust futures are passive, i.e. they have to be actively driven by polling in order to make progress.

Ultimately, asynchronous code in Rust requires some executor function that can drive a future by polling it to completion. There is an open-ended choice of async runtimes in the Rust ecosystem, which provide executors. Tokio is one of the most widely used runtimes in the Rust ecosystem; all of the explored code bases written in Rust are based on it. One can create specialized runtimes, e.g. Sui has a simulator that provides an drop-in replacement for Tokio and supports deterministic, randomized execution.

The async/await syntax in Rust helps writing asynchronous fragments of code very close to normal, synchronous code. The async keyword introduces an async context by constructing a future from the corresponding piece of code; the await expression can be used within an async context to poll another future and yield control if that future is not yet ready to produce a value.

Apart from using the async/await syntax, Rust futures can be composed together using various combinators provided by the futures, tokio, and other crates. In particular, the select macro allows polling multiple futures simultaneously until one of them completes, similar to the select statement in Go; the join macro polls multiple futures to completion. There are also asynchronous channels for asynchronously producing a sequence of values and streams for communication between asynchronous tasks.

Asynchronous Rust is evolving rapidly; thus, it may still lack maturity, has limited documentation and less-established best practices. Many developers find programming in asynchronous Rust quite challenging and sometimes counter-intuitive, e.g. when dealing with cancellation, long-running or blocking operations, and due to the passive nature of futures. Although Rust prevents some concurrency problems like data races, concurrent code is still vulnerable to different types of concurrency bugs (e.g., deadlocks, logic errors, etc.) and requires deep understanding and careful design.

Finally, concurrency in Haskell is based on lightweight user threads. Haskell allows throwing asynchronous exceptions from one thread to another. Handling asynchronous exceptions safely requires great care in critical sections, i.e. when manipulating shared resources. Since Haskell is a purely functional programming language, it does not explicitly support shared mutable memory for communication between threads. One of the mechanisms for normal communication between Haskell threads is MVar, a synchronizing variable, which can act as a synchronized container for shared state or as a one-place channel. Concurrent Haskell prevents data races, but using MVars is susceptible to other concurrency bugs, such as race conditions, deadlocks, etc.

Another mechanism for concurrent communication widely used in the Haskell ecosystem is Software Transactional Memory (STM). STM is an optimistic concurrency mechanism that allows transactions over shared mutable variables (transactional variables or TVars) to be safely composed and atomically executed, without exposing the implementation details. STM transactions can block on an arbitrary condition; alternative STM transactions can be composed together using the orElse combinator. The Haskell type system ensure that STM transactions cannot have undesired side effects and thus are safe to roll back and retry.

Building various custom concurrency abstractions and combinators with STM is relatively easy and safe, thanks to high composability. For instance, in Cardano, concurrent components expose STM transactions for retrieving relevant pieces of their mutable current state; the components then interact by combining and atomically executing such STM queries from other components and atomically updating the corresponding pieces of their own mutable state.⁴

However, STM has some limitations and caveats. First of all, composable multi-way communication between threads cannot be expressed in STM. That is because STM transactions cannot produce a visible side effect while being blocked. This is closely related to another limitation: STM does not provide fairness for threads waiting in a blocked STM transaction. In contrast, MVars guarantee fair scheduling of threads blocked on the same MVar. STM also incurs some overhead in terms of memory and performance costs, which depends on the size of transactions. Though sometimes it can actually help building more efficient mechanisms. Long-running STM transactions can suffer from starvation, i.e. being repeatedly rolled-back and retried. Finally, Haskell, similar to Go, provides quite limited control over its runtime.

To summarize, the traditional mainstream approach to concurrency based on explicit, low-level synchronization primitives and communication directly through shared mutable memory is well known and established, but it is tricky, error prone, and suffers from poor composability. Restricting concurrent access to shared memory, e.g. with the ownership model as in Rust or immutability as in Haskell, can help preventing data races. Communication and synchronization through message passing primitives like channels and using combinators like select can improve composability. Spawning short-lived OS threads may be too expensive; thread pools and lightweight user thread runtimes can help to achieve more flexibility. Though relying on a concurrency runtime is an additional dependency that is not always replacable or adjustable. Another approach to concurrency with good flexibility and composability is asynchronous programming with cooperative multitasking and async/await syntax, as exemplified by Rust. Software transactional memory is a highly composable and flexible approach to concurrency, though it has some restrictions, additional overhead and does not guarantee fairness.

Evading Concurrency

Given all the challenges with concurrent programming, why not trying to avoid concurrency as much as possible? Some of the explored code bases go quite far this route and implement almost all of the core protocol logic as a completely sequential state machine, perhaps only offloading long-running operations (e.g., computationally intensive cryptography) to dedicated concurrent execution pools. Let's consider consequences of this approach.

First of all, distributed systems are inherently concurrent because they, by definition, consist of multiple nodes running largely independently. Thus, each node needs to handle events (e.g., messages received over the network or expired timeouts) originating from different sources asynchronously, i.e. independent of its main program flow and handling of events from other sources. Moreover, the protocol logic must also reflect the concurrent nature of the system.

So some parts of the protocol are fundamentally sequential, e.g. delivering totally ordered transactions, whereas some parts are fundamentally concurrent, e.g. handling of messages received over the network from different peer nodes. Some parts may be concurrent, but don't have to, e.g. validation of the subsequent messages while finishing processing of the current one.

Attempting to implement an essentially concurrent part of the protocol in a sequential manner, i.e. without using concurrent programming techniques, necessarily requires explicitly maintaining and switching contexts. Not only this adds some amount of boilerplate code and makes it entangled, more importantly, this causes fragmentation of the core protocol logic because such an artificially sequential component still has to multiplex handling of asynchronous events. Therefore, what in concurrent programming could have been naturally expressed as a blocking operation becomes an abrupt return of control, breaking out of the sequential component.

There is another problem with multiplexed handling of asynchronous events in sequential code, namely controlling the flow of events from concurrent sources. Consider a situation where a sequential component is given an event to handle that it cannot yet fully process because, in order to make a decision on how to react to the event, it first needs to handle some other events, e.g. it has to complete the current round of the protocol before participating in a new one. Since the sequential component cannot block waiting and has to return the control back, there are basically two options: dropping the event or putting it aside into some kind of buffer. In the first case, the event source cannot assume that all events it emits will be reliably handled and has to take this into account in its logic, e.g. emit an equivalent event under some conditions later. In the second case, there should be some way to enforce a reasonable bound on the amount of buffered pending events without compromising the protocol properties, e.g. emitting further events only after having received an acknowledgement from the destination. This can add a lot more complexity to the protocol implementation.

So concurrency cannot be easily evaded in distributed systems. Attempting to avoid using concurrent programming techniques complicates the implementation and causes fragmentation of the protocol logic in code. On the other hand, when done appropriately, designing for concurrency and using concurrent programming techniques can actually be advantageous. It boils down to recognizing inherently concurrent and sequential parts of the protocol and finding appropriate ways to express this distinction in code. Those parts of the protocol that are neither inherently concurrent nor sequential may nevertheless benefit from being implemented as concurrent: Designing for concurrency can guide towards better decoupling of components while concurrent execution can help to achieve higher responsiveness and performance.

Nondeterminism

As mentioned in the previous section, distributed systems are inherently concurrent and therefore nondeterministic. If we think of nondeterminism in terms of events happening in the system then it can manifest itself as unpredictable events or their order. For example, requests from external agents (clients, users, etc.), values produced by a random-number generator, or node failures are not known in advance; the same set of messages may arrive at different nodes in different order due to unpredictable delays in communication; timeouts may happen due to unexpectedly long delays. Inner workings of nodes can introduce additional, implementation-specific nondeterminism, e.g. unspecified order of iteration over unordered collections, scheduling of concurrent tasks, etc. To some extend, the purpose of distributed protocols can be seen as confining nondeterminism within certain constraints in order to maintain the required invariants.

Nondeterministic steps in protocol execution introduce alternative state transitions, thus expanding the state space. This complicates reasoning about distributed protocols, as well as implementing and verifying them, because it often requires considering a large number of possible executions. Nondeterministic execution also makes reproducing problems and debugging particularly challenging. Therefore, it is desirable to control nondeterminism or attempt to eliminate it.

Some of the explored code bases constrain nondeterminism by implementing parts of the protocol as deterministic state machines. Inherently nondeterministic aspects, such as time, randomness, and asynchronous operations, are abstracted out of those state machines. Randomness, as well as current time, can be supplied to the state machine through abstract interfaces provided as dependencies. Alternatively, the current timestamp can be supplied to the state machine at each step explicitly in the input. Time can also be represented in terms of an abstract logical clock maintained by the state machine, which is then advanced with special tick events periodically supplied to the state machine. Asynchronous operations can be requested by the state machine by emitting special output events; the result is then supplied back as special input events. This approach is very close to evading concurrency discussed before and therefore is associated with the same kind of disadvantages.

In Haskell, as a strictly typed purely functional programming language, ordinary functions are deterministic (referentially transparent) in the mathematical sense: given the same input, they must produce the same result. Nondeterministic computations are expressed using the monadic interface. Only IO actions, when executed in the IO monad, can cause side effects and produce nondeterministic results. This is enforced by the type system. Cardano takes advantage of this by making most of its code polymorphic in the main IO-like monad. This allows fully controlling nondeterminism by choosing the main monad implementation.

Being able to control nondeterminism is particularly useful for testing and debugging. This allows creating reproducible test environments, as well as discrete-event simulation for faster-than-real-time simulation of time delays. For example, Cardano uses a simulation environment for the IO monad that closely follows core Haskell packages; Sui has a simulator based on madsim that provides an API-compatible replacement for the Tokio runtime and intercepts various POSIX API calls in order to enforce determinism. Both allow running the same code in production as in the simulator for testing.

Nondeterminism is an important aspect of distributed systems, so it should be clearly expressed in the implementation. Type system features can help with that. Confining nondeterminism within natural boundaries of components can reduce complexity and simplify reasoning about the protocol implementation. Simulated execution of unmodified code with controlled nondeterminism is a very effective technique in testing and debugging.

Communication

Communication is at the core of distributed systems where individual nodes need to coordinate in order to act as a coherent system. Nodes in a distributed system interact with each other by exchanging peer-to-peer (P2P) messages. The communication happens over an unreliable network medium that only provides best-effort, unordered delivery of data packets, i.e. it may fail to deliver individual packets or deliver them out of order. Moreover, nodes can fail, and, in general, it may be impossible to determine precisely if a peer node has failed or its messages were simply dropped or delayed in the network. Nodes can also differ in processing power and experience different traffic load. Therefore, it is important to manage the rate of data transmission using flow control mechanisms, as well as to retransmit lost pieces of data. This can contribute significantly to the overall complexity of distributed protocols and their implementation.

Communication Layers

Most of the explored implementations use SSL/TLS over TCP/IP as a transport layer for P2P communication. Establishing a TCP connection takes a few packet round trips over the network. Moreover, operating systems impose limits on the number of open TCP connections per process because they consume system resources. For those reasons, communication layers based on TCP establish long-lived connections with remote peers and try to keeps the number of open connections low. This often means that the transport-level connections have to be multiplexed into multiple logical sub-streams.

Substrate and Lighthouse use libp2p as a networking stack for communication between nodes. The libp2p framework is a versatile modular peer-to-peer networking stack. It provides a collections of abstractions, mechanisms, and protocols for facilitating communication in P2P systems. In particular, libp2p supports multiple transport mechanisms (TCP, QUIC, WebSocket, WebTransport, etc.), encryption schemes (TLS and Noise), and stream multiplexing. Higher-level protocols in libp2p are implemented on top of reliable, ordered, bidirectional binary streams, which are transparently encrypted and multiplexed by the framework.

Communication layer in Sui is based on anemo, a peer-to-peer networking library built on top of QUIC. QUIC is a modern higher-level network transport protocol layered over UDP. It has built-in support for encryption and multiplexing. Similar to TCP connections, QUIC streams are reliable, ordered, bidirectional, providing flow control (backpressure), but they are cheap and almost instantaneous to open once an initial connection is established. The anemo library takes advantage of the efficient stream-multiplexing capability of QUIC; libp2p also uses the built-in capabilities of QUIC when it is used as a transport mechanism.

So there may be several levels of communication abstractions. There are low-level transport protocols like UDP or TCP, medium-level ones like QUIC, and comprehensive high-level networking stacks like libp2p. Higher-level mechanisms can be built on top of lower-level layers. Sometimes, it makes sense to fuse several layers, e.g. QUIC efficiently embeds security into the transport layer. In order to simplify implementation of higher-level layers, it is desirable to take advantage of those properties that are already guaranteed by lower-level layers, e.g. reliable, ordered delivery and flow control provided by commonly used transport layers such as TCP and QUIC.

Styles of Communication

There are different ways to organize communication between nodes. The most common styles of communication in the explored code bases are request-response and fire-and-forget message delivery. The request-response style follows the remote procedure call (RPC) pattern: the initiator node sends a message to the remote node, and the latter is expected to respond back. In the fire-and-forget style, the initiator node unidirectionally sends messages to the remote node without waiting for a response. Another style of communication, which is also often used in the explored implementations, is gossiping, where nodes publish and disseminate pieces of information among themselves in an indirect and random manner. Cardano uses a session-based style of communication, where peers establish continuous bidirectional communication channels and exchange messages according to some stateful communication protocol.

The fire-and-forget message delivery is a very simple style of communication. It does not mandate any acknowledgement from the remote node, so it can only provide best-effort delivery guarantee. Messages that cannot be handled for any reason are often simply dropped, e.g. when a message queue is full. Usually, there is also no guarantee about ordering of messages. Higher-level code needs to take care of such things as flow control, retransmission of lost messages, as well as determining and maintaining the context to handle messages in. On the other hand, this style can be expressed with a non-blocking interface. That allows sending a message to a group of remote nodes at once, which is a simple form of best-effort multi-/broadcasting. Some implementations provide a blocking or asynchronous variant of the interface giving more control over data flow within the local node. For example, in Substrate, the sender should wait until it acquires a free slot in the outgoing message buffer; the slot reservation is then consumed to enqueue a message.

The request-response style is a simple type of session-based communication: sending a request initiates a new session, which normally terminates with reception of the corresponding response. Sessions can terminate abnormally, e.g. upon a timeout. The request-response style demands blocking or asynchronous interface on the sender side since it should wait for and handle the eventual response or error. This provides a context for response messages linking them to the corresponding requests. However, the communication layer treats each individual session independently. More complex patterns of interaction have to be split into a number of one-shot request-response sessions. Multiple sessions may be initiated concurrently, and the communication layer needs to keep track of those one-shot sessions starting, running, and finishing concurrently.

The session-based style of communication is connection-oriented and supports stateful interaction between nodes. Communication sessions are established between individual nodes and represent reliable, ordered, bidirectional message streams. This provides a context for the messages being exchanged between nodes and implies blocking or asynchronous interface. Thanks to reliable and ordered delivery, the context establishes causal relationship between individual messages. Relying on those assumptions can greatly simplify the protocol implementation while taking advantage of the guarantees commonly provided by stream-based transport layers. This style of communication is quite generic and can express many different patterns of interaction. Combined with built-in flow control (backpressure), it is particularly suited for implementing consumer-driven communication. On the other hand, session-based communication cannot directly express multi-/broadcasting primitives and can induce additional latency in certain patterns of interaction. Though higher-level communication mechanisms built on top of session-based communication can implement multi-/broadcasting, whereas using pipelining techniques can help to hide latency and achieve good performance.

The gossip-style communication designates probabilistic broadcasting in a relay network of nodes. It resembles the best-effort broadcasting in the fire-and-forget message delivery style. The key difference is that data in the gossip-style communication can propagate from one node to another in multiple hops rather than being received directly from the source node. This makes it suitable for sparsely connected networks. Therefore, gossip communication can scale well in large networks. It can provide, with high probability, eventual delivery of bounded amount of data under normal network conditions. This style of communication implies a publish-subscribe interface. Similar to the fire-and-forget message delivery, the interface is largely stateless and can be non-blocking. Under the hood, it is often implemented using the advertise-request-response pattern of communication: nodes advertise available pieces of data to their neighbors and exchange with them the missing parts following the request-response pattern. Efficient gossip implementations require adaptive network topology and advanced data dissemination techniques, which can make them fairly complicated.

An interesting example of using the gossip-style communication is artifact pools in the Internet Computer blockchain. Artifact pools in ICP are structured collections of artifacts, generic pieces of data produced by the local replica or received from other nodes. The gossip layer is responsible for synchronizing artifact pools between nodes. Nodes communicate with each other through the artifact pools by adding/removing/moving artifacts to/from/between pool sections. Higher-level code is responsible for artifact validation; it also determines retention and prioritization policies.

It is easy to notice that some styles of communication can be implemented in terms of others. So the request-response style is a reduced from of the session-based communication, which is more generic and expressive. Both can be implemented relying on the fire-and-forget delivery and using some message retransmission and acknowledgement protocol. Or conversely, the fire-and-forget message delivery can be implemented on top of a reliable session-based communication using bounded lossy message queues. Similarly, gossip mechanisms can be implemented using any of the other styles of communication; though the implementations may differ in complexity.

Different styles of communication have different properties that can significantly influence the shape of code built around them. Some of them are strictly more expressive than others, but do not necessarily reduce to an equivalent, because less expressive mechanisms may have more efficient implementations. In order to avoid accidental complexity when implementing distributed protocols, it is important to have a range of communication mechanisms with aligned interfaces and clearly defined properties.

Internal Communication

Apart from interaction between nodes, there is also communication between concurrent tasks within the same node. This internal communication shares some similarity with communication between nodes. The main difference is in the communication medium: while different nodes communicate through unreliable and slow network, internal communication happens through fast and reliable shared memory. Some programming models and techniques make the similarity particularly prominent, e.g. the actor model, communicating sequential processes (CSP), remote procedure calls (RPC), etc.

Any piece of shared memory can act as a communication channel between internal components. Such a channel can be established by simply sharing a reference to the corresponding piece of memory. Internal messages do not need translation into/from a binary representation; they can be simply shared by reference. The request-response style of communication can be implemented as simple invocation of blocking or asynchronous procedures (functions); invoking non-blocking procedures (functions) without a return value corresponds to the fire-and-forget message delivery style. Obviously, such procedures need to be safe for concurrent invocation.

The session-based communication style can be implemented for internal communication using the constructs commonly known as channels (e.g. channels in Go, Tokio channels in Rust) or concurrent queues (e.g. LinkedBlockingQueue and other concurrent queues in Java). Those constructs belong to fundamental mechanisms of communication and coordination between concurrent components. Channels can be buffered or unbuffered (i.e. not buffered). Buffered channels and queues can hold items being sent through them without blocking the sender. In contrast, sending to or receiving from an unbuffered channel acts as a rendezvous point: it synchronizes the sender and the receiver at the point of communication.

Buffered channels and queues that can hold more than a singe item may return items to a receiver in different order. FIFO is the most commonly used ordering policy, in which items are returned in the same order as they were inserted. LIFO is another option, in which the most recently inserted item is the one that is returned first. One can think of many other options such as priority queues etc. The preferred ordering policy would depend on the purpose of communication.

Buffered channels and queues can be bounded or unbounded. The bounded version imposes a hard limit on the amount of items that they can hold. Unbounded channels and queues usually provide a simple non-blocking interface for inserting new items. However, they require some additional mechanism to prevent accumulating indefinite amount of items, e.g. blocking ingress of external events when internal buffers grow above certain threshold or relying on time-based assumptions such as throttling the data flow or imposing expiration time on the items. Such mechanisms can make reasoning about the protocol implementation more complicated.

Bounded channels and queues usually provide blocking or asynchronous interface. They can also support non-blocking insertion of new items, but then they must discard some items when there is no more capacity left. There may be different eviction policies. The simplest one is to discard the item being inserted. Otherwise, the new item is inserted, but some of the buffered items must be discarded, e.g. the least recently inserted one. Similarly to the ordering policy, there may be many other options, and the choice depends on the purpose of communication.

It is also worth mentioning buffered channels with a single-item buffer. They can be convenient for communicating a single item from one concurrent component to another, e.g. sending a response message back to the requester. The oneshot channel in Tokio is a good example of such channel type. CompletableFuture in Java can also be considered a kind of single-item buffered channel, as well as synchronizing variables MVar and TMVar in Haskell. Another interesting example of a single-item buffered channel is the watch channel in Tokio: it always keeps the last value sent to it. The watch channel is useful for watching for changes to a value from multiple concurrent components. Transactional variables (TVars) in Haskell are somewhat similar to watch channels since STM transactions can be suspended until one of the TVars that it has read from has been updated.

Channels and queues often serve as fundamental constructs to implement message passing between concurrent components. They can be used to implement various styles of internal communication and higher-level components. For example, implementations of components for communication between nodes often use channels and queues as internal message buffers.

The publish-subscribe design pattern resembles the gossip style of communication. It can be implemented for internal communication as an event bus or broadcast channel. Same as channels and queues, it can be buffered or unbuffered, bounded or unbounded. Unless messages can be dropped, unbuffered and bounded buffered implementations only support non-blocking publishing/broadcasting of messages if no subscriber blocks.

Similar to communication between nodes, different mechanisms and styles of internal communication have different properties that can significantly influence the shape of code. Therefore, it is equally important to have a range of internal communication mechanisms with aligned interfaces and clearly defined properties. The similarity between mechanisms for internal communication and communication between nodes provides an interesting perspective and can help to come up with better abstractions for communication.

Resilience

Fault-tolerant distributed systems are meant to tolerate (within limits) faults of individual nodes due to crashes, network partitioning, malfunctioning, or even malicious behavior. Crash fault tolerant (CFT) systems, e.g. Apache Kafka and Apache Zookeeper, are relatively simple since they can only withstand node crashes and network partitioning. Byzantine fault tolerant (BFT) systems, e.g. public blockchains, are designed to withstand arbitrary (including malicious) behavior of a fraction of nodes and thus are significantly more complicated. There are two sides of the issue: preventing faulty or malicious nodes from compromising the whole system and recovering failed nodes to rejoin the system.

Theoretically, fault-tolerant distributed protocols are designed so that they guarantee their safely and liveness properties despite the presence of faulty nodes in the system, provided that certain assumptions hold. In practice, those guarantees are only provided if the implementation ensures that the required assumptions actually hold. This is particularly challenging in BFT systems meant to operate in adversarial environments. Nodes in such systems can be subjects to various attacks, such as denial-of-service (DoS) attacks through resource exhaustion. Fairness between peers is another concern since it may also impact resilience.

To mitigate those risks, many implementations maintain reputation metrics for remote peers and apply rate-limiting or throttling techniques. Peer reputation is based on the observable behavior of the peer, such as protocol violations, timeouts, and performance. Nodes normally disconnect from remote peers whose reputation drops below a certain threshold, as well as reject inbound connections from those peers. Conversely, peers with higher reputation may be preferred for communication in sparsely connected systems.

In its threat-aware design approach, Cardano emphasizes detecting protocol violations as early as possible in the operational cycle where the data is available but the least resources have been expended to process the received data⁵. For instance, block and transaction relaying is interleaved with validation to avoid circulating invalid data in the system. This approach works well with stateful consumer-driven communication between nodes: Inbound messages must be well-formed syntactically and semantically valid in the context of information previously received from the peer node.

In order to allow failed nodes to efficiently restore and safely rejoin the system, some parts of the protocol state can be persisted in a stable storage. This is usually implemented as a write-ahead log (WAL), an append-only stable storage used for crash recovery. Certain events are first recorded in the log before the corresponding actions are taken, e.g. before sending messages to other nodes. This allows the node to restore and continue participating in the protocol from where it stopped, without violating the protocol. Persistence mechanisms are also required to support recovery from a massive system crash, i.e. to provide the durability property.

Early detection of protocol violations is advantageous, and the implementation structure should allow that. There should be a clear path for propagating information about detected protocol violations and other anomalies to adjust peer reputation metrics and take appropriate measures. Persistence mechanisms, such as write-ahead logging, are required for durability, as well as for safe and efficient node recovery.

Optimization

Practical distributed systems require not only reliability but also efficiency. Simplistic designs and implementations unfortunately tend to exhibit poor performance, whereas we would like that our systems scale well and provide decent throughput and latency. Improving those characteristics demands optimization at protocol and implementation levels. Great effort has been put into optimizing distributed protocols during decades of active research. This gave rise to a range of elaborate protocols attempting to achieve ever higher performance. On the implementation level, there also exists a variety of technical means for increasing efficiency. Optimizations, however, often add more complexity and make protocols harder to reason about and implement.

Protocol-level optimizations may involve using more complex communication patterns and topologies. Protocol phase pipelining, i.e. participating with a single message in multiple protocol phases at once, and speculative execution are common techniques to improve responsiveness. Batching, as well as advanced cryptography such as threshold signatures, helps to reduce communication overhead. State-of-the-art protocols are often based on advanced data structures, such as directed acyclic graphs (DAGs). At the implementation level, on-demand execution and caching are often used to avoid performing unnecessary or duplicate operations.

Communication contributes significantly to the overall overhead in distributed systems and, therefore, is a clear target for optimization. Point-to-point protocol pipelining, i.e. continuous sending of requests without waiting for the corresponding responses, can greatly increase performance by hiding high network latency. Widely used transport protocols, such as TCP, tend to perform best under steady data flow. Moreover, keeping multiple network connections consumes additional system resources. Therefore, implementations commonly multiplex multiple logical communication streams through a single network connection. Minimizing head-of-line blocking effects may require flow control mechanisms at the level of individual logical streams; large pieces of data should be transmitted through a multiplexed connection in chunks. Specific kinds of communication, e.g. state synchronization in blockchain systems, can benefit from dedicated, specialized communication mechanisms.

Interaction between concurrent components and across levels of abstraction is also subject to fine-tuning and optimization. Prioritization and flexible policies can help to maximize performance. For example, the system may perform better when certain concurrent tasks or communication paths have higher priority. Internal communication, as well as communication between nodes, may be optimized through prioritization and retention policies applied to individual messages or kinds of messages. This sort of optimization requires deep understanding of the protocol and its inner workings.

Expensive low-level operations, such as spawning threads, blocking on locks, and copying data, can become a hidden cause of suboptimal performance. There are well-known techniques that can help to avoid unnecessary low-level overhead. For example, execution pools avoid the overhead associated with creation and destruction of threads for executing short-lived concurrent tasks; non-blocking algorithms can improve performance by avoiding unnecessary suspension of thread execution; zero-copy techniques focus on eliminating excessive copying of data.

Improving performance characteristics of distributed systems may require nontrivial changes in the underlying protocols and their implementations. The structure of the code should be flexible enough to support such changes. Some optimizations can be confined within boundaries of abstract components, whereas some may require crossing the borders of modularity. Flexible and composable primitives and interfaces designed for optimization would help to fully realize the potential of distributed systems in practice.

Correctness

Correctness is absolutely essential for implementation of distributed fault-tolerant protocols since they are critical for ensuring reliability of the whole system. Formal verification methods allow confirming protocol correctness in terms of desired properties. Applying those methods requires that the protocol is described precisely with a formal specification. Though the way protocols are actually implemented in code tends to be significantly different from the notation used in protocol specifications. This discrepancy is clearly a potential source of errors. There are different methods that can help to acquire higher confidence in correctness of the protocol implementation.

Testing is an established practice to examine correctness of software. Comprehensive testing of complex systems happens at different levels, and modularity of the code supports more effective testing by isolating functionalities, enabling independent unit testing, simplifying integration testing, and promoting code reuse. Some code bases include dedicated interfaces and hooks to facilitate testing; fail points is a technique that allows injecting errors and other behavior at runtime for testing purposes, which is used in Aptos and Sui. In Algorand, each component of the hierarchical state machine implementing the consensus protocol can perform pre- and post-condition checks to validate if it conforms to its contract. Most code bases perform diagnostic logging or tracing that can also be useful for testing, e.g. to check invariants in property-based testing.

Deterministic discrete-event simulation is a powerful technique that can be used for performing randomized but reproducible testing. For example, Sui, Apache Kafka, and Cardano employ this technique. It works by running the code within a special runtime that supports deterministic, randomized execution of concurrent code, as well as faster-than-real-time simulation of time delays. This technique can be used to run an entire network in a single process, with simulated network latency and packet loss. To ensure deterministic execution, the simulation approach usually requires that the code is generic over the sources of local time and randomness; it can also rely on code instrumentation techniques. The key advantage of this approach is that it allows running precisely the same code in the simulator for testing as in production.

Certain correctness properties of code can be ensured statically, i.e. at compile time. Those checks rely on the programming language's type system. Software engineers can take advantage of type safety features to implement components in a way that makes them safe by construction. For example, Cardano uses the typed-protocols package, a generic framework for implementing application-level protocols, which is based on a simple form of session typing.⁶ Within this framework, protocols are described as state machines encoded into Haskell types. The allowed transitions between states correspond to messages exchanged between the peers, so the protocol state determines which messages are allowed to be sent or must be accepted when received, at type level. This simplifies protocol implementation, allows early detection of protocol violations, and makes the protocols themselves deadlock-free by construction. More advanced type-level programming techniques may allow achieving impressive levels of type safety; however, such code may be significantly harder to implement, understand, and maintain.

Ensuring correctness in distributed systems is a complex task. Protocols and their properties can be formally specified and verified. Expressing the protocol specification and its implementation using possibly similar notations could help to ensure equivalence between the two. Modular and generic structure of code, as well as using various testing support features within the code base, support more effective testing. Supporting deterministic discrete-event simulation is particularly powerful for reproducible randomized testing. Finally, type safety techniques like session types and typestates can eliminate certain kinds of programming errors at compile time.

Resource Utilization

Real computing systems are fundamentally bounded in the amount of available resources. Computers operate with limited computational power, memory, storage, and network bandwidth. Operating systems impose further limits on such system resources as threads, open network connections, file handles, etc. Practical systems are required to prevent resource leaks, as well as to ensure fair and efficient utilization of available resources.

Some resources, such as allocated memory, open file handles and network connections, spawned concurrent tasks and threads, may require explicit actions to release them properly when they are no longer needed. Failing to release resources promptly is known as a resource leak. It can cause resource starvation, slowdowns, and instability in the system. Relying on explicit releasing of acquired resources is known to be error-prone. Automatically releasing resources based on lifetimes and lexical scopes is a more robust form of resource management. Sometimes the encompassing lexical scope's lifetime is longer than the resource's natural life cycle, e.g. when managing concurrent tasks, so that strict lexical scoping becomes inappropriate. In such cases, resources may be managed more explicitly within the scope, but with a fallback mechanism to track resources and ensure that any remaining resource gets released when leaving the scope.⁷

Concurrency often makes resource management more challenging. First of all, concurrent tasks running in background is a kind of resource that needs to be released when no longer needed. Moreover, they can acquire other resources that should be released when the task is terminated, even in case of asynchronous cancellation. In simple cases, there is a limited number of long-running concurrent tasks, which are responsible for releasing the resources acquired by them, and their termination is synchronized with the main task; short-living jobs can run concurrently on execution pools that distribute those jobs among a number of long-running concurrent tasks. When more flexibility is desired, structured concurrency can help managing concurrent code in a more organized and predictable manner by organizing concurrent tasks into a structured hierarchy with well-defined scopes and lifetimes.⁸

Individual parts of a distributed system may operate at different pace. Moreover, for performance reasons, it is common to apply pipelining techniques when communicating with remote nodes, i.e. proceed without waiting for a response or acknowledgement form the remote node in order to hide network latency. Compensating for the delays and variability in throughput demands some kind of explicit or implicit buffering, e.g. buffered channels, send/receive queues, pending request trackers, out-of-context message buffers, etc. The amount of buffered state tends to grow under certain conditions, e.g. under heavy load or during network instability. Therefore, there should be some mechanisms to prevent unbounded growth of state without compromising liveness. That can be such mechanisms as backpressure, rate limiting, item expiration and eviction policies, etc.

In adversarial environments, potential DoS attacks through resource exhaustion is a major threat. An adversary may attempt to exhaust node's resources, such as network bandwidth, memory and computational capacity. In order to effectively mitigate such attacks, they should be prohibitively expensive for the attacker relative to the amount of resource consumed from honest participants. Early detection of protocol violations and consumer-driven data flow, as employed in Cardano, can reduce the amount of resources expended by the nodes under attack. It may also be useful to tack resource expenditure caused by processing messages from remote peers, as done in Avalanche, and apply fair throttling to communication channels.

Proper resource management is indispensable in long-running systems. It can be particularly challenging combined with concurrency. Reliable resource management approaches, e.g. based on lifetimes and lexical scopes, as well as structured concurrency, should be, when applicable, preferred to relying on explicit hand-coded releasing of acquired resources. Potential growth of state due to buffering requires mechanisms for ensuring bounded memory usage. Resource exhaustion attacks should be anticipated in adversarial environments and mitigated by minimizing their impact on honest nodes.

Maintainability

Maintenance of distributed systems is challenging. Those systems are usually long-running critical parts of infrastructure with high reliability requirements. They are complex systems consisting of multiple nodes, often operated independently by different entities. Publicly available deployments are also subject to malicious attacks. Thus effective maintenance of distributed systems demands comprehensive mechanisms and tools.

First of all, deploying distributed systems may require specific bootstrapping procedures in order to ensure a secure setup for the whole system. Different distributed protocols may have different requirements and rely on different assumptions for the setup phase.⁹ Protocol implementations should be clear about the requirements and assumptions for their setup phase. Long-running, highly available distributed systems should be capable of upgrading individual nodes with newer versions of the protocol implementation without disrupting the whole system. This requires designing for backward and forward compatibility. Similarly, failed nodes should be able to recover and rejoin the system safely and efficiently. Moreover, it is also desired that the system is able to safely recover from a massive crash, i.e. provide durability. Therefore, the protocols should be designed and implemented with a clear recovery procedure.

Distributed system administrators need mechanisms and tools for monitoring individual nodes in order to analyze the system and promptly detect anomalies. Developers also need effective mechanisms for analyzing, diagnosing issues, and identifying bugs in protocol implementations. Logging, tracing, and collecting metrics are common observability techniques to allow monitoring and obtaining diagnostic information from the system; most of the explored code bases use these techniques. OpenTelemetry and Prometheus are popular open-source monitoring solutions, which are used in many of the explored code bases.

Diagnostic logging typically refers to emitting and recording chronological textual messages that capture important events happening during the execution of software. Messages in diagnostic logs are traditionally assigned a severity level that can be used to disable logging of messages below a certain severity level, e.g. debug messages. Log messages can support addition of structured data along with a formatted text message, e.g. key-value context fields. Logging can be organized hierarchically, reflecting the structure of components within the system. Messages in hierarchical logging are usually automatically enriched with context from higher-level components.

Tracing is somewhat similar to logging, but it is focused on capturing a detailed view of the flow of execution in the system. Tracing records are primarily structured rather than textual and reflect causal relationships. In particular, distributed tracing is tracking of events caused by processing individual logical operations, such as user requests or transactions, across different components of a distributed system. A distributed trace is associated with a single logical operation and consists of spans linked with causal relationships where each span represents a particular activity within the operation. Spans normally contain structured data describing the corresponding activity and timing information.

Metrics represent numeric measurements that describe the system's behavior over time. Metrics are typically collected and aggregated at regular intervals. They can include various types of information such as CPU and memory utilization, latency, error rates, throughput, queue lengths, etc. There are different kinds or metrics; the most widely used are counter, gauge, and histogram. A counter is a cumulative metric monotonically increasing over time; a gauge expresses the current value of some measurement; a histogram records sampled observations in a statistical representation.

Observability is a cross-cutting concern. Most implementations define abstract interfaces for logging, tracing, and capturing metrics and require them as dependency across components; some use code instrumentation techniques. Cardano uses an interesting approach to implement observability features, called contravariant tracing, in which domain-specific values are provided to domain-agnostic processors. The contravariance property allows domain-agnostic tracers to be adapted and stand in where a domain-specific tracer is required. This discourages using textual encoding for diagnostic logging/tracing in favor of dedicated domain-specific event types. Contravariant tracing can also be used to collect metrics.

Detailed logging and tracing can add significant overhead. When logging a large amount of diagnostic data is expensive, logging can be sampled, producing only a subset of the total messages based on a predetermined sampling rate or criteria. The contravariant tracing incurs zero runtime cost if the program is compiled with tracing disabled; this is possible even when dealing with a tracer which ignores only certain types of events.

Fault-tolerant distributed protocols should be designed and implemented with clear bootstrapping, upgrading, and recovery procedures. Note that upgradability relies on backward and forward compatibility of the implementation. It is also worth considering the durability feature, i.e. the ability to safely recover the system from a massive crash. There should be seamless support for usual observability and diagnostic mechanisms.

Flexibility

Flexible software is able to adapt to changing requirements without having to undergo extensive restructuring. Flexibility is crucial for adoption, reuse, and evolution of code. Each explicit or implicit assumption or requirement imposed on how the code can be used is an additional constraint reducing flexibility. The explored code bases were primarily meant to implement particular protocols or serve specific purposes rather than to address fundamental needs for implementing distributed systems in general. This is a common approach to building software, but it tends to result in rather limited flexibility of the code.

In general, highly modular and composable code is also more flexible. Clear separation of concerns through abstract interfaces and dependency inversion contributes to flexibility by enabling interchangeable components, as well as facilitating easier code modifications and extensions. Flexibility can also be enhanced with generic and configurable components. Generic programming techniques, such as parametric polymorphism, encourage the development of more generic and adaptable components that can be used in different contexts without modification.

The ability to seamlessly integrate into larger systems is another aspect of flexibility required for adoption and reuse of protocol implementations. Since larger systems may opt for different programming languages and runtime environments, it is important to support interfacing with other languages and impose minimal runtime requirements. Rust is particularly suitable to implement robust software components for integrating into other languages and environments due to its rich language features, zero-cost abstractions, predictable performance, safe memory management without a garbage collector, and the ability to use custom concurrency runtimes.

Designing for flexibility promotes adoption, reuse, and evolution of code. Following this approach should be a deliberate choice from the beginning. Avoiding strong constraints, assumptions, and requirements, aiming at modularity and composability with generic and configurable components make for greater flexibility.

Conclusions

I learned a lot while exploring those 14 code bases. I have acquired a much deeper understanding of what is important for a practical distributed protocol implementation and what are the typical challenges there. I have seen different approaches in use and discovered some interesting ideas and techniques scattered around. Though I find the ways distributed protocols are implemented quite unsatisfactory. Even for an engineer experienced in implementing this kind of protocols, most of the the code bases were fairly hard to comprehend and follow. I can imagine how much effort it took and how painful was it to first make them work, as well as to improve them later.

Status Quo

Most of the time, it was rather hard to follow the main protocol, its causal dependencies and logical connections in the code that was presumably structured focusing on the operational aspects, fragmented, entangled, and cluttered. Structuring the protocol implementation directly around simplistic communication mechanisms foregoing reliable and ordered delivery guarantees provided by the transport layers, expressing concurrency and synchronization explicitly in terms of low-level mechanisms based on shared mutable memory and lock-based primitives or attempting to evade concurrency in favor of sequential state machines, all seem to cause fragmentation of the protocol logic across the code base, shift the focus towards operational technicalities, and incur cluttering of the code with boilerplate and hand-coded flow control, context switching, resource management, etc. Invasive ad hoc optimizations, patches, and cross-cutting concerns also contribute to muddling the code. Often insufficient modularity, unclear structure, excessive coupling, and abundance of mutable state complicate the matter further. Using sophisticated techniques and lack of inline documentation present additional obstacles for understanding.

It seems barely possible to fully convince oneself that the majority of the implementations actually correspond closely to the original protocol and guarantee the claimed properties under unfavorable conditions. The way the protocols are expressed in code does not appear anything like formal specification. The ever-present possibility of such subtle issues as race conditions, deadlocks, resource starvation, and, in some languages, data races manifesting themselves in such complicated code bases does not add more confidence. Unconstrained nondeterminism and abundance of non-local operations result in state space explosion, making rigorous test-based verification infeasible. Only a few of the implementations support reproducible testing with deterministic discrete-event simulation of unmodified code.

It also seems unclear how many of the implementations would behave under certain high load conditions, e.g. under a denial-of-service attack. Protocol violations are not always optimally detected at early stages of processing incoming data; many implementations lack mechanisms for propagating information about detected anomalies towards lower communication layers in order to restrict communication with offending nodes. The majority of the implementations employ the push style of communication and forgo flow control mechanisms of transport layers, so individual remote nodes can potentially send arbitrary amounts of data that the receiving node has to deal with in time. Uncomfortably often there are unbounded buffers and queues with unclear mechanisms that could control growth of state. Unreliable explicit hand-coded resource management could cause resource leaks, including concurrent tasks dangling in background.

In terms of observability, most of the protocol implementations rely on simple logging with context fields and collect various metrics. However, this may not provide enough details and context for effective debugging and analysis of the protocol execution.

The explored code bases are quite specific to particular protocols, execution environments, and use cases. Modularity there is rather coarse and most of the components are not meant to be reused or recombined; tight coupling is also not rare. This harms adaptability and reusability of the code, making it inflexible.

We Can Do Better

I think we can do much better. I think we should not waste our efforts reinventing the wheel over and over again and repeating mistakes. Builders better focus on implementing the functionality specific to their solutions without having to figure out how to approach implementing the tricky but critically important distributed protocols. There should be a framework that solves the problem of implementing distributed protocols once and for all, a framework reach with easy-to-use, reliable primitives and components that can be taken as is or mixed and matched as needed, a framework that guides towards robust and understandable code, a framework that supports analyzing, monitoring, testing, and debugging protocol implementations, a framework that is reasonably efficient and can be easily integrated into various environments.

The framework should guide away from incidental, non-essential complexity and allow expressing protocol implementations in clear and understandable code. Protocol implementations should be structured primarily focusing on functional and logical aspects with clear separation of concerns, operational technicalities and sophisticated techniques possibly hidden behind simple and clearly defined abstractions. Fine modularity of reasonably small and simple components expressed in more declarative notation with reduced number of non-local operations should facilitate local reasoning. Components should have minimal internal state, as well as clearly defined requirements, properties, and external dependencies.

Concurrency requires special attention since it is unavoidable, tricky, and can add a great deal of complexity, whereas designing for concurrency can be actually advantageous in terms of code structure and modularity. Working with concurrency should be possibly safe, easy, and efficient. Low-level concurrency mechanisms, such as OS threads, lock-based synchronization primitives, and shared mutable memory, should only be used for implementing the internals of higher-level, safer, and easier-to-use concurrency mechanisms, such as concurrency runtimes. Expressing concurrent parts of the protocol in code should feel as natural as expressing sequential ones. This can be achieved with syntactical means, abstractions, and a concurrency model that recognizes any causally independent operations as potentially concurrent.

Since communication interfaces can greatly affect the structure of code built around them, we need a range of communication mechanisms with aligned interfaces and clearly defined properties. There should be different levels of abstractions for communication. Higher levels should take advantage of the properties already guaranteed by lower-level layers, such as reliable, ordered communication channels with flow control provided by commonly used transport layers. Expressing communication in stateful sessions can help to express causal relationship between individual messages and greatly simplify protocol implementations. Similarity between internal and external communication can suggest better abstractions.

Flexibility is extremely important to make the framework applicable to an open-ended space of use cases. Therefore, we should by all means avoid strong constraints, assumptions, and requirements. The framework should support integration with different programming languages and runtime environments. Its components should be generic and configurable. It should also support backward and forward compatibility. Composability is critical for ensuring great adaptability, reusability, and evolvability. It requires unified means of abstraction and combination. Generic programming techniques, such as parametric polymorphism, can be used to make components generic; functional and asynchronous programming techniques can be great sources of ideas for enhancing composability, particularly with concurrency.

Correctness of distributed protocol implementations should be verifiable, in terms of both safety and liveness properties. Formal verification methods are able to provide rigorous assurance about correctness of protocols and their implementations. However, since formal verification involves exhaustively analyzing all possible states of a system, it may become infeasible for large and complex components. Fine modularity, components amenable to local reasoning, as well as reducing the number of non-local and nondeterministic operations, can help making formal verification more tractable. In order to maintain equivalence between a formally verified protocol specification and its implementation in code, the implementation should be expressed possibly close to the formal specification, preferably using an identical notation. Type safely techniques, such as ownership, typestates, session types, linear and uniqueness types, can greatly help to ensure correctness of the code by making it virtually safe by construction in terms of certain properties. Hybrid approaches combining formal verification with other testing methods can be used to achieve decently high assurance about correctness where purely formal methods become infeasible. Deterministic discrete-event simulation of unmodified code is a particularly powerful technique to complement other verification methods with randomized, reproducible testing. Confining nondeterministic aspects behind abstract interfaces in code and being able to control nondeterminism during simulation is the key for enabling reproducible testing.

Distributed protocols and their implementations should provide strong guarantees even under unfavorable conditions, especially those supposed to be deployed in adversarial environments. The framework should employ a reliable approach for resource management in concurrent code, e.g. based on lifetimes and lexical scopes, structured concurrency. There should be mechanisms for flow control preventing unlimited growth of state and ensuring bounded memory usage. The framework should emphasize threat-aware design. Potential impact of resource exhaustion (DoS) attacks should be minimized with early detection of protocol violations and propagating information about detected anomalies for maintaining peer reputation metrics and taking appropriate measures. For that reason, consumer-driven patterns of communication should be preferred. There should also be mechanisms for safe and efficient recovery of failed nodes from persistent storage. Supporting durability, i.e. safe recovery after a massive system crash, is also desirable.

Protocols and their implementations should be clear about bootstrapping and recovery requirements and procedures. Upgradability requires backward and forward compatibility. There should be seamless support for usual observability and diagnostic mechanisms, such as logging, tracing, and collecting metrics. It may also be useful to provide mechanisms for tracking resource expenditure caused by processing incoming data. In place of simple logging with context fields, it seems advantageous supporting structured distributed tracing using domain-specific trace event types. This kind of tracing could also be suitable for collecting metrics. It is important to minimize incurred overhead when tracing is disabled. Code instrumentation can help to avoid cluttering code with tracing boilerplate.

The framework should provide good performance and support various optimizations, such as speculative and on-demand execution, caching, flexible prioritization policies. To support protocol-level optimizations, the framework should allow expressing complex communication patterns. The communication layer should prevent such undesired effects as head-of-line blocking, optimize data flow and take advantage of the properties provided by the transport layers. Lightweight user threads and non-blocking algorithms allow achieving high concurrency without compromising efficiency. Zero-copy techniques can be used to eliminate unnecessary copying of data.

So we need a structured, yet flexible enough, approach guiding away from incidental complexity towards understandability, fine modularity, and composability. The framework's components should be generic and configurable, allowing local reasoning about the implementation. Expressing concurrency and communication abstractions should be safe and easy, structured and composable. We should be serious about correctness and resilience against unfavorable conditions. The framework should also cater for maintenance needs, provide great observability and diagnostic mechanisms. It should deliver decent performance and allow for various optimizations.

Next Steps

Having explored those implementations of distributed protocol, now it became more clear to me what is worth focusing on while developing the Replica_IO framework. I define the following key areas of focus:

simplicity: making protocol implementations well structured and understandable;
flexibility: keeping the framework adaptable, widely applicable, and evolvable;
reliability: ensuring that protocol correctness is verifiable and the implementation is resilient;
efficiency: allowing for various optimizations and delivering good performance;
maintainability: catering for maintenance needs and providing great diagnostic mechanisms.

Achieving all of that at once is obviously not realistic. Therefore, the primary focus will be initially put on simplicity, flexibility, and reliability, but without neglecting the remaining aspects. Of particular interest are the matters of structure and notation supporting composability in concurrency and communication mechanisms, as well as controlling nondeterminism.

Exploring distributed protocol implementations was the first phase of the initial state-of-the-art exploration. The next step is to select and examine some existing frameworks for developing distributed protocols in order to find out how they attempt to approach the problem and, perhaps, also discover some interesting techniques or ideas employed there. Then there are some potentially related concepts, approaches, and techniques worth looking into. The exploration tasks are tracked in the scope of this issue on GitHub.

If you like the project and find it valuable, please support its further development! 🙏
❤️ Replica_IO

If you have any thought you would like to share or any question regarding this post, please add a comment here. You are also welcome to start a new discussion or chime in to our Discord server.

If you know of some other implementation that I should have absolutely looked into for some reason, please let me know. ↩
Actually, incidental complexity can start creeping in even earlier, into the way we think about distributed systems, but let's not go into this here. ↩
Zarko Milosevic, CTO at Informal Systems, tells in his invited talk at ConsensusDays 23 how a small protocol change addressing a major issue resulted in months of implementation work on the Tendermint code base. ↩
In his talk "Using STM for Modular Concurrency", Duncan Coutts expands on the approach to concurrency employed by Cardano. ↩
More about the threat-aware design approach in "Introduction to the design of the Data Diffusion and Networking for Cardano Shelley". ↩
The typed-protocols framework was presented in the talk "Well-Typed Communication Protocols" by Duncan Coutts. ↩
ResourceRegistry used in Cardano is an example of a fallback mechanism based on lexical scoping for preventing resource leaks. ↩
Nathaniel J. Smith elaborates on structured concurrency in great detail in his blog post "Notes on structured concurrency, or: Go statement considered harmful". ↩
This post discusses the setup phase in distributed systems. ↩

The Story Behind Replica_IO

Sergey Fedorov — Fri, 05 Apr 2024 10:49:45 +0000

This post tells how the Replica_IO project originated and explains the
motivation behind it.

The original post can be found here.

My Background

I'd like to start by tell you a bit about my professional background.
I'm a research engineer with quite some experience in software
engineering. I began working as a software engineer back in 2009.

First 7 years, I was mostly focused on developing low-level system
software: I worked with such things as Linux kernel, microcontrollers,
hardware emulation, and trusted execution. Back then, I particularly
enjoyed contributing to Qemu, a generic and
open-source machine emulator and virtualizer. My contribution included
enhancing emulation of the ARM platform and enabling multithreading
support in the generic binary translation engine

In 2016, I took a big leap and came into research and development in
the areas of blockchain, distributed and decentralized systems. Soon
enough, I became absolutely excited about this, and since then, I keep
expanding my knowledge and experience in that area, in particular,
designing and implementing distributed protocols. During that period,
apart from proprietary stuff, I worked on the following open-source
projects:

MinBFT Hyperledger Lab — an implementation of the MinBFT consensus protocol as a pluggable component. I was the main author, contributor, and maintainer of the project.
Mir — a framework for implementing, debugging, and analyzing distributed protocols. My main contribution was implementation of the checkpointing mechanism, protocol garbage collection, and reproducible testing with simulated time.
Interplanetary Consensus (IPC) — a framework to enable on-demand horizontal scalability of the Filecoin blockchain. My main contribution was redesign and implementation of the atomic cross-chain transaction execution protocol in Rust.

Implementing Distributed Protocols

So much was I excited about distributed systems, but, after a while, I
started feeling like there's something fundamentally wrong in how we
usually design and implement them.

Distributed protocols are notoriously complex, and it took academia
significant effort to develop a solid theoretical foundation for them.
Due to inherent concurrency, the reasoning about distributed systems
is quite tricky, and there are lots of pitfalls where one gets trapped
pretty quickly, unless being extremely careful. Though, I find this
really fascinating because I particularly love digging deep and
thinking thoroughly.

However, the way distributed protocols are conventionally described on
paper makes it hardly possible to implement them correctly with
confidence; it's simply too far from the realities of software
engineering. Not only academic papers often neglect some details of
practical importance but also the language and notation used there,
they require nontrivial translation to the languages and patterns
commonly used in programming. Add there typical issues that come up
inevitably when programming concurrent systems, time pressure, and we
end up with a great mess that one can hardly comprehend and maintain.

Moreover, it seems like those engineers who get their hands dirty and
implement distributed protocols for practical use tend to jump in and
try applying whatever approach they were used to or that was implied
by the surrounding system. Although one can certainly learn a lot from
such experiments (and I'm doing that), it's generally waste of efforts
when one simply needs to get the thing reliably working. More than
that, since this kind of code is quite hard to get right, inevitable
mistakes creep into such implementations and lurk there unnoticed.
Even when some of those mistakes get revealed, individual projects are
usually too busy and too specific to keep following and effectively
learning from each other.

Having implemented a couple of distributed protocols myself, I find
this status quo deeply unsatisfactory, especially when it comes to
distributed replication mechanisms such as consensus protocols. After
all, they are supposed to ensure consistency and availability in such
critical computing systems as distributed coordination services,
distributed databases, and blockchain. There is an opinion that the
main obstacle to wider adoption of distributed, decentralized systems,
particularly those capable of tolerating arbitrary (Byzantine) faults,
is their requirement for additional resources and reduced performance.
While it's certainly true that high reliability doesn't come for free,
I think the concerns regarding complexity do actually matter a lot in
the end; it's simply hard to get it right.

I think decentralized Byzantine-fault tolerant mechanisms should
prevail in future computing systems and we can do a much better job
working towards that. I believe such complex problems can have neat
solutions, not only efficient, but also easy to use. Clearly,
discovering and developing such solutions does take quite some effort.
There must have been attempts to solve this problem, apparently not
very successful. But since I like to think of myself as someone
discovering smart solutions to hard problems, I'm not too scared; I'm
stubborn enough 😄

Replica_IO

So I was thinking about this for years, but never managed to find room
for seriously working on it. Suddenly, in February 2023, I was
affected by a lay-off in Protocol Labs and had to
leave; by that time, I had worked with the company as a long-term
collaborator, a Research Engineer at the ConsensusLab
group, for almost a year. After a while, I realized that this is
actually a great chance to finally start working on what I was
dreaming of.

Initially, I thought I would just take a break and spend some time on
a hobby project. I already had a name for it — Replica_IO, which had
come to my mind a few months before, as I had been yet again thinking
about communication between replicas in a distributed replication
system. However, once I started asking myself about my real intention
behind this, I realized that it's much bigger than just playing with a
pet project: what I really want is to make a breakthrough in how
distributed systems are designed and implemented!

In March 2023, I decided to found the Replica_IO project and work on
it full time as an independent research engineer. Since I believe in
open source, open innovation and collaboration, I also wanted to make
it radically open and started developing it entirely in the open from
day one. I described the project's purpose, goals, and
approach, created its logo, defined the initial
roadmap, and started working on the first milestone.

At the time of this writing, I'm exploring some relevant
state of the art, summarizing the findings.
Approaching this in a systematic way lets me dive deeper into the
problem, form a more educated opinion, find some inspiration, and
ultimately come up with effective ideas for achieving the project's
key technical objectives.

I understand how ambitious the goals of this project are and that it
may take long time to get there, but I'm absolutely sure it is worth
the effort. I'm surprised how much attention the project has already
attracted and would like to see great experts from the relevant fields
become involved and help to make it real. I also count on getting enough
support for this initiative, and I'm grateful to those who have
already been helping 🙏

If you'd like to learn more about this project, please visit the
About page and watch this talk.

DEV Community: Sergey Fedorov

A Sketch of Reversible Deterministic Concurrency for Distributed Protocols

Approach

Base and composite components, extrinsic inputs and outputs

Input-output pairs, interaction lines, and two kinds of interaction

Determinism and reversibility

Simple components

Sum

Generalized controlled swap

Concurrency of interaction lines

Demand-driven nature of interactions

Unspecified values

The Def component

CSwap with unspecified control

Synchronization

Signals

Delays and timers

One-shot and composite lines

Transposition

The Select component

Uncluttering diagrams

Repetitive patterns

Replication box

Recursive loops

Domains and projections

Assume-guarantee reasoning and modelling of faults

Confining interaction lines

Modelling Distributed Protocols

Bracha's Reliable Broadcast

Peer-to-peer links

Weak broadcast

Quorums

Overall protocol model

Consistent broadcast and modularized protocol model

Tendermint Consensus

Gossip communication

Overall protocol model

Height component

Round component

Conclusion

On Frameworks for Implementing Distributed Protocols

Exploring Distributed Protocol Frameworks

Model

Structure

Notation

Operation

Verification

Conclusions

Next Steps

On Implementation of Distributed Protocols

Exploring Distributed Protocol Implementations

Complexity

Modularity

Granularity

Decomposition

Composability

Concurrency

Approaches to Concurrency

Evading Concurrency

Nondeterminism

Communication

Communication Layers

Styles of Communication

Internal Communication

Resilience

Optimization

Correctness

Resource Utilization

Maintainability

Flexibility

Conclusions

Status Quo

We Can Do Better

Next Steps

The Story Behind Replica_IO

My Background

Implementing Distributed Protocols

Replica_IO

The `Def` component

`CSwap` with unspecified control

The `Select` component

`Height` component

`Round` component