Understanding System Models in Distributed system

#distributedsystems #softwareengineering #learning #software

A model in a distributed system is an abstraction built on top of the actual system, which encodes several assumptions about how the system will behave, the types of failures it may encounter, what it can handle, how the communication links and timing will behave, and so on. Understanding a model of a system provides the ability to reason about the distributed system, helping us identify where it can be applied and where it cannot.

We will explore some popular models to deepen our understanding.

Models Based on Communication Links

Fair-loss link: This model assumes that messages can be lost or duplicated. Therefore, retransmission of messages is needed to ensure delivery.
Reliable link: This model assumes that messages are not duplicated (similar to data segments in TCP) and that no data is lost. If you are given a fair-loss link but want to implement a system where message loss or duplication is unacceptable, you must ensure that duplicate messages are de-duplicated on the receiver’s side. This adds extra work, diverting focus from core business logic. Therefore, it’s crucial to choose the appropriate model for the use case.
Authenticated link: This model has all the assumptions of a reliable link, with the addition that the receiver can authenticate the sender (e.g., adding TLS on top of a reliable link makes it an authenticated link).

Models Based on Failures

Arbitrary fault: In this model, the core business algorithm can fail due to bugs or unexpected issues. However, the model assumes multiple entities and tolerates the failure of up to one-third of them, allowing the system to continue operating correctly. For example, software controlling a rover’s wheels on Mars ensures that if one server fails or crashes, another can take over the task.
Crash recovery: In this model, processes don’t deviate from their core algorithm. If a failure occurs, the system should restart and return online, although restarting may lead to the loss of online state (e.g., logs or memory caches). A good example is a production database. You don’t want your name to be saved with errors, and operations like updating entries in a database should not introduce bugs. There are replication mechanisms, such as RISK, to ensure recovery in case of a crash.
Crash-stop: This model assumes there are no bugs in the algorithm, and if a crash occurs, the system does not recover. Developers working on these models don't need to worry about recovery or state management, so the algorithms are simpler and easier. These models are common in IoT devices or sensors.

Models Based on Timing Assumptions

Synchronous model: This model assumes that any operation (e.g., sending a message) takes no longer than a certain amount of time. Obviously, this is unrealistic since there can be network failures or system delays.
Asynchronous model: This model assumes that any operation can take an unbounded amount of time. While this may seem less practical, it can simplify problem-solving in certain cases, which we will explore in future blog posts.
Partially synchronous model: This model assumes that the system is synchronous most of the time.

In our upcoming blogs, we will explore different models and scenarios where they may fail.

Conclusion

In conclusion, choosing the appropriate model based on the system's requirements is critical to avoid unnecessary overhead and ensure reliable performance. Future discussions will explore various models and scenarios where these assumptions may fail.

Here are links to my previous posts on distributed systems:

Feel free to check them out and share your thoughts!