Distributed Applications. Part 1 - Overview

#distributedsystems #programming

In this blog series we will examine distributed systems - and the modern practices associated with them. This will not be an academic series, but a practical one - there won't be citations of any papers, only evaluation of software products, many of which were originally based on a paper in a scientific journal. As for literature, I recommend "Designing Data-Intensive Applications" by Martin Kleppmann. He has a youtube channel, and a lecture series on distributed system. He also gives many talks. There also is a book by the titan Andrew Tanenbaum - you can get it for free.

So what is a distributed system? The accepted definition is - a software system with components on different networked computers. Now why is the networked part important, why can't threads or processes on the same computer form a distributed system? Because the network is unreliable, compared to memory. It could also be slower than disk, if you add up all the overhead - modern NVMe is ~7GB/s, 100Gb Ethernet is ~12.5GB/s

This unreliable network introduces the core challenges faced by distributed systems:

It's hard to know if your message was processed by the recipient
A netsplit splits a shared understanding of state

This is on top of challenges faced by a concurrent program - a good distributed program is also a good concurrent program. The distributed solution is more general.

A separate, but related issue is horizontal scaling. A system's processing capacity should grow with the amount of hardware that gets added to it - while not getting eaten up by the extra overhead. For stateless compute, this is simple enough - spawn more copies of the application - while for the stateful database, sharding needs to be employed, and that involves careful coordination.

And finally, fault tolerance. At any moment, there is a chance that your program catches a bug and breaks, or slows down to the point of being unresponcive. If your system consists of a large number of such components - something, somewhere is always broken. And your system needs to tolerate that. This would also make for a good local program - but it's easier for them to crash completely, if anything is broken. Once again, the problem of distributed computing is the more general than local computing.

And really, the central problem of distributed computing is distributed state - that's why there's so much talk of databases, since they are responsible for it. Each database presents its tradeoffs, which you need to keep in mind for your actual application. This is in addition to problems of local state - namely, the correct modeling of it, and the ability of its evolution - you may want to look into Domain Driven Design for that.

What is distributed state? To an individual program, running on an individual computer, there is no local or distributed state, there is only state - the only one it has. Much like to a person, there is no subjective or objective reality, there is only reality - the one they experience. Objective reality, much like distributed state, is a contrivance - it is simply whatever the individual participants can agree on. This means that for distributed state, you need an algorithm - one that, within pre-determined failure conditions, can make it so individual participants can always agree on an objective reality, on distributed state.

Say there is a netsplit - 5 out of 9 nodes recorded an operation, while 4 out of 9 didn't, and they can't reach each other. What is the reality, did the operation happen? That is, once again, determined by the algorithm, which is usually based on a rule of thumb - if one node is unavailable, that's most likely because that node is broken, if many nodes are unavailable - the majority decides what the distributed state is, because the majority is more fault-tolerant, due to having more nodes. But deciding that in the moment, there is no valid shared state, that we should stop accepting operations and wait for the other nodes to become available again, is also a valid design - you are trading availability for consistency, which is the dilemma posed by the CAP theorem. A possible solution is replication - if the remaining 5 nodes can assume ownership over entities previously owned by the other 4 nodes, because they have all the necessary data - the system can continue accepting operations. The 4 nodes would need to realise that they are in the minority, stop accepting operations based on stale data, catch up on the data, once available, and only then accept operations again.

Another common fault is a transient network outage - a message gets lost, and a sender gets no reply. Should the sender send the message again, at the risk of potentially doing the same thing twice? (like removing money from someone's bank account, which you don't want to do twice by accident). This creates at-least-once and at-most-once delivery guarantees for messaging frameworks, which one also needs to design around.

To summarise, regular applications need to correctly model internal state, based on the real world, and make operations on that state, which are useful to the business. Distributed applications then need an algorithm to agree on shared state, and in doing so, decide on the tradeoff between consistency and availability, in the event of outages and netsplits.

In the following blog, we will examine the common patterns employed by applications, to deal with distributed state. Thank you for reading!

Top comments (2)

Roi Kadmon • Oct 17

Excellent article!
A few things I'd add:

(A bit of a nitpick) You mentioned that accessing data over the network is slower, and you mentioned bandwidth of various protocols. Is the latency also important here? Moreover, is the fact that the latency may be unpredictable relevant?
Could you also explain why distributed systems are desirable to begin with? Is it because you want to be able to scale your system horizontally, or is it because it's important to have data or servicing machines in various locations so that users of the distributed systems can access the resources faster?

lostghost • Oct 17

I think latency only becomes important with suboptimal design with many round-trips. Otherwise, its usually "good enough"
Because the moment you have two computers talking, it is already a distributed system. Nowadays we have more than two computers talking :) Every time you speak to another computer, you are a part of a distributed system. And understanding the logic behind them will allow you to be a better participant.