DEV Community

lostghost
lostghost

Posted on

Distributed Applications. Part 1 - Overview

In this blog series we will examine distributed systems - and the modern practices associated with them. This will not be an academic series, but a practical one - there won't be citations of any papers, only evaluation of software products, many of which were originally based on a paper in a scientific journal. As for literature, I recommend "Designing Data-Intensive Applications" by Martin Kleppmann.

So what is a distributed system? The accepted definition is - a software system with components on different networked computers. Now why is the networked part important, why can't threads or processes on the same computer form a distributed system? Because the network is unreliable, compared to memory. It could also be slower than disk, if you add up all the overhead - modern NVMe is ~7GB/s, 100Gb Ethernet is ~12.5GB/s

This unreliable network introduces the core challenges faced by distributed systems:

  • It's hard to know if your message was processed by the recipient
  • A netsplit splits a shared understanding of state

This is on top of challenges faced by a concurrent program - a good distributed program is also a good concurrent program. The distributed solution is more general.

A separate, but related issue is horizontal scaling. A system's processing capacity should grow with the amount of hardware that gets added to it - while not getting eaten up by the extra overhead. For stateless compute, this is simple enough - spawn more copies of the application - while for the stateful database, sharding needs to be employed, and that involves careful coordination.

And finally, fault tolerance. At any moment, there is a chance that your program catches a bug and breaks, or slows down to the point of being unresponcive. If your system consists of a large number of such components - something, somewhere is always broken. And your system needs to tolerate that. This would also make for a good local program - but it's easier for them to crash completely, if anything is broken. Once again, the problem of distributed computing is the more general than local computing.

And really, the central problem of distributed computing is distributed state - that's why there's so much talk of databases, since they are responsible for it. Each database presents its tradeoffs, which you need to keep in mind for your actual application. This is in addition to problems of local state - namely, the correct modeling of it, and the ability of its evolution - you may want to look into Domain Driven Design for that.

What is distributed state? To an individual program, running on an individual computer, there is no local or distributed state, there is only state - the only one it has. Much like to a person, there is no subjective or objective reality, there is only reality - the one they experience. Objective reality, much like distributed state, is a contrivance - it is simply whatever the individual participants can agree on. This means that for distributed state, you need an algorithm - one that, within pre-determined failure conditions, can make it so individual participants can always agree on an objective reality, on distributed state.

Say there is a netsplit - 5 out of 9 nodes recorded an operation, while 4 out of 9 didn't, and they can't reach each other. What is the reality, did the operation happen? That is, once again, determined by the algorithm, which is usually based on a rule of thumb - if one node is unavailable, that's most likely because that node is broken, if many nodes are unavailable - the majority decides what the distributed state is, because the majority is more fault-tolerant, due to having more nodes. But deciding that in the moment, there is no valid shared state, that we should stop accepting operations and wait for the other nodes to become available again, is also a valid design - you are trading availability for consistency, which is the dilemma posed by the CAP theorem. A possible solution is replication - if the remaining 5 nodes can assume ownership over entities previously owned by the other 4 nodes, because they have all the necessary data - the system can continue accepting operations. The 4 nodes would need to realise that they are in the minority, stop accepting operations based on stale data, catch up on the data, once available, and only then accept operations again.

Another common fault is a transient network outage - a message gets lost, and a sender gets no reply. Should the sender send the message again, at the risk of potentially doing the same thing twice? (like removing money from someone's bank account, which you don't want to do twice by accident). This creates at-least-once and at-most-once delivery guarantees for messaging frameworks, which one also needs to design around.

To summarise, regular applications need to correctly model internal state, based on the real world, and make operations on that state, which are useful to the business. Distributed applications then need an algorithm to agree on shared state, and in doing so, decide on the tradeoff between consistency and availability, in the event of outages and netsplits.

In the following blog, we will examine the common patterns employed by applications, to deal with distributed state. Thank you for reading!

Top comments (0)