Which steps to ensure the robustness of a small distributed system ?

github logo ・1 min read

I'm a student doing a college project. Can say I'm an experienced programmer, I have been tackling with distributed systems for so long before this project but never taken it seriously.

This project is fairly small as a distributed system, so it's easy to make a working implementation. Now I have to make it robust and write a report about it.

Searching on the web is frustrating. There're only long reads in form of books, smaller one like reports or paper focus on only one specific aspect such as scalibility. They are all for domain experts working on large systems in industry. The 8 fallacies of DS do remind me something to do but they're still abstract concepts, mot HOWTO.

So, how to make a distributed system robust (practically secure and fast), and, prove it (any metrics I can measure and put on the scientific report) ?

Many thanks !

twitter logo DISCUSS (1)
markdown guide

Hi there,

In a practical sense, the biggest issue with robustness of distributed systems is the extra failure modes: each distributed piece can/will fail independently of the others. So with that in mind, think about:

(I'm assuming your system has separate processes serving the front-end, the back-end and a database.)

  • What does your front-end do when the back-end fails?
  • What does the back-end do when the database fails?
  • How will the front-end know the back-end has failed?
  • How will the back-end know the database failed?
  • How will the front-end know when the back-end has recovered?
  • How will the back-end know when the database has recovered?

Typically people will horizontally scale the front-end and back-end components (keeping them stateless if possible) which increases the ability for these layers to survive outages, but databases often become the biggest issue unless it is also mirrored in some form.

Typical measurements for these things are in terms of availability, but unless you are serving a large number of requests regularly, these metrics may be meaningless because downtime doesn't impact anyone.

To prove some of these capabilities of your system, you might want to have a look at what Netflix has done with their AWS workload (medium.com/netflix-techblog/the-ne...) if you are running on AWS or similar cloud provider.

Depending on your implementation language, make sure you use appropriate timeouts for each distributed connection and potentially look at more advanced patterns like Circuit Breakers and Bulkheads (I'll leave you to Google these).

p.s Making distributed systems robust is (a) not a binary thing and (b) not easy:-) You will eventually have failure modes you haven't anticipated. Being able to recover from them quickly is probably more important than avoiding them in the first place.

Classic DEV Post from Aug 28 '19

How did you feel after your first open source PR?

A thread discussing the sensation of contribution to open-source.

Khoa Che profile image
A hacker by definition of the Jargon File | Be able to develop websites, desktop GUIs, libraries that he can imagine.

We are a community