My Key Takeaways from DDIA Chapter 1: Reliability, Scalability, and Maintainability

#systemdesign #distributedsystems #architecture #computerscience

This was an introductory yet useful chapter to begin with, as it starts with the high-level fundamentals you need to think about before designing systems.
here are my notes which are the things which felt more important for me

Reliability:

The application should continue to work even if things go wrong, so design it in a way that handles most of the mistakes users make and places where it could crash. While discussing this stuff, use terms like "resilient" as terms like "fault tolerant" are misleading since there is no system which is 100% fault tolerant.
The difference between fault and failure is that a fault means often one component is off, while failure means the whole system goes down.

Types of errors the system needs to handle are-

hardware errors (not much for us to worry about, but the errors can also be interconnected with software)
software errors (bugs in code can cause cascading failures)
human errors (like wrong config files).

Scalability:

It's the system's ability to cope with increased load. It depends on how you design the system so that it can handle more load. Also, it's meaningless to talk like "something is scalable or something doesn't scale."
While talking about load, it's best to define the load termed as load parameters as per your app's requirements the load can be requests per second, ratio of reads and writes to the database, or active users in a chat room.
It's important to manage the system's resources (CPU, memory, etc.) according to when you increase the load parameters.
There are various ways to cope with the load like horizontal and vertical scaling, while keeping in mind that there ain't one scaling strategy or one absolute secret sauce that is applicable for every system as it depends on various factors.

Latency and response time

Latency means the time the data took to travel across the network, while response time means what the user sees—in simple terms, the total time the client sees from sending the request to getting a response. Response time includes more things like network delays, queueing delays, etc.

The best statistic used to monitor the response times of users is by plotting it in sorted order from fast to slow in percentiles and using the median (also 50th percentile or p50) as a metric. Big companies like Amazon design their internal structure according to the response time of the 99.9th percentile, which means 1 in 1000 requests.

Maintainability:

You don't leave the system just after making it and have to maintain it, so while making it you need to keep in mind that it should be easy for maintenance. Important things to keep in mind are

operability (making it easier for operations teams)
simplicity (easier to understand, reduce complexity as much as you can by adding right abstractions)
evolvability (easier to incorporate changes in the future, so it should be modifiable and extensible).

DEV Community

My Key Takeaways from DDIA Chapter 1: Reliability, Scalability, and Maintainability

Reliability:

Scalability:

Latency and response time

Maintainability:

Top comments (0)