Reliability, scalability and maintainability

Designing Data-Intensive Applications, Chapter 1

Recently I've acquired Martin's DDIA book, and in the following days I will be sharing some notes about the chapters
Also reading it? Feel free to connect, let's talk about it!

Crashed hard disks, faulty RAM, lack of maintainance, etc
- First response: adding redundancy to individual hardware components
⬆ applications' computing demands ⬆ rate of hardware faults
- And in common cloud services, flexibility and elasticity are prioritized over single-machine reliability
- Software fault tolerance techniques!

Harder to anticipate
Correlated across nodes
- Which causes more system failures than uncorrelated hardware faults
Examples include:
- Software bugs given a particular input
- Runaway processes (infinite loops)
- Downtime in some service that the system depends on
- Cascading faillures
Prevention includes:
- Carefully thinking about interactions and assumptions about the system
- Automated/manual testing
- Process isolation
- Measuring, monitoring and analyzing system behavior in production

How do we make our systems reliable, in spite of unreliable humans?
- Designing systems in a way that minimizes opportunities for error. Eg: abstractions, APIs and admin dashboards (high-level interface)
- Non-production environments (so people can explore and experiment safely)
- Automated/manual testing to cover corner cases
- Easy rollback options
- Deploying code gradually + feature flags(trunk-based development)
- Detailed monitoring and observability. Eg: performance metrics and error rates
- Good management practices and training
Reliability is important for big or small companies (potential lost of revenue and damage to reputation)
- Balance between reliability and costs

Ability to cope with increased load
- If the system grows in a particular way, what are our options for coping with the growth?

Actually, first we have to consider what load is
Load parameters depend on the system architecture
- Request per second to a web server
- Ratio of reads to writes in database
- Number of Simultaneosly active users in a chat room
- Hit rate on cache, etc
Twitter key load parameter: distribution of followers per user (fan out load)
- SELECT all tweets in home timeline requests VS timeline cache

Batch processing -> throughput (records processed/second or time to complete a job)
Online systems -> response time (request -> response)
We should think response time as a distribution of values, not a single number
- Since in every request, the response time will be different
- Random latency, loss of packet, garbage collection pause, mechanical vibrations, etc
Average response time is NOT a good metric
- It does not tell how many users experienced that delay
Always use percentile: sort it from fatest to slowest and take the median (half of the requests return less/longer than X)
- Also known as p50
To know how bad the outliers are: p95, p99, p999
- Also known as tail latencies
If p95 = 1.5 seconds, 5 out of 100 requests take 1.5 seconds or more
Tail latencies are important, since they affect users' experience of the service
- 100ms increase in response time reduces sales by 1% (Amazon)
- 1 sec slowdown reduces customer satisfaction metrics by 16%
These percenetiles are used when defining SLAs and SLOs
- "Server considered up if has a median response time of X ms and a p99 under 1s"
Monitoring response times:
- Ongoing basis
- Window of response times of requests in the last 10 minutes
- Every minute, calculate median and percentiles and plot those metrics in a graph

Rethink the architecture on every order of magnitude load increase
Scaling up (moving to a more powerful machine) vs scaling out (distributing load across machines)
- Elastic systems (automatically add computing resources as load increases
- Useful if increase of load is unpredictable
Architecture of large systems is highly specific to the application
- There is no such thing as a one-size-fits-all architecture
- 100,000 1kB requests/sec vs 3 2GB requests/min
- An architecture that scales well will be built around the load parameters
- Iterate quickly on features VS scale to hypothetical future load
- However, scalable architectures are built from general-purpose building-blocks (arranged in familiar patterns)

Software cost is due mainly in maintainance, not in initial development process
- Fixing bugs, investigating failures, adapting to new platforms, modifying use cases, repaying technical debt, adding new features, etc

Simple and expressive code instead of bloated and complex
Symptoms of complexity:
- Tight coupling of modules
- Tangled dependencies
- Inconsistent naming
- Hacks to solve performance issues, etc
When the system is harder for developers to understand and reason about (hidden assumptions, unintended consequences, unexpected interactions), the risk of introducing new bugs is increased
Making system simpler -> removing accidental complexity
- Through abstractions (façade) #### Evolvability
Constant flux: learn new facts, new use cases emerge, business priorities change, user request new features, architecture changes, etc
Agile development
The ease of modifying a system is linked with its simplicity
- Easy-to-understand systems are easier to modify than complex ones