angDecoder

Posted on Dec 29, 2024

CHAPTER 1 : Reliable, Scalable, and Maintainable Applications

#systemdesign #programming #development #opensource

1. Fundamentals of Data-Intensive Applications
2. Reliability
3. Scalability
4. Maintainability

1. Fundamentals of Data-Intensive Applications

Modern applications are often data-intensive, not compute-intensive. Key challenges include:

Amount of data: Handling large volumes.
Data complexity: Managing intricate structures.
Data change speed: Processing updates in real time.

Effective systems must focus on reliability, scalability, and maintainability.

1.1. Reliability

Systems must function correctly under:

Hardware failures: Disk crashes, network outages.
Software bugs: Unexpected behavior.
Human errors: Mistakes in configuration or operations.

1.2. Scalability

Systems should handle growth in:

Data volume: Large datasets.
Traffic volume: High user requests.
Complexity: Evolving use cases.

1.3. Maintainability

Systems must remain easy to work on over time:

Multiple contributors can modify and maintain it.
Adapts to new features and requirements.
Minimizes technical debt and operational issues.

2. Reliability

Reliability in software means:

The application performs the expected function.
It tolerates user mistakes or unexpected usage.
Performance meets the requirements under expected load and data volume.
It prevents unauthorized access and abuse.

a. Faults: When a component deviates from its specification.
b. Failures: When the system stops providing the required service.

Fault-tolerant or resilient systems anticipate and cope with faults.

Testing fault-tolerant systems can involve deliberately triggering faults. For example, Netflix's Chaos Monkey tests system resilience by introducing faults intentionally.

2.1. Hardware Faults

Modern systems favor software fault-tolerance techniques alongside or instead of hardware redundancy. These systems offer operational benefits, such as:

Tolerating machine loss: Enables continuous operation despite hardware failures.
Rolling upgrades: Allows patching or rebooting machines one node at a time without system downtime.

2.2. Software Errors

Bug causing all instances of an application server to crash with a bad input (e.g., leap second on June 30, 2012, affecting Linux kernel).
Service dependency that slows down, becomes unresponsive, or returns corrupted responses.
Cascading failures: A fault in one component triggers faults in others [10].
Runaway process consuming shared resources (CPU, memory, disk, network).

2.3. Human Errors

3. Scalability

Scalability refers to a system's ability to handle increased load.

3.1. Describing Load

Load parameters quantify load (e.g., requests per second, read/write ratio, active users, cache hit rate).
Parameters depend on the system's architecture. Average case performance or Extreme cases causing bottlenecks.

3.1.1. Case Study: Twitter's Operations and Scaling Challenges

Twitter handles two primary operations:

Post Tweet: A user publishes a message to their followers.
- Load:
  - Average: 4.6k requests/sec.
  - Peak: Over 12k requests/sec.
- Process: Add the tweet to a global collection.
Home Timeline: A user views tweets from followed accounts.
- Load: 300k requests/sec.
- Process:
  - Lookup followed users.
  - Fetch and merge their tweets, sorted by time.

How to get the Home Timeline posts ?

Option 1: On-Demand Generation

Process: Fetch and merge tweets from followed users when requested.
Simplifies write operations (posting a tweet is lightweight).

Option 2: Cached Home Timeline

Maintain a cache for each user (like a tweet mailbox).
Reads are cheap as results are pre-computed.
Write operations become heavier (more work to update multiple caches).

Twitter switched to Approach 2 (cached home timelines). This change proved more effective because the rate of home timeline reads is nearly two orders of magnitude higher than the rate of published tweets. By shifting more work to write time, the system reduced the burden on read time, enhancing scalability and overall performance.

3.2. Describing Performance

To analyze system performance under increasing load, consider two perspectives:

Fixed Resources:
- Increase a load parameter (e.g., requests/sec) while keeping system resources (CPU, memory, bandwidth) constant.
Scaling Resources:
- Increase a load parameter and determine how much system resources need to scale to maintain unchanged performance.

Latency vs. Response Time => **Response Time* is total time observed by the client, including: Service time (Time to process the request.) and Network delays and queueing delays. Whereas,
Latency is the time a request spends waiting to be handled (awaiting service)

3.2.1. Using Percentiles for Response Times

Percentiles: Provide a clearer picture of response time distribution than averages.
- Median (p50):
- Half of requests are completed faster than the median, and half are slower.
- Commonly used in SLOs (Service Level Objectives) and SLAs (Service Level Agreements) to define expected performance.
Head-of-Line Blocking:
- A few slow requests can delay the processing of subsequent requests, creating bottlenecks.

4. Maintainability

Cost of Maintenance:
- Most software costs arise not from initial development but from ongoing maintenance tasks such as: Fixing bugs, Operational upkeep, etc.
Goal: Minimize maintenance pain and avoid creating legacy systems by focusing on these key design principles:

a. Operability:

Ensure operations teams can manage the system efficiently and keep it running smoothly.
Can be achieved with proper documentation, updating with security-pathes, monitoring.

b. Simplicity:

Reduce system complexity to make it easier for new engineers to understand.
Can be achieved through abstraction.
Note: Simplicity here refers to system design, not the user interface.

c. Evolvability:

Enable future changes to the system for unanticipated use cases as requirements evolve.
Also referred to as extensibility, modifiability, or plasticity.

DEV Community