System Design Pillars

Isha Mudgal — Wed, 18 Jun 2025 13:55:43 +0000

1. ⚡ Availability
Definition:
The ability of a system to remain accessible and operational at all times.

Key Points:

Measured as a percentage uptime (e.g., 99.99% availability = ~52 minutes/year downtime).

Requires redundancy: multiple instances, failovers.

Common techniques: Load balancers, health checks, replicas.

2. 🛡️ Reliability
Definition:
The system's ability to function correctly and consistently over time.

Key Points:

Reliability ≠ Availability. A system can be available but return incorrect results.

Achieved through: fault detection, retries, data replication, and monitoring.

Measured with metrics like MTBF (Mean Time Between Failures).

3. 📈 Scalability
Definition:
A system’s ability to handle increased load without performance loss.

Key Points:

Vertical Scaling: Add more power to a single machine.

Horizontal Scaling: Add more machines (preferred for web-scale).

Involves sharding, caching, stateless services, and distributed queues.

4. 🔧 Maintainability
Definition:
How easily a system can be understood, updated, and fixed.

Key Points:

High maintainability = faster iterations and fewer bugs.

Achieved with: clean code, modular architecture, automated tests, observability.

Reduces system downtime and tech debt over time.

5. 🧯 Fault Tolerance
Definition:
The system’s ability to keep running even when some components fail.

Key Points:

Examples: retry logic, failover systems, circuit breakers.

Closely tied with availability and reliability.

Design principle: “Design for failure” — assume things will go wrong.

Abstraction

Isha Mudgal — Tue, 17 Jun 2025 15:59:44 +0000

1. Why Are Abstractions Important?
Abstractions simplify complex systems by hiding lower-level implementation details. They:

Allow developers to focus on solving higher-level problems.

Enable modularity: changes in one layer don’t affect others.

Promote reuse, scalability, and testability.

Are foundational in both OS (e.g., file systems) and distributed systems (e.g., databases, RPC).

2. Network Abstractions: Remote Procedure Calls (RPC)
RPC enables communication between services across a network, hiding the complexity of:

Data serialization/deserialization

Network communication protocols

Retry logic and error handling

Advantages:

Makes remote calls look like local method calls.

Reduces boilerplate for service-to-service interactions.

Popular frameworks: gRPC, Thrift, Apache Avro.

Caveats:

Network issues can introduce latency/failures.

RPC is not the same as function calls: developers must account for distributed system challenges (timeouts, retries, etc.).

3. Spectrum of Consistency Models
Consistency in distributed systems balances availability, latency, and correctness.

Models include:

Strong Consistency – Reads always reflect latest writes. Simplifies reasoning but affects availability.

Eventual Consistency – All replicas eventually converge. Faster but may show stale data.

Causal Consistency – Preserves cause-effect relationships.

Read-Your-Writes / Monotonic Reads – Guarantees that a user's session observes its own updates.

Tradeoffs: CAP theorem dictates that in network partitions, you must choose between consistency and availability.

4. The Spectrum of Failure Models
Failures are inevitable in distributed systems. Types include:

Crash Failures – Server stops working (e.g., power loss).

Omission Failures – Messages dropped or not sent.

Timing Failures – Response not within expected time.

Byzantine Failures – Arbitrary/malicious behavior.

Key takeaways:

Design for failure: Use timeouts, retries, replication, and monitoring.

Systems must tolerate partial failures without breaking completely.

Consensus protocols (e.g., Paxos, Raft) handle agreement in unreliable environments.

DEV Community: Isha Mudgal

System Design Pillars

Abstraction