The book Designing Distributed Systems by Brendan Burns presents the fundamental principles needed to deal with the complexity of distributed systems. In Chapter 2, "Key Concepts of Distributed Systems," the author discusses essential topics for developing scalable, reliable, and resilient systems. Below is a summary of the main concepts discussed:
APIs and RPC
APIs define the communication contract between services, specifying the available operations and the data to be exchanged. RPC (Remote Procedure Call) is a technique that allows function calls between remote servers as if they were local. In other words, while the API describes what can be done, RPC is one of the ways to perform these actions remotely.
Latency
Latency is the time between sending a request and receiving the response. In distributed systems, this time tends to be higher due to communication between multiple machines. The higher the latency, the worse the user experience. Therefore, monitoring and optimizing latency is essential to ensure a fast and efficient system.
Percentiles
Averages can be misleading. For example, imagine that 90% of requests are fast, but 10% take a long time. While the average might seem good, users who fall into that 10% will have a poor experience. This is where percentiles come in:
- p95: Indicates the time at which 95% of requests were completed.
- p99: Indicates the time at which 99% of requests were completed.
These percentiles help identify issues that the average may hide.
Reliability
Reliability is essential in distributed systems. The system must continue functioning even when parts of it fail. To ensure this, strategies such as the following are used:
- Replication: Keeping copies of data in multiple locations.
- Redundancy: Having multiple instances of the same service.
- Retries: Repeating operations that temporarily failed.
Idempotency
Idempotency ensures that performing the same operation multiple times will have the same effect, without causing inconsistencies. For example, if a user clicks the payment button twice, the system should only register one charge. This feature prevents serious errors in systems that may receive repeated requests.
Delivery Semantics
Delivery semantics define how messages are delivered between services:
- At-most-once: The message may be lost, but it will never be duplicated.
- At-least-once: The message will always arrive, but it may be delivered more than once.
- Exactly-once: The message arrives only once (ideal model but difficult to implement in practice).
Relational Integrity
Relational integrity ensures that the data in the system maintains the correct relationships between them. For example, an order must always be linked to an existing customer. Without this guarantee, the system could store orders without a reference to who made them.
Data Consistency
Maintaining data consistency in distributed systems is one of the biggest challenges. There are two main models:
- Strong consistency: All nodes in the system see the same information immediately, ideal for critical operations, such as bank transfers.
- Eventual consistency: Different parts of the system may see different information for a while, but the data will eventually align. This model is common in systems that prioritize high availability. The choice between strong and eventual consistency depends on a balance between accuracy, performance, and availability.
Orchestration and Kubernetes
Manually managing multiple distributed services would be unfeasible, and this is where orchestration comes in. It automates tasks like deployment, scalability, and monitoring. Kubernetes is the primary orchestration tool used today. It manages containers, distributes the workload, and ensures the continuity of the system, even in the case of failure.
Health Checks
Health checks are mechanisms used to monitor whether services are functioning correctly. In Kubernetes, there are two main types of checks:
- Liveness: Verifies if the application is "alive." If it fails, Kubernetes restarts the service.
- Readiness: Verifies if the application is ready to receive requests. If it is not, traffic is routed to other instances. These checks ensure the health of the system and prevent users from being impacted by problematic instances.
Liked the content? Then, be sure to follow me for more articles and updates on my networks:
Top comments (0)