DEV Community

Ewerson Vieira Nascimento
Ewerson Vieira Nascimento

Posted on

Resilience Strategies for Microservices

Resilience in microservices starts with the acknowledgment that failures are inevitable, and therefore, systems must be designed to adapt when these failures occur. This involves implementing a group of strategies aimed at minimizing the risks of data loss and ensuring the continuity of important business transactions. Techniques such as data replication, distributed transaction management, and real-time monitoring help safeguard the integrity and availability of critical services. By anticipating failures and preparing for them, systems can maintain functionality and performance even under adverse conditions.

The goal of this article is to list some concepts and strategies to address resilience for microservices as there are a lot of actions we can take to ensure resilience in our applications.

Photo by [David Pupăză](https://unsplash.com/@davfts?utm_source=medium&utm_medium=referral) on [Unsplash](https://unsplash.com?utm_source=medium&utm_medium=referral)

The starting point for resilience in this context is the concept of protecting and being protected. This means each service should be capable of self-preservation, maintaining consistent response times regardless of external issues, and avoid making requests to known failing systems, using mechanisms like circuit breakers. Not only that, but microservices should also bypass slow systems, which can be more harmful than complete failures.

Health Check

Health checks are vital for maintaining resilience in microservices. They should assess all system dependencies to provide a comprehensive status of the system’s health. An unhealthy system can often recover, or self-heal, if it stops receiving traffic temporarily. There are two types of health checks: passive and active. In a passive health check, the system becomes aware of its health status only upon receiving a request, such as when a service fails to respond correctly and triggers an alert. Active health checks involve the system continuously monitoring its own health or being monitored by other systems. For example, in an e-commerce platform, an active health check might involve the catalog service periodically checking the database connection and the API gateway checking the status of all microservices it routes traffic to, ensuring any issues are detected and addressed proactively.

Rate Limit

Rate limiting is a technique used for protecting a system by ensuring it only handles the load it was designed to support. By controlling the number of requests a service can handle within a given timeframe, rate limiting prevents system overloads and maintains performance. It can be customized based on client type, allowing different rate limits for various users or services. In an API-driven platform for instance, regular users might be allowed 100 requests per minute, while premium users are allowed 1,000 requests per minute. This ensures the system remains stable and responsive, even under high demand, by prioritizing and managing traffic according to the designed capacity and client preferences.

Circuit Breaker

The circuit breaker pattern is essential for enhancing system resilience by managing requests to potentially failing services. It functions by denying requests when necessary to prevent overload and cascading failures.

Circuit Breaker illustrated

In the closed state, it allows requests to proceed normally. If a service failure is detected, the circuit breaker transitions to the open state, blocking all requests to the failing service to prevent further issues. After a specified time, it moves to the half-open state, permitting a limited number of requests to test if the service has recovered. We could use the circuit breaker pattern in a payment processing system: if the payment gateway becomes unresponsive, the circuit breaker opens, stopping all payment attempts. After a cooldown period, it enters the half-open state, allowing a few transactions to check if the gateway is back to normal before fully reopening.

API Gateway

An API Gateway is a component of a microservices architecture that acts as an entry point that prevents inappropriate requests from reaching the system. It enforces policies for rate limiting, health checks, and other crucial functions, thereby managing traffic and protecting services.

API Gateway usage example

In an e-commerce platform, an API Gateway could limit the number of requests per user, check the health of backend services before routing requests, and block malicious traffic, ensuring the system remains stable and efficient.
API Gateway also helps avoid the “Death Star” scenario (where a vast number of microservices lead to complexity and potential failures) by having requests going through private API Gateways instead of direct calls to microservices.

Service Mesh

A Service Mesh is a dedicated infrastructure layer that controls network traffic between microservices via proxies, offloading responsibilities from individual services.

[Istio](https://istio.io/v1.13/about/service-mesh/) service mesh

It avoids the need for each service to implement its own protection mechanisms. Key features include mutual TLS (mTLS) for secure communication and policies for circuit breakers, retries, timeouts, and fault injection. In a financial services application, for example, a Service Mesh can manage secure connections between services, enforce retry policies for transient errors, and implement circuit breakers to prevent cascading failures, all without altering the services themselves, thereby enhancing overall resilience and security.

Working Asynchronously

Working asynchronously involves sending information to a message broker, ensuring that the message is received, and allowing interested systems to consume it at their own pace. This approach helps handle a higher volume of requests and reduces the risk of data loss if a server is down, as messages are stored and processed later.
In an e-commerce platform, when a user places an order, the order details could be sent to a message broker like Kafka. This allows the order processing system to handle the request asynchronously, ensuring that even if the order processing service is temporarily down, the message will be stored and processed once the service is back online. Developers must have good knowledge of the message broker or streaming system to effectively implement and manage asynchronous workflows.

Retry

Retry mechanisms are used to handle transient failures by attempting a request again after an initial failure. Using exponential backoff with jitter, the system increases the wait time between retries and adds randomness to prevent simultaneous retries across multiple systems.

Comparison of Backoff Algorithms for retry policies

Example: in a payment processing system, if a transaction request fails due to a temporary issue, the system retries the request with increasing delays (e.g., 1s, 2s, 4s) and introduces random delays to avoid all retries happening simultaneously. This approach improves the likelihood of success while preventing overload and collisions in the system.

Delivery Guarantees

Delivery guarantees ensure that messages are reliably delivered and processed in a messaging system. In Kafka, delivery guarantees can be set to 0, 1, or -1:

  • 0: No acknowledgment is required from the broker. This is the fastest but least reliable. An Uber app might send user/driver location updates with 0 acknowledgment to minimize latency, accepting that some updates might be lost.

  • 1: The message is acknowledged by the leader broker, ensuring it is written to at least one replica. This strikes a balance between speed and reliability.

  • -1: The message must be acknowledged by all in-sync replicas, providing the highest reliability but with increased latency.

Choosing the appropriate guarantee depends on the required trade-off between reliability and performance for different use cases.

Complex Situations

In complex situations where a message broker goes down, several challenges come up, such as potential message loss and whether your system will remain operational. To ensure resilience, it’s crucial to implement strategies like:

  1. Durable Storage: Ensure the message broker uses durable storage to prevent data loss. Kafka stores messages on disk and replicates them across multiple brokers to avoid data loss.

  2. Retry Mechanisms: Implement retry logic to handle temporary failures and reconnect to the broker once it’s back online.

  3. Fallback Strategies: Design the system to switch to a fallback mechanism or queue messages temporarily if the broker is unavailable.

In an order processing system, if the message broker fails, messages might be temporarily stored in a local queue and retried once the broker is operational again, ensuring the system can recover without complete downtime.

Transactional Outbox

A transactional outbox ensures reliable message delivery by saving messages to a database table before sending them to the message broker. If the broker is down, the messages are stored temporarily in this table and are sent to the broker once it becomes available again. An usage of this concept could be in an e-commerce system: when an order is placed, the order details are saved to a transactional outbox table in the database. If the message broker is temporarily unavailable, the system stores the order details in the outbox table. When the broker is back online, a process reads from this table and sends the pending messages to the broker, ensuring no data is lost and the system remains resilient.

Receipt Guarantees

Receipt guarantees certify that once a message is processed, the system commits to acknowledging it so the message broker can discard it, preventing reprocessing. This is achieved by having the system send a confirmation to the broker after successfully handling the message. In a payment system context, this acknowledgment could be sent to a message broker after a payment message is successfully processed.
To optimize performance and reduce downtime, it’s essential to align the prefetch settings with the expected volume, determined through stress tests and hardware checks.

Idempotence and Fallback Policies

Example of when Idempotence is crucial

Idempotence guarantees that processing a message multiple times does not produce different results, which is crucial for avoiding issues with information duplicity. In banking systems, if a deposit message is received twice due to a retry, idempotence guarantees that the deposit is only applied once, preventing duplicate entries.

Fallback policies handle scenarios where the primary operation fails. So, in this context, if a deposit transaction fails to process due to a temporary issue, a fallback policy might involve retrying the operation or applying a compensating action to correct the error. This ensures that the system remains reliable and can recover from failures without affecting data consistency.

Observability

Observability tools are used for monitoring and understanding the behavior of microservices systems. It involves:

  1. Application Performance Monitoring (APM): Tracks overall application performance, identifying bottlenecks and issues. It can highlight slow response times in your system.

  2. Distributed Tracing: Allows tracking of requests as they travel through various microservices. For instance, distributed tracing can show the journey of a user order from placement to payment and shipping.

  3. Custom Metrics: Provides insights into specific aspects of system performance, like the number of successful transactions or error rates.

  4. Custom Spans: Each span represents a step in the system’s processing, helping to pinpoint where delays or failures occur. A span could track the time taken to validate a user’s credentials.

  5. OpenTelemetry: A standard framework for collecting and exporting observability data, including traces and metrics. It helps integrate various observability tools, providing a unified view of the system’s performance.

Using these tools can give you deep insights into your microservices’ behavior, ensuring that issues are quickly identified and resolved.

Conclusion

When building microservices, thinking about resilience from the outset is crucial. It ensures that your system can withstand and recover from failures, maintaining continuity and reliability. The techniques discussed in this article — such as health checks, rate limiting, circuit breakers, API Gateways, and Service Meshes — provide a comprehensive toolkit for enhancing resilience. Each method addresses different aspects of system stability and performance, from managing traffic and preventing overloads to ensuring secure communication and reliable message delivery. The key point is to always have a proactive approach to resilience to thrive in unpredictable environments.

Top comments (0)