In a system that applies microservices architecture, microservices often need to collaborate to get the job done. But How does the system act if a microservice is down? What would happen if a service is unavailable or not responsive enough to handle requests? How can this be handled gracefully with minimal effect on the user experience? In this article, I will explain the circuit breaker pattern which provides a solution to this problem.
First of all we must accept that, if a service is unavailble or unresponsive there is nothing we can do until the service is up again and the cause of the problem is solved. Which means that our only options are to fail fast and be fault tolerant. To understand why the circuit breaker is important, let's see how the error might be handled without a circuit breaker.
Assume that a user sends a request to buy some stock and somewhere on the path of that transaction the transactions microservice needs information from the users microservice but the users microservice is down or the network is slow. Assume that the timeout value is 10 seconds then the user will only know about the error after 10 seconds and each user trying to make a transaction will also have to wait for 10 seconds before each error. That will also happen for each retry by the user and obviously, this is not acceptable specially in a realtime system where time is critical and users expect things to happen instantaniously.
That's where a circuit breaker comes in.
The circuit breaker is a proxy that a microservice receives requests through. It checks two things regarding that microservice:
1- availability: Is the microservice online?
2- responsiveness: Can the microservice handle requests?
by monitoring these states, the circuit breaker may or may not allow requests to reach the microservice. Which allows the system to fail fast and hence be more fault tolerant.
Initially, the circuit breaker is closed (it allows requests) and it keeps monitoring the microservice at all times until the microservice is either unavailable or unresponsive. In that case, the circuit breaker opens and it automatically rejects all subsequent requests without sending them to the microservice or waiting for the timeout period. Eventually when the microservice is available and responsive again, the circuit closes and allows requests to reach the microservice normally. The most important part of that process is the monitoring.
One way to monitor the microservice can be an endpoint that simply replys with OK.The circuit breaker can keep pinging the endpoint, If it replies then the service is up, if it doesn't then it's down. However, this approach would be limited since it only checks for availability but not responsiveness. In order to work, a service might need to use a database and if the database is unavailable then the service is not responsive and should not receive requests even though it is online.
To solve that, another approach is a synthetic transaction. This is done by actually calling a real endpoint in the service with fake data and depending on whether the request is processed or not the circuit closes or opens. But this means that all services in all environments should be aware that this is a fake request and should not be handled by the system.
That's why another approach is more commonly used which uses real-user monitoring.
This approach monitors the requests done by the users in realtime, monitors the response time of those requests and analyizes the trends in them. If the response time keeps increasing above the mean time then the service is assumed to be overloaded and the circuit opens. The circuit breaker then moves to a half-open state in which it only allows a small number of requests and monitors them. When the response time is back to normal, the circuit is then closed again and all requests are handled normally.
This can also be threshold-based where a threshold is set for failed requests and the circuit breaker waits for requests to actually fail a number of times that is equal to that threshold before going to the half-open state.
In both cases, this approach will need to use cache or storage in order to keep track of the requests and the overhead of this approach depends on the effeciency of the storage used.
Now that we know what a circuit breaker is, We have two options to implement it:
1- Actually implement it from scratch.
2- Using one that's provided by a service that you already use.
As for the first one, it's relatively easy to implement and there are many examples online so I will not get into that and I will focus on the second option.
If you're using an API Gateway, chances are it already has a circuit breaker that you can configure. For example: Tyk and KrakenD both have their own implementations of a circuit breaker that you can simply use. There are also other opensource services that have a circuit breaker ready for configuration like Elasticsearch. Many orchestration and integration frameworks like Apache Camel also provide circuit breakers.
Failing fast is important for most systems and is crucial for realtime systems. But eventhough the circuit breaker is useful, it does affect the performance since it adds a new layer that a request needs to go through. However, in software engineering there are no silver bullets. Using microservices introduces new problems and those problems are solved by approaches like the circuit breaker that in turn affects the performance. Things can also get more complex in distributed systems.
In short, there are always tradeoffs and it's a matter of what suits your usecase and what you can afford to sacrify in order to improve other aspects in your system.