In this article I'll try to explain what is a circuit breaker pattern.
a fail-fast system is one that immediately reports at its interface any condition that is likely to indicate a failure. - Wikipedia
So if a part/parts of a system fails, a fail-fast system will know this and it will stop operating in the normal way it usually does. If something fails, a fail-fast system can behave in a way that is defined for some failure, like a fallback when something goes wrong. A fail-fast system often checks a system's state at various intervals. So failures can be detected early.
A wise saying in system design is
If a system fails, then let it fail fast and fail in a safe state.
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. - Wikipedia
If you are at this point can you relate a fail-fast system with a fault-tolerant system?
Well, a fault-tolerant system will continue operating properly rather than crashing or stopping serving requests if some components do not work or fail. So it means, a fault-tolerant system can detect a failure and when that occurs it knows what to do which is like the fail-fast system.
So, a fault-tolerant system has the property of a fail-fast system that monitors the system's state, and when some failures occur it can detect the failures. And, when the system detects the failures, it does not stop operating. Rather it behaves that is specifically designed for the failure events like fallback behavior in the event of failure.
And this is what a circuit-breaker is.
A circuit is a path through which electricity moves. A circuit breaker is a component in the circuit that monitors the amount of electricity.
You probably saw this in your house.
This is a circuit breaker. If the electricity in the circuit is more than the threshold amount it breaks the circuit and makes the circuit open. So, no electricity can pass the circuit and the extra electricity cannot harm the electrical components.
The circuit breaker is a very popular pattern for making a system resilient and fault-tolerant. In the microservice world, one service can depend on many services. And as the microservices are deployed by separate teams, any service can experience downtime for any reason. Either maintenance is going on or the server/pods crashed etc. This downtime can cause a ripple effect on other services. In this article, I will describe the circuit breaker pattern and how to make a service resilient by introducing it.
Say, there is two microservice,
Service A --- Service B. And Service B is down. Then Service A will experience that all the calls it is making to Service B are failing. It can be a connection timeout, or a request timeout, or anything. If Service B is down for 20 minutes, Service A will experience it for 20 minutes or so.
What Service A could do if it is intelligent? Well, it could monitor the requests it is sending to Service B. So, it could think,
Hey... I have sent 10 requests to Service B and all of them failed. I think Service B is down. So, you know what, I will not send any request to Service B rather I will have a fallback response and I will send it to the client who is requesting me.
Well, as you can see the circuit breaker is monitoring the states of the system. For our example, it is monitoring the success status of the requests, and based on this it is deciding that some components are not working. So it can detect the failure like a fail-fast system. But it does not stop operating. It still serves all the requests that will come but rather than sending the normal-time response it will send a fallback response. This is what makes the whole system a fault-tolerant system.
The circuit in the circuit breaker has three phases.
- Half Open
When two services are UP and communicating, the requests are always allowed. It means the circuit is closed. So the path is established. Like the following:
Service A ------ Service B
When the service B is down the circuit is made open. No requests are passed to Service B. Like the following:
Service A XXXXXXXXX Service B
In this case, all the requests are handled by the fallback response from Service A.
In this phase, some small amount of requests are passed to Service B. Usually, when you're circuit is in Open State you don't want to wait in this state forever. Rather you should wait for some time and then check again if the downstream service is UP. But when you are checking this, you also don't want to allow all of your requests to the downstream service. Rather, you should send a small number of requests to check if the downstream service is UP. And when you are doing this, this state is called Half Open state.
Let's see a diagram to get a more clear idea.
- The state of the CircuitBreaker changes from CLOSED to OPEN when the failure rate (or percentage of slow calls) is equal or greater than a configurable threshold. For example when more than 50% of the recorded calls have failed.
The failure rate and slow call rate can be calculated only if a minimum number of calls were recorded. For example, if the minimum number of required calls is 10, then at least 10 calls must be recorded before the failure rate can be calculated. If only 9 calls have been evaluated the CircuitBreaker will not trip open even if all 9 calls have failed.
After a wait time duration has elapsed, the CircuitBreaker state changes from OPEN to HALF OPEN and permits a configurable number of calls to see if the downstream service is still unavailable or has become available again.
- If the failure rate or slow call rate is equal or greater than the configured threshold, the state changes back to OPEN.
- If the failure rate and slow call rate are below the threshold, the state changes back to CLOSED.
When Circuit is in OPEN state all calls are rejected by throwing an exception.
And this is the concept of the Circuit Breaker.
Now, you can of course implement this. But there are some really cool circuit breaker libraries that take care of this for us.
Netflix has its own circuit breaker library called Hystrix. The description of the library says,
Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services, and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.
Netflix uses this in its eco-system to ensure fault tolerance which is really necessary when you have hundreds or even more microservices.
But the problem with Hystrix is it's no longer in active development and is currently in maintenance mode.
An alternate of Hystrix is resilience4j. This is another fantastic library. The description says,
Resilience4j is a fault tolerance library designed for Java8 and functional programming
It has really cool features and the documentation is rich with examples and how you can configure it.
So, I hope this article helps you and I hope you enjoyed it. If you have any questions or any confusion, please feel free to comment and we can have a nice discussion.
See you in another article. Till then, let's enjoy life :D