In this article I'll try to explain what is a circuit breaker pattern.
Fail-fast system
a fail-fast system is one that immediately reports at its interface any condition that is likely to indicate a failure. - Wikipedia
So if a part/parts of a system fails, a fail-fast system will know this and it will stop operating in the normal way it usually does. If something fails, a fail-fast system can behave in a way that is defined for some failure, like a fallback when something goes wrong. A fail-fast system often checks a system's state at various intervals. So failures can be detected early.
A wise saying in system design is
If a system fails, then let it fail fast and fail in a safe state.
Fault-tolerant system
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of some of its components. - Wikipedia
If you are at this point can you relate a fail-fast system with a fault-tolerant system?
Well, a fault-tolerant system will continue operating properly rather than crashing or stopping serving requests if some components do not work or fail. So it means, a fault-tolerant system can detect a failure and when that occurs it knows what to do which is like the fail-fast system.
So, a fault-tolerant system has the property of a fail-fast system that monitors the system's state, and when some failures occur it can detect the failures. And, when the system detects the failures, it does not stop operating. Rather it behaves that is specifically designed for the failure events like fallback behavior in the event of failure.
And this is what a circuit-breaker is.
Circuit Breaker
A circuit is a path through which electricity moves. A circuit breaker is a component in the circuit that monitors the amount of electricity.
You probably saw this in your house.
This is a circuit breaker. If the electricity in the circuit is more than the threshold amount it breaks the circuit and makes the circuit open. So, no electricity can pass the circuit and the extra electricity cannot harm the electrical components.
Circuit breaker pattern
The circuit breaker is a very popular pattern for making a system resilient and fault-tolerant. In the microservice world, one service can depend on many services. And as the microservices are deployed by separate teams, any service can experience downtime for any reason. Either maintenance is going on or the server/pods crashed etc. This downtime can cause a ripple effect on other services. In this article, I will describe the circuit breaker pattern and how to make a service resilient by introducing it.
Say, there is two microservice, Service A --- Service B
. And Service B is down. Then Service A will experience that all the calls it is making to Service B are failing. It can be a connection timeout, or a request timeout, or anything. If Service B is down for 20 minutes, Service A will experience it for 20 minutes or so.
What Service A could do if it is intelligent? Well, it could monitor the requests it is sending to Service B. So, it could think,
Hey... I have sent 10 requests to Service B and all of them failed. I think Service B is down. So, you know what, I will not send any request to Service B rather I will have a fallback response and I will send it to the client who is requesting me.
Well, as you can see the circuit breaker is monitoring the states of the system. For our example, it is monitoring the success status of the requests, and based on this it is deciding that some components are not working. So it can detect the failure like a fail-fast system. But it does not stop operating. It still serves all the requests that will come but rather than sending the normal-time response it will send a fallback response. This is what makes the whole system a fault-tolerant system.
Deep Dive
The circuit in the circuit breaker has three phases.
- Closed
- Open
- Half Open
Closed Circuit
When two services are UP and communicating, the requests are always allowed. It means the circuit is closed. So the path is established. Like the following:
Service A ------ Service B
Open Circuit
When the service B is down the circuit is made open. No requests are passed to Service B. Like the following:
Service A XXXXXXXXX Service B
In this case, all the requests are handled by the fallback response from Service A.
Half Open
In this phase, some small amount of requests are passed to Service B. Usually, when you're circuit is in Open State you don't want to wait in this state forever. Rather you should wait for some time and then check again if the downstream service is UP. But when you are checking this, you also don't want to allow all of your requests to the downstream service. Rather, you should send a small number of requests to check if the downstream service is UP. And when you are doing this, this state is called Half Open state.
Let's see a diagram to get a more clear idea.
Circuit diagram
When State is Changed
- The state of the CircuitBreaker changes from CLOSED to OPEN when the failure rate (or percentage of slow calls) is equal or greater than a configurable threshold. For example when more than 50% of the recorded calls have failed.
The failure rate and slow call rate can be calculated only if a minimum number of calls were recorded. For example, if the minimum number of required calls is 10, then at least 10 calls must be recorded before the failure rate can be calculated. If only 9 calls have been evaluated the CircuitBreaker will not trip open even if all 9 calls have failed.
-
After a wait time duration has elapsed, the CircuitBreaker state changes from OPEN to HALF OPEN and permits a configurable number of calls to see if the downstream service is still unavailable or has become available again.
- If the failure rate or slow call rate is equal or greater than the configured threshold, the state changes back to OPEN.
- If the failure rate and slow call rate are below the threshold, the state changes back to CLOSED.
When Circuit is in OPEN state all calls are rejected by throwing an exception.
And this is the concept of the Circuit Breaker.
Now, you can of course implement this. But there are some really cool circuit breaker libraries that take care of this for us.
Netflix has its own circuit breaker library called Hystrix. The description of the library says,
Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services, and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.
Netflix uses this in its eco-system to ensure fault tolerance which is really necessary when you have hundreds or even more microservices.
But the problem with Hystrix is it's no longer in active development and is currently in maintenance mode.
An alternate of Hystrix is resilience4j. This is another fantastic library. The description says,
Resilience4j is a fault tolerance library designed for Java8 and functional programming
It has really cool features and the documentation is rich with examples and how you can configure it.
So, I hope this article helps you and I hope you enjoyed it. If you have any questions or any confusion, please feel free to comment and we can have a nice discussion.
See you in another article. Till then, let's enjoy life :D
Top comments (8)
Fault tolerant system is an adjective to the whole system design. Circuit breaker is a library which lies inside your service, and when the circuit breaker manages all these things as Sajid mentioned in the blog, the system becomes more resilient and we can say that the system is handling failure with a fallback, not letting the failure to cascade and make the system turn into a waiting room, hence fault tolerant service.
Aah, yes. "Waiting room". Was thinking this word but couldn't write it. Thanks Ridit. Nice explanation.
welcome mate!
No. Fault tolerant system (the circuit breaker) is not a microservice. It's a module that lies in your service. Say, your service A is depending on service B, then you put the circuit breaker inside service A which monitors all the requests that service A is sending to the service B. And based on all the parameters you set for the circuit breaker, it either goes to one of the three state mentioned in the article.
The circuit breaker keeps it's matrices in the runtime.
So, long story short, circuit breaker is a module/library that lies inside your service. It's not a separate service.
let's suppose , My service A is making requests to service B and the minimum request per second is 10 to check fail occurrence. now lets say I am experiencing failure with 300 request per second from A to B (more than 50% requests have been dropped, though first 149 request is successful and next 151 request has been failed) which was operating fine in 149 requests per second. Now is this really a good idea just to break the connection and make 10 request per second after some interval to check if the service B is ok? as it will be ok for first 10 requests but will fail every time after number 149th request. Could this circuit breaker decrease request per second in binary approach (300 request fails so lets check with 150, then 75 or so) rather than just immediate cut and check with 10 requests per second after some interval (and if this 10 request is successful then will I immediately make 300 due requests that are coming or I will follow sequential approach to 300th ? ). could this approach give better performance as now I can come to know by some learning mechanism how much it really can handle and make the best use of it's resource rather than just abandon it just because it can't handle so much load though it can handle a number of load. yeah I am only considering the case "server load" but this scenario just came across my mind and wanna know about this scenario performance issue of this design pattern
I have found the answer after some searching as I had no previous knowledge about "Circuit Breaker Pattern". The thing is, in here - "Every time the call succeeds, we reset the state to as it was in the beginning" so in my given case or whatever it will perform best with an optimized solution. So, after succeeding the 149th request it will actually set its counter to Zero and then 50% failure rate will come to a point, and then after 10 subsequent failed requests (100% failure rate) it will break the circuit for any "next", not previous succeded request-response, so let's say it will just handle 149 requests for forwarding and send fallback for any next requests. You can include that line for readers like me who had no previous knowledge and thanks for the great article though..!
Hi Fuji, extremely sorry for the late reply. Was stuck! :(
Thanks for your question. A Circuit breaker can be of two types - a) counter-based b)sliding window-based.
A counter-based Circuit breaker will monitor X amount of requests. If you see the image in the article, the circuit breaker will start from the closed state. Then it will monitor the requests. If you configured that out of 10 requests if 5 fails you'll trip the circuit, then the breaker will monitor 10 requests and track how many are successful. At first, it will monitor at least 10 requests. After the 10th request, it will take the decision. From your example, say, the first 149 requests are successful. Nothing breaks. After that, your requests start to fail. If out of the next 10 requests 5 fails, the circuit breaker will trip the circuit and the state will be the open state. As I explained in the state transition paragraph in the above article, the circuit breaker will have a cool-down period until which it will wait before making another decision. You can set the cool-down period. After that, the circuit breaker will go to the half-open state as I mentioned in the article. Here, the circuit breaker will let some requests pass through and many of them will be not. And in the half-open state, if the circuit breaker sees that out of the 10 requests that it passed through 5 failed then it will again switch to the open state. Otherwise, it will switch to the closed state. The whole 151 requests will not be used to trip the circuit.
Hope it helps. If you have questions, please do comment. I liked your question. And again, sorry for the late response.
yeah, so the counter is resetting (starting to monitor) after every success call as I started counting after the first failure? but then after 3 consecutive fails and then 1 success call will again reset the state -_- or the concept is I just monitor (without resetting) consecutive 10 requests then reset and again monitor 10 consecutive requests and again?
and as you said when after failing over 50% request handling (the first 149 was good then consecutive 6 fails) during the cool down period or half-open state, is the server serving that capable 149 requests and not making another call (hold) within short time or it is now not handling any requests? (um confused here) as from our trivial knowledge we know circuit breaker as breaking the full circuit while overload but here we can code to serve the capable 149 request and hold for any new requst (cool down)?
Thanx a lot for the reply <3