There are many possible reasons for failures, from an outage of a third-party service (API, DB, etc.) to a hardware failure, to the “classic” software bug (after all, software developers are humans too, aren’t we? 🤔).
Fault tolerance is a requirement, not a feature.
This post is an a attempt to address handling system failures during 3rd Party API communications.
What are some of the failure points?
The network connectivity between the application and the third party system might be disrupted causing communication timeouts or lost information.
The third party system has an internal error or machine failure causing the application not to receive a response.
Third party system has not been paid for, e.g late in account payment renewal.
-
We have an internal error or machine failure. Depending on the timing of this failure, two things could happen:
1. Inability to send the request. 2. Inability to receive the response from the third party.
How do we respond to a user request when failure occurs?
The diagram below attempts to show a process that would be followed to achieve resilience in API communication failure
In each of the options described above a timeout, internal error or connection error will result in a request not retrieving the optimal response for customers.
What are the possible actions to take when retry will not succeed and is unlikely to help?
Notifications: when such 3rd party failure occurs, notifications should be sent for quick action such as calling the customer back to manage expectation and give way out that lets them know that you have their back.
Showing descriptive messages that will aid to let the customer know what to do. e.g refresh, retry
To improve the resilience of the application we should consider the following patterns:
1.Retry
2.Caching
3.Persistence
4.Circuit Breaker
Retry
For a reliable and robust retrying solution it should be:
Smart - To answer this, we use exponential backoff retrying.
Customizable - Standard way to handle errors, we should control which errors should be retried and which shouldn’t by throwing exceptions for the errors that should be retried and by catching the others.
Caching
We record each request in the database before sending the information to an external party.
We should have a status attribute that tracks which part of the process the request is in.
We update the request with a new status based on the response, normally either created, declined or successful.
Persistence
To get more inspiration Uber manages the reprocessing of their data using kafka topics and dead-letter queues. See https://eng.uber.com/reliable-reprocessing/.
Uber’s idea would fit many of our needs, so we can design a solution based on it.
First, our error handling solution should identify that there is an error. Then, it needs to grab the relevant context of the error (in this example, the handled HTTP request).
After that, it should send that context to some persistence layer, for later re-execution of the service flow.
When a failure occurs in one of the HTTP requests, the request context (request body, query params, etc.) will be sent as a Kafka message to the first retry Kafka topic.
As for the re-execution part, a polling of the first retry topic is added and if that processing fails too, it will be sent to the second topic and so forth, until it’s sent to the dlq (“dead-letter queue”) topic, for manual analysis. (The number of retry topics is arbitrary in this example, it can be any number).
Let’s see how the solution I described applies to service A below:
Summary
If we implement the solutions I’ve described, we can be much more confident now when our system has failures, the system requires fewer manual interventions when it fails, and most importantly, we sleep well at night!
I hope this post helped you by providing you with some new ideas for handling failures in your applications.
Feel free to comment below with any questions/thoughts :-)
Top comments (0)