Adding an API Gateway to your application is a good way to centralize some work you usually have to do for all of your API routes, like authentication or validation. But like every software system, it comes with its own problems. Solving errors in the cloud isn't always straightforward, and API Gateway isn't an exception.
AWS API Gateway is an HTTP gateway, and as such, it uses the well-known HTTP status codes to convey its errors to you. Errors in the range of 400 to 499 usually point to a problem with the API client, and errors in the range of 500 to 599 mean something on the server is wrong.
This is a rule of thumb, and if you don't have any logic bugs in your backend, it holds. But nobody is perfect, and so it could happen that a 400 code still means your client is right and your backend is wrong. But let's not get ahead of us and look into the errors, case by case.
The 400 error is probably the broadest of the client errors. Depending on what AWS service API Gateway is integrating with for the URL, it can mean many things.
A retry usually doesn't help because it means the request doesn't match what that specific API Gateway integration is expecting, and sending it again wouldn't change that.
Reasons for this error include:
- Invalid JSON, like missing commas and such.
- Missing fields, when the upstream service has required a field you missed
- Wrong data types, when you send a string instead of a number
- Invalid characters, like using whitespaces in identifiers
You can find the required fields, expected data types, and valid characters for a field in the documentation of the AWS service you integrated with API Gateway.
This error is also known as "Forbidden" and implies some permission issue. Every resource you provision in AWS has an IAM role. This role defines what that resource can access and how it can access it. Your API Gateway has an IAM role too, and if it's not configured correctly, it can prevent API Gateway from integrating with a service.
Again, a retry doesn't help here.
If you use end-user authentication with AWS Cognito, every request will get a temporary role related to the Cognito user who issued the request. If this role isn't configured correctly, it can also prevent users from accessing specific resources.
If you're using a custom Lambda authorizer in your API Gateway, this error code could also relate to a problem in that Lambda function.
The 404 error usually means your URL is wrong. Probably in 99% of the cases.
If you're sure the URL is right, but you're still getting the error, it could also be related to the service you integrate with API Gateway when you try to access data in these services that aren't there.
A retry only solves this problem if the 404 comes from a race condition. When you told the backend to create a resource, you wanted to access it with the next request, but the request was too soon, and the thing you created isn't there yet. Such issues happen with eventually consistent data stores like DynamoDB.
The more expensive consistent reads of DynamoDB usually solve this problem.
The 409 status indicates that your request is trying to do something that conflicts with a resource's current state. A resource could be a record in a DynamoDB table that's integrated with your API. It could be that you tried to create a resource with a specific ID that already exists.
The 409 error is also related to something called a callers reference. This reference is used to mark a request, so it gets only executed once. If you send it and don't get an answer from the API, you don't know if the request got lost before or after it made its way to the API. This usually leads to a retry. If the API hasn't seen the caller reference the last time, it will simply execute it and respond with an appropriate status code. But if the API has seen the caller reference, it gives you a 409 status code to indicate your request was already accepted when you sent it the first time.
So, a retry usually won't solve this problem and can even be the source of this error code in the first place.
There are two 429 errors you could get from API Gateway. The first one is called "Limit Exceeded Exception," which indicates that you went over an API quota.
API Gateway allows access control via API keys. When creating such a key, you can also define a usage quota such as 1000 requests per week. If this quota is reached, the API gateway will respond with a 429.
Normally a retry doesn't solve this problem. You either have to increase the quota for the key, or you have to wait until the next usage period starts.
The best way to get around this issue is to keep your API requests monitored and counted. Check how many requests you send and if you really need to send so many. You can also try to cache responses so that you can reuse them instead of sending duplicate requests that count to your key's quota.
The second 429 error is of temporary nature. You would get it if you sent too many requests at once. For example, if you have an API endpoint connected to a Lambda function, this function has a predefined limit of 1000 concurrent invocations.
If you send 1001 in parallel, you get a 429 error, but depending on the time this Lambda function takes to handle a request, you can retry some time later and get a free slot again.
Again, API keys can have limits too. If you got a key that only allows for 10 concurrent requests, the upstream service could handle millions, but your 11th parallel request wouldn't go through.
Try to monitor your request so you see when they get close to the limit of your services, and try to cache requests on your clients so that they won't hammer the API.
Did you know Dashbird will detect API Gateway issues and alert them to you?
The 500 status code might be the most used and most generic HTTP error on this planet. If you get it from an API endpoint that integrates with AWS Lambda, it usually means your code buggy.
The next source for this error is inconsistent error mapping. Many of the errors we talked about here can become a 500 error when finally landing on your client as a response. You'll get a "limit exceeded," but it will have a 500 status code instead of 429. So you have to extract the right error out of this response, check what the real cause is, and then look at how to solve it.
Since the error can be anything really, a retry can technically solve that problem, but usually, it doesn't.
If you monitor your system carefully and get one of these every few million requests, it could be that cosmic rays flipped some bits or whatever. Still, if you see a 500 status code more often than that, it's crucial to investigate; it can very well point to an inconsistency that will blow up sooner or later.
A 502 error code is related to the service your API Gateway integrates with. It means that API Gateway couldn't understand the response.
For example, when you throw an error in a Lambda function or the resolved value has an invalid structure, it can lead to a 502 error. If you used EC2 or ECS/EKS, it could also be that API Gateway can't connect to the VM or container because they aren't running (correctly).
Retries can help, especially when integrated services are currently restarting.
If you see a 503 error, most of the time, it means the service you're integrating takes too long to answer.
API Gateway has a maximum hard limit of 30 seconds timeouts. If your service can't respond in under 30 seconds, API Gateway will assume it's unavailable and stop waiting.
If the work your service does takes around 30 seconds, you should handle things asynchronously. Respond with a 202 accepted and give the client a way to fetch the results later.
If your service usually responds well below 30 seconds but only occasionally goes over the limit, you can solve the problem with retries.
The 504 status code is a bit like 503. The difference is that 504 indicates a DNS or network problem, and 503 indicates a performance problem.
Again, this can be temporary, and a retry might solve it. After all, the internet isn't 100% stable.
But you usually see that issue when an integrated service isn't running, or you got the IP or hostname wrong, either because you entered the wrong or they changed somehow after you entered them.
We went over all the API Gateway errors you will probably encounter, and like with anything debugging-related, things can get quite messy --- especially if you have countless rows of logs to sift through.
The good news is that Dashbird integrates well with API Gateway monitoring and delivers actionable insights straight to your Slack or SMS when things go awry.
Dashbird also works with AWS as their Advanced Technology Partner and uses the AWS Well-Architected Framework to ensure you're on track to performance and cost optimization. If you want to try Dashbird out, it's free for the first 1 million invocations per month.