In case you never heard it before, AWS has this orchestration service called Beanstalk. Basically it means, instead of setting up EC2 instances, load balancers, SQS, CloudWatch, etc manually, you can use this Beanstalk to orchestrate the setup. All you need to do is zip your code (or package it if you use Java), upload the zip, and do some setup from a centralised dashboard and you are done. If you don't need to have a fine control over your setup, this can really help you to get a running system quickly.
AWS Beanstalk provide two types of environment: Web server environment, and Worker environment. On Web server environment, you will get a server configured with your platform choices (can be Java, Node JS, Docker, etc), and you also can setup a load balancer easily. On worker environment, you can have server and SQS messaging platform to run heavy background jobs or scheduled jobs. I will focus on Worker environment on this post.
A Beanstalk worker instance has a daemon process which continuously listen to an SQS queue. Whenever it detects a message in the queue, the daemon will send a HTTP POST request locally to http://localhost/ on port 80 with the contents of the queue message in the body. All that your application needs to do is perform the long-running task in response to the POST. You can configure the daemon to post to different URL, connect to existing queue, and configure the timeouts.
There are 3 basic timeout you can configure from the worker environment dashboard:
1. Connection timeout
This is the number of seconds we want to wait for our application when we try to establish a new connection with the app. The value is between 1 to 60 seconds.
2. Inactivity timeout
This is the number of seconds we allow our application to do the processing and return response. You can specify between 1 to 36000 seconds.
3. Visibility timeout
This is the number of seconds to lock the message before returning to queue, and you can specify it to be between 1 to 43200 seconds. I personally feel this is the confusing part, so I'll try my best to explain.
When the daemon read a message from the SQS queue, it will send POST request to your application. If your application can process the request and return good response (e.g HTTP 200) within the Inactivity timeout, great. The whole processing is done, and the message is deleted automatically from the queue.
However, if your application failed to give response within Inactivity timeout period, the message will be sent back to SQS to be retried after Visibility timeout reached, calculated from the time your app start processing the message.
So, for example, let's say you set Inactivity timeout to be 30 seconds, and Visibility timeout to be 60 seconds. You then send a message to the queue at 10:00:00. At 10:00:30, if your application cannot give response after 30 seconds, the daemon will flag the request as failed. At 10:01:00, after 60 seconds reached, the daemon will re-throw the message to the queue and the whole process will be repeated until it reaches the Max Retries setting (default 10 times).
Now, what if your application need 45 seconds to do the heavy background processing in above example? In this case, at 10:00:00 the request will be fired and the processing starts. At 10:00:30, the daemon will flag the processing as failed, but the actual processing will still continue on the background. At 10:00:45 your app finally gives response, but no one is listening to the response. At 10:01:00, the message is back to the queue and the whole heavy processing is repeated even though it was actually success. So you will want to set the Inactivity timeout and Visibility timeout to be a safe value relative on your expected app processing time, and keep both values relatively close. The default settings from AWS put the Inactivity timeout at 299 seconds and Visibility timeout at 300 seconds, only differ by 1 second.
Another thing you need to be careful is to make sure you set up Visibility timeout higher than the Inactivity timeout. Now, consider this example:
You set Inactivity timeout at 60 seconds and Visibility timeout at 30 seconds. Your app needs 45 seconds processing time. In this scenario, when the processing time reached the seconds 30, the message will be made visible again in the queue and then automatically be consumed again by your server. Your server ended up doing double work when it is actually not necessary.
Phew! Now we should be able to config our worker environment properly to avoid timeout. But then, everything changed when Nginx configuration attack.
If you are using Node JS for your worker environment platform, AWS will give Nginx out of the box to act as proxy between the actual client and your app. Nginx introduces another layer of timeout, which by default is 60 seconds (I think, never really checked, sorry). Now let's dive to example on how this might cause you problem.
Let's say your application needs 120 seconds to do processing, and you already set up both Visibility timeout and Inactivity timeout to some safe number, e.g 300 seconds. When a request started to be processed at 10:00:00, at 10:01:00 Nginx will spit out timeout. The daemon will pick this signal as literal error, thus will not delete the message from the queue. Meanwhile, your app will continue process the request in the background until it reached 120 seconds and shouts the response to no one since the daemon is no longer listening. As we reached the 300 seconds, the whole process is repeated again.
There is an easy way to distinguish whether your timeout is introduced by Inactivity timeout or Nginx timeout. You can grab Nginx access log from the worker environment at
/var/log/nginx/access.log. Then, you can inspect the HTTP response status from the request processed by your app. If the status is 499, it means your app hit Inactivity timeout. However, if the HTTP response is 504, then there's good chance your app hit Nginx timeout limit.
Alright, so, we also need to extend the Nginx timeout. However, extending Nginx timeout is not as easy as the other timeouts since you cannot do it from AWS Console. You can SSH to your worker server instance and change the config directly, but in this case, every time your server rebuild you will need to apply the config changes manually. There is a better method.
You can add a folder in your application source code and name it as
.ebextensions. You need to name it as is and are not able to name the folder with other names. Inside the folder, add a file and name it
nginx_timeout.config. You can name the file however you want, but it's better to have a descriptive name. One important point about the file naming is, the file must end with
.config extension. Inside that file, you can simply straight copy paste this:
What it does is simply that it will create a file
/etc/nginx/conf.d/timeout.conf which will be automatically read by Nginx default standard config. That file states that various Nginx timeout value should be 300 instead of the whatever the default values from standard config, and that's it! Now our worker environment should be good to go digest all heavy processing you feed without hitting timeouts.