AWS Batch is a powerful platform for running batch jobs, but setting up can be difficult, especially if your environment uses private subnets and you want to make use of AWS PrivateLink.
One source of frustration is that your jobs get stuck in a 'RUNNABLE' state and don't progress. Here are some pointers you can use for troubleshooting this issue.
Read through the AWS Batch Troubleshooting guide and 'Why is my AWS Batch job stuck in RUNNABLE status'. If you carefully work through these guides then they should lead you to the source of the issue.
One thing to highlight is:
Container instances need access to communicate with the Amazon ECS service endpoint. This can be through an interface VPC endpoint or through your container instances having public IP addresses.
If you are using a NATGateway, then this shouldn't be a problem, because this is a standard solution to communicating with web resources from private subnets.
However, I was configuring service access via PrivateLink, meaning that requests are not routed over the public internet. Using PrivateLink brings with it additional steps for successfully running AWS Batch jobs.
ECS Instances not being created in the ECS cluster
Your batch job goes to RUNNABLE, but you don't see any ECS cluster instances created (refer to 'troubleshooting' for more on this).
-
Refer to Creating the VPC Endpoints for Amazon ECS. Your containers are going to need to be able to access the following endpoints:
com.amazonaws.<your region>.ecs-agent
com.amazonaws.<your region>.ecs-telemetry
com.amazonaws.<your region>.ecs
The Security Group associated with each endpoint need to be able to send 'all traffic' outbound requests - refer to the troubleshooting guide.
The Security Group associated with the endpoint(s) needs to be able to accept TCP requests on port 443.
Jobs run but then fail with 'CannotPullECRContainerError'
Refer to this guide for setting up VPC Endpoints to ECR.
The key points is that you will need two VPC Endpoints:
com.amazonaws.<your region>.ecr.dkr
com.amazonaws.<your region>.ecr.api
I found setup of AWS Batch with VPC Endpoints was not straightforward, but hopefully some of the pointers in this guide can save you time if you are stuck.
Top comments (0)