Camille He

Posted on Sep 10, 2023

Build Scheduled AWS Batch Job Infrastructure using Terraform

#aws #batch #terraform

In this post, I will walk you through how to build a scheduled AWS Batch job infrastructure using Terraform. It focuses on the Terraform infrastructure and modules definition, and how they work together to build the entire workflow. It doesn't cover the advanced features that AWS Batch provided, and only use Terraform aws official provider
and basic Terraform features. In short, this post helps you setup a basic AWS Batch job infrastructure quickly, and you can add more features and enhance the structure as needed.

Prerequisite

AWS CLI (V2) is installed on the local machine.
AWS credentials setup. We use the credentials to deploy Terraform resources to the target AWS Account. In this demo, I use the same AWS profile for S3 remote backend configuration and Terraform apply.
Terraform CLI (1.3.4) is installed in the local machine. You can loosen the Terraform version restriction in versions.tf to use other close versions, however I'm not sure if the code works as expected without verification.
I'm working on Mac (MacOS Monterey) with Apple M2 Chip. So the aws provider installed is .terraform/providers/registry.terraform.io/hashicorp/aws/5.0.0/darwin_arm64/terraform-provider-aws_v5.0.0_x5. If you are in other operating system, you should remove .terraform.lock.hcl file from source code, and let Terraform CLI install the well-matched aws provider version according to your OS.
In the demo source code, I use default VPC, subnets and security group for EC2 instances. You can use customized network resources, however you should take care of the network availability if your Batch needs internet access, for example, the docker image that is used is hosted in public docker registry.

You can find the demo source code from https://github.com/camillehe1992/on-demand-job-in-aws-batch

Terraform Structure

The Terraform structure contains several components shows as below.

.
├── config.tfbackend  # Remote backend config file
├── data.tf           # File for Terraform data source 
├── main.tf           # The reference of modules
├── outputs.tf        # The outputs of Terraform resources
├── terraform         
│   ├── dev           # Environment specified variables
│   └── modules       # Terraform modules
├── variables.tf      # Terraform input variables that should be passed to the arch module
└── versions.tf       # Defines the versions of Terraform and providers

Remote Backend Configuration (S3)

For best practice, I save the Terraform state files in the remote location, which is S3 bucket here. The configuration below specifies the S3 bucket name, region, and profile used to access S3 bucket. Another parameter is key, which we

# config.tfbackend
region         = "cn-north-1"
bucket         = "tf-state-756143471679-cn-north-1"
profile        = "service.app-deployment-dev-ci-bot"

Another parameter is key that is provided via -backend-config command in terraform init as there is a ENVIRONMENT specific path and we have to inject it dynamically instead of hard-coding in config.tfbackend file. You can use other configurations settings as needed.

terraform init -reconfigure \
    -backend-config=$BACKEND_CONFIG \
    -backend-config="key=$NICKNAME/$ENVIRONMENT/$AWS_REGION/terraform.tfstate"

Terraform Main File

The main.tf in the root directory is the core component and entrance in Terraform structure. I define all resources here, including IAM roles, Batch, CloudWatch Events (EventBridge), and SNS Topic. You can separate them into specific files, for example, iam.tf, batch.tf, events.tf, sns.tf, etc. I keep them in one file as the project is not that complex, and it's easy to understand the relationships between resources/modules. You should make a better decision according to your project structure.

In the main.tf file, I define 8 resources, 3 IAM roles, 1 secrets group, 1 batch, 2 CloudWatch events, 1 SNS topic. All of them compose the AWS architecture as the diagram shows

Workflow steps:

User creates a docker image, uploads the image to the Amazon ECR or another container registry (for example, DockerHub), and creates a job definition, compute environment and job queue in AWS Batch. In this repo, we use an AWS official image public.ecr.aws/amazonlinux/amazonlinux:latest for demo purpose.
Batch job is submitted using job definition into the job queue in AWS Batch by CloudWatch Event regularly as scheduled.
AWS Batch launches an EC2 instance in computing environment, pulls the image from the image registry and create an container.
The container should implement some tasks on your behave. An email notification will be triggered if the job is failed.
After done, the container will be stopped and removed. EC2 instance is shutdown automatically by AWS Batch.

Terraform Modules

For well-architecting and organizing Terraform structure, I define modules for each resource group. Modules is a key feature in Terraform which helps users manage their own resources efficiently. Let's dive into the details of each module in this section.

Batch Module: Defines resources in Batch service, including computing environment, job queue, job definition.
EventBridge Module: Defines resources in EventBridge, including event rule, rule target. One named submit_batch_job_event is used to submit Batch job as scheduled, another named capture_failed_batch_event is used to send out an alert email if a Batch job is failed.
IAM Module: Defines IAM resources, including roles, policies, instance profile. These roles are used by AWS Batch resources and EventBridge rules.
SecretManager Module: Defines secret token that may be used in Job container. It's not required in the demo project, but for your reference.
SNS Module: Defines resource in SNS, including topic and subscription.

Apply/Destroy Terraform Resources

I create a Makefile and shell script to simplify the apply/destroy process in one command. You can find the shell script from /scripts/apply.sh in demo code.

# apply Terraform infrastructure
make apply

# destroy Terraform infrastructure
make destroy

Submit Batch Job Manually

After applying Terraform resources successfully using make apply, you can submit a Batch job for testing the entire workflow. The Batch job is submitted/triggered by CloudWatch Event (EventBridge) per day regularly as scheduled. However, you are allowed to submit a job manually via AWS CLI as below. Or from AWS Console directly.

Don't forget to update job definition revision in --job-definition if you have a new revision created. Only the latest revision is ACTIVE.

# Setup AWS_PROFILE with permission to submit batch job
export AWS_PROFILE=service.app-deployment-dev-ci-bot

# Submit a job using CLI
aws batch submit-job \
  --job-name triggered-via-cli \
  --job-definition arn:aws-cn:batch:cn-north-1:756143471679:job-definition/dev-helloworld-jd:1 \
  --job-queue arn:aws-cn:batch:cn-north-1:756143471679:job-queue/dev-helloworld-jq

After submitted successfully, go to AWS Console -> Batch -> Jobs. Select the target job queue from the dropdown list, then your new submitted job will be listed on the top. It will spend a few minutes for a job to complete, according to the job processing time, and whether you allocate an EC2 instance resource in advance by giving variable desired_vcpus a number greater than 0 or not. If the job failed, an email notification will be sent out to the Topic subscribers you provided in variable notification_email_addresses.

For cost saving, I set the desired_vcpus as 0 as default, which means a new EC2 instance will be launched when a new job is submitted and shut down immediately after completed. The screenshot below shows the latest job that submitted by CloudWatch Event (EventBridge) at 04:00 AM (UTC).

Logging

The logging data is saved to CloudWatch Logs automatically. You can find the logs on the bottom of the job details (some delay to sync logs from CloudWatch Logs). In the job details view page, it also provides a link to the log stream of current job. AWS creates a CloudWatch Logs group named /aws/batch/job automatically when you submit a Batch job at the first time in the same region.

Notes:

In the compute_resources block of resource aws_batch_compute_environment, the instance_role argument is the ARN of the IAM role profile, not the IAM role.
You may met a ClientException that failed to delete Computing Environment as it has a relationship with Job Queue. This is a well-known issue in Terraform aws provider and lots of people raised on the Internet. A few workarounds came up with, and you can find one that fits your needs if you google it.
Don't forget to accept the email subscription request when you first deploy the SNS subscription, then your subscribed email address is able to receive failed job alert.
For secret token, NEVER EVER checkin the secret token in your source code. For demo purpose, I export the token in Makefile using export TF_VAR_my_secret=replace_me, but you should save the secret tokens in a better place, for example GitHub Secrets, and inject them as environment variables in runtime.

Reference

Done. I'm always appreciating your commands and ideas. Happy learning!

DEV Community

Build Scheduled AWS Batch Job Infrastructure using Terraform

Prerequisite

Terraform Structure

Remote Backend Configuration (S3)

Terraform Main File

Terraform Modules

Apply/Destroy Terraform Resources

Submit Batch Job Manually

Logging

Notes:

Reference

Top comments (0)

Read next

Accelerate AI Workloads with Amazon EC2 Trn1 Instances and AWS Neuron SDK

AWS Security: IAM Roles, Inline Policies, and Assigning Policies to Roles (Attach & Detach)

My AWS Journey

Is Agentic AI Dominating in 2025?