DEV Community

Daniel Quackenbush
Daniel Quackenbush

Posted on • Edited on

3 3

Getting Started with AWS Batch

AWS Batch is a scheduled executor based on a job queue. At my $job, we were evaluating it for dynamic workloads, whereas a container service is needed to execute dynamic workloads based on a queue. The general workflow looks a little like this:

The external worker submits a job -> the job is scheduled on a spot instance -> ECS takes over and executes the task -> results are logged.

Given that most applications communicate with different external systems, there would a wide variety of IAM configuration and container scripts. For simplicity purposes, and to provide a generic system, this batch job will copy /etc/motd to a parametrized S3 bucket.

Topology


Configure Foundation

VPC

First, define out the VPC, and private subnets. In my use case, I have predefined which VPC through a variable (vpc_id), and then dynamically lookup the subnet through tags with the key/value of subnet/private.

data "aws_vpc" "selected" {
id = var.vpc_id
}
data "aws_subnet_ids" "private" {
vpc_id = data.aws_vpc.selected.id
tags = {
subnet = "private"
}
}
view raw vpc_data.tf hosted with ❤ by GitHub

Security Groups

For this topology, I am utilizing VPCE Endpoints, such that my containers remain locked down on available egress, however according to AWS' Setting Up with AWS Batch, they recommend you can just configure open traffic.

resource "aws_security_group" "this" {
name = "batch_compute_env"
vpc_id = data.aws_vpc.selected.id
# egress only to VPC + S3 buckets for ECR pulling/test bucket
egress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [
data.aws_vpc.selected.cidr_block,
"54.231.0.0/17",
"52.216.0.0/15",
"3.5.16.0/21",
"52.92.16.0/20",
"3.5.0.0/20",
]
}
}

IAM

Utilizing the principle of least privilege is important, however, for simplicity, I am utilizing largely AWS managed roles. Batch with ECS requires two roles, first the Batch Role which allows the service to create ec2 instances, create and modify the auto-scaling group, etc. Second is the ECS service role which provides two purposes, being the task execution role (permissions needed for your container) and the service role. It is worth noting that during my research, I did not see a breakdown of task execution vs task role, a feature for which the ECS service itself provides.

Batch Instances/Service Roles:

# Batch Service Role
resource "aws_iam_role" "aws_batch_service_role" {
name = "aws_batch_service_role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "batch.amazonaws.com"
}
}]
}
EOF
}
resource "aws_iam_role_policy_attachment" "aws_batch_service_role" {
role = aws_iam_role.aws_batch_service_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole"
}
# Batch Instance Role
resource "aws_iam_role" "ecs_instance_role" {
name = "ecs_instance_role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
}
}]
}
EOF
}
resource "aws_iam_role_policy_attachment" "ecs_instance_policy" {
role = aws_iam_role.ecs_instance_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role"
}
resource "aws_iam_role_policy_attachment" "ecs_ssm_policy" {
role = aws_iam_role.ecs_instance_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM"
}
resource "aws_iam_instance_profile" "ecs_instance_role" {
name = "ecs_instance_role"
role = aws_iam_role.ecs_instance_role.name
}
view raw service_iam.tf hosted with ❤ by GitHub

Additional policies for uploading to s3:

data "aws_iam_policy_document" "uploader_policy" {
statement {
actions = [
"s3:PutObject",
]
resources = [
"arn:aws:s3:::${var.bucket_name}/",
]
}
}
resource "aws_iam_policy" "ecs_batch_uploader" {
name = "ecs_batch_uploader"
path = "/"
policy = data.aws_iam_policy_document.uploader_policy.json
}
resource "aws_iam_role_policy_attachment" "ecs_uploader_policy" {
role = aws_iam_role.ecs_instance_role.name
policy_arn = aws_iam_policy.ecs_batch_uploader.arn
}

Configure Batch

Batch is compiled of several pieces:

  1. Compute Environment
  2. Queue
  3. Job Definition

Compute Environment

Batch allows you to configure any variety of the EC2 flavors you want to configure. For this concept design, I went with strictly optimal spot instances, however for the production workloads, it’s likely the environment won’t be as ephemeral and some on-demand instances might be required.

To ensure the most optimally secure environment, I had to create a task definition to accomplish two main purposes:

  • Ensure the base volume is encrypted

Requesting a blank spot instance won’t encrypt your volumes at rest. To do so, you must define the block_device_mappings, ensuring setting encryption to true.

  • Utilize Amazon Linux 2 over 1, to provide the latest patches. By default, the optimized spot instance ship with Amazon Linux 1, which was last updated in March of 2018. Utilizing the parameter store, AWS provides the ability to dynamically look up the image id: /AWS/service/ecs/optimized-ami/amazon-Linux-2.
data "template_file" "container_properties" {
template = file("templates/container_properties.yaml")
vars = {
bucket_name = var.bucket_name
}
}
data "aws_ssm_parameter" "image_id" {
name = "/aws/service/ecs/optimized-ami/amazon-linux-2/recommended/image_id"
}
resource "aws_launch_template" "batch_launch_template" {
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = 100
encrypted = true
}
}
image_id = data.aws_ssm_parameter.image_id.value
}
resource "aws_batch_compute_environment" "spot" {
compute_environment_name = "spot-fleet"
compute_resources {
allocation_strategy = "SPOT_CAPACITY_OPTIMIZED"
instance_role = aws_iam_instance_profile.ecs_instance_role.arn
instance_type = [
"optimal",
]
max_vcpus = 256
min_vcpus = 0
desired_vcpus = 4
security_group_ids = [
aws_security_group.this.id,
]
subnets = data.aws_subnet_ids.private.ids
type = "SPOT"
launch_template {
launch_template_id = aws_launch_template.batch_launch_template.id
version = "$Latest"
}
}
service_role = aws_iam_role.aws_batch_service_role.arn
type = "MANAGED"
}
resource "aws_batch_job_queue" "this" {
name = "queue"
state = "ENABLED"
priority = "1"
compute_environments = [
aws_batch_compute_environment.spot.arn,
]
}
resource "aws_batch_job_definition" "example" {
name = "batch-job-definition"
type = "container"
container_properties = jsonencode(yamldecode(
data.template_file.container_properties.rendered
))
}
view raw batch.tf hosted with ❤ by GitHub

Queue

The queue is where definitions will be associated with compute environments. If designed for production, it’s possible to combine different types of compute fleet.

Job Definition

This is where the meat of the operations will happen. When you submit a job to the queue, you specify a definition for which you want to utilize, which then is like the cookbook/playbook for the instance. This is important because the definition is what gets defined for compute requirements, entry point command, etc. Job definitions also allow for parameterization, such that you can create dynamic workloads.

Below I break down properties of the container, which highlights the parameterized implementation:

command:
- aws
- s3
- cp
- /etc/motd
- Ref::BUCKET_NAME
image: "<AWS ACCOUNT ID>.dkr.ecr.<REGION>.amazonaws.com/<AWSCLI REPO>"
memory: 128
vcpus: 1

Execution

To wrap the job altogether, simply submit a job to the queue, and watch the magic happen.

aws batch submit-job --job-name test --job-queue queue --job-definition batch-job-definition --parameters BUCKET_NAME=s3://quack-batch-testing
Enter fullscreen mode Exit fullscreen mode

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (0)