Piotr Pabis for AWS Community Builders

Posted on Oct 22 • Originally published at pabis.eu

Let's try Managed ECS Instances

#aws #ecs #docker #containers

If you use ECS, you should already know about the two ways of running the containers: on EC2 machines or using serverless technology - Fargate. If you choose EC2 instances then you have to take care of the autoscaling groups, using the latest AMI. Fargate on the other hand is limited regarding shared volumes and memory and CPU selections. However, on 30th of September 2025 AWS announced a hybrid approach: managed EC2 instances. It gives you the savings of EC2 instances with as little management overhead as Fargate. Today, we will try this out!

Base infrastructure

As I assume that you are already an experienced user, I won't cover the basics like VPC and how to use ECS. We will start from a basic infrastructure: VPC, ECS Cluster, Application Load Balancer and some other necessities like IAM roles and Security Groups. The image below shows what we aim to create.

Both our EC2 instances and ECS service will be running in private subnets. Public one will be only used by the ALB and fck-nat instance providing us the Internet and save on endpoint costs. You can find the Terraform code here: github.com/ppabis/managed-ecs-instances. I will skip description of things like vpc.tf.

Defining security for Managed Instances

In advance I can say that we will need to define three resources: IAM role for the capacity provider (which is not a service role), IAM role/profile for individual EC2 instances and their security groups. Let's do it now.

Infrastructure role

The first role we have to create is the infrastructure role. Fortunately, there is already a policy that will cover all the required permissions and we just need to attach it. Infrastructure role for the capacity provider trusts service ecs.amazonaws.com.

data "aws_iam_policy_document" "trust_policy" {
  statement {
    actions = ["sts:AssumeRole"]
    effect  = "Allow"
    principals {
      type        = "Service"
      identifiers = ["ecs.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "ecs_infrastructure" {
  name               = "ECSInfraRole"
  assume_role_policy = data.aws_iam_policy_document.trust_policy.json
}

resource "aws_iam_role_policy_attachment" "ecs_infrastructure" {
  role       = aws_iam_role.ecs_infrastructure.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonECSInfrastructureRolePolicyForManagedInstances"
}

EC2 instance role

Another one that we need to define is the role and profile for individual EC2 instances. They also need to communicate with some services. Just as with above the permissions are already defined in a policy created by AWS. Also note that you should name the instance role starting with ecsInstanceRole to be able to use the default AWS managed policy. Otherwise you would need to define another permission to allow to pass role by the infrastructure role.

data "aws_iam_policy_document" "trust_policy_instance" {
  statement {
    actions = ["sts:AssumeRole"]
    effect  = "Allow"
    principals {
      type        = "Service"
      identifiers = ["ec2.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "ecs_managed_instance" {
  name               = "ecsInstanceRoleManaged"
  assume_role_policy = data.aws_iam_policy_document.trust_policy_instance.json
}

resource "aws_iam_instance_profile" "ecs_managed_instance" {
  name = "ECSManagedInstanceProfile"
  role = aws_iam_role.ecs_managed_instance.name
}

resource "aws_iam_role_policy_attachment" "ecs_instance" {
  role       = aws_iam_role.ecs_managed_instance.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonECSInstanceRolePolicyForManagedInstances"
}

Security group

I will give the instances complete egress access as there are no inbound requirements. This is mostly used to tag the traffic coming from the instances in case you want to allow them to communicate for example to a VPC interface endpoint.

resource "aws_security_group" "managed-instance-sg" {
  name = "ManagedECSInstancesSG"  
  vpc_id = module.vpc.vpc_id
  egress {
    from_port = 0
    to_port = 0
    protocol = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Defining the capacity provider

In order to use Managed ECS Instances, we need to define a capacity provider just as we did with classic EC2 autoscaling group. This time however, the parameters are a bit different. Make sure that you use Terrafrom AWS provider version 6.15 or newer.

First we define the name of the provider and we directly set the cluster for which it will be created. The resource type is aws_ecs_capacity_provider.

resource "aws_ecs_cluster" "ec2-managed" {
  name = "ec2-managed"
}

resource "aws_ecs_capacity_provider" "ec2-managed" {
  name    = "ec2-managed"
  cluster = aws_ecs_cluster.ec2-managed.name

  managed_instances_provider {
    # more parameters follow ...
  }
}

Next we need to define the managed_instances_provider block. First, we give need to give it the infrastructure role. I will also add an option to propagate tags of this provider to the instances, so that we can easily find them.

resource "aws_ecs_capacity_provider" "ec2-managed" {
  # ...
  managed_instances_provider {
    infrastructure_role_arn = aws_iam_role.ecs_infrastructure.arn
    propagate_tags          = "CAPACITY_PROVIDER"

    instance_launch_template {
      # more parameters here ...
    }
  }
}

Next nested block is instance_launch_template. Here we define all the details of how the EC2 instances should be shaped. Let's start by defining the IAM role/profile for the instance, configuring monitoring level and storage size. I will also use private subnets and the new security group in the networking config.

resource "aws_ecs_capacity_provider" "ec2-managed" {
  # ...
  managed_instances_provider {
    # ...
    instance_launch_template {
      ec2_instance_profile_arn = aws_iam_instance_profile.ecs_managed_instance.arn
      monitoring               = "BASIC"

      storage_configuration {
        storage_size_gib = 16
      }

      network_configuration {
        subnets         = module.vpc.private_subnets
        security_groups = [aws_security_group.managed-instance-sg.id]
      }

      instance_requirements {
        # ... more parameters
      }
    }
  }
}

The last nested block inside instance_launch_template is instance_requirements. Here we define the CPU and memory sizes, instance generations and CPU manufacturers to select. You can also optionally choose GPUs and FPGAs. I will choose only Graviton instances of current and previous generation with between 2-4 CPUs and 2GB-16GB of memory. I want to also avoid bare metal instances and include burstable (t4g and t3g) instances.

resource "aws_ecs_capacity_provider" "ec2-managed" {
  # ...
  managed_instances_provider {
    # ...
    instance_launch_template {
      # ...
      instance_requirements {
        memory_mib {
          min = 2048
          max = 16384
        }

        vcpu_count {
          min = 2
          max = 4
        }

        instance_generations = ["current", "previous"]
        cpu_manufacturers    = ["amazon-web-services"]
        burstable_performance = "included"
        bare_metal            = "excluded"
      }
    }
  }
}

The next part is very important as well. Even though if you now go to the AWS Console and look at the cluster, it will show that there's indeed the Managed Instances capacity provider. However, there's no capacity strategy set. It is different than with Fargate, where you can just set launch type to "FARGATE" in the service you are good to go on any cluster. You can bind the capacity directly to the service but it's better to control it at the cluster level for a cleaner setup. Thus we need to define the default strategy.

resource "aws_ecs_cluster_capacity_providers" "ec2-managed" {
  cluster_name       = aws_ecs_cluster.ec2-managed.name
  capacity_providers = [aws_ecs_capacity_provider.ec2-managed.name]

  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.ec2-managed.name
    base              = 0
    weight            = 100
  }
}

Load balancer to preview the service

In order to test the service, I will create a load balancer with HTTP listener. It is a pretty basic setup for the ECS services. It is important to set the target group target type to ip and configure an appropriate health check. In case of needed quicker redeployments I will also define a shorter deregistration delay.

resource "aws_security_group" "alb" {
  name        = "test-ecs-managed-alb"
  vpc_id      = module.vpc.vpc_id

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_lb" "alb" {
  name               = "test-ecs-managed-lb"
  security_groups    = [aws_security_group.alb.id]
  subnets            = module.vpc.public_subnets

  enable_deletion_protection = false
}

resource "aws_lb_target_group" "test-ecs" {
  name                 = "test-ecs-managed-tg"
  port                 = 80
  protocol             = "HTTP"
  vpc_id               = module.vpc.vpc_id
  target_type          = "ip"
  deregistration_delay = 10

  health_check {
    path                = "/"
    port                = 80
    protocol            = "HTTP"
    healthy_threshold   = 2
    unhealthy_threshold = 2
    timeout             = 5
    interval            = 30
    matcher             = "200-499"
  }
}

resource "aws_lb_listener" "http" {
  load_balancer_arn = aws_lb.alb.arn
  port              = 80
  protocol          = "HTTP"

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.test-ecs.arn
  }
}

We will be then able to use the load balanacer and the target group to preview our containers and see if they are working. I will use just default Nginx from ECR public gallery as the test.

Task definition

To define a task definition we need first to have some IAM roles that will be attached as task role and as execution role. The task role doesn't need any permissions for now as we won't be accessing any internal AWS services. For execution role we will just use the default managed policy.

data "aws_iam_policy_document" "ecs_task_assume_role" {
  statement {
    effect  = "Allow"
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["ecs-tasks.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "task" {
  name               = "test-managed-ecs-task-role"
  assume_role_policy = data.aws_iam_policy_document.ecs_task_assume_role.json
}

resource "aws_iam_role" "execution" {
  name               = "test-managed-ecs-execution-role"
  assume_role_policy = data.aws_iam_policy_document.ecs_task_assume_role.json
}

resource "aws_iam_role_policy_attachment" "execution" {
  role       = aws_iam_role.execution.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

Another thing is a security group. I will add one that will allow the load balancer to access our Nginx container on port 80 and the container should be able to reach out to the wide Internet if necessary.

resource "aws_security_group" "service-sg" {
  name   = "test-ecs-service-sg"
  vpc_id = module.vpc.vpc_id

  ingress {
    from_port       = 80
    to_port         = 80
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

Now we can create a task definition. I will enforce it to use ARM64 architecture and require compatibility of "MANAGED_INSTANCES". Apparently this is required for the task to deploy using this capacity provider. As previously said, I will use public ECR gallery Nginx image and map port 80. We will use awsvpc networking mode to have a new network interface created for each of our tasks. For this task I will use 0.5 vCPU and 1GB of RAM.

resource "aws_ecs_task_definition" "test-ecs" {
  family                   = "test-ecs-task"
  task_role_arn            = aws_iam_role.task.arn
  execution_role_arn       = aws_iam_role.execution.arn
  cpu                      = "512"
  memory                   = "1024"
  network_mode             = "awsvpc"
  requires_compatibilities = ["MANAGED_INSTANCES"]

  runtime_platform {
    operating_system_family = "LINUX"
    cpu_architecture        = "ARM64"
  }

  container_definitions = jsonencode([{
      name      = "web"
      image     = "public.ecr.aws/nginx/nginx:latest"
      essential = true
      portMappings = [{
          containerPort = 80
          hostPort      = 80
          protocol      = "tcp"
      }]
  }])
}

Lastly we can create the service that will spawn the new tasks and add them to the load balancer. For the first try I will just set desired count to 1. Add capacity_provider_strategy to ignored, as we don't want to bind the service with this capacity. Also otherwise any redeployment will fail.

resource "aws_ecs_service" "test-ecs" {
  name                    = "test-ecs-service"
  cluster                 = aws_ecs_cluster.ec2-managed.name
  desired_count           = 1
  task_definition         = aws_ecs_task_definition.test-ecs.arn
  enable_ecs_managed_tags = true

  network_configuration {
    subnets         = module.vpc.private_subnets
    security_groups = [aws_security_group.service-sg.id]
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.test-ecs.arn
    container_name   = "web"
    container_port   = 80
  }

  lifecycle {
    ignore_changes = [capacity_provider_strategy]
  }
}

You can even use any chosen values for memory and CPU in the task definition such as 1234 for CPU and 1234 for RAM as we are not constrained by available Fargate configs. I will also try to scale up the service to some large number of containers to see how will be the EC2 instances scheduled.

Scaling up the service

When I chose to scale the instances to 30 (and I also had to increase account quota), the only instances that were created were t4g.small. This seemed quite inefficient as the scheduler also consumes some capacity and larger instances seem more reasonable. However, it was likely due to limitation of how many network cards can be added to each instance. t4g.small can provide up to 3 network cards, so two of them can be used by the ECS tasks. t4g.large has the same count. However, after several tries I got this weird combination:

$ aws ec2 describe-instances \
 --filters "Name=tag:aws:ecs:clusterName,Values=ec2-managed" \
  "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].[InstanceType]' \
  --output text | tr '\t' '\n' | sort | uniq -c
   5 im4gn.large
   3 is4gen.large
  16 t4g.small

In my opinion this looks even worse as the i4 instances are twice as much expensive as t4g per CPU but im4 have some more memory per each core. So I tried setting some large memory requirement to get an r-family instance. I set 10 tasks of 1000 millicores and 3500 MB of RAM. This gave me just 10 t4g.medium instances as they have 4 GB memory each. So I increase the stakes even higher and asked for 14000 MB per task. This time I got 10 r6g.large instances.

$ aws ec2 describe-instances \
 --filters "Name=tag:aws:ecs:clusterName,Values=ec2-managed" \
  "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].[InstanceType]' \
  --output text | tr '\t' '\n' | sort | uniq -c
  10 r6g.large
  10 t4g.medium
   2 t4g.small

As you can see, ECS Managed Instances give us much more flexibility. We can quickly spawn different types of instances based on our need and the configuration is more than trivial. Cost-wise this setup is the most optimal even if you don't have reserved instances. Fargate is still just $0.04 per hour for a task with 1 vCPU and 1 GB requirement but for that price you can get t4g.medium with 2 CPUs and 4 GB of RAM. The startup time is almost negligible as Bottlerocket (Amazon's OS for running containers) instances start within seconds.

DEV Community