If you use ECS, you should already know about the two ways of running the containers: on EC2 machines or using serverless technology - Fargate. If you choose EC2 instances then you have to take care of the autoscaling groups, using the latest AMI. Fargate on the other hand is limited regarding shared volumes and memory and CPU selections. However, on 30th of September 2025 AWS announced a hybrid approach: managed EC2 instances. It gives you the savings of EC2 instances with as little management overhead as Fargate. Today, we will try this out!
Base infrastructure
As I assume that you are already an experienced user, I won't cover the basics like VPC and how to use ECS. We will start from a basic infrastructure: VPC, ECS Cluster, Application Load Balancer and some other necessities like IAM roles and Security Groups. The image below shows what we aim to create.
Both our EC2 instances and ECS service will be running in private subnets. Public one will be only used by the ALB and fck-nat instance providing us the Internet and save on endpoint costs. You can find the Terraform code here: github.com/ppabis/managed-ecs-instances. I will skip description of things like vpc.tf
.
Defining security for Managed Instances
In advance I can say that we will need to define three resources: IAM role for the capacity provider (which is not a service role), IAM role/profile for individual EC2 instances and their security groups. Let's do it now.
Infrastructure role
The first role we have to create is the infrastructure role. Fortunately, there is already a policy that will cover all the required permissions and we just need to attach it. Infrastructure role for the capacity provider trusts service ecs.amazonaws.com
.
data "aws_iam_policy_document" "trust_policy" {
statement {
actions = ["sts:AssumeRole"]
effect = "Allow"
principals {
type = "Service"
identifiers = ["ecs.amazonaws.com"]
}
}
}
resource "aws_iam_role" "ecs_infrastructure" {
name = "ECSInfraRole"
assume_role_policy = data.aws_iam_policy_document.trust_policy.json
}
resource "aws_iam_role_policy_attachment" "ecs_infrastructure" {
role = aws_iam_role.ecs_infrastructure.name
policy_arn = "arn:aws:iam::aws:policy/AmazonECSInfrastructureRolePolicyForManagedInstances"
}
EC2 instance role
Another one that we need to define is the role and profile for individual EC2 instances. They also need to communicate with some services. Just as with above the permissions are already defined in a policy created by AWS. Also note that you should name the instance role starting with ecsInstanceRole
to be able to use the default AWS managed policy. Otherwise you would need to define another permission to allow to pass role by the infrastructure role.
data "aws_iam_policy_document" "trust_policy_instance" {
statement {
actions = ["sts:AssumeRole"]
effect = "Allow"
principals {
type = "Service"
identifiers = ["ec2.amazonaws.com"]
}
}
}
resource "aws_iam_role" "ecs_managed_instance" {
name = "ecsInstanceRoleManaged"
assume_role_policy = data.aws_iam_policy_document.trust_policy_instance.json
}
resource "aws_iam_instance_profile" "ecs_managed_instance" {
name = "ECSManagedInstanceProfile"
role = aws_iam_role.ecs_managed_instance.name
}
resource "aws_iam_role_policy_attachment" "ecs_instance" {
role = aws_iam_role.ecs_managed_instance.name
policy_arn = "arn:aws:iam::aws:policy/AmazonECSInstanceRolePolicyForManagedInstances"
}
Security group
I will give the instances complete egress access as there are no inbound requirements. This is mostly used to tag the traffic coming from the instances in case you want to allow them to communicate for example to a VPC interface endpoint.
resource "aws_security_group" "managed-instance-sg" {
name = "ManagedECSInstancesSG"
vpc_id = module.vpc.vpc_id
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Defining the capacity provider
In order to use Managed ECS Instances, we need to define a capacity provider just as we did with classic EC2 autoscaling group. This time however, the parameters are a bit different. Make sure that you use Terrafrom AWS provider version 6.15 or newer.
First we define the name of the provider and we directly set the cluster for which it will be created. The resource type is aws_ecs_capacity_provider
.
resource "aws_ecs_cluster" "ec2-managed" {
name = "ec2-managed"
}
resource "aws_ecs_capacity_provider" "ec2-managed" {
name = "ec2-managed"
cluster = aws_ecs_cluster.ec2-managed.name
managed_instances_provider {
# more parameters follow ...
}
}
Next we need to define the managed_instances_provider
block. First, we give need to give it the infrastructure role. I will also add an option to propagate tags of this provider to the instances, so that we can easily find them.
resource "aws_ecs_capacity_provider" "ec2-managed" {
# ...
managed_instances_provider {
infrastructure_role_arn = aws_iam_role.ecs_infrastructure.arn
propagate_tags = "CAPACITY_PROVIDER"
instance_launch_template {
# more parameters here ...
}
}
}
Next nested block is instance_launch_template
. Here we define all the details of how the EC2 instances should be shaped. Let's start by defining the IAM role/profile for the instance, configuring monitoring level and storage size. I will also use private subnets and the new security group in the networking config.
resource "aws_ecs_capacity_provider" "ec2-managed" {
# ...
managed_instances_provider {
# ...
instance_launch_template {
ec2_instance_profile_arn = aws_iam_instance_profile.ecs_managed_instance.arn
monitoring = "BASIC"
storage_configuration {
storage_size_gib = 16
}
network_configuration {
subnets = module.vpc.private_subnets
security_groups = [aws_security_group.managed-instance-sg.id]
}
instance_requirements {
# ... more parameters
}
}
}
}
The last nested block inside instance_launch_template
is instance_requirements
. Here we define the CPU and memory sizes, instance generations and CPU manufacturers to select. You can also optionally choose GPUs and FPGAs. I will choose only Graviton instances of current and previous generation with between 2-4 CPUs and 2GB-16GB of memory. I want to also avoid bare metal instances and include burstable (t4g
and t3g
) instances.
resource "aws_ecs_capacity_provider" "ec2-managed" {
# ...
managed_instances_provider {
# ...
instance_launch_template {
# ...
instance_requirements {
memory_mib {
min = 2048
max = 16384
}
vcpu_count {
min = 2
max = 4
}
instance_generations = ["current", "previous"]
cpu_manufacturers = ["amazon-web-services"]
burstable_performance = "included"
bare_metal = "excluded"
}
}
}
}
The next part is very important as well. Even though if you now go to the AWS Console and look at the cluster, it will show that there's indeed the Managed Instances capacity provider. However, there's no capacity strategy set. It is different than with Fargate, where you can just set launch type to "FARGATE" in the service you are good to go on any cluster. You can bind the capacity directly to the service but it's better to control it at the cluster level for a cleaner setup. Thus we need to define the default strategy.
resource "aws_ecs_cluster_capacity_providers" "ec2-managed" {
cluster_name = aws_ecs_cluster.ec2-managed.name
capacity_providers = [aws_ecs_capacity_provider.ec2-managed.name]
default_capacity_provider_strategy {
capacity_provider = aws_ecs_capacity_provider.ec2-managed.name
base = 0
weight = 100
}
}
Load balancer to preview the service
In order to test the service, I will create a load balancer with HTTP listener. It is a pretty basic setup for the ECS services. It is important to set the target group target type to ip
and configure an appropriate health check. In case of needed quicker redeployments I will also define a shorter deregistration delay.
resource "aws_security_group" "alb" {
name = "test-ecs-managed-alb"
vpc_id = module.vpc.vpc_id
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_lb" "alb" {
name = "test-ecs-managed-lb"
security_groups = [aws_security_group.alb.id]
subnets = module.vpc.public_subnets
enable_deletion_protection = false
}
resource "aws_lb_target_group" "test-ecs" {
name = "test-ecs-managed-tg"
port = 80
protocol = "HTTP"
vpc_id = module.vpc.vpc_id
target_type = "ip"
deregistration_delay = 10
health_check {
path = "/"
port = 80
protocol = "HTTP"
healthy_threshold = 2
unhealthy_threshold = 2
timeout = 5
interval = 30
matcher = "200-499"
}
}
resource "aws_lb_listener" "http" {
load_balancer_arn = aws_lb.alb.arn
port = 80
protocol = "HTTP"
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.test-ecs.arn
}
}
We will be then able to use the load balanacer and the target group to preview our containers and see if they are working. I will use just default Nginx from ECR public gallery as the test.
Task definition
To define a task definition we need first to have some IAM roles that will be attached as task role and as execution role. The task role doesn't need any permissions for now as we won't be accessing any internal AWS services. For execution role we will just use the default managed policy.
data "aws_iam_policy_document" "ecs_task_assume_role" {
statement {
effect = "Allow"
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ecs-tasks.amazonaws.com"]
}
}
}
resource "aws_iam_role" "task" {
name = "test-managed-ecs-task-role"
assume_role_policy = data.aws_iam_policy_document.ecs_task_assume_role.json
}
resource "aws_iam_role" "execution" {
name = "test-managed-ecs-execution-role"
assume_role_policy = data.aws_iam_policy_document.ecs_task_assume_role.json
}
resource "aws_iam_role_policy_attachment" "execution" {
role = aws_iam_role.execution.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
Another thing is a security group. I will add one that will allow the load balancer to access our Nginx container on port 80 and the container should be able to reach out to the wide Internet if necessary.
resource "aws_security_group" "service-sg" {
name = "test-ecs-service-sg"
vpc_id = module.vpc.vpc_id
ingress {
from_port = 80
to_port = 80
protocol = "tcp"
security_groups = [aws_security_group.alb.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
Now we can create a task definition. I will enforce it to use ARM64 architecture and require compatibility of "MANAGED_INSTANCES". Apparently this is required for the task to deploy using this capacity provider. As previously said, I will use public ECR gallery Nginx image and map port 80. We will use awsvpc
networking mode to have a new network interface created for each of our tasks. For this task I will use 0.5 vCPU and 1GB of RAM.
resource "aws_ecs_task_definition" "test-ecs" {
family = "test-ecs-task"
task_role_arn = aws_iam_role.task.arn
execution_role_arn = aws_iam_role.execution.arn
cpu = "512"
memory = "1024"
network_mode = "awsvpc"
requires_compatibilities = ["MANAGED_INSTANCES"]
runtime_platform {
operating_system_family = "LINUX"
cpu_architecture = "ARM64"
}
container_definitions = jsonencode([{
name = "web"
image = "public.ecr.aws/nginx/nginx:latest"
essential = true
portMappings = [{
containerPort = 80
hostPort = 80
protocol = "tcp"
}]
}])
}
Lastly we can create the service that will spawn the new tasks and add them to the load balancer. For the first try I will just set desired count to 1. Add capacity_provider_strategy
to ignored, as we don't want to bind the service with this capacity. Also otherwise any redeployment will fail.
resource "aws_ecs_service" "test-ecs" {
name = "test-ecs-service"
cluster = aws_ecs_cluster.ec2-managed.name
desired_count = 1
task_definition = aws_ecs_task_definition.test-ecs.arn
enable_ecs_managed_tags = true
network_configuration {
subnets = module.vpc.private_subnets
security_groups = [aws_security_group.service-sg.id]
}
load_balancer {
target_group_arn = aws_lb_target_group.test-ecs.arn
container_name = "web"
container_port = 80
}
lifecycle {
ignore_changes = [capacity_provider_strategy]
}
}
You can even use any chosen values for memory and CPU in the task definition such as 1234
for CPU and 1234
for RAM as we are not constrained by available Fargate configs. I will also try to scale up the service to some large number of containers to see how will be the EC2 instances scheduled.
Scaling up the service
When I chose to scale the instances to 30 (and I also had to increase account quota), the only instances that were created were t4g.small
. This seemed quite inefficient as the scheduler also consumes some capacity and larger instances seem more reasonable. However, it was likely due to limitation of how many network cards can be added to each instance. t4g.small
can provide up to 3 network cards, so two of them can be used by the ECS tasks. t4g.large
has the same count. However, after several tries I got this weird combination:
$ aws ec2 describe-instances \
--filters "Name=tag:aws:ecs:clusterName,Values=ec2-managed" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[*].Instances[*].[InstanceType]' \
--output text | tr '\t' '\n' | sort | uniq -c
5 im4gn.large
3 is4gen.large
16 t4g.small
In my opinion this looks even worse as the i4
instances are twice as much expensive as t4g
per CPU but im4
have some more memory per each core. So I tried setting some large memory requirement to get an r
-family instance. I set 10 tasks of 1000
millicores and 3500
MB of RAM. This gave me just 10 t4g.medium
instances as they have 4 GB memory each. So I increase the stakes even higher and asked for 14000
MB per task. This time I got 10 r6g.large
instances.
$ aws ec2 describe-instances \
--filters "Name=tag:aws:ecs:clusterName,Values=ec2-managed" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[*].Instances[*].[InstanceType]' \
--output text | tr '\t' '\n' | sort | uniq -c
10 r6g.large
10 t4g.medium
2 t4g.small
As you can see, ECS Managed Instances give us much more flexibility. We can quickly spawn different types of instances based on our need and the configuration is more than trivial. Cost-wise this setup is the most optimal even if you don't have reserved instances. Fargate is still just $0.04 per hour for a task with 1 vCPU and 1 GB requirement but for that price you can get t4g.medium
with 2 CPUs and 4 GB of RAM. The startup time is almost negligible as Bottlerocket (Amazon's OS for running containers) instances start within seconds.
Top comments (0)