Aleksandra Ljuboje for AWS Community Builders

Posted on Jan 28

Practical ECS Configurations in Terraform that make your Life easier

#aws #containers #ecs #docker

Getting started with ECS as a DevOps engineer is a learning journey in itself - a bit like assembling a complex puzzle—every tiny piece needs to click into place for it to run smoothly.

Implementing it in Terraform as Infrastructure as Code adds another layer to the story.

Here are a few key considerations that can save your production deployments and help make your solution more robust and reliable

Propagating Tags Automatically (and Correctly)

Tagging is essential for cost allocation, ownership, and governance, but some ECS resources don’t automatically inherit tags unless you explicitly configure them.

Why is that? When writing Infrastructure as Code, you typically define the Task Definition and the Service. The actual Tasks that run are deployed based on those definitions and inherit most of their configuration from them, but not tags. They appear in your account only when the infrastructure is deployed, which means you need a way to ensure consistent tagging across these ephemeral resources..

To propagate tags correctly in your ECS Service Terraform code, add these two lines:

enable_ecs_managed_tags = true
propagate_tags          = "SERVICE"

Why this matters

Ensures tasks inherit tags from the ECS service – no more untagged tasks floating around.
Makes cost allocation and resource ownership clear – essential for reporting and chargebacks.
Reduces the risk of “untagged” resources appearing in billing or compliance reports.

This is especially important in larger organizations where tagging policies are enforced.

Reference: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-using-tags.html

Preventing ECS from Lowering Desired Count in Production

In production environments, Terraform shouldn’t override operational realities like autoscaling adjustments or manual scaling during incidents. Without proper handling, a terraform apply could unintentionally scale down your ECS service, causing downtime or degraded performance.

You can prevent this by telling Terraform to ignore changes to the desired_count in your ECS Service resource:

lifecycle {
  ignore_changes = [desired_count]
}

Why this matters

Prevents Terraform from scaling your service down unintentionally
Avoids surprises during terraform apply
Works nicely with Application Auto Scaling or manual scaling during incidents

Always Create a Custom CloudWatch Log Group

ECS can automatically create CloudWatch log groups for your tasks, but doing so limits your control over important settings like log retention, naming conventions, and cost management. Defining log groups explicitly in Terraform is a best practice that ensures consistency and predictability.

resource "aws_cloudwatch_log_group" "<project_name>_ecs_log_group" {
  name = "/ecs/<log group name>"
  retention_in_days = 1 # current minimum 
}

Why this matters

Controls log retention and avoids infinite log storage
Helps with cost optimization
Makes log group naming consistent and predictable

Enable ECS Exec for easier Debugging

ECS Exec is a powerful feature that lets you connect directly into a running container from the AWS Console or CLI — no SSH or bastion host required. This is incredibly useful for troubleshooting production issues safely and quickly.

To enable ECS Exec in your service:

enable_execute_command = true

You also need the proper IAM permissions:

resource "aws_iam_policy" "<policy name>" {
  name        = <policy name>
  description = "Give SSM permissions to use ECS exec"

  policy = jsonencode({
    "Version" : "2012-10-17",
    "Statement" : [
      {
        "Effect" : "Allow",
        "Action" : [
          "ssmmessages:CreateControlChannel",
          "ssmmessages:CreateDataChannel",
          "ssmmessages:OpenControlChannel",
          "ssmmessages:OpenDataChannel"
        ],
        "Resource" : "*"
      }
    ]
  })
}

How it works

To troubleshoot inside the container click on the running task and scroll until Containers section.

Choose the application container and on the upper left choose the Connect as show in the image below

After this, the terminal in Console will open up, and suggest to paste the command like shown in the example below:

$ aws ecs execute-command --cluster <cluster_name>
--task <placeholder>
--container <container_name> 
--interactive --command '/bin/sh'

To check the healthCheck configuration navigate to:

Task Definitions → choose your Task Definition → Choose a revision → Scroll down to Containers section → Monitoring and logging

Reference: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-exec.html

Why this matters

Secure access without opening SSH
Extremely helpful for production debugging

Once you use ECS Exec, it’s hard to go back.

Allow ECS to Temporarily Exceed the Task Limit During Deployments

By default, ECS can be very conservative during deployments. You can speed up rollouts by allowing temporary overprovisioning.

deployment_maximum_percent         = x
deployment_minimum_healthy_percent = y

How ECS Deployment Percentages Work

When ECS deploys a new version of your service, it doesn’t replace all tasks at once. Instead, it gradually shifts traffic from the old tasks to the new ones to avoid downtime. Two key settings control this behaviour deployment_maximum_percent and
deployment_minimum_healthy_percent.

deployment_maximum_percent

This defines the maximum number of tasks ECS can run during a deployment, expressed as a percentage of the desired task count.

Example: if your service normally runs 4 tasks and deployment_maximum_percent = 200, ECS can temporarily run up to 8 tasks during the rollout.

This means, ECS can start new tasks before stopping old ones, ensuring that your service remains available and deployment is faster.

deployment_minimum_healthy_percent

This defines the minimum number of tasks that must remain healthy during the deployment, also as a percentage of the desired task count.

Example: with 4 desired tasks and deployment_minimum_healthy_percent = 50, ECS ensures at least 2 tasks stay healthy at any time.

This means, even if new tasks fail to start or are unhealthy, ECS keeps enough old tasks running so your application continues serving traffic.

Why this matters

Enables zero-downtime deployments
Allows ECS to start new tasks before stopping old ones

This is especially important for production workloads.

Add Container-Level Health Checks

ECS health checks go beyond simply verifying that a container is running — they allow the scheduler to understand whether your application is actually healthy and ready to serve traffic.

By configuring a health check in your task definition, ECS can automatically replace unhealthy containers, improving reliability and reducing downtime.

healthCheck = {
  command     = ["CMD-SHELL", "curl -f http://localhost/status || exit 1"]
  interval    = 10
  timeout     = 5
  retries     = 3
  startPeriod = 10
}

How it works

command – what ECS runs to check the container’s health. Here, it calls a local endpoint /status. If the command fails, the container is marked unhealthy.
interval – how often the health check runs (seconds).
timeout – how long to wait for a response before considering the check failed (seconds).
retries – number of consecutive failures before marking the container unhealthy.
startPeriod – initial grace period for a container to start before health checks begin (seconds).

Why this matters

ECS can automatically replace unhealthy containers
Improves reliability and resilience
Makes deployments safer by detecting bad releases early

Always implement application-level health checks rather than just process-level checks. A container might be running but the app inside could still be failing—health checks catch this early.