DEV Community

Cover image for ECS Environment Scheduling: The Complete Guide
Matt
Matt

Posted on • Originally published at fortem.dev

ECS Environment Scheduling: The Complete Guide

How Do You Stop Paying for Idle ECS Environments?

Originally published at https://fortem.dev/blog/ecs-environment-scheduling
Stop paying for ECS dev and staging compute when nobody's using it. Every scheduling approach — AWS-native options, trade-offs, and what teams at fleet scale actually do.


Your dev and staging ECS environments run 168 hours a week. Your team works 40. The other 128 hours are pure waste. This guide covers every approach to scheduling ECS environments — from AWS-native options you can set up today to what actually works when you're managing 20+ environments across multiple accounts.

The math: what you're actually paying

A typical dev environment on ECS Fargate — 8 services, 0.5 vCPU and 1GB memory each — costs around $144/monthrunning 24/7. That's $1,728/year for one environment that your developers use 50 hours a week at most.

ScheduleHours/weekHours/monthMonthly cost

24/7 (current)168730$144

Mon–Fri 9am–7pm50217$43

Mon–Fri 8am–8pm60260$51

Mon–Sun 8am–10pm98425$84

Switching one 8-service environment from 24/7 to business hours saves $101/month. At 10 environments that's $1,010/month — $12,120/year — without changing a single line of application code.

How ECS scheduling works

ECS doesn't have a native "scheduled environment" concept. What you're actually doing is setting the desired count of each ECS service to 0 on a schedule (stop) and back to its normal value on another schedule (start).

When desired count hits 0, ECS drains existing tasks and stops billing for vCPU and memory. Your service definition, load balancer, security groups, and networking remain intact. The environment is "off" — not deleted. Starting it is setting desired count back to 1 (or whatever your normal value is).

Key principle — You pay for running tasks, not for service definitions. Desired count = 0 means no tasks running means no Fargate billing. The service configuration costs nothing — only the compute does.

Ready to use — copy this today

This EventBridge + Lambda setup stops and starts an ECS service on a schedule. Replace the cluster name, service name, and region — it works today with zero additional tools.

pythonCopy

import boto3, os

ecs = boto3.client("ecs")
CLUSTER = os.environ["CLUSTER_NAME"]
SERVICE = os.environ["SERVICE_NAME"]

def set_desired_count(count: int):
    ecs.update_service(
        cluster=CLUSTER,
        service=SERVICE,
        desiredCount=count,
    )
    print(f"Set {SERVICE} desired count to {count}")

def handler(event, context):
    action = event.get("action", "stop")  # "stop" or "start"
    count = 0 if action == "stop" else int(os.environ.get("NORMAL_COUNT", "2"))
    set_desired_count(count)
Enter fullscreen mode Exit fullscreen mode

Deploy this as a Lambda, then create two EventBridge Scheduler rules — one that invokes it with { "action": "stop" } on weekdays at 7PM, another with { "action": "start" } at 9AM Mon–Fri. Total cost: zero beyond the Lambda invocations.

Option 1: Application Auto Scaling scheduled actions

Best for: 1–3 environments, simple schedules

Application Auto Scaling supports scheduled scaling actions on ECS services. You define a cron expression and a min/max/desired capacity. AWS handles the rest — no Lambda, no EventBridge rules to manage.

Register your ECS service as a scalable target, then create two scheduled actions — one to stop (desired = 0) and one to start (desired = your normal count):

# Register the service as a scalable target
aws application-autoscaling register-scalable-target \
  --service-namespace ecs \
  --resource-id service/my-cluster/my-service \
  --scalable-dimension ecs:service:DesiredCount \
  --min-capacity 0 \
  --max-capacity 3

# Stop at 7pm UTC (Mon–Fri)
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id service/my-cluster/my-service \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name stop-evenings \
  --schedule "cron(0 19 ? * MON-FRI *)" \
  --scalable-target-action MinCapacity=0,MaxCapacity=0

# Start at 8am UTC (Mon–Fri)
aws application-autoscaling put-scheduled-action \
  --service-namespace ecs \
  --resource-id service/my-cluster/my-service \
  --scalable-dimension ecs:service:DesiredCount \
  --scheduled-action-name start-mornings \
  --schedule "cron(0 8 ? * MON-FRI *)" \
  --scalable-target-action MinCapacity=1,MaxCapacity=3
Enter fullscreen mode Exit fullscreen mode

Limitations

  • • One command per service — 8 services × 2 actions = 16 CLI calls per environment
  • • No concept of "environment" — you schedule individual services
  • • Schedule changes require updating each service individually
  • • No visibility into scheduled state across services or environments

Option 2: EventBridge Scheduler + Lambda

Best for: multiple environments, custom logic, per-timezone schedules

EventBridge Scheduler triggers a Lambda function on a cron schedule. The Lambda iterates over all services in an environment (identified by a tag) and sets their desired count. This is the most flexible AWS-native approach — you can handle timezones, environment grouping, and custom logic.

The Lambda function itself is straightforward — iterate over tagged services and update desired count:

pythonCopy

import boto3

ecs = boto3.client('ecs')

def handler(event, context):
    desired_count = event['desired_count']  # 0 to stop, 1 to start
    cluster = event['cluster']
    env_tag = event['environment']          # e.g. "staging"

    # List all services in the cluster
    paginator = ecs.get_paginator('list_services')
    for page in paginator.paginate(cluster=cluster):
        for arn in page['serviceArns']:
            # Describe to get tags
            svc = ecs.describe_services(
                cluster=cluster,
                services=[arn],
                include=['TAGS']
            )['services'][0]

            tags = {t['key']: t['value'] for t in svc.get('tags', [])}

            if tags.get('Environment') == env_tag:
                current = svc['desiredCount']
                if desired_count == 0:
                    # Store current count before stopping
                    ecs.tag_resource(
                        resourceArn=arn,
                        tags=[{'key': 'ScheduledDesiredCount',
                               'value': str(current)}]
                    )
                    ecs.update_service(
                        cluster=cluster,
                        service=arn,
                        desiredCount=0
                    )
                else:
                    # Restore previous count
                    restore = int(tags.get('ScheduledDesiredCount', '1'))
                    ecs.update_service(
                        cluster=cluster,
                        service=arn,
                        desiredCount=restore
                    )
Enter fullscreen mode Exit fullscreen mode

Then create two EventBridge Scheduler rules — one for stop, one for start — each passing the appropriate desired_count in the input.

What this doesn't solve

  • • No UI — schedule changes require code or CLI changes
  • • Per-timezone logic gets complex fast (US-east vs EU-west teams)
  • • Error handling and alerting on failed starts is your problem
  • • At 10+ environments, you're maintaining a scheduling system, not using one

Option 3: Terraform-managed schedules

Best for: teams with strong Terraform discipline and few environments

You can manage scheduled scaling actions directly in Terraform using the aws_appautoscaling_scheduled_action resource. This keeps scheduling configuration version-controlled alongside your infrastructure.

resource "aws_appautoscaling_target" "ecs_target" {
  service_namespace  = "ecs"
  resource_id        = "service/${var.cluster_name}/${var.service_name}"
  scalable_dimension = "ecs:service:DesiredCount"
  min_capacity       = 0
  max_capacity       = var.max_capacity
}

resource "aws_appautoscaling_scheduled_action" "stop" {
  name               = "${var.service_name}-stop"
  service_namespace  = "ecs"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  schedule           = "cron(0 19 ? * MON-FRI *)"

  scalable_target_action {
    min_capacity = 0
    max_capacity = 0
  }
}

resource "aws_appautoscaling_scheduled_action" "start" {
  name               = "${var.service_name}-start"
  service_namespace  = "ecs"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  schedule           = "cron(0 8 ? * MON-FRI *)"

  scalable_target_action {
    min_capacity = var.desired_count
    max_capacity = var.max_capacity
  }
}
Enter fullscreen mode Exit fullscreen mode

Clean and auditable — but it still operates at the service level. Changing a schedule for an environment with 8 services means updating 8 Terraform resources and running apply. For teams where schedules change rarely, this is fine. For teams where developers want to adjust their own environment hours, it becomes a bottleneck.

What breaks at fleet scale

Every approach above works at 1–3 environments. Here's what teams discover when they try to scale it to 15–50 environments across multiple AWS accounts:

Per-service configuration doesn't scale

At 20 environments × 8 services, you have 160 individual Auto Scaling targets to manage. A schedule change for one environment touches 8 resources. A timezone change for one team requires finding and updating those 8 resources across potentially multiple accounts.

No environment-level visibility

None of the AWS-native approaches give you a view of 'which environments are running, which are scheduled, and what their current cost is.' You're looking at individual services in CloudWatch and Cost Explorer, not environments as units.

Timezone complexity multiplies

EU teams want environments to stop at 18:00 CET. US East teams want 19:00 EST. US West teams want 19:00 PST. Each requires separate cron expressions — and those expressions need to account for DST. A single Lambda managing this across 20 environments becomes a meaningful maintenance burden.

Developer self-service breaks down

Developers want to override their environment schedule occasionally — stay late on a sprint, work a weekend. In every AWS-native approach, that override requires console access or a platform engineer intervention. The friction is high enough that teams just leave environments running 24/7 to avoid the hassle.

Failed starts are silent

If an ECS service fails to start after a scheduled start (image pull error, IAM issue, resource limits), the EventBridge rule fires, Lambda runs, desired count updates — but nobody knows the environment didn't come up. You need separate health checking and alerting to catch this.

The pattern we see

Teams start with EventBridge + Lambda at 3 environments. By 10 environments they're spending 2–4 hours a month maintaining the scheduling system. By 20 environments they've either given up and gone back to 24/7, or a platform engineer owns a growing codebase that does nothing except stop and start ECS services on a schedule.

What to track

Regardless of which approach you use, these are the metrics worth monitoring:

Baseline vs. actual spend per environment

Tag all ECS services with Environment and use Cost Explorer with resource-level tags. Baseline = what you'd pay at 24/7. Actual = what you paid. The delta is your scheduling savings.

Schedule adherence

CloudWatch metric: ECS service DesiredCount. If an environment should be at 0 from 19:00–08:00 but DesiredCount is 1, your schedule isn't firing. Set an alarm on non-zero DesiredCount during expected off-hours.

Start latency

Time from scheduled start to all services healthy. ECS RunningTaskCount = DesiredCount AND target group healthy host count = DesiredCount. Anything over 3 minutes warrants investigation.

Failed starts

ECS StoppedTaskCount increasing after a scheduled start usually means image pull errors or resource exhaustion. CloudWatch alarm on StoppedTaskCount > 0 for environments in scheduled-start window.


See your scheduling savings: fortem.dev/ecs-cost-calculator

Top comments (0)