Ruween Iddagoda

Posted on Nov 11

Implementing Predictive Scaling for ECS Services with Custom Metrics and Termination Policies

#aws #ecs #containers #devops

Many organizations running Amazon ECS services on EC2 experience predictable traffic patterns — for example, higher demand during business hours and lower utilization during off-hours. Efficient scaling in such environments requires not only the ability to handle load spikes proactively but also to scale down during idle periods to optimize costs.

This post describes an approach to implementing predictive scaling for ECS services using custom CloudWatch metrics and custom termination policies. It also addresses several implementation challenges and outlines solutions that ensure scaling decisions remain accurate, proactive, and non-disruptive.

Overview of Predictive Scaling

Unlike target tracking or step scaling policies that react to metric thresholds, predictive scaling in AWS Auto Scaling analyzes historical data to forecast future capacity requirements. This allows ECS clusters to anticipate incoming traffic and scale up resources before the load actually increases.

However, implementing predictive scaling for ECS services with EC2 launch type introduces a number of challenges:

Default termination policies do not account for the number of ECS tasks running on each instance.
Blue/green deployments fragment metric history for scaling.
Publishing aggregated custom metrics can result in data gaps.
Predictive scaling only supports proactive scale-up actions and cannot directly scale down tasks.

The following sections describe each challenge and its corresponding solution.

Challenge 1: ECS Termination Policies Do Not Consider Running Tasks

When the desired task count for an ECS service decreases, and the ECS cluster has more EC2 instances than required, the Auto Scaling Group (ASG) determines which instance to terminate based on its termination policy.

AWS’s default termination policies prioritize instance age, availability zone balance, or instance lifecycle state, but they do not consider the number of ECS tasks running on an instance. As a result, an instance hosting multiple active tasks may be terminated while another instance remains mostly idle. This behavior can lead to application downtime or uneven task distribution.

Solution: Custom Termination Policy Lambda

A custom AWS Lambda function can be used as a termination policy for the ASG.
The function is invoked whenever a scale-in event is initiated. It performs the following operations:

Retrieves the list of EC2 instances registered to the ASG or ECS capacity provider.
Uses the ECS API to obtain the number of running tasks per instance.
Identifies the instance with the fewest running tasks.
Returns the corresponding instance ID to the ASG for termination.

The Lambda function must respond within 2 seconds. Through optimization, the execution time can be consistently reduced to under 1 second, allowing the ASG to complete scale-in operations efficiently and without service disruption.

Challenge 2: Using Predictive Scaling with Blue/Green Deployments

Predictive scaling policies support only the following predefined metrics:

AverageCPUUtilization
AverageMemoryUtilization
ALBRequestCountPerTarget

For most ECS workloads, ALBRequestCountPerTarget provides the most accurate representation of traffic. However, when ECS services use blue/green deployments, each service maintains two target groups, with only one active at a time.

This introduces two problems:

Metric fragmentation — the request count metric changes target group ARN when a deployment switch occurs, breaking the continuity of metric data.

Scaling policy maintenance overhead — each deployment requires manual updates to associate the predictive scaling policy with the active target group.

Solution: Combined Request Count Metric

To address these issues, a custom metric named CombinedRequestCount is published in CloudWatch.
Using the AWS SDK, request counts from both target groups (blue and green) are aggregated into a single metric namespace. This enables predictive scaling to reference a consistent metric, preserving historical data across deployment switches.

Benefits include:

Continuous metric history for accurate forecasting.
Elimination of scaling policy reconfiguration between deployments.
Simplified metric management through automation.

Challenge 3: Publishing Custom Metrics Without Data Gaps

When publishing custom metrics such as CombinedRequestCount through a Lambda function, data gaps can occur due to the timing of CloudWatch metric availability.

For example, fetching request counts at half-minute intervals can sometimes return zero values if the metrics are not yet aggregated by CloudWatch at the time of retrieval. This can mislead the predictive scaling model and reduce forecast accuracy.

Solution: Delayed Metric Collection

Introducing a one-minute delay in the metric publishing Lambda ensures that CloudWatch has finalized data aggregation before the values are retrieved.
The Lambda execution sequence is as follows:

Wait one minute before collecting data for the previous period.
Retrieve request counts from both target groups.
Aggregate and publish the results as CombinedRequestCount.

This approach ensures the continuity and reliability of metric data, which is critical for the predictive scaling algorithm.

Challenge 4: Predictive Scaling Does Not Scale Down

While predictive scaling effectively forecasts and provisions additional capacity ahead of demand spikes, it does not support scaling down tasks. Predictive scaling only triggers scale-up actions, relying on historical growth trends rather than current utilization levels.

To achieve optimal resource usage and prevent over-provisioning, predictive scaling must be combined with a reactive scaling policy that can handle downscale events.

Solution: Hybrid Scaling with Target Tracking

A hybrid scaling strategy can be implemented by combining predictive scaling with ECS target tracking.
In this configuration:

Predictive scaling proactively increases task counts before load spikes.
Target tracking reduces task counts when utilization drops.

A second custom metric, RequestPerTaskUtilization, is introduced to support this model.
This metric calculates how many requests each ECS task is handling relative to a defined target threshold. The resulting utilization percentage is then published to CloudWatch and used as the input metric for the ECS target tracking policy.

For example:

RequestPerTaskUtilization = (TotalRequests / NumberOfTasks) / TargetRequestsPerTask * 100

With this configuration:

1. Predictive scaling ensures readiness for anticipated traffic increases.
3. Target tracking ensures cost efficiency during low-demand periods.

This dual-policy approach provides proactive and reactive scaling behaviors, delivering both performance stability and resource optimization.

Implementation

Architecture Diagram

Component Overview

Component	Description	Purpose
ECS Cluster (EC2 Launch Type)	Runs containerized services across a fleet of EC2 instances managed by an Auto Scaling Group.	Hosts the workloads that require dynamic scaling.
Auto Scaling Group (ASG)	Manages EC2 instances for the ECS cluster.	Adjusts instance count based on predictive and target tracking policies.
Custom Termination Lambda	Invoked via an ASG lifecycle hook during scale-in events.	Determines which instance to terminate based on ECS task load.
Application Load Balancer (ALB)	Routes incoming traffic to ECS services.	Provides the `RequestCountPerTarget` metric used for scaling logic.
CombinedRequestCount Metric	Custom CloudWatch metric that aggregates requests across blue/green target groups.	Provides a consistent scaling metric unaffected by deployment switches.
RequestPerTaskUtilization Metric	Custom CloudWatch metric derived from CombinedRequestCount and ECS task count.	Provides utilization percentage for target tracking policy.
Predictive Scaling Policy	Uses historical `CombinedRequestCount` data.	Proactively scales up ECS tasks before forecasted traffic increases.
Target Tracking Scaling Policy	Uses `RequestPerTaskUtilization`.	Reactively scales down tasks during low load periods.

Implementation Steps

Below are the main implementation steps with key AWS CLI commands and Lambda code references.

1. Create the Custom CombinedRequestCount Metric

This metric aggregates request counts from both target groups of your blue/green ECS service.

AWS CLI Example

aws cloudwatch put-metric-data \
  --namespace "ECS/CustomMetrics" \
  --metric-name "CombinedRequestCount" \
  --value 1500 \
  --unit "Count"

In production, this command is automated by a Lambda function that:

Uses get-metric-statistics to fetch RequestCount for each target group.
Sums the counts.
Publishes the total as CombinedRequestCount.

Lambda Pseudocode Example

import boto3, datetime

cw = boto3.client('cloudwatch')

def lambda_handler(event, context):
    alb_metrics = boto3.client('cloudwatch')
    now = datetime.datetime.utcnow()
    start = now - datetime.timedelta(minutes=2)
    end = now - datetime.timedelta(minutes=1)

    target_groups = ["arn:aws:elasticloadbalancing:...:tg/blue",
                     "arn:aws:elasticloadbalancing:...:tg/green"]

    total_requests = 0
    for tg in target_groups:
        stats = alb_metrics.get_metric_statistics(
            Namespace='AWS/ApplicationELB',
            MetricName='RequestCount',
            Dimensions=[{'Name': 'TargetGroup', 'Value': tg}],
            StartTime=start,
            EndTime=end,
            Period=60,
            Statistics=['Sum']
        )
        if stats['Datapoints']:
            total_requests += stats['Datapoints'][0]['Sum']

    cw.put_metric_data(
        Namespace='ECS/CustomMetrics',
        MetricData=[{
            'MetricName': 'CombinedRequestCount',
            'Value': total_requests,
            'Unit': 'Count'
        }]
    )

2. Create the RequestPerTaskUtilization Metric

This metric calculates the number of requests handled per ECS task and expresses it as a utilization percentage.

AWS CLI Example

aws cloudwatch put-metric-data \
  --namespace "ECS/CustomMetrics" \
  --metric-name "RequestPerTaskUtilization" \
  --value 72.5 \
  --unit "Percent"

Lambda Logic Pseudocode (Derived from Combined Metric)

import boto3

ecs = boto3.client('ecs')
cw = boto3.client('cloudwatch')

def lambda_handler(event, context):
    cluster = "your-ecs-cluster"
    service = "your-ecs-service"
    response = ecs.describe_services(cluster=cluster, services=[service])
    task_count = response['services'][0]['runningCount']

    # Example: use CombinedRequestCount from previous step
    combined_requests = 1500
    target_requests_per_task = 1000
    utilization = (combined_requests / task_count) / target_requests_per_task * 100

    cw.put_metric_data(
        Namespace='ECS/CustomMetrics',
        MetricData=[{
            'MetricName': 'RequestPerTaskUtilization',
            'Value': utilization,
            'Unit': 'Percent'
        }]
    )

3. Attach Scaling Policies

Predictive Scaling (Passive)

aws autoscaling put-scaling-policy \
  --policy-name "PredictiveScalingPolicy" \
  --auto-scaling-group-name "ecs-asg" \
  --policy-type "PredictiveScaling" \
  --predictive-scaling-configuration '{
    "MetricSpecifications": [{
      "TargetValue": 1000.0,
      "CustomizedCapacityMetricSpecification": {
        "MetricDataQueries": [{
          "Id": "combinedRequestMetric",
          "MetricStat": {
            "Metric": {
              "Namespace": "ECS/CustomMetrics",
              "MetricName": "CombinedRequestCount"
            },
            "Stat": "Sum"
          }
        }]
      }
    }]
  }'

Target Tracking (Reactive)

aws application-autoscaling put-scaling-policy \
  --policy-name "ECS-TargetTracking" \
  --service-namespace ecs \
  --resource-id service/your-cluster/your-service \
  --scalable-dimension ecs:service:DesiredCount \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration '{
    "TargetValue": 100.0,
    "CustomizedMetricSpecification": {
      "MetricName": "RequestPerTaskUtilization",
      "Namespace": "ECS/CustomMetrics",
      "Statistic": "Average"
    },
    "ScaleOutCooldown": 60,
    "ScaleInCooldown": 300
  }'

4. Attach Custom Termination Policy Lambda

Create a Lambda function and register it as a custom termination policy for the Auto Scaling Group (ASG).
This function ensures that the instance with the fewest running ECS tasks is terminated first, minimizing disruption during scale-ins.

Lambda Pseudocode

import boto3, heapq, re

asg = boto3.client('autoscaling')
ecs = boto3.client('ecs')

def lambda_handler(event, context):
    asg_name = event.get("AutoScalingGroupName")
    capacity_to_terminate = event.get("CapacityToTerminate", [])
    total_capacity = sum(cap["Capacity"] for cap in capacity_to_terminate)

    # Derive ECS cluster name from ASG naming convention
    env_match = re.search(r"^(external|internal)-([^-]+)-", asg_name)
    cluster_type, env = env_match.groups()
    cluster_name = f"{cluster_type}-{env}-ecs-cluster"

    # Describe ASG and ECS instances
    instances = asg.describe_auto_scaling_groups(AutoScalingGroupNames=[asg_name])['AutoScalingGroups'][0]['Instances']
    eligible = [i for i in instances if i["LifecycleState"] == "InService" and not i.get("ProtectedFromScaleIn", False)]

    container_arns = ecs.list_container_instances(cluster=cluster_name)['containerInstanceArns']
    details = ecs.describe_container_instances(cluster=cluster_name, containerInstances=container_arns)['containerInstances']
    task_counts = {d['ec2InstanceId']: d['runningTasksCount'] for d in details}

    # Pick instances with the fewest running tasks
    selected = heapq.nsmallest(total_capacity, eligible, key=lambda x: task_counts.get(x["InstanceId"], 0))
    return {"InstanceIDs": [i["InstanceId"] for i in selected]}

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name ecs-asg \
  --termination-policies "arn:aws:lambda:us-east-1:123456789012:function:CustomTerminationPolicyLambda"

This configuration ensures:

The ASG invokes the Lambda on scale-in events.
The instance(s) with the fewest ECS tasks are selected for termination.
Scale-ins occur smoothly without disrupting active workloads.

DEV Community