DEV Community: chgerkens

Automatic deletion of unused AWS ECR container images for multi-account AWS ECS services

chgerkens — Mon, 20 Jan 2025 09:00:00 +0000

AWS Elastic Container Service (ECS) is often used as a platform for Microservices on AWS. In most cases, container images are stored in AWS Elastic Container Registry (ECR). However, over time, unused images can accumulate, especially when using continuous integration and continuous delivery approaches, which wastes both storage space and costs. ECR vulnerability scanning also leads to many irrelevant findings as they relate to images that are no longer in use.

This article shows how you can automatically clean up your central ECR repositories to remove old images that are no longer used by AWS ECS services. The presented approach is designed for a multi-account setup by running the production environment in another AWS account as UAT/Dev and can be used with "tag immutability".

Why remove old container images?

Every image that is stored in an ECR repository incurs costs, for storage and for rescanning by AWS Inspector if Enhanced Scanning is enabled. The removal of unused images helps to avoid unnecessary expenses.

The accumulation of unused images also makes it difficult to identify security vulnerabilities of productive services, as vulnerability scan reports covers all images.

ECR lifecycle policies are often used to clean up ECR repositories automatically. You can configure rules based on image tag patterns and image push dates to limit the number of images in a repository. However, lifecycle policy rules cannot check when an image was last used, e.g. actively used in an ECS task.

The usage of tags such as "production", which are set during deployment, can be a way of deleting unused images via lifecycle policies. However, this is not possible if "tag immutability" is enabled for a repository, and it complicates the deployment process.
Automation scripts such as awslabs/ecr-cleanup-lambda go one step further and take the actual image usage into account by analyzing the running ECS tasks.

An automated Multi-Account ECR Cleanup Approach

Before we outline the solution approach, let's summarize the requirements and restrictions:

ECR Container Images that are not used in the last 24 hours should be deleted automatically. ("Used": referred in an active ECS task definition.) The time constraint allows rolling back to previous deployed versions within 24 hours.
ECR Container Images can run in multiple ECS services in different AWS accounts (e.g. Production, UAT/Dev)
ECR Repositories are located in a shared AWS account
ECR Repositories might use "tag immutability" (a tag can only be used once)
ECR Repositories should be explicitly enabled for "automated removal" to prevent accidental loss of images. For example, if the repository is used in Kubernetes or Lambda deployments.

The fact that an ECS task can run in different AWS accounts makes it difficult to query active ECS task definitions. We can use AWS Config Aggregator to build a central, searchable resource inventory that includes ECS task definitions of multiple connected AWS Accounts (1).

Since only container images that have not been part of an active ECS task definition in the last 24 hours should be deleted, we cannot simply query active task definitions and delete all other images. Unfortunately, AWS Config does not allow us to apply queries "back-in-time", but only against the current state only. To respect the time constraint, whenever a task definition becomes inactive (2), we (re)schedule an "image deletion check" in 24 hours for each container image used in the inactivated task definition (3).

When an "image deletion check" is triggered 24 hours later, a query to the AWS Config Aggregator is executed that looks for active task definitions containing the specific image (4). If no active task definitions are found, the image can be removed from the repository (5).

To ensure that only opt-in repositories are automatically cleaned, we can check the existence of a repository feature tag, when scheduling or executing "image deletion checks".

Conclusion

Managing container images in AWS Elastic Container Registry (ECR) for multi-account AWS Elastic Container Service (ECS) environments can be challenging, particularly when ensuring cost-efficiency and maintaining security through vulnerability scanning. This article has outlined an automated approach to clean up unused ECR container images by leveraging AWS Config Aggregator, EventBridge Rules and EventBridge Scheduler.

By implementing this solution, you can ensure that only container images actively used in ECS services remain in your repositories. The approach respects the complexities of multi-account setups and considers essential constraints, such as tag immutability and rollback windows. This helps reduce storage costs, improves vulnerability scanning accuracy, and simplifies repository management.

Adopting this automated cleanup process is a proactive step towards optimizing your AWS infrastructure, enhancing security posture, and fostering better resource management across accounts.

Circuit Breaker Solution for AWS Lambda Functions

chgerkens — Wed, 13 Jan 2021 15:30:47 +0000

CloudWatch Metrics and Alarms can be used to add circuit breaker functionality to AWS Lambda functions that are triggered by SQS messages in a non-intrusive and cost-effective way.
You can protect overwhelmed downstream services without the need to make code changes, replay messages from dead letter queues or increase operating costs significantly.

Introduction

A Serverless Architecture frees you from the responsibility to ensure your application scales rapidly with increasing demands and is available even when underlying infrastructure components fail. But as soon as your application calls external APIs — either third-party services hosted somewhere or managed (non-serverless) AWS services — the ideal world is crumbling. You are confronted with increasing latency, long running calls and increasing error rate.

A couple of well-known stability patterns exist, like use Timeouts, Bulkheads, Decoupling Middleware and Circuit Breaker (published in Michael T. Nygard’s Book Release It!). In the context of Serverless on AWS, you can configure timeouts for Lambda functions, decouple your application from external APIs by putting a message queue like SQS in front of your single-purpose Lambda functions (Bulkheads). But there is currently no straightforward approach to apply a circuit breaker to AWS Lambda functions. If your message processing lambda functions start to fail recurrently due to an incident in the downstream service, AWS Lambda will retry to send SQS messages to your function (respecting an optional configured dead-letter queue and maximum receives count). Your function might get even more load, since AWS Lambda scales concurrent invocations based on available messages. If you restrict the function concurrency, AWS Lambda might throttle and fail to process messages.

Circuit Breaker Pattern

When a downstream service is in trouble, for instance due to very high load or failing underlying infrastructure components, the idea of the Circuit Breaker Pattern is to stop an upstream system making further calls (open state). The downstream service gets the chance to recover and the upstream system does not waste time nor operating resources to make calls which will probably fail anyway. After some time, the circuit breaker allows a small number of calls to find out whether the downstream is operating normal again (half open state). If a threshold of successful calls is reached, the circuit breaker enables all calls to the downstream service again (closed state).

Three key aspects are important to implement a circuit breaker:

Detect when a timeout or error threshold is exceeded
Prevent calls to the downstream service for a certain time
Allow some calls to pass periodically, to detect if the downstream service has recovered

Existing Approaches for AWS Lambda

A common approach is to implement a circuit breaker inside your function and use DynamoDB to store the circuit breaker state (like Gunnar Grosch’s failure lambda node.js implementation and Jeremy Daly outlines in his AWS Reference Architecture Pattern). The Lambda function will fail before calling the Third-Party API when a failure threshold has been exceeded. This protects the downstream service, but it will not stop AWS Lambda polling the upstream queue and invoking your function. You also have to make changes to lambda function code, specific to the particular Lambda runtime and programming language. The approach introduces a number of DynamoDB requests which could significantly increase costs.

A Solution based on CloudWatch Alarms and Event Source Mapping

This solution relies on CloudWatch metrics and alarms to detect message processing issues caused by the downstream service.

When the number of timeouts or errors exceed a threshold, a CloudWatch alarm is triggered — based on Lambda function metrics. To reduce false alarms, you should use a combination of ratio and sum metric thresholds. I recommend custom metrics with high-resolution alarms over AWS-provided function metrics to get a prompt response once a failure situation occurs. Log metric filters can detect errors and timeouts based on your function log streams.
When the CloudWatch alarm is triggered, a Lambda function disables the event source mapping. AWS Lambda will not poll the message queue from now on. The circuit breaker is in state “open”. Once the alarm falls back to OK, an AWS Step Function takes over: It periodically tries to invoke the protected function with a message from the queue. The circuit breaker is in state “half open”.
If a certain number of trial messages succeed, the step function enables the event source mapping again. AWS Lambda starts polling the queue again. The circuit breaker is back in state “closed”.

This solution supports any Lambda runtime. No changes to your function code are required. Fix costs incur for CloudWatch alarms and metrics per month (AWS free tier can be applied, except for high-resolution alarms). Costs for Step Functions transitions and Lambda Functions invocations incur only in failure state. On the other hand, you save costs for unnecessary queue service requests and lambda invocations.

I designed the solution for SQS as function trigger, but other services like Amazon MQ, that are integrated by AWS Lambda via event source mappings, should work too.

Deploy the solution

You can find an implementation of this Circuit Breaker solution on GitHub.

Related Approaches

Jeremy Daly’s Lambda Orchestrator pattern goes a step further. It does not rely on AWS Lambda event source mapping at all to receive messages and invoke Lambda functions. Instead, a long-running Lambda function polls the queue and invokes processing lambda function, similar to the solution described above in state “half open”. The Lambda Orchestrator pattern enables sophisticated ways to throttle Third-Party API calls, like respecting API quotas.