Working on a new project recently, I delved into deploying ECS Fargate containers in private subnets. The goal in this case was to have ECS Fargate containers deployed in private subnets, which allowed ingress only through an Application Load Balancer. We chose this configuration primarily for security and firewall configuration reasons. Cost optimization was also an important consideration for this architecture.
The containers also needed egress access to other (non AWS) services, and this is allowed through a NAT Gateway.
Note: Some parts of the architecture(like the database) are omitted from this post, in order to focus on the necessary components.
With this configuration alone, the images would be fetched from ECR(and S3) using the NAT Gateway, which presents the following challenges:
Cost Implications of NAT Gateway Usage: The NAT gateway accrues costs based on a per GB data processing fee, in addition to an hourly charge. For instance, in the
us-east-1
region at the time of writing, it's $0.045 per GB. At first glance, this might seem negligible. But consider this: if your container images are around 400MB, deploying just three containers exceeds 1GB. This can quickly add up, leading to unexpectedly high charges. Instances of such unexpected expenses have been reported (source: Excellent Dev.to article). Furthermore, repeated deployments due to failures can exacerbate this cost, as the image is pulled multiple times.Security Concerns with Data Transit: While this post focuses primarily on cost, it's worth noting that routing traffic over the public internet can pose security risks. For a deeper dive into this aspect, refer to AWS's documentation on VPC Endpoints and ECR.
The Networking Behind Docker Image Retrieval in Private Subnets
ECS interacts with three AWS services behind the scenes when pulling Docker images:
-
ECR DKR: Utilized for Docker Registry APIs. Docker client commands like
push
andpull
engage with this endpoint. -
ECR API: This endpoint handles calls to the Amazon ECR API, facilitating actions like
DescribeImages
andCreateRepository
. -
S3: ECR stores the actual layers of Docker images in AWS-managed S3 buckets, typically named
arn:aws:s3:::prod-<region>-starport-layer-bucket
.
ECS also needs to have access to other services, like ECS telemetry and CloudWatch, but they are not directly linked to the docker image pull.
Understanding and Mitigating NAT Gateway Traffic
In this section, we'll explore different strategies to minimise NAT gateway traffic and, consequently, its associated costs.
The experiment is to deploy one container instance with every scenario. With each step, we add the VPC endpoint(s) mentioned in the scenario to evaluate the difference.
The infrastructure is created using terraform, and can be found in this git repository. The project uses community maintained AWS Terraform modules, which simplify this process. The code examples that follow in the post are using the vpc-endpoints module to create the Gateway and interface endpoints.
In addition, I created a custom dashboard on CloudWatch that has a widget showing the sum of BytesOutToSource
(The number of bytes sent through the NAT gateway to the clients in your VPC.) and BytesOutToDestination
(The number of bytes sent out through the NAT gateway to the destination.) as an indication of the data processed by the NAT Gateway.
The docker image being used in this scenario is a very simple NodeJS image with a size of ~403MB.
That's enough about the setup, let's dive into the scenarios and results.
1. Only NAT Gateway, no VPC endpoints
As we see in the Total Bytes Out
below, all the data(~414MB) for pulling the docker image flows through the NAT Gateway.
2. NAT Gateway + S3 Gateway endpoint
Now let's add an S3 Gateway endpoint to the VPC. Gateway endpoints have no cost associated with them. These are offered for S3 and DynamoDB by AWS.
In this case, adding the s3 endpoint using the vpc-endpoints module:
s3 = {
service = "s3"
private_dns_enabled = true
service_type = "Gateway"
tags = { Name = "S3 Gateway Endpoint" }
policy = data.aws_iam_policy_document.s3_endpoint_policy.json
route_table_ids = module.vpc.private_route_table_ids
},
And corresponding endpoint policy
data "aws_iam_policy_document" "s3_endpoint_policy" {
statement {
effect = "Allow"
actions = ["s3:GetObject"]
resources = ["arn:aws:s3:::prod-${local.region}-starport-layer-bucket/*"] # to access the layer files
principals {
type = "*"
identifiers = ["*"]
}
}
}
Important to note here is that S3 Gateway endpoints should be created in the same region as the S3 bucket.
As we see here, the data processed by the NAT Gateway drops drastically(to ~245KB), confirming our image layers are now largely being transferred through the S3 gateway endpoint.
Note: If your containers have existing connections to Amazon S3, their connections might be briefly interrupted when you add the Amazon S3 gateway endpoint. Source
3. NAT Gateway + S3 Gateway endpoint + ECR DKR interface endpoints
In the next step, we add an ECR DKR interface endpoint.
ecr_dkr = {
service = "ecr.dkr"
private_dns_enabled = true
tags = { Name = "ECR DKR Interface Endpoint" }
subnet_ids = [module.vpc.private_subnets[0]] # Interface endpoints are priced per AZ
policy = data.aws_iam_policy_document.generic_endpoint_policy.json
},
See the demo project for details on the endpoint policy.
Note that interface endpoints also have an hourly and data processing fees, but these tend to be lower than NAT gateway charges. Depending on the amount of data processed by the NAT gateway for a particular service, it might make sense to include these for cost optimization reasons.
In this instance the traffic for a single deployment dropped further to ~33KB.
4. NAT Gateway + S3 Gateway endpoint + ECR DKR and API interface endpoints
Adding the ECR API endpoint:
ecr_api = {
service = "ecr.api"
private_dns_enabled = true
tags = { Name = "ECR API Interface Endpoint" }
subnet_ids = [module.vpc.private_subnets[0]] # Interface endpoints are priced per AZ
policy = data.aws_iam_policy_document.generic_endpoint_policy.json
},
Comparing the scenarios
The results needed to be plotted on a logarithmic scale for visibility. As we see below, the S3 Gateway endpoint has the biggest impact on the data processed by the NAT gateway.
The cost impact
Considering a scenario similar to the original article, how much impact could the Gateway S3 endpoint have made?
The article mentions that their NAT Gateway processed 16TB of data, with a 500MB docker image. This is approximately 32,000 deployments. This was also because of a failing health check, which can happen in real world scenarios.
Let's simulate the same scenario with our docker image, which is 403MB.
Without the S3 Endpoint, the NAT Gateway processes ~414MB.
With an S3 Gateway endpoint, the NAT Gateway processes ~0.245MB.
If there were 32,000 deployments with the image in our example:
1. Without the S3 Gateway endpoint
Data processed: 414MB*32,000 = 13,248,000MB = 13,248GB
Cost($0.045/GB) = $596.16
2. With the S3 Gateway endpoint
Data processed: 0.245MB*32,000 = 7,840MB = 7.84GB
Costs($0.045/GB) = $0.3528
This could of course be mitigated further with VPC interface endpoints, but since they come with their own costs, it would be worth analysing based on requirements for a specific setup.
Wrapping up
Looking at the data processed by the NAT gateway in different scenarios, I think it's fair to say:
- Definitely consider creating an S3 gateway endpoint, since these are available at no additional cost and drastically reduce the data processed by the NAT Gateway for this and other scenarios.
- Depending on the number of deployments and security aspects of your architecture, consider using VPC interface endpoints.
If there are questions or feedback, please feel free to reach out!
Top comments (2)
Great way to reduce or avoid impacting costs. Good to know!
valuable info! always good to know cost optimization on AWS