DEV Community

Hardeep Singh Tiwana
Hardeep Singh Tiwana

Posted on

NAT Gateways Killing Your Container Costs? Amazon ECR VPC endpoints to the Rescue

Picture this. Your AWS bill hits, and there it is: $10K in NAT Gateway charges for 3 NAT GWs in us-east-1. You started to dig in, and see ~$8K comes from NatGateway-Bytes (Data Processed) alone, assuming most of it tied to ECR image pulls. I've helped teams spot this exact issue using Cost Explorer and VPC Flow logs, watching container deployments quietly eat budgets. The solution? Amazon ECR VPC endpoints. They dropped NAT bills by >75% in one setup I worked on. Let's walk through spotting it, the math, and the flow change.

TL;DR:

  • ECR image pulls through NAT Gateways cost $0.045/GB.
  • VPC Interface Endpoints cost $0.01/GB (78% cheaper).
  • Real example: ~$8K/month β†’ ~$2K/month = ~$70K annual savings.

πŸ’‘ Key Takeaways

The Problem: NAT Gateways charge $0.045/GB for data processing. For ECR-heavy workloads, this adds up fast, as our example case shows $8,010/month in data processing charges alone!

The Solution: Deploy three VPC endpoints to route ECR traffic privately:

  1. ECR API Interface Endpoint (com.amazonaws.<region>.ecr.api)
    • Handles authentication and image manifests
    • Cost: ~$22/month per AZ + minimal data charges
    • Required: Must deploy in each AZ for high availability
  2. ECR Docker Interface Endpoint (com.amazonaws.<region>.ecr.dkr)
    • Handles Docker pull/push commands
    • Cost: ~$22/month per AZ + minimal data charges
    • Required: Must deploy in each AZ for high availability
  3. S3 Gateway Endpoint (com.amazonaws.<region>.s3) ⭐ THE MOST CRITICAL ONE
    • Handles actual image layer downloads (99%+ of your data!)
    • Cost: $0.00 (FREE!)
    • Required: Without this, your image layers still hit NAT Gateways
  • The Savings: For 178,000 GB/month of ECR traffic:
    • Before: $8,108.55/month (NAT Gateways)
    • After: $1,823.80/month (VPC Endpoints)
    • Savings: $6,284.75/month (77.5%) = $75,417/year

Why This Works?: ECR stores Docker image layers in S3. The free S3 Gateway endpoint handles 95%+ of your data transfer, while the two paid Interface endpoints handle control plane operations. All three work together to eliminate NAT Gateway data processing charges.

Implementation Time: ~30 minutes with Terraform, plus 48 hours to validate savings in Cost Explorer.

Critical Success Factor: You MUST deploy all three endpoints. Deploying only the ECR endpoints without the S3 Gateway endpoint will save you almost nothing because the bulk of your data will still flow through NAT Gateways

Compare both models


Let's start with the Brutal Math: NAT vs. Endpoints Head-to-Head

Think standard 3-AZ VPC with private subnets and container workloads. NAT charges $0.045 per hour per AZ plus $0.045 per GB processed. Endpoints run $0.01 per hour per ENI and $0.01 per GB. Much better for high volume.

Note: AWS requires 2 VPC interface endpoints per AZ for complete ECR private access: ecr.api, ecr.dkr, and s3 (layers), making it 6 ENIs total in a 3-AZ setup. The S3 Gateway endpoint modifies route tables and creates no ENIs. If you like to read more on this, follow links at the end of this post.

  • ecr.api β†’ Interface endpoint (ENI per AZ)
  • ecr.dkr β†’ Interface endpoint (ENI per AZ)
  • s3 β†’ Gateway endpoint (NO ENIs, modifies route tables)

NAT Gateway vs VPC Endpoints Cost Comparison

Configuration: 3 AZs with 3 NAT Gateways vs 3 VPC Endpoints

VPC Endpoint Configuration:

  • com.amazonaws..ecr.api (Interface) - $0.01/hour per AZ + $0.01/GB
  • com.amazonaws..ecr.dkr (Interface) - $0.01/hour per AZ + $0.01/GB
  • com.amazonaws..s3 (Gateway) - FREE (no hourly or data charges)

NAT Gateway Configuration:

  • 3 NAT Gateways (one per AZ) - $0.045/hour each + $0.045/GB

Here's the model, scaled to $8K spend as data baseline (730 hours a month, 9 endpoints: 3 per AZ for ECR API, Docker, and S3):

Data Volume (GB/mo) NAT Cost ($) VPC Endpoint Cost ($) Monthly Savings ($) Savings %
100 103.05 44.80 58.25 56.5%
500 121.05 48.80 72.25 59.7%
1,000 143.55 53.80 89.75 62.5%
5,000 323.55 93.80 229.75 71.0%
10,000 548.55 143.80 404.75 73.8%
50,000 2,348.55 543.80 1,804.75 76.8%
100,000 4,598.55 1,043.80 3,554.75 77.3%
178,000 8,108.55 1,823.80 6,284.75 77.5%

Total NAT spend declines like a falling rock, at production scale, you will see ROI in days.


Example use case with assumptions

Assume we have 3 NAT Gateways in us-east-1 processing 178,000 GB of ECR traffic monthly.

Cost Breakdown for Total Monthly Cost: $8,108.55

  1. NAT Gateway Hourly Charges: $98.55

    • $0.045 per hour Γ— 3 NAT Gateways Γ— 730 hours/month
    • This covers the provisioning cost for maintaining 3 NAT Gateways (one per AZ)
  2. Data Processing Charges: $8,010.00

    • $0.045 per GB Γ— 178,000 GB
    • This is the charge for processing all data flowing through the NAT Gateways
  • Per NAT Gateway:

    • Hourly cost: $32.85/month per gateway
    • Data processing (if evenly distributed): $2,670.00/month per gateway

    Important Note: The data processing charge of $8,010 represents the vast majority (98.8%) of our assumed total NAT Gateway costs. Since we're processing ECR (Elastic Container Registry) traffic within the same region, we won't incur additional data transfer charges for the traffic itself, but the NAT Gateway data processing fee still applies.

Prerequisites:

  • Private subnets with NAT Gateway access
  • ECR repositories in the same region
  • Security groups allowing HTTPS (443) from workloads

Hunt Down Those Hidden ECR Pull Fees

Start in AWS Cost Explorer. In Group by, select Dimension Usage Type, Filter to Service: EC2 - Other and Usage type group: for EC2: NAT Gateway - Data Processed and EC2: NAT Gateway - Running Hours. You'll see NatGateway-Bytes racking up that e.g. $8K at $0.045 per GB, plus NatGateway-Hours for the $0.045 hourly per AZ hit.

Cost Explorer Filters

For proof, enable VPC Flow Logs on your subnets. Filter for port 443 traffic to ecr.api or ecr.dkr domains (Specifically, Look for destination port 443 traffic to IP addresses in the ECR service IP ranges, available via AWS IP ranges JSON).

Do you see private subnet bytes flooding NAT ENIs? That's the problem. Every pull sends a small request out via NAT, fetches metadata, then hauls gigabytes back, doubling up on processing fees. (If it is an Inter-AZ hop, it add $0.01 per GB more. Caught this pattern adding ~$3000 a month extra in a recent cluster review.)

Using VPC Flow Logs to Track and Validate ECR Traffic Costs

Before deploying VPC endpoints, you need proof that ECR is actually consuming your NAT Gateway bandwidth. After deployment, you need validation that traffic shifted correctly. VPC Flow Logs provide both.

Step 1: Enable VPC Flow Logs

Enable Flow Logs on your private subnets where container workloads run:

Via AWS CLI:

aws ec2 create-flow-logs \
  --resource-type Subnet \
  --resource-ids subnet-xxxxx subnet-yyyyy subnet-zzzzz \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name /aws/vpc/flowlogs \
  --deliver-logs-permission-arn arn:aws:iam::ACCOUNT_ID:role/flowlogsRole
Enter fullscreen mode Exit fullscreen mode

Via Terraform: : Follow link to see the module on terraform website

resource "aws_flow_log" "private_subnets" {
  iam_role_arn    = aws_iam_role.flow_logs.arn
  log_destination = aws_cloudwatch_log_group.flow_logs.arn
  traffic_type    = "ALL"
  vpc_id          = aws_vpc.main.id
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Identify Top HTTPS Destinations

Run this CloudWatch Logs Insights query to find your highest-volume HTTPS destinations:

fields @timestamp, srcAddr, dstAddr, dstPort, bytes, action
| filter dstPort = 443
| filter interfaceId like /eni-/
| stats sum(bytes) as totalBytes by dstAddr
| sort totalBytes desc
| limit 50
Enter fullscreen mode Exit fullscreen mode

This shows which destinations consume the most bandwidth on port 443. The top destinations are likely S3 IPs (for ECR image layers).

Step 3: Identify S3 and ECR Service IP Ranges

VPC Flow Logs show IP addresses, not domain names. Download AWS's IP ranges to identify both S3 and ECR traffic:

# Download AWS IP ranges
curl -o ip-ranges.json https://ip-ranges.amazonaws.com/ip-ranges.json

# Inspect services for your region
jq -r '.prefixes[] | select(.region=="us-east-1") | .service' ip-ranges.json | sort -u
Enter fullscreen mode Exit fullscreen mode

Once you know the correct service values, narrow it down, since ECR doesn't have a designated value, we use AMAZON:

# Once you know the correct service values, narrow it down, for example:
jq -r '.prefixes[] | select(.service=="AMAZON" or .service=="S3" and .region=="us-east-1") | .ip_prefix' ip-ranges.json
Enter fullscreen mode Exit fullscreen mode

Example IP ranges for us-east-1:

44.223.121.0/24
44.223.122.0/24
98.80.195.0/25
98.80.238.0/23
3.5.0.0/19
1.178.4.0/24
Enter fullscreen mode Exit fullscreen mode

You will see >95% of traffic for S3:

  • S3 (where ECR stores image layers - 95%+ of your traffic)
  • ECR (API and Docker registry - <5% of your traffic) Why This Matters: Your 178,000 GB/month is primarily S3 traffic (image layer downloads), not ECR API calls. You must track S3 IPs to see the real cost impact!

(Always check the current AWS IP ranges JSON for your specific region)

Step 4: Calculate NAT Gateway ECR+S3 Traffic

Filter Flow Logs for traffic to BOTH S3 and ECR IPs through NAT Gateway ENIs:

NOTE:

  1. Do NOT copy paste as it is, update filter dstAddr like line to match the range from previus command output.
  2. Replace /^3\.5\./ or dstAddr like /^52\.94\./ or dstAddr like /^3\.5\./ with real IPs you want to look for
fields @timestamp, srcAddr, dstAddr, dstPort, bytes, interfaceId
| filter dstPort = 443
| filter interfaceId like /eni-/ and action = "ACCEPT"
| filter dstAddr like /^3\.5\./ or dstAddr like /^52\.94\./ or dstAddr like /^3\.5\./
| stats sum(bytes) as totalBytes by interfaceId, dstAddr
| sort totalBytes desc
Enter fullscreen mode Exit fullscreen mode

Identify NAT Gateway ENIs:

aws ec2 describe-nat-gateways --region us-east-1 \
  --query 'NatGateways[].{NatGatewayId:NatGatewayId, NetworkInterfaceIds:NatGatewayAddresses[].NetworkInterfaceId}' \
  --output table
Enter fullscreen mode Exit fullscreen mode

Cross-reference the ENI IDs from your query results with NAT Gateway ENIs.
πŸ’‘ Pro Tip: The top destination IPs by bytes will be S3 ranges, not ECR ranges. This confirms that S3 Gateway endpoint is critical for cost savings!

Step 5: Calculate Monthly Cost Impact

From your Flow Logs query results:

  1. Sum total bytes through NAT Gateway ENIs to S3 + ECR IPs
  2. Convert to GB: totalBytes / 1,000,000,000 (AWS uses decimal GB)
  3. Calculate cost: GB Γ— $0.045

Cost Calculation Example:

  • Flow Logs show: 191,102,976,000 bytes to S3/ECR
  • Convert: 191,102,976,000 / 1,000,000,000 = 191.10 GB
  • For 178,000 GB/month: 178,000 Γ— $0.045 = $8,010/month

Traffic Breakdown (typical):

  • S3 image layers: ~177,850 GB (99.91%)
  • ECR API calls: ~50 GB (0.03%)
  • ECR Docker registry: ~100 GB (0.06%)

Step 6: Validate After VPC Endpoint Deployment

After deploying VPC endpoints, confirm traffic shifted to private IPs:

fields @timestamp, srcAddr, dstAddr, dstPort, bytes, interfaceId
| filter dstPort = 443
| filter dstAddr like /^10\./
| filter interfaceId like /eni-/
| stats sum(bytes) as totalBytes by interfaceId
| sort totalBytes desc
Enter fullscreen mode Exit fullscreen mode

What you should see:

  • βœ“ Traffic now goes to private 10.x.x.x IPs (VPC endpoint ENIs)
  • βœ“ NAT Gateway ENIs show minimal S3/ECR traffic
  • βœ“ Total bytes shifted from NAT to VPC endpoints

❌ But this validation method has problems ❌

⚠️ The above given filter only filters for RFC 1918 private IPs (10.0.0.0/8), but VPC endpoints use different address ranges:

Gateway Endpoints (S3, DynamoDB)

  • Use prefix list routes (pl-xxx), not destination IPs in flow logs
  • dstAddr shows the actual S3 service IP (public range like 52.x.x.x), not private
  • Flow log records bypass the interfaceId filter entirely because they hit the prefix list route directly

Interface Endpoints (ECR.api, ECR.dkr, etc.)

  • Use PrivateLink IPs in the VPC CIDR (e.g., 10.0.x.x if your VPC is 10.0.0.0/16)
  • dstAddr shows the endpoint ENI IP (private), but only if your VPC CIDR starts with 10.

So what would correct validation queries look like?

1. Interface Endpoints (ECR, etc.) - Check PrivateLink traffic

fields @timestamp, srcAddr, dstAddr, dstPort, bytes, interfaceId
| filter dstPort = 443
| filter dstAddr like /^10\./  # Your VPC CIDR range
| filter interfaceId like /eni-/
| stats sum(bytes) as totalBytes by dstAddr, interfaceId
| sort totalBytes desc
Enter fullscreen mode Exit fullscreen mode

⚠️ Only works if your VPC CIDR is 10.x.x.x. Replace with your actual CIDR (e.g., 172.16. or 192.168.).

2. Gateway Endpoints (S3) - Check prefix list bypass

fields @timestamp, srcAddr, dstAddr, dstPort, bytes, interfaceId
| filter dstPort = 443
| filter s3BucketName != "" or dstAddr like /s3\./  # S3 traffic
| filter interfaceId like /nat-/ == false  # Not NAT ENIs
| stats sum(bytes) as totalBytes by dstAddr
| sort totalBytes desc
Enter fullscreen mode Exit fullscreen mode

3. 🎯 NAT Gateway traffic drop (The real validation)🎯

fields @timestamp, srcAddr, dstAddr, dstPort, bytes, interfaceId
| filter dstPort = 443
| filter interfaceId like /nat-/
| stats sum(bytes) as totalBytes by interfaceId
| sort totalBytes desc
Enter fullscreen mode Exit fullscreen mode

Before endpoints: High bytes on NAT ENIs
After endpoints: Bytes drop significantly on those same ENIs.

🎯 What success looks like 🎯

BEFORE endpoints:

  • NAT ENI: 150 GB to s3.us-east-1.amazonaws.com
  • NAT ENI: 25 GB to 123456789012.dkr.ecr.us-east-1.amazonaws.com

AFTER endpoints:

  • NAT ENI: 5 GB (mostly external APIs)
  • Interface ENI: 25 GB to 10.0.2.100 (ECR.dkr endpoint)
  • S3 traffic: Prefix list route (no NAT ENI)

Key metric: NAT ENI bytes drop. That's your validation.

The /^10\./ filter only catches interface endpoints and only if your VPC uses that range. Use the NAT traffic reduction query instead.

Validate endpoint ENI IDs:

# ECR API endpoint ENIs
aws ec2 describe-vpc-endpoints --region us-east-1 \
  --filters "Name=service-name,Values=com.amazonaws.us-east-1.ecr.api" \
  --query 'VpcEndpoints[*].NetworkInterfaceIds' \
  --output table

# ECR Docker endpoint ENIs
aws ec2 describe-vpc-endpoints --region us-east-1 \
  --filters "Name=service-name,Values=com.amazonaws.us-east-1.ecr.dkr" \
  --query 'VpcEndpoints[*].NetworkInterfaceIds' \
  --output table

# S3 Gateway endpoint (no ENIs - modifies route tables)
aws ec2 describe-vpc-endpoints --region us-east-1 \
  --filters "Name=service-name,Values=com.amazonaws.us-east-1.s3" \
  --query 'VpcEndpoints[*].[VpcEndpointId,VpcEndpointType,RouteTableIds]' \
  --output table
Enter fullscreen mode Exit fullscreen mode

Step 7: Correlate with Cost Explorer

Confirm the cost impact in AWS Cost Explorer:

  1. Navigate to: Cost Explorer β†’ Cost & Usage Reports
  2. Group by: Usage Type
  3. Filter Service: EC2 - Other
  4. Look for:
    • NatGateway-Bytes (should drop ~75%)
    • VpcEndpoint-Bytes (should increase proportionally)
    • Time range: Compare 2 weeks before vs 2 weeks after deployment

Expected results:

- NAT Gateway data processing: $8,010 β†’ ~$2,000 (75% reduction)
- VPC Endpoint data processing: $0 β†’ ~$1,780
- Net savings: ~$6,285/month
Enter fullscreen mode Exit fullscreen mode

Understanding the Three-Endpoint Architecture

Why you need all three endpoints:

  1. ECR API Interface Endpoint (com.amazonaws.us-east-1.ecr.api)
    • Handles authentication, authorization, image manifests
    • Low data volume (~50 GB/month)
    • Cost: $21.90/month (3 AZs Γ— 730 hrs Γ— $0.01) + ~$0.50 data
  2. ECR Docker Interface Endpoint (com.amazonaws.us-east-1.ecr.dkr)
    • Handles Docker pull/push commands, layer discovery
    • Low data volume (~100 GB/month)
    • Cost: $21.90/month (3 AZs Γ— 730 hrs Γ— $0.01) + ~$1.00 data
  3. S3 Gateway Endpoint (com.amazonaws.us-east-1.s3) ← THE CRITICAL ONE
    • Handles actual image layer downloads (99%+ of your data!)
    • High data volume (~177,850 GB/month)
    • Cost: $0.00 (FREE!) ← This is where your savings come from! Without the S3 Gateway endpoint, your image layer downloads would still hit NAT Gateways even with ECR endpoints deployed!

Pro Tips for Flow Logs Analysis

  • βœ“ Track S3 IPs, not just ECR IPs - S3 is where 95%+ of ECR data flows
  • βœ“ Enable Flow Logs on private subnets only - Reduces log volume and costs
  • βœ“ Use CloudWatch Logs Insights - Best for ad-hoc queries and quick analysis
  • βœ“ Consider Amazon Athena - Better for large-scale historical analysis
  • βœ“ Set up CloudWatch alarms - Alert on unexpected NAT traffic spikes
  • βœ“ Tag your resources - Makes NAT Gateways and VPC endpoints easier to identify
  • βœ“ Factor in Flow Logs cost - Approximately $0.50/GB ingested to CloudWatch
  • βœ“ Aggregate by 5-minute intervals - Reduces log volume without losing insights
  • βœ“ Monitor for 2-4 weeks - Ensures you capture full deployment cycles and traffic patterns

Before and After: Understanding The Traffic Flow

  • Before: ECS Tasks β†’ NAT Gateway β†’ Internet β†’ ECR/S3 (expensive)
  • After: ECS Tasks β†’ VPC Endpoints β†’ AWS Private Network β†’ ECR/S3 (optimized)

Before endpoints

  • A pod in a private subnet hits NAT Gateway for every ECR pull
  • Request goes outbound to the internet, ECR API replies inbound through NAT processing, then Docker layers stream back with massive GBs.
  • Flow Logs show megabytes to NAT ENIs. Cost Explorer's NatGateway-Bytes balloons to $8K.

After, deploy

  • com.amazonaws.<region>.ecr.api and .ecr.dkr endpoints in each private subnet per AZ, turn on private DNS.
  • Pod traffic goes straight to the endpoint ENI via PrivateLink, no NAT or internet gateway.
  • AWS backbone handles the rest, ECR layers flow free within the region.
  • Flow Logs shift: zero NAT to ECR domains, all bytes on private 10.x endpoint IPs.
  • In Cost Explorer, NAT usage drop like a falling rock.
  • Look for usage types containing VpcEndpoint-Hours and VpcEndpoint-Bytes under the VPC service to confirm it is starting to show costs with much smaller amounts as compared to what NAT was showing.

VPC Endpoint Costs

Rolled this out on a Kubernetes fleet processing 178,000 GB/mo ECR traffic. NAT crashed from $10K ($8K data processed) to $2K for services that still need it. Endpoints totaled $1.8k. Filter Data Transfer + EC2 in Cost Explorer you will see EC2: NAT Gateway - Data Processed costs drop sharply, while VpcEndpoint-Hours + VpcEndpoint-Bytes take over at $0.01/GB.

Cost After VPC Interface Endpoints: $$1,823.80/month

New Cost Breakdown:

NAT Gateway Costs:

  • Hourly charges: $98.55 (gateways remain for other traffic)
  • Data processing: $0.00 (ECR traffic now bypasses NAT entirely) #### VPC Interface Endpoint Costs:
  • Hourly charges: $43.80 (2 endpoints Γ— 3 AZs Γ— 730 hours Γ— $0.01/hour)
  • Data processing: $1,780.00 (178,000 GB Γ— $0.01/GB) ## The Impact: πŸ’° Monthly Savings: $6,284.75/month (77.5%) πŸ’° Annual Savings: $75,417.00/year

What You Need to Deploy:

Required Interface Endpoints (per AZ):

  • βœ… com.amazonaws.us-east-1.ecr.api - For ECR API calls
  • βœ… com.amazonaws.us-east-1.ecr.dkr - For Docker registry operations #### Required Gateway Endpoint (VPC-wide - For ECR image layer storage - FREE):
  • βœ… com.amazonaws.us-east-1.s3 - Deploy once per VPC (not per AZ)

A quick and dirty example Terraform code"

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.${var.aws_region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
  policy            = data.aws_iam_policy_document.s3_ecr_access.json

  tags = {
    Name = "s3-gateway"
  }
}

resource "aws_vpc_endpoint" "ecr-dkr-endpoint" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.aws_region}.ecr.dkr"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  security_group_ids  = [aws_security_group.ecs_task.id]
  subnet_ids          = aws_subnet.private[*].id

  tags = {
    Name = "ecr-dkr"
  }
}

resource "aws_vpc_endpoint" "ecr-api-endpoint" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.aws_region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  security_group_ids  = [aws_security_group.ecs_task.id]
  subnet_ids          = aws_subnet.private[*].id

  tags = {
    Name = "ecr-api"
  }
}
Enter fullscreen mode Exit fullscreen mode

Validation:

  • Validate with: nslookup ecr.api.us-east-1.amazonaws.com
  • Should resolve to private 10.x.x.x addresses, not public IPs.

πŸ’‘ Pro Tip: The S3 Gateway endpoint is critical but FREE.

  • Add a free S3 Gateway endpoint for ECR layer storage access. While ECR endpoints handle API calls, image layers are stored in S3. The Gateway endpoint ensures this traffic also bypasses NAT at zero cost, so don't skip it. ECR stores image layers in S3, and without this endpoint, your layer downloads will still hit NAT Gateways!

Why Does This Work So Well?

The key is data processing rate difference:

  • NAT Gateway: $0.045/GB
  • VPC Endpoint: $0.01/GB (78% cheaper per GB)

Plus, VPC endpoints provide:

  • Better security - Traffic never leaves AWS network
  • Lower latency - Direct path to ECR
  • Higher reliability - No internet gateway dependency
  • Simplified architecture - Private subnets can pull images directly

Another Implementation detail to keep in mind:

Your NAT Gateways stay in place for other internet-bound traffic (software updates, external APIs, etc.), but all ECR image pulls route through the VPC endpoints instead. This is a configuration change, not a replacement and you get the best of both worlds.

Troubleshooting:

  • DNS not resolving privately? Enable "Private DNS" on endpoints βœ…
  • Still seeing NAT charges? Check security group rules allow 443 inbound βœ…
  • Pulls timing out? Verify subnet route tables don't force internet gateway βœ…
  • Endpoint not appearing in Cost Explorer? Wait 24-48 hours for billing data to populate; check under Service: "VPC" βœ…
  • Validate endpoint status: aws ec2 describe-vpc-endpoints --filters "Name=service-name,Values=com.amazonaws.us-east-1.ecr.api" βœ…

Troubleshooting Flow Logs Analysis

Issue: Can't find NAT Gateway ENIs in Flow Logs

  • βœ… Verify Flow Logs are enabled on the correct subnets
  • βœ… Check that traffic-type is set to ALL (not just ACCEPT or REJECT)
  • βœ… Wait 10-15 minutes after enabling for data to populate

Issue: S3/ECR IP ranges don't match traffic

  • βœ… AWS IP ranges change periodically - always download the latest JSON
  • βœ… Some regions have additional IP ranges not in the main prefixes
  • βœ… Check for both IPv4 and IPv6 ranges if your VPC supports dual-stack
  • βœ… Remember: Most traffic will be to S3 IPs, not ECR IPs!

Issue: Traffic still shows NAT Gateway after endpoint deployment

  • βœ… Verify private_dns_enabled = true on Interface endpoints
  • βœ… Check security groups allow port 443 from workload subnets
  • βœ… Confirm route tables don't have explicit routes forcing internet gateway
  • βœ… Verify S3 Gateway endpoint is associated with correct route tables
  • βœ… Test DNS resolution: nslookup ecr.api.us-east-1.amazonaws.com should return 10.x.x.x
  • βœ… Test S3 access: nslookup s3.us-east-1.amazonaws.com should resolve (Gateway endpoints don't change DNS)

Issue: Cost Explorer doesn't match Flow Logs calculations

  • βœ… Flow Logs show raw bytes; Cost Explorer uses decimal GB (1 GB = 1,000,000,000 bytes)
  • βœ… Cost Explorer has 24-48 hour delay for billing data
  • βœ… Ensure you're comparing the same time periods
  • βœ… Check for data transfer charges vs data processing charges
  • βœ… Remember: S3 Gateway endpoint traffic is FREE, so you won't see it in VPC endpoint costs

Issue: Only seeing small data volumes to ECR IPs

  • βœ… This is NORMAL! ECR API/Docker traffic is <5% of total
  • βœ… The bulk of your data goes to S3 IPs (image layers)
  • βœ… If you're only filtering for ECR IPs, you're missing 95%+ of the traffic
  • βœ… Update your query to include S3 IP ranges

Reality Check

This assumes full traffic shift (realistic for ECR-only optimization). Background NAT persists for other internet traffic. Monitor your Cost Explorer's NAT Gateway data processing charges weekly for the first month. You should see a 75%+ drop if ECR is your primary NAT consumer. If not, investigate other high-volume services using VPC Flow Logs.

Next Steps

  1. Run Cost Explorer analysis (5 min)
  2. Deploy endpoints in non-prod (30 min)
  3. Validate with test pulls (10 min)
  4. Monitor for 48 hours
  5. Roll to production during maintenance window
  6. Track Cost Explorer for 2 weeks to confirm savings

Ready to fix it? Create the endpoints in console or Terraform, tag them like Name:ecr-api for tracking, test docker pull once private DNS propagates. Budget relief comes fast. Seen this work for you? Share in the comments.

References:

Top comments (0)