Tracking down a spike in NAT Gateway charges

#aws #awsbill #debug #cloud

If you spend enough time architecting and building solutions for the cloud, you're likely to have either experienced or at the very least, heard tell of an exciting cloud billing story (📈).

This is my tale of investigating an unexpected spike in NAT Gateway charges in a cloud migration project.

The Project

One of my clients has recently migrated a complex on-premise application to AWS, a lift-and-shift project that uses GitHub Actions, Terraform and Ansible to automate the creation of environment-specific VPCs containing several web applications, several microservices, and supporting infrastructure.

After the new environments were launched, the project team took some time to step back and look for problems. We found and documented points of friction in the automation, we tightened the security, and we looked at the month by month bills from AWS.

Much of the AWS costs were as expected, but there was one category that we didn't have an explanation for: NAT.

What's NAT?

NAT stands for "Network Address Translation". You typically need NAT when you're trying to communicate between a resource that does not have a public IP address and some outside resource. Because the internal machine does not have a public IP address, there's no way to communicate directly, so you can have another resource (a server, or in this case, an AWS-managed NAT Gateway) translate between an external address and port to an internal address and port.

Unexplained NAT Costs

The VPCs created by the project's automation have private subnets and those subnets use NAT Gateways to communicate with the outside world when necessary.

However, the NAT costs were high and rising:

The NAT traffic was surprisingly high -- we were approaching nearly 40TB of monthly NAT, way more than the team would have expected. Where was the traffic coming from, and where was it going to, and why?

There's a VPC per environment, and some of the environments share an account, so let's see if each VPC is seeing similar amounts of NAT traffic by using CloudWatch Metrics for the managed NAT gateways.

CloudWatch Metrics

Let's look at the metrics for NAT Gateways in CloudWatch. "NATGateway > NAT Gateway Metrics" has a bunch of metrics, but since we're looking at data volume, let's turn on all the ones that refer to "Bytes":

Ok, it's just one NAT Gateway that seems to be responsible for the majority of the traffic, both BytesInFromDestination and BytesOutToSource. Both of those suggest that the traffic is coming from an outside source -- that resources inside the VPC are downloading large amounts of data:

VPC Flow Logs / CloudWatch Insights

Now that we've narrowed it down to one VPC, how do we figure out where the traffic is coming from and going to? A good next step would be VPC Flow Logs, which log traffic in your VPC. VPC flow logs were already turned on, and there's a convenient knowledge-base article that gives tips on how to use CloudWatch Log Insights to analyze contributors to traffic through a NAT Gateway.

Looking at traffic going in and out of the NAT gateway, it looks like there's a lot of different internal IP Addresses involved:

So what's the source of this data?

A quick scan suggests that these are AWS addresses, likely S3. So a variety of internal IP addresses in a single VPC are downloading lots of data from AWS S3 instances. Why?

Tracking Down IP Addresses

At this point, I checked a couple of things first -- in particular, I was starting to be suspicious that there might be an unhealthy ECS service that wasn't being well-monitored, but I didn't see one on a very quick scan, so I decided to continue with the systematic search. What are these IP Addresses associated with?

So, now we look in EC2 > Network Interfaces (ENIs):

Nothing. I searched for several of the ip addresses listed in the flow logs, and none of them are currently active. Suspicious. Something is using an IP address, doing some NAT traffic and then ... going away before I'm searching for it.

Can we find the IP address on CloudTrail?

Yes we can:



❯ aws cloudtrail lookup-events --max-results 20 --lookup-attributes AttributeKey=EventName,AttributeValue=CreateNetworkInterface --end-time "2022-06-10" | jq '.Events[].CloudTrailEvent | fromjson' | jq .userAgent| sort | uniq -c
  20 "ecs.amazonaws.com"

Conveniently, the security group which appears in the event also matches the ECS service name. Sounds like my theory was correct, I just hadn't found the right service yet.

It's immediately clear from the service events that the service isn't healthy and has been restarting since some event in March without anyone noticing:



{
    "id": "715c6493-9a08-4de2-82d1-afcaf54c54ef",
    "createdAt": "2022-07-13T15:24:18.650000-04:00",
    "message": "(service qa-XXX-service) deregistered 1 targets in (target-group arn:aws:elasticloadbalancing:ca-central-1:004110273183:targetgroup/qa-XXX-service/a7f603824fb1e497)"
},
{
    "id": "0fea699a-3628-4f9d-abe0-dd89a930aa45",
    "createdAt": "2022-07-13T15:24:01.939000-04:00",
    "message": "(service qa-XXX-service) has begun draining connections on 1 tasks."
},
{
    "id": "68e666d1-f489-40a3-9dda-c5b739a7e6ad",
    "createdAt": "2022-07-13T15:24:01.926000-04:00",
    "message": "(service qa-XXX-service) deregistered 1 targets in (target-group arn:aws:elasticloadbalancing:ca-central-1:004110273183:targetgroup/qa-XXX-service/a7f603824fb1e497)"
},
{
    "id": "3cca6545-d38d-4b88-a8b9-1e936f737526",
    "createdAt": "2022-07-13T15:23:30.948000-04:00",
    "message": "(service qa-XXX-service) registered 1 targets in (target-group arn:aws:elasticloadbalancing:ca-central-1:004110273183:targetgroup/qa-XXX-service/a7f603824fb1e497)"
},
{
    "id": "36b50859-babe-47eb-8dcb-ca2c5c661491",
    "createdAt": "2022-07-13T15:23:21.103000-04:00",
    "message": "(service qa-XXX-service) has started 1 tasks: (task d1a70a3629d64febbc91695b4b848d71)."
},
{
    "id": "32f354a3-21ec-4a26-aac0-ceb02805dd16",
    "createdAt": "2022-07-13T15:23:16.401000-04:00",
    "message": "(service qa-XXX-service) has begun draining connections on 1 tasks."
},

Each time it restarts, it's pulling down a new container image from ECS, and incurring NAT bandwidth costs to do it.

What Now?

First, stop the bleeding:

Set the number of desired tasks to 0.
This will stop ECS from trying to create new tasks that will never be healthy.
This is a good way to stop ECS from trying to fix a broken service without having to significantly alter the task definition or service definition, allowing you to diagnose later.

At this point, we can monitor NAT Gateway metrics to make sure the usage has dropped, and AWS billing costs over the next few days to make sure the NAT Gateway costs have dropped.

Next?

We've solved the immediate problem, but it has identified lots of areas for further investigation:

Technical Debt
- This is an old service that has been replaced by the new environments but hasn't been cleaned up.
- Those cleanup tasks that had been deferred now have a very real cost.
Better Monitoring of ECS Services
- I've seen an ECS service go into this infinite unhealthy loop before.
- There are lots of ways to detect and diagnose failing ECS services, but most of them will require some setup on your part.
- These NAT costs make it clear that the project team will need to invest more time into detecting unhealthy ECS services.
NAT Gateway Monitoring
- One NAT gateway was doing significantly more traffic than the others
- Could add an alarm for when NAT gateway usage passes a reasonable limit.
- Could also consider CloudWatch anomaly detection.
Cost Monitoring
- A cost of this magnitude should probably not have gone unnoticed for months.
- While the other items still have value, this is just one way out of a seemingly infinite number of ways your AWS bill could be surprising. If we improve our monitoring of ECS services and NAT gateways, that won't prevent something else from going off the rails in a month or two.
- There are lots of options here:
  - Manually reviewing AWS bills when they come in
  - Billing threshold alarms
  - Cost anomaly detection
  - Third party solutions to manage and optimize AWS costs.
PrivateLink
- Without PrivateLink, traffic between ECS tasks running on Fargate and AWS services like ECS and S3 are travelling through the NAT gateway.
- By configuring PrivateLink for the VPC, we can route that traffic through AWS internal infrastructure, reducing NAT gateway traffic and costs.
Replace Managed NAT Gateways
- It is possible to set up your own NAT gateways on EC2 instances instead of using the AWS-managed ones.
- This would be more work to setup and maintain, and I'm not sure it's worth it, but there's an argument to be made.
- I will admit to having some sympathy for Corey Quinn's "Unpleasant and Not Recommended" tagline.

None of these are immediately critical, but if we don't address them, these and similar problems will recur and bring with them new and exciting AWS bill problems.