DEV Community

Cover image for VPC-connected Bedrock AgentCore Runtime-hosted agents: beware of NAT Gateway costs!

VPC-connected Bedrock AgentCore Runtime-hosted agents: beware of NAT Gateway costs!

Last week I received a cost anomaly alert from AWS. The alert pointed at my training account, flagging an unexpected $29 charge under — oddly enough — Amazon Elastic Block Store. The usage type, however, told a different story: NatGateway-Bytes. 659 GB of data had flowed through my NAT Gateway in six days.

I had recently deployed a voice agent on Bedrock AgentCore Runtime in VPC mode, using a NAT Gateway for outbound internet access (required for WebRTC TURN relay) - see my blog post here. The VPC had been created specifically for this agent, so the suspect was obvious. But I wanted ground truth before jumping to conclusions. Was it WebRTC traffic? Something else?

Starting the investigation

My first stop was CloudWatch metrics on the NAT Gateway. The BytesOutToDestination metric (traffic from the container to the internet) showed only 2.1 GB total over the six days. Negligible. But BytesInFromDestination (traffic from the internet into the container through the NAT) told a very different story:

Date Inbound through NAT
Mar 26 6.3 GB
Mar 27 240.3 GB
Mar 28 149.1 GB
Mar 29 149.8 GB
Mar 30 102.3 GB
Mar 31 15.0 GB
Apr 01 5.4 GB (partial)

This unbalanced metrics values between inbound and outbound flows pleaded against WebRTC as the traffic culprit.

Moreover, the ActiveConnectionCount metric showed a steady ~90 connections 24/7, even when nobody was using the agent. The hourly pattern was remarkably regular — alternating between ~850 MB and ~430 MB per hour, around the clock.

Just to be sure, I checked CloudTrail for InvokeAgentRuntime events between March 28 and March 30. Zero. No user activity at all during the period with the heaviest traffic. The agent was completely idle.

NAT Gateway traffic vs AgentCore invocations

Enabling VPC Flow Logs

I needed to see where the traffic was coming from. I enabled VPC Flow Logs (shouldn't have it done on day 1? Nay, this was a POC workload!) on the VPC, sending them to a CloudWatch log group, and ran a Logs Insights query to identify the top talkers:

stats sum(bytes) as totalBytes by srcAddr, dstAddr, dstPort
| sort totalBytes desc
| limit 20
Enter fullscreen mode Exit fullscreen mode

The results over a two-hour window showed a handful of IP addresses responsible for all the heavy traffic:

    52.216.58.42 ->       10.0.0.144: 31175     270.1 MB
    16.15.207.229 ->      10.0.0.144: 62935     263.7 MB
    16.15.191.63 ->       10.0.0.144: 25320     263.6 MB
    52.216.12.24 ->       10.0.0.144: 12542     115.8 MB
    3.5.16.209 ->         10.0.0.144: 30762     113.4 MB
    16.15.199.52 ->       10.0.0.144: 49632     113.3 MB
    54.231.160.154 ->     10.0.0.144: 55754      29.6 MB
Enter fullscreen mode Exit fullscreen mode

The 10.0.0.144 address is the NAT Gateway's private IP. All the traffic was flowing from external IPs, through the NAT, to the AgentCore container ENIs in the private subnets.

Identifying the source

I needed to know what service these IPs belonged to. I used my does-this-ip-belong-to-aws tool, which checks IPs against the official AWS IP ranges published at https://ip-ranges.amazonaws.com/ip-ranges.json.

Every single high-traffic IP resolved to Amazon S3 in us-east-1!

All the traffic — every last gigabyte — was S3 pulls flowing through the NAT Gateway.

The fix: S3 Gateway Endpoint

The fix is straightforward and free. An S3 Gateway VPC Endpoint routes S3 traffic directly through the AWS network, bypassing the NAT Gateway entirely. Unlike interface endpoints, gateway endpoints have no hourly charge and no data processing fee.

resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${var.aws_region}.s3"
  route_table_ids = [
    aws_route_table.private.id,
    aws_route_table.public.id,
  ]
}
Enter fullscreen mode Exit fullscreen mode

One terraform apply and the NAT Gateway data transfer cost drops to near zero.

This raises a broader question: why would you ever not have an S3 Gateway Endpoint in a VPC? It's free, takes one resource to create, and prevents exactly this kind of surprise. If you're creating VPCs with private subnets and NAT Gateways, add an S3 Gateway Endpoint as a default. There's no downside. S3 Gateway endpoints are good for you wallet, if not for your soul.

The root cause: warm pool recycling

After filing a support case, the Bedrock AgentCore service team identified the root cause.

AgentCore Runtime maintains a warm pool of VMs to ensure low-latency invocations. Each VM in the pool pulls the container image from ECR — and ECR stores image layers in S3. My container image was ~435 MB compressed.

Three things combined to produce the 659 GB bill:

First, the 21 UpdateAgentRuntime API calls I made on March 27 (a day of heavy debugging and redeployment) each triggered an asynchronous warm pool re-provisioning cycle. Multiple rounds of 10-VM provisioning, each pulling the 435 MB image, produced the ~240 GB spike that day.

Second, the warm pool continued recycling VMs over the following days to keep them fresh and ready. With 10 VMs each pulling the image periodically, the steady ~150 GB/day on March 28-30 is consistent with regular recycling.

Third, after approximately 72 hours with no invocations, the warm pool automatically downscaled from 10 VMs to 1 VM. This explains the drop from ~150 GB/day to ~15 GB/day on March 31.

The warm pool recycling is expected platform behavior — it's what makes AgentCore able to serve requests with low latency. The problem was that all those S3 pulls were routing through my NAT Gateway at $0.045/GB instead of staying on the AWS internal network.

Firing these many VMs for so few invocations seems to mme like shooting a bazooka to kill a fly; I wonder how sustainable that is.. yet AWS has a good track record at operating profitable business at scale: who am I to judge?

Anyway, the service team promised they'll make an update to the documentation so that not (too) many users face these (frankly) undue charges.

Takeaways

If you're running Bedrock AgentCore Runtime in VPC mode, three things to keep in mind:

  1. add an S3 Gateway Endpoint to your VPC. It's free and eliminates what turned out to be the dominant source of NAT Gateway data transfer costs — ECR image pulls from the warm pool. AWS has confirmed they are updating their VPC documentation to more prominently recommend this. There is genuinely no reason not to have one in every VPC with private subnets.

  2. be mindful of container image size. My 435 MB image, pulled across a 10-VM warm pool with regular recycling, generated hundreds of gigabytes of transfer. Slimming the image (multi-stage builds, fewer dependencies, Alpine base) directly reduces this cost — even with the S3 endpoint in place, smaller images mean faster cold starts.

  3. monitor your NAT Gateway metrics early. The BytesInFromDestination and BytesOutToSource metrics in CloudWatch will show you if something unexpected is happening. I only noticed because of the cost anomaly alert — by then, $29 had already been spent. VPC Flow Logs combined with CloudWatch Logs Insights made the diagnosis straightforward once I looked.


Paul Santus is an independent cloud consultant at TerraCloud. He helps organizations build and deploy AI-powered applications on AWS. Connect with him on LinkedIn.

Top comments (0)