Rex Zhen

Posted on Jan 6

AWS Cost Optimization: When Reducing Cross-Zone Data Transfer Actually Costs More

#aws #vpc

AWS Cost Optimization: When Reducing Cross-Zone Data Transfer Actually Costs More

Introduction

When reviewing AWS costs, data transfer charges often stand out—especially for high-traffic organizations. Unlike compute or storage resources, you can't easily tag data transfer costs to identify which project or environment generated them. The bill just says "data transfer between zones" with a growing number.

Most AWS cost optimization guides focus on the obvious wins:

Reduce internet egress traffic
Minimize cross-region data transfer
Eliminate unnecessary data movement

But cross-zone data transfer is different. AWS designed many services to span multiple availability zones for redundancy: ECS/EKS clusters, RDS, MSK (Managed Kafka), Application Load Balancers. You can't simply force everything into a single zone without sacrificing availability.

This article explores two specific scenarios where the "obvious" cost optimization—eliminating cross-zone traffic—can actually increase your total costs or break your architecture.

A Brief History of NAT in AWS

Let me set the stage with how we got here:

Before AWS (2000s)

Organizations ran bare-metal NAT boxes in colocation facilities for outbound traffic. Cost: $1,500+ for entry-level hardware or third-party appliances.

Early AWS Days (2010-2012)

We launched EC2 instances as NAT boxes. Cost: ~$50/month. Huge improvement.

AWS Managed NAT Gateway (2015+)

AWS launched NAT Gateway as a managed service. Cost: $32/month per gateway + $0.045/GB processed. Even better.

By legacy convention, all internal resources needing outbound internet access routed through this NAT Gateway. Over the years, this created significant cross-zone traffic as instances in zones B and C sent traffic to the NAT in zone A.

No one noticed... until the organization decided to review every penny of AWS spending.

Optimization #1: NAT Gateway Per Zone

The "Obvious" Solution

The optimization seems straightforward:

Create one NAT Gateway per availability zone
Configure route tables so each zone uses its local NAT
Eliminate cross-zone data transfer for outbound traffic
Bonus: Better availability if one NAT fails

In Terraform, this looks cleaner and more organized. Excellent!

But Wait... Is That Actually Saving Money?

Here's the reality check most teams miss:

NAT Gateway costs: $32/month × 12 months = $384/year per gateway

Cross-zone data transfer: $0.01/GB

For this optimization to break even, each additional NAT Gateway must save at least 38,400 GB (38 TB) of cross-zone traffic per year.

Real-World Math

Let's say you have:

20 production VPCs
3 availability zones per VPC (currently 1 NAT per VPC)
3 lower environments (staging, QA, dev) with similar setup

Adding NAT gateways:

Prod: 20 VPCs × 2 additional NATs = 40 new NATs
Lower envs: 3 envs × 20 VPCs × 2 additional NATs = 120 new NATs
Total new NATs: 160

Annual cost of new NATs: 160 × $384 = $61,440/year

Required savings to break even: 6,144 TB of cross-zone traffic eliminated

Actual outbound traffic (check your NAT Gateway metrics): In most organizations, outbound traffic is relatively small—maybe 100-500 GB/month per VPC.

Even if 50% crosses zones, you're looking at:

20 VPCs × 500 GB × 50% × 12 months = 60 TB/year
Cost at $0.01/GB = $600/year saved

Result: Spend $61,440 to save $600.

That's not optimization—that's a 100x cost increase.

When It Actually Makes Sense

NAT per zone IS cost-effective when:

Outbound traffic exceeds 3 TB/month per VPC
High availability is critical (worth paying for redundancy)
You have only a few VPCs (fixed NAT cost is manageable)

Takeaway: Check your actual traffic volumes in CloudWatch metrics before implementing this "optimization."

A Similar Pattern: VPC Endpoints

VPC Endpoints follow the same pattern:

Historical context:

VPC Endpoints didn't exist when AWS launched
ALL traffic to S3, CloudWatch, SSM, Secrets Manager, etc. went through NAT to the internet and back
Sounds crazy, but that was reality

Today's dilemma:

VPC Endpoint costs: $7.20/month + $0.01/GB processed
AWS has hundreds of endpoint-supported services
Should you create endpoints for all of them?

The answer: Calculate based on actual traffic to each service.

For example:

S3 endpoint: Probably yes (high traffic volume)
Secrets Manager endpoint: Maybe not (low traffic, API calls only)
Athena endpoint: Depends on your query patterns

The decision is data-driven, not default.

Optimization #2: Internal ALBs and Service Discovery

This one involves both cost AND architecture tradeoffs.

Another Brief History

Before cloud (2000s-2010s):
Many organizations developed in-house service discovery systems for internal microservices communication. (Different names at different companies, same concept.)

Early AWS days:
No equivalent service existed. Application Load Balancers became the industry standard for both external and internal microservices.

Today:
AWS offers Cloud Map (Service Discovery), which could replace internal ALBs in theory.

What's the Problem with Internal ALBs?

ALBs work great... until you calculate the costs at scale.

Fixed costs per ALB:

$16.20/month (0.0225/hour × 730 hours)
Plus: $0.008 per LCU-hour (varies with traffic)

Data transfer costs:
By design, ECS/EKS clusters run instances across availability zones. When Service A calls Service B through an internal ALB:

Service A → ALB: Cross-zone data transfer ($0.01/GB)
ALB → Service B: Cross-zone data transfer ($0.01/GB)

Every internal API call potentially incurs 2× cross-zone charges.

Real-World Example

Organization with 80 internal services:

Production: 80 ALBs
Staging: 80 ALBs
QA: 80 ALBs
Dev: 80 ALBs

ALB fixed costs alone:

320 ALBs × $16.20 × 12 months = $62,208/year

Plus:

LCU charges based on traffic
Cross-zone data transfer (2× per request)

For high-traffic services, this becomes significant.

The AWS Service Discovery Alternative

AWS Cloud Map (Service Discovery) eliminates:

ALB fixed costs
Cross-zone data transfer charges (direct service-to-service)
LCU charges

Sounds perfect for cost optimization!

The Architectural Problem

Critical limitation: AWS Service Discovery only works within the same VPC.

This means: Services in different VPCs—even in the same region and same account—cannot communicate via Service Discovery. Cross-VPC service discovery does not work.

Why not? AWS Service Discovery uses DNS-based service discovery with Route 53 private hosted zones. The private hosted zone is associated with a specific VPC. When a service in VPC B tries to discover a service in VPC A:

VPC B doesn't have access to VPC A's private hosted zone
DNS resolution fails (can't resolve the service name)
Even with VPC Peering or Transit Gateway providing network connectivity, DNS discovery still fails

You CAN manually associate the Route 53 hosted zone with multiple VPCs as a workaround, but this:

Requires manual configuration for each VPC
Adds operational complexity managing DNS associations
Defeats the entire purpose of "simple service discovery"

At that point, you're just trading ALB complexity for DNS management complexity.

This breaks a fundamental architecture pattern from the multi-account article: Project Team Autonomy.

If you gave each team their own VPC (Strategy 1 or 3 from the previous article), switching to Service Discovery forces you to either:

Consolidate teams into fewer VPCs (losing isolation)
Manually associate Route 53 hosted zones across all VPCs (operational complexity, defeats simplicity)
Keep ALBs for cross-VPC communication (partial benefit only, still paying for ALBs)

The Real Tradeoff

This isn't a pure cost optimization—it's an architectural tradeoff:

Option A: Keep internal ALBs

✅ Teams maintain VPC isolation
✅ Cross-VPC communication works seamlessly
❌ Higher costs ($62k+/year for 320 ALBs)
❌ Cross-zone data transfer overhead

Option B: Switch to Service Discovery

✅ Eliminate ALB fixed costs
✅ Reduce cross-zone data transfer
❌ Lose cross-VPC communication
❌ Break team autonomy architecture
❌ Requires VPC consolidation or Transit Gateway

For many organizations, the architectural cost of losing team autonomy outweighs the dollar savings.

Key Takeaways

1. "Obvious" Optimizations Aren't Always Savings

The pattern repeats:

Adding NAT per zone sounds logical
Creating VPC endpoints for everything seems safe
Replacing ALBs with Service Discovery looks cost-effective

But each requires actual data to determine if savings exceed new costs.

2. AWS Architecture Forces Tradeoffs

Cross-zone data transfer costs exist because AWS resources are multi-zone by design:

You can't eliminate the traffic without losing redundancy
"Optimizations" often just shift costs to different line items
Sometimes the shifted cost is higher than the original

3. Hidden Costs of "Cheaper" Solutions

When evaluating alternatives, consider:

Fixed costs: NAT Gateways, VPC Endpoints, ALBs have monthly charges
Architectural constraints: Service Discovery's VPC limitation
Operational complexity: Managing more resources costs engineering time
Availability implications: Single-zone NAT is a single point of failure

4. Calculate Before Implementing

Before any "optimization":

Check CloudWatch metrics for actual traffic volumes
Calculate break-even points
Consider architectural implications
Factor in operational overhead

$600 saved on data transfer isn't worth $60,000 spent on NAT Gateways.

5. Sometimes the "Expensive" Option is Correct

Paying for internal ALBs might be the right choice if:

Team autonomy is architecturally important
Cross-VPC communication is a core requirement
The cost is acceptable for the business value provided

Cost optimization doesn't mean "minimize every line item." It means "maximize value per dollar spent."

Final Thought

Cross-zone data transfer optimization is like the multi-account architecture problem: AWS's design decisions create tradeoffs that seem fixable with "obvious" solutions, but the obvious solution often creates new problems or higher costs.

The goal isn't to eliminate cross-zone traffic at all costs. The goal is to understand:

Where your actual costs come from (check the data!)
What each optimization really costs (including hidden costs)
Whether the architectural tradeoffs are acceptable

Sometimes the best optimization is accepting the cost and focusing on higher-value improvements elsewhere.

Building cost-effective AWS infrastructure? Optimizing data transfer costs? Share your experiences and lessons learned in the comments.

Connect with me on LinkedIn: https://www.linkedin.com/in/rex-zhen-b8b06632/

I share insights on cloud architecture, SRE practices, and cost optimization strategies. Let's connect and learn together!

Top comments (1)

Vikas Tripathi • Feb 19

Really interesting edge case — cross-zone transfer costs
are exactly the kind of thing that surprises people because
it's not obvious until you dig into the bill.

This is why automated scanning matters so much. I've been
building a tool that catches the more common waste patterns
(idle EC2, unattached EBS, unused EIPs) but cross-zone
transfer is a great reminder that some waste requires
deeper traffic analysis.

What tools do you use to identify cross-zone traffic patterns
before they show up on the bill?