nat-zero: scale-to-zero NAT instances for AWS

Leonard O Sullivan — Tue, 28 Apr 2026 07:26:07 +0000

nat-zero is a Terraform module that replaces always-on NAT infrastructure with on-demand NAT instances that start when your workloads need internet access and shut down when they don't. We built it because we're cheap and our workloads are weird.

Today we're open sourcing it.

The problem

At machine.dev we run GPU workloads across every availability zone in six AWS regions. The workloads live in private subnets — no public IPs, no direct internet access. They need NAT to reach the outside world for things like pulling packages and container images.

Our workloads are sporadic. We have an Intelligent Tiering system that hunts for the cheapest GPU globally, or within whatever regions the user has defined. Some AZs might not see a single job for days. Then suddenly they see fifty.

AWS gives you two standard options for NAT and neither of them made sense for us:

NAT Gateway costs about $36/month per AZ. We operate in every AZ across six regions. That's a lot of AZs, and most of them are sitting empty most of the time. The per-GB data processing charge on top of that would have eaten us alive. NAT Gateway is built for steady-state traffic. Our traffic is the opposite of steady-state.

Always-on NAT instances (like fck-nat, which is genuinely good) run about $7-8/month per AZ. Better, but we were too tight-fisted to pay for instances running 24/7 in AZs that had zero workloads. Paying for a NAT instance to sit idle in ap-southeast-2b for four days straight because nobody needed a GPU in Sydney this week felt like the kind of waste that keeps you up at night. It did keep me up at night.

We needed a third option: NAT that scales to zero when nothing is running, starts up when something is, and costs almost nothing in between.

So we built one.

How nat-zero works

The core idea is simple: a single Lambda function watches for EC2 instance state changes via EventBridge. When a workload launches in a private subnet, the Lambda starts a NAT instance in that AZ. When the last workload terminates, it stops the NAT and releases the Elastic IP.

The interesting part is how it makes decisions. The Lambda runs a reconciliation loop — it doesn't care what event triggered it. It just looks at the current state of the AZ and takes one corrective action:

Workloads?	NAT State	EIP?	Action
Yes	None	—	Create NAT
Yes	Stopped	—	Start NAT
Yes	Running	No EIP	Allocate and attach EIP
Yes	Running	Has EIP	Converged. Do nothing.
No	Running	—	Stop NAT
No	Stopped	Has EIP	Release EIP
No	Stopped	No EIP	Converged. Do nothing.

One action per invocation, then it returns. The next event triggers the next step. This keeps the logic dead simple and avoids the kind of race conditions that make infrastructure code age you prematurely.

The Lambda runs at concurrency of one. This is deliberate. One writer means no duplicate NAT creation, no double EIP allocation, no start/stop races, and no need for distributed locking. Events that arrive while it's running just queue up. Simple beats clever every time.

The dual ENI trick

Each NAT instance uses two network interfaces — one private, one public — both pre-created by Terraform. When the NAT instance stops and starts, the ENIs stick around. This means your route tables stay pointed at the right place and you don't have to reconfigure anything on restart. The EIP attaches to the public ENI when the NAT is running and gets released when it stops, so you're not paying the $3.60/month public IPv4 charge on idle AZs.

This is the part that makes the scale-to-zero part actually work without breaking your routing. It's a nice bit of engineering that we're quietly proud of, despite being the kind of people who normally downplay everything.

What it costs

Here's the part we actually care about:

State	nat-zero	fck-nat	NAT Gateway
Idle (no workloads)	~$0.80/month	~$7-8/month	~$36+/month
Active (workloads running)	~$7-8/month	~$7-8/month	~$36+/month

That $0.80 idle cost is just the EBS volume sitting there waiting. No instance running, no EIP allocated, no meter ticking. When a workload shows up, you're paying the same as a regular fck-nat instance. When it leaves, you're back to pocket change.

We run across 22 AZs. NAT Gateway would have cost us $792/month. With nat-zero, idle months cost us $17.60. That's the kind of difference that turns your AWS bill from "we need to have a meeting about this" into "that seems fine."

How fast is it

The honest answer: about 10 seconds for a cold start. A NAT instance that's completely new takes roughly 10.7 seconds from workload launch to internet connectivity. Restarting a stopped NAT is about 8.5 seconds. If the NAT is already running, it's instant.

For us this was fine. Our workloads are resilient to a brief wait for network access. The GPU instances themselves need time to initialize anyway — the NAT instances actually started faster than the workloads did, so in practice nobody was ever waiting for NAT. The 10-second cold start is a number we measured carefully and never actually experienced as a real delay.

The Lambda itself is a compiled Go binary running on ARM64. Cold start is 55ms. Typical invocation is 400-600ms. Peak memory is 29MB out of 128MB allocated. It's fast because it does very little per invocation, which is the whole point.

Why open source

Same reason as everything else we open source: it's useful, and keeping it to ourselves doesn't make it more useful. We built nat-zero to solve a real problem we had. Other people running sporadic workloads in private subnets have the same problem. The module is self-contained, well-tested (integration tests run against real AWS infrastructure on every PR), and MIT licensed.

If your workloads are bursty, spread across multiple AZs, or just not running often enough to justify always-on NAT, this might save you some money. If your workloads are steady-state and high-throughput, NAT Gateway is probably still the right call. We're not pretending this is the right solution for everyone. We're saying it was the right solution for us, and it might be for you too.

Try it

The repo is at github.com/MachineDotDev/nat-zero and the docs are at nat-zero.machine.dev. It's a Terraform module — point it at your VPC, tell it which AZs and subnets to manage, and it handles the rest.

If you find a bug, open an issue. If you find a way to make it cheaper, we definitely want to hear about it.

DEV Community: Leonard O Sullivan