DEV Community

authagonal
authagonal

Posted on • Originally published at authagonal.io

Take your Terraform state off the public internet (without standing up a VPN)

Your Terraform state file is the single most sensitive artifact in your cloud. It is a complete map of every resource you run, and depending on your providers it also holds secrets in plaintext: connection strings, generated passwords, keys. By default that file lives in a cloud storage account with a public endpoint, locked down with nothing more than an access key. If that key leaks, the attacker doesn't have to enumerate your infrastructure. You handed them the diagram.

We sell authentication. An auth vendor that leaves the keys to its own kingdom on the open internet has no business holding yours. So our production state account has no public surface at all. Getting there has three traps, and we walked into the shape of each one before getting it right.

Trap one: the chicken and egg

Remote state needs a backend that already exists before Terraform can run. But the backend is itself infrastructure, and you'd like Terraform to manage it. You cannot use the storage account to store the state of the storage account that doesn't exist yet.

The way out is a deliberate two-phase bootstrap. Phase one runs with local state and creates exactly the foundation: the state storage account, the network it will live in, and the access path. Phase two flips the backend block from local to remote and migrates the now-existing state file up into the account it just created. From that point on, that bootstrap layer manages itself remotely like everything else. It's a few minutes of feeling like you're standing on a ladder you're still building, and then it's done forever.

Trap two: the reach problem, and the VPN gateway tax

Making the account private is one line: turn off public network access and put a private endpoint in front of it. Now the storage account is reachable only from inside your virtual network. Which is exactly the problem, because the thing that needs to reach it most often, your CI pipeline, is not inside your virtual network. Neither are you.

The textbook answer is to stand up a provider VPN. A managed VPN gateway, or a bastion host, or point-to-site with client certificates. Anyone who has done this knows the tax: the gateway is expensive and slow to provision, point-to-site means minting and rotating client certs, every new operator is a setup ritual, and a bastion is one more box to patch and pay for. It is a lot of standing infrastructure whose entire job is "let trusted people reach a private thing."

We skipped all of it. Instead of a VPN gateway, a small zero-trust connector runs as a container inside the VNet, the category that Tailscale, Twingate, and Cloudflare Access occupy. It joins an identity-aware mesh. Authorized people and the CI pipeline reach the private endpoint through that mesh, authenticated per identity, with access scoped to exactly the one resource that needs it. No gateway, no public IP, no certificates to rotate, no jump box. CI brings up its connection for the length of a run and tears it down after. The reach problem goes away without the standing-infrastructure bill.

Trap three: the lockdown you cannot do in one step

The obvious instinct is to declare the account private from the start, in the same Terraform apply that creates it. Do that and you cut the cord before the private path exists: the apply needs to write state through the public endpoint, the private endpoint and DNS aren't wired yet, and you lock Terraform, and yourself, out of the very account you're creating. We reasoned through this before triggering it, which is the one time the dry run lives in your head instead of the terminal.

So the lockdown is a deliberate separate step, run only after the network, the private endpoint, and the connector are all up and proven. One command flips public network access to disabled. The moment it lands, the public endpoint is gone, the old IP allowlist becomes irrelevant because there's no public surface left to allow anything onto, and every future run reaches state over the mesh. We verified it the only way that counts: by checking that the account now answers from the private path and refuses the public one.

The other half: no password to steal

Killing the public endpoint removes the network door. The matching move is making sure there's no standing key behind it either. Our pipeline holds no cloud credential. It authenticates with workload identity federation: the CI system presents a short-lived OIDC token, the cloud trusts that token for one specific repository, and hands back access that expires in minutes. There's no service principal secret sitting in a vault waiting to leak, because there's no secret to begin with.

State access follows the same rule. Terraform reads and writes the state blob with a short-lived directory token tied to that identity, not the storage account's access key. So the two things an attacker most wants, a way in and a credential to use it, come out as a private-only endpoint and a token that was already expiring while they read it. Nothing static to lift.

Why bother

State files don't usually leak because someone broke encryption. They leak because the account was public, a key ended up in a log or a fork or a laptop, and nothing else stood in the way. Taking the public endpoint away removes the entire class of "the key leaked" from the threat model. The key becomes useless without also being on the mesh.

It is the same principle we build the product on: don't gate security behind effort or tier, just do it, because the alternative is the thing you regret at 2am. Every security feature we ship to customers, SSO and SAML, SCIM, MFA, enforced webhooks, audit export, is on at every plan, not held back for an enterprise upsell. See what's included.

Top comments (0)