Takafumi Endo

Posted on Jun 30

Let Aurora Sleep: Multi-Tenant SaaS Cost, Reconsidered with AI IaC

#aws #serverless #architecture #postgres

For most multi-tenant SaaS, the default still rhymes: PostgreSQL on Aurora, behind an API, in a private subnet. Stable, well-understood, the shape you'd draw on a whiteboard.

But I keep feeling a mismatch between what that shape costs and what it actually does while a product is small or bursty. A lot of the time, Vercel + Cloudflare + Supabase or Neon would give me similar real-world performance for less. And yet I get pulled back to AWS and Aurora — not for raw performance, but for enterprise requirements: VPC isolation, audit posture, "it has to live in our AWS org."

What changed for me is the third option in that fork. It used to be "cheap edge stack or heavy AWS stack." Now there's a middle path: keep the AWS/Aurora skeleton the enterprise wants, but redesign it so it stops costing like it's always on — and reach that redesign by sparring with an AI against a real dev deployment, instead of needing to already be an infra specialist.

Here's the whole idea in one picture:

BEFORE — everything flows through Aurora
  Client ── API ── VPC App ── Aurora (always warm)
                      │
                      └── NAT ── OpenAI        (NAT always provisioned)

AFTER — Aurora only for control/commit
  Client ── CloudFront / S3            viewer & artifacts (no DB)
         └─ Hot Lambda ── DynamoDB     hot path / cache / counters (no DB)

  Control actions ── VPC Lambda ── Aurora        (wakes on purpose)
  LLM calls ──────── non-VPC Lambda ── OpenAI     (no NAT, no DB)

One scoping note up front: this is primarily for dev, internal tools, bursty early-stage SaaS, and low-frequency enterprise environments — not a blanket "let your production database sleep" recommendation for steady, high-traffic workloads. It's an optimization for a specific shape of usage, not a rejection of the proven production default.

Not a "Postgres is over" piece, and not a best-practice writeup either. This is something I'm still validating — a design space I mostly reached by sparring with Claude Code and Codex while keeping one eye on cost-performance, then deploying to dev to see what actually held. Notes on where my thinking has drifted, written down mostly so I can find out where it's wrong. If anything here is useful, take it as "you can stumble into shapes like this too," not "do it this way."

Where I've landed for now:

The cost pain usually isn't Aurora. It's making Aurora the center of every request.

Demote Aurora to a thin control plane — canonical state, commit, approval, audit — and let it sleep.

Push viewing, LLM hot paths, and ephemeral state to S3 / CloudFront / DynamoDB / non-VPC Lambda.

What makes this reachable for non-specialists is declarative schema + IaC + AI, validated agilely in dev.

The default bills like it's always on

The textbook shape — CloudFront → ALB/API Gateway → app → Aurora, pool model with tenant_id everywhere and RLS for isolation — is fine. The problem is what accretes around it to make it reliable: always-on Aurora, RDS Proxy, NAT Gateway, readers, VPC endpoints, logs.

For dev or a 1–30 person workload, the fixed cost of reliability is wildly out of proportion to the traffic. You're paying ledger-grade rent to store scratch work deleted in an hour.

The mismatch isn't "Aurora is expensive." It's "Aurora is expensive when it can never go idle."

Here's the cost shape the redesign is chasing — not exact dollars, but where the fixed costs go:

BEFORE                              AFTER
- Aurora always warm               - Aurora wakes only for control/commit
- NAT always provisioned           - No NAT for LLM egress
- RDS Proxy holding connections    - No Proxy in the sleep path
- DB hit on health/viewer/LLM      - Viewer/LLM/health paths stay DB-free

The reframe: Aurora as a thin control plane

Don't put Aurora in the path of every request.
Make it a thin control plane: canonical state, approval, audit, commit.
Move viewing, LLM execution, short-lived state, and delivery
to S3 / CloudFront / DynamoDB / non-VPC Lambda.

There's a system of work — high-churn, disposable, fine to lose — and a system of record — money, contracts, audit, singular and strict. The mistake is paying record-grade prices for work-grade state. So split by job:

Layer	Job
S3	artifacts, reports, raw payloads (cheap bulk)
CloudFront	viewer delivery — keeps reads off Aurora
DynamoDB	projections, cache, locks, progress, counters (hot path)
Aurora	tenant, RBAC, manifest, lineage, approval, audit, rollups
non-VPC Lambda	outbound LLM calls — no NAT, never touches the DB

Once viewing and the LLM hot path stop touching Aurora, it stops waking for trivia. That, more than any price knob, moves the bill.

One discipline keeps DynamoDB from quietly becoming a second source of truth: anything stored there should be either TTL-bound, recomputable from Aurora/S3, or an explicit live counter with a reconciliation path. If a value is none of those, it probably wants to be in Aurora.

Sleep-first only works if the app lets Aurora sleep

Aurora Serverless v2 can scale to zero in supported configurations (min_acu = 0), and while paused, compute charge goes to zero (storage still bills). The flag is the easy part; the discipline is not poking it awake.

Bad:  set min=0, but login/health/LLM/viewer all read Aurora -> never sleeps
Good: only login, admin, manifest commit, rollups, audit wake it on purpose

Two silent traps: in this sleep-first setup, RDS Proxy works against you — it keeps database connections around, which prevents pause — and any open user-initiated connection does the same. So sleep-first wants no Proxy, a small pool, short idle timeouts. (None of this is a knock on RDS Proxy in general; it's doing exactly its job, which happens to be the opposite of what you want here.)

The goal isn't to make Aurora cheap by configuration; it's to make the application structurally capable of not needing Aurora most of the time. The min_acu = 0 flag only pays off once the app no longer reaches for the DB on every request.

No NAT: separate "can reach data" from "can reach the internet"

Dropping NAT is usually pitched as savings (it bills per hour and per GB). The better reason is blast radius:

A Lambda that can touch the data cannot go out.
A Lambda that can go out cannot touch the data.

One function with both powers, if compromised, can read and exfiltrate. Split it: a VPC control Lambda reaches RDS but can't egress; a non-VPC egress Lambda calls the external LLM but has no line to the DB. Cheaper and smaller blast radius at once. (Caveat: an /ask flow hands tenant context to the egress function, so minimize the payload, forbid logging it, keep a request_id audit trail.)

The point isn't that non-VPC is magically safe. The egress Lambda still touches the OpenAI key and whatever prompt/context you pass it, so it still deserves a narrow IAM role, a single-purpose secret, no broad Secrets Manager or S3 read permissions, and strict logging rules. The win is narrower and more durable: it cannot both read the database and ship it somewhere.

And the boundary only holds if the network isn't the only thing keeping the egress Lambda away from the DB. Aurora has to be private, its security group must not allow the egress path, and the egress Lambda must have no Data API access or broad Secrets Manager permissions that would quietly recreate a database path through IAM. "Not in the VPC" is necessary, not sufficient.

The part that makes this AI-friendly: declarative schema

This design means making the same placement call constantly: canonical Aurora? DynamoDB cache? S3 artifact? hot path? allowed to wake the DB?

Migration-history schema fights you here — to know what is, you replay what happened (001_create… 006_revert…). Declarative schema flips it: you describe the desired current state and let the tool diff it. Both you and the AI read one artifact that says what should be true now.

It doesn't abolish prod migrations — the loop is declarative -> plan -> reviewed migration -> apply -> drift check. You think in declarative state; you apply via migration. That's what fits AI-assisted work: the AI reasons over a stable description, not a changelog.

You don't have to be an infra specialist up front

This is the claim I most want to make — and hold most loosely.

A sleep-first, no-NAT, projection-driven design used to need someone who lived in AWS networking. The wall wasn't the idea; it was getting VPC routing, IAM boundaries, and serverless wiring all correct. Most of this design, honestly, isn't something I knew up front — it's where I drifted by sparring with Claude Code and Codex against a real dev deployment: "does this Lambda actually have egress?", "what wakes Aurora here?", "where's the tenant boundary?" With IaC describing the whole thing as code, the feedback is a deployed stack you can poke, not a whiteboard argument — and that's what let me reach a shape like this at all.

It costs more than a managed edge platform and burns dev cycles. But infra design became something you validate agilely rather than get right up front from expertise alone.

To be precise about the claim: AI doesn't remove the need for infrastructure expertise. It changes the iteration loop — from "know the answer upfront" to "generate, deploy, inspect, and correct faster." You still need the judgment to know what to inspect; you just acquire it by iterating instead of having to bring all of it to the whiteboard.

The honest caveat: the AI will confidently produce wiring that's "plausible but subtly wrong" — a security group more open than you think, a function you believe is sandboxed but isn't. So the loop must include verification you can eyeball: probe the egress, confirm what wakes the DB, read the plan diff. And be clear-eyed that demoting Aurora is really distributed-systems-ification — you trade fixed cost for consistency, projections, and sync. It's a tradeoff, not a free win.

When this stops being the right call

Aurora awake < ~100h/mo:  min_acu=0 + wake-ahead — keep going
~100–250h/mo:             compare against RDS db.t4g.micro/small
> ~250h/mo:               always-on RDS, or Aurora min_acu=0.5
prod-like / heavy validation: Aurora min_acu=0.5

Judge total cost (ACU + NAT + Proxy + endpoints + DynamoDB + S3 + Lambda + CloudWatch), not Aurora alone. For bursty dev, killing Aurora compute and NAT usually wins; for steady production, often it won't. These aren't universal thresholds — they're decision triggers for my environment, and they move with region, log volume, and DynamoDB/S3 usage.

What I'd actually measure

Numbers beat vibes here. If I were deciding whether this is paying off, these are the signals I'd watch:

Aurora awake hours per month (the headline number)
which endpoints wake Aurora — and whether any are supposed to be DB-free
NAT hourly cost avoided by moving egress out of the VPC
DynamoDB counter / reconciliation lag — how stale the cheap layer gets
p95 login latency right after a cold resume (the resume-tax you're accepting)
count of unexpected DB calls from hot paths (this should trend to zero)

Objections I'd raise myself

"You've rebuilt a distributed system to avoid a DB bill." Sometimes, yes. The boundary I try to hold is: only move work-grade state out of Aurora. If a flow needs cross-store transactions, strict read-after-write semantics everywhere, or complex reconciliation, it probably belongs back in the path. The moment you're reaching for distributed consistency to keep the cheap layer correct, you've moved the boundary too far.
"Stale auth will bite you." The sharp edge. JWT + DynamoDB cache is fine for low-risk reads, but role/budget changes, manifest commits, approvals, and any tenant-data-injecting /ask must consult canonical Aurora. "Fast but stale permissions" is how you ship an authz incident.
"The warmup endpoint is a footgun." A public /warmup doing select 1 leaks almost nothing — but it's a button that wakes Aurora, and a public one can be hammered until sleep-first is meaningless. Behind WAF/rate-limit, single-flight, no-op when recently warm.

A checklist before you try this

If you want to see whether your workload fits, run down these before touching anything:

Is your workload bursty / intermittent (not steady high traffic)?
Can viewer reads be served without touching Aurora (CloudFront/S3/DynamoDB)?
Can LLM calls run outside the VPC, with no DB access?
Can the DynamoDB state be TTL-bound, recomputable, or reconciled?
Can you tolerate resume latency on the paths that do wake Aurora?
Do you have metrics for what wakes Aurora, so you can catch leaks?
Is your schema declarative, so you (and the AI) reason over current state, not history?

If most boxes are unchecked, the boring always-on RDS/Aurora is probably still the right call — and that's fine.

Still figuring out the edges

The shift isn't any one AWS feature. It's that the boundary between "only an infra specialist could safely build this" and "a product engineer can reach it by sparring with an AI against a dev deployment" has moved a lot.

Enterprise gravity toward AWS and Aurora is real and isn't leaving. But it no longer forces the always-on, NAT-heavy, everything-through-the-DB default. Keep Aurora as the vault — strict, singular — let it sleep, and run the workbench at the edge, with explicit events carrying workbench changes back into the control plane. Whether this particular shape survives more validation, I don't know yet. What I'm fairly sure of is that the path to trying designs like it is now something you can iterate into — with AI and IaC, deploying and inspecting — instead of having to know it cold beforehand.

If you've run something like this in production — or watched it fall apart — I'd like to hear where it broke for you.

DEV Community