How I Built Multi-Tenant SaaS on AWS (So You Don't Have To)

#aws #devops #terraform #kubernetes

It was 2:17 AM when my phone lit up with a Slack alert.

Two enterprise customers were seeing each other’s data.

Not all of it — just enough to trigger panic. The kind of bug that doesn’t just wake you up; it makes you question every infrastructure decision you’ve ever made.

That night is why SaaSInfraLab exists.

I was tired of rebuilding the same fragile multi-tenant infrastructure for every new SaaS project and hoping I didn’t miss something critical at 2 AM again.

The Problem: Multi-Tenancy Breaks in Subtle, Expensive Ways

Multi-tenant SaaS sounds straightforward until you’re running real workloads at scale.

Here’s what broke for me repeatedly:

Manual tenant onboarding took 2–3 hours per customer
Namespace misconfigurations exposed data across tenants
Terraform modules were copied and pasted and drifted over time
CI/CD pipelines were brittle and hard to reason about
AWS costs grew with no per-tenant visibility

At around 40–50 tenants, everything slowed down.

One bad helm change could impact everyone.
One missed IAM permission could block a deployment.
One rushed fix could leak data.

The problem isn’t Kubernetes or AWS — it’s the lack of structure and repeatability.

The Solution: A Production-Ready, GitOps-Driven SaaS Stack

Instead of patching the same problems again, I stepped back and designed a system with one rule:

Tenant isolation must exist at every layer.

High-Level Approach

I built a modular infrastructure stack with:

AWS EKS as the compute foundation
Terraform for deterministic infrastructure
GitOps (ArgoCD) as the control plane
PostgreSQL schema isolation for data
Namespaces, quotas, RBAC, and network policies by default

Everything is defined once, versioned, and reused.

No click-ops. No snowflakes.

Core Design Decisions (and Why)

Kubernetes Namespaces per tenant
This gives clean workload isolation, quota enforcement, and blast-radius control.

PostgreSQL schemas instead of separate databases
Lower cost, simpler operations, and safe isolation when paired with strict search paths.

await client.query(`SET search_path TO tenant_${tenantId}`);

GitOps for all deployments
ArgoCD watches tenant definitions and applies changes automatically. No manual deploys, no surprises.

IRSA + RBAC everywhere
Every pod gets only the AWS permissions it needs — nothing more.

CI/CD Flow

CI (GitHub Actions): build images, run tests, push to ECR
CD (ArgoCD): syncs manifests, runs per-tenant migrations, deploys safely

Adding a tenant is a config change — not a weekend task.

Lessons Learned & What I’d Do Differently

If I were starting again:

I’d add cost attribution from day one
I’d document network policies earlier
I’d automate tenant-isolation tests sooner

The biggest takeaway?
Tenant isolation isn’t a single feature.
It’s defense in depth: IAM, network, compute, data, and deployment workflows all working together.

That’s what SaaSInfraLab tries to encode.