Part 4: EKS Multi-Cluster Setup — Six Clusters Across Two Regions
Part of the series: Building a Production-Grade DevSecOps Pipeline on AWS
Introduction
Why six clusters instead of one? The answer is isolation and resilience:
┌──────────────────────────────────────────────────────────────────────────┐
│ ONE CLUSTER (anti-pattern) │
│ │
│ Dev pods → same etcd as Production pods │
│ A misconfigured dev deployment can consume all cluster resources │
│ Cluster upgrade = every environment goes down simultaneously │
│ Cost visibility: impossible to attribute spend per environment │
└──────────────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│ SIX CLUSTERS (this guide) │
│ │
│ myapp-dev-use1 (us-east-1, public endpoint, 2 nodes) │
│ myapp-dev-usw2 (us-west-2, public endpoint, 2 nodes) │
│ myapp-staging-use1 (us-east-1, private endpoint, 2 nodes) │
│ myapp-staging-usw2 (us-west-2, private endpoint, 2 nodes) │
│ myapp-production-use1 (us-east-1, private endpoint, 2+ nodes + Karpenter) │
│ myapp-production-usw2 (us-west-2, private endpoint, 2+ nodes + Karpenter) │
│ │
│ Benefits: │
│ ✓ Complete IAM isolation between environments │
│ ✓ Production upgrade independent of dev │
│ ✓ Regional failover — us-east-1 outage → us-west-2 serves traffic │
│ ✓ Clear cost attribution per cluster tag │
└────────────────────────────────────────────────────────────────────────────┘
Cluster Overview
| Cluster | Region | Endpoint | Nodes | Karpenter |
|---|---|---|---|---|
myapp-dev-use1 |
us-east-1 | Public | 2 × t3.medium | No |
myapp-dev-usw2 |
us-west-2 | Public | 2 × t3.medium | No |
myapp-staging-use1 |
us-east-1 | Private | 2 × t3.medium | No |
myapp-staging-usw2 |
us-west-2 | Private | 2 × t3.medium | No |
myapp-production-use1 |
us-east-1 | Private | 2+ × t3.medium | Yes |
myapp-production-usw2 |
us-west-2 | Private | 2+ × t3.medium | Yes |
EKS Terraform Module
# _modules/eks/main.tf
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = var.cluster_name
cluster_version = "1.29"
vpc_id = var.vpc_id
subnet_ids = var.private_subnet_ids # Nodes always in private subnets
control_plane_subnet_ids = var.private_subnet_ids
# Endpoint access: private always on, public only for dev
cluster_endpoint_private_access = true
cluster_endpoint_public_access = var.public_api
cluster_endpoint_public_access_cidrs = var.public_api ? ["0.0.0.0/0"] : ["0.0.0.0/0"]
# Note: AWS rejects empty list when public is disabled — always set to 0.0.0.0/0
# Without this the cluster creator IAM role (your Terragrunt role) can't kubectl
enable_cluster_creator_admin_permissions = true
# Encrypt Kubernetes secrets in etcd with KMS
cluster_encryption_config = {
provider_key_arn = var.kms_key_arn
resources = ["secrets"]
}
# EKS managed add-ons (AWS manages patching)
cluster_addons = {
coredns = {
most_recent = true
}
kube-proxy = {
most_recent = true
}
vpc-cni = {
most_recent = true
service_account_role_arn = module.vpc_cni_irsa.iam_role_arn
}
aws-ebs-csi-driver = {
most_recent = true
service_account_role_arn = module.ebs_csi_irsa.iam_role_arn
}
}
eks_managed_node_groups = {
main = {
instance_types = var.instance_types # ["t3.medium"]
min_size = var.min_nodes
max_size = var.max_nodes
desired_size = var.desired_nodes
# IMPORTANT: name_prefix has a 38 character limit.
# "myapp-production-use1-eks-node-group-" = 39 chars → FAILS.
# Fix: use explicit role name (IAM limit is 64 chars).
iam_role_name = "${var.cluster_name}-node-group"
iam_role_use_name_prefix = false
# Nodes need these policies to pull from ECR, write to CloudWatch, etc.
iam_role_additional_policies = {
AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
labels = {
"node-type" = "general"
}
# Karpenter discovery tag — only needed on production node groups
# (Karpenter uses this to find the right security group)
taints = var.karpenter_enabled ? [] : []
}
}
# Node security group — allow Karpenter to manage nodes
node_security_group_tags = var.karpenter_enabled ? {
"karpenter.sh/discovery" = var.cluster_name
} : {}
}
# IRSA for VPC CNI (pod networking)
module "vpc_cni_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "~> 5.0"
role_name = "${var.cluster_name}-vpc-cni"
attach_vpc_cni_policy = true
vpc_cni_enable_ipv4 = true
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["kube-system:aws-node"]
}
}
}
# IRSA for EBS CSI Driver (persistent volumes)
module "ebs_csi_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "~> 5.0"
role_name = "${var.cluster_name}-ebs-csi"
attach_ebs_csi_policy = true
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["kube-system:ebs-csi-controller-sa"]
}
}
}
# _modules/eks/outputs.tf
output "cluster_name" { value = module.eks.cluster_name }
output "cluster_endpoint" { value = module.eks.cluster_endpoint }
output "cluster_certificate_authority_data" {
value = module.eks.cluster_certificate_authority_data
}
output "oidc_provider_arn" { value = module.eks.oidc_provider_arn }
output "oidc_provider" { value = module.eks.oidc_provider }
output "node_security_group_id" { value = module.eks.node_security_group_id }
output "node_subnet_ids" { value = var.private_subnet_ids }
Per-Environment Terragrunt Configs
Dev (public endpoint):
# live/dev/us-east-1/eks/terragrunt.hcl
include "root" { path = find_in_parent_folders() }
terraform { source = "../../../../_modules/eks" }
dependency "vpc" {
config_path = "../vpc"
mock_outputs = {
vpc_id = "vpc-mock"
private_subnet_ids = ["subnet-mock1", "subnet-mock2"]
}
}
dependency "kms" {
config_path = "../kms"
mock_outputs = { key_arn = "arn:aws:kms:us-east-1:123456789:key/mock" }
}
inputs = {
cluster_name = "myapp-dev-use1"
vpc_id = dependency.vpc.outputs.vpc_id
private_subnet_ids = dependency.vpc.outputs.private_subnet_ids
kms_key_arn = dependency.kms.outputs.key_arn
public_api = true # Dev gets public endpoint for laptop + CI access
min_nodes = 2
max_nodes = 4
desired_nodes = 2
instance_types = ["t3.medium"]
karpenter_enabled = false
}
Production (private endpoint + Karpenter):
# live/production/us-east-1/eks/terragrunt.hcl
include "root" { path = find_in_parent_folders() }
terraform { source = "../../../../_modules/eks" }
dependency "vpc" { config_path = "../vpc" }
dependency "kms" { config_path = "../kms" }
inputs = {
cluster_name = "myapp-production-use1"
vpc_id = dependency.vpc.outputs.vpc_id
private_subnet_ids = dependency.vpc.outputs.private_subnet_ids
kms_key_arn = dependency.kms.outputs.key_arn
public_api = false # Private endpoint only — no public internet access
min_nodes = 2
max_nodes = 10
desired_nodes = 2
instance_types = ["t3.medium"]
karpenter_enabled = true # Karpenter manages additional nodes beyond the initial 2
}
Bootstrapping Private Clusters
Staging and production clusters have endpointPublicAccess: false. This means kubectl from your laptop or CI cannot reach the API server directly. You must temporarily enable public access, bootstrap the cluster (install ArgoCD, register spokes, etc.), then lock it back down.
# Step 1: Temporarily enable public access
aws eks update-cluster-config \
--name myapp-production-use1 \
--region us-east-1 \
--profile myapp-prod-use1 \
--resources-vpc-config \
endpointPublicAccess=true,\
endpointPrivateAccess=true,\
publicAccessCidrs="0.0.0.0/0"
# Step 2: Wait 3 minutes — AWS takes time to update the Elastic Network Interfaces
sleep 180
# Step 3: Verify access
kubectl --context prod-use1 get nodes
# Step 4: Bootstrap (install ArgoCD, apply ApplicationSets, etc.)
# ... your bootstrap commands ...
# Step 5: Lock back to private-only
aws eks update-cluster-config \
--name myapp-production-use1 \
--region us-east-1 \
--profile myapp-prod-use1 \
--resources-vpc-config \
endpointPublicAccess=false,\
endpointPrivateAccess=true
Do not forget Step 5. A production cluster with a public API endpoint is a security risk — the API server is internet-accessible, relying solely on authentication for protection.
kubectl Context Setup
After terragrunt apply completes for each cluster, add it to your kubeconfig:
# Update kubeconfig for all 6 clusters
aws eks update-kubeconfig \
--name myapp-dev-use1 \
--region us-east-1 \
--alias dev-use1 \
--profile myapp-dev-use1
aws eks update-kubeconfig \
--name myapp-dev-usw2 \
--region us-west-2 \
--alias dev-usw2 \
--profile myapp-dev-usw2
aws eks update-kubeconfig \
--name myapp-staging-use1 \
--region us-east-1 \
--alias staging-use1 \
--profile myapp-staging-use1
aws eks update-kubeconfig \
--name myapp-staging-usw2 \
--region us-west-2 \
--alias staging-usw2 \
--profile myapp-staging-usw2
aws eks update-kubeconfig \
--name myapp-production-use1 \
--region us-east-1 \
--alias prod-use1 \
--profile myapp-prod-use1
aws eks update-kubeconfig \
--name myapp-production-usw2 \
--region us-west-2 \
--alias prod-usw2 \
--profile myapp-prod-usw2
# Verify
kubectl config get-contexts
EKS Add-ons
EKS managed add-ons are maintained by AWS — they patch security vulnerabilities in CoreDNS, kube-proxy, and vpc-cni without you having to manage Helm releases.
kube-proxy — handles iptables rules for Service routing
coredns — in-cluster DNS resolution
vpc-cni — AWS VPC networking for pods (each pod gets a real VPC IP)
aws-ebs-csi-driver — allows EKS to provision EBS volumes for PersistentVolumeClaims
Why IRSA for vpc-cni and ebs-csi?
These add-ons need to call AWS APIs (EC2 for ENI management, EC2 for EBS volume ops). Without IRSA they would use the node's EC2 instance profile — giving every pod on the node those permissions. With IRSA, only the specific add-on service account has the permissions.
Fixing kubectl 401 Errors
If kubectl get nodes returns HTTP 401 Unauthorized, the IAM role you're using is not in the cluster's access entries.
# List current access entries
aws eks list-access-entries \
--cluster-name myapp-production-use1 \
--region us-east-1 \
--profile myapp-prod-use1
# If your OrganizationAccountAccessRole is missing, add it:
ROLE_ARN="arn:aws:iam::591120834781:role/OrganizationAccountAccessRole"
aws eks create-access-entry \
--cluster-name myapp-production-use1 \
--region us-east-1 \
--profile myapp-prod-use1 \
--principal-arn $ROLE_ARN
aws eks associate-access-policy \
--cluster-name myapp-production-use1 \
--region us-east-1 \
--profile myapp-prod-use1 \
--principal-arn $ROLE_ARN \
--policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
--access-scope type=cluster
The root cause:
enable_cluster_creator_admin_permissions = truemust be explicitly set in the EKS module. If it's missing, Terraform creates the cluster but the IAM role that ran Terraform doesn't get an access entry.
AWS Load Balancer Controller
The AWS Load Balancer Controller (LBC) runs in every cluster and watches for Ingress resources with ingressClassName: alb. When it sees one, it provisions an Application Load Balancer in AWS automatically.
Install via Helm (or ArgoCD) after the cluster is up:
# IRSA for LBC
eksctl create iamserviceaccount \
--cluster myapp-production-use1 \
--namespace kube-system \
--name aws-load-balancer-controller \
--attach-policy-arn arn:aws:iam::aws:policy/ElasticLoadBalancingFullAccess \
--override-existing-serviceaccounts \
--approve \
--region us-east-1 \
--profile myapp-prod-use1
# Install controller
helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=myapp-production-use1 \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller \
--set region=us-east-1 \
--set vpcId=<VPC_ID>
In this pipeline, the LBC is deployed via ArgoCD ApplicationSet — the Helm release is version-controlled in myapp-gitops/infrastructure/aws-lbc/.
StorageClass for EBS PVCs
kube-prometheus-stack needs persistent storage for Prometheus and Grafana data. With the EBS CSI driver installed, create a StorageClass:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp2
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer # Don't provision until pod is scheduled
reclaimPolicy: Retain # Don't delete EBS volume if PVC is deleted
parameters:
type: gp2
encrypted: "true"
kmsKeyId: <your-kms-key-arn>
Node Security Group Tags
For Karpenter to manage node lifecycles, it needs to find the cluster's node security group. Tag it during EKS creation:
node_security_group_tags = {
"karpenter.sh/discovery" = var.cluster_name
}
Similarly, private subnets need the discovery tag:
private_subnet_tags = {
"karpenter.sh/discovery" = var.cluster_name
}
Verifying All Six Clusters
for CTX in dev-use1 dev-usw2 staging-use1 staging-usw2 prod-use1 prod-usw2; do
echo "=== $CTX ==="
kubectl --context $CTX get nodes -o wide
done
Expected output:
=== dev-use1 ===
NAME STATUS ROLES AGE VERSION
ip-10-0-8-xx.ec2.internal Ready <none> 5d v1.29.15-eks-ac2d5a0
ip-10-0-16-xx.ec2.internal Ready <none> 5d v1.29.15-eks-ac2d5a0
=== prod-use1 ===
NAME STATUS ROLES AGE VERSION
ip-10-20-8-xx.ec2.internal Ready <none> 5d v1.29.15-eks-ac2d5a0
ip-10-20-16-xx.ec2.internal Ready <none> 5d v1.29.15-eks-ac2d5a0
Summary
By the end of Part 4 you have:
- ✅ Six EKS clusters (Kubernetes 1.29) across three environments and two regions
- ✅ Private endpoints on staging and production (public on dev)
- ✅ KMS encryption for Kubernetes secrets in etcd
- ✅ IAM IRSA for VPC CNI and EBS CSI add-ons
- ✅ AWS Load Balancer Controller installed
- ✅ kubectl contexts configured for all six clusters
- ✅ Karpenter discovery tags on production node security groups and subnets
Screenshot Placeholders
SCREENSHOT: AWS EKS console showing 2 clusters running in production with ACTIVE status
Next: Part 5 — GitOps with ArgoCD: Hub-Spoke Model
Follow the series — next part publishes next Wednesday.
Live system: https://www.matthewoladipupo.dev/health
Runbook: Operations Guide
Source code: myapp-infra | myapp-gitops | myapp



Top comments (0)