Alex walks out of the coffee shop, laptop bag over shoulder, ready to build something amazing. Sam orders another coffee and opens a laptop – time to help the next person on their Kubernetes journey.
The Beginning of Your EKS Adventure 🚀☸️
📚 Table of Contents
- Introduction
- Understanding Kubernetes: The Philosophy
- The Kubernetes Architecture
- Enter Amazon EKS
- Core Kubernetes Concepts: The Building Blocks
- Networking in EKS: The Complex Beast
- Storage in Kubernetes
- Security in EKS: Locking It Down
- Setting Up Your First EKS Cluster
- Essential Add-ons and Tools
- Observability: Knowing What's Happening
- Real-World Application Deployment
- Advanced Patterns and Best Practices
- Troubleshooting Common Issues
- GitOps and CI/CD
- Cost Optimization
- Production Readiness Checklist
- Wrapping Up
- Quick Reference Guide
Introduction
Two weeks have passed after the discussion on AWS Container Services. Alex is back at the coffee shop, laptop open, with a slightly worried expression. Sam walks in with two coffees.
Sam: sets down coffee Okay, that text you sent at 2 AM saying "I think I need Kubernetes after all" was concerning. What happened?
Alex: laughs nervously Don't worry, nothing's on fire. The ECS setup is working great actually! But... our CTO just announced we're acquiring a startup. They're running everything on Kubernetes. And our new lead developer keeps talking about "the K8s ecosystem" and "cloud-native patterns."
Sam: Ah, the classic acquisition scenario. So now you need to actually understand Kubernetes, not just know it exists.
Alex: Exactly. And honestly, after working with ECS, I'm curious. What am I missing? Why do people love Kubernetes so much they tattoo the logo on their arms?
Sam: spits out coffee Someone actually did that?
Alex: I saw it on Twitter!
Sam: shakes head Okay, well... let's dive deep. Fair warning: this is going to be a longer conversation. Kubernetes isn't something you grasp in ten minutes.
Alex: I've got all afternoon. Teach me, sensei.
↑ Back to Table of Contents
Understanding Kubernetes: The Philosophy
Sam: Before we touch EKS, you need to understand Kubernetes itself. Think back to our last conversation – I said Kubernetes is complex. But there's a reason for that complexity. It's not accidental.
Kubernetes was built to solve Google's problems: running billions of containers across thousands of machines. It's designed for:
- Declarative configuration: You say what you want, K8s makes it happen
- Self-healing: Things fail, K8s automatically recovers
- Extensibility: You can add functionality without changing core code
- API-driven: Everything is an API resource
Alex: You lost me at "declarative configuration."
Sam: Right, let me explain with an example. Remember ECS?
In ECS (Imperative thinking):
# You tell ECS what to DO
aws ecs create-service --service-name my-app --desired-count 3
aws ecs update-service --service-name my-app --desired-count 5
In Kubernetes (Declarative thinking):
# You tell K8s what you WANT
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 5 # I want 5 replicas, K8s makes it happen
# ... rest of config
You apply this file, and Kubernetes continuously works to maintain that state. If a pod dies, K8s starts a new one. If you change replicas from 5 to 10, K8s smoothly adds 5 more.
Alex: So it's like... I'm describing my desired state, not giving commands?
Sam: Exactly! That's the Kubernetes mindset. Everything is about desired state vs. current state.
┌─────────────────────────────────────────┐
│ Kubernetes Control Loop │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Desired │ │ Current │ │
│ │ State │ │ State │ │
│ │ (your YAML) │ │ (reality) │ │
│ └──────┬───────┘ └───────┬──────┘ │
│ │ │ │
│ │ ┌────────────┐ │ │
│ └──►│ Controller │◄──┘ │
│ │ (reconcile)│ │
│ └────────────┘ │
│ │
│ If desired ≠ current, take action │
└─────────────────────────────────────────┘
The Kubernetes Architecture
Alex: Okay, I'm buying into the philosophy. Show me how it actually works.
Sam: draws on napkin
Kubernetes has two main parts:
1. Control Plane (the brains)
2. Worker Nodes (the muscle)
Let me break this down:
┌────────────────────────────────────────────────────┐
│ KUBERNETES CLUSTER │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ CONTROL PLANE (Master) │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ │ │
│ │ │ API Server │ │ etcd │ │ │
│ │ │ (kubectl) │ │ (database) │ │ │
│ │ └──────┬───────┘ └──────────────┘ │ │
│ │ │ │ │
│ │ ┌──────┴───────┐ ┌──────────────┐ │ │
│ │ │ Scheduler │ │ Controller │ │ │
│ │ │ (placement) │ │ Manager │ │ │
│ │ └──────────────┘ └──────────────┘ │ │
│ └───────────────────────┬────────────────────┬─┘ │
│ │ │ │
│ ┌───────────────────────▼─────┐ ┌──────────▼──┐ │
│ │ WORKER NODE 1 │ │ WORKER NODE 2│ │
│ │ │ │ │ │
│ │ ┌────────┐ ┌────────┐ │ │ ┌────────┐ │ │
│ │ │ Pod │ │ Pod │ │ │ │ Pod │ │ │
│ │ │┌──────┐│ │┌──────┐│ │ │ │┌──────┐│ │ │
│ │ ││ App ││ ││ App ││ │ │ ││ App ││ │ │
│ │ │└──────┘│ │└──────┘│ │ │ │└──────┘│ │ │
│ │ └────────┘ └────────┘ │ │ └────────┘ │ │
│ │ │ │ │ │
│ │ kubelet | kube-proxy │ │ kubelet... │ │
│ └─────────────────────────────┘ └──────────────┘ │
└────────────────────────────────────────────────────┘
Alex: Walk me through what each piece does.
Sam: Great! Let's follow a request:
1. API Server:
- The front door to Kubernetes
- When you run
kubectl apply, it hits the API Server - Validates requests, authenticates users
- Everything goes through here – it's the only component that talks to etcd
2. etcd:
- Distributed key-value database
- Stores ALL cluster state
- If etcd is lost, the cluster doesn't know what it should be running
- This is why backups are critical!
3. Scheduler:
- Watches for new pods that don't have a node assigned
- Decides which node should run the pod based on:
- Resource requirements (CPU, memory)
- Node constraints (labels, taints, tolerations)
- Affinity/anti-affinity rules
4. Controller Manager:
- Runs multiple controllers (Deployment controller, ReplicaSet controller, etc.)
- Each controller watches for changes and acts on them
- Example: Deployment controller sees replicas: 5, counts current pods, creates more if needed
5. kubelet (on each worker node):
- Ensures containers are running in pods
- Talks to container runtime (Docker, containerd)
- Reports node and pod status back to API Server
6. kube-proxy (on each worker node):
- Handles networking
- Implements Service abstraction
- Routes traffic to appropriate pods
Alex: So when I run kubectl apply -f deployment.yaml, what happens?
Sam: Perfect question! Let's trace it:
Step-by-Step: kubectl apply -f deployment.yaml
1. kubectl → API Server
"Here's a Deployment resource"
2. API Server → etcd
"Store this Deployment spec"
3. Deployment Controller (watching API Server)
"New Deployment! I need to create a ReplicaSet"
4. Deployment Controller → API Server
"Create this ReplicaSet"
5. ReplicaSet Controller (watching)
"New ReplicaSet! I need to create Pods"
6. ReplicaSet Controller → API Server
"Create 3 Pods please"
7. Scheduler (watching for unassigned Pods)
"These Pods need homes... Node 2 has space!"
8. Scheduler → API Server
"Assign these Pods to Node 2"
9. kubelet on Node 2 (watching API Server)
"I have new Pods to run! Starting containers..."
10. kubelet → Container Runtime
"Pull image, create container, start it"
11. kubelet → API Server (continuous)
"Status update: Pod is Running"
Alex: Wow. That's... a lot of moving parts for just starting a container.
Sam: It is! But that complexity gives you power. Each piece can be extended, replaced, or customized. That's why Kubernetes can do things ECS can't.
Enter Amazon EKS
Alex: Okay, so where does EKS fit into all this?
Sam: Amazon EKS (Elastic Kubernetes Service) is AWS's managed Kubernetes. Here's what AWS handles for you:
AWS Manages:
- Control Plane (API Server, etcd, scheduler, controller manager)
- Control plane HA across multiple AZs
- Control plane upgrades
- Control plane scaling
- etcd backups
- API Server endpoint
You Manage:
- Worker nodes (or use Fargate)
- Applications
- Add-ons (though some are managed now)
- Networking configuration
- Security policies
Alex: So it's like... they run the control plane, I run the workers?
Sam: Exactly! Here's the EKS architecture:
┌──────────────────────────────────────────────────────┐
│ AMAZON EKS │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ AWS-MANAGED CONTROL PLANE │ │
│ │ (Runs in AWS VPC, multi-AZ) │ │
│ │ │ │
│ │ ┌─────┐ ┌─────┐ ┌──────┐ ┌──────────┐ │ │
│ │ │ API │ │etcd │ │Sched │ │Controller│ │ │
│ │ │ │ │ │ │ │ │ Manager │ │ │
│ │ └─────┘ └─────┘ └──────┘ └──────────┘ │ │
│ │ │ │
│ │ Exposed via AWS-managed Load Balancer │ │
│ └────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌────────────────────▼───────────────────────────┐ │
│ │ YOUR VPC │ │
│ │ │ │
│ │ ┌──────────────────┐ ┌──────────────────┐ │ │
│ │ │ Subnet 1 (AZ-a) │ │ Subnet 2 (AZ-b) │ │ │
│ │ │ │ │ │ │ │
│ │ │ ┌────────────┐ │ │ ┌────────────┐ │ │ │
│ │ │ │ Node 1 │ │ │ │ Node 2 │ │ │ │
│ │ │ │ (EC2 or │ │ │ │ (EC2 or │ │ │ │
│ │ │ │ Fargate) │ │ │ │ Fargate) │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ │ ┌────────┐ │ │ │ │ ┌────────┐ │ │ │
│ │ │ │ │ Pods │ │ │ │ │ │ Pods │ │ │ │
│ │ │ │ └────────┘ │ │ │ │ └────────┘ │ │ │
│ │ │ └────────────┘ │ │ └────────────┘ │ │ │
│ │ └──────────────────┘ └──────────────────┘ │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ AWS Services: ECR, IAM, CloudWatch, VPC, ELB, etc. │
└──────────────────────────────────────────────────────┘
Alex: So I never SSH into the control plane?
Sam: Nope! You can't even see those machines. AWS handles all that. You interact with the cluster through kubectl, which talks to the API Server endpoint AWS provides.
Core Kubernetes Concepts: The Building Blocks
Alex: Alright, I need to understand the actual resources. You mentioned Pods, Deployments, Services... break them all down for me.
Sam: cracks knuckles Here we go. This is where Kubernetes gets rich but also complex.
Pods: The Atomic Unit
Sam: A Pod is the smallest deployable unit in Kubernetes. Think of it as a wrapper around one or more containers that need to run together.
apiVersion: v1
kind: Pod
metadata:
name: my-app-pod
labels:
app: my-app
tier: frontend
spec:
containers:
- name: app-container
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:v1.0
ports:
- containerPort: 8080
env:
- name: DATABASE_URL
value: "postgres://db:5432"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
Alex: Why would I put multiple containers in one pod?
Sam: Great question! Common patterns:
Sidecar Pattern:
┌─────────────────────────┐
│ Pod │
│ │
│ ┌─────────┐ │
│ │ Main │ │
│ │ App │ │
│ └────┬────┘ │
│ │ shares │
│ ┌────▼────┐ storage │
│ │ Logging │ │
│ │ Sidecar │ │
│ └─────────┘ │
└─────────────────────────┘
Example: Main app writes logs to a shared volume, sidecar container ships logs to CloudWatch.
Ambassador Pattern:
┌─────────────────────────┐
│ Pod │
│ │
│ ┌─────────┐ │
│ │ Main │───────┐ │
│ │ App │ │ │
│ └─────────┘ │ │
│ │ │
│ ┌─────────┐ │ │
│ │ Proxy/ │◄─────┘ │
│ │ Cache │ │
│ └────┬────┘ │
└───────┼─────────────────┘
│
Network
Example: Main app talks to local proxy, proxy handles connection pooling to database.
Alex: But usually it's just one container per pod?
Sam: Usually, yes! One container per pod is the most common pattern.
ReplicaSets: Maintaining Desired Count
Sam: ReplicaSets ensure a specified number of pod replicas are running at all times.
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: my-app-rs
spec:
replicas: 3
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app:v1.0
If a pod dies, ReplicaSet creates a new one. If you manually create extra pods with matching labels, ReplicaSet deletes them to maintain exactly 3.
Alex: So ReplicaSets are like the ECS Service desired count?
Sam: Yes! But here's the thing – you almost never create ReplicaSets directly. You use...
Deployments: The Smart Way to Manage Replicas
Sam: Deployments are what you actually use. They manage ReplicaSets for you and provide:
- Rolling updates
- Rollbacks
- Version history
- Pause/resume capabilities
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Max 1 extra pod during update
maxUnavailable: 1 # Max 1 pod down during update
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
version: v1.0
spec:
containers:
- name: my-app
image: my-app:v1.0
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Alex: What happens when I update the image?
Sam: Magic! Watch this:
Current State (v1.0):
┌──────────────────────────────────┐
│ Deployment: my-app │
│ ├─ ReplicaSet-abc (5 pods v1.0) │
│ ├─ Pod 1 │
│ ├─ Pod 2 │
│ ├─ Pod 3 │
│ ├─ Pod 4 │
│ └─ Pod 5 │
└──────────────────────────────────┘
Update to v1.1:
$ kubectl set image deployment/my-app my-app=my-app:v1.1
During Rolling Update:
┌──────────────────────────────────┐
│ Deployment: my-app │
│ ├─ ReplicaSet-abc (3 pods v1.0) │
│ │ ├─ Pod 1 │
│ │ ├─ Pod 2 │
│ │ └─ Pod 3 │
│ └─ ReplicaSet-xyz (2 pods v1.1) │
│ ├─ Pod 6 (new!) │
│ └─ Pod 7 (new!) │
└──────────────────────────────────┘
Final State:
┌──────────────────────────────────┐
│ Deployment: my-app │
│ ├─ ReplicaSet-abc (0 pods) │
│ └─ ReplicaSet-xyz (5 pods v1.1) │
│ ├─ Pod 6 │
│ ├─ Pod 7 │
│ ├─ Pod 8 │
│ ├─ Pod 9 │
│ └─ Pod 10 │
└──────────────────────────────────┘
Kubernetes gradually terminates old pods and starts new ones. If something goes wrong:
kubectl rollout undo deployment/my-app
And you're back to v1.0!
Alex: That's actually really cool. ECS can do blue/green, but this feels more gradual.
Services: Stable Networking
Sam: Here's a problem: Pods are ephemeral. They get IPs when they start, but those IPs change if pods restart. How do you connect to them?
Alex: Um... DNS?
Sam: Kind of! That's where Services come in. A Service provides a stable IP and DNS name for a set of pods.
apiVersion: v1
kind: Service
metadata:
name: my-app-service
spec:
type: ClusterIP # Only accessible within cluster
selector:
app: my-app
ports:
- protocol: TCP
port: 80 # Service port
targetPort: 8080 # Container port
Now any pod can reach your app at my-app-service:80 or my-app-service.default.svc.cluster.local:80
Service Types:
1. ClusterIP (default):
Internal only, not accessible from outside
┌──────────────────────────┐
│ Cluster │
│ ┌────────┐ │
│ │ Pod A │──────┐ │
│ └────────┘ │ │
│ ▼ │
│ ┌────────────────────┐ │
│ │ Service ClusterIP │ │
│ │ 10.100.200.50 │ │
│ └────────┬───────────┘ │
│ │ │
│ ┌────▼───┐ │
│ │ Pod B │ │
│ └────────┘ │
└──────────────────────────┘
2. NodePort:
Accessible on each node's IP at a static port
┌──────────────────────────┐
│ Cluster │
│ ┌─────────────────────┐ │
│ │ Node 1 │ │
│ │ IP: 10.0.1.50 │ │
│ │ NodePort: 30080 ────┼─┼──► External traffic
│ └─────────┬───────────┘ │
│ │ │
│ ┌─────────▼───────────┐ │
│ │ Service │ │
│ └─────────┬───────────┘ │
│ │ │
│ ┌────▼───┐ │
│ │ Pods │ │
│ └────────┘ │
└──────────────────────────┘
3. LoadBalancer (the one you'll use most in EKS):
apiVersion: v1
kind: Service
metadata:
name: my-app
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
type: LoadBalancer
selector:
app: my-app
ports:
- port: 80
targetPort: 8080
This automatically provisions an AWS Load Balancer!
Internet
│
▼
┌─────────────────────┐
│ AWS Load Balancer │
│ (NLB or ALB) │
└────────┬────────────┘
│
┌─────▼──────┐
│ Service LB │
└─────┬──────┘
│
┌────▼────┐
│ Pods │
└─────────┘
Alex: Wait, so Kubernetes creates an actual AWS Load Balancer?
Sam: Yes! Through the AWS Load Balancer Controller. We'll get to that. But this is where EKS's AWS integration shines.
ConfigMaps and Secrets: Configuration Management
Sam: Applications need configuration. In Kubernetes, you use ConfigMaps for non-sensitive data and Secrets for sensitive data.
ConfigMap example:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
database_host: "postgres.example.com"
log_level: "info"
app.properties: |
feature.new=true
timeout=30s
Using it in a Pod:
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: app
image: my-app:v1.0
env:
# As environment variables
- name: DATABASE_HOST
valueFrom:
configMapKeyRef:
name: app-config
key: database_host
# Or mount as files
volumeMounts:
- name: config-volume
mountPath: /etc/config
volumes:
- name: config-volume
configMap:
name: app-config
Secrets (similar but encoded):
apiVersion: v1
kind: Secret
metadata:
name: db-secret
type: Opaque
data:
username: YWRtaW4= # base64 encoded
password: cGFzc3dvcmQ=
Alex: Base64 isn't encryption though...
Sam: Correct! Secrets in Kubernetes are just base64 encoded by default. In EKS, you should use AWS Secrets Manager or Parameter Store integration for real security. We'll cover that.
Networking in EKS: The Complex Beast
Alex: Okay, I need to understand networking because I heard it's complicated.
Sam: deep breath You heard right. Kubernetes networking is... special. Let me break it down.
The Kubernetes Networking Model has three rules:
- Every pod gets its own IP address
- All pods can communicate with all other pods without NAT
- All nodes can communicate with all pods without NAT
Alex: How does EKS implement this?
Sam: Through the Amazon VPC CNI (Container Network Interface) plugin. This is unique to EKS and actually pretty clever.
Traditional Kubernetes:
┌────────────────────────────┐
│ Node (EC2 Instance) │
│ IP: 10.0.1.50 │
│ │
│ ┌──────────────────────┐ │
│ │ Pod 1: 172.16.1.5 │ │
│ └──────────────────────┘ │
│ ┌──────────────────────┐ │
│ │ Pod 2: 172.16.1.6 │ │
│ └──────────────────────┘ │
│ │
│ Pod IPs are virtual │
│ (overlay network) │
└────────────────────────────┘
EKS with VPC CNI:
┌────────────────────────────┐
│ Node (EC2 Instance) │
│ Primary IP: 10.0.1.50 │
│ │
│ ┌──────────────────────┐ │
│ │ Pod 1: 10.0.1.100 │ │ ← Real VPC IP!
│ └──────────────────────┘ │
│ ┌──────────────────────┐ │
│ │ Pod 2: 10.0.1.101 │ │ ← Real VPC IP!
│ └──────────────────────┘ │
│ │
│ Pods get real ENI IPs │
│ from your VPC! │
└────────────────────────────┘
Alex: So pods get actual VPC IP addresses?
Sam: Yes! This means:
Pros:
- Pods can directly communicate with VPC resources (RDS, ElastiCache, etc.)
- VPC Security Groups can apply to pods
- No overlay network performance penalty
- VPC Flow Logs work
Cons:
- You consume IP addresses quickly
- Need to plan VPC CIDR blocks carefully
- ENI limits per instance type matter
Alex: How many pods can I run per node?
Sam: It depends on the EC2 instance type and its ENI limits:
t3.small: 3 ENIs × 4 IPs = 12 IP addresses
- 1 for node
- 11 for pods
Maximum pods: 11
m5.large: 3 ENIs × 10 IPs = 30 IP addresses
- 1 for node
- 29 for pods
Maximum pods: 29
m5.xlarge: 4 ENIs × 15 IPs = 60 IP addresses
Maximum pods: 58
Alex: What if I run out of IPs?
Sam: You have options:
1. Use larger CIDR blocks:
Bad: 10.0.1.0/28 (16 IPs - ouch!)
Better: 10.0.1.0/24 (256 IPs)
Best: 10.0.0.0/16 (65,536 IPs)
2. Use secondary CIDR blocks:
# Add a secondary CIDR to your VPC
10.0.0.0/16 (primary)
100.64.0.0/16 (secondary for pods)
3. Use Custom Networking:
Pods use different subnet than nodes
4. Use Fargate:
No node IP limits!
5. Use IPv6:
Basically infinite addresses
Ingress: Advanced Routing
Sam: Services get you basic load balancing, but what if you need:
- Path-based routing (
/api/users→ User Service) - Host-based routing (
api.example.com→ API Service) - SSL termination
- Advanced traffic management
That's where Ingress comes in.
Alex: Is this like ALB path-based routing?
Sam: Exactly! But Kubernetes Ingress is a standard API, and different controllers implement it.
In EKS, you use the AWS Load Balancer Controller:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...
spec:
ingressClassName: alb
rules:
- host: api.example.com
http:
paths:
- path: /users
pathType: Prefix
backend:
service:
name: user-service
port:
number: 80
- path: /orders
pathType: Prefix
backend:
service:
name: order-service
port:
number: 80
This creates:
Internet
│
▼
┌─────────────────────────────────┐
│ Application Load Balancer (ALB) │
│ api.example.com │
└────┬─────────────────────┬──────┘
│ │
│ /users │ /orders
▼ ▼
┌──────────┐ ┌──────────┐
│ User │ │ Order │
│ Service │ │ Service │
└────┬─────┘ └────┬─────┘
│ │
┌──▼──┐ ┌──▼──┐
│Pods │ │Pods │
└─────┘ └─────┘
Alex: That's cleaner than managing ALB rules manually!
Sam: Absolutely! And the Ingress controller handles:
- Creating the ALB
- Configuring listeners
- Updating target groups
- Health checks
- SSL certificates from ACM
Storage in Kubernetes
Alex: What about data that needs to persist? Databases, file uploads, etc.?
Sam: Kubernetes storage is built on these concepts:
1. Volumes: Storage attached to a pod's lifecycle
2. Persistent Volumes (PV): Storage that exists independently
3. Persistent Volume Claims (PVC): Request for storage
┌─────────────────────────────────────────┐
│ Storage Architecture │
│ │
│ ┌───────────────┐ │
│ │ Pod │ │
│ │ ┌─────────┐ │ │
│ │ │Container│ │ │
│ │ └────┬────┘ │ │
│ │ │mount │ │
│ │ ┌────▼──────────────┐ │
│ │ │ PVC: my-app-storage│ │
│ │ │ Size: 10Gi │ │
│ │ └────┬───────────────┘ │
│ └───────┼───────────────┘ │
│ │ bound to │
│ ┌───────▼──────────────┐ │
│ │ PV: pv-12345 │ │
│ │ Size: 10Gi │ │
│ │ Type: EBS │ │
│ └───────┬──────────────┘ │
│ │ │
│ ┌───────▼──────────────┐ │
│ │ AWS EBS Volume │ │
│ │ vol-abcdef123 │ │
│ └──────────────────────┘ │
└─────────────────────────────────────────┘
Creating a PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-pvc
spec:
accessModes:
- ReadWriteOnce # Can be mounted by one node at a time
storageClassName: gp3
resources:
requests:
storage: 20Gi
Using it in a Pod:
apiVersion: v1
kind: Pod
metadata:
name: postgres
spec:
containers:
- name: postgres
image: postgres:14
volumeMounts:
- name: postgres-storage
mountPath: /var/lib/postgresql/data
volumes:
- name: postgres-storage
persistentVolumeClaim:
claimName: postgres-pvc
Storage Classes in EKS:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"
throughput: "125"
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
Alex: What storage should I use for different scenarios?
Sam: Good question! Here's my guide:
EBS (via EBS CSI Driver):
✓ Databases (PostgreSQL, MySQL)
✓ Single-writer applications
✗ Multi-node read/write
✗ Cross-AZ access
EFS (via EFS CSI Driver):
✓ Shared file storage
✓ Multi-pod read/write
✓ Cross-AZ access
✗ High-performance databases
✗ Block-level operations
FSx for Lustre:
✓ High-performance computing
✓ Machine learning training
✗ General purpose
S3 (via CSI drivers):
✓ Object storage
✓ Static assets
✗ POSIX filesystem operations
Security in EKS: Locking It Down
Alex: nervously Okay, security. I know this is important but overwhelming.
Sam: Deep breath. Let's tackle this systematically. Security in EKS has multiple layers:
1. IAM Roles for Service Accounts (IRSA)
Sam: This is CRUCIAL in EKS. It lets pods assume IAM roles without needing node-level permissions.
The Old Way (bad):
┌──────────────────────┐
│ EC2 Instance │
│ IAM Role: SuperPower│ ← All pods get this!
│ ┌────┐ ┌────┐ │
│ │Pod1│ │Pod2│ │
│ └────┘ └────┘ │
└──────────────────────┘
Problem: Pod1 needs S3, Pod2 needs DynamoDB
But both get full access!
The EKS Way (good - IRSA):
┌──────────────────────────────────┐
│ EC2 Instance │
│ IAM Role: Minimal │
│ │
│ ┌────────────┐ ┌────────────┐ │
│ │ Pod1 │ │ Pod2 │ │
│ │ IAM Role: │ │ IAM Role: │ │
│ │ S3Access │ │ DynamoAccess│ │
│ └────────────┘ └────────────┘ │
└──────────────────────────────────┘
Setting it up:
# 1. Create IAM role
eksctl create iamserviceaccount \
--name my-app-sa \
--namespace default \
--cluster my-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
--approve
# 2. Use it in your deployment
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app-sa
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/my-app-role
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
spec:
serviceAccountName: my-app-sa # Pod gets this role!
containers:
- name: app
image: my-app:v1.0
Alex: So the pod can just call AWS APIs now?
Sam: Exactly! The AWS SDKs automatically pick up the credentials from the service account.
2. Pod Security Standards
Sam: You need to prevent pods from running with dangerous settings:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
This prevents:
- Running as root
- Privileged containers
- Host network access
- Unsafe volume types
3. Network Policies
Sam: Control which pods can talk to which:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-network-policy
namespace: production
spec:
podSelector:
matchLabels:
app: api
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
This means:
┌──────────┐
│ Frontend │
│ Pod │───────┐
└──────────┘ │
▼ ✓ Allowed
┌──────────┐
│ API │
│ Pod │
└────┬─────┘
│
▼ ✓ Allowed
┌──────────┐
│ Database │
│ Pod │
└──────────┘
┌──────────┐
│ Random │
│ Pod │───X──► API (Blocked!)
└──────────┘
4. Secrets Management
Sam: Don't use default Kubernetes secrets for production! Use AWS Secrets Manager:
AWS Secrets Manager CSI Driver:
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: aws-secrets
spec:
provider: aws
parameters:
objects: |
- objectName: "prod/database/password"
objectType: "secretsmanager"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
spec:
serviceAccountName: my-app-sa
containers:
- name: app
image: my-app:v1.0
volumeMounts:
- name: secrets
mountPath: "/mnt/secrets"
readOnly: true
volumes:
- name: secrets
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: "aws-secrets"
Now your database password is securely fetched from AWS Secrets Manager and mounted as a file!
5. Image Security
Sam: Scan images for vulnerabilities:
# In your CI/CD pipeline
steps:
- name: Build image
run: docker build -t my-app:v1.0 .
- name: Scan with ECR
run: |
aws ecr start-image-scan \
--repository-name my-app \
--image-id imageTag=v1.0
- name: Check scan results
run: |
aws ecr describe-image-scan-findings \
--repository-name my-app \
--image-id imageTag=v1.0
# Fail if critical vulnerabilities found
Alex: This is a lot...
Sam: I know. But security is layers. You don't need everything day one, but you should have a plan to implement them.
↑ Back to Table of Contents
Setting Up Your First EKS Cluster
Alex: Alright, enough theory. How do I actually create an EKS cluster?
Sam: Great! Let's do this properly. You have a few options:
Option 1: eksctl (Easiest)
# Install eksctl
brew install eksctl # Mac
# or download from github.com/weaveworks/eksctl
# Create cluster
eksctl create cluster \
--name my-cluster \
--region us-east-1 \
--nodegroup-name standard-workers \
--node-type t3.medium \
--nodes 3 \
--nodes-min 2 \
--nodes-max 4 \
--managed
This creates:
- EKS control plane
- VPC with public/private subnets
- Managed node group
- All necessary IAM roles
- VPC CNI, kube-proxy, CoreDNS add-ons
Alex: That's it?
Sam: That's it! In about 15 minutes, you have a production-ready cluster.
Option 2: eksctl with Config File (Better)
# cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production-cluster
region: us-east-1
version: "1.28"
# Use existing VPC (recommended)
vpc:
id: "vpc-12345"
subnets:
private:
us-east-1a: { id: subnet-aaaa }
us-east-1b: { id: subnet-bbbb }
us-east-1c: { id: subnet-cccc }
# IAM OIDC provider for IRSA
iam:
withOIDC: true
# Managed node groups
managedNodeGroups:
- name: general-purpose
instanceType: t3.medium
minSize: 2
maxSize: 6
desiredCapacity: 3
volumeSize: 20
privateNetworking: true
labels:
workload-type: general
tags:
Environment: production
Team: platform
- name: memory-optimized
instanceType: r5.large
minSize: 1
maxSize: 3
desiredCapacity: 1
privateNetworking: true
labels:
workload-type: memory-intensive
taints:
- key: workload-type
value: memory-intensive
effect: NoSchedule
# Add-ons
addons:
- name: vpc-cni
version: latest
- name: coredns
version: latest
- name: kube-proxy
version: latest
- name: aws-ebs-csi-driver
version: latest
# CloudWatch logging
cloudWatch:
clusterLogging:
enableTypes:
- api
- audit
- authenticator
- controllerManager
- scheduler
eksctl create cluster -f cluster.yaml
Option 3: Terraform (Most Control)
# main.tf
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = "production-cluster"
cluster_version = "1.28"
cluster_endpoint_public_access = true
cluster_endpoint_private_access = true
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnets
# EKS Managed Node Group(s)
eks_managed_node_groups = {
general = {
min_size = 2
max_size = 6
desired_size = 3
instance_types = ["t3.medium"]
capacity_type = "ON_DEMAND"
labels = {
Environment = "production"
Workload = "general"
}
tags = {
Team = "platform"
}
}
spot = {
min_size = 0
max_size = 10
desired_size = 2
instance_types = ["t3.medium", "t3a.medium"]
capacity_type = "SPOT"
labels = {
Workload = "batch"
}
taints = [{
key = "spot"
value = "true"
effect = "NoSchedule"
}]
}
}
# Enable IRSA
enable_irsa = true
# Add-ons
cluster_addons = {
vpc-cni = {
most_recent = true
}
kube-proxy = {
most_recent = true
}
coredns = {
most_recent = true
}
aws-ebs-csi-driver = {
most_recent = true
}
}
tags = {
Environment = "production"
Terraform = "true"
}
}
Alex: Which should I use?
Sam:
- eksctl: Quick start, learning, small teams
- eksctl + config file: Most common, good balance
- Terraform: Enterprise, multi-cluster, infrastructure as code
I'd start with eksctl config file. It's declarative, version-controllable, and easy to understand.
↑ Back to Table of Contents
Essential Add-ons and Tools
Alex: You mentioned add-ons. What else do I need?
Sam: Great question! A basic EKS cluster needs several additional components:
1. AWS Load Balancer Controller
Why: Creates ALBs and NLBs from Ingress and Service resources
# Install via Helm
helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=my-cluster \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller
Now you can create ALBs:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-ingress
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
spec:
ingressClassName: alb
# ... rest of config
2. EBS CSI Driver
Why: Persistent volumes using EBS
# Already installed if you used addons in eksctl
# Or install manually
kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.25"
3. Metrics Server
Why: For kubectl top and HorizontalPodAutoscaler
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
4. Cluster Autoscaler
Why: Automatically scale node groups
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=kube-system
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false
How it works:
Scenario: Need more pods than nodes can handle
1. Pods are pending (not enough resources)
2. Cluster Autoscaler sees pending pods
3. Checks node group can scale up
4. Triggers ASG to add nodes
5. New node joins cluster
6. Pods schedule on new node
Scenario: Nodes are underutilized
1. Node has <50% utilization for 10 minutes
2. All pods can fit elsewhere
3. Cluster Autoscaler cordons node
4. Drains pods (moves them)
5. Terminates node
6. ASG size decreases
5. Karpenter (Alternative to Cluster Autoscaler)
Sam: Karpenter is newer and smarter:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["t3.medium", "t3.large", "t3.xlarge"]
limits:
resources:
cpu: 1000
providerRef:
name: default
ttlSecondsAfterEmpty: 30
Karpenter vs Cluster Autoscaler:
Cluster Autoscaler:
- Works with node groups
- Slower (waits for ASG)
- Less flexible instance types
Karpenter:
- Directly provisions EC2
- Faster (no ASG)
- Intelligently picks instance types
- Better bin-packing
Alex: Which should I use?
Sam: Start with Cluster Autoscaler (simpler, well-tested). Move to Karpenter when you need faster scaling or better cost optimization.
↑ Back to Table of Contents
Observability: Knowing What's Happening
Alex: How do I monitor all this?
Sam: Observability in K8s has three pillars: Metrics, Logs, and Traces.
Metrics: CloudWatch Container Insights
# Install CloudWatch Agent
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluentd-quickstart.yaml
This gives you:
- Cluster metrics (CPU, memory, disk, network)
- Node metrics
- Pod metrics
- Namespace metrics
- Service metrics
Custom metrics:
apiVersion: v1
kind: Pod
metadata:
name: my-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
image: my-app:v1.0
# App exposes metrics at :8080/metrics
Logs: FluentBit to CloudWatch
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: amazon-cloudwatch
data:
fluent-bit.conf: |
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
[OUTPUT]
Name cloudwatch_logs
Match *
region us-east-1
log_group_name /aws/eks/my-cluster/containers
auto_create_group true
Alex: Can I see logs from kubectl?
Sam: Absolutely!
# View pod logs
kubectl logs my-pod
# Follow logs (like tail -f)
kubectl logs -f my-pod
# Logs from specific container in pod
kubectl logs my-pod -c sidecar-container
# Previous container logs (if crashed)
kubectl logs my-pod --previous
# Logs from all pods with label
kubectl logs -l app=my-app
# Logs from last hour
kubectl logs my-pod --since=1h
Traces: AWS X-Ray
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
template:
spec:
containers:
- name: app
image: my-app:v1.0
env:
- name: AWS_XRAY_DAEMON_ADDRESS
value: xray-service.default:2000
- name: xray-daemon
image: amazon/aws-xray-daemon:latest
ports:
- containerPort: 2000
protocol: UDP
The full observability stack:
┌────────────────────────────────────┐
│ Your Application │
│ - Emits metrics (Prometheus) │
│ - Writes logs (stdout/stderr) │
│ - Sends traces (X-Ray) │
└──────┬──────────────┬──────────┬───┘
│ │ │
│ Metrics │ Logs │ Traces
▼ ▼ ▼
┌──────────────┐ ┌──────────┐ ┌──────────┐
│ CloudWatch │ │FluentBit │ │ X-Ray │
│ Container │ │ │ │ Daemon │
│ Insights │ └────┬─────┘ └────┬─────┘
└──────┬───────┘ │ │
│ ▼ ▼
│ ┌─────────────────────┐
└────────►│ CloudWatch & │
│ X-Ray │
└──────────┬──────────┘
│
┌────────▼──────────┐
│ CloudWatch │
│ Dashboards │
└───────────────────┘
↑ Back to Table of Contents
Real-World Application Deployment
Alex: Okay, let's put this all together. Walk me through deploying a real application.
Sam: Perfect! Let's deploy a three-tier application:
- React frontend
- Node.js API
- PostgreSQL database
Step 1: Namespace and RBAC
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: myapp
labels:
name: myapp
---
# Service account for the app
apiVersion: v1
kind: ServiceAccount
metadata:
name: myapp-sa
namespace: myapp
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/myapp-role
Step 2: Database (PostgreSQL)
# postgres.yaml
apiVersion: v1
kind: Secret
metadata:
name: postgres-secret
namespace: myapp
type: Opaque
data:
password: cG9zdGdyZXM= # base64 encoded
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-pvc
namespace: myapp
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3
resources:
requests:
storage: 20Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: myapp
spec:
serviceName: postgres
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:14
ports:
- containerPort: 5432
env:
- name: POSTGRES_DB
value: myapp
- name: POSTGRES_USER
value: myapp
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
volumeMounts:
- name: postgres-storage
mountPath: /var/lib/postgresql/data
subPath: postgres
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
volumes:
- name: postgres-storage
persistentVolumeClaim:
claimName: postgres-pvc
---
apiVersion: v1
kind: Service
metadata:
name: postgres
namespace: myapp
spec:
selector:
app: postgres
ports:
- port: 5432
targetPort: 5432
clusterIP: None # Headless service for StatefulSet
Step 3: Backend API
# api.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: api-config
namespace: myapp
data:
DATABASE_HOST: postgres.myapp.svc.cluster.local
DATABASE_PORT: "5432"
DATABASE_NAME: myapp
LOG_LEVEL: info
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: myapp
spec:
replicas: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
version: v1
spec:
serviceAccountName: myapp-sa
containers:
- name: api
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/myapp-api:v1.0
ports:
- containerPort: 3000
env:
- name: DATABASE_HOST
valueFrom:
configMapKeyRef:
name: api-config
key: DATABASE_HOST
- name: DATABASE_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
---
apiVersion: v1
kind: Service
metadata:
name: api
namespace: myapp
spec:
selector:
app: api
ports:
- port: 80
targetPort: 3000
type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
namespace: myapp
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Step 4: Frontend
# frontend.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: frontend
namespace: myapp
spec:
replicas: 2
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spec:
containers:
- name: frontend
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/myapp-frontend:v1.0
ports:
- containerPort: 80
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"
---
apiVersion: v1
kind: Service
metadata:
name: frontend
namespace: myapp
spec:
selector:
app: frontend
ports:
- port: 80
targetPort: 80
type: ClusterIP
Step 5: Ingress (ALB)
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: myapp-ingress
namespace: myapp
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:ACCOUNT:certificate/abc123
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
alb.ingress.kubernetes.io/ssl-redirect: '443'
alb.ingress.kubernetes.io/healthcheck-path: /health
alb.ingress.kubernetes.io/success-codes: '200'
spec:
ingressClassName: alb
rules:
- host: myapp.example.com
http:
paths:
- path: /api
pathType: Prefix
backend:
service:
name: api
port:
number: 80
- path: /
pathType: Prefix
backend:
service:
name: frontend
port:
number: 80
Deploy Everything
# Create namespace
kubectl apply -f namespace.yaml
# Deploy database
kubectl apply -f postgres.yaml
# Wait for database to be ready
kubectl wait --for=condition=ready pod -l app=postgres -n myapp --timeout=300s
# Deploy API
kubectl apply -f api.yaml
# Wait for API
kubectl wait --for=condition=ready pod -l app=api -n myapp --timeout=300s
# Deploy frontend
kubectl apply -f frontend.yaml
# Create ingress
kubectl apply -f ingress.yaml
# Check everything
kubectl get all -n myapp
The final architecture:
Internet
│
▼
┌─────────────────────────────────────┐
│ Application Load Balancer (ALB) │
│ myapp.example.com │
└──────┬────────────────────┬─────────┘
│ │
│ /api │ /
▼ ▼
┌──────────────┐ ┌──────────────┐
│ API Service │ │ Frontend │
│ (ClusterIP) │ │ Service │
└──────┬───────┘ └──────┬───────┘
│ │
┌──────▼────────┐ ┌──────▼───────┐
│ API Pods x3 │ │Frontend Pods │
│ (Deployment) │ │ x2 │
└──────┬────────┘ └──────────────┘
│
│ connects to
▼
┌──────────────────┐
│ Postgres Service │
│ (Headless) │
└──────┬───────────┘
│
┌──────▼────────────┐
│ Postgres Pod │
│ (StatefulSet) │
└──────┬────────────┘
│
┌──────▼────────────┐
│ EBS Volume │
│ (20Gi gp3) │
└───────────────────┘
Alex: impressed That's a full application!
Sam: And it has:
- High availability (multi-replica)
- Auto-scaling (HPA)
- Persistent storage (EBS)
- TLS/HTTPS (ACM certificate)
- Health checks
- Resource limits
- Proper networking
↑ Back to Table of Contents
Advanced Patterns and Best Practices
Alex: What are some patterns I should know about as I get more advanced?
Sam: Let me share patterns I use in production:
1. Init Containers
Problem: App needs database migrations before starting
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
initContainers:
- name: migration
image: myapp-api:v1.0
command: ['npm', 'run', 'migrate']
env:
- name: DATABASE_URL
value: postgres://...
containers:
- name: api
image: myapp-api:v1.0
Flow:
1. Pod created
2. Init container runs migrations
3. If migrations succeed, app container starts
4. If migrations fail, pod fails
2. Pod Disruption Budgets
Problem: Want to ensure minimum availability during updates
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api
This prevents:
- Draining too many nodes at once
- Updating too many pods simultaneously
- Cluster maintenance killing all pods
3. Resource Quotas
Problem: Teams overconsuming resources
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-a
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
persistentvolumeclaims: "5"
pods: "50"
4. Affinity and Anti-Affinity
Problem: Want to spread pods across nodes
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
template:
spec:
affinity:
# Prefer different nodes
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: api
topologyKey: kubernetes.io/hostname
# Require different availability zones
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: api
topologyKey: topology.kubernetes.io/zone
Result:
┌──────────────┬──────────────┬──────────────┐
│ AZ-1a │ AZ-1b │ AZ-1c │
├──────────────┼──────────────┼──────────────┤
│ Node 1 │ Node 3 │ Node 5 │
│ - API Pod 1 │ - API Pod 3 │ - API Pod 5 │
│ │ │ │
│ Node 2 │ Node 4 │ │
│ - API Pod 2 │ - API Pod 4 │ │
└──────────────┴──────────────┴──────────────┘
5. Jobs and CronJobs
Problem: Need to run batch tasks
# One-time job
apiVersion: batch/v1
kind: Job
metadata:
name: data-import
spec:
template:
spec:
containers:
- name: import
image: myapp-importer:v1.0
command: ['python', 'import.py']
restartPolicy: Never
backoffLimit: 3
---
# Scheduled job
apiVersion: batch/v1
kind: CronJob
metadata:
name: daily-report
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
containers:
- name: report
image: myapp-reporter:v1.0
command: ['python', 'report.py']
restartPolicy: Never
6. Service Mesh (AWS App Mesh)
Problem: Need advanced traffic management, retries, circuit breaking
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualNode
metadata:
name: api-v1
spec:
podSelector:
matchLabels:
app: api
version: v1
listeners:
- portMapping:
port: 3000
protocol: http
serviceDiscovery:
dns:
hostname: api.myapp.svc.cluster.local
backends:
- virtualService:
virtualServiceRef:
name: database
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualRouter
metadata:
name: api-router
spec:
listeners:
- portMapping:
port: 3000
protocol: http
routes:
- name: api-route
httpRoute:
match:
prefix: /
action:
weightedTargets:
- virtualNodeRef:
name: api-v1
weight: 90
- virtualNodeRef:
name: api-v2
weight: 10 # Canary deployment!
retryPolicy:
maxRetries: 3
perRetryTimeout:
unit: s
value: 2
↑ Back to Table of Contents
Troubleshooting Common Issues
Alex: What about when things go wrong? How do I debug?
Sam: pulls out troubleshooting playbook This is essential! Let me walk you through common issues:
Issue 1: Pods Not Starting
# Check pod status
kubectl get pods -n myapp
# Output shows:
NAME READY STATUS RESTARTS AGE
api-7d4b6c9f8-abc123 0/1 ImagePullBackOff 0 5m
# Get detailed info
kubectl describe pod api-7d4b6c9f8-abc123 -n myapp
# Look for:
# - Image pull errors (wrong image name, permissions)
# - Resource constraints (not enough CPU/memory)
# - Volume mount errors
Common causes and fixes:
# Problem: ImagePullBackOff
# Fix 1: Check image name
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:v1.0 # Correct region?
# Fix 2: Check IAM permissions for pulling from ECR
# Service account needs ecr:GetAuthorizationToken, ecr:BatchCheckLayerAvailability
# Problem: CrashLoopBackOff
# Check logs
kubectl logs api-7d4b6c9f8-abc123 -n myapp --previous
# Problem: Pending (not scheduling)
# Describe pod to see why
Events:
Type Reason Message
---- ------ -------
Warning FailedScheduling 0/3 nodes available: insufficient memory
# Fix: Either reduce pod requests or add more nodes
Issue 2: Service Not Reachable
# Check if service exists
kubectl get svc -n myapp
# Check if endpoints exist (pods backing the service)
kubectl get endpoints -n myapp
NAME ENDPOINTS AGE
api 10.0.1.50:3000,10.0.1.51:3000 10m
# If no endpoints, pods might not match service selector
kubectl get pods -n myapp --show-labels
# Test connectivity from another pod
kubectl run test --image=busybox -it --rm -- wget -O- api.myapp.svc.cluster.local
Issue 3: Ingress Not Working
# Check if ingress exists
kubectl get ingress -n myapp
# Describe to see ALB creation status
kubectl describe ingress myapp-ingress -n myapp
# Check AWS Load Balancer Controller logs
kubectl logs -n kube-system deployment/aws-load-balancer-controller
# Common issues:
# - IAM permissions for controller
# - Subnet tags missing (kubernetes.io/role/elb=1)
# - Security groups blocking traffic
Issue 4: High Memory/CPU Usage
# Check current usage
kubectl top pods -n myapp
NAME CPU(cores) MEMORY(bytes)
api-7d4b6c9f8-abc123 450m 890Mi # Approaching limits!
# Check if pods are being OOMKilled
kubectl describe pod api-7d4b6c9f8-abc123 -n myapp | grep -A 5 "Last State"
# Fix: Adjust resources
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "1Gi" # Increased
cpu: "500m"
Issue 5: PVC Not Binding
# Check PVC status
kubectl get pvc -n myapp
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
postgres-pvc Pending gp3
# Describe for details
kubectl describe pvc postgres-pvc -n myapp
# Common issues:
# - Storage class doesn't exist
# - No available volumes
# - Zone mismatch (PVC in us-east-1a, nodes in us-east-1b)
# Fix: Use WaitForFirstConsumer binding mode
volumeBindingMode: WaitForFirstConsumer # Wait until pod is scheduled
Debugging Toolkit
# Get everything in namespace
kubectl get all -n myapp
# Watch resources live
kubectl get pods -n myapp -w
# Exec into running container
kubectl exec -it api-7d4b6c9f8-abc123 -n myapp -- /bin/sh
# Port-forward for local testing
kubectl port-forward svc/api 8080:80 -n myapp
# Now access at localhost:8080
# Check events
kubectl get events -n myapp --sort-by='.lastTimestamp'
# Check resource usage
kubectl top nodes
kubectl top pods -n myapp
# View all resources with labels
kubectl get all -n myapp --show-labels
# Drain node for maintenance
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data
# Cordon node (stop scheduling new pods)
kubectl cordon node-name
↑ Back to Table of Contents
GitOps and CI/CD
Alex: How do I actually deploy updates? Keep editing YAML files and applying them?
Sam: No way! That's where GitOps comes in. Let me show you a proper CI/CD pipeline.
The GitOps Workflow with Flux or ArgoCD
Developer Flow:
1. Developer → Git Commit → Push to main branch
│
▼
2. GitHub Actions (CI)
- Runs tests
- Builds Docker image
- Tags image (git sha)
- Pushes to ECR
- Updates Kubernetes manifest with new image tag
- Commits manifest to GitOps repo
│
▼
3. Flux/ArgoCD (watching GitOps repo)
- Detects change
- Pulls new manifests
- Applies to cluster
│
▼
4. Kubernetes
- Rolling update deployment
- New pods come up
- Old pods terminate
│
▼
5. Developer sees app updated!
Example GitHub Actions workflow:
# .github/workflows/deploy.yaml
name: Build and Deploy
on:
push:
branches: [main]
env:
ECR_REPOSITORY: myapp-api
EKS_CLUSTER: production-cluster
AWS_REGION: us-east-1
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: ${{ env.AWS_REGION }}
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Build and push image
env:
ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
IMAGE_TAG: ${{ github.sha }}
run: |
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT
- name: Update Kubernetes manifest
env:
IMAGE: ${{ steps.build-and-push.outputs.image }}
run: |
# Update image in deployment YAML
sed -i "s|image:.*|image: $IMAGE|g" k8s/deployment.yaml
- name: Commit and push to GitOps repo
run: |
git config user.name "GitHub Actions"
git config user.email "actions@github.com"
git add k8s/deployment.yaml
git commit -m "Update image to ${{ github.sha }}"
git push
ArgoCD Application:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: myapp
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/company/gitops-repo
targetRevision: main
path: k8s
destination:
server: https://kubernetes.default.svc
namespace: myapp
syncPolicy:
automated:
prune: true # Delete resources not in git
selfHeal: true # Fix manual changes
syncOptions:
- CreateNamespace=true
Alex: So ArgoCD watches git and keeps the cluster in sync?
Sam: Exactly! If someone manually changes something in the cluster, ArgoCD changes it back. Git is the source of truth.
↑ Back to Table of Contents
Cost Optimization
Alex: This is great, but I'm worried about costs. Any tips?
Sam: Absolutely! EKS costs can add up. Here's my cost optimization playbook:
1. Right-Size Your Pods
# Install VPA (Vertical Pod Autoscaler)
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
# Create VPA in recommendation mode
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api
updateMode: "Off" # Just recommend, don't auto-update
# Check recommendations
kubectl describe vpa api-vpa
2. Use Spot Instances
# In eksctl config
managedNodeGroups:
- name: spot-workers
instanceTypes:
- t3.medium
- t3a.medium
- t2.medium
spot: true
minSize: 2
maxSize: 10
labels:
lifecycle: spot
taints:
- key: spot
value: "true"
effect: NoSchedule
# Pods that can tolerate spot
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
template:
spec:
tolerations:
- key: spot
operator: Equal
value: "true"
effect: NoSchedule
nodeSelector:
lifecycle: spot
3. Use Fargate for Bursty Workloads
# Fargate profile
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: my-cluster
fargateProfiles:
- name: batch-jobs
selectors:
- namespace: batch
labels:
workload-type: batch
4. Implement Pod Disruption Budgets for Safe Downscaling
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: 1 # Keep at least one running during disruptions
selector:
matchLabels:
app: api
5. Use Cluster Autoscaler or Karpenter
Cost savings:
- Scales down unused nodes
- Consolidates workloads
- Uses cheapest instance types
6. Monitor and Alert
# CloudWatch alarm for high costs
aws cloudwatch put-metric-alarm \
--alarm-name eks-high-cost \
--alarm-description "Alert if EKS costs exceed threshold" \
--metric-name EstimatedCharges \
--namespace AWS/Billing \
--statistic Maximum \
--period 86400 \
--evaluation-periods 1 \
--threshold 1000 \
--comparison-operator GreaterThanThreshold
7. Use Graviton (ARM) Instances
managedNodeGroups:
- name: graviton-workers
instanceTypes: [t4g.medium, t4g.large] # Graviton-based
# 20% cheaper than x86!
Cost breakdown example:
Monthly EKS Costs:
Control Plane: $73/month (fixed)
3x t3.medium (x86):
- On-Demand: $90/month
- Spot: ~$27/month (70% savings!)
3x t4g.medium (ARM):
- On-Demand: $72/month (20% savings vs x86!)
- Spot: ~$22/month (75% savings!)
Best Practice Mix:
- 2x t4g.medium on-demand (baseline) $48
- 3x t4g.medium spot (burst) $22
- Control plane $73
----------------------------------------
Total: ~$143/month for 5-node cluster
↑ Back to Table of Contents
Production Readiness Checklist
Alex: Okay, before we wrap up – give me a checklist. What do I need before going to production?
Sam: pulls out final napkin Here's my production readiness checklist:
Infrastructure
- [ ] Multi-AZ deployment (at least 2 AZs)
- [ ] Private subnets for worker nodes
- [ ] NAT Gateways for outbound internet
- [ ] VPC Flow Logs enabled
- [ ] Control plane logging enabled
- [ ] Cluster autoscaling configured
- [ ] Pod autoscaling (HPA) for services
- [ ] Proper instance types selected (right-sized)
Security
- [ ] IRSA configured for pod IAM roles
- [ ] Pod Security Standards enforced
- [ ] Network Policies defined
- [ ] Secrets in AWS Secrets Manager (not K8s secrets)
- [ ] ECR image scanning enabled
- [ ] Security groups properly configured
- [ ] IAM least privilege for all roles
- [ ] Encryption at rest for EBS volumes
- [ ] Encryption in transit (TLS everywhere)
Observability
- [ ] CloudWatch Container Insights installed
- [ ] Logging to CloudWatch configured
- [ ] Metrics collection working
- [ ] Distributed tracing (X-Ray) configured
- [ ] Dashboards created for key metrics
- [ ] Alerts configured for critical issues
- [ ] On-call rotation defined
High Availability
- [ ] Multiple replicas for all services
- [ ] Pod Disruption Budgets defined
- [ ] Liveness/readiness probes on all pods
- [ ] Resource requests/limits set
- [ ] Anti-affinity rules for spreading pods
- [ ] Health checks on load balancers
- [ ] Graceful shutdown configured
Disaster Recovery
- [ ] etcd backups automated (handled by EKS)
- [ ] PV snapshots scheduled
- [ ] Gitops repo backed up
- [ ] Disaster recovery plan documented
- [ ] Recovery tested at least once
- [ ] RTO/RPO defined and achievable
Operations
- [ ] GitOps workflow implemented
- [ ] CI/CD pipeline automated
- [ ] Rollback procedure tested
- [ ] Runbooks for common issues
- [ ] Access control (RBAC) configured
- [ ] Audit logging enabled
- [ ] Change management process defined
Cost Management
- [ ] Resource quotas per namespace
- [ ] Cost tracking by team/project
- [ ] Spot instances for applicable workloads
- [ ] Rightsizing reviewed monthly
- [ ] Unused resources cleaned up
- [ ] Budget alerts configured
Alex: overwhelmed That's a lot...
Sam: It is! But you don't need everything day one. Prioritize:
Week 1: Get it running
- Basic cluster
- Deploy apps
- Basic monitoring
Week 2-4: Make it reliable
- HA configuration
- Proper health checks
- Autoscaling
Month 2-3: Make it secure
- IRSA
- Network policies
- Secrets management
Month 3+: Optimize
- Cost optimization
- Performance tuning
- Advanced features
↑ Back to Table of Contents
Wrapping Up
Alex: closes laptop Wow. That was... comprehensive. My brain is full.
Sam: laughs I know it's a lot. Kubernetes and EKS are powerful but complex. Let me leave you with key takeaways:
1. Start Simple
Day 1: Basic EKS cluster with eksctl
Day 2: Deploy first app
Week 1: Add monitoring
Week 2: Add autoscaling
Month 1: Implement GitOps
Month 2: Advanced features
2. Use Managed Services
- Let AWS manage the control plane
- Use managed node groups
- Leverage AWS integrations (IAM, ALB, EBS)
3. Embrace Declarative Configuration
- Everything in YAML/Git
- Let Kubernetes reconcile
- Don't fight the system
4. Focus on Observability
- You can't fix what you can't see
- Logs, metrics, traces
- Alert on what matters
5. Security is a Journey
- Start with basics (IRSA, network policies)
- Add layers over time
- Never stop improving
Alex: And if I had to explain EKS to my manager in 30 seconds?
Sam: "EKS is AWS's managed Kubernetes service. AWS handles the complex control plane, we focus on running our applications. It gives us industry-standard container orchestration, portability, and a massive ecosystem of tools. It's more complex than ECS but more powerful and portable. Perfect for our growing needs."
Alex: Perfect. One last question – when should I absolutely NOT use EKS?
Sam: Great question!
Don't use EKS if:
- You have a single, simple application → Use App Runner or ECS
- Your team has zero container experience → Start with ECS, migrate later
- You need something running TODAY → Kubernetes has a learning curve
- Budget is extremely tight → Control plane costs $73/month minimum
- You don't need Kubernetes features → Simpler tools exist
DO use EKS if:
- You need Kubernetes for multi-cloud
- You have Kubernetes expertise
- You need advanced orchestration features
- You're building a platform for multiple teams
- You want the K8s ecosystem tools
Alex: stands up Alright, I'm ready. Time to create my first EKS cluster!
Sam: That's the spirit! Remember:
# Your first cluster
eksctl create cluster \
--name my-first-cluster \
--region us-east-1 \
--nodegroup-name workers \
--node-type t3.medium \
--nodes 2 \
--managed
# Deploy something
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=LoadBalancer
# Check it
kubectl get all
Alex: And when it inevitably breaks?
Sam: hands over card Text me. Or check the logs, describe the pods, and check events. 90% of issues are:
- Wrong image name
- Missing permissions
- Resource constraints
- Configuration errors
Alex: shakes hand Thanks, Sam. Same time next week to discuss service meshes?
Sam: grins Let's start with getting this working first!
↑ Back to Table of Contents
Quick Reference Guide
Essential Commands:
# Cluster management
eksctl create cluster -f cluster.yaml
eksctl get cluster
eksctl delete cluster --name my-cluster
# Context management
kubectl config get-contexts
kubectl config use-context my-cluster
# Resource management
kubectl get pods -n myapp
kubectl describe pod my-pod -n myapp
kubectl logs my-pod -n myapp -f
kubectl exec -it my-pod -n myapp -- /bin/sh
# Apply configurations
kubectl apply -f deployment.yaml
kubectl apply -f . # Apply all YAML in directory
# Scaling
kubectl scale deployment/my-app --replicas=5
kubectl autoscale deployment/my-app --min=2 --max=10 --cpu-percent=70
# Updates
kubectl set image deployment/my-app app=myapp:v2.0
kubectl rollout status deployment/my-app
kubectl rollout undo deployment/my-app
# Debugging
kubectl get events -n myapp --sort-by='.lastTimestamp'
kubectl top pods -n myapp
kubectl top nodes
# Port forwarding
kubectl port-forward svc/my-app 8080:80
Resource Hierarchy:
Cluster
└─ Namespace
├─ Deployment
│ └─ ReplicaSet
│ └─ Pod
│ └─ Container
├─ Service
├─ ConfigMap
├─ Secret
├─ PersistentVolumeClaim
│ └─ PersistentVolume
└─ Ingress
Useful Links:
↑ Back to Table of Contents
Alex walks out of the coffee shop, laptop bag over shoulder, ready to build something amazing. Sam orders another coffee and opens a laptop – time to help the next person on their Kubernetes journey.
The Beginning of Your EKS Adventure 🚀☸️
Top comments (0)