DEV Community

Cover image for Kubernetes & EKS Deep Dive: From Zero to Production
Data Tech Bridge
Data Tech Bridge

Posted on

Kubernetes & EKS Deep Dive: From Zero to Production

Alex walks out of the coffee shop, laptop bag over shoulder, ready to build something amazing. Sam orders another coffee and opens a laptop – time to help the next person on their Kubernetes journey.

The Beginning of Your EKS Adventure 🚀☸️

📚 Table of Contents


Introduction

Two weeks have passed after the discussion on AWS Container Services. Alex is back at the coffee shop, laptop open, with a slightly worried expression. Sam walks in with two coffees.


Sam: sets down coffee Okay, that text you sent at 2 AM saying "I think I need Kubernetes after all" was concerning. What happened?

Alex: laughs nervously Don't worry, nothing's on fire. The ECS setup is working great actually! But... our CTO just announced we're acquiring a startup. They're running everything on Kubernetes. And our new lead developer keeps talking about "the K8s ecosystem" and "cloud-native patterns."

Sam: Ah, the classic acquisition scenario. So now you need to actually understand Kubernetes, not just know it exists.

Alex: Exactly. And honestly, after working with ECS, I'm curious. What am I missing? Why do people love Kubernetes so much they tattoo the logo on their arms?

Sam: spits out coffee Someone actually did that?

Alex: I saw it on Twitter!

Sam: shakes head Okay, well... let's dive deep. Fair warning: this is going to be a longer conversation. Kubernetes isn't something you grasp in ten minutes.

Alex: I've got all afternoon. Teach me, sensei.

↑ Back to Table of Contents

Understanding Kubernetes: The Philosophy

Sam: Before we touch EKS, you need to understand Kubernetes itself. Think back to our last conversation – I said Kubernetes is complex. But there's a reason for that complexity. It's not accidental.

Kubernetes was built to solve Google's problems: running billions of containers across thousands of machines. It's designed for:

  • Declarative configuration: You say what you want, K8s makes it happen
  • Self-healing: Things fail, K8s automatically recovers
  • Extensibility: You can add functionality without changing core code
  • API-driven: Everything is an API resource

Alex: You lost me at "declarative configuration."

Sam: Right, let me explain with an example. Remember ECS?

In ECS (Imperative thinking):

# You tell ECS what to DO
aws ecs create-service --service-name my-app --desired-count 3
aws ecs update-service --service-name my-app --desired-count 5
Enter fullscreen mode Exit fullscreen mode

In Kubernetes (Declarative thinking):

# You tell K8s what you WANT
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 5  # I want 5 replicas, K8s makes it happen
  # ... rest of config
Enter fullscreen mode Exit fullscreen mode

You apply this file, and Kubernetes continuously works to maintain that state. If a pod dies, K8s starts a new one. If you change replicas from 5 to 10, K8s smoothly adds 5 more.

Alex: So it's like... I'm describing my desired state, not giving commands?

Sam: Exactly! That's the Kubernetes mindset. Everything is about desired state vs. current state.

┌─────────────────────────────────────────┐
│        Kubernetes Control Loop          │
│                                         │
│  ┌──────────────┐    ┌──────────────┐  │
│  │   Desired    │    │   Current    │  │
│  │    State     │    │    State     │  │
│  │ (your YAML)  │    │ (reality)    │  │
│  └──────┬───────┘    └───────┬──────┘  │
│         │                    │         │
│         │   ┌────────────┐   │         │
│         └──►│ Controller │◄──┘         │
│             │ (reconcile)│             │
│             └────────────┘             │
│                                         │
│  If desired ≠ current, take action     │
└─────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents


The Kubernetes Architecture

Alex: Okay, I'm buying into the philosophy. Show me how it actually works.

Sam: draws on napkin

Kubernetes has two main parts:

1. Control Plane (the brains)
2. Worker Nodes (the muscle)

Let me break this down:

┌────────────────────────────────────────────────────┐
│               KUBERNETES CLUSTER                    │
│                                                     │
│  ┌──────────────────────────────────────────────┐  │
│  │         CONTROL PLANE (Master)                │  │
│  │                                               │  │
│  │  ┌──────────────┐  ┌──────────────┐         │  │
│  │  │  API Server  │  │    etcd      │         │  │
│  │  │  (kubectl)   │  │  (database)  │         │  │
│  │  └──────┬───────┘  └──────────────┘         │  │
│  │         │                                     │  │
│  │  ┌──────┴───────┐  ┌──────────────┐         │  │
│  │  │  Scheduler   │  │  Controller  │         │  │
│  │  │ (placement)  │  │   Manager    │         │  │
│  │  └──────────────┘  └──────────────┘         │  │
│  └───────────────────────┬────────────────────┬─┘  │
│                          │                    │    │
│  ┌───────────────────────▼─────┐  ┌──────────▼──┐ │
│  │      WORKER NODE 1          │  │ WORKER NODE 2│ │
│  │                             │  │              │ │
│  │  ┌────────┐  ┌────────┐    │  │  ┌────────┐ │ │
│  │  │  Pod   │  │  Pod   │    │  │  │  Pod   │ │ │
│  │  │┌──────┐│  │┌──────┐│    │  │  │┌──────┐│ │ │
│  │  ││ App  ││  ││ App  ││    │  │  ││ App  ││ │ │
│  │  │└──────┘│  │└──────┘│    │  │  │└──────┘│ │ │
│  │  └────────┘  └────────┘    │  │  └────────┘ │ │
│  │                             │  │              │ │
│  │  kubelet | kube-proxy      │  │  kubelet...  │ │
│  └─────────────────────────────┘  └──────────────┘ │
└────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Alex: Walk me through what each piece does.

Sam: Great! Let's follow a request:

1. API Server:

  • The front door to Kubernetes
  • When you run kubectl apply, it hits the API Server
  • Validates requests, authenticates users
  • Everything goes through here – it's the only component that talks to etcd

2. etcd:

  • Distributed key-value database
  • Stores ALL cluster state
  • If etcd is lost, the cluster doesn't know what it should be running
  • This is why backups are critical!

3. Scheduler:

  • Watches for new pods that don't have a node assigned
  • Decides which node should run the pod based on:
    • Resource requirements (CPU, memory)
    • Node constraints (labels, taints, tolerations)
    • Affinity/anti-affinity rules

4. Controller Manager:

  • Runs multiple controllers (Deployment controller, ReplicaSet controller, etc.)
  • Each controller watches for changes and acts on them
  • Example: Deployment controller sees replicas: 5, counts current pods, creates more if needed

5. kubelet (on each worker node):

  • Ensures containers are running in pods
  • Talks to container runtime (Docker, containerd)
  • Reports node and pod status back to API Server

6. kube-proxy (on each worker node):

  • Handles networking
  • Implements Service abstraction
  • Routes traffic to appropriate pods

Alex: So when I run kubectl apply -f deployment.yaml, what happens?

Sam: Perfect question! Let's trace it:

Step-by-Step: kubectl apply -f deployment.yaml

1. kubectl → API Server
   "Here's a Deployment resource"

2. API Server → etcd
   "Store this Deployment spec"

3. Deployment Controller (watching API Server)
   "New Deployment! I need to create a ReplicaSet"

4. Deployment Controller → API Server
   "Create this ReplicaSet"

5. ReplicaSet Controller (watching)
   "New ReplicaSet! I need to create Pods"

6. ReplicaSet Controller → API Server
   "Create 3 Pods please"

7. Scheduler (watching for unassigned Pods)
   "These Pods need homes... Node 2 has space!"

8. Scheduler → API Server
   "Assign these Pods to Node 2"

9. kubelet on Node 2 (watching API Server)
   "I have new Pods to run! Starting containers..."

10. kubelet → Container Runtime
    "Pull image, create container, start it"

11. kubelet → API Server (continuous)
    "Status update: Pod is Running"
Enter fullscreen mode Exit fullscreen mode

Alex: Wow. That's... a lot of moving parts for just starting a container.

Sam: It is! But that complexity gives you power. Each piece can be extended, replaced, or customized. That's why Kubernetes can do things ECS can't.

↑ Back to Table of Contents


Enter Amazon EKS

Alex: Okay, so where does EKS fit into all this?

Sam: Amazon EKS (Elastic Kubernetes Service) is AWS's managed Kubernetes. Here's what AWS handles for you:

AWS Manages:

  • Control Plane (API Server, etcd, scheduler, controller manager)
  • Control plane HA across multiple AZs
  • Control plane upgrades
  • Control plane scaling
  • etcd backups
  • API Server endpoint

You Manage:

  • Worker nodes (or use Fargate)
  • Applications
  • Add-ons (though some are managed now)
  • Networking configuration
  • Security policies

Alex: So it's like... they run the control plane, I run the workers?

Sam: Exactly! Here's the EKS architecture:

┌──────────────────────────────────────────────────────┐
│                   AMAZON EKS                         │
│                                                      │
│  ┌────────────────────────────────────────────────┐ │
│  │    AWS-MANAGED CONTROL PLANE                   │ │
│  │    (Runs in AWS VPC, multi-AZ)                │ │
│  │                                                │ │
│  │  ┌─────┐  ┌─────┐  ┌──────┐  ┌──────────┐   │ │
│  │  │ API │  │etcd │  │Sched │  │Controller│   │ │
│  │  │     │  │     │  │     │  │ Manager  │   │ │
│  │  └─────┘  └─────┘  └──────┘  └──────────┘   │ │
│  │                                                │ │
│  │  Exposed via AWS-managed Load Balancer        │ │
│  └────────────────────┬───────────────────────────┘ │
│                       │                             │
│  ┌────────────────────▼───────────────────────────┐ │
│  │         YOUR VPC                               │ │
│  │                                                │ │
│  │  ┌──────────────────┐    ┌──────────────────┐ │ │
│  │  │  Subnet 1 (AZ-a) │    │ Subnet 2 (AZ-b) │ │ │
│  │  │                  │    │                  │ │ │
│  │  │  ┌────────────┐  │    │  ┌────────────┐ │ │ │
│  │  │  │   Node 1   │  │    │  │   Node 2   │ │ │ │
│  │  │  │ (EC2 or    │  │    │  │ (EC2 or    │ │ │ │
│  │  │  │  Fargate)  │  │    │  │  Fargate)  │ │ │ │
│  │  │  │            │  │    │  │            │ │ │ │
│  │  │  │ ┌────────┐ │  │    │  │ ┌────────┐ │ │ │
│  │  │  │ │  Pods  │ │  │    │  │ │  Pods  │ │ │ │
│  │  │  │ └────────┘ │  │    │  │ └────────┘ │ │ │
│  │  │  └────────────┘  │    │  └────────────┘ │ │ │
│  │  └──────────────────┘    └──────────────────┘ │ │
│  └────────────────────────────────────────────────┘ │
│                                                      │
│  AWS Services: ECR, IAM, CloudWatch, VPC, ELB, etc. │
└──────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Alex: So I never SSH into the control plane?

Sam: Nope! You can't even see those machines. AWS handles all that. You interact with the cluster through kubectl, which talks to the API Server endpoint AWS provides.

↑ Back to Table of Contents


Core Kubernetes Concepts: The Building Blocks

Alex: Alright, I need to understand the actual resources. You mentioned Pods, Deployments, Services... break them all down for me.

Sam: cracks knuckles Here we go. This is where Kubernetes gets rich but also complex.

Pods: The Atomic Unit

Sam: A Pod is the smallest deployable unit in Kubernetes. Think of it as a wrapper around one or more containers that need to run together.

apiVersion: v1
kind: Pod
metadata:
  name: my-app-pod
  labels:
    app: my-app
    tier: frontend
spec:
  containers:
  - name: app-container
    image: 123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:v1.0
    ports:
    - containerPort: 8080
    env:
    - name: DATABASE_URL
      value: "postgres://db:5432"
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"
Enter fullscreen mode Exit fullscreen mode

Alex: Why would I put multiple containers in one pod?

Sam: Great question! Common patterns:

Sidecar Pattern:

┌─────────────────────────┐
│         Pod             │
│                         │
│  ┌─────────┐           │
│  │  Main   │           │
│  │  App    │           │
│  └────┬────┘           │
│       │ shares         │
│  ┌────▼────┐  storage  │
│  │ Logging │           │
│  │ Sidecar │           │
│  └─────────┘           │
└─────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Example: Main app writes logs to a shared volume, sidecar container ships logs to CloudWatch.

Ambassador Pattern:

┌─────────────────────────┐
│         Pod             │
│                         │
│  ┌─────────┐           │
│  │  Main   │───────┐   │
│  │  App    │       │   │
│  └─────────┘       │   │
│                    │   │
│  ┌─────────┐      │   │
│  │ Proxy/  │◄─────┘   │
│  │ Cache   │           │
│  └────┬────┘           │
└───────┼─────────────────┘
        │
      Network
Enter fullscreen mode Exit fullscreen mode

Example: Main app talks to local proxy, proxy handles connection pooling to database.

Alex: But usually it's just one container per pod?

Sam: Usually, yes! One container per pod is the most common pattern.

↑ Back to Table of Contents

ReplicaSets: Maintaining Desired Count

Sam: ReplicaSets ensure a specified number of pod replicas are running at all times.

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: my-app-rs
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: app
        image: my-app:v1.0
Enter fullscreen mode Exit fullscreen mode

If a pod dies, ReplicaSet creates a new one. If you manually create extra pods with matching labels, ReplicaSet deletes them to maintain exactly 3.

Alex: So ReplicaSets are like the ECS Service desired count?

Sam: Yes! But here's the thing – you almost never create ReplicaSets directly. You use...

↑ Back to Table of Contents

Deployments: The Smart Way to Manage Replicas

Sam: Deployments are what you actually use. They manage ReplicaSets for you and provide:

  • Rolling updates
  • Rollbacks
  • Version history
  • Pause/resume capabilities
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    app: my-app
spec:
  replicas: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Max 1 extra pod during update
      maxUnavailable: 1  # Max 1 pod down during update
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
        version: v1.0
    spec:
      containers:
      - name: my-app
        image: my-app:v1.0
        ports:
        - containerPort: 8080
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
Enter fullscreen mode Exit fullscreen mode

Alex: What happens when I update the image?

Sam: Magic! Watch this:

Current State (v1.0):
┌──────────────────────────────────┐
│  Deployment: my-app              │
│  ├─ ReplicaSet-abc (5 pods v1.0) │
│      ├─ Pod 1                    │
│      ├─ Pod 2                    │
│      ├─ Pod 3                    │
│      ├─ Pod 4                    │
│      └─ Pod 5                    │
└──────────────────────────────────┘

Update to v1.1:
$ kubectl set image deployment/my-app my-app=my-app:v1.1

During Rolling Update:
┌──────────────────────────────────┐
│  Deployment: my-app              │
│  ├─ ReplicaSet-abc (3 pods v1.0) │
│  │   ├─ Pod 1                    │
│  │   ├─ Pod 2                    │
│  │   └─ Pod 3                    │
│  └─ ReplicaSet-xyz (2 pods v1.1) │
│      ├─ Pod 6 (new!)             │
│      └─ Pod 7 (new!)             │
└──────────────────────────────────┘

Final State:
┌──────────────────────────────────┐
│  Deployment: my-app              │
│  ├─ ReplicaSet-abc (0 pods)      │
│  └─ ReplicaSet-xyz (5 pods v1.1) │
│      ├─ Pod 6                    │
│      ├─ Pod 7                    │
│      ├─ Pod 8                    │
│      ├─ Pod 9                    │
│      └─ Pod 10                   │
└──────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Kubernetes gradually terminates old pods and starts new ones. If something goes wrong:

kubectl rollout undo deployment/my-app
Enter fullscreen mode Exit fullscreen mode

And you're back to v1.0!

Alex: That's actually really cool. ECS can do blue/green, but this feels more gradual.

↑ Back to Table of Contents

Services: Stable Networking

Sam: Here's a problem: Pods are ephemeral. They get IPs when they start, but those IPs change if pods restart. How do you connect to them?

Alex: Um... DNS?

Sam: Kind of! That's where Services come in. A Service provides a stable IP and DNS name for a set of pods.

apiVersion: v1
kind: Service
metadata:
  name: my-app-service
spec:
  type: ClusterIP  # Only accessible within cluster
  selector:
    app: my-app
  ports:
  - protocol: TCP
    port: 80        # Service port
    targetPort: 8080  # Container port
Enter fullscreen mode Exit fullscreen mode

Now any pod can reach your app at my-app-service:80 or my-app-service.default.svc.cluster.local:80

Service Types:

1. ClusterIP (default):

Internal only, not accessible from outside
┌──────────────────────────┐
│      Cluster             │
│  ┌────────┐             │
│  │ Pod A  │──────┐      │
│  └────────┘      │      │
│                  ▼      │
│  ┌────────────────────┐ │
│  │  Service ClusterIP │ │
│  │  10.100.200.50     │ │
│  └────────┬───────────┘ │
│           │             │
│      ┌────▼───┐         │
│      │ Pod B  │         │
│      └────────┘         │
└──────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

2. NodePort:

Accessible on each node's IP at a static port
┌──────────────────────────┐
│      Cluster             │
│  ┌─────────────────────┐ │
│  │ Node 1              │ │
│  │ IP: 10.0.1.50       │ │
│  │ NodePort: 30080 ────┼─┼──► External traffic
│  └─────────┬───────────┘ │
│            │             │
│  ┌─────────▼───────────┐ │
│  │  Service            │ │
│  └─────────┬───────────┘ │
│            │             │
│       ┌────▼───┐         │
│       │  Pods  │         │
│       └────────┘         │
└──────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

3. LoadBalancer (the one you'll use most in EKS):

apiVersion: v1
kind: Service
metadata:
  name: my-app
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  type: LoadBalancer
  selector:
    app: my-app
  ports:
  - port: 80
    targetPort: 8080
Enter fullscreen mode Exit fullscreen mode

This automatically provisions an AWS Load Balancer!

Internet
   │
   ▼
┌─────────────────────┐
│ AWS Load Balancer   │
│ (NLB or ALB)        │
└────────┬────────────┘
         │
   ┌─────▼──────┐
   │ Service LB │
   └─────┬──────┘
         │
    ┌────▼────┐
    │  Pods   │
    └─────────┘
Enter fullscreen mode Exit fullscreen mode

Alex: Wait, so Kubernetes creates an actual AWS Load Balancer?

Sam: Yes! Through the AWS Load Balancer Controller. We'll get to that. But this is where EKS's AWS integration shines.

↑ Back to Table of Contents

ConfigMaps and Secrets: Configuration Management

Sam: Applications need configuration. In Kubernetes, you use ConfigMaps for non-sensitive data and Secrets for sensitive data.

ConfigMap example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
data:
  database_host: "postgres.example.com"
  log_level: "info"
  app.properties: |
    feature.new=true
    timeout=30s
Enter fullscreen mode Exit fullscreen mode

Using it in a Pod:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: my-app:v1.0
    env:
    # As environment variables
    - name: DATABASE_HOST
      valueFrom:
        configMapKeyRef:
          name: app-config
          key: database_host
    # Or mount as files
    volumeMounts:
    - name: config-volume
      mountPath: /etc/config
  volumes:
  - name: config-volume
    configMap:
      name: app-config
Enter fullscreen mode Exit fullscreen mode

Secrets (similar but encoded):

apiVersion: v1
kind: Secret
metadata:
  name: db-secret
type: Opaque
data:
  username: YWRtaW4=  # base64 encoded
  password: cGFzc3dvcmQ=
Enter fullscreen mode Exit fullscreen mode

Alex: Base64 isn't encryption though...

Sam: Correct! Secrets in Kubernetes are just base64 encoded by default. In EKS, you should use AWS Secrets Manager or Parameter Store integration for real security. We'll cover that.

↑ Back to Table of Contents


Networking in EKS: The Complex Beast

Alex: Okay, I need to understand networking because I heard it's complicated.

Sam: deep breath You heard right. Kubernetes networking is... special. Let me break it down.

The Kubernetes Networking Model has three rules:

  1. Every pod gets its own IP address
  2. All pods can communicate with all other pods without NAT
  3. All nodes can communicate with all pods without NAT

Alex: How does EKS implement this?

Sam: Through the Amazon VPC CNI (Container Network Interface) plugin. This is unique to EKS and actually pretty clever.

Traditional Kubernetes:
┌────────────────────────────┐
│  Node (EC2 Instance)       │
│  IP: 10.0.1.50             │
│                            │
│  ┌──────────────────────┐  │
│  │ Pod 1: 172.16.1.5    │  │
│  └──────────────────────┘  │
│  ┌──────────────────────┐  │
│  │ Pod 2: 172.16.1.6    │  │
│  └──────────────────────┘  │
│                            │
│  Pod IPs are virtual       │
│  (overlay network)         │
└────────────────────────────┘

EKS with VPC CNI:
┌────────────────────────────┐
│  Node (EC2 Instance)       │
│  Primary IP: 10.0.1.50     │
│                            │
│  ┌──────────────────────┐  │
│  │ Pod 1: 10.0.1.100    │  │ ← Real VPC IP!
│  └──────────────────────┘  │
│  ┌──────────────────────┐  │
│  │ Pod 2: 10.0.1.101    │  │ ← Real VPC IP!
│  └──────────────────────┘  │
│                            │
│  Pods get real ENI IPs     │
│  from your VPC!            │
└────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Alex: So pods get actual VPC IP addresses?

Sam: Yes! This means:

Pros:

  • Pods can directly communicate with VPC resources (RDS, ElastiCache, etc.)
  • VPC Security Groups can apply to pods
  • No overlay network performance penalty
  • VPC Flow Logs work

Cons:

  • You consume IP addresses quickly
  • Need to plan VPC CIDR blocks carefully
  • ENI limits per instance type matter

Alex: How many pods can I run per node?

Sam: It depends on the EC2 instance type and its ENI limits:

t3.small:  3 ENIs × 4 IPs = 12 IP addresses
  - 1 for node
  - 11 for pods
  Maximum pods: 11

m5.large:  3 ENIs × 10 IPs = 30 IP addresses
  - 1 for node
  - 29 for pods
  Maximum pods: 29

m5.xlarge: 4 ENIs × 15 IPs = 60 IP addresses
  Maximum pods: 58
Enter fullscreen mode Exit fullscreen mode

Alex: What if I run out of IPs?

Sam: You have options:

1. Use larger CIDR blocks:

Bad:  10.0.1.0/28  (16 IPs - ouch!)
Better: 10.0.1.0/24  (256 IPs)
Best: 10.0.0.0/16    (65,536 IPs)
Enter fullscreen mode Exit fullscreen mode

2. Use secondary CIDR blocks:

# Add a secondary CIDR to your VPC
10.0.0.0/16   (primary)
100.64.0.0/16 (secondary for pods)
Enter fullscreen mode Exit fullscreen mode

3. Use Custom Networking:
Pods use different subnet than nodes

4. Use Fargate:
No node IP limits!

5. Use IPv6:
Basically infinite addresses

↑ Back to Table of Contents

Ingress: Advanced Routing

Sam: Services get you basic load balancing, but what if you need:

  • Path-based routing (/api/users → User Service)
  • Host-based routing (api.example.com → API Service)
  • SSL termination
  • Advanced traffic management

That's where Ingress comes in.

Alex: Is this like ALB path-based routing?

Sam: Exactly! But Kubernetes Ingress is a standard API, and different controllers implement it.

In EKS, you use the AWS Load Balancer Controller:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-ingress
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:...
spec:
  ingressClassName: alb
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /users
        pathType: Prefix
        backend:
          service:
            name: user-service
            port:
              number: 80
      - path: /orders
        pathType: Prefix
        backend:
          service:
            name: order-service
            port:
              number: 80
Enter fullscreen mode Exit fullscreen mode

This creates:

Internet
   │
   ▼
┌─────────────────────────────────┐
│ Application Load Balancer (ALB) │
│ api.example.com                 │
└────┬─────────────────────┬──────┘
     │                     │
     │ /users              │ /orders
     ▼                     ▼
┌──────────┐         ┌──────────┐
│  User    │         │  Order   │
│  Service │         │  Service │
└────┬─────┘         └────┬─────┘
     │                     │
  ┌──▼──┐               ┌──▼──┐
  │Pods │               │Pods │
  └─────┘               └─────┘
Enter fullscreen mode Exit fullscreen mode

Alex: That's cleaner than managing ALB rules manually!

Sam: Absolutely! And the Ingress controller handles:

  • Creating the ALB
  • Configuring listeners
  • Updating target groups
  • Health checks
  • SSL certificates from ACM

↑ Back to Table of Contents


Storage in Kubernetes

Alex: What about data that needs to persist? Databases, file uploads, etc.?

Sam: Kubernetes storage is built on these concepts:

1. Volumes: Storage attached to a pod's lifecycle
2. Persistent Volumes (PV): Storage that exists independently
3. Persistent Volume Claims (PVC): Request for storage

┌─────────────────────────────────────────┐
│         Storage Architecture            │
│                                         │
│  ┌───────────────┐                     │
│  │     Pod       │                     │
│  │  ┌─────────┐  │                     │
│  │  │Container│  │                     │
│  │  └────┬────┘  │                     │
│  │       │mount  │                     │
│  │  ┌────▼──────────────┐              │
│  │  │ PVC: my-app-storage│             │
│  │  │ Size: 10Gi         │             │
│  │  └────┬───────────────┘             │
│  └───────┼───────────────┘             │
│          │ bound to                    │
│  ┌───────▼──────────────┐              │
│  │ PV: pv-12345         │              │
│  │ Size: 10Gi           │              │
│  │ Type: EBS            │              │
│  └───────┬──────────────┘              │
│          │                             │
│  ┌───────▼──────────────┐              │
│  │ AWS EBS Volume       │              │
│  │ vol-abcdef123        │              │
│  └──────────────────────┘              │
└─────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Creating a PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
spec:
  accessModes:
    - ReadWriteOnce  # Can be mounted by one node at a time
  storageClassName: gp3
  resources:
    requests:
      storage: 20Gi
Enter fullscreen mode Exit fullscreen mode

Using it in a Pod:

apiVersion: v1
kind: Pod
metadata:
  name: postgres
spec:
  containers:
  - name: postgres
    image: postgres:14
    volumeMounts:
    - name: postgres-storage
      mountPath: /var/lib/postgresql/data
  volumes:
  - name: postgres-storage
    persistentVolumeClaim:
      claimName: postgres-pvc
Enter fullscreen mode Exit fullscreen mode

Storage Classes in EKS:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
Enter fullscreen mode Exit fullscreen mode

Alex: What storage should I use for different scenarios?

Sam: Good question! Here's my guide:

EBS (via EBS CSI Driver):
✓ Databases (PostgreSQL, MySQL)
✓ Single-writer applications
✗ Multi-node read/write
✗ Cross-AZ access

EFS (via EFS CSI Driver):
✓ Shared file storage
✓ Multi-pod read/write
✓ Cross-AZ access
✗ High-performance databases
✗ Block-level operations

FSx for Lustre:
✓ High-performance computing
✓ Machine learning training
✗ General purpose

S3 (via CSI drivers):
✓ Object storage
✓ Static assets
✗ POSIX filesystem operations
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents


Security in EKS: Locking It Down

Alex: nervously Okay, security. I know this is important but overwhelming.

Sam: Deep breath. Let's tackle this systematically. Security in EKS has multiple layers:

1. IAM Roles for Service Accounts (IRSA)

Sam: This is CRUCIAL in EKS. It lets pods assume IAM roles without needing node-level permissions.

The Old Way (bad):

┌──────────────────────┐
│  EC2 Instance        │
│  IAM Role: SuperPower│  ← All pods get this!
│  ┌────┐  ┌────┐     │
│  │Pod1│  │Pod2│     │
│  └────┘  └────┘     │
└──────────────────────┘

Problem: Pod1 needs S3, Pod2 needs DynamoDB
But both get full access!
Enter fullscreen mode Exit fullscreen mode

The EKS Way (good - IRSA):

┌──────────────────────────────────┐
│  EC2 Instance                    │
│  IAM Role: Minimal               │
│                                  │
│  ┌────────────┐  ┌────────────┐ │
│  │   Pod1     │  │   Pod2     │ │
│  │ IAM Role:  │  │ IAM Role:  │ │
│  │ S3Access   │  │ DynamoAccess│ │
│  └────────────┘  └────────────┘ │
└──────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Setting it up:

# 1. Create IAM role
eksctl create iamserviceaccount \
  --name my-app-sa \
  --namespace default \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
  --approve
Enter fullscreen mode Exit fullscreen mode
# 2. Use it in your deployment
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app-sa
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/my-app-role
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      serviceAccountName: my-app-sa  # Pod gets this role!
      containers:
      - name: app
        image: my-app:v1.0
Enter fullscreen mode Exit fullscreen mode

Alex: So the pod can just call AWS APIs now?

Sam: Exactly! The AWS SDKs automatically pick up the credentials from the service account.

↑ Back to Table of Contents

2. Pod Security Standards

Sam: You need to prevent pods from running with dangerous settings:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
Enter fullscreen mode Exit fullscreen mode

This prevents:

  • Running as root
  • Privileged containers
  • Host network access
  • Unsafe volume types

↑ Back to Table of Contents

3. Network Policies

Sam: Control which pods can talk to which:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-network-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: api
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432
Enter fullscreen mode Exit fullscreen mode

This means:

┌──────────┐
│ Frontend │
│   Pod    │───────┐
└──────────┘       │
                   ▼ ✓ Allowed
              ┌──────────┐
              │   API    │
              │   Pod    │
              └────┬─────┘
                   │
                   ▼ ✓ Allowed
              ┌──────────┐
              │ Database │
              │   Pod    │
              └──────────┘

┌──────────┐
│ Random   │
│   Pod    │───X──► API  (Blocked!)
└──────────┘
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

4. Secrets Management

Sam: Don't use default Kubernetes secrets for production! Use AWS Secrets Manager:

AWS Secrets Manager CSI Driver:

apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
  name: aws-secrets
spec:
  provider: aws
  parameters:
    objects: |
      - objectName: "prod/database/password"
        objectType: "secretsmanager"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      serviceAccountName: my-app-sa
      containers:
      - name: app
        image: my-app:v1.0
        volumeMounts:
        - name: secrets
          mountPath: "/mnt/secrets"
          readOnly: true
      volumes:
      - name: secrets
        csi:
          driver: secrets-store.csi.k8s.io
          readOnly: true
          volumeAttributes:
            secretProviderClass: "aws-secrets"
Enter fullscreen mode Exit fullscreen mode

Now your database password is securely fetched from AWS Secrets Manager and mounted as a file!

↑ Back to Table of Contents

5. Image Security

Sam: Scan images for vulnerabilities:

# In your CI/CD pipeline
steps:
  - name: Build image
    run: docker build -t my-app:v1.0 .

  - name: Scan with ECR
    run: |
      aws ecr start-image-scan \
        --repository-name my-app \
        --image-id imageTag=v1.0

  - name: Check scan results
    run: |
      aws ecr describe-image-scan-findings \
        --repository-name my-app \
        --image-id imageTag=v1.0
      # Fail if critical vulnerabilities found
Enter fullscreen mode Exit fullscreen mode

Alex: This is a lot...

Sam: I know. But security is layers. You don't need everything day one, but you should have a plan to implement them.

↑ Back to Table of Contents

Setting Up Your First EKS Cluster

Alex: Alright, enough theory. How do I actually create an EKS cluster?

Sam: Great! Let's do this properly. You have a few options:

Option 1: eksctl (Easiest)

# Install eksctl
brew install eksctl  # Mac
# or download from github.com/weaveworks/eksctl

# Create cluster
eksctl create cluster \
  --name my-cluster \
  --region us-east-1 \
  --nodegroup-name standard-workers \
  --node-type t3.medium \
  --nodes 3 \
  --nodes-min 2 \
  --nodes-max 4 \
  --managed
Enter fullscreen mode Exit fullscreen mode

This creates:

  • EKS control plane
  • VPC with public/private subnets
  • Managed node group
  • All necessary IAM roles
  • VPC CNI, kube-proxy, CoreDNS add-ons

Alex: That's it?

Sam: That's it! In about 15 minutes, you have a production-ready cluster.

↑ Back to Table of Contents

Option 2: eksctl with Config File (Better)

# cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: production-cluster
  region: us-east-1
  version: "1.28"

# Use existing VPC (recommended)
vpc:
  id: "vpc-12345"
  subnets:
    private:
      us-east-1a: { id: subnet-aaaa }
      us-east-1b: { id: subnet-bbbb }
      us-east-1c: { id: subnet-cccc }

# IAM OIDC provider for IRSA
iam:
  withOIDC: true

# Managed node groups
managedNodeGroups:
  - name: general-purpose
    instanceType: t3.medium
    minSize: 2
    maxSize: 6
    desiredCapacity: 3
    volumeSize: 20
    privateNetworking: true
    labels:
      workload-type: general
    tags:
      Environment: production
      Team: platform

  - name: memory-optimized
    instanceType: r5.large
    minSize: 1
    maxSize: 3
    desiredCapacity: 1
    privateNetworking: true
    labels:
      workload-type: memory-intensive
    taints:
      - key: workload-type
        value: memory-intensive
        effect: NoSchedule

# Add-ons
addons:
  - name: vpc-cni
    version: latest
  - name: coredns
    version: latest
  - name: kube-proxy
    version: latest
  - name: aws-ebs-csi-driver
    version: latest

# CloudWatch logging
cloudWatch:
  clusterLogging:
    enableTypes:
      - api
      - audit
      - authenticator
      - controllerManager
      - scheduler
Enter fullscreen mode Exit fullscreen mode
eksctl create cluster -f cluster.yaml
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

Option 3: Terraform (Most Control)

# main.tf
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"

  cluster_name    = "production-cluster"
  cluster_version = "1.28"

  cluster_endpoint_public_access  = true
  cluster_endpoint_private_access = true

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  # EKS Managed Node Group(s)
  eks_managed_node_groups = {
    general = {
      min_size     = 2
      max_size     = 6
      desired_size = 3

      instance_types = ["t3.medium"]
      capacity_type  = "ON_DEMAND"

      labels = {
        Environment = "production"
        Workload    = "general"
      }

      tags = {
        Team = "platform"
      }
    }

    spot = {
      min_size     = 0
      max_size     = 10
      desired_size = 2

      instance_types = ["t3.medium", "t3a.medium"]
      capacity_type  = "SPOT"

      labels = {
        Workload = "batch"
      }

      taints = [{
        key    = "spot"
        value  = "true"
        effect = "NoSchedule"
      }]
    }
  }

  # Enable IRSA
  enable_irsa = true

  # Add-ons
  cluster_addons = {
    vpc-cni = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    coredns = {
      most_recent = true
    }
    aws-ebs-csi-driver = {
      most_recent = true
    }
  }

  tags = {
    Environment = "production"
    Terraform   = "true"
  }
}
Enter fullscreen mode Exit fullscreen mode

Alex: Which should I use?

Sam:

  • eksctl: Quick start, learning, small teams
  • eksctl + config file: Most common, good balance
  • Terraform: Enterprise, multi-cluster, infrastructure as code

I'd start with eksctl config file. It's declarative, version-controllable, and easy to understand.

↑ Back to Table of Contents

Essential Add-ons and Tools

Alex: You mentioned add-ons. What else do I need?

Sam: Great question! A basic EKS cluster needs several additional components:

1. AWS Load Balancer Controller

Why: Creates ALBs and NLBs from Ingress and Service resources

# Install via Helm
helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=my-cluster \
  --set serviceAccount.create=false \
  --set serviceAccount.name=aws-load-balancer-controller
Enter fullscreen mode Exit fullscreen mode

Now you can create ALBs:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
spec:
  ingressClassName: alb
  # ... rest of config
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

2. EBS CSI Driver

Why: Persistent volumes using EBS

# Already installed if you used addons in eksctl
# Or install manually
kubectl apply -k "github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.25"
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

3. Metrics Server

Why: For kubectl top and HorizontalPodAutoscaler

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

4. Cluster Autoscaler

Why: Automatically scale node groups

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: cluster-autoscaler
        image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
        command:
          - ./cluster-autoscaler
          - --cloud-provider=aws
          - --namespace=kube-system
          - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
          - --balance-similar-node-groups
          - --skip-nodes-with-system-pods=false
Enter fullscreen mode Exit fullscreen mode

How it works:

Scenario: Need more pods than nodes can handle

1. Pods are pending (not enough resources)
2. Cluster Autoscaler sees pending pods
3. Checks node group can scale up
4. Triggers ASG to add nodes
5. New node joins cluster
6. Pods schedule on new node

Scenario: Nodes are underutilized

1. Node has <50% utilization for 10 minutes
2. All pods can fit elsewhere
3. Cluster Autoscaler cordons node
4. Drains pods (moves them)
5. Terminates node
6. ASG size decreases
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

5. Karpenter (Alternative to Cluster Autoscaler)

Sam: Karpenter is newer and smarter:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  requirements:
    - key: karpenter.sh/capacity-type
      operator: In
      values: ["spot", "on-demand"]
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["t3.medium", "t3.large", "t3.xlarge"]
  limits:
    resources:
      cpu: 1000
  providerRef:
    name: default
  ttlSecondsAfterEmpty: 30
Enter fullscreen mode Exit fullscreen mode

Karpenter vs Cluster Autoscaler:

Cluster Autoscaler:
- Works with node groups
- Slower (waits for ASG)
- Less flexible instance types

Karpenter:
- Directly provisions EC2
- Faster (no ASG)
- Intelligently picks instance types
- Better bin-packing
Enter fullscreen mode Exit fullscreen mode

Alex: Which should I use?

Sam: Start with Cluster Autoscaler (simpler, well-tested). Move to Karpenter when you need faster scaling or better cost optimization.

↑ Back to Table of Contents

Observability: Knowing What's Happening

Alex: How do I monitor all this?

Sam: Observability in K8s has three pillars: Metrics, Logs, and Traces.

Metrics: CloudWatch Container Insights

# Install CloudWatch Agent
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluentd-quickstart.yaml
Enter fullscreen mode Exit fullscreen mode

This gives you:

  • Cluster metrics (CPU, memory, disk, network)
  • Node metrics
  • Pod metrics
  • Namespace metrics
  • Service metrics

Custom metrics:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  containers:
  - name: app
    image: my-app:v1.0
    # App exposes metrics at :8080/metrics
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

Logs: FluentBit to CloudWatch

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: amazon-cloudwatch
data:
  fluent-bit.conf: |
    [INPUT]
        Name                tail
        Path                /var/log/containers/*.log
        Parser              docker
        Tag                 kube.*

    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443

    [OUTPUT]
        Name                cloudwatch_logs
        Match               *
        region              us-east-1
        log_group_name      /aws/eks/my-cluster/containers
        auto_create_group   true
Enter fullscreen mode Exit fullscreen mode

Alex: Can I see logs from kubectl?

Sam: Absolutely!

# View pod logs
kubectl logs my-pod

# Follow logs (like tail -f)
kubectl logs -f my-pod

# Logs from specific container in pod
kubectl logs my-pod -c sidecar-container

# Previous container logs (if crashed)
kubectl logs my-pod --previous

# Logs from all pods with label
kubectl logs -l app=my-app

# Logs from last hour
kubectl logs my-pod --since=1h
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

Traces: AWS X-Ray

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: my-app:v1.0
        env:
        - name: AWS_XRAY_DAEMON_ADDRESS
          value: xray-service.default:2000
      - name: xray-daemon
        image: amazon/aws-xray-daemon:latest
        ports:
        - containerPort: 2000
          protocol: UDP
Enter fullscreen mode Exit fullscreen mode

The full observability stack:

┌────────────────────────────────────┐
│         Your Application           │
│  - Emits metrics (Prometheus)      │
│  - Writes logs (stdout/stderr)     │
│  - Sends traces (X-Ray)            │
└──────┬──────────────┬──────────┬───┘
       │              │          │
       │ Metrics      │ Logs     │ Traces
       ▼              ▼          ▼
┌──────────────┐ ┌──────────┐ ┌──────────┐
│ CloudWatch   │ │FluentBit │ │ X-Ray    │
│ Container    │ │          │ │ Daemon   │
│ Insights     │ └────┬─────┘ └────┬─────┘
└──────┬───────┘      │            │
       │              ▼            ▼
       │         ┌─────────────────────┐
       └────────►│   CloudWatch &      │
                 │     X-Ray           │
                 └──────────┬──────────┘
                            │
                   ┌────────▼──────────┐
                   │   CloudWatch      │
                   │   Dashboards      │
                   └───────────────────┘
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

Real-World Application Deployment

Alex: Okay, let's put this all together. Walk me through deploying a real application.

Sam: Perfect! Let's deploy a three-tier application:

  • React frontend
  • Node.js API
  • PostgreSQL database

Step 1: Namespace and RBAC

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: myapp
  labels:
    name: myapp
---
# Service account for the app
apiVersion: v1
kind: ServiceAccount
metadata:
  name: myapp-sa
  namespace: myapp
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT:role/myapp-role
Enter fullscreen mode Exit fullscreen mode

Step 2: Database (PostgreSQL)

# postgres.yaml
apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
  namespace: myapp
type: Opaque
data:
  password: cG9zdGdyZXM=  # base64 encoded
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
  namespace: myapp
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: myapp
spec:
  serviceName: postgres
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:14
        ports:
        - containerPort: 5432
        env:
        - name: POSTGRES_DB
          value: myapp
        - name: POSTGRES_USER
          value: myapp
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
        volumeMounts:
        - name: postgres-storage
          mountPath: /var/lib/postgresql/data
          subPath: postgres
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
      volumes:
      - name: postgres-storage
        persistentVolumeClaim:
          claimName: postgres-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: myapp
spec:
  selector:
    app: postgres
  ports:
  - port: 5432
    targetPort: 5432
  clusterIP: None  # Headless service for StatefulSet
Enter fullscreen mode Exit fullscreen mode

Step 3: Backend API

# api.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: api-config
  namespace: myapp
data:
  DATABASE_HOST: postgres.myapp.svc.cluster.local
  DATABASE_PORT: "5432"
  DATABASE_NAME: myapp
  LOG_LEVEL: info
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: myapp
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api
  template:
    metadata:
      labels:
        app: api
        version: v1
    spec:
      serviceAccountName: myapp-sa
      containers:
      - name: api
        image: 123456789.dkr.ecr.us-east-1.amazonaws.com/myapp-api:v1.0
        ports:
        - containerPort: 3000
        env:
        - name: DATABASE_HOST
          valueFrom:
            configMapKeyRef:
              name: api-config
              key: DATABASE_HOST
        - name: DATABASE_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "200m"
---
apiVersion: v1
kind: Service
metadata:
  name: api
  namespace: myapp
spec:
  selector:
    app: api
  ports:
  - port: 80
    targetPort: 3000
  type: ClusterIP
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
  namespace: myapp
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
Enter fullscreen mode Exit fullscreen mode

Step 4: Frontend

# frontend.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend
  namespace: myapp
spec:
  replicas: 2
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
    spec:
      containers:
      - name: frontend
        image: 123456789.dkr.ecr.us-east-1.amazonaws.com/myapp-frontend:v1.0
        ports:
        - containerPort: 80
        resources:
          requests:
            memory: "64Mi"
            cpu: "50m"
          limits:
            memory: "128Mi"
            cpu: "100m"
---
apiVersion: v1
kind: Service
metadata:
  name: frontend
  namespace: myapp
spec:
  selector:
    app: frontend
  ports:
  - port: 80
    targetPort: 80
  type: ClusterIP
Enter fullscreen mode Exit fullscreen mode

Step 5: Ingress (ALB)

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-ingress
  namespace: myapp
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:ACCOUNT:certificate/abc123
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/ssl-redirect: '443'
    alb.ingress.kubernetes.io/healthcheck-path: /health
    alb.ingress.kubernetes.io/success-codes: '200'
spec:
  ingressClassName: alb
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: api
            port:
              number: 80
      - path: /
        pathType: Prefix
        backend:
          service:
            name: frontend
            port:
              number: 80
Enter fullscreen mode Exit fullscreen mode

Deploy Everything

# Create namespace
kubectl apply -f namespace.yaml

# Deploy database
kubectl apply -f postgres.yaml

# Wait for database to be ready
kubectl wait --for=condition=ready pod -l app=postgres -n myapp --timeout=300s

# Deploy API
kubectl apply -f api.yaml

# Wait for API
kubectl wait --for=condition=ready pod -l app=api -n myapp --timeout=300s

# Deploy frontend
kubectl apply -f frontend.yaml

# Create ingress
kubectl apply -f ingress.yaml

# Check everything
kubectl get all -n myapp
Enter fullscreen mode Exit fullscreen mode

The final architecture:

Internet
   │
   ▼
┌─────────────────────────────────────┐
│  Application Load Balancer (ALB)   │
│  myapp.example.com                  │
└──────┬────────────────────┬─────────┘
       │                    │
       │ /api               │ /
       ▼                    ▼
┌──────────────┐     ┌──────────────┐
│ API Service  │     │   Frontend   │
│ (ClusterIP)  │     │   Service    │
└──────┬───────┘     └──────┬───────┘
       │                    │
┌──────▼────────┐    ┌──────▼───────┐
│  API Pods x3  │    │Frontend Pods │
│  (Deployment) │    │      x2      │
└──────┬────────┘    └──────────────┘
       │
       │ connects to
       ▼
┌──────────────────┐
│ Postgres Service │
│   (Headless)     │
└──────┬───────────┘
       │
┌──────▼────────────┐
│ Postgres Pod      │
│ (StatefulSet)     │
└──────┬────────────┘
       │
┌──────▼────────────┐
│  EBS Volume       │
│  (20Gi gp3)       │
└───────────────────┘
Enter fullscreen mode Exit fullscreen mode

Alex: impressed That's a full application!

Sam: And it has:

  • High availability (multi-replica)
  • Auto-scaling (HPA)
  • Persistent storage (EBS)
  • TLS/HTTPS (ACM certificate)
  • Health checks
  • Resource limits
  • Proper networking

↑ Back to Table of Contents

Advanced Patterns and Best Practices

Alex: What are some patterns I should know about as I get more advanced?

Sam: Let me share patterns I use in production:

1. Init Containers

Problem: App needs database migrations before starting

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      initContainers:
      - name: migration
        image: myapp-api:v1.0
        command: ['npm', 'run', 'migrate']
        env:
        - name: DATABASE_URL
          value: postgres://...
      containers:
      - name: api
        image: myapp-api:v1.0
Enter fullscreen mode Exit fullscreen mode

Flow:

1. Pod created
2. Init container runs migrations
3. If migrations succeed, app container starts
4. If migrations fail, pod fails
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

2. Pod Disruption Budgets

Problem: Want to ensure minimum availability during updates

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api
Enter fullscreen mode Exit fullscreen mode

This prevents:

  • Draining too many nodes at once
  • Updating too many pods simultaneously
  • Cluster maintenance killing all pods

↑ Back to Table of Contents

3. Resource Quotas

Problem: Teams overconsuming resources

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    persistentvolumeclaims: "5"
    pods: "50"
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

4. Affinity and Anti-Affinity

Problem: Want to spread pods across nodes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  template:
    spec:
      affinity:
        # Prefer different nodes
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: api
              topologyKey: kubernetes.io/hostname
        # Require different availability zones
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: api
            topologyKey: topology.kubernetes.io/zone
Enter fullscreen mode Exit fullscreen mode

Result:

┌──────────────┬──────────────┬──────────────┐
│   AZ-1a      │    AZ-1b     │    AZ-1c     │
├──────────────┼──────────────┼──────────────┤
│ Node 1       │ Node 3       │ Node 5       │
│  - API Pod 1 │  - API Pod 3 │  - API Pod 5 │
│              │              │              │
│ Node 2       │ Node 4       │              │
│  - API Pod 2 │  - API Pod 4 │              │
└──────────────┴──────────────┴──────────────┘
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

5. Jobs and CronJobs

Problem: Need to run batch tasks

# One-time job
apiVersion: batch/v1
kind: Job
metadata:
  name: data-import
spec:
  template:
    spec:
      containers:
      - name: import
        image: myapp-importer:v1.0
        command: ['python', 'import.py']
      restartPolicy: Never
  backoffLimit: 3
---
# Scheduled job
apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-report
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: report
            image: myapp-reporter:v1.0
            command: ['python', 'report.py']
          restartPolicy: Never
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

6. Service Mesh (AWS App Mesh)

Problem: Need advanced traffic management, retries, circuit breaking

apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualNode
metadata:
  name: api-v1
spec:
  podSelector:
    matchLabels:
      app: api
      version: v1
  listeners:
    - portMapping:
        port: 3000
        protocol: http
  serviceDiscovery:
    dns:
      hostname: api.myapp.svc.cluster.local
  backends:
    - virtualService:
        virtualServiceRef:
          name: database
---
apiVersion: appmesh.k8s.aws/v1beta2
kind: VirtualRouter
metadata:
  name: api-router
spec:
  listeners:
    - portMapping:
        port: 3000
        protocol: http
  routes:
    - name: api-route
      httpRoute:
        match:
          prefix: /
        action:
          weightedTargets:
            - virtualNodeRef:
                name: api-v1
              weight: 90
            - virtualNodeRef:
                name: api-v2
              weight: 10  # Canary deployment!
        retryPolicy:
          maxRetries: 3
          perRetryTimeout:
            unit: s
            value: 2
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

Troubleshooting Common Issues

Alex: What about when things go wrong? How do I debug?

Sam: pulls out troubleshooting playbook This is essential! Let me walk you through common issues:

Issue 1: Pods Not Starting

# Check pod status
kubectl get pods -n myapp

# Output shows:
NAME                  READY   STATUS              RESTARTS   AGE
api-7d4b6c9f8-abc123  0/1     ImagePullBackOff    0          5m

# Get detailed info
kubectl describe pod api-7d4b6c9f8-abc123 -n myapp

# Look for:
# - Image pull errors (wrong image name, permissions)
# - Resource constraints (not enough CPU/memory)
# - Volume mount errors
Enter fullscreen mode Exit fullscreen mode

Common causes and fixes:

# Problem: ImagePullBackOff
# Fix 1: Check image name
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:v1.0  # Correct region?

# Fix 2: Check IAM permissions for pulling from ECR
# Service account needs ecr:GetAuthorizationToken, ecr:BatchCheckLayerAvailability

# Problem: CrashLoopBackOff
# Check logs
kubectl logs api-7d4b6c9f8-abc123 -n myapp --previous

# Problem: Pending (not scheduling)
# Describe pod to see why
Events:
  Type     Reason            Message
  ----     ------            -------
  Warning  FailedScheduling  0/3 nodes available: insufficient memory

# Fix: Either reduce pod requests or add more nodes
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

Issue 2: Service Not Reachable

# Check if service exists
kubectl get svc -n myapp

# Check if endpoints exist (pods backing the service)
kubectl get endpoints -n myapp

NAME   ENDPOINTS                           AGE
api    10.0.1.50:3000,10.0.1.51:3000       10m

# If no endpoints, pods might not match service selector
kubectl get pods -n myapp --show-labels

# Test connectivity from another pod
kubectl run test --image=busybox -it --rm -- wget -O- api.myapp.svc.cluster.local
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

Issue 3: Ingress Not Working

# Check if ingress exists
kubectl get ingress -n myapp

# Describe to see ALB creation status
kubectl describe ingress myapp-ingress -n myapp

# Check AWS Load Balancer Controller logs
kubectl logs -n kube-system deployment/aws-load-balancer-controller

# Common issues:
# - IAM permissions for controller
# - Subnet tags missing (kubernetes.io/role/elb=1)
# - Security groups blocking traffic
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

Issue 4: High Memory/CPU Usage

# Check current usage
kubectl top pods -n myapp

NAME                  CPU(cores)   MEMORY(bytes)
api-7d4b6c9f8-abc123  450m         890Mi  # Approaching limits!

# Check if pods are being OOMKilled
kubectl describe pod api-7d4b6c9f8-abc123 -n myapp | grep -A 5 "Last State"

# Fix: Adjust resources
resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "1Gi"      # Increased
    cpu: "500m"
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

Issue 5: PVC Not Binding

# Check PVC status
kubectl get pvc -n myapp

NAME           STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS
postgres-pvc   Pending                                      gp3

# Describe for details
kubectl describe pvc postgres-pvc -n myapp

# Common issues:
# - Storage class doesn't exist
# - No available volumes
# - Zone mismatch (PVC in us-east-1a, nodes in us-east-1b)

# Fix: Use WaitForFirstConsumer binding mode
volumeBindingMode: WaitForFirstConsumer  # Wait until pod is scheduled
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

Debugging Toolkit

# Get everything in namespace
kubectl get all -n myapp

# Watch resources live
kubectl get pods -n myapp -w

# Exec into running container
kubectl exec -it api-7d4b6c9f8-abc123 -n myapp -- /bin/sh

# Port-forward for local testing
kubectl port-forward svc/api 8080:80 -n myapp
# Now access at localhost:8080

# Check events
kubectl get events -n myapp --sort-by='.lastTimestamp'

# Check resource usage
kubectl top nodes
kubectl top pods -n myapp

# View all resources with labels
kubectl get all -n myapp --show-labels

# Drain node for maintenance
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data

# Cordon node (stop scheduling new pods)
kubectl cordon node-name
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

GitOps and CI/CD

Alex: How do I actually deploy updates? Keep editing YAML files and applying them?

Sam: No way! That's where GitOps comes in. Let me show you a proper CI/CD pipeline.

The GitOps Workflow with Flux or ArgoCD

Developer Flow:

1. Developer → Git Commit → Push to main branch
              │
              ▼
2. GitHub Actions (CI)
   - Runs tests
   - Builds Docker image
   - Tags image (git sha)
   - Pushes to ECR
   - Updates Kubernetes manifest with new image tag
   - Commits manifest to GitOps repo
              │
              ▼
3. Flux/ArgoCD (watching GitOps repo)
   - Detects change
   - Pulls new manifests
   - Applies to cluster
              │
              ▼
4. Kubernetes
   - Rolling update deployment
   - New pods come up
   - Old pods terminate
              │
              ▼
5. Developer sees app updated!
Enter fullscreen mode Exit fullscreen mode

Example GitHub Actions workflow:

# .github/workflows/deploy.yaml
name: Build and Deploy

on:
  push:
    branches: [main]

env:
  ECR_REPOSITORY: myapp-api
  EKS_CLUSTER: production-cluster
  AWS_REGION: us-east-1

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: ${{ env.AWS_REGION }}

      - name: Login to Amazon ECR
        id: login-ecr
        uses: aws-actions/amazon-ecr-login@v1

      - name: Build and push image
        env:
          ECR_REGISTRY: ${{ steps.login-ecr.outputs.registry }}
          IMAGE_TAG: ${{ github.sha }}
        run: |
          docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .
          docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG
          echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" >> $GITHUB_OUTPUT

      - name: Update Kubernetes manifest
        env:
          IMAGE: ${{ steps.build-and-push.outputs.image }}
        run: |
          # Update image in deployment YAML
          sed -i "s|image:.*|image: $IMAGE|g" k8s/deployment.yaml

      - name: Commit and push to GitOps repo
        run: |
          git config user.name "GitHub Actions"
          git config user.email "actions@github.com"
          git add k8s/deployment.yaml
          git commit -m "Update image to ${{ github.sha }}"
          git push
Enter fullscreen mode Exit fullscreen mode

ArgoCD Application:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: myapp
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/company/gitops-repo
    targetRevision: main
    path: k8s
  destination:
    server: https://kubernetes.default.svc
    namespace: myapp
  syncPolicy:
    automated:
      prune: true      # Delete resources not in git
      selfHeal: true   # Fix manual changes
    syncOptions:
      - CreateNamespace=true
Enter fullscreen mode Exit fullscreen mode

Alex: So ArgoCD watches git and keeps the cluster in sync?

Sam: Exactly! If someone manually changes something in the cluster, ArgoCD changes it back. Git is the source of truth.

↑ Back to Table of Contents

Cost Optimization

Alex: This is great, but I'm worried about costs. Any tips?

Sam: Absolutely! EKS costs can add up. Here's my cost optimization playbook:

1. Right-Size Your Pods

# Install VPA (Vertical Pod Autoscaler)
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

# Create VPA in recommendation mode
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updateMode: "Off"  # Just recommend, don't auto-update

# Check recommendations
kubectl describe vpa api-vpa
Enter fullscreen mode Exit fullscreen mode

2. Use Spot Instances

# In eksctl config
managedNodeGroups:
  - name: spot-workers
    instanceTypes:
      - t3.medium
      - t3a.medium
      - t2.medium
    spot: true
    minSize: 2
    maxSize: 10
    labels:
      lifecycle: spot
    taints:
      - key: spot
        value: "true"
        effect: NoSchedule

# Pods that can tolerate spot
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  template:
    spec:
      tolerations:
      - key: spot
        operator: Equal
        value: "true"
        effect: NoSchedule
      nodeSelector:
        lifecycle: spot
Enter fullscreen mode Exit fullscreen mode

3. Use Fargate for Bursty Workloads

# Fargate profile
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: my-cluster
fargateProfiles:
  - name: batch-jobs
    selectors:
      - namespace: batch
        labels:
          workload-type: batch
Enter fullscreen mode Exit fullscreen mode

4. Implement Pod Disruption Budgets for Safe Downscaling

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 1  # Keep at least one running during disruptions
  selector:
    matchLabels:
      app: api
Enter fullscreen mode Exit fullscreen mode

5. Use Cluster Autoscaler or Karpenter

Cost savings:

  • Scales down unused nodes
  • Consolidates workloads
  • Uses cheapest instance types

6. Monitor and Alert

# CloudWatch alarm for high costs
aws cloudwatch put-metric-alarm \
  --alarm-name eks-high-cost \
  --alarm-description "Alert if EKS costs exceed threshold" \
  --metric-name EstimatedCharges \
  --namespace AWS/Billing \
  --statistic Maximum \
  --period 86400 \
  --evaluation-periods 1 \
  --threshold 1000 \
  --comparison-operator GreaterThanThreshold
Enter fullscreen mode Exit fullscreen mode

7. Use Graviton (ARM) Instances

managedNodeGroups:
  - name: graviton-workers
    instanceTypes: [t4g.medium, t4g.large]  # Graviton-based
    # 20% cheaper than x86!
Enter fullscreen mode Exit fullscreen mode

Cost breakdown example:

Monthly EKS Costs:

Control Plane:              $73/month  (fixed)

3x t3.medium (x86):
  - On-Demand: $90/month
  - Spot: ~$27/month         (70% savings!)

3x t4g.medium (ARM):
  - On-Demand: $72/month     (20% savings vs x86!)
  - Spot: ~$22/month         (75% savings!)

Best Practice Mix:
  - 2x t4g.medium on-demand  (baseline)    $48
  - 3x t4g.medium spot       (burst)       $22
  - Control plane                          $73
  ----------------------------------------
  Total: ~$143/month for 5-node cluster
Enter fullscreen mode Exit fullscreen mode

↑ Back to Table of Contents

Production Readiness Checklist

Alex: Okay, before we wrap up – give me a checklist. What do I need before going to production?

Sam: pulls out final napkin Here's my production readiness checklist:

Infrastructure

  • [ ] Multi-AZ deployment (at least 2 AZs)
  • [ ] Private subnets for worker nodes
  • [ ] NAT Gateways for outbound internet
  • [ ] VPC Flow Logs enabled
  • [ ] Control plane logging enabled
  • [ ] Cluster autoscaling configured
  • [ ] Pod autoscaling (HPA) for services
  • [ ] Proper instance types selected (right-sized)

Security

  • [ ] IRSA configured for pod IAM roles
  • [ ] Pod Security Standards enforced
  • [ ] Network Policies defined
  • [ ] Secrets in AWS Secrets Manager (not K8s secrets)
  • [ ] ECR image scanning enabled
  • [ ] Security groups properly configured
  • [ ] IAM least privilege for all roles
  • [ ] Encryption at rest for EBS volumes
  • [ ] Encryption in transit (TLS everywhere)

Observability

  • [ ] CloudWatch Container Insights installed
  • [ ] Logging to CloudWatch configured
  • [ ] Metrics collection working
  • [ ] Distributed tracing (X-Ray) configured
  • [ ] Dashboards created for key metrics
  • [ ] Alerts configured for critical issues
  • [ ] On-call rotation defined

High Availability

  • [ ] Multiple replicas for all services
  • [ ] Pod Disruption Budgets defined
  • [ ] Liveness/readiness probes on all pods
  • [ ] Resource requests/limits set
  • [ ] Anti-affinity rules for spreading pods
  • [ ] Health checks on load balancers
  • [ ] Graceful shutdown configured

Disaster Recovery

  • [ ] etcd backups automated (handled by EKS)
  • [ ] PV snapshots scheduled
  • [ ] Gitops repo backed up
  • [ ] Disaster recovery plan documented
  • [ ] Recovery tested at least once
  • [ ] RTO/RPO defined and achievable

Operations

  • [ ] GitOps workflow implemented
  • [ ] CI/CD pipeline automated
  • [ ] Rollback procedure tested
  • [ ] Runbooks for common issues
  • [ ] Access control (RBAC) configured
  • [ ] Audit logging enabled
  • [ ] Change management process defined

Cost Management

  • [ ] Resource quotas per namespace
  • [ ] Cost tracking by team/project
  • [ ] Spot instances for applicable workloads
  • [ ] Rightsizing reviewed monthly
  • [ ] Unused resources cleaned up
  • [ ] Budget alerts configured

Alex: overwhelmed That's a lot...

Sam: It is! But you don't need everything day one. Prioritize:

Week 1: Get it running

  • Basic cluster
  • Deploy apps
  • Basic monitoring

Week 2-4: Make it reliable

  • HA configuration
  • Proper health checks
  • Autoscaling

Month 2-3: Make it secure

  • IRSA
  • Network policies
  • Secrets management

Month 3+: Optimize

  • Cost optimization
  • Performance tuning
  • Advanced features

↑ Back to Table of Contents

Wrapping Up

Alex: closes laptop Wow. That was... comprehensive. My brain is full.

Sam: laughs I know it's a lot. Kubernetes and EKS are powerful but complex. Let me leave you with key takeaways:

1. Start Simple

Day 1: Basic EKS cluster with eksctl
Day 2: Deploy first app
Week 1: Add monitoring
Week 2: Add autoscaling
Month 1: Implement GitOps
Month 2: Advanced features
Enter fullscreen mode Exit fullscreen mode

2. Use Managed Services

  • Let AWS manage the control plane
  • Use managed node groups
  • Leverage AWS integrations (IAM, ALB, EBS)

3. Embrace Declarative Configuration

  • Everything in YAML/Git
  • Let Kubernetes reconcile
  • Don't fight the system

4. Focus on Observability

  • You can't fix what you can't see
  • Logs, metrics, traces
  • Alert on what matters

5. Security is a Journey

  • Start with basics (IRSA, network policies)
  • Add layers over time
  • Never stop improving

Alex: And if I had to explain EKS to my manager in 30 seconds?

Sam: "EKS is AWS's managed Kubernetes service. AWS handles the complex control plane, we focus on running our applications. It gives us industry-standard container orchestration, portability, and a massive ecosystem of tools. It's more complex than ECS but more powerful and portable. Perfect for our growing needs."

Alex: Perfect. One last question – when should I absolutely NOT use EKS?

Sam: Great question!

Don't use EKS if:

  • You have a single, simple application → Use App Runner or ECS
  • Your team has zero container experience → Start with ECS, migrate later
  • You need something running TODAY → Kubernetes has a learning curve
  • Budget is extremely tight → Control plane costs $73/month minimum
  • You don't need Kubernetes features → Simpler tools exist

DO use EKS if:

  • You need Kubernetes for multi-cloud
  • You have Kubernetes expertise
  • You need advanced orchestration features
  • You're building a platform for multiple teams
  • You want the K8s ecosystem tools

Alex: stands up Alright, I'm ready. Time to create my first EKS cluster!

Sam: That's the spirit! Remember:

# Your first cluster
eksctl create cluster \
  --name my-first-cluster \
  --region us-east-1 \
  --nodegroup-name workers \
  --node-type t3.medium \
  --nodes 2 \
  --managed

# Deploy something
kubectl create deployment nginx --image=nginx
kubectl expose deployment nginx --port=80 --type=LoadBalancer

# Check it
kubectl get all
Enter fullscreen mode Exit fullscreen mode

Alex: And when it inevitably breaks?

Sam: hands over card Text me. Or check the logs, describe the pods, and check events. 90% of issues are:

  1. Wrong image name
  2. Missing permissions
  3. Resource constraints
  4. Configuration errors

Alex: shakes hand Thanks, Sam. Same time next week to discuss service meshes?

Sam: grins Let's start with getting this working first!

↑ Back to Table of Contents

Quick Reference Guide

Essential Commands:

# Cluster management
eksctl create cluster -f cluster.yaml
eksctl get cluster
eksctl delete cluster --name my-cluster

# Context management
kubectl config get-contexts
kubectl config use-context my-cluster

# Resource management
kubectl get pods -n myapp
kubectl describe pod my-pod -n myapp
kubectl logs my-pod -n myapp -f
kubectl exec -it my-pod -n myapp -- /bin/sh

# Apply configurations
kubectl apply -f deployment.yaml
kubectl apply -f . # Apply all YAML in directory

# Scaling
kubectl scale deployment/my-app --replicas=5
kubectl autoscale deployment/my-app --min=2 --max=10 --cpu-percent=70

# Updates
kubectl set image deployment/my-app app=myapp:v2.0
kubectl rollout status deployment/my-app
kubectl rollout undo deployment/my-app

# Debugging
kubectl get events -n myapp --sort-by='.lastTimestamp'
kubectl top pods -n myapp
kubectl top nodes

# Port forwarding
kubectl port-forward svc/my-app 8080:80
Enter fullscreen mode Exit fullscreen mode

Resource Hierarchy:

Cluster
  └─ Namespace
      ├─ Deployment
      │   └─ ReplicaSet
      │       └─ Pod
      │           └─ Container
      ├─ Service
      ├─ ConfigMap
      ├─ Secret
      ├─ PersistentVolumeClaim
      │   └─ PersistentVolume
      └─ Ingress
Enter fullscreen mode Exit fullscreen mode

Useful Links:

↑ Back to Table of Contents

Alex walks out of the coffee shop, laptop bag over shoulder, ready to build something amazing. Sam orders another coffee and opens a laptop – time to help the next person on their Kubernetes journey.

The Beginning of Your EKS Adventure 🚀☸️

↑ Back to Table of Contents

Top comments (0)