S, Sanjay

Posted on Mar 19

Kubernetes Explained: The Drama of Pods, Nodes, and the Scheduler Who Hates Everyone

#kubernetes #containers #azure #devops

🎬 Let Me Paint a Picture

It's 3:14 AM. Your phone buzzes. PagerDuty.

CRITICAL: payment-service - 0/3 pods ready

You open your laptop, eyes half-closed, and type:

kubectl get pods -n payments

NAME                              READY   STATUS             RESTARTS   AGE
payment-service-7f8d9b6c4-abc12   0/1     CrashLoopBackOff   47         2h
payment-service-7f8d9b6c4-def34   0/1     CrashLoopBackOff   47         2h
payment-service-7f8d9b6c4-ghi56   0/1     CrashLoopBackOff   47         2h

CrashLoopBackOff. The three most terrifying words in the Kubernetes dictionary.

Welcome to Kubernetes Mastery. By the end of this blog, you'll not only understand what every K8s component does — you'll know what to do when they break. Let's go.

🧠 Kubernetes Architecture: The Cast of Characters

Think of Kubernetes as a restaurant:

┌─────────────────────────────────────────────────────────┐
│  CONTROL PLANE (The Kitchen Management)                 │
│                                                         │
│  🧑‍🍳 API Server    = The Maître d' (takes ALL orders)  │
│  📒 etcd           = The order book (remembers everything) │
│  🎯 Scheduler      = The seating host (assigns tables)  │
│  🔄 Controllers    = The managers (make sure orders     │
│                      are fulfilled)                     │
│  ☁️ Cloud Controller = The landlord (manages building)   │
└─────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────┐
│  DATA PLANE (The Actual Kitchen & Dining Room)          │
│                                                         │
│  🖥️ Nodes         = Tables in the restaurant            │
│  📦 Pods          = Plates of food on the table         │
│  🤖 kubelet       = The waiter at each table            │
│  🔀 kube-proxy    = The runner (routes food to tables)  │
│  🐳 containerd    = The actual cook                     │
└─────────────────────────────────────────────────────────┘

What Really Happens When You `kubectl apply`

Every time you deploy something, here's the actual flow:

You: kubectl apply -f deployment.yaml
        │
        ▼
   API Server: "Hold on, let me check..."
        │
        ├─ Step 1: AuthN → "Who are you?" (certificate/token)
        ├─ Step 2: AuthZ → "Can you do this?" (RBAC check)
        ├─ Step 3: Admission → "Should we allow this?"
        │          (Webhooks: Kyverno says "no latest tag!")
        ├─ Step 4: Validation → "Is this YAML even valid?"
        └─ Step 5: Write to etcd → "OK, saved."
               │
               ▼
   Controller Manager: "Oh, new Deployment! Let me create a ReplicaSet."
   ReplicaSet Controller: "ReplicaSet says 3 pods. Let me create 3 Pods."
               │
               ▼
   Scheduler: "3 new Pods need homes. Node-1 has CPU.
               Node-2 has a taint. Node-3 is full.
               → Pods go to Node-1 and Node-4."
               │
               ▼
   kubelet (on each node): "I got assigned pods.
               Pulling image... Starting container...
               Health check passed. Reporting ready!"

🍔 Restaurant analogy: You (the customer) tell the Maître d' (API Server) you want 3 burgers. The Maître d' writes it in the order book (etcd). The manager (Controller) tells the kitchen to make 3 burgers. The seating host (Scheduler) figures out which tables have room. The waiter (kubelet) brings the burgers to the right tables.

🏗️ AKS Architecture: What Microsoft Manages (And What's Your Problem)

When you use AKS, there's a clear split:

Microsoft's Problem               Your Problem
(Free/SLA-backed)                  (Good luck 🫡)
═══════════════════               ═══════════════════════
✅ API Server                      😰 Your application code
✅ etcd                            😰 Node pool sizing
✅ Controller Manager              😰 Pod configurations
✅ Scheduler                       😰 Networking choices
✅ Control plane upgrades           😰 Your Docker images
                                   😰 Secrets management
                                   😰 Ingress configuration
                                   😰 That one deployment
                                      with no resource limits

🚨 Real-World Disaster #1: The Node Pool That Couldn't Scale

The Error:

Events:
  Warning  FailedScaleUp  cluster-autoscaler
  pod didn't trigger scale-up: 1 max node group size reached

What Happened: The team set max nodes to 5, but Black Friday traffic needed 12. The Cluster Autoscaler wanted to add nodes but was blocked by the max limit. Pods sat in Pending state for 45 minutes.

The Fix:

# Check current autoscaler settings
az aks nodepool show -g rg-prod --cluster-name aks-prod \
  -n userpool --query '{min:minCount, max:maxCount, current:count}'

# Update max nodes (always set 2-3x your expected peak)
az aks nodepool update -g rg-prod --cluster-name aks-prod \
  -n userpool --max-count 20 --min-count 3

# Pro tip: Enable NAP (Node Auto-Provisioning) for fully automated scaling
az aks update -g rg-prod -n aks-prod --enable-node-autoprovision

💡 Rule of thumb: Set maxCount to 2-3x your normal peak. The Cluster Autoscaler won't scale up if it's not needed — you only pay for what you use.

📦 The Pod Spec: Where 90% of Production Issues Live

If Kubernetes is a restaurant, the Pod spec is the recipe. Get the recipe wrong, and you serve garbage. Here's the production-ready pod spec with every field explained:

Resource Requests & Limits (THE #1 K8s Issue)

resources:
  requests:        # "I need at least this much"
    cpu: 250m      # 0.25 CPU cores (scheduler uses this)
    memory: 256Mi  # Scheduler reserves this on the node
  limits:
    cpu: 1000m     # Can burst up to 1 CPU core
    memory: 512Mi  # HARD LIMIT — exceed this = OOMKilled 💀

🚨 Real-World Disaster #2: The OOMKilled Epidemic

The Error:

$ kubectl describe pod payment-service-xyz
State:          Terminated
Reason:         OOMKilled
Exit Code:      137

What Happened: The Java app was configured with -Xmx512m (512MB heap) but the container memory limit was set to 512Mi. Java heap + overhead (metaspace, threads, JNI) = ~680MB. Container tries to use more than 512Mi → kernel kills it. Pod restarts. Uses 680MB again. Killed again. CrashLoopBackOff.

Translation: The app's memory request was a lie. It asked for 512Mi but actually needed ~700Mi. Kubernetes trusted the lie, and the OOM killer delivered justice.

The Fix:

resources:
  requests:
    memory: 768Mi    # Be honest about what your app needs
  limits:
    memory: 1Gi      # Give it headroom (limit = ~1.3x request for memory)

The Rule:

CPU: limit = 2x to 4x request is fine (CPU is compressible — it just gets throttled)
Memory: limit = 1.3x to 1.5x request MAX (memory is NOT compressible — exceed it = death)

Health Probes: The Three Probe Ensemble

# 1. Startup Probe: "Has the app finished booting?"
startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30    # Try 30 times
  periodSeconds: 10       # Every 10 seconds = 5 min max startup
  # Without this: K8s kills slow-starting apps before they're ready!

# 2. Liveness Probe: "Is the app alive?"
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 15
  timeoutSeconds: 5
  # If this fails: K8s RESTARTS the pod

# 3. Readiness Probe: "Can the app serve traffic?"
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  timeoutSeconds: 3
  # If this fails: K8s removes pod from the Service (no traffic sent)

🚨 Real-World Disaster #3: The Probe That Killed Production

What Happened: A team set the liveness probe path to the same endpoint as their main API — /api/v1/health. During a database connection pool exhaustion, this endpoint hung for 10 seconds. The liveness timeout was 5 seconds. Kubernetes thought the pod was dead. Killed it. New pod starts, also can't connect to DB. Killed. ALL PODS KILLED SIMULTANEOUSLY.

Result: Complete outage because K8s was trying to "help" by restarting healthy pods.

The Fix:

Liveness probes should check local health only (can the process respond?), NOT dependency health
Readiness probes should check dependencies (is the DB reachable?)
Never point liveness at your main API endpoint

# GOOD: Lightweight liveness check
livenessProbe:
  httpGet:
    path: /healthz     # Returns 200 if process is alive. That's it.
    port: 8080

# GOOD: Dependency-aware readiness check
readinessProbe:
  httpGet:
    path: /ready       # Checks DB connection, cache, etc.
    port: 8080

🌐 Kubernetes Networking: The "Why Can't My Pod Talk to That Pod" Chapter

Service Types Explained (with when to use each)

 ClusterIP (default)
 └─ Internal only. Pod-to-pod communication.
    Use for: microservice → microservice calls
    Cost: Free

 LoadBalancer
 └─ Gets a real Azure Load Balancer (public or internal IP)
    Use for: non-HTTP services (gRPC, TCP, game servers)
    Cost: $18/month + data transfer PER SERVICE 😱

 Ingress
 └─ One LoadBalancer → routes to many services by host/path
    Use for: HTTP/HTTPS services (90% of your apps)
    Cost: One LB cost shared across all services 🎉

 Gateway API (the future)
 └─ Like Ingress but better: multi-tenant, L4+L7, cross-namespace
    Use for: new deployments, forward-thinking architecture

🚨 Real-World Disaster #4: The $2,400/Month LoadBalancer Bill

What Happened: Each team created individual Services with type: LoadBalancer for their apps. 12 services × $18/month LB + data transfer = $2,400/month just for load balancers.

The Fix: Deploy ONE NGINX Ingress Controller, route all HTTP traffic through it:

# Instead of 12 LoadBalancers, one Ingress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: main-ingress
  annotations:
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  rules:
    - host: api.mycompany.com
      http:
        paths:
          - path: /payments
            pathType: Prefix
            backend:
              service:
                name: payment-service
                port:
                  number: 8080
          - path: /users
            pathType: Prefix
            backend:
              service:
                name: user-service
                port:
                  number: 8080

Cost after: One LoadBalancer = ~$18/month. Savings: $2,382/month. You're welcome.

🔐 Kubernetes Security: The Non-Negotiables

The Security Checklist Every Pod Must Pass

spec:
  serviceAccountName: my-app-sa       # Dedicated SA per app
  automountServiceAccountToken: false  # Don't mount unless needed
  securityContext:
    runAsNonRoot: true                 # Never run as root
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault             # syscall filtering
  containers:
    - name: my-app
      image: myacr.azurecr.io/app:v1.2.3@sha256:abc...  # Pin by digest!
      securityContext:
        allowPrivilegeEscalation: false  # Can't become root
        readOnlyRootFilesystem: true     # No writing to filesystem
        capabilities:
          drop: ["ALL"]                  # Drop all Linux capabilities

🚨 Real-World Disaster #5: The Crypto Miner in Your Cluster

The Alert:

Defender for Containers: CRITICAL
"Suspicious container detected: Image contains known cryptomining software"
"Pod 'nginx-proxy-xyz' in namespace 'default' running as root with
hostNetwork: true"

What Happened: Someone deployed a "convenience" nginx image from Docker Hub (not your private ACR). The image was compromised and contained a crypto miner. Because the pod ran as root with hostNetwork: true, it could access the node's network and mine crypto using your Azure bill.

The Fix:

Only allow images from your private ACR:

# Kyverno policy: Block images not from our ACR
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  rules:
    - name: validate-registries
      match:
        any:
          - resources:
              kinds: ["Pod"]
      validate:
        message: "Images must come from myacr.azurecr.io"
        pattern:
          spec:
            containers:
              - image: "myacr.azurecr.io/*"

Never run pods in the default namespace (no policies are applied there by default)
Scan images in your CI/CD pipeline with Trivy before pushing to ACR

📈 Autoscaling: Making Kubernetes Elastic

Kubernetes has three levels of autoscaling, and you need all of them:

Level 1: HPA (Horizontal Pod Autoscaler)
└─ Adds/removes PODS based on CPU, memory, or custom metrics
   "My service is busy? Add more pod replicas!"

Level 2: KEDA (Kubernetes Event-Driven Autoscaler)
└─ Scales based on EVENTS — queue depth, HTTP requests, cron
   "There are 10,000 messages in the queue? Scale to 50 pods!"
   "It's 3 AM and queue is empty? Scale to zero!"

Level 3: Cluster Autoscaler
└─ Adds/removes NODES when pods can't be scheduled
   "Pods are Pending because no node has capacity? Add a node!"

🚨 Real-World Disaster #6: The Autoscaler Death Spiral

What Happened: HPA was configured to scale on CPU. Under load, pods scaled from 3 → 15. But each pod opening connections to the database caused connection pool exhaustion. The DB started returning errors. Error-handling code consumed MORE CPU (logging, retries). HPA saw more CPU → scaled to 30 pods. More DB connections → faster DB collapse. Complete meltdown.

The Fix:

Set maxReplicas in HPA to something your DB can handle
Use connection pooling (PgBouncer for Postgres)
Scale on business metrics (requests/second) not raw CPU
Add a circuit breaker between your app and the DB

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payment-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  minReplicas: 3
  maxReplicas: 15           # Cap it! Know your DB's connection limit.
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60   # Don't scale up too fast
      policies:
        - type: Pods
          value: 2                     # Max 2 pods per minute
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second  # Business metric, not CPU!
        target:
          type: AverageValue
          averageValue: "100"

🚀 GitOps: Your Cluster's Single Source of Truth

GitOps = Your Git repository is the single source of truth for your cluster state. No more kubectl apply from laptops. No more "who deployed that?"

Developer pushes to Git
        │
        ▼
  Git Repository (the truth)
        │
        ▼
  GitOps Agent (Flux / ArgoCD)
  watches the repo, detects changes
        │
        ▼
  Applies changes to cluster
  (reconciliation loop — every 1-5 minutes)
        │
        ▼
  Cluster state matches Git ✅

🚨 Real-World Disaster #7: The Rogue kubectl

What Happened: A developer ran kubectl scale deployment payment-service --replicas=1 in production "to test something." This reduced payment processing capacity by 66%. But since there was no GitOps, nobody noticed the drift for 3 hours until load increased and the single replica started dropping requests.

With GitOps: Flux/ArgoCD would have detected the drift within minutes and automatically scaled back to 3 replicas. The desired state in Git always wins.

🧪 Quick Reference: The K8s Troubleshooting Flowchart

Pod not starting?
├── Status: Pending
│   ├── "Insufficient cpu/memory" → Node is full
│   │   └─ Fix: Check resource requests, scale node pool
│   ├── "No nodes match pod topology" → Affinity/taint issue
│   │   └─ Fix: Check nodeSelector, tolerations, topology constraints
│   └── "0/3 nodes available: PersistentVolumeClaim not bound"
│       └─ Fix: Check PVC, storage class, disk availability
│
├── Status: ImagePullBackOff
│   ├── "unauthorized: authentication required" → ACR auth failed
│   │   └─ Fix: Check imagePullSecrets or AKS-ACR integration
│   └── "manifest unknown" → Image tag doesn't exist
│       └─ Fix: Check image:tag spelling, verify it exists in registry
│
├── Status: CrashLoopBackOff
│   ├── Exit Code 137 → OOMKilled
│   │   └─ Fix: Increase memory limit
│   ├── Exit Code 1 → App crashed on startup
│   │   └─ Fix: Check logs: kubectl logs <pod> --previous
│   └── Exit Code 0 → App exited successfully (shouldn't for a server)
│       └─ Fix: Check entrypoint/command, app should run indefinitely
│
├── Status: Running but not Ready
│   └── Readiness probe failing
│       └─ Fix: Check probe path, port, and app dependencies
│
└── Status: Terminating (stuck)
    └── Finalizer or preStop hook issue
        └─ Fix: kubectl delete pod <name> --grace-period=0 --force
           (last resort!)

🎯 Key Takeaways

Resources requests/limits are the #1 cause of production K8s issues — set them honestly
Liveness probes should check the process, not dependencies — bad probes kill healthy pods
One Ingress Controller beats 12 LoadBalancers every time ($$$)
Pin images by digest in production — tags are mutable and untrustworthy
Autoscaling needs guardrails — uncapped HPA can create death spirals
GitOps eliminates drift and rogue kubectl changes
Never run pods as root — unless you enjoy donating CPU to crypto miners

🔥 Homework

Run kubectl get pods --all-namespaces | grep -E "CrashLoop|Error|Pending" — fix what you find
Check if any pod in your cluster runs as root: kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.securityContext.runAsNonRoot}{"\n"}{end}'
Calculate how many LoadBalancers your cluster has and whether you can consolidate with an Ingress

Next up in the series: **Terraform State Files: The Diary Your Infrastructure Never Wanted You to Read* — where state file corruption, locking wars, and the dreaded -target flag are decoded with real horror stories.*

💬 What's your worst CrashLoopBackOff story? Share it below. There's no judgment here — only solidarity. 🫂

DEV Community

Kubernetes Explained: The Drama of Pods, Nodes, and the Scheduler Who Hates Everyone

🎬 Let Me Paint a Picture

🧠 Kubernetes Architecture: The Cast of Characters

What Really Happens When You `kubectl apply`

🏗️ AKS Architecture: What Microsoft Manages (And What's Your Problem)

🚨 Real-World Disaster #1: The Node Pool That Couldn't Scale

📦 The Pod Spec: Where 90% of Production Issues Live

Resource Requests & Limits (THE #1 K8s Issue)

🚨 Real-World Disaster #2: The OOMKilled Epidemic

Health Probes: The Three Probe Ensemble

🚨 Real-World Disaster #3: The Probe That Killed Production

🌐 Kubernetes Networking: The "Why Can't My Pod Talk to That Pod" Chapter

Service Types Explained (with when to use each)

🚨 Real-World Disaster #4: The $2,400/Month LoadBalancer Bill

🔐 Kubernetes Security: The Non-Negotiables

The Security Checklist Every Pod Must Pass

🚨 Real-World Disaster #5: The Crypto Miner in Your Cluster

📈 Autoscaling: Making Kubernetes Elastic

🚨 Real-World Disaster #6: The Autoscaler Death Spiral

🚀 GitOps: Your Cluster's Single Source of Truth

🚨 Real-World Disaster #7: The Rogue kubectl

🧪 Quick Reference: The K8s Troubleshooting Flowchart

🎯 Key Takeaways

🔥 Homework

Top comments (0)

🎬 Let Me Paint a Picture

🧠 Kubernetes Architecture: The Cast of Characters

What Really Happens When You kubectl apply

🏗️ AKS Architecture: What Microsoft Manages (And What's Your Problem)

🚨 Real-World Disaster #1: The Node Pool That Couldn't Scale

📦 The Pod Spec: Where 90% of Production Issues Live

Resource Requests & Limits (THE #1 K8s Issue)

🚨 Real-World Disaster #2: The OOMKilled Epidemic

Health Probes: The Three Probe Ensemble

🚨 Real-World Disaster #3: The Probe That Killed Production

🌐 Kubernetes Networking: The "Why Can't My Pod Talk to That Pod" Chapter

Service Types Explained (with when to use each)

🚨 Real-World Disaster #4: The $2,400/Month LoadBalancer Bill

🔐 Kubernetes Security: The Non-Negotiables

The Security Checklist Every Pod Must Pass

🚨 Real-World Disaster #5: The Crypto Miner in Your Cluster

📈 Autoscaling: Making Kubernetes Elastic

🚨 Real-World Disaster #6: The Autoscaler Death Spiral

🚀 GitOps: Your Cluster's Single Source of Truth

🚨 Real-World Disaster #7: The Rogue kubectl

🧪 Quick Reference: The K8s Troubleshooting Flowchart

🎯 Key Takeaways

🔥 Homework

What Really Happens When You `kubectl apply`