๐ฌ Let Me Paint a Picture
It's 3:14 AM. Your phone buzzes. PagerDuty.
CRITICAL: payment-service - 0/3 pods ready
You open your laptop, eyes half-closed, and type:
kubectl get pods -n payments
NAME READY STATUS RESTARTS AGE
payment-service-7f8d9b6c4-abc12 0/1 CrashLoopBackOff 47 2h
payment-service-7f8d9b6c4-def34 0/1 CrashLoopBackOff 47 2h
payment-service-7f8d9b6c4-ghi56 0/1 CrashLoopBackOff 47 2h
CrashLoopBackOff. The three most terrifying words in the Kubernetes dictionary.
Welcome to Kubernetes Mastery. By the end of this blog, you'll not only understand what every K8s component does โ you'll know what to do when they break. Let's go.
๐ง Kubernetes Architecture: The Cast of Characters
Think of Kubernetes as a restaurant:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CONTROL PLANE (The Kitchen Management) โ
โ โ
โ ๐งโ๐ณ API Server = The Maรฎtre d' (takes ALL orders) โ
โ ๐ etcd = The order book (remembers everything) โ
โ ๐ฏ Scheduler = The seating host (assigns tables) โ
โ ๐ Controllers = The managers (make sure orders โ
โ are fulfilled) โ
โ โ๏ธ Cloud Controller = The landlord (manages building) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DATA PLANE (The Actual Kitchen & Dining Room) โ
โ โ
โ ๐ฅ๏ธ Nodes = Tables in the restaurant โ
โ ๐ฆ Pods = Plates of food on the table โ
โ ๐ค kubelet = The waiter at each table โ
โ ๐ kube-proxy = The runner (routes food to tables) โ
โ ๐ณ containerd = The actual cook โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
What Really Happens When You kubectl apply
Every time you deploy something, here's the actual flow:
You: kubectl apply -f deployment.yaml
โ
โผ
API Server: "Hold on, let me check..."
โ
โโ Step 1: AuthN โ "Who are you?" (certificate/token)
โโ Step 2: AuthZ โ "Can you do this?" (RBAC check)
โโ Step 3: Admission โ "Should we allow this?"
โ (Webhooks: Kyverno says "no latest tag!")
โโ Step 4: Validation โ "Is this YAML even valid?"
โโ Step 5: Write to etcd โ "OK, saved."
โ
โผ
Controller Manager: "Oh, new Deployment! Let me create a ReplicaSet."
ReplicaSet Controller: "ReplicaSet says 3 pods. Let me create 3 Pods."
โ
โผ
Scheduler: "3 new Pods need homes. Node-1 has CPU.
Node-2 has a taint. Node-3 is full.
โ Pods go to Node-1 and Node-4."
โ
โผ
kubelet (on each node): "I got assigned pods.
Pulling image... Starting container...
Health check passed. Reporting ready!"
๐ Restaurant analogy: You (the customer) tell the Maรฎtre d' (API Server) you want 3 burgers. The Maรฎtre d' writes it in the order book (etcd). The manager (Controller) tells the kitchen to make 3 burgers. The seating host (Scheduler) figures out which tables have room. The waiter (kubelet) brings the burgers to the right tables.
๐๏ธ AKS Architecture: What Microsoft Manages (And What's Your Problem)
When you use AKS, there's a clear split:
Microsoft's Problem Your Problem
(Free/SLA-backed) (Good luck ๐ซก)
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโ
โ
API Server ๐ฐ Your application code
โ
etcd ๐ฐ Node pool sizing
โ
Controller Manager ๐ฐ Pod configurations
โ
Scheduler ๐ฐ Networking choices
โ
Control plane upgrades ๐ฐ Your Docker images
๐ฐ Secrets management
๐ฐ Ingress configuration
๐ฐ That one deployment
with no resource limits
๐จ Real-World Disaster #1: The Node Pool That Couldn't Scale
The Error:
Events:
Warning FailedScaleUp cluster-autoscaler
pod didn't trigger scale-up: 1 max node group size reached
What Happened: The team set max nodes to 5, but Black Friday traffic needed 12. The Cluster Autoscaler wanted to add nodes but was blocked by the max limit. Pods sat in Pending state for 45 minutes.
The Fix:
# Check current autoscaler settings
az aks nodepool show -g rg-prod --cluster-name aks-prod \
-n userpool --query '{min:minCount, max:maxCount, current:count}'
# Update max nodes (always set 2-3x your expected peak)
az aks nodepool update -g rg-prod --cluster-name aks-prod \
-n userpool --max-count 20 --min-count 3
# Pro tip: Enable NAP (Node Auto-Provisioning) for fully automated scaling
az aks update -g rg-prod -n aks-prod --enable-node-autoprovision
๐ก Rule of thumb: Set
maxCountto 2-3x your normal peak. The Cluster Autoscaler won't scale up if it's not needed โ you only pay for what you use.
๐ฆ The Pod Spec: Where 90% of Production Issues Live
If Kubernetes is a restaurant, the Pod spec is the recipe. Get the recipe wrong, and you serve garbage. Here's the production-ready pod spec with every field explained:
Resource Requests & Limits (THE #1 K8s Issue)
resources:
requests: # "I need at least this much"
cpu: 250m # 0.25 CPU cores (scheduler uses this)
memory: 256Mi # Scheduler reserves this on the node
limits:
cpu: 1000m # Can burst up to 1 CPU core
memory: 512Mi # HARD LIMIT โ exceed this = OOMKilled ๐
๐จ Real-World Disaster #2: The OOMKilled Epidemic
The Error:
$ kubectl describe pod payment-service-xyz
State: Terminated
Reason: OOMKilled
Exit Code: 137
What Happened: The Java app was configured with -Xmx512m (512MB heap) but the container memory limit was set to 512Mi. Java heap + overhead (metaspace, threads, JNI) = ~680MB. Container tries to use more than 512Mi โ kernel kills it. Pod restarts. Uses 680MB again. Killed again. CrashLoopBackOff.
Translation: The app's memory request was a lie. It asked for 512Mi but actually needed ~700Mi. Kubernetes trusted the lie, and the OOM killer delivered justice.
The Fix:
resources:
requests:
memory: 768Mi # Be honest about what your app needs
limits:
memory: 1Gi # Give it headroom (limit = ~1.3x request for memory)
The Rule:
-
CPU:
limit = 2x to 4x requestis fine (CPU is compressible โ it just gets throttled) -
Memory:
limit = 1.3x to 1.5x requestMAX (memory is NOT compressible โ exceed it = death)
Health Probes: The Three Probe Ensemble
# 1. Startup Probe: "Has the app finished booting?"
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # Try 30 times
periodSeconds: 10 # Every 10 seconds = 5 min max startup
# Without this: K8s kills slow-starting apps before they're ready!
# 2. Liveness Probe: "Is the app alive?"
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 15
timeoutSeconds: 5
# If this fails: K8s RESTARTS the pod
# 3. Readiness Probe: "Can the app serve traffic?"
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
timeoutSeconds: 3
# If this fails: K8s removes pod from the Service (no traffic sent)
๐จ Real-World Disaster #3: The Probe That Killed Production
What Happened: A team set the liveness probe path to the same endpoint as their main API โ /api/v1/health. During a database connection pool exhaustion, this endpoint hung for 10 seconds. The liveness timeout was 5 seconds. Kubernetes thought the pod was dead. Killed it. New pod starts, also can't connect to DB. Killed. ALL PODS KILLED SIMULTANEOUSLY.
Result: Complete outage because K8s was trying to "help" by restarting healthy pods.
The Fix:
- Liveness probes should check local health only (can the process respond?), NOT dependency health
- Readiness probes should check dependencies (is the DB reachable?)
- Never point liveness at your main API endpoint
# GOOD: Lightweight liveness check
livenessProbe:
httpGet:
path: /healthz # Returns 200 if process is alive. That's it.
port: 8080
# GOOD: Dependency-aware readiness check
readinessProbe:
httpGet:
path: /ready # Checks DB connection, cache, etc.
port: 8080
๐ Kubernetes Networking: The "Why Can't My Pod Talk to That Pod" Chapter
Service Types Explained (with when to use each)
ClusterIP (default)
โโ Internal only. Pod-to-pod communication.
Use for: microservice โ microservice calls
Cost: Free
LoadBalancer
โโ Gets a real Azure Load Balancer (public or internal IP)
Use for: non-HTTP services (gRPC, TCP, game servers)
Cost: $18/month + data transfer PER SERVICE ๐ฑ
Ingress
โโ One LoadBalancer โ routes to many services by host/path
Use for: HTTP/HTTPS services (90% of your apps)
Cost: One LB cost shared across all services ๐
Gateway API (the future)
โโ Like Ingress but better: multi-tenant, L4+L7, cross-namespace
Use for: new deployments, forward-thinking architecture
๐จ Real-World Disaster #4: The $2,400/Month LoadBalancer Bill
What Happened: Each team created individual Services with type: LoadBalancer for their apps. 12 services ร $18/month LB + data transfer = $2,400/month just for load balancers.
The Fix: Deploy ONE NGINX Ingress Controller, route all HTTP traffic through it:
# Instead of 12 LoadBalancers, one Ingress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: main-ingress
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
ingressClassName: nginx
rules:
- host: api.mycompany.com
http:
paths:
- path: /payments
pathType: Prefix
backend:
service:
name: payment-service
port:
number: 8080
- path: /users
pathType: Prefix
backend:
service:
name: user-service
port:
number: 8080
Cost after: One LoadBalancer = ~$18/month. Savings: $2,382/month. You're welcome.
๐ Kubernetes Security: The Non-Negotiables
The Security Checklist Every Pod Must Pass
spec:
serviceAccountName: my-app-sa # Dedicated SA per app
automountServiceAccountToken: false # Don't mount unless needed
securityContext:
runAsNonRoot: true # Never run as root
runAsUser: 1000
seccompProfile:
type: RuntimeDefault # syscall filtering
containers:
- name: my-app
image: myacr.azurecr.io/app:v1.2.3@sha256:abc... # Pin by digest!
securityContext:
allowPrivilegeEscalation: false # Can't become root
readOnlyRootFilesystem: true # No writing to filesystem
capabilities:
drop: ["ALL"] # Drop all Linux capabilities
๐จ Real-World Disaster #5: The Crypto Miner in Your Cluster
The Alert:
Defender for Containers: CRITICAL
"Suspicious container detected: Image contains known cryptomining software"
"Pod 'nginx-proxy-xyz' in namespace 'default' running as root with
hostNetwork: true"
What Happened: Someone deployed a "convenience" nginx image from Docker Hub (not your private ACR). The image was compromised and contained a crypto miner. Because the pod ran as root with hostNetwork: true, it could access the node's network and mine crypto using your Azure bill.
The Fix:
- Only allow images from your private ACR:
# Kyverno policy: Block images not from our ACR
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: restrict-image-registries
spec:
validationFailureAction: Enforce
rules:
- name: validate-registries
match:
any:
- resources:
kinds: ["Pod"]
validate:
message: "Images must come from myacr.azurecr.io"
pattern:
spec:
containers:
- image: "myacr.azurecr.io/*"
-
Never run pods in the
defaultnamespace (no policies are applied there by default) - Scan images in your CI/CD pipeline with Trivy before pushing to ACR
๐ Autoscaling: Making Kubernetes Elastic
Kubernetes has three levels of autoscaling, and you need all of them:
Level 1: HPA (Horizontal Pod Autoscaler)
โโ Adds/removes PODS based on CPU, memory, or custom metrics
"My service is busy? Add more pod replicas!"
Level 2: KEDA (Kubernetes Event-Driven Autoscaler)
โโ Scales based on EVENTS โ queue depth, HTTP requests, cron
"There are 10,000 messages in the queue? Scale to 50 pods!"
"It's 3 AM and queue is empty? Scale to zero!"
Level 3: Cluster Autoscaler
โโ Adds/removes NODES when pods can't be scheduled
"Pods are Pending because no node has capacity? Add a node!"
๐จ Real-World Disaster #6: The Autoscaler Death Spiral
What Happened: HPA was configured to scale on CPU. Under load, pods scaled from 3 โ 15. But each pod opening connections to the database caused connection pool exhaustion. The DB started returning errors. Error-handling code consumed MORE CPU (logging, retries). HPA saw more CPU โ scaled to 30 pods. More DB connections โ faster DB collapse. Complete meltdown.
The Fix:
- Set
maxReplicasin HPA to something your DB can handle - Use connection pooling (PgBouncer for Postgres)
- Scale on business metrics (requests/second) not raw CPU
- Add a circuit breaker between your app and the DB
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
minReplicas: 3
maxReplicas: 15 # Cap it! Know your DB's connection limit.
behavior:
scaleUp:
stabilizationWindowSeconds: 60 # Don't scale up too fast
policies:
- type: Pods
value: 2 # Max 2 pods per minute
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second # Business metric, not CPU!
target:
type: AverageValue
averageValue: "100"
๐ GitOps: Your Cluster's Single Source of Truth
GitOps = Your Git repository is the single source of truth for your cluster state. No more kubectl apply from laptops. No more "who deployed that?"
Developer pushes to Git
โ
โผ
Git Repository (the truth)
โ
โผ
GitOps Agent (Flux / ArgoCD)
watches the repo, detects changes
โ
โผ
Applies changes to cluster
(reconciliation loop โ every 1-5 minutes)
โ
โผ
Cluster state matches Git โ
๐จ Real-World Disaster #7: The Rogue kubectl
What Happened: A developer ran kubectl scale deployment payment-service --replicas=1 in production "to test something." This reduced payment processing capacity by 66%. But since there was no GitOps, nobody noticed the drift for 3 hours until load increased and the single replica started dropping requests.
With GitOps: Flux/ArgoCD would have detected the drift within minutes and automatically scaled back to 3 replicas. The desired state in Git always wins.
๐งช Quick Reference: The K8s Troubleshooting Flowchart
Pod not starting?
โโโ Status: Pending
โ โโโ "Insufficient cpu/memory" โ Node is full
โ โ โโ Fix: Check resource requests, scale node pool
โ โโโ "No nodes match pod topology" โ Affinity/taint issue
โ โ โโ Fix: Check nodeSelector, tolerations, topology constraints
โ โโโ "0/3 nodes available: PersistentVolumeClaim not bound"
โ โโ Fix: Check PVC, storage class, disk availability
โ
โโโ Status: ImagePullBackOff
โ โโโ "unauthorized: authentication required" โ ACR auth failed
โ โ โโ Fix: Check imagePullSecrets or AKS-ACR integration
โ โโโ "manifest unknown" โ Image tag doesn't exist
โ โโ Fix: Check image:tag spelling, verify it exists in registry
โ
โโโ Status: CrashLoopBackOff
โ โโโ Exit Code 137 โ OOMKilled
โ โ โโ Fix: Increase memory limit
โ โโโ Exit Code 1 โ App crashed on startup
โ โ โโ Fix: Check logs: kubectl logs <pod> --previous
โ โโโ Exit Code 0 โ App exited successfully (shouldn't for a server)
โ โโ Fix: Check entrypoint/command, app should run indefinitely
โ
โโโ Status: Running but not Ready
โ โโโ Readiness probe failing
โ โโ Fix: Check probe path, port, and app dependencies
โ
โโโ Status: Terminating (stuck)
โโโ Finalizer or preStop hook issue
โโ Fix: kubectl delete pod <name> --grace-period=0 --force
(last resort!)
๐ฏ Key Takeaways
- Resources requests/limits are the #1 cause of production K8s issues โ set them honestly
- Liveness probes should check the process, not dependencies โ bad probes kill healthy pods
- One Ingress Controller beats 12 LoadBalancers every time ($$$)
- Pin images by digest in production โ tags are mutable and untrustworthy
- Autoscaling needs guardrails โ uncapped HPA can create death spirals
- GitOps eliminates drift and rogue kubectl changes
- Never run pods as root โ unless you enjoy donating CPU to crypto miners
๐ฅ Homework
- Run
kubectl get pods --all-namespaces | grep -E "CrashLoop|Error|Pending"โ fix what you find - Check if any pod in your cluster runs as root:
kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.securityContext.runAsNonRoot}{"\n"}{end}' - Calculate how many LoadBalancers your cluster has and whether you can consolidate with an Ingress
Next up in the series: **Terraform State Files: The Diary Your Infrastructure Never Wanted You to Read* โ where state file corruption, locking wars, and the dreaded -target flag are decoded with real horror stories.*
๐ฌ What's your worst CrashLoopBackOff story? Share it below. There's no judgment here โ only solidarity. ๐ซ
Top comments (0)