Jyothi Kumar

Posted on May 16

Kubernetes in Production:

#devops #infrastructure #kubernetes #sre

Kubernetes in Production: Deployments, Scaling, and Troubleshooting the Right Way

So you've got Kubernetes running locally. Maybe you've even deployed a few services to a staging cluster. But production is a different beast — and most tutorials stop right before things get real.

This article covers what actually matters when running Kubernetes in production: reliable deployments, smart scaling, and debugging when things go wrong (because they will).

1. Deployments: Ship Safely Every Time

Use Rolling Updates with Sensible Defaults

Kubernetes rolls out updates by default, but the defaults aren't always production-safe. Always set these explicitly:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

maxUnavailable: 0 ensures no pod is terminated before a healthy replacement is running. This is the single most impactful change you can make to reduce deployment-related downtime.

Set Readiness and Liveness Probes

Without probes, Kubernetes assumes a pod is ready the moment it starts. That's almost never true.

readinessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 15
  periodSeconds: 20

Readiness probe: controls when traffic is sent to the pod
Liveness probe: restarts the pod if it's stuck or deadlocked

If you only implement one thing from this article, make it readiness probes.

Always Set Resource Requests and Limits

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Without requests, the scheduler can't make good placement decisions. Without limits, a single misbehaving pod can starve its neighbors. Both will cause you pain in production.

2. Scaling: Handle Traffic Without Drama

Horizontal Pod Autoscaler (HPA)

HPA scales your pods based on CPU, memory, or custom metrics.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60

A few rules of thumb:

Never set minReplicas: 1 for production workloads — you lose high availability
Target 60–70% CPU utilization, not 80%+. You want headroom before the next scale event kicks in
Give HPA time to stabilize — avoid tuning it based on a single traffic spike

Cluster Autoscaler

HPA scales pods; Cluster Autoscaler scales nodes. Use both together.

When HPA adds pods and there's no room on existing nodes, Cluster Autoscaler provisions new nodes automatically. When load drops, it removes underutilized nodes to cut costs.

Key config tip: set --scale-down-utilization-threshold=0.5 to avoid aggressive scale-downs that can disrupt workloads.

Pod Disruption Budgets (PDBs)

PDBs protect your app during node maintenance or autoscaling events:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

This tells Kubernetes: "Never take down more pods than would leave fewer than 2 running." Without a PDB, rolling node upgrades can silently take down your entire service.

3. Troubleshooting: Debug Like a Pro

Here's a systematic approach when something breaks in production.

Step 1 — Check Pod Status

kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

Look at the Events section at the bottom of describe output first. It tells you exactly what Kubernetes tried to do and where it failed.

Common states and what they mean:

Status	Likely Cause
`CrashLoopBackOff`	App is crashing on startup — check logs
`Pending`	No node can schedule the pod — check resource requests or taints
`OOMKilled`	Memory limit too low — increase limits or fix a memory leak
`ImagePullBackOff`	Wrong image name/tag or missing registry credentials

Step 2 — Read the Logs

# Current logs
kubectl logs <pod-name> -n <namespace>

# Previous container instance (if crashing)
kubectl logs <pod-name> -n <namespace> --previous

# Follow live logs
kubectl logs -f <pod-name> -n <namespace>

The --previous flag is critical for CrashLoopBackOff — it shows you logs from the crashed container, not the restarted one.

Step 3 — Exec Into the Pod

When logs aren't enough:

kubectl exec -it <pod-name> -n <namespace> -- /bin/sh

From inside the pod you can test DNS resolution, check environment variables, curl internal services, and verify file mounts — all in the actual runtime environment.

Step 4 — Check Events Cluster-Wide

kubectl get events -n <namespace> --sort-by='.lastTimestamp'

This is often overlooked but invaluable. Node pressure, failed mounts, scheduler failures — all show up here.

Step 5 — Inspect Resource Pressure

kubectl top nodes
kubectl top pods -n <namespace>

If nodes are under memory or CPU pressure, they'll start evicting pods. This can look like random pod restarts when the real problem is a noisy neighbor.

Quick Reference Checklist

Before any production deployment, verify:

[ ] Readiness and liveness probes are configured
[ ] Resource requests and limits are set
[ ] maxUnavailable: 0 in rolling update strategy
[ ] HPA is configured with minReplicas >= 2
[ ] Pod Disruption Budget exists for critical services
[ ] Image tags are pinned (never use :latest in production)

Final Thought

Most Kubernetes outages aren't caused by Kubernetes itself — they're caused by missing probes, absent resource limits, or no disruption budgets. The cluster is doing exactly what it's configured to do. Production-readiness is about closing those gaps before traffic finds them for you.

Got questions or war stories from your own clusters? Drop them in the comments.

DEV Community