Sumit Gautam

Posted on May 2

Why Your Docker Container Works Locally But Fails in Kubernetes

#kubernetes #devops #sre #docker

It's not Kubernetes being difficult. It's the assumptions your container was making that Docker quietly satisfied — and Kubernetes doesn't.

You've been here before.

The container runs perfectly on your laptop. docker run works. The app responds. Logs look clean. You push it to your managed Kubernetes cluster — EKS, GKE, AKS, take your pick — and something breaks. The pod crashes with no useful logs. Or it starts, passes health checks, and returns wrong responses. Or it worked fine in staging and silently fails in production despite identical manifests.

This isn't bad luck. It's a specific and repeatable class of problem: your container was built with implicit assumptions about its runtime environment, and Docker satisfies those assumptions automatically while Kubernetes does not.

Docker on your laptop is a generous host. It passes through your shell environment, runs containers as your user by default, shares your network namespace, and gives containers as much memory and CPU as they ask for. Kubernetes is a strict host. It enforces isolation, applies resource constraints, manages networking through its own abstraction layer, and runs containers in a security context that may differ significantly from what you tested locally.

Every mismatch between those two environments is a potential failure. Here are the ones I've personally hit — and exactly how to close each gap.

Failure 1: Environment Variables and Secrets That Exist Locally But Not in the Cluster

This is the most common failure and the hardest to diagnose because the error it produces is almost never "environment variable missing." It's usually a downstream failure — a database connection refused, an API call returning 401, a feature that behaves as if it's in the wrong mode.

Locally, your container inherits environment variables from your shell, your .env file, your docker-compose.yml. You've set these up once and forgotten about them. In Kubernetes, none of that exists. The pod gets exactly what you put in the manifest — nothing more.

The failure pattern I've seen most in EKS environments: an application that uses AWS SDK will work locally because the developer's machine has IAM credentials in ~/.aws/credentials. In EKS, those credentials don't exist — the pod needs an IAM role attached via a service account. The app starts, the pod is Running, health checks pass, and every AWS API call silently fails or returns permission errors that look like application bugs.

What catches this:

Always run an environment audit before moving to Kubernetes. Start the container locally with a completely clean environment — no .env file, no inherited shell variables:

# Strip your local environment entirely
docker run --env-file /dev/null myapp:latest

# Or explicitly pass only what Kubernetes will provide
docker run \
  -e DB_HOST=localhost \
  -e APP_ENV=production \
  myapp:latest

If it breaks locally with a clean environment, it will break in Kubernetes. Fix it before it gets there.

For secrets in managed clusters, use the platform's native secret injection — AWS Secrets Manager with External Secrets Operator on EKS, GCP Secret Manager on GKE — rather than baking secrets into ConfigMaps or manifests:

# External Secrets Operator pattern for EKS
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: app-secrets
  data:
    - secretKey: DB_PASSWORD
      remoteRef:
        key: prod/myapp/db
        property: password

For IAM authentication specifically on EKS, use IRSA (IAM Roles for Service Accounts) — not instance profiles, not hardcoded credentials:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: myapp-sa
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/myapp-role

Failure 2: Resource Limits Causing OOMKill and CPU Throttling

This one presents as the most confusing failure because the symptoms look like application bugs, not infrastructure problems.

OOMKill: the pod runs for a few minutes, then disappears. No error in application logs because the process was killed before it could write one. kubectl describe pod shows OOMKilled in the last state — but only if you look at the right time, because that state rotates out of describe output after the pod restarts. Miss the window and you're debugging a ghost.

CPU throttling: the pod runs, the application responds, but it's slow. Intermittently slow in ways that don't correlate with traffic. This is the cgroup CPU quota applying — your container is being throttled because it requested 200m CPU, hit a burst, and the kernel is enforcing the limit. Locally, docker run with no resource flags gives the container your full machine's CPU. In Kubernetes with limits set, the container gets exactly what you asked for — which may be far less than it needs under load.

What catches this:

Never set resource limits in Kubernetes without first understanding your container's actual consumption profile. Run it under realistic load and measure:

# Watch resource consumption in real time
kubectl top pod myapp-pod --containers

# Get historical metrics if you have metrics-server
kubectl top pods -l app=myapp --sort-by=memory

Set requests and limits based on observed data, not guesses:

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    # Consider not setting CPU limits — only requests
    # CPU limits cause throttling; CPU requests cause scheduling

A pattern worth adopting in production: set memory limits (OOMKill is preferable to a node going down) but be conservative with CPU limits. CPU throttling degrades performance silently; it doesn't crash the pod, so it's far harder to detect. Use CPU requests for scheduling, and monitor actual CPU usage separately.

For OOMKill diagnosis, always check the pod's last state immediately after a crash:

kubectl describe pod myapp-pod | grep -A 10 "Last State"
# Look for: Reason: OOMKilled

Failure 3: Networking and Service Discovery Failures

Locally, your microservices talk to each other via localhost or hostnames defined in docker-compose. In Kubernetes, localhost refers to the pod itself — not other services. Service discovery works through DNS, and that DNS only resolves correctly if your service names, namespaces, and selectors are configured precisely.

The failure I've hit most: an application configured to connect to localhost:5432 for its database — perfectly valid in a Docker Compose setup where the database is a sidecar. In Kubernetes, that connection attempt hits the pod's own loopback interface and fails immediately. The error looks like a database connection failure, not a networking misconfiguration.

The staging-to-production variant: services work in staging because everything is in the default namespace and short DNS names resolve. In production with multiple namespaces, myservice doesn't resolve — myservice.production.svc.cluster.local does. The same manifest, different namespace, different DNS behavior.

What catches this:

Replace all localhost service references with Kubernetes DNS names before deploying. The full DNS format is:

<service-name>.<namespace>.svc.cluster.local

For services in the same namespace, the short name works:

env:
  - name: DB_HOST
    value: "postgres-service"  # same namespace
  - name: AUTH_SERVICE_URL
    value: "http://auth-service.auth-namespace.svc.cluster.local"  # cross-namespace

Debug DNS resolution from inside the pod — not from your laptop:

# Exec into the pod and test DNS directly
kubectl exec -it myapp-pod -- nslookup postgres-service
kubectl exec -it myapp-pod -- curl -v http://postgres-service:5432

# If nslookup fails, check CoreDNS
kubectl logs -n kube-system -l k8s-app=kube-dns

Network policies are the other common gotcha in production managed clusters. EKS and GKE often ship with default-deny network policies in hardened configurations. A service that communicates freely in staging can be silently blocked in production:

# Explicit ingress policy — don't rely on default-allow
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-myapp-ingress
spec:
  podSelector:
    matchLabels:
      app: myapp
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - port: 8080

Failure 4: Readiness and Liveness Probes Misconfigured

This failure is subtle because it's the Kubernetes layer doing exactly what you told it to do — you just told it the wrong thing.

A liveness probe that's too aggressive will kill a pod that's healthy but slow to start — especially JVM applications, Python apps loading large models, or anything with a meaningful initialization phase. The pod starts, Kubernetes probes it at second 10, gets no response because the app isn't ready yet, and kills it. CrashLoopBackOff. The app never had a chance to run.

A readiness probe that's too lenient — or missing entirely — sends traffic to pods that aren't ready. The service shows endpoints, requests route to the new pod, and users get errors during the rollout window.

Locally, neither of these exists. Docker runs your container and leaves it alone.

What catches this:

Configure initialDelaySeconds generously on liveness probes — always longer than your slowest observed startup time:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30    # give the app time to start
  periodSeconds: 10
  failureThreshold: 3
  timeoutSeconds: 5

readinessProbe:
  httpGet:
    path: /ready              # separate endpoint from liveness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

Use separate endpoints for liveness and readiness. /healthz for liveness should return 200 as long as the process is alive and not deadlocked. /ready for readiness should verify the application can actually serve traffic — database connected, cache warm, dependencies reachable.

Failure 5: File Permissions and Volume Mount Issues

Locally, your Docker container typically runs as root or as your user — whichever the Dockerfile specifies, with no external enforcement. In managed Kubernetes clusters, particularly on GKE Autopilot and hardened EKS configurations, pods run with runAsNonRoot: true enforced at the namespace or cluster level. If your container expects to write to /app/logs or /tmp/cache as root, it silently fails or crashes with a permission error that's easy to misread.

Volume mounts compound this. A hostPath volume that works in a local Docker setup doesn't exist in a managed cluster. An emptyDir volume mounted at /app/data will be owned by root unless you explicitly set fsGroup — meaning a container running as a non-root user can't write to it.

What catches this:

Always set an explicit security context and test against it locally:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000
  fsGroup: 1000             # ensures volume mounts are group-writable
  readOnlyRootFilesystem: true   # force explicit volume declarations

And in your Dockerfile, match the user:

RUN addgroup --system appgroup && adduser --system --ingroup appgroup appuser
RUN chown -R appuser:appgroup /app
USER appuser

Test this locally before pushing to the cluster:

docker run --user 1000:1000 --read-only myapp:latest

If it fails locally with these constraints, it will fail in Kubernetes. Fix the permissions at the image level, not with cluster-level workarounds.

The Underlying Pattern

Every failure above follows the same structure: Docker locally is permissive by default, Kubernetes in production is restrictive by design.

This isn't a Kubernetes flaw. Isolation, resource enforcement, and security contexts exist for good reasons in multi-tenant managed clusters. The problem is that the permissive local environment creates invisible dependencies — on inherited environment variables, on unrestricted resources, on root file access — that your container never had to explicitly declare.

The fix isn't to make Kubernetes more permissive. It's to make your container honest about what it needs.

Build containers that declare their requirements explicitly: environment variables, resource requests, security context, health check endpoints, DNS-based service addressing. Test them under production-like constraints before they reach the cluster. When a container works locally and fails in Kubernetes, the question isn't "what's wrong with Kubernetes" — it's "what assumption was my container making that I didn't know about."

Kubernetes just makes those assumptions visible. Usually at the worst possible time.

Quick Reference: The Local-to-Kubernetes Readiness Checklist

Before promoting any container from local Docker to a managed Kubernetes cluster:

Environment audit — run locally with clean environment, no inherited shell variables
IAM/credentials — no local credential files; use IRSA or Workload Identity
Resource profiling — measure actual CPU and memory under load before setting limits
DNS references — replace all localhost with Kubernetes service DNS names
Probe configuration — separate liveness/readiness endpoints, generous initialDelaySeconds
Security context — test with runAsNonRoot: true and readOnlyRootFilesystem: true locally
Volume permissions — set fsGroup on all writable volume mounts