DEV Community: Sumit Gautam

The Cloud Cost Spike Nobody Warned Me About

Sumit Gautam — Thu, 21 May 2026 06:07:53 +0000

I've discovered cloud cost problems every possible way. Here's what I learned each time.

I've been on the wrong end of an unexpected AWS bill more than once. And I've discovered those problems every possible way the industry offers.

A billing alert firing at 11pm on a friday evening. A client call on a Monday morning where the first words were "why did our AWS bill double?" A routine Cost Explorer review that started as a 10-minute check and turned into a two-hour investigation. And yes — a month-end invoice that was simply higher than it should have been, with no prior warning because nobody had set one.

Each time, the root cause wasn't a bug. It wasn't a misconfiguration in any obvious sense. It was the natural output of infrastructure built by engineers — including me — who understood how AWS services work but hadn't fully internalized how AWS billing works.

Those are not the same thing. And the gap between them is where real money disappears.

This article is about that gap — the specific AWS cost patterns that look like correct architecture until you see the bill, and what I put in place after each incident to make sure it didn't happen the same way twice.

Cost Driver 1: NAT Gateway Data Transfer Charges

This is the one that surprises almost everyone the first time.

NAT Gateway pricing has two components that AWS documents clearly and engineers consistently underestimate in practice. The first is the hourly charge for the gateway existing — roughly $0.045/hour per gateway, about $32/month. Noticeable but expected.

The second is the data processing charge — $0.045 per GB of data that passes through the gateway in either direction. This is the one that generates real bills.

The scenario I hit: a Kubernetes cluster on EKS with pods in private subnets pulling container images from ECR, sending logs to CloudWatch, and making API calls to various AWS services — all routed through a NAT Gateway. A moderately active cluster processing a few hundred GB of data per day generates NAT Gateway charges that dwarf the EC2 costs underneath it.

The architecture is correct. Private subnets with NAT Gateway is the right pattern for production workloads. The billing implication just wasn't modeled.

What fixes this:

For traffic between your resources and AWS services specifically, use VPC Endpoints instead of routing through NAT Gateway. VPC Endpoints keep traffic on the AWS private network — no NAT Gateway processing charge, lower latency, and often better security posture:

# Create a VPC Endpoint for S3 (Gateway type — free)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxxxxxxx \
  --service-name com.amazonaws.ap-south-1.s3 \
  --route-table-ids rtb-xxxxxxxx \
  --vpc-endpoint-type Gateway

# Create Interface Endpoint for ECR (replaces NAT for image pulls)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-xxxxxxxx \
  --service-name com.amazonaws.ap-south-1.ecr.dkr \
  --vpc-endpoint-type Interface \
  --subnet-ids subnet-xxxxxxxx \
  --security-group-ids sg-xxxxxxxx

For S3 and DynamoDB, Gateway Endpoints are free. For ECR, CloudWatch, Secrets Manager, and other services, Interface Endpoints have an hourly cost — but for high-volume workloads, they're almost always cheaper than equivalent NAT Gateway processing charges.

Model this before you build. The break-even point is lower than you expect.

Cost Driver 2: Forgotten and Idle Resources

This one is less glamorous than NAT Gateway math but responsible for more wasted spend across more accounts than anything else on this list.

The pattern is consistent: resources get created for a purpose, the purpose ends or changes, the resources remain. Nobody deletes them because nobody owns the cleanup. In a team environment, this compounds — everyone assumes someone else deprovisioned the staging environment from last quarter.

What I found in a Cost Explorer review of a client account:

Unattached EBS volumes from terminated EC2 instances — volumes persist after instance termination by default unless you explicitly configure deletion on termination
Outdated RDS snapshots — automated snapshots accumulate beyond the retention window you thought you configured, particularly if manual snapshots were taken and never cleaned up
Idle NAT Gateways in regions where workloads had been decommissioned — $32/month each, several of them, months after the workloads they served were gone
Old AMIs and their associated snapshots — AMIs are easy to create, easy to forget, and each one holds snapshot storage charges indefinitely

None of these are large individually. Together, across an account that had been running for two years without systematic cleanup, they were meaningful.

What fixes this:

Build a cleanup policy into your infrastructure practice, not your quarterly review calendar. At minimum:

# Find unattached EBS volumes
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}' \
  --output table

# Find snapshots older than 90 days (adjust Owner to your account ID)
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`2025-01-01`].{ID:SnapshotId,Size:VolumeSize,Date:StartTime}' \
  --output table

# Find NAT Gateways not associated with active route tables
aws ec2 describe-nat-gateways \
  --filter Name=state,Values=available \
  --query 'NatGateways[*].{ID:NatGatewayId,VPC:VpcId,Created:CreateTime}' \
  --output table

For ongoing governance, enable AWS Config with rules for unattached volumes and idle resources, and use AWS Cost Anomaly Detection — it catches spend pattern changes faster than static billing alerts:

# Create a cost anomaly monitor for EC2
aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "EC2Monitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

Tag everything at creation with an owner and a project. Resources without tags in a quarterly audit are candidates for deletion. Make this a policy, not a suggestion.

Cost Driver 3: Data Transfer Between Availability Zones

This is the most invisible cost driver on the list because it requires no misconfiguration and no forgotten resources. It's the direct result of building the high-availability architecture AWS recommends.

AWS charges $0.01 per GB for data transferred between Availability Zones within the same region. In both directions. This sounds trivial until you map it against what actually moves between AZs in a real distributed system.

The scenario: a three-tier application deployed across three AZs for availability. Application servers in AZ-A making database calls to RDS in AZ-B. A caching layer in AZ-C that application servers across all three AZs read from. A Kubernetes cluster where pods are scheduled across AZs without affinity rules, meaning a pod in AZ-A routinely calls a service pod in AZ-C. Every one of these cross-AZ calls — database queries, cache reads, inter-service calls — generates data transfer charges.

At low volume, this is background noise. At production scale, cross-AZ transfer costs can match or exceed your compute costs for data-intensive workloads.

What fixes this:

The goal is AZ-aware traffic routing — keeping traffic within the same AZ wherever availability requirements permit:

# Kubernetes topology-aware routing
# Prefer pods in the same AZ before routing cross-zone
apiVersion: v1
kind: Service
metadata:
  name: my-service
  annotations:
    service.kubernetes.io/topology-mode: "Auto"
spec:
  selector:
    app: my-service
  ports:
    - port: 80

For EKS specifically, enable Topology Aware Routing and configure pod affinity rules to co-locate services that communicate frequently:

affinity:
  podAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - dependent-service
          topologyKey: topology.kubernetes.io/zone

For RDS, use RDS Proxy in the same AZ as your compute where possible, and be deliberate about which AZ your primary instance sits in relative to your application tier.

Cost Driver 4: S3 Storage and Request Costs

S3 feels cheap because the storage rate is low — $0.023 per GB per month for Standard storage. The request costs are what accumulate unexpectedly.

S3 charges per API request: $0.0004 per 1,000 GET requests, $0.005 per 1,000 PUT/COPY/POST/LIST requests. These numbers are small. Multiplied by millions of requests per day from an application that wasn't designed with S3 request patterns in mind, they add up.

The patterns I've seen generate unexpected S3 costs:

Application code calling ListObjects in a loop instead of paginating correctly — each List call counts as a request, and tight loops can generate thousands per minute
Small file uploads — many small PUTs cost more in request charges than fewer large ones, relevant for logging pipelines that write per-event rather than batching
S3 access logs enabled and writing to the same bucket — access logs generate their own requests, which generate more access logs, compounding the request count
Lifecycle policies absent — objects in Standard storage that should have transitioned to Infrequent Access or Glacier months ago

What fixes this:

Enable S3 Storage Lens at the account level — it gives you per-bucket visibility into request patterns, storage class distribution, and cost drivers without requiring manual investigation:

# Enable S3 Storage Lens default dashboard
aws s3control put-storage-lens-configuration \
  --account-id YOUR_ACCOUNT_ID \
  --config-id default \
  --storage-lens-configuration '{
    "Id": "default",
    "IsEnabled": true,
    "AccountLevel": {
      "BucketLevel": {}
    }
  }'

Add lifecycle policies to every bucket at creation — treat it as a default, not an optimization:

{
  "Rules": [
    {
      "Id": "transition-to-ia",
      "Status": "Enabled",
      "Transitions": [
        {
          "Days": 30,
          "StorageClass": "STANDARD_IA"
        },
        {
          "Days": 90,
          "StorageClass": "GLACIER_IR"
        }
      ]
    }
  ]
}

Cost Driver 5: Oversized Instances Running 24/7

This is the simplest cost driver and the one with the most straightforward fix — which is why it's last. Simple doesn't mean small.

The pattern: instances sized for peak load running continuously at 10-20% utilization. Development and staging environments sized to match production. Instances that were right-sized six months ago for a workload that has since shrunk.

On a client engagement I reviewed Cost Explorer and found several m5.2xlarge instances — $0.384/hour, about $276/month each — running continuously at consistently low CPU and memory utilization. They had been provisioned for a load test, the load test had concluded, and the instances had continued running because nobody had a process for decommissioning them after the test.

What fixes this:

Enable AWS Compute Optimizer — it analyzes CloudWatch metrics and produces specific right-sizing recommendations with projected savings:

# Get EC2 right-sizing recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[*].{
    Instance:instanceArn,
    Finding:finding,
    RecommendedType:recommendationOptions[0].instanceType,
    SavingsPercent:recommendationOptions[0].estimatedMonthlySavings.value
  }' \
  --output table

For non-production environments, implement instance scheduling — stop instances outside working hours. An instance running 8 hours a day instead of 24 costs 67% less:

# AWS Instance Scheduler via CloudFormation (or use Lambda)
# Simple approach: tag-based stop/start with EventBridge

# Tag instances for scheduling
aws ec2 create-tags \
  --resources i-xxxxxxxxx \
  --tags Key=Schedule,Value=office-hours

# EventBridge rule to stop tagged instances at 7pm IST
aws events put-rule \
  --name StopDevInstances \
  --schedule-expression "cron(30 13 ? * MON-FRI *)" \
  --state ENABLED

What Every Discovery Method Taught Me

Every way I've found an AWS cost problem taught me something different.

The billing alert that fired at 11pm taught me to set thresholds before I think I need them — at 50%, 80%, and 100% of expected spend, not just at the number that feels alarming.

The client call on a Monday morning taught me that cost problems in team environments are invisible until they're someone else's problem to escalate. Shared accounts need shared visibility — Cost Explorer access for the whole team, not just the billing owner.

The routine review that turned into two hours taught me that Cost Explorer by service, checked weekly rather than monthly, surfaces anomalies while they're small. By month end, the pattern has been running for weeks.

The surprise invoice taught me the most: the absence of an alert is not the same as the absence of a problem. An unmonitored account is a guarantee of eventual surprise.

The actual lesson across all of them is the same: AWS billing is an observability problem. The same discipline you apply to application monitoring — alerts, anomaly detection, dashboards, regular review — applies to your cloud spend. Without it, cost issues are invisible until they're on an invoice.

The AWS services that generate surprising costs are almost always working exactly as documented. The surprise comes from not modeling the billing implications before the architecture is built, and not monitoring spend with the same rigor as uptime.

Model the billing first. Monitor it like production. Build the architecture second.

Quick Reference: The AWS Cost Governance Checklist

VPC Endpoints for S3, ECR, CloudWatch, Secrets Manager — eliminate NAT Gateway processing for AWS service traffic
Billing alerts at 50%, 80%, 100% of monthly budget threshold
Cost Anomaly Detection enabled at account level
AWS Config rules for unattached EBS volumes and idle resources
Topology Aware Routing on EKS to minimize cross-AZ data transfer
S3 lifecycle policies on every bucket at creation
Compute Optimizer enabled — review recommendations monthly
Instance scheduling for all non-production environments
Mandatory tagging policy — Owner, Project, Environment on every resource

Have you been hit by an unexpected AWS bill? I'd genuinely like to know which service surprised you most — drop it in the comments.

Every DevOps engineer has hit this. Works in Docker, breaks in Kubernetes — no clear error, no obvious reason. Here are the 5 assumptions your container is silently making that Kubernetes won't tolerate.

Sumit Gautam — Mon, 04 May 2026 03:41:16 +0000

Sumit Gautam

May 2

Why Your Docker Container Works Locally But Fails in Kubernetes

#webdev #tutorial #beginners #programming

Comments

8 min read

Why Your Docker Container Works Locally But Fails in Kubernetes

Sumit Gautam — Sat, 02 May 2026 05:15:46 +0000

It's not Kubernetes being difficult. It's the assumptions your container was making that Docker quietly satisfied — and Kubernetes doesn't.

You've been here before.

The container runs perfectly on your laptop. docker run works. The app responds. Logs look clean. You push it to your managed Kubernetes cluster — EKS, GKE, AKS, take your pick — and something breaks. The pod crashes with no useful logs. Or it starts, passes health checks, and returns wrong responses. Or it worked fine in staging and silently fails in production despite identical manifests.

This isn't bad luck. It's a specific and repeatable class of problem: your container was built with implicit assumptions about its runtime environment, and Docker satisfies those assumptions automatically while Kubernetes does not.

Docker on your laptop is a generous host. It passes through your shell environment, runs containers as your user by default, shares your network namespace, and gives containers as much memory and CPU as they ask for. Kubernetes is a strict host. It enforces isolation, applies resource constraints, manages networking through its own abstraction layer, and runs containers in a security context that may differ significantly from what you tested locally.

Every mismatch between those two environments is a potential failure. Here are the ones I've personally hit — and exactly how to close each gap.

Failure 1: Environment Variables and Secrets That Exist Locally But Not in the Cluster

This is the most common failure and the hardest to diagnose because the error it produces is almost never "environment variable missing." It's usually a downstream failure — a database connection refused, an API call returning 401, a feature that behaves as if it's in the wrong mode.

Locally, your container inherits environment variables from your shell, your .env file, your docker-compose.yml. You've set these up once and forgotten about them. In Kubernetes, none of that exists. The pod gets exactly what you put in the manifest — nothing more.

The failure pattern I've seen most in EKS environments: an application that uses AWS SDK will work locally because the developer's machine has IAM credentials in ~/.aws/credentials. In EKS, those credentials don't exist — the pod needs an IAM role attached via a service account. The app starts, the pod is Running, health checks pass, and every AWS API call silently fails or returns permission errors that look like application bugs.

What catches this:

Always run an environment audit before moving to Kubernetes. Start the container locally with a completely clean environment — no .env file, no inherited shell variables:

# Strip your local environment entirely
docker run --env-file /dev/null myapp:latest

# Or explicitly pass only what Kubernetes will provide
docker run \
  -e DB_HOST=localhost \
  -e APP_ENV=production \
  myapp:latest

If it breaks locally with a clean environment, it will break in Kubernetes. Fix it before it gets there.

For secrets in managed clusters, use the platform's native secret injection — AWS Secrets Manager with External Secrets Operator on EKS, GCP Secret Manager on GKE — rather than baking secrets into ConfigMaps or manifests:

# External Secrets Operator pattern for EKS
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: app-secrets
  data:
    - secretKey: DB_PASSWORD
      remoteRef:
        key: prod/myapp/db
        property: password

For IAM authentication specifically on EKS, use IRSA (IAM Roles for Service Accounts) — not instance profiles, not hardcoded credentials:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: myapp-sa
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/myapp-role

Failure 2: Resource Limits Causing OOMKill and CPU Throttling

This one presents as the most confusing failure because the symptoms look like application bugs, not infrastructure problems.

OOMKill: the pod runs for a few minutes, then disappears. No error in application logs because the process was killed before it could write one. kubectl describe pod shows OOMKilled in the last state — but only if you look at the right time, because that state rotates out of describe output after the pod restarts. Miss the window and you're debugging a ghost.

CPU throttling: the pod runs, the application responds, but it's slow. Intermittently slow in ways that don't correlate with traffic. This is the cgroup CPU quota applying — your container is being throttled because it requested 200m CPU, hit a burst, and the kernel is enforcing the limit. Locally, docker run with no resource flags gives the container your full machine's CPU. In Kubernetes with limits set, the container gets exactly what you asked for — which may be far less than it needs under load.

What catches this:

Never set resource limits in Kubernetes without first understanding your container's actual consumption profile. Run it under realistic load and measure:

# Watch resource consumption in real time
kubectl top pod myapp-pod --containers

# Get historical metrics if you have metrics-server
kubectl top pods -l app=myapp --sort-by=memory

Set requests and limits based on observed data, not guesses:

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    # Consider not setting CPU limits — only requests
    # CPU limits cause throttling; CPU requests cause scheduling

A pattern worth adopting in production: set memory limits (OOMKill is preferable to a node going down) but be conservative with CPU limits. CPU throttling degrades performance silently; it doesn't crash the pod, so it's far harder to detect. Use CPU requests for scheduling, and monitor actual CPU usage separately.

For OOMKill diagnosis, always check the pod's last state immediately after a crash:

kubectl describe pod myapp-pod | grep -A 10 "Last State"
# Look for: Reason: OOMKilled

Failure 3: Networking and Service Discovery Failures

Locally, your microservices talk to each other via localhost or hostnames defined in docker-compose. In Kubernetes, localhost refers to the pod itself — not other services. Service discovery works through DNS, and that DNS only resolves correctly if your service names, namespaces, and selectors are configured precisely.

The failure I've hit most: an application configured to connect to localhost:5432 for its database — perfectly valid in a Docker Compose setup where the database is a sidecar. In Kubernetes, that connection attempt hits the pod's own loopback interface and fails immediately. The error looks like a database connection failure, not a networking misconfiguration.

The staging-to-production variant: services work in staging because everything is in the default namespace and short DNS names resolve. In production with multiple namespaces, myservice doesn't resolve — myservice.production.svc.cluster.local does. The same manifest, different namespace, different DNS behavior.

What catches this:

Replace all localhost service references with Kubernetes DNS names before deploying. The full DNS format is:

<service-name>.<namespace>.svc.cluster.local

For services in the same namespace, the short name works:

env:
  - name: DB_HOST
    value: "postgres-service"  # same namespace
  - name: AUTH_SERVICE_URL
    value: "http://auth-service.auth-namespace.svc.cluster.local"  # cross-namespace

Debug DNS resolution from inside the pod — not from your laptop:

# Exec into the pod and test DNS directly
kubectl exec -it myapp-pod -- nslookup postgres-service
kubectl exec -it myapp-pod -- curl -v http://postgres-service:5432

# If nslookup fails, check CoreDNS
kubectl logs -n kube-system -l k8s-app=kube-dns

Network policies are the other common gotcha in production managed clusters. EKS and GKE often ship with default-deny network policies in hardened configurations. A service that communicates freely in staging can be silently blocked in production:

# Explicit ingress policy — don't rely on default-allow
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-myapp-ingress
spec:
  podSelector:
    matchLabels:
      app: myapp
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: frontend
      ports:
        - port: 8080

Failure 4: Readiness and Liveness Probes Misconfigured

This failure is subtle because it's the Kubernetes layer doing exactly what you told it to do — you just told it the wrong thing.

A liveness probe that's too aggressive will kill a pod that's healthy but slow to start — especially JVM applications, Python apps loading large models, or anything with a meaningful initialization phase. The pod starts, Kubernetes probes it at second 10, gets no response because the app isn't ready yet, and kills it. CrashLoopBackOff. The app never had a chance to run.

A readiness probe that's too lenient — or missing entirely — sends traffic to pods that aren't ready. The service shows endpoints, requests route to the new pod, and users get errors during the rollout window.

Locally, neither of these exists. Docker runs your container and leaves it alone.

What catches this:

Configure initialDelaySeconds generously on liveness probes — always longer than your slowest observed startup time:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30    # give the app time to start
  periodSeconds: 10
  failureThreshold: 3
  timeoutSeconds: 5

readinessProbe:
  httpGet:
    path: /ready              # separate endpoint from liveness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

Use separate endpoints for liveness and readiness. /healthz for liveness should return 200 as long as the process is alive and not deadlocked. /ready for readiness should verify the application can actually serve traffic — database connected, cache warm, dependencies reachable.

Failure 5: File Permissions and Volume Mount Issues

Locally, your Docker container typically runs as root or as your user — whichever the Dockerfile specifies, with no external enforcement. In managed Kubernetes clusters, particularly on GKE Autopilot and hardened EKS configurations, pods run with runAsNonRoot: true enforced at the namespace or cluster level. If your container expects to write to /app/logs or /tmp/cache as root, it silently fails or crashes with a permission error that's easy to misread.

Volume mounts compound this. A hostPath volume that works in a local Docker setup doesn't exist in a managed cluster. An emptyDir volume mounted at /app/data will be owned by root unless you explicitly set fsGroup — meaning a container running as a non-root user can't write to it.

What catches this:

Always set an explicit security context and test against it locally:

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  runAsGroup: 1000
  fsGroup: 1000             # ensures volume mounts are group-writable
  readOnlyRootFilesystem: true   # force explicit volume declarations

And in your Dockerfile, match the user:

RUN addgroup --system appgroup && adduser --system --ingroup appgroup appuser
RUN chown -R appuser:appgroup /app
USER appuser

Test this locally before pushing to the cluster:

docker run --user 1000:1000 --read-only myapp:latest

If it fails locally with these constraints, it will fail in Kubernetes. Fix the permissions at the image level, not with cluster-level workarounds.

The Underlying Pattern

Every failure above follows the same structure: Docker locally is permissive by default, Kubernetes in production is restrictive by design.

This isn't a Kubernetes flaw. Isolation, resource enforcement, and security contexts exist for good reasons in multi-tenant managed clusters. The problem is that the permissive local environment creates invisible dependencies — on inherited environment variables, on unrestricted resources, on root file access — that your container never had to explicitly declare.

The fix isn't to make Kubernetes more permissive. It's to make your container honest about what it needs.

Build containers that declare their requirements explicitly: environment variables, resource requests, security context, health check endpoints, DNS-based service addressing. Test them under production-like constraints before they reach the cluster. When a container works locally and fails in Kubernetes, the question isn't "what's wrong with Kubernetes" — it's "what assumption was my container making that I didn't know about."

Kubernetes just makes those assumptions visible. Usually at the worst possible time.

Quick Reference: The Local-to-Kubernetes Readiness Checklist

Before promoting any container from local Docker to a managed Kubernetes cluster:

Environment audit — run locally with clean environment, no inherited shell variables
IAM/credentials — no local credential files; use IRSA or Workload Identity
Resource profiling — measure actual CPU and memory under load before setting limits
DNS references — replace all localhost with Kubernetes service DNS names
Probe configuration — separate liveness/readiness endpoints, generous initialDelaySeconds
Security context — test with runAsNonRoot: true and readOnlyRootFilesystem: true locally
Volume permissions — set fsGroup on all writable volume mounts

What's the most confusing Docker-to-Kubernetes failure you've debugged? Drop it in the comments — the weirder the better.

The CI/CD Pipeline That Looked Fine But Was Silently Failing

Sumit Gautam — Wed, 22 Apr 2026 06:26:43 +0000

Everything was green. The deployment succeeded. Production was broken for hours. Here's what I learned.

There's a specific kind of production incident that's worse than an outage.

An outage is loud. Alerts fire, dashboards go red, everyone knows something is wrong. You fix it.

The silent failure is different. The pipeline is green. The deployment says "successful." No alerts fire. And somewhere in production, the wrong code is quietly running — serving stale responses, skipping validation, behaving in ways that don't match what you just merged. Nobody knows yet.

I've been on the wrong end of this more than once. Wrong Docker images deployed due to layer caching. Tests marked as passed that never actually ran. Environment variables from staging quietly bleeding into production. A deployment that reported success while the old version kept serving traffic because the agent never actually finished the job.

Each time, the CI/CD dashboard looked fine. That's what made it dangerous.

This article is about what green pipelines hide — and the specific verification habits that catch these failures before your users do.

Failure 1: The Docker Cache That Deployed Yesterday's Code

This one is subtle enough that it can fool you completely if you're not looking for it.

The scenario: you push a fix, the pipeline runs, the Docker build completes in 12 seconds instead of the usual 4 minutes. You don't think much of it — fast builds are good, right? Deployment succeeds. You check the service, it seems to respond. You close the laptop.

What actually happened: Docker's layer cache served a previously built image. Your COPY . . instruction didn't invalidate the cache because the file timestamps didn't change the way Docker expected — common in CI environments where the workspace is freshly checked out but mtime metadata doesn't match. The image that got deployed was built from code that predated your fix.

The dangerous part is that the build log looks correct. You see your Dockerfile steps. You see layer hashes. Nothing screams "wrong image."

What catches this:

Always embed the Git commit SHA into your image at build time and verify it at deploy time:

ARG GIT_COMMIT=unknown
LABEL git-commit=$GIT_COMMIT
ENV GIT_COMMIT=$GIT_COMMIT

# GitHub Actions
- name: Build image
  run: |
    docker build \
      --build-arg GIT_COMMIT=${{ github.sha }} \
      --no-cache \
      -t myapp:${{ github.sha }} .

Then expose this via a /healthz or /version endpoint in your application and verify it immediately post-deployment. If the SHA in the running container doesn't match the SHA that triggered the pipeline — you have a problem, and you know it within seconds, not hours.

For builds where you intentionally use caching for speed, use --cache-from with explicit cache sources rather than relying on local daemon cache. This gives you cache benefits with predictable, auditable behavior.

Failure 2: Tests That Were Skipped But Reported Green

This is the one that genuinely shook my confidence in pipelines for a while.

The scenario: a test suite that passed every run for weeks. No failures, consistent timing. Then a bug reaches production that the tests should have caught — and when you investigate, you find that the test step exited with code 0 (success) without actually running the tests. The framework had a configuration issue, found no test files matching the pattern, reported "0 tests run, 0 failures" and exited cleanly.

Zero failures. Zero tests. Green.

This happens across test frameworks. Jest, Pytest, JUnit — all of them, by default, exit successfully when they find nothing to run. They're not broken. They did exactly what you asked. You just didn't ask them to verify they ran something.

What catches this:

# GitHub Actions with pytest
- name: Run tests
  run: |
    pytest --tb=short -q

- name: Verify tests actually ran
  run: |
    COUNT=$(pytest --collect-only -q 2>&1 | tail -1 | grep -oP '^\d+')
    if [ "$COUNT" -lt "10" ]; then
      echo "ERROR: Expected at least 10 tests, found $COUNT"
      exit 1
    fi

Add a minimum test count gate to your pipeline. It feels paranoid until the day it saves you. Also configure your test framework to fail explicitly on empty test runs — most modern frameworks support this:

# pytest.ini
[pytest]
addopts = --strict-markers

// jest.config.js
{
  "passWithNoTests": false
}

The principle: a pipeline step that can succeed by doing nothing is a liability.

Failure 3: The Wrong Environment Variables in Production

This failure is almost embarrassingly simple — which is exactly why it happens.

The scenario: a deployment to production uses a configuration value from staging. A database connection string, an API endpoint, a feature flag threshold. The application starts fine because the staging value is valid — it just points somewhere wrong. The service runs, the pipeline is green, and for hours your production traffic is quietly hitting staging infrastructure or using misconfigured limits.

In a Jenkins multi-environment setup, this often happens when:

Environment-specific credential bindings aren't properly scoped to the deployment stage
A previous build's workspace has leftover .env files
Variable precedence between pipeline parameters, Jenkins credentials, and application defaults isn't clearly understood

What catches this:

First, never rely on implicit environment variable inheritance in pipelines. Be explicit and loud about what each stage receives:

// Jenkinsfile
stage('Deploy Production') {
  environment {
    APP_ENV = 'production'
    DB_HOST = credentials('prod-db-host')
  }
  steps {
    sh '''
      echo "Deploying to: $APP_ENV"
      echo "DB host prefix: ${DB_HOST:0:8}..."
      ./deploy.sh
    '''
  }
}

Second, add a post-deployment verification step that queries a /config or /env-check endpoint and asserts key environment markers are what you expect:

DEPLOYED_ENV=$(curl -sf https://myapp.prod/healthz | jq -r '.environment')
if [ "$DEPLOYED_ENV" != "production" ]; then
  echo "FATAL: Deployed environment is '$DEPLOYED_ENV', expected 'production'"
  exit 1
fi

This takes 30 seconds to write and catches an entire class of misconfiguration failures permanently.

Failure 4: Deployment Succeeded, Old Code Still Running

This one is specifically painful because the deployment tooling is telling you the truth — it did succeed. The problem is that "deployment succeeded" and "new code is serving traffic" are not the same statement.

The scenario: a Kubernetes rollout reports complete. GitHub Actions shows a green checkmark. You hit the service and you're getting responses consistent with the old version. What happened?

Common causes:

Rollout completed but pods are serving from cached image — imagePullPolicy: IfNotPresent on a node that already has the old image with the same tag (the classic latest tag problem)
Old pods didn't terminate cleanly — they're still in Terminating state and still receiving traffic because the service selector hasn't fully propagated
The deployment updated but a HorizontalPodAutoscaler or another controller scaled back to old replicas before you checked
The CI agent itself failed mid-job, reported partial success, and the deployment step never fully executed

What catches this:

Never use mutable tags like latest in production Kubernetes manifests. Always deploy with the image SHA or a unique build tag:

# Bad
image: myapp:latest

# Good  
image: myapp:a3f8c21d

Add explicit rollout verification as a pipeline step, not a manual check:

# GitHub Actions
- name: Verify rollout
  run: |
    kubectl rollout status deployment/myapp --timeout=120s

- name: Verify correct image is running
  run: |
    RUNNING_IMAGE=$(kubectl get pods -l app=myapp \
      -o jsonpath='{.items[0].spec.containers[0].image}')
    EXPECTED_IMAGE="myapp:${{ github.sha }}"

    if [ "$RUNNING_IMAGE" != "$EXPECTED_IMAGE" ]; then
      echo "Image mismatch: running $RUNNING_IMAGE, expected $EXPECTED_IMAGE"
      exit 1
    fi

For the agent failure case — always configure your CI agents with heartbeat timeouts and ensure your pipeline has explicit failure handling for agent disconnection. A job that loses its agent mid-run should never report green.

Failure 5: The Agent That Quietly Gave Up

This is the most operationally unglamorous failure on this list, and possibly the most common in Jenkins environments.

The scenario: a build agent goes offline, becomes unresponsive, or hits a resource limit mid-job. Depending on your Jenkins configuration, this can result in the job being marked as successful if the failure happens during a non-critical step, or if the agent timeout is set too generously and the job just... stops reporting.

You check the console log. It ends mid-line. No error. No stack trace. Just silence — and a green badge.

What catches this:

// Jenkinsfile — always set explicit timeouts
pipeline {
  options {
    timeout(time: 30, unit: 'MINUTES')
    retry(1)
  }
  post {
    always {
      script {
        if (currentBuild.result == null) {
          currentBuild.result = 'FAILURE'
        }
      }
    }
  }
}

Monitor agent health as infrastructure — not as an afterthought. Agent failures should fire the same alerts as application failures. If your agents are running in Docker or Kubernetes, treat them with the same resource limits, health checks, and observability you'd apply to any production workload.

The Underlying Principle

Every failure above shares a root cause: the pipeline verified that steps executed, not that outcomes were correct.

A step that runs is not the same as a step that succeeded in the way you intended. Green means the process completed. It does not mean the result is what you think it is.

The discipline of post-deployment verification — checking the SHA, querying the running environment, asserting the test count, confirming the rollout image — closes this gap. It's not extra work. It's the last mile of the deployment that most pipelines are missing.

Build pipelines that are skeptical of themselves. Verify outcomes, not just execution. Treat a deployment as unconfirmed until the running system tells you it's correct — not until your CI dashboard does.

The dashboard will lie to you. Production won't.

Quick Reference: The Verification Checklist

Add these steps to every production deployment pipeline:

Image SHA verification — confirm running container matches the commit that triggered the build
Test count gate — assert minimum number of tests ran, fail on zero
Environment assertion — query running service to confirm correct environment config
Rollout image check — verify deployed pods are running the new image, not a cached version
Agent timeout + null result handling — ensure agent failures produce explicit pipeline failures
Explicit --no-cache policy — or documented, auditable cache-from strategy

None of these take more than 20 lines to implement. Together, they eliminate the entire class of "it looked fine" incidents.

Have you been burned by a silent pipeline failure? I'd genuinely like to hear what broke and what you did to catch it — drop it in the comments.

IPv6 Is "The Future of the Internet" — So Why Did It Break My Streaming App in 2025?

Sumit Gautam — Tue, 14 Apr 2026 08:59:37 +0000

A personal debugging incident that turned into an industry-wide infrastructure audit.

Last week I spent 45-50 minutes convinced my LG WebOS TV or my ISP had quietly broken something. JioHotstar — India's dominant streaming platform — was refusing to play anything. Every title. Every time. Error code DR-6006_X: "We are having trouble playing this video right now."

I did what everyone does. Restarted the router. Restarted the TV. Unplugged everything and waited. Reinstalled the app. Nothing changed, because none of that was the problem.

The fix, once I found it, took ten seconds: I forced my LG TV to use IPv4 directly from the TV's own network settings — leaving my router free to run IPv6 for every other device on the network. JioHotstar worked immediately.

That's a cleaner fix than it sounds. The router doesn't lose IPv6. Your phone, laptop, and other devices are unaffected. Only the TV talks IPv4. But the real question isn't how I fixed it — it's why this broke in the first place, and what it says about where the industry actually stands on IPv6 readiness in 2024.

The short answer: not as far along as anyone wants to admit.

What Actually Failed — and Why Restarting Never Would Have Fixed It

To understand the failure, you need to understand what happens when a smart TV tries to play protected streaming content.

When your LG TV connects to JioHotstar, it doesn't just fetch a video file. It first resolves DNS to locate the platform's servers, negotiates a session, contacts a DRM (Digital Rights Management) license server to verify you're entitled to watch the content, receives a cryptographic key, and then begins streaming. The DR-6006_X error code sits in that DRM handshake layer — not in the video delivery itself. The content never starts because the license exchange never completes.

Here's where IPv6 enters. Modern home routers run what's called a dual-stack configuration — both IPv4 and IPv6 simultaneously. When a device makes a DNS query, it typically receives both A records (IPv4 addresses) and AAAA records (IPv6 addresses). Devices are supposed to implement a mechanism called Happy Eyeballs (RFC 8305) — racing both connection types and falling back gracefully if one fails.

LG's WebOS, based on observed behavior, does not implement this fallback reliably. It preferentially routes traffic over IPv6 and appears to fail silently when that path encounters a problem. Since that preference persists on every reconnection, restarting the router or TV changes nothing — you reconnect over the same path every single time.

The most likely explanation for the failure, based on symptoms and error behavior, is that some part of the playback stack — whether DRM license delivery, CDN routing, or session token validation — doesn't handle IPv6 connections reliably in certain network configurations. I can't confirm exactly where the chain breaks without packet-level access to both sides. But the fix was consistent, repeatable, and immediate — which points clearly at the transport layer, not the content or the account.

This Isn't Unique to One Platform. It's an Industry-Wide Pattern.

What makes this incident worth writing about is that it isn't unusual. IPv6 compatibility failures in streaming and connected devices follow a remarkably consistent pattern across the industry.

Streaming platforms broadly have CDN routing behavior that differs meaningfully between IPv4 and IPv6. CDN providers maintain separate peering agreements for IPv6 traffic, and edge node coverage isn't uniform — a regional PoP (Point of Presence) may have IPv6 routes that are technically announced but practically unreliable in certain geographies. Users on these paths see buffering on fast connections, or quality adaptation that behaves erratically — symptoms almost impossible to attribute to IP version without infrastructure-level visibility.

Some smart home devices — cameras, doorbells, smart speakers — are quietly problematic on IPv6-preferred networks. Most embedded firmware was written assuming IPv4. Device discovery protocols like mDNS and SSDP behave differently in dual-stack environments, and the majority of IoT vendors have never included IPv6-preferred configurations in their QA test matrix. The result is intermittent connectivity that looks exactly like hardware failure or ISP instability.

Enterprise SaaS applications carry a specific class of IPv6 bug: session token validation tied to IP address. Several categories of HR, ERP, and authentication platforms were built when binding a session to an IPv4 address seemed like reasonable security practice. In dual-stack environments, where the same user can appear at different addresses during a session depending on which path the OS chooses, this breaks authentication flows in ways that are genuinely hard to reproduce and diagnose.

The pattern is consistent: the application works, the network works, but the intersection of a modern network configuration and legacy application assumptions produces a failure that looks random from the outside.

Why the Industry Keeps Deprioritizing This — An Honest Analysis

The economic reasoning behind IPv6 neglect is worth understanding clearly, because it explains why this problem persists despite being well-known.

"It works on IPv4 — what's the business case?" This is the dominant internal conversation at most product companies, and it's genuinely hard to argue against on a quarterly basis. IPv4 still functions. Most users are still on IPv4-dominant configurations. IPv6 failures are intermittent, hard to reproduce in standard QA environments, and — most importantly — users blame their ISP or their device, not the platform. The error rate doesn't surface in dashboards as an IPv6 problem. It shows up as generic playback failures, support tickets, or quietly churned users. The platform never sees the root cause.

Third-party dependency chains are real. DRM systems are not built in-house. Streaming platforms rely on Widevine (Google), FairPlay (Apple), and PlayReady (Microsoft) licensing infrastructure. If any component in that chain — license delivery endpoints, session APIs, token validation services — doesn't fully support IPv6, the platform inherits that limitation regardless of how well their own code handles it. Fixing it means waiting on vendor roadmaps.

CDN IPv6 support is uneven at the edge. Major providers like Akamai, Cloudflare, and AWS CloudFront have strong IPv6 support at their primary nodes. But regional edge coverage is not uniform — particularly in markets like India, Southeast Asia, and parts of Africa. IPv6 route announcements can be technically active while practically unreliable, creating what networking engineers call "black hole routes." Traffic arrives at the edge and disappears. This is invisible unless you're monitoring IPv6 path performance as a separate metric from IPv4.

QA environments default to IPv4. This is arguably the most systemic issue of all. Most developer laptops, staging environments, and CI/CD pipelines run on IPv4. IPv6 failures are never surfaced in development because the development environment can't produce them. By the time the code reaches production users with IPv6-preferred home networks, the bug has been shipped, tested against, and forgotten.

What IPv6 Readiness Actually Looks Like in Practice

For engineering and infrastructure teams, the baseline is:

Add IPv6 explicitly to your QA matrix. Run a staging environment on an IPv6-preferred network. Test every authentication flow, every DRM handshake, every CDN segment request against both stacks — independently and together.
Audit your third-party dependencies. Your DRM vendor, CDN configuration, session management layer, analytics endpoints, and error reporting infrastructure. One IPv4-only dependency can silently break the entire user flow.
Instrument by IP version. Your observability stack should tag requests by IP version so you can see IPv6 error rates as a distinct signal — not buried inside aggregate failure rates where it's invisible.
Don't trust OS-level fallback on smart TV platforms. WebOS, Tizen, Android TV, and FireOS all handle Happy Eyeballs differently. Build explicit connection retry logic with IP version awareness into your client applications rather than assuming the platform handles it correctly.

For end-users dealing with this today:

The cleanest fix is to force IPv4 directly in your TV's network settings rather than disabling IPv6 on the router. This keeps your router and all other devices on IPv6 — only the TV talks IPv4. No network-wide compromise needed.
If your TV doesn't expose IP version settings directly, creating a separate SSID with IPv6 disabled for smart TVs and IoT devices is the next best option.
If you're on a mesh network (Eero, Google Nest, Orbi), check whether IPv6 is enabled by default in the admin panel — many ship with it on, and most don't advertise it clearly.

The Bigger Picture

IPv6 was standardized in 1998. IPv4 address exhaustion has been a formally declared crisis since 2011. In 2024, a user on a modern home network running the protocol the industry has called "the future" for two decades can hit silent, inexplicable streaming failures — and the standard advice is still "restart your router."

This isn't a failure of any single company. It's the accumulated result of thousands of individually rational decisions — by platform teams, CDN vendors, device manufacturers, and DRM providers — to defer IPv6 readiness because IPv4 still works for most users most of the time.

The problem with "most users most of the time" is that it's actively changing. Jio, Airtel, and BSNL in India are all accelerating IPv6 deployment. The population of users on IPv6-preferred networks is growing faster than the industry is closing the compatibility gaps. And because these failures are invisible in aggregate metrics — they look like ISP problems, device problems, anything but platform problems — there's no forcing function to fix them.

The 45 minutes I spent debugging my TV is trivial. Multiplied across millions of users who never find the fix, it's churn, eroded trust, and support volume that gets categorized incorrectly and never traced back to its root cause.

IPv6 readiness is no longer a future concern for streaming platforms, IoT vendors, and enterprise software teams. It is a present-tense gap that the industry's standard testing practices are structurally incapable of detecting.

The router restart won't fix it. The QA matrix needs to.

Have you hit IPv6 compatibility issues on streaming platforms or connected devices? I'd be genuinely interested in what you found — drop it in the comments below.