<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sumit Gautam</title>
    <description>The latest articles on DEV Community by Sumit Gautam (@sumit_gautam_379d5).</description>
    <link>https://dev.to/sumit_gautam_379d5</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3272778%2F2baa36dc-d5b2-4ccb-a61c-f2f4799695d2.png</url>
      <title>DEV Community: Sumit Gautam</title>
      <link>https://dev.to/sumit_gautam_379d5</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sumit_gautam_379d5"/>
    <language>en</language>
    <item>
      <title>The Cloud Cost Spike Nobody Warned Me About</title>
      <dc:creator>Sumit Gautam</dc:creator>
      <pubDate>Thu, 21 May 2026 06:07:53 +0000</pubDate>
      <link>https://dev.to/sumit_gautam_379d5/the-cloud-cost-spike-nobody-warned-me-about-13ph</link>
      <guid>https://dev.to/sumit_gautam_379d5/the-cloud-cost-spike-nobody-warned-me-about-13ph</guid>
      <description>&lt;p&gt;&lt;em&gt;I've discovered cloud cost problems every possible way. Here's what I learned each time.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I've been on the wrong end of an unexpected AWS bill more than once. And I've discovered those problems every possible way the industry offers.&lt;/p&gt;

&lt;p&gt;A billing alert firing at 11pm on a friday evening. A client call on a Monday morning where the first words were "why did our AWS bill double?" A routine Cost Explorer review that started as a 10-minute check and turned into a two-hour investigation. And yes — a month-end invoice that was simply higher than it should have been, with no prior warning because nobody had set one.&lt;/p&gt;

&lt;p&gt;Each time, the root cause wasn't a bug. It wasn't a misconfiguration in any obvious sense. It was the natural output of infrastructure built by engineers — including me — who understood how AWS services work but hadn't fully internalized how AWS &lt;em&gt;billing&lt;/em&gt; works.&lt;/p&gt;

&lt;p&gt;Those are not the same thing. And the gap between them is where real money disappears.&lt;/p&gt;

&lt;p&gt;This article is about that gap — the specific AWS cost patterns that look like correct architecture until you see the bill, and what I put in place after each incident to make sure it didn't happen the same way twice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost Driver 1: NAT Gateway Data Transfer Charges
&lt;/h2&gt;

&lt;p&gt;This is the one that surprises almost everyone the first time.&lt;/p&gt;

&lt;p&gt;NAT Gateway pricing has two components that AWS documents clearly and engineers consistently underestimate in practice. The first is the hourly charge for the gateway existing — roughly $0.045/hour per gateway, about $32/month. Noticeable but expected.&lt;/p&gt;

&lt;p&gt;The second is the data processing charge — $0.045 per GB of data that passes through the gateway in either direction. This is the one that generates real bills.&lt;/p&gt;

&lt;p&gt;The scenario I hit: a Kubernetes cluster on EKS with pods in private subnets pulling container images from ECR, sending logs to CloudWatch, and making API calls to various AWS services — all routed through a NAT Gateway. A moderately active cluster processing a few hundred GB of data per day generates NAT Gateway charges that dwarf the EC2 costs underneath it.&lt;/p&gt;

&lt;p&gt;The architecture is correct. Private subnets with NAT Gateway is the right pattern for production workloads. The billing implication just wasn't modeled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What fixes this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For traffic between your resources and AWS services specifically, use &lt;strong&gt;VPC Endpoints&lt;/strong&gt; instead of routing through NAT Gateway. VPC Endpoints keep traffic on the AWS private network — no NAT Gateway processing charge, lower latency, and often better security posture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a VPC Endpoint for S3 (Gateway type — free)&lt;/span&gt;
aws ec2 create-vpc-endpoint &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-id&lt;/span&gt; vpc-xxxxxxxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-name&lt;/span&gt; com.amazonaws.ap-south-1.s3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--route-table-ids&lt;/span&gt; rtb-xxxxxxxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-endpoint-type&lt;/span&gt; Gateway

&lt;span class="c"&gt;# Create Interface Endpoint for ECR (replaces NAT for image pulls)&lt;/span&gt;
aws ec2 create-vpc-endpoint &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-id&lt;/span&gt; vpc-xxxxxxxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--service-name&lt;/span&gt; com.amazonaws.ap-south-1.ecr.dkr &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--vpc-endpoint-type&lt;/span&gt; Interface &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--subnet-ids&lt;/span&gt; subnet-xxxxxxxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--security-group-ids&lt;/span&gt; sg-xxxxxxxx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For S3 and DynamoDB, Gateway Endpoints are free. For ECR, CloudWatch, Secrets Manager, and other services, Interface Endpoints have an hourly cost — but for high-volume workloads, they're almost always cheaper than equivalent NAT Gateway processing charges.&lt;/p&gt;

&lt;p&gt;Model this before you build. The break-even point is lower than you expect.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost Driver 2: Forgotten and Idle Resources
&lt;/h2&gt;

&lt;p&gt;This one is less glamorous than NAT Gateway math but responsible for more wasted spend across more accounts than anything else on this list.&lt;/p&gt;

&lt;p&gt;The pattern is consistent: resources get created for a purpose, the purpose ends or changes, the resources remain. Nobody deletes them because nobody owns the cleanup. In a team environment, this compounds — everyone assumes someone else deprovisioned the staging environment from last quarter.&lt;/p&gt;

&lt;p&gt;What I found in a Cost Explorer review of a client account:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unattached EBS volumes&lt;/strong&gt; from terminated EC2 instances — volumes persist after instance termination by default unless you explicitly configure deletion on termination&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Outdated RDS snapshots&lt;/strong&gt; — automated snapshots accumulate beyond the retention window you thought you configured, particularly if manual snapshots were taken and never cleaned up&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idle NAT Gateways&lt;/strong&gt; in regions where workloads had been decommissioned — $32/month each, several of them, months after the workloads they served were gone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Old AMIs&lt;/strong&gt; and their associated snapshots — AMIs are easy to create, easy to forget, and each one holds snapshot storage charges indefinitely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these are large individually. Together, across an account that had been running for two years without systematic cleanup, they were meaningful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What fixes this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Build a cleanup policy into your infrastructure practice, not your quarterly review calendar. At minimum:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find unattached EBS volumes&lt;/span&gt;
aws ec2 describe-volumes &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;status,Values&lt;span class="o"&gt;=&lt;/span&gt;available &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table

&lt;span class="c"&gt;# Find snapshots older than 90 days (adjust Owner to your account ID)&lt;/span&gt;
aws ec2 describe-snapshots &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--owner-ids&lt;/span&gt; self &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Snapshots[?StartTime&amp;lt;=`2025-01-01`].{ID:SnapshotId,Size:VolumeSize,Date:StartTime}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table

&lt;span class="c"&gt;# Find NAT Gateways not associated with active route tables&lt;/span&gt;
aws ec2 describe-nat-gateways &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;state,Values&lt;span class="o"&gt;=&lt;/span&gt;available &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'NatGateways[*].{ID:NatGatewayId,VPC:VpcId,Created:CreateTime}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For ongoing governance, enable &lt;strong&gt;AWS Config&lt;/strong&gt; with rules for unattached volumes and idle resources, and use &lt;strong&gt;AWS Cost Anomaly Detection&lt;/strong&gt; — it catches spend pattern changes faster than static billing alerts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a cost anomaly monitor for EC2&lt;/span&gt;
aws ce create-anomaly-monitor &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--anomaly-monitor&lt;/span&gt; &lt;span class="s1"&gt;'{
    "MonitorName": "EC2Monitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tag everything at creation with an owner and a project. Resources without tags in a quarterly audit are candidates for deletion. Make this a policy, not a suggestion.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost Driver 3: Data Transfer Between Availability Zones
&lt;/h2&gt;

&lt;p&gt;This is the most invisible cost driver on the list because it requires no misconfiguration and no forgotten resources. It's the direct result of building the high-availability architecture AWS recommends.&lt;/p&gt;

&lt;p&gt;AWS charges $0.01 per GB for data transferred between Availability Zones within the same region. In both directions. This sounds trivial until you map it against what actually moves between AZs in a real distributed system.&lt;/p&gt;

&lt;p&gt;The scenario: a three-tier application deployed across three AZs for availability. Application servers in AZ-A making database calls to RDS in AZ-B. A caching layer in AZ-C that application servers across all three AZs read from. A Kubernetes cluster where pods are scheduled across AZs without affinity rules, meaning a pod in AZ-A routinely calls a service pod in AZ-C. Every one of these cross-AZ calls — database queries, cache reads, inter-service calls — generates data transfer charges.&lt;/p&gt;

&lt;p&gt;At low volume, this is background noise. At production scale, cross-AZ transfer costs can match or exceed your compute costs for data-intensive workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What fixes this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The goal is AZ-aware traffic routing — keeping traffic within the same AZ wherever availability requirements permit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Kubernetes topology-aware routing&lt;/span&gt;
&lt;span class="c1"&gt;# Prefer pods in the same AZ before routing cross-zone&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;service.kubernetes.io/topology-mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Auto"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For EKS specifically, enable &lt;strong&gt;Topology Aware Routing&lt;/strong&gt; and configure pod affinity rules to co-locate services that communicate frequently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;affinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podAffinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;preferredDuringSchedulingIgnoredDuringExecution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
        &lt;span class="na"&gt;podAffinityTerm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;labelSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;matchExpressions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
                &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
                &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;dependent-service&lt;/span&gt;
          &lt;span class="na"&gt;topologyKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;topology.kubernetes.io/zone&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For RDS, use &lt;strong&gt;RDS Proxy&lt;/strong&gt; in the same AZ as your compute where possible, and be deliberate about which AZ your primary instance sits in relative to your application tier.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost Driver 4: S3 Storage and Request Costs
&lt;/h2&gt;

&lt;p&gt;S3 feels cheap because the storage rate is low — $0.023 per GB per month for Standard storage. The request costs are what accumulate unexpectedly.&lt;/p&gt;

&lt;p&gt;S3 charges per API request: $0.0004 per 1,000 GET requests, $0.005 per 1,000 PUT/COPY/POST/LIST requests. These numbers are small. Multiplied by millions of requests per day from an application that wasn't designed with S3 request patterns in mind, they add up.&lt;/p&gt;

&lt;p&gt;The patterns I've seen generate unexpected S3 costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Application code calling &lt;code&gt;ListObjects&lt;/code&gt; in a loop&lt;/strong&gt; instead of paginating correctly — each List call counts as a request, and tight loops can generate thousands per minute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small file uploads&lt;/strong&gt; — many small PUTs cost more in request charges than fewer large ones, relevant for logging pipelines that write per-event rather than batching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 access logs enabled and writing to the same bucket&lt;/strong&gt; — access logs generate their own requests, which generate more access logs, compounding the request count&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lifecycle policies absent&lt;/strong&gt; — objects in Standard storage that should have transitioned to Infrequent Access or Glacier months ago&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What fixes this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Enable S3 Storage Lens at the account level — it gives you per-bucket visibility into request patterns, storage class distribution, and cost drivers without requiring manual investigation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable S3 Storage Lens default dashboard&lt;/span&gt;
aws s3control put-storage-lens-configuration &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--account-id&lt;/span&gt; YOUR_ACCOUNT_ID &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config-id&lt;/span&gt; default &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--storage-lens-configuration&lt;/span&gt; &lt;span class="s1"&gt;'{
    "Id": "default",
    "IsEnabled": true,
    "AccountLevel": {
      "BucketLevel": {}
    }
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add lifecycle policies to every bucket at creation — treat it as a default, not an optimization:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Rules"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"transition-to-ia"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Enabled"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Transitions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"Days"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"StorageClass"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"STANDARD_IA"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"Days"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"StorageClass"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GLACIER_IR"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Cost Driver 5: Oversized Instances Running 24/7
&lt;/h2&gt;

&lt;p&gt;This is the simplest cost driver and the one with the most straightforward fix — which is why it's last. Simple doesn't mean small.&lt;/p&gt;

&lt;p&gt;The pattern: instances sized for peak load running continuously at 10-20% utilization. Development and staging environments sized to match production. Instances that were right-sized six months ago for a workload that has since shrunk.&lt;/p&gt;

&lt;p&gt;On a client engagement I reviewed Cost Explorer and found several &lt;code&gt;m5.2xlarge&lt;/code&gt; instances — $0.384/hour, about $276/month each — running continuously at consistently low CPU and memory utilization. They had been provisioned for a load test, the load test had concluded, and the instances had continued running because nobody had a process for decommissioning them after the test.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What fixes this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Enable &lt;strong&gt;AWS Compute Optimizer&lt;/strong&gt; — it analyzes CloudWatch metrics and produces specific right-sizing recommendations with projected savings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get EC2 right-sizing recommendations&lt;/span&gt;
aws compute-optimizer get-ec2-instance-recommendations &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'instanceRecommendations[*].{
    Instance:instanceArn,
    Finding:finding,
    RecommendedType:recommendationOptions[0].instanceType,
    SavingsPercent:recommendationOptions[0].estimatedMonthlySavings.value
  }'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For non-production environments, implement &lt;strong&gt;instance scheduling&lt;/strong&gt; — stop instances outside working hours. An instance running 8 hours a day instead of 24 costs 67% less:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# AWS Instance Scheduler via CloudFormation (or use Lambda)&lt;/span&gt;
&lt;span class="c"&gt;# Simple approach: tag-based stop/start with EventBridge&lt;/span&gt;

&lt;span class="c"&gt;# Tag instances for scheduling&lt;/span&gt;
aws ec2 create-tags &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resources&lt;/span&gt; i-xxxxxxxxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tags&lt;/span&gt; &lt;span class="nv"&gt;Key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Schedule,Value&lt;span class="o"&gt;=&lt;/span&gt;office-hours

&lt;span class="c"&gt;# EventBridge rule to stop tagged instances at 7pm IST&lt;/span&gt;
aws events put-rule &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; StopDevInstances &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--schedule-expression&lt;/span&gt; &lt;span class="s2"&gt;"cron(30 13 ? * MON-FRI *)"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--state&lt;/span&gt; ENABLED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What Every Discovery Method Taught Me
&lt;/h2&gt;

&lt;p&gt;Every way I've found an AWS cost problem taught me something different.&lt;/p&gt;

&lt;p&gt;The billing alert that fired at 11pm taught me to set thresholds &lt;em&gt;before&lt;/em&gt; I think I need them — at 50%, 80%, and 100% of expected spend, not just at the number that feels alarming.&lt;/p&gt;

&lt;p&gt;The client call on a Monday morning taught me that cost problems in team environments are invisible until they're someone else's problem to escalate. Shared accounts need shared visibility — Cost Explorer access for the whole team, not just the billing owner.&lt;/p&gt;

&lt;p&gt;The routine review that turned into two hours taught me that Cost Explorer by service, checked weekly rather than monthly, surfaces anomalies while they're small. By month end, the pattern has been running for weeks.&lt;/p&gt;

&lt;p&gt;The surprise invoice taught me the most: &lt;strong&gt;the absence of an alert is not the same as the absence of a problem.&lt;/strong&gt; An unmonitored account is a guarantee of eventual surprise.&lt;/p&gt;

&lt;p&gt;The actual lesson across all of them is the same: &lt;strong&gt;AWS billing is an observability problem.&lt;/strong&gt; The same discipline you apply to application monitoring — alerts, anomaly detection, dashboards, regular review — applies to your cloud spend. Without it, cost issues are invisible until they're on an invoice.&lt;/p&gt;

&lt;p&gt;The AWS services that generate surprising costs are almost always working exactly as documented. The surprise comes from not modeling the billing implications before the architecture is built, and not monitoring spend with the same rigor as uptime.&lt;/p&gt;

&lt;p&gt;Model the billing first. Monitor it like production. Build the architecture second.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference: The AWS Cost Governance Checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VPC Endpoints&lt;/strong&gt; for S3, ECR, CloudWatch, Secrets Manager — eliminate NAT Gateway processing for AWS service traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Billing alerts&lt;/strong&gt; at 50%, 80%, 100% of monthly budget threshold&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Anomaly Detection&lt;/strong&gt; enabled at account level&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Config rules&lt;/strong&gt; for unattached EBS volumes and idle resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Topology Aware Routing&lt;/strong&gt; on EKS to minimize cross-AZ data transfer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 lifecycle policies&lt;/strong&gt; on every bucket at creation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute Optimizer&lt;/strong&gt; enabled — review recommendations monthly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instance scheduling&lt;/strong&gt; for all non-production environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mandatory tagging policy&lt;/strong&gt; — Owner, Project, Environment on every resource&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Have you been hit by an unexpected AWS bill? I'd genuinely like to know which service surprised you most — drop it in the comments.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>aws</category>
      <category>serverless</category>
      <category>tutorial</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Every DevOps engineer has hit this. Works in Docker, breaks in Kubernetes — no clear error, no obvious reason. Here are the 5 assumptions your container is silently making that Kubernetes won't tolerate.</title>
      <dc:creator>Sumit Gautam</dc:creator>
      <pubDate>Mon, 04 May 2026 03:41:16 +0000</pubDate>
      <link>https://dev.to/sumit_gautam_379d5/every-devops-engineer-has-hit-this-works-in-docker-breaks-in-kubernetes-no-clear-error-no-1l26</link>
      <guid>https://dev.to/sumit_gautam_379d5/every-devops-engineer-has-hit-this-works-in-docker-breaks-in-kubernetes-no-clear-error-no-1l26</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/sumit_gautam_379d5/why-your-docker-container-works-locally-but-fails-in-kubernetes-3ced" class="crayons-story__hidden-navigation-link"&gt;Why Your Docker Container Works Locally But Fails in Kubernetes&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/sumit_gautam_379d5" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3272778%2F2baa36dc-d5b2-4ccb-a61c-f2f4799695d2.png" alt="sumit_gautam_379d5 profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/sumit_gautam_379d5" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Sumit Gautam
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Sumit Gautam
                
              
              &lt;div id="story-author-preview-content-3598547" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/sumit_gautam_379d5" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3272778%2F2baa36dc-d5b2-4ccb-a61c-f2f4799695d2.png" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Sumit Gautam&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/sumit_gautam_379d5/why-your-docker-container-works-locally-but-fails-in-kubernetes-3ced" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;May 2&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/sumit_gautam_379d5/why-your-docker-container-works-locally-but-fails-in-kubernetes-3ced" id="article-link-3598547"&gt;
          Why Your Docker Container Works Locally But Fails in Kubernetes
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/webdev"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;webdev&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/tutorial"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;tutorial&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/beginners"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;beginners&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/programming"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;programming&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/sumit_gautam_379d5/why-your-docker-container-works-locally-but-fails-in-kubernetes-3ced" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/exploding-head-daceb38d627e6ae9b730f36a1e390fca556a4289d5a41abb2c35068ad3e2c4b5.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/multi-unicorn-b44d6f8c23cdd00964192bedc38af3e82463978aa611b4365bd33a0f1f4f3e97.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;5&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/sumit_gautam_379d5/why-your-docker-container-works-locally-but-fails-in-kubernetes-3ced#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            8 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>Why Your Docker Container Works Locally But Fails in Kubernetes</title>
      <dc:creator>Sumit Gautam</dc:creator>
      <pubDate>Sat, 02 May 2026 05:15:46 +0000</pubDate>
      <link>https://dev.to/sumit_gautam_379d5/why-your-docker-container-works-locally-but-fails-in-kubernetes-3ced</link>
      <guid>https://dev.to/sumit_gautam_379d5/why-your-docker-container-works-locally-but-fails-in-kubernetes-3ced</guid>
      <description>&lt;p&gt;&lt;em&gt;It's not Kubernetes being difficult. It's the assumptions your container was making that Docker quietly satisfied — and Kubernetes doesn't.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;You've been here before.&lt;/p&gt;

&lt;p&gt;The container runs perfectly on your laptop. &lt;code&gt;docker run&lt;/code&gt; works. The app responds. Logs look clean. You push it to your managed Kubernetes cluster — EKS, GKE, AKS, take your pick — and something breaks. The pod crashes with no useful logs. Or it starts, passes health checks, and returns wrong responses. Or it worked fine in staging and silently fails in production despite identical manifests.&lt;/p&gt;

&lt;p&gt;This isn't bad luck. It's a specific and repeatable class of problem: &lt;strong&gt;your container was built with implicit assumptions about its runtime environment, and Docker satisfies those assumptions automatically while Kubernetes does not.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Docker on your laptop is a generous host. It passes through your shell environment, runs containers as your user by default, shares your network namespace, and gives containers as much memory and CPU as they ask for. Kubernetes is a strict host. It enforces isolation, applies resource constraints, manages networking through its own abstraction layer, and runs containers in a security context that may differ significantly from what you tested locally.&lt;/p&gt;

&lt;p&gt;Every mismatch between those two environments is a potential failure. Here are the ones I've personally hit — and exactly how to close each gap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 1: Environment Variables and Secrets That Exist Locally But Not in the Cluster
&lt;/h2&gt;

&lt;p&gt;This is the most common failure and the hardest to diagnose because the error it produces is almost never "environment variable missing." It's usually a downstream failure — a database connection refused, an API call returning 401, a feature that behaves as if it's in the wrong mode.&lt;/p&gt;

&lt;p&gt;Locally, your container inherits environment variables from your shell, your &lt;code&gt;.env&lt;/code&gt; file, your &lt;code&gt;docker-compose.yml&lt;/code&gt;. You've set these up once and forgotten about them. In Kubernetes, none of that exists. The pod gets exactly what you put in the manifest — nothing more.&lt;/p&gt;

&lt;p&gt;The failure pattern I've seen most in EKS environments: an application that uses AWS SDK will work locally because the developer's machine has IAM credentials in &lt;code&gt;~/.aws/credentials&lt;/code&gt;. In EKS, those credentials don't exist — the pod needs an IAM role attached via a service account. The app starts, the pod is Running, health checks pass, and every AWS API call silently fails or returns permission errors that look like application bugs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Always run an environment audit before moving to Kubernetes. Start the container locally with a completely clean environment — no &lt;code&gt;.env&lt;/code&gt; file, no inherited shell variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Strip your local environment entirely&lt;/span&gt;
docker run &lt;span class="nt"&gt;--env-file&lt;/span&gt; /dev/null myapp:latest

&lt;span class="c"&gt;# Or explicitly pass only what Kubernetes will provide&lt;/span&gt;
docker run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;DB_HOST&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;localhost &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;APP_ENV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;production &lt;span class="se"&gt;\&lt;/span&gt;
  myapp:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If it breaks locally with a clean environment, it will break in Kubernetes. Fix it before it gets there.&lt;/p&gt;

&lt;p&gt;For secrets in managed clusters, use the platform's native secret injection — AWS Secrets Manager with External Secrets Operator on EKS, GCP Secret Manager on GKE — rather than baking secrets into ConfigMaps or manifests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# External Secrets Operator pattern for EKS&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ExternalSecret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-secrets&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;refreshInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1h&lt;/span&gt;
  &lt;span class="na"&gt;secretStoreRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-secrets-manager&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterSecretStore&lt;/span&gt;
  &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-secrets&lt;/span&gt;
  &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;secretKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DB_PASSWORD&lt;/span&gt;
      &lt;span class="na"&gt;remoteRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod/myapp/db&lt;/span&gt;
        &lt;span class="na"&gt;property&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;password&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For IAM authentication specifically on EKS, use IRSA (IAM Roles for Service Accounts) — not instance profiles, not hardcoded credentials:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-sa&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::ACCOUNT_ID:role/myapp-role&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Failure 2: Resource Limits Causing OOMKill and CPU Throttling
&lt;/h2&gt;

&lt;p&gt;This one presents as the most confusing failure because the symptoms look like application bugs, not infrastructure problems.&lt;/p&gt;

&lt;p&gt;OOMKill: the pod runs for a few minutes, then disappears. No error in application logs because the process was killed before it could write one. &lt;code&gt;kubectl describe pod&lt;/code&gt; shows &lt;code&gt;OOMKilled&lt;/code&gt; in the last state — but only if you look at the right time, because that state rotates out of describe output after the pod restarts. Miss the window and you're debugging a ghost.&lt;/p&gt;

&lt;p&gt;CPU throttling: the pod runs, the application responds, but it's slow. Intermittently slow in ways that don't correlate with traffic. This is the cgroup CPU quota applying — your container is being throttled because it requested 200m CPU, hit a burst, and the kernel is enforcing the limit. Locally, &lt;code&gt;docker run&lt;/code&gt; with no resource flags gives the container your full machine's CPU. In Kubernetes with limits set, the container gets exactly what you asked for — which may be far less than it needs under load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Never set resource limits in Kubernetes without first understanding your container's actual consumption profile. Run it under realistic load and measure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Watch resource consumption in real time&lt;/span&gt;
kubectl top pod myapp-pod &lt;span class="nt"&gt;--containers&lt;/span&gt;

&lt;span class="c"&gt;# Get historical metrics if you have metrics-server&lt;/span&gt;
kubectl top pods &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;myapp &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;memory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set requests and limits based on observed data, not guesses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;256Mi"&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;250m"&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512Mi"&lt;/span&gt;
    &lt;span class="c1"&gt;# Consider not setting CPU limits — only requests&lt;/span&gt;
    &lt;span class="c1"&gt;# CPU limits cause throttling; CPU requests cause scheduling&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A pattern worth adopting in production: set memory limits (OOMKill is preferable to a node going down) but be conservative with CPU limits. CPU throttling degrades performance silently; it doesn't crash the pod, so it's far harder to detect. Use CPU requests for scheduling, and monitor actual CPU usage separately.&lt;/p&gt;

&lt;p&gt;For OOMKill diagnosis, always check the pod's last state immediately after a crash:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe pod myapp-pod | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 10 &lt;span class="s2"&gt;"Last State"&lt;/span&gt;
&lt;span class="c"&gt;# Look for: Reason: OOMKilled&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Failure 3: Networking and Service Discovery Failures
&lt;/h2&gt;

&lt;p&gt;Locally, your microservices talk to each other via &lt;code&gt;localhost&lt;/code&gt; or hostnames defined in &lt;code&gt;docker-compose&lt;/code&gt;. In Kubernetes, &lt;code&gt;localhost&lt;/code&gt; refers to the pod itself — not other services. Service discovery works through DNS, and that DNS only resolves correctly if your service names, namespaces, and selectors are configured precisely.&lt;/p&gt;

&lt;p&gt;The failure I've hit most: an application configured to connect to &lt;code&gt;localhost:5432&lt;/code&gt; for its database — perfectly valid in a Docker Compose setup where the database is a sidecar. In Kubernetes, that connection attempt hits the pod's own loopback interface and fails immediately. The error looks like a database connection failure, not a networking misconfiguration.&lt;/p&gt;

&lt;p&gt;The staging-to-production variant: services work in staging because everything is in the default namespace and short DNS names resolve. In production with multiple namespaces, &lt;code&gt;myservice&lt;/code&gt; doesn't resolve — &lt;code&gt;myservice.production.svc.cluster.local&lt;/code&gt; does. The same manifest, different namespace, different DNS behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Replace all &lt;code&gt;localhost&lt;/code&gt; service references with Kubernetes DNS names before deploying. The full DNS format is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;service-name&amp;gt;.&amp;lt;namespace&amp;gt;.svc.cluster.local
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For services in the same namespace, the short name works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DB_HOST&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgres-service"&lt;/span&gt;  &lt;span class="c1"&gt;# same namespace&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AUTH_SERVICE_URL&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://auth-service.auth-namespace.svc.cluster.local"&lt;/span&gt;  &lt;span class="c1"&gt;# cross-namespace&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Debug DNS resolution from inside the pod — not from your laptop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Exec into the pod and test DNS directly&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; myapp-pod &lt;span class="nt"&gt;--&lt;/span&gt; nslookup postgres-service
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; myapp-pod &lt;span class="nt"&gt;--&lt;/span&gt; curl &lt;span class="nt"&gt;-v&lt;/span&gt; http://postgres-service:5432

&lt;span class="c"&gt;# If nslookup fails, check CoreDNS&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system &lt;span class="nt"&gt;-l&lt;/span&gt; k8s-app&lt;span class="o"&gt;=&lt;/span&gt;kube-dns
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Network policies are the other common gotcha in production managed clusters. EKS and GKE often ship with default-deny network policies in hardened configurations. A service that communicates freely in staging can be silently blocked in production:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Explicit ingress policy — don't rely on default-allow&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;allow-myapp-ingress&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
  &lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
      &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Failure 4: Readiness and Liveness Probes Misconfigured
&lt;/h2&gt;

&lt;p&gt;This failure is subtle because it's the Kubernetes layer doing exactly what you told it to do — you just told it the wrong thing.&lt;/p&gt;

&lt;p&gt;A liveness probe that's too aggressive will kill a pod that's healthy but slow to start — especially JVM applications, Python apps loading large models, or anything with a meaningful initialization phase. The pod starts, Kubernetes probes it at second 10, gets no response because the app isn't ready yet, and kills it. CrashLoopBackOff. The app never had a chance to run.&lt;/p&gt;

&lt;p&gt;A readiness probe that's too lenient — or missing entirely — sends traffic to pods that aren't ready. The service shows endpoints, requests route to the new pod, and users get errors during the rollout window.&lt;/p&gt;

&lt;p&gt;Locally, neither of these exists. Docker runs your container and leaves it alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Configure &lt;code&gt;initialDelaySeconds&lt;/code&gt; generously on liveness probes — always longer than your slowest observed startup time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/healthz&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;    &lt;span class="c1"&gt;# give the app time to start&lt;/span&gt;
  &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;timeoutSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;

&lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/ready&lt;/span&gt;              &lt;span class="c1"&gt;# separate endpoint from liveness&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use separate endpoints for liveness and readiness. &lt;code&gt;/healthz&lt;/code&gt; for liveness should return 200 as long as the process is alive and not deadlocked. &lt;code&gt;/ready&lt;/code&gt; for readiness should verify the application can actually serve traffic — database connected, cache warm, dependencies reachable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 5: File Permissions and Volume Mount Issues
&lt;/h2&gt;

&lt;p&gt;Locally, your Docker container typically runs as root or as your user — whichever the Dockerfile specifies, with no external enforcement. In managed Kubernetes clusters, particularly on GKE Autopilot and hardened EKS configurations, pods run with &lt;code&gt;runAsNonRoot: true&lt;/code&gt; enforced at the namespace or cluster level. If your container expects to write to &lt;code&gt;/app/logs&lt;/code&gt; or &lt;code&gt;/tmp/cache&lt;/code&gt; as root, it silently fails or crashes with a permission error that's easy to misread.&lt;/p&gt;

&lt;p&gt;Volume mounts compound this. A &lt;code&gt;hostPath&lt;/code&gt; volume that works in a local Docker setup doesn't exist in a managed cluster. An &lt;code&gt;emptyDir&lt;/code&gt; volume mounted at &lt;code&gt;/app/data&lt;/code&gt; will be owned by root unless you explicitly set &lt;code&gt;fsGroup&lt;/code&gt; — meaning a container running as a non-root user can't write to it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Always set an explicit security context and test against it locally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;securityContext&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;runAsNonRoot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="na"&gt;runAsUser&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
  &lt;span class="na"&gt;runAsGroup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
  &lt;span class="na"&gt;fsGroup&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;             &lt;span class="c1"&gt;# ensures volume mounts are group-writable&lt;/span&gt;
  &lt;span class="na"&gt;readOnlyRootFilesystem&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# force explicit volume declarations&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And in your Dockerfile, match the user:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;RUN &lt;/span&gt;addgroup &lt;span class="nt"&gt;--system&lt;/span&gt; appgroup &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; adduser &lt;span class="nt"&gt;--system&lt;/span&gt; &lt;span class="nt"&gt;--ingroup&lt;/span&gt; appgroup appuser
&lt;span class="k"&gt;RUN &lt;/span&gt;&lt;span class="nb"&gt;chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; appuser:appgroup /app
&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; appuser&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test this locally before pushing to the cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--user&lt;/span&gt; 1000:1000 &lt;span class="nt"&gt;--read-only&lt;/span&gt; myapp:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If it fails locally with these constraints, it will fail in Kubernetes. Fix the permissions at the image level, not with cluster-level workarounds.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Underlying Pattern
&lt;/h2&gt;

&lt;p&gt;Every failure above follows the same structure: Docker locally is permissive by default, Kubernetes in production is restrictive by design.&lt;/p&gt;

&lt;p&gt;This isn't a Kubernetes flaw. Isolation, resource enforcement, and security contexts exist for good reasons in multi-tenant managed clusters. The problem is that the permissive local environment creates invisible dependencies — on inherited environment variables, on unrestricted resources, on root file access — that your container never had to explicitly declare.&lt;/p&gt;

&lt;p&gt;The fix isn't to make Kubernetes more permissive. It's to make your container honest about what it needs.&lt;/p&gt;

&lt;p&gt;Build containers that declare their requirements explicitly: environment variables, resource requests, security context, health check endpoints, DNS-based service addressing. Test them under production-like constraints before they reach the cluster. When a container works locally and fails in Kubernetes, the question isn't "what's wrong with Kubernetes" — it's "what assumption was my container making that I didn't know about."&lt;/p&gt;

&lt;p&gt;Kubernetes just makes those assumptions visible. Usually at the worst possible time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference: The Local-to-Kubernetes Readiness Checklist
&lt;/h2&gt;

&lt;p&gt;Before promoting any container from local Docker to a managed Kubernetes cluster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Environment audit&lt;/strong&gt; — run locally with clean environment, no inherited shell variables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM/credentials&lt;/strong&gt; — no local credential files; use IRSA or Workload Identity&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource profiling&lt;/strong&gt; — measure actual CPU and memory under load before setting limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DNS references&lt;/strong&gt; — replace all &lt;code&gt;localhost&lt;/code&gt; with Kubernetes service DNS names&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probe configuration&lt;/strong&gt; — separate liveness/readiness endpoints, generous &lt;code&gt;initialDelaySeconds&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security context&lt;/strong&gt; — test with &lt;code&gt;runAsNonRoot: true&lt;/code&gt; and &lt;code&gt;readOnlyRootFilesystem: true&lt;/code&gt; locally&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volume permissions&lt;/strong&gt; — set &lt;code&gt;fsGroup&lt;/code&gt; on all writable volume mounts&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;What's the most confusing Docker-to-Kubernetes failure you've debugged? Drop it in the comments — the weirder the better.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>webdev</category>
      <category>tutorial</category>
      <category>beginners</category>
      <category>programming</category>
    </item>
    <item>
      <title>The CI/CD Pipeline That Looked Fine But Was Silently Failing</title>
      <dc:creator>Sumit Gautam</dc:creator>
      <pubDate>Wed, 22 Apr 2026 06:26:43 +0000</pubDate>
      <link>https://dev.to/sumit_gautam_379d5/the-cicd-pipeline-that-looked-fine-but-was-silently-failing-33oe</link>
      <guid>https://dev.to/sumit_gautam_379d5/the-cicd-pipeline-that-looked-fine-but-was-silently-failing-33oe</guid>
      <description>&lt;p&gt;&lt;em&gt;Everything was green. The deployment succeeded. Production was broken for hours. Here's what I learned.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;There's a specific kind of production incident that's worse than an outage.&lt;/p&gt;

&lt;p&gt;An outage is loud. Alerts fire, dashboards go red, everyone knows something is wrong. You fix it.&lt;/p&gt;

&lt;p&gt;The silent failure is different. The pipeline is green. The deployment says "successful." No alerts fire. And somewhere in production, the wrong code is quietly running — serving stale responses, skipping validation, behaving in ways that don't match what you just merged. Nobody knows yet.&lt;/p&gt;

&lt;p&gt;I've been on the wrong end of this more than once. Wrong Docker images deployed due to layer caching. Tests marked as passed that never actually ran. Environment variables from staging quietly bleeding into production. A deployment that reported success while the old version kept serving traffic because the agent never actually finished the job.&lt;/p&gt;

&lt;p&gt;Each time, the CI/CD dashboard looked fine. That's what made it dangerous.&lt;/p&gt;

&lt;p&gt;This article is about what green pipelines hide — and the specific verification habits that catch these failures before your users do.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 1: The Docker Cache That Deployed Yesterday's Code
&lt;/h2&gt;

&lt;p&gt;This one is subtle enough that it can fool you completely if you're not looking for it.&lt;/p&gt;

&lt;p&gt;The scenario: you push a fix, the pipeline runs, the Docker build completes in 12 seconds instead of the usual 4 minutes. You don't think much of it — fast builds are good, right? Deployment succeeds. You check the service, it seems to respond. You close the laptop.&lt;/p&gt;

&lt;p&gt;What actually happened: Docker's layer cache served a previously built image. Your &lt;code&gt;COPY . .&lt;/code&gt; instruction didn't invalidate the cache because the file timestamps didn't change the way Docker expected — common in CI environments where the workspace is freshly checked out but mtime metadata doesn't match. The image that got deployed was built from code that predated your fix.&lt;/p&gt;

&lt;p&gt;The dangerous part is that the build log &lt;em&gt;looks&lt;/em&gt; correct. You see your Dockerfile steps. You see layer hashes. Nothing screams "wrong image."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Always embed the Git commit SHA into your image at build time and verify it at deploy time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;ARG&lt;/span&gt;&lt;span class="s"&gt; GIT_COMMIT=unknown&lt;/span&gt;
&lt;span class="k"&gt;LABEL&lt;/span&gt;&lt;span class="s"&gt; git-commit=$GIT_COMMIT&lt;/span&gt;
&lt;span class="k"&gt;ENV&lt;/span&gt;&lt;span class="s"&gt; GIT_COMMIT=$GIT_COMMIT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# GitHub Actions&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build image&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;docker build \&lt;/span&gt;
      &lt;span class="s"&gt;--build-arg GIT_COMMIT=${{ github.sha }} \&lt;/span&gt;
      &lt;span class="s"&gt;--no-cache \&lt;/span&gt;
      &lt;span class="s"&gt;-t myapp:${{ github.sha }} .&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then expose this via a &lt;code&gt;/healthz&lt;/code&gt; or &lt;code&gt;/version&lt;/code&gt; endpoint in your application and verify it immediately post-deployment. If the SHA in the running container doesn't match the SHA that triggered the pipeline — you have a problem, and you know it within seconds, not hours.&lt;/p&gt;

&lt;p&gt;For builds where you intentionally use caching for speed, use &lt;code&gt;--cache-from&lt;/code&gt; with explicit cache sources rather than relying on local daemon cache. This gives you cache benefits with predictable, auditable behavior.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 2: Tests That Were Skipped But Reported Green
&lt;/h2&gt;

&lt;p&gt;This is the one that genuinely shook my confidence in pipelines for a while.&lt;/p&gt;

&lt;p&gt;The scenario: a test suite that passed every run for weeks. No failures, consistent timing. Then a bug reaches production that the tests should have caught — and when you investigate, you find that the test step exited with code &lt;code&gt;0&lt;/code&gt; (success) without actually running the tests. The framework had a configuration issue, found no test files matching the pattern, reported "0 tests run, 0 failures" and exited cleanly.&lt;/p&gt;

&lt;p&gt;Zero failures. Zero tests. Green.&lt;/p&gt;

&lt;p&gt;This happens across test frameworks. Jest, Pytest, JUnit — all of them, by default, exit successfully when they find nothing to run. They're not broken. They did exactly what you asked. You just didn't ask them to verify they ran &lt;em&gt;something&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches this:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# GitHub Actions with pytest&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run tests&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;pytest --tb=short -q&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify tests actually ran&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;COUNT=$(pytest --collect-only -q 2&amp;gt;&amp;amp;1 | tail -1 | grep -oP '^\d+')&lt;/span&gt;
    &lt;span class="s"&gt;if [ "$COUNT" -lt "10" ]; then&lt;/span&gt;
      &lt;span class="s"&gt;echo "ERROR: Expected at least 10 tests, found $COUNT"&lt;/span&gt;
      &lt;span class="s"&gt;exit 1&lt;/span&gt;
    &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add a minimum test count gate to your pipeline. It feels paranoid until the day it saves you. Also configure your test framework to fail explicitly on empty test runs — most modern frameworks support this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# pytest.ini
&lt;/span&gt;&lt;span class="nn"&gt;[pytest]&lt;/span&gt;
&lt;span class="py"&gt;addopts&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;--strict-markers&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;jest.config.js&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"passWithNoTests"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The principle: &lt;strong&gt;a pipeline step that can succeed by doing nothing is a liability.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 3: The Wrong Environment Variables in Production
&lt;/h2&gt;

&lt;p&gt;This failure is almost embarrassingly simple — which is exactly why it happens.&lt;/p&gt;

&lt;p&gt;The scenario: a deployment to production uses a configuration value from staging. A database connection string, an API endpoint, a feature flag threshold. The application starts fine because the staging value is valid — it just points somewhere wrong. The service runs, the pipeline is green, and for hours your production traffic is quietly hitting staging infrastructure or using misconfigured limits.&lt;/p&gt;

&lt;p&gt;In a Jenkins multi-environment setup, this often happens when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Environment-specific credential bindings aren't properly scoped to the deployment stage&lt;/li&gt;
&lt;li&gt;A previous build's workspace has leftover &lt;code&gt;.env&lt;/code&gt; files&lt;/li&gt;
&lt;li&gt;Variable precedence between pipeline parameters, Jenkins credentials, and application defaults isn't clearly understood&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What catches this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First, never rely on implicit environment variable inheritance in pipelines. Be explicit and loud about what each stage receives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight groovy"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Jenkinsfile&lt;/span&gt;
&lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Deploy Production'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;environment&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;APP_ENV&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'production'&lt;/span&gt;
    &lt;span class="n"&gt;DB_HOST&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;credentials&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'prod-db-host'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;sh&lt;/span&gt; &lt;span class="s1"&gt;'''
      echo "Deploying to: $APP_ENV"
      echo "DB host prefix: ${DB_HOST:0:8}..."
      ./deploy.sh
    '''&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Second, add a post-deployment verification step that queries a &lt;code&gt;/config&lt;/code&gt; or &lt;code&gt;/env-check&lt;/code&gt; endpoint and asserts key environment markers are what you expect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;DEPLOYED_ENV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;curl &lt;span class="nt"&gt;-sf&lt;/span&gt; https://myapp.prod/healthz | jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.environment'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DEPLOYED_ENV&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"FATAL: Deployed environment is '&lt;/span&gt;&lt;span class="nv"&gt;$DEPLOYED_ENV&lt;/span&gt;&lt;span class="s2"&gt;', expected 'production'"&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This takes 30 seconds to write and catches an entire class of misconfiguration failures permanently.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 4: Deployment Succeeded, Old Code Still Running
&lt;/h2&gt;

&lt;p&gt;This one is specifically painful because the deployment tooling is telling you the truth — it &lt;em&gt;did&lt;/em&gt; succeed. The problem is that "deployment succeeded" and "new code is serving traffic" are not the same statement.&lt;/p&gt;

&lt;p&gt;The scenario: a Kubernetes rollout reports complete. GitHub Actions shows a green checkmark. You hit the service and you're getting responses consistent with the old version. What happened?&lt;/p&gt;

&lt;p&gt;Common causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rollout completed but pods are serving from cached image&lt;/strong&gt; — &lt;code&gt;imagePullPolicy: IfNotPresent&lt;/code&gt; on a node that already has the old image with the same tag (the classic &lt;code&gt;latest&lt;/code&gt; tag problem)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Old pods didn't terminate cleanly&lt;/strong&gt; — they're still in &lt;code&gt;Terminating&lt;/code&gt; state and still receiving traffic because the service selector hasn't fully propagated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The deployment updated but a HorizontalPodAutoscaler or another controller scaled back to old replicas&lt;/strong&gt; before you checked&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The CI agent itself failed mid-job&lt;/strong&gt;, reported partial success, and the deployment step never fully executed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What catches this:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Never use mutable tags like &lt;code&gt;latest&lt;/code&gt; in production Kubernetes manifests. Always deploy with the image SHA or a unique build tag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bad&lt;/span&gt;
&lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp:latest&lt;/span&gt;

&lt;span class="c1"&gt;# Good  &lt;/span&gt;
&lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp:a3f8c21d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add explicit rollout verification as a pipeline step, not a manual check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# GitHub Actions&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify rollout&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;kubectl rollout status deployment/myapp --timeout=120s&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Verify correct image is running&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;RUNNING_IMAGE=$(kubectl get pods -l app=myapp \&lt;/span&gt;
      &lt;span class="s"&gt;-o jsonpath='{.items[0].spec.containers[0].image}')&lt;/span&gt;
    &lt;span class="s"&gt;EXPECTED_IMAGE="myapp:${{ github.sha }}"&lt;/span&gt;

    &lt;span class="s"&gt;if [ "$RUNNING_IMAGE" != "$EXPECTED_IMAGE" ]; then&lt;/span&gt;
      &lt;span class="s"&gt;echo "Image mismatch: running $RUNNING_IMAGE, expected $EXPECTED_IMAGE"&lt;/span&gt;
      &lt;span class="s"&gt;exit 1&lt;/span&gt;
    &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the agent failure case — always configure your CI agents with heartbeat timeouts and ensure your pipeline has explicit failure handling for agent disconnection. A job that loses its agent mid-run should never report green.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure 5: The Agent That Quietly Gave Up
&lt;/h2&gt;

&lt;p&gt;This is the most operationally unglamorous failure on this list, and possibly the most common in Jenkins environments.&lt;/p&gt;

&lt;p&gt;The scenario: a build agent goes offline, becomes unresponsive, or hits a resource limit mid-job. Depending on your Jenkins configuration, this can result in the job being marked as successful if the failure happens during a non-critical step, or if the agent timeout is set too generously and the job just... stops reporting.&lt;/p&gt;

&lt;p&gt;You check the console log. It ends mid-line. No error. No stack trace. Just silence — and a green badge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What catches this:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight groovy"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Jenkinsfile — always set explicit timeouts&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;options&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;time:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nl"&gt;unit:&lt;/span&gt; &lt;span class="s1"&gt;'MINUTES'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;retry&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;post&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;always&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;script&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;currentBuild&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;result&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
          &lt;span class="n"&gt;currentBuild&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'FAILURE'&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Monitor agent health as infrastructure — not as an afterthought. Agent failures should fire the same alerts as application failures. If your agents are running in Docker or Kubernetes, treat them with the same resource limits, health checks, and observability you'd apply to any production workload.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Underlying Principle
&lt;/h2&gt;

&lt;p&gt;Every failure above shares a root cause: &lt;strong&gt;the pipeline verified that steps executed, not that outcomes were correct.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A step that runs is not the same as a step that succeeded in the way you intended. Green means the process completed. It does not mean the result is what you think it is.&lt;/p&gt;

&lt;p&gt;The discipline of post-deployment verification — checking the SHA, querying the running environment, asserting the test count, confirming the rollout image — closes this gap. It's not extra work. It's the last mile of the deployment that most pipelines are missing.&lt;/p&gt;

&lt;p&gt;Build pipelines that are skeptical of themselves. Verify outcomes, not just execution. Treat a deployment as unconfirmed until the running system tells you it's correct — not until your CI dashboard does.&lt;/p&gt;

&lt;p&gt;The dashboard will lie to you. Production won't.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference: The Verification Checklist
&lt;/h2&gt;

&lt;p&gt;Add these steps to every production deployment pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;strong&gt;Image SHA verification&lt;/strong&gt; — confirm running container matches the commit that triggered the build&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Test count gate&lt;/strong&gt; — assert minimum number of tests ran, fail on zero&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Environment assertion&lt;/strong&gt; — query running service to confirm correct environment config&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Rollout image check&lt;/strong&gt; — verify deployed pods are running the new image, not a cached version&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Agent timeout + null result handling&lt;/strong&gt; — ensure agent failures produce explicit pipeline failures&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Explicit &lt;code&gt;--no-cache&lt;/code&gt; policy&lt;/strong&gt; — or documented, auditable cache-from strategy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these take more than 20 lines to implement. Together, they eliminate the entire class of "it looked fine" incidents.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you been burned by a silent pipeline failure? I'd genuinely like to hear what broke and what you did to catch it — drop it in the comments.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>devops</category>
      <category>githubactions</category>
      <category>docker</category>
      <category>cloudcomputing</category>
    </item>
    <item>
      <title>IPv6 Is "The Future of the Internet" — So Why Did It Break My Streaming App in 2025?</title>
      <dc:creator>Sumit Gautam</dc:creator>
      <pubDate>Tue, 14 Apr 2026 08:59:37 +0000</pubDate>
      <link>https://dev.to/sumit_gautam_379d5/-ipv6-is-the-future-of-the-internet-so-why-did-it-break-my-streaming-app-in-2024-4e73</link>
      <guid>https://dev.to/sumit_gautam_379d5/-ipv6-is-the-future-of-the-internet-so-why-did-it-break-my-streaming-app-in-2024-4e73</guid>
      <description>&lt;p&gt;&lt;em&gt;A personal debugging incident that turned into an industry-wide infrastructure audit.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;Last week I spent 45-50 minutes convinced my LG WebOS TV or my ISP had quietly broken something. JioHotstar — India's dominant streaming platform — was refusing to play anything. Every title. Every time. Error code &lt;code&gt;DR-6006_X&lt;/code&gt;: &lt;em&gt;"We are having trouble playing this video right now."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I did what everyone does. Restarted the router. Restarted the TV. Unplugged everything and waited. Reinstalled the app. Nothing changed, because none of that was the problem.&lt;/p&gt;

&lt;p&gt;The fix, once I found it, took ten seconds: I forced my LG TV to use IPv4 directly from the TV's own network settings — leaving my router free to run IPv6 for every other device on the network. JioHotstar worked immediately.&lt;/p&gt;

&lt;p&gt;That's a cleaner fix than it sounds. The router doesn't lose IPv6. Your phone, laptop, and other devices are unaffected. Only the TV talks IPv4. But the real question isn't how I fixed it — it's &lt;em&gt;why this broke in the first place&lt;/em&gt;, and what it says about where the industry actually stands on IPv6 readiness in 2024.&lt;/p&gt;

&lt;p&gt;The short answer: not as far along as anyone wants to admit.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Failed — and Why Restarting Never Would Have Fixed It
&lt;/h2&gt;

&lt;p&gt;To understand the failure, you need to understand what happens when a smart TV tries to play protected streaming content.&lt;/p&gt;

&lt;p&gt;When your LG TV connects to JioHotstar, it doesn't just fetch a video file. It first resolves DNS to locate the platform's servers, negotiates a session, contacts a DRM (Digital Rights Management) license server to verify you're entitled to watch the content, receives a cryptographic key, and &lt;em&gt;then&lt;/em&gt; begins streaming. The &lt;code&gt;DR-6006_X&lt;/code&gt; error code sits in that DRM handshake layer — not in the video delivery itself. The content never starts because the license exchange never completes.&lt;/p&gt;

&lt;p&gt;Here's where IPv6 enters. Modern home routers run what's called a &lt;strong&gt;dual-stack configuration&lt;/strong&gt; — both IPv4 and IPv6 simultaneously. When a device makes a DNS query, it typically receives both &lt;code&gt;A&lt;/code&gt; records (IPv4 addresses) and &lt;code&gt;AAAA&lt;/code&gt; records (IPv6 addresses). Devices are supposed to implement a mechanism called &lt;strong&gt;Happy Eyeballs&lt;/strong&gt; (RFC 8305) — racing both connection types and falling back gracefully if one fails.&lt;/p&gt;

&lt;p&gt;LG's WebOS, based on observed behavior, does not implement this fallback reliably. It preferentially routes traffic over IPv6 and appears to fail silently when that path encounters a problem. Since that preference persists on every reconnection, restarting the router or TV changes nothing — you reconnect over the same path every single time.&lt;/p&gt;

&lt;p&gt;The most likely explanation for the failure, based on symptoms and error behavior, is that some part of the playback stack — whether DRM license delivery, CDN routing, or session token validation — doesn't handle IPv6 connections reliably in certain network configurations. I can't confirm exactly where the chain breaks without packet-level access to both sides. But the fix was consistent, repeatable, and immediate — which points clearly at the transport layer, not the content or the account.&lt;/p&gt;




&lt;h2&gt;
  
  
  This Isn't Unique to One Platform. It's an Industry-Wide Pattern.
&lt;/h2&gt;

&lt;p&gt;What makes this incident worth writing about is that it isn't unusual. IPv6 compatibility failures in streaming and connected devices follow a remarkably consistent pattern across the industry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming platforms broadly&lt;/strong&gt; have CDN routing behavior that differs meaningfully between IPv4 and IPv6. CDN providers maintain separate peering agreements for IPv6 traffic, and edge node coverage isn't uniform — a regional PoP (Point of Presence) may have IPv6 routes that are technically announced but practically unreliable in certain geographies. Users on these paths see buffering on fast connections, or quality adaptation that behaves erratically — symptoms almost impossible to attribute to IP version without infrastructure-level visibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Some smart home devices&lt;/strong&gt; — cameras, doorbells, smart speakers — are quietly problematic on IPv6-preferred networks. Most embedded firmware was written assuming IPv4. Device discovery protocols like mDNS and SSDP behave differently in dual-stack environments, and the majority of IoT vendors have never included IPv6-preferred configurations in their QA test matrix. The result is intermittent connectivity that looks exactly like hardware failure or ISP instability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise SaaS applications&lt;/strong&gt; carry a specific class of IPv6 bug: session token validation tied to IP address. Several categories of HR, ERP, and authentication platforms were built when binding a session to an IPv4 address seemed like reasonable security practice. In dual-stack environments, where the same user can appear at different addresses during a session depending on which path the OS chooses, this breaks authentication flows in ways that are genuinely hard to reproduce and diagnose.&lt;/p&gt;

&lt;p&gt;The pattern is consistent: &lt;strong&gt;the application works, the network works, but the intersection of a modern network configuration and legacy application assumptions produces a failure that looks random from the outside.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the Industry Keeps Deprioritizing This — An Honest Analysis
&lt;/h2&gt;

&lt;p&gt;The economic reasoning behind IPv6 neglect is worth understanding clearly, because it explains why this problem persists despite being well-known.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"It works on IPv4 — what's the business case?"&lt;/strong&gt; This is the dominant internal conversation at most product companies, and it's genuinely hard to argue against on a quarterly basis. IPv4 still functions. Most users are still on IPv4-dominant configurations. IPv6 failures are intermittent, hard to reproduce in standard QA environments, and — most importantly — &lt;em&gt;users blame their ISP or their device, not the platform.&lt;/em&gt; The error rate doesn't surface in dashboards as an IPv6 problem. It shows up as generic playback failures, support tickets, or quietly churned users. The platform never sees the root cause.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third-party dependency chains are real.&lt;/strong&gt; DRM systems are not built in-house. Streaming platforms rely on Widevine (Google), FairPlay (Apple), and PlayReady (Microsoft) licensing infrastructure. If any component in that chain — license delivery endpoints, session APIs, token validation services — doesn't fully support IPv6, the platform inherits that limitation regardless of how well their own code handles it. Fixing it means waiting on vendor roadmaps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CDN IPv6 support is uneven at the edge.&lt;/strong&gt; Major providers like Akamai, Cloudflare, and AWS CloudFront have strong IPv6 support at their primary nodes. But regional edge coverage is not uniform — particularly in markets like India, Southeast Asia, and parts of Africa. IPv6 route announcements can be technically active while practically unreliable, creating what networking engineers call "black hole routes." Traffic arrives at the edge and disappears. This is invisible unless you're monitoring IPv6 path performance as a separate metric from IPv4.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QA environments default to IPv4.&lt;/strong&gt; This is arguably the most systemic issue of all. Most developer laptops, staging environments, and CI/CD pipelines run on IPv4. IPv6 failures are never surfaced in development because the development environment can't produce them. By the time the code reaches production users with IPv6-preferred home networks, the bug has been shipped, tested against, and forgotten.&lt;/p&gt;




&lt;h2&gt;
  
  
  What IPv6 Readiness Actually Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;For &lt;strong&gt;engineering and infrastructure teams&lt;/strong&gt;, the baseline is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add IPv6 explicitly to your QA matrix.&lt;/strong&gt; Run a staging environment on an IPv6-preferred network. Test every authentication flow, every DRM handshake, every CDN segment request against both stacks — independently and together.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit your third-party dependencies.&lt;/strong&gt; Your DRM vendor, CDN configuration, session management layer, analytics endpoints, and error reporting infrastructure. One IPv4-only dependency can silently break the entire user flow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instrument by IP version.&lt;/strong&gt; Your observability stack should tag requests by IP version so you can see IPv6 error rates as a distinct signal — not buried inside aggregate failure rates where it's invisible.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't trust OS-level fallback on smart TV platforms.&lt;/strong&gt; WebOS, Tizen, Android TV, and FireOS all handle Happy Eyeballs differently. Build explicit connection retry logic with IP version awareness into your client applications rather than assuming the platform handles it correctly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For &lt;strong&gt;end-users&lt;/strong&gt; dealing with this today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The cleanest fix is to force IPv4 directly in your TV's network settings rather than disabling IPv6 on the router. This keeps your router and all other devices on IPv6 — only the TV talks IPv4. No network-wide compromise needed.&lt;/li&gt;
&lt;li&gt;If your TV doesn't expose IP version settings directly, creating a separate SSID with IPv6 disabled for smart TVs and IoT devices is the next best option.&lt;/li&gt;
&lt;li&gt;If you're on a mesh network (Eero, Google Nest, Orbi), check whether IPv6 is enabled by default in the admin panel — many ship with it on, and most don't advertise it clearly.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;IPv6 was standardized in 1998. IPv4 address exhaustion has been a formally declared crisis since 2011. In 2024, a user on a modern home network running the protocol the industry has called "the future" for two decades can hit silent, inexplicable streaming failures — and the standard advice is still "restart your router."&lt;/p&gt;

&lt;p&gt;This isn't a failure of any single company. It's the accumulated result of thousands of individually rational decisions — by platform teams, CDN vendors, device manufacturers, and DRM providers — to defer IPv6 readiness because IPv4 still works for most users most of the time.&lt;/p&gt;

&lt;p&gt;The problem with "most users most of the time" is that it's actively changing. Jio, Airtel, and BSNL in India are all accelerating IPv6 deployment. The population of users on IPv6-preferred networks is growing faster than the industry is closing the compatibility gaps. And because these failures are invisible in aggregate metrics — they look like ISP problems, device problems, anything but platform problems — there's no forcing function to fix them.&lt;/p&gt;

&lt;p&gt;The 45 minutes I spent debugging my TV is trivial. Multiplied across millions of users who never find the fix, it's churn, eroded trust, and support volume that gets categorized incorrectly and never traced back to its root cause.&lt;/p&gt;

&lt;p&gt;IPv6 readiness is no longer a future concern for streaming platforms, IoT vendors, and enterprise software teams. It is a present-tense gap that the industry's standard testing practices are structurally incapable of detecting.&lt;/p&gt;

&lt;p&gt;The router restart won't fix it. The QA matrix needs to.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you hit IPv6 compatibility issues on streaming platforms or connected devices? I'd be genuinely interested in what you found — drop it in the comments below.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>networking</category>
      <category>devops</category>
      <category>cloudengineering</category>
      <category>platformengineering</category>
    </item>
  </channel>
</rss>
