<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hardeep Singh Tiwana</title>
    <description>The latest articles on DEV Community by Hardeep Singh Tiwana (@hstiwana).</description>
    <link>https://dev.to/hstiwana</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3118021%2F910b89fb-a8da-468c-b3fd-58de5fd50d0c.jpg</url>
      <title>DEV Community: Hardeep Singh Tiwana</title>
      <link>https://dev.to/hstiwana</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hstiwana"/>
    <language>en</language>
    <item>
      <title>NAT Gateways Killing Your Container Costs? Amazon ECR VPC endpoints to the Rescue</title>
      <dc:creator>Hardeep Singh Tiwana</dc:creator>
      <pubDate>Fri, 19 Dec 2025 21:32:27 +0000</pubDate>
      <link>https://dev.to/hstiwana/nat-gateways-killing-your-container-costs-amazon-ecr-vpc-endpoints-to-the-rescue-21k5</link>
      <guid>https://dev.to/hstiwana/nat-gateways-killing-your-container-costs-amazon-ecr-vpc-endpoints-to-the-rescue-21k5</guid>
      <description>&lt;p&gt;&lt;strong&gt;Picture this&lt;/strong&gt;. Your AWS bill hits, and there it is: &lt;strong&gt;$10K in NAT Gateway charges&lt;/strong&gt; for 3 NAT GWs in &lt;strong&gt;&lt;code&gt;us-east-1&lt;/code&gt;&lt;/strong&gt;. You started to dig in, and see &lt;strong&gt;~$8K&lt;/strong&gt; comes from &lt;strong&gt;NatGateway-Bytes&lt;/strong&gt; (&lt;strong&gt;Data Processed&lt;/strong&gt;) alone, assuming most of it tied to ECR image pulls. I've helped teams spot this exact issue using Cost Explorer and VPC Flow logs, watching container deployments quietly eat budgets. The solution? Amazon ECR VPC endpoints. They &lt;strong&gt;dropped NAT bills by &amp;gt;75%&lt;/strong&gt; in one setup I worked on. Let's walk through spotting it, the math, and the flow change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; &lt;/p&gt;

&lt;blockquote&gt;
&lt;ul&gt;
&lt;li&gt;ECR image pulls through NAT Gateways cost $0.045/GB.&lt;/li&gt;
&lt;li&gt;VPC Interface Endpoints cost $0.01/GB (&lt;strong&gt;78% cheaper&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;Real example: ~$8K/month → ~$2K/month = &lt;strong&gt;~$70K annual savings&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  💡 Key Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; NAT Gateways charge $0.045/GB for data processing. For ECR-heavy workloads, this adds up fast, as our &lt;strong&gt;example case&lt;/strong&gt; shows $8,010/month in data processing charges alone!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution:&lt;/strong&gt; Deploy three VPC endpoints to &lt;strong&gt;route ECR traffic privately&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ECR API Interface Endpoint&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;com.amazonaws.&amp;lt;region&amp;gt;.ecr.api&lt;/code&gt;&lt;/strong&gt;)

&lt;ul&gt;
&lt;li&gt;Handles authentication and image manifests&lt;/li&gt;
&lt;li&gt;Cost: ~$22/month per AZ + minimal data charges&lt;/li&gt;
&lt;li&gt;Required: Must deploy in each AZ for high availability&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ECR Docker Interface Endpoint&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;com.amazonaws.&amp;lt;region&amp;gt;.ecr.dkr&lt;/code&gt;&lt;/strong&gt;)

&lt;ul&gt;
&lt;li&gt;Handles Docker pull/push commands&lt;/li&gt;
&lt;li&gt;Cost: ~$22/month per AZ + minimal data charges&lt;/li&gt;
&lt;li&gt;Required: Must deploy in each AZ for high availability&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 Gateway Endpoint&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;com.amazonaws.&amp;lt;region&amp;gt;.s3&lt;/code&gt;&lt;/strong&gt;) &lt;strong&gt;⭐ THE MOST CRITICAL ONE&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Handles actual image layer downloads (99%+ of your data!)&lt;/li&gt;
&lt;li&gt;Cost: &lt;strong&gt;$0.00 (FREE!)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Required:&lt;/strong&gt; Without this, your image layers still hit NAT Gateways&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The Savings:&lt;/strong&gt; For 178,000 GB/month of ECR traffic:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before:&lt;/strong&gt; $8,108.55/month (NAT Gateways)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After:&lt;/strong&gt; $1,823.80/month (VPC Endpoints)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Savings:&lt;/strong&gt; $6,284.75/month &lt;strong&gt;(77.5%) = $75,417/year&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why This Works?:&lt;/strong&gt; ECR stores Docker image layers in S3. The free S3 Gateway endpoint handles 95%+ of your data transfer, while the two paid Interface endpoints handle control plane operations. All three work together to eliminate NAT Gateway data processing charges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implementation Time:&lt;/strong&gt; ~30 minutes with Terraform, plus 48 hours to validate savings in Cost Explorer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critical Success Factor:&lt;/strong&gt; You MUST deploy all three endpoints. Deploying only the ECR endpoints without the S3 Gateway endpoint will save you almost nothing because the bulk of your data will still flow through NAT Gateways&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpcmr1j07m4pj0mlga1wy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpcmr1j07m4pj0mlga1wy.png" alt="Compare both models" width="800" height="340"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Let's start with the Brutal Math: NAT vs. Endpoints Head-to-Head
&lt;/h2&gt;

&lt;p&gt;Think standard 3-AZ VPC with private subnets and container workloads. NAT charges &lt;strong&gt;&lt;code&gt;$0.045&lt;/code&gt;&lt;/strong&gt; per hour per AZ plus &lt;strong&gt;&lt;code&gt;$0.045&lt;/code&gt;&lt;/strong&gt; per GB processed. Endpoints run &lt;strong&gt;&lt;code&gt;$0.01&lt;/code&gt;&lt;/strong&gt; per hour per ENI and &lt;strong&gt;&lt;code&gt;$0.01&lt;/code&gt;&lt;/strong&gt; per GB. Much better for high volume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; AWS requires 2 VPC interface endpoints per AZ for complete ECR private access: &lt;strong&gt;&lt;code&gt;ecr.api&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;ecr.dkr&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;s3&lt;/code&gt;&lt;/strong&gt; (layers), making it 6 ENIs total in a 3-AZ setup. The S3 Gateway endpoint modifies route tables and creates no ENIs. If you like to read more on this, follow links at the end of this post.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ecr.api&lt;/code&gt;&lt;/strong&gt; → Interface endpoint (ENI per AZ)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ecr.dkr&lt;/code&gt;&lt;/strong&gt; → Interface endpoint (ENI per AZ)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;s3&lt;/code&gt;&lt;/strong&gt; → Gateway endpoint (NO ENIs, modifies route tables)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  NAT Gateway vs VPC Endpoints Cost Comparison
&lt;/h3&gt;

&lt;p&gt;Configuration: 3 AZs with 3 NAT Gateways vs 3 VPC Endpoints&lt;/p&gt;

&lt;h4&gt;
  
  
  VPC Endpoint Configuration:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;com.amazonaws..ecr.api (Interface) - $0.01/hour per AZ + $0.01/GB&lt;/li&gt;
&lt;li&gt;com.amazonaws..ecr.dkr (Interface) - $0.01/hour per AZ + $0.01/GB&lt;/li&gt;
&lt;li&gt;com.amazonaws..s3 (Gateway) - FREE (no hourly or data charges)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  NAT Gateway Configuration:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;3 NAT Gateways (one per AZ) - $0.045/hour each + $0.045/GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the model, scaled to $8K spend as data baseline (730 hours a month, 9 endpoints: 3 per AZ for ECR API, Docker, and S3):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data Volume (GB/mo)&lt;/th&gt;
&lt;th&gt;NAT Cost ($)&lt;/th&gt;
&lt;th&gt;VPC Endpoint Cost ($)&lt;/th&gt;
&lt;th&gt;Monthly Savings ($)&lt;/th&gt;
&lt;th&gt;Savings %&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;103.05&lt;/td&gt;
&lt;td&gt;44.80&lt;/td&gt;
&lt;td&gt;58.25&lt;/td&gt;
&lt;td&gt;56.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500&lt;/td&gt;
&lt;td&gt;121.05&lt;/td&gt;
&lt;td&gt;48.80&lt;/td&gt;
&lt;td&gt;72.25&lt;/td&gt;
&lt;td&gt;59.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1,000&lt;/td&gt;
&lt;td&gt;143.55&lt;/td&gt;
&lt;td&gt;53.80&lt;/td&gt;
&lt;td&gt;89.75&lt;/td&gt;
&lt;td&gt;62.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5,000&lt;/td&gt;
&lt;td&gt;323.55&lt;/td&gt;
&lt;td&gt;93.80&lt;/td&gt;
&lt;td&gt;229.75&lt;/td&gt;
&lt;td&gt;71.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10,000&lt;/td&gt;
&lt;td&gt;548.55&lt;/td&gt;
&lt;td&gt;143.80&lt;/td&gt;
&lt;td&gt;404.75&lt;/td&gt;
&lt;td&gt;73.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50,000&lt;/td&gt;
&lt;td&gt;2,348.55&lt;/td&gt;
&lt;td&gt;543.80&lt;/td&gt;
&lt;td&gt;1,804.75&lt;/td&gt;
&lt;td&gt;76.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100,000&lt;/td&gt;
&lt;td&gt;4,598.55&lt;/td&gt;
&lt;td&gt;1,043.80&lt;/td&gt;
&lt;td&gt;3,554.75&lt;/td&gt;
&lt;td&gt;77.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;178,000&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8,108.55&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,823.80&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6,284.75&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;77.5%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Total NAT spend declines like a falling rock, at production scale, you will see ROI in days.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example use case with assumptions
&lt;/h2&gt;

&lt;p&gt;Assume we have 3 NAT Gateways in &lt;strong&gt;&lt;code&gt;us-east-1&lt;/code&gt;&lt;/strong&gt; processing 178,000 GB of ECR traffic monthly.&lt;/p&gt;

&lt;p&gt;Cost Breakdown for Total Monthly Cost: $8,108.55&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;NAT Gateway Hourly Charges: $98.55 &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$0.045 per hour × 3 NAT Gateways × 730 hours/month&lt;/li&gt;
&lt;li&gt;This covers the provisioning cost for maintaining 3 NAT Gateways (one per AZ)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Data Processing Charges: $8,010.00&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$0.045 per GB × 178,000 GB&lt;/li&gt;
&lt;li&gt;This is the charge for processing all data flowing through the NAT Gateways&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Per NAT Gateway:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hourly cost: $32.85/month per gateway&lt;/li&gt;
&lt;li&gt;Data processing (if evenly distributed): $2,670.00/month per gateway&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important Note:&lt;/strong&gt; The data processing charge of $8,010 represents the vast majority (98.8%) of our assumed total NAT Gateway costs. Since we're processing ECR (Elastic Container Registry) traffic within the same region, we won't incur additional data transfer charges for the traffic itself, but the NAT Gateway data processing fee still applies.&lt;/p&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prerequisites:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Private subnets with NAT Gateway access&lt;/li&gt;
&lt;li&gt;ECR repositories in the same region&lt;/li&gt;
&lt;li&gt;Security groups allowing HTTPS (443) from workloads&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Hunt Down Those Hidden ECR Pull Fees
&lt;/h2&gt;

&lt;p&gt;Start in AWS Cost Explorer. In Group by, select Dimension &lt;strong&gt;Usage Type&lt;/strong&gt;, Filter to &lt;strong&gt;Service:&lt;/strong&gt; &lt;code&gt;EC2 - Other&lt;/code&gt; and &lt;strong&gt;Usage type group:&lt;/strong&gt; for &lt;strong&gt;&lt;code&gt;EC2: NAT Gateway - Data Processed&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;EC2: NAT Gateway - Running Hours&lt;/code&gt;&lt;/strong&gt;. You'll see &lt;strong&gt;&lt;code&gt;NatGateway-Bytes&lt;/code&gt;&lt;/strong&gt; racking up that e.g. $8K at $0.045 per GB, plus &lt;strong&gt;&lt;code&gt;NatGateway-Hours&lt;/code&gt;&lt;/strong&gt; for the $0.045 hourly per AZ hit.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feoob6pqjjgvqj854f053.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feoob6pqjjgvqj854f053.png" alt="Cost Explorer Filters" width="628" height="1388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For proof&lt;/strong&gt;, enable &lt;strong&gt;VPC Flow Logs&lt;/strong&gt; on your subnets. Filter for port &lt;strong&gt;&lt;code&gt;443&lt;/code&gt;&lt;/strong&gt; traffic to &lt;strong&gt;&lt;code&gt;ecr.api&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;ecr.dkr&lt;/code&gt;&lt;/strong&gt; domains (Specifically, Look for destination port 443 traffic to IP addresses in the ECR service IP ranges, available via AWS IP ranges JSON). &lt;/p&gt;

&lt;p&gt;Do you see private subnet bytes flooding NAT ENIs? That's the problem. Every pull sends a small request out via NAT, fetches metadata, then hauls gigabytes back, doubling up on processing fees. (If it is an Inter-AZ hop, it add $0.01 per GB more. Caught this pattern adding ~$3000 a month extra in a recent cluster review.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Using VPC Flow Logs to Track and Validate ECR Traffic Costs
&lt;/h2&gt;

&lt;p&gt;Before deploying VPC endpoints, you need proof that ECR is actually consuming your NAT Gateway bandwidth. After deployment, you need validation that traffic shifted correctly. VPC Flow Logs provide both.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Enable VPC Flow Logs
&lt;/h3&gt;

&lt;p&gt;Enable Flow Logs on your private subnets where container workloads run:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Via AWS CLI:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ec2 create-flow-logs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource-type&lt;/span&gt; Subnet &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--resource-ids&lt;/span&gt; subnet-xxxxx subnet-yyyyy subnet-zzzzz &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--traffic-type&lt;/span&gt; ALL &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--log-destination-type&lt;/span&gt; cloud-watch-logs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--log-group-name&lt;/span&gt; /aws/vpc/flowlogs &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--deliver-logs-permission-arn&lt;/span&gt; arn:aws:iam::ACCOUNT_ID:role/flowlogsRole
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Via Terraform:&lt;/strong&gt; : &lt;a href="https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/flow_log" rel="noopener noreferrer"&gt;Follow link to see the module on terraform website&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_flow_log"&lt;/span&gt; &lt;span class="s2"&gt;"private_subnets"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;iam_role_arn&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;flow_logs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;log_destination&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_cloudwatch_log_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;flow_logs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;traffic_type&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ALL"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Identify Top HTTPS Destinations
&lt;/h3&gt;

&lt;p&gt;Run this CloudWatch Logs Insights query to find your highest-volume HTTPS destinations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fields @timestamp, srcAddr, dstAddr, dstPort, bytes, action
| filter dstPort = 443
| filter interfaceId like /eni-/
| stats sum(bytes) as totalBytes by dstAddr
| sort totalBytes desc
| limit 50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows which destinations consume the most bandwidth on port 443. The top destinations are likely S3 IPs (for ECR image layers).&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Identify S3 and ECR Service IP Ranges
&lt;/h3&gt;

&lt;p&gt;VPC Flow Logs show IP addresses, not domain names. Download AWS's IP ranges to identify both S3 and ECR traffic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download AWS IP ranges&lt;/span&gt;
curl &lt;span class="nt"&gt;-o&lt;/span&gt; ip-ranges.json https://ip-ranges.amazonaws.com/ip-ranges.json

&lt;span class="c"&gt;# Inspect services for your region&lt;/span&gt;
jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.prefixes[] | select(.region=="us-east-1") | .service'&lt;/span&gt; ip-ranges.json | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you know the correct service values, narrow it down, since ECR doesn't have a designated value, we use AMAZON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Once you know the correct service values, narrow it down, for example:&lt;/span&gt;
jq &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s1"&gt;'.prefixes[] | select(.service=="AMAZON" or .service=="S3" and .region=="us-east-1") | .ip_prefix'&lt;/span&gt; ip-ranges.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example IP ranges for &lt;code&gt;us-east-1&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;44.223.121.0/24
44.223.122.0/24
98.80.195.0/25
98.80.238.0/23
3.5.0.0/19
1.178.4.0/24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;You will see &amp;gt;95% of traffic for S3:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;S3 (where ECR stores image layers - 95%+ of your traffic)&lt;/li&gt;
&lt;li&gt;ECR (API and Docker registry - &amp;lt;5% of your traffic)
&lt;strong&gt;Why This Matters:&lt;/strong&gt; Your 178,000 GB/month is primarily S3 traffic (image layer downloads), not ECR API calls. You must track S3 IPs to see the real cost impact!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;(Always check the current AWS IP ranges JSON for your specific region)&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Calculate NAT Gateway ECR+S3 Traffic
&lt;/h3&gt;

&lt;p&gt;Filter Flow Logs for traffic to BOTH S3 and ECR IPs through NAT Gateway ENIs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE:&lt;/strong&gt; &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Do NOT copy paste as it is, update &lt;code&gt;filter dstAddr like&lt;/code&gt; line to match the range from previus command output.&lt;/li&gt;
&lt;li&gt;Replace &lt;code&gt;/^3\.5\./ or dstAddr like /^52\.94\./ or dstAddr like /^3\.5\./&lt;/code&gt; with real IPs you want to look for
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fields @timestamp, srcAddr, dstAddr, dstPort, bytes, interfaceId
| filter dstPort = 443
| filter interfaceId like /eni-/ and action = "ACCEPT"
| filter dstAddr like /^3\.5\./ or dstAddr like /^52\.94\./ or dstAddr like /^3\.5\./
| stats sum(bytes) as totalBytes by interfaceId, dstAddr
| sort totalBytes desc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Identify NAT Gateway ENIs:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws ec2 describe-nat-gateways &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'NatGateways[].{NatGatewayId:NatGatewayId, NetworkInterfaceIds:NatGatewayAddresses[].NetworkInterfaceId}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cross-reference the ENI IDs from your query results with NAT Gateway ENIs.&lt;br&gt;
&lt;strong&gt;💡 Pro Tip:&lt;/strong&gt; The top destination IPs by bytes will be S3 ranges, not ECR ranges. This confirms that S3 Gateway endpoint is critical for cost savings!&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 5: Calculate Monthly Cost Impact
&lt;/h3&gt;

&lt;p&gt;From your Flow Logs query results:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Sum total bytes&lt;/strong&gt; through NAT Gateway ENIs to S3 + ECR IPs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Convert to GB:&lt;/strong&gt; totalBytes / 1,000,000,000 (AWS uses decimal GB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Calculate cost:&lt;/strong&gt; GB × $0.045&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Cost Calculation Example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flow Logs show:&lt;/strong&gt; 191,102,976,000 bytes to S3/ECR&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Convert:&lt;/strong&gt; 191,102,976,000 / 1,000,000,000 = 191.10 GB&lt;/li&gt;
&lt;li&gt;For 178,000 GB/month: 178,000 × $0.045 = &lt;strong&gt;$8,010/month&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Traffic Breakdown (typical):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;S3 image layers: ~177,850 GB (99.91%)&lt;/li&gt;
&lt;li&gt;ECR API calls: ~50 GB (0.03%)&lt;/li&gt;
&lt;li&gt;ECR Docker registry: ~100 GB (0.06%)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Step 6: Validate After VPC Endpoint Deployment
&lt;/h3&gt;

&lt;p&gt;After deploying VPC endpoints, confirm traffic shifted to private IPs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fields @timestamp, srcAddr, dstAddr, dstPort, bytes, interfaceId
| filter dstPort = 443
| filter dstAddr like /^10\./
| filter interfaceId like /eni-/
| stats sum(bytes) as totalBytes by interfaceId
| sort totalBytes desc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What you should see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✓ Traffic now goes to private 10.x.x.x IPs (VPC endpoint ENIs)&lt;/li&gt;
&lt;li&gt;✓ NAT Gateway ENIs show minimal S3/ECR traffic&lt;/li&gt;
&lt;li&gt;✓ Total bytes shifted from NAT to VPC endpoints&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ❌ But this validation method has problems ❌
&lt;/h3&gt;

&lt;p&gt;⚠️ The above given filter only filters for RFC 1918 private IPs (10.0.0.0/8), but VPC endpoints use different address ranges:&lt;/p&gt;

&lt;p&gt;Gateway Endpoints (S3, DynamoDB)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use prefix list routes (&lt;code&gt;pl-xxx&lt;/code&gt;), not destination IPs in flow logs&lt;/li&gt;
&lt;li&gt;dstAddr shows the actual S3 service IP (public range like &lt;code&gt;52.x.x.x&lt;/code&gt;), not private&lt;/li&gt;
&lt;li&gt;Flow log records bypass the interfaceId filter entirely because they hit the prefix list route directly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Interface Endpoints (&lt;code&gt;ECR.api&lt;/code&gt;, &lt;code&gt;ECR.dkr&lt;/code&gt;, etc.)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;PrivateLink IPs&lt;/strong&gt; in the VPC CIDR (e.g., &lt;code&gt;10.0.x.x&lt;/code&gt; if your VPC is &lt;code&gt;10.0.0.0/16&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;dstAddr shows the endpoint ENI IP (private), but only if your VPC CIDR starts with 10.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  So what would correct validation queries look like?
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;1. Interface Endpoints (ECR, etc.) - Check PrivateLink traffic&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;srcAddr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dstAddr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dstPort&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interfaceId&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;dstPort&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;dstAddr&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/^&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;  &lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;Your&lt;/span&gt; &lt;span class="n"&gt;VPC&lt;/span&gt; &lt;span class="n"&gt;CIDR&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;interfaceId&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;eni&lt;/span&gt;&lt;span class="o"&gt;-/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;totalBytes&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;dstAddr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interfaceId&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="n"&gt;totalBytes&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⚠️ Only works if your VPC CIDR is &lt;code&gt;10.x.x.x&lt;/code&gt;. &lt;strong&gt;Replace with your actual CIDR (e.g., 172.16. or 192.168.)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Gateway Endpoints (S3) - Check prefix list bypass&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;srcAddr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dstAddr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dstPort&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interfaceId&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;dstPort&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;s3BucketName&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="nv"&gt;""&lt;/span&gt; &lt;span class="k"&gt;or&lt;/span&gt; &lt;span class="n"&gt;dstAddr&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;s3&lt;/span&gt;&lt;span class="err"&gt;\&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;  &lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="n"&gt;S3&lt;/span&gt; &lt;span class="n"&gt;traffic&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;interfaceId&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;nat&lt;/span&gt;&lt;span class="o"&gt;-/&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;  &lt;span class="o"&gt;#&lt;/span&gt; &lt;span class="k"&gt;Not&lt;/span&gt; &lt;span class="n"&gt;NAT&lt;/span&gt; &lt;span class="n"&gt;ENIs&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;totalBytes&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;dstAddr&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="n"&gt;totalBytes&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. 🎯 NAT Gateway traffic drop (The real validation)🎯&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;fields&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;srcAddr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dstAddr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dstPort&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;interfaceId&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;dstPort&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;filter&lt;/span&gt; &lt;span class="n"&gt;interfaceId&lt;/span&gt; &lt;span class="k"&gt;like&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;nat&lt;/span&gt;&lt;span class="o"&gt;-/&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;stats&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;totalBytes&lt;/span&gt; &lt;span class="k"&gt;by&lt;/span&gt; &lt;span class="n"&gt;interfaceId&lt;/span&gt;
&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;sort&lt;/span&gt; &lt;span class="n"&gt;totalBytes&lt;/span&gt; &lt;span class="k"&gt;desc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Before endpoints:&lt;/strong&gt; High bytes on NAT ENIs&lt;br&gt;
&lt;strong&gt;After endpoints:&lt;/strong&gt; Bytes drop significantly on those same ENIs.&lt;/p&gt;
&lt;h4&gt;
  
  
  🎯 What success looks like 🎯
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;BEFORE endpoints:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NAT ENI: 150 GB to s3.us-east-1.amazonaws.com&lt;/li&gt;
&lt;li&gt;NAT ENI: 25 GB to 123456789012.dkr.ecr.us-east-1.amazonaws.com&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;AFTER endpoints:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NAT ENI: 5 GB (mostly external APIs)&lt;/li&gt;
&lt;li&gt;Interface ENI: 25 GB to 10.0.2.100 (ECR.dkr endpoint)&lt;/li&gt;
&lt;li&gt;S3 traffic: Prefix list route (no NAT ENI)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key metric: &lt;strong&gt;NAT ENI bytes drop&lt;/strong&gt;. That's your validation.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;/^10\./&lt;/code&gt; filter only catches interface endpoints and only if your VPC uses that range. Use the NAT traffic reduction query instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validate endpoint ENI IDs:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ECR API endpoint ENIs&lt;/span&gt;
aws ec2 describe-vpc-endpoints &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="s2"&gt;"Name=service-name,Values=com.amazonaws.us-east-1.ecr.api"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'VpcEndpoints[*].NetworkInterfaceIds'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table

&lt;span class="c"&gt;# ECR Docker endpoint ENIs&lt;/span&gt;
aws ec2 describe-vpc-endpoints &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="s2"&gt;"Name=service-name,Values=com.amazonaws.us-east-1.ecr.dkr"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'VpcEndpoints[*].NetworkInterfaceIds'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table

&lt;span class="c"&gt;# S3 Gateway endpoint (no ENIs - modifies route tables)&lt;/span&gt;
aws ec2 describe-vpc-endpoints &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="s2"&gt;"Name=service-name,Values=com.amazonaws.us-east-1.s3"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'VpcEndpoints[*].[VpcEndpointId,VpcEndpointType,RouteTableIds]'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 7: Correlate with Cost Explorer
&lt;/h3&gt;

&lt;p&gt;Confirm the cost impact in AWS Cost Explorer:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Navigate to:&lt;/strong&gt; Cost Explorer → Cost &amp;amp; Usage Reports&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Group by:&lt;/strong&gt; Usage Type&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filter Service:&lt;/strong&gt; EC2 - Other&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Look for:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NatGateway-Bytes&lt;/strong&gt; (should drop ~75%)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VpcEndpoint-Bytes&lt;/strong&gt; (should increase proportionally)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time range:&lt;/strong&gt; Compare 2 weeks before vs 2 weeks after deployment&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Expected results:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- NAT Gateway data processing: $8,010 → ~$2,000 (75% reduction)
- VPC Endpoint data processing: $0 → ~$1,780
- Net savings: ~$6,285/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h4&gt;
  
  
  Understanding the Three-Endpoint Architecture
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Why you need all three endpoints:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ECR API Interface Endpoint&lt;/strong&gt; (&lt;code&gt;com.amazonaws.us-east-1.ecr.api&lt;/code&gt;)

&lt;ul&gt;
&lt;li&gt;Handles authentication, authorization, image manifests&lt;/li&gt;
&lt;li&gt;Low data volume (~50 GB/month)&lt;/li&gt;
&lt;li&gt;Cost: $21.90/month (3 AZs × 730 hrs × $0.01) + ~$0.50 data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ECR Docker Interface Endpoint&lt;/strong&gt; (&lt;code&gt;com.amazonaws.us-east-1.ecr.dkr&lt;/code&gt;)

&lt;ul&gt;
&lt;li&gt;Handles Docker pull/push commands, layer discovery&lt;/li&gt;
&lt;li&gt;Low data volume (~100 GB/month)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost:&lt;/strong&gt; $21.90/month (3 AZs × 730 hrs × $0.01) + ~$1.00 data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 Gateway Endpoint&lt;/strong&gt; (&lt;code&gt;com.amazonaws.us-east-1.s3&lt;/code&gt;) ← &lt;strong&gt;THE CRITICAL ONE&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Handles actual image layer downloads (99%+ of your data!)&lt;/li&gt;
&lt;li&gt;High data volume (~177,850 GB/month)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost: $0.00 (FREE!)&lt;/strong&gt; ← This is where your savings come from!
&lt;strong&gt;Without the S3 Gateway endpoint&lt;/strong&gt;, your image layer downloads would still hit NAT Gateways even with ECR endpoints deployed!&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Pro Tips for Flow Logs Analysis&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;✓ Track S3 IPs, not just ECR IPs&lt;/strong&gt; - S3 is where 95%+ of ECR data flows &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✓ Enable Flow Logs on private subnets only&lt;/strong&gt; - Reduces log volume and costs &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✓ Use CloudWatch Logs Insights&lt;/strong&gt; - Best for ad-hoc queries and quick analysis &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✓ Consider Amazon Athena&lt;/strong&gt; - Better for large-scale historical analysis &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✓ Set up CloudWatch alarms&lt;/strong&gt; - Alert on unexpected NAT traffic spikes &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✓ Tag your resources&lt;/strong&gt; - Makes NAT Gateways and VPC endpoints easier to identify &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✓ Factor in Flow Logs cost&lt;/strong&gt; - Approximately $0.50/GB ingested to CloudWatch &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✓ Aggregate by 5-minute intervals&lt;/strong&gt; - Reduces log volume without losing insights &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✓ Monitor for 2-4 weeks&lt;/strong&gt; - Ensures you capture full deployment cycles and traffic patterns&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Before and After: Understanding The Traffic Flow
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before:&lt;/strong&gt; ECS Tasks → NAT Gateway → Internet → ECR/S3 (expensive)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After:&lt;/strong&gt; ECS Tasks → VPC Endpoints → AWS Private Network → ECR/S3 (optimized)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Before endpoints
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A pod in a private subnet hits NAT Gateway for every ECR pull&lt;/li&gt;
&lt;li&gt;Request goes outbound to the internet, ECR API replies inbound through NAT processing, then Docker layers stream back with massive GBs.&lt;/li&gt;
&lt;li&gt;Flow Logs show megabytes to NAT ENIs. Cost Explorer's &lt;code&gt;NatGateway-Bytes&lt;/code&gt; balloons to $8K.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  After, deploy
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;com.amazonaws.&amp;lt;region&amp;gt;.ecr.api&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;.ecr.dkr&lt;/code&gt;&lt;/strong&gt; endpoints in each private subnet per AZ, turn on private DNS. &lt;/li&gt;
&lt;li&gt;Pod traffic goes straight to the endpoint ENI via PrivateLink, no NAT or internet gateway. &lt;/li&gt;
&lt;li&gt;AWS backbone handles the rest, ECR layers flow free within the region.&lt;/li&gt;
&lt;li&gt;Flow Logs shift: zero NAT to ECR domains, all bytes on private 10.x endpoint IPs. &lt;/li&gt;
&lt;li&gt;In Cost Explorer, NAT usage drop like a falling rock. &lt;/li&gt;
&lt;li&gt;Look for usage types containing &lt;strong&gt;&lt;code&gt;VpcEndpoint-Hours&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;VpcEndpoint-Bytes&lt;/code&gt;&lt;/strong&gt; under the &lt;strong&gt;VPC&lt;/strong&gt; service to confirm it is starting to show costs with much smaller amounts as compared to what NAT was showing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ngxpvlu0bvo7haec33r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5ngxpvlu0bvo7haec33r.png" alt="VPC Endpoint Costs" width="652" height="1382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Rolled this out on a Kubernetes fleet processing 178,000 GB/mo ECR traffic. NAT crashed from $10K ($8K data processed) to $2K for services that still need it. Endpoints totaled $1.8k. Filter &lt;strong&gt;Data Transfer&lt;/strong&gt; + &lt;strong&gt;EC2&lt;/strong&gt; in Cost Explorer you will see &lt;strong&gt;EC2: NAT Gateway - Data Processed&lt;/strong&gt; costs drop sharply, while &lt;strong&gt;VpcEndpoint-Hours&lt;/strong&gt; + &lt;strong&gt;VpcEndpoint-Bytes&lt;/strong&gt; take over at $0.01/GB.&lt;/p&gt;
&lt;h2&gt;
  
  
  Cost After VPC Interface Endpoints: $$1,823.80/month
&lt;/h2&gt;
&lt;h3&gt;
  
  
  New Cost Breakdown:
&lt;/h3&gt;
&lt;h4&gt;
  
  
  NAT Gateway Costs:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Hourly charges: $98.55 (gateways remain for other traffic)&lt;/li&gt;
&lt;li&gt;Data processing: $0.00 (ECR traffic now bypasses NAT entirely)
#### VPC Interface Endpoint Costs:&lt;/li&gt;
&lt;li&gt;Hourly charges: $43.80 (2 endpoints × 3 AZs × 730 hours × $0.01/hour)&lt;/li&gt;
&lt;li&gt;Data processing: $1,780.00 (178,000 GB × $0.01/GB)
## The Impact:
💰 Monthly Savings: $6,284.75/month (77.5%) 💰 Annual Savings: $75,417.00/year&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  What You Need to Deploy:
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Required Interface Endpoints (per AZ):
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;✅ com.amazonaws.us-east-1.ecr.api - For ECR API calls&lt;/li&gt;
&lt;li&gt;✅ com.amazonaws.us-east-1.ecr.dkr - For Docker registry operations&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Required Gateway Endpoint (VPC-wide - For ECR image layer storage - FREE):
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;✅ com.amazonaws.us-east-1.s3 - Deploy once per VPC (not per AZ)&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  A quick and dirty example Terraform code"
&lt;/h4&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight terraform"&gt;&lt;code&gt;&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_endpoint"&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;service_name&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"com.amazonaws.&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_region&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.s3"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_endpoint_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Gateway"&lt;/span&gt;
  &lt;span class="nx"&gt;route_table_ids&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_route_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private&lt;/span&gt;&lt;span class="p"&gt;[*].&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;policy&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_iam_policy_document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;s3_ecr_access&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;json&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"s3-gateway"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_endpoint"&lt;/span&gt; &lt;span class="s2"&gt;"ecr-dkr-endpoint"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;service_name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"com.amazonaws.&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_region&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.ecr.dkr"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_endpoint_type&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Interface"&lt;/span&gt;
  &lt;span class="nx"&gt;private_dns_enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;security_group_ids&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ecs_task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private&lt;/span&gt;&lt;span class="p"&gt;[*].&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ecr-dkr"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_endpoint"&lt;/span&gt; &lt;span class="s2"&gt;"ecr-api-endpoint"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;service_name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"com.amazonaws.&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="kd"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_region&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.ecr.api"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_endpoint_type&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Interface"&lt;/span&gt;
  &lt;span class="nx"&gt;private_dns_enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;security_group_ids&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_security_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ecs_task&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private&lt;/span&gt;&lt;span class="p"&gt;[*].&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ecr-api"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Validation:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Validate with: &lt;code&gt;nslookup ecr.api.us-east-1.amazonaws.com&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Should resolve to private &lt;code&gt;10.x.x.x&lt;/code&gt; addresses, not public IPs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💡 Pro Tip: The S3 Gateway endpoint is critical but FREE.
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Add a free &lt;strong&gt;S3 Gateway endpoint&lt;/strong&gt; for ECR layer storage access. While ECR endpoints handle API calls, image layers are stored in S3. The Gateway endpoint ensures this traffic also bypasses NAT at zero cost, so don't skip it. ECR stores image layers in S3, and without this endpoint, your layer downloads will still hit NAT Gateways!&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Does This Work So Well?
&lt;/h2&gt;

&lt;p&gt;The key is &lt;strong&gt;data processing rate difference&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NAT Gateway: $0.045/GB&lt;/li&gt;
&lt;li&gt;VPC Endpoint: $0.01/GB (78% cheaper per GB)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plus, VPC endpoints provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better security - Traffic never leaves AWS network&lt;/li&gt;
&lt;li&gt;Lower latency - Direct path to ECR&lt;/li&gt;
&lt;li&gt;Higher reliability - No internet gateway dependency&lt;/li&gt;
&lt;li&gt;Simplified architecture - Private subnets can pull images directly&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Another Implementation detail to keep in mind:
&lt;/h3&gt;

&lt;p&gt;Your NAT Gateways stay in place for other internet-bound traffic (software updates, external APIs, etc.), but all ECR image pulls route through the VPC endpoints instead. This is a configuration change, not a replacement and you get the best of both worlds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;DNS not resolving privately? Enable "Private DNS" on endpoints ✅&lt;/li&gt;
&lt;li&gt;Still seeing NAT charges? Check security group rules allow 443 inbound ✅&lt;/li&gt;
&lt;li&gt;Pulls timing out? Verify subnet route tables don't force internet gateway ✅&lt;/li&gt;
&lt;li&gt;Endpoint not appearing in Cost Explorer? Wait 24-48 hours for billing data to populate; check under Service: "VPC" ✅&lt;/li&gt;
&lt;li&gt;Validate endpoint status: &lt;code&gt;aws ec2 describe-vpc-endpoints --filters "Name=service-name,Values=com.amazonaws.us-east-1.ecr.api"&lt;/code&gt; ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Troubleshooting Flow Logs Analysis
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Issue: Can't find NAT Gateway ENIs in Flow Logs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Verify Flow Logs are enabled on the correct subnets&lt;/li&gt;
&lt;li&gt;✅ Check that traffic-type is set to ALL (not just &lt;code&gt;ACCEPT&lt;/code&gt; or &lt;code&gt;REJECT&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;✅ Wait 10-15 minutes after enabling for data to populate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Issue: S3/ECR IP ranges don't match traffic&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ AWS IP ranges change periodically - always download the latest JSON&lt;/li&gt;
&lt;li&gt;✅ Some regions have additional IP ranges not in the main prefixes&lt;/li&gt;
&lt;li&gt;✅ Check for both IPv4 and IPv6 ranges if your VPC supports dual-stack&lt;/li&gt;
&lt;li&gt;✅ Remember: Most traffic will be to S3 IPs, not ECR IPs!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Issue: Traffic still shows NAT Gateway after endpoint deployment&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Verify private_dns_enabled = true on Interface endpoints&lt;/li&gt;
&lt;li&gt;✅ Check security groups allow port 443 from workload subnets&lt;/li&gt;
&lt;li&gt;✅ Confirm route tables don't have explicit routes forcing internet gateway&lt;/li&gt;
&lt;li&gt;✅ Verify S3 Gateway endpoint is associated with correct route tables&lt;/li&gt;
&lt;li&gt;✅ Test DNS resolution: nslookup &lt;code&gt;ecr.api.us-east-1.amazonaws.com&lt;/code&gt; should return &lt;code&gt;10.x.x.x&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;✅ Test S3 access: nslookup &lt;code&gt;s3.us-east-1.amazonaws.com&lt;/code&gt; should resolve (Gateway endpoints don't change DNS)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Issue: Cost Explorer doesn't match Flow Logs calculations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Flow Logs show raw bytes; Cost Explorer uses decimal GB (1 GB = 1,000,000,000 bytes)&lt;/li&gt;
&lt;li&gt;✅ Cost Explorer has 24-48 hour delay for billing data&lt;/li&gt;
&lt;li&gt;✅ Ensure you're comparing the same time periods&lt;/li&gt;
&lt;li&gt;✅ Check for data transfer charges vs data processing charges&lt;/li&gt;
&lt;li&gt;✅ Remember: S3 Gateway endpoint traffic is FREE, so you won't see it in VPC endpoint costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Issue: Only seeing small data volumes to ECR IPs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ This is NORMAL! ECR API/Docker traffic is &amp;lt;5% of total&lt;/li&gt;
&lt;li&gt;✅ The bulk of your data goes to S3 IPs (image layers)&lt;/li&gt;
&lt;li&gt;✅ If you're only filtering for ECR IPs, you're missing 95%+ of the traffic&lt;/li&gt;
&lt;li&gt;✅ Update your query to include S3 IP ranges&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reality Check
&lt;/h2&gt;

&lt;p&gt;This assumes full traffic shift (realistic for ECR-only optimization). Background NAT persists for other internet traffic. Monitor your Cost Explorer's NAT Gateway data processing charges weekly for the first month. You should see a 75%+ drop if ECR is your primary NAT consumer. If not, investigate other high-volume services using VPC Flow Logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Run Cost Explorer analysis (5 min)&lt;/li&gt;
&lt;li&gt;Deploy endpoints in non-prod (30 min)&lt;/li&gt;
&lt;li&gt;Validate with test pulls (10 min)&lt;/li&gt;
&lt;li&gt;Monitor for 48 hours&lt;/li&gt;
&lt;li&gt;Roll to production during maintenance window&lt;/li&gt;
&lt;li&gt;Track Cost Explorer for 2 weeks to confirm savings&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Ready to fix it? Create the endpoints in console or Terraform, tag them like Name:ecr-api for tracking, test docker pull once private DNS propagates. Budget relief comes fast. Seen this work for you? Share in the comments.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;References:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/compute/setting-up-aws-privatelink-for-amazon-ecs-and-amazon-ecr/" rel="noopener noreferrer"&gt;Setting up AWS PrivateLink for Amazon ECS, and Amazon ECR&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/containers/using-vpc-endpoint-policies-to-control-amazon-ecr-access/" rel="noopener noreferrer"&gt;Using VPC endpoint policies to control Amazon ECR access&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

</description>
      <category>aws</category>
      <category>costoptimization</category>
      <category>ecr</category>
      <category>privateendpoints</category>
    </item>
    <item>
      <title>Reducing EKS cross-AZ cost using Cilium</title>
      <dc:creator>Hardeep Singh Tiwana</dc:creator>
      <pubDate>Tue, 14 Oct 2025 17:15:46 +0000</pubDate>
      <link>https://dev.to/hstiwana/using-cilium-to-reduce-cross-az-costs-on-aws-5138</link>
      <guid>https://dev.to/hstiwana/using-cilium-to-reduce-cross-az-costs-on-aws-5138</guid>
      <description>&lt;p&gt;As Kubernetes workloads scale on AWS across multiple Availability Zones (AZs), managing inter-AZ traffic efficiently is crucial for performance and cost savings. AWS charges for data transferred between AZs, and Kubernetes’ standard networking can inadvertently increase this cross-zone traffic. Cilium, a modern, eBPF-powered networking and security solution, offers unique capabilities to reduce these costs while improving network visibility and control. This blog merges clear explanations and official resources, providing a comprehensive overview of how Cilium helps optimize cross-AZ traffic on AWS.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Challenge of Cross-AZ Traffic on AWS
&lt;/h3&gt;

&lt;p&gt;AWS bills data transfer anytime network traffic crosses AZ boundaries within the same region (PS: it costs for other regions too!, using "same region" to stay aligned with AZ boundary and concept). Kubernetes Service types such as LoadBalancer or NodePort may distribute traffic across nodes in different AZs, leading to increased cross-zone data flow and charges. This is especially impactful at scale where pod-to-pod communication patterns cause costly inter-AZ hops.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Cilium Limits Cross-AZ Transfer Costs
&lt;/h3&gt;

&lt;p&gt;Cilium employs the Linux kernel's eBPF technology to transform Kubernetes networking with efficiency and deep visibility. Its key features for reducing cross-AZ traffic include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Topology-Aware Routing:&lt;/strong&gt; Cilium supports Kubernetes topology-aware service routing, ensuring traffic stays within the same AZ whenever possible to avoid cross-zone charges. This feature uses annotations like &lt;code&gt;topology.kubernetes.io/zone&lt;/code&gt; to guide Kubernetes service traffic locality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;ENI Mode Integration:&lt;/strong&gt; Cilium's ENI IP Address Management (IPAM) mode assigns pod IPs directly to AWS Elastic Network Interfaces (ENIs) attached to nodes within the same AZ. In this setup, pod traffic routes natively through AWS networking without encapsulation, reducing latency and avoiding cross-AZ data transfers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Advanced IPAM:&lt;/strong&gt; Cilium offers IPAM modes such as ENI and ClusterPool, providing granular control over IP assignment and routing. These modes improve traffic locality by aligning pod IPs with underlying AWS subnet allocation per AZ.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Policy-Driven Traffic Control:&lt;/strong&gt; With Cilium’s rich layer 3 to layer 7 network policies, you can enforce strict AZ-local communication rules or selectively allow cross-AZ traffic only when needed.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Practical Cilium Setup on AWS EKS
&lt;/h3&gt;

&lt;p&gt;Implementing Cilium on EKS involves options from full replacement of AWS VPC CNI to running alongside it in a secondary CNI mode. To optimize cross-AZ traffic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enable Topology-Aware Routing:&lt;/strong&gt; Use Kubernetes service annotations paired with Cilium’s kube-proxy replacement to route traffic preferentially within the same AZ.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------------------------------------------------------------+
|                 AWS Region (Multiple AZs)                   |
|                                                             |
|   +------------------------+   +------------------------+   |
|   | Availability Zone A    |   | Availability Zone B    |   |
|   |                        |   |                        |   |
|   | +------------------+   |   | +------------------+   |   |
|   | |  Node A1         |   |   | |  Node B1         |   |   |
|   | |  Pod(s) A        |   |   | |  Pod(s) B        |   |   |
|   | +------------------+   |   | +------------------+   |   |
|   |   |                    |   |   |                    |   |
|   |   | Service Traffic    |   |   | Service Traffic    |   |
|   |   | goes within AZ     |   |   | goes within AZ     |   |
|   |   v                    |   |   v                    |   |
|   | +------------------+   |   | +------------------+   |   |
|   | | Pod(s) A (target)|&amp;lt;--+   | | Pod(s) B (target)|&amp;lt;--+   |
|   | +------------------+   |   | +------------------+   |   |
|   |                        |   |                        |   |
|   +------------------------+   +------------------------+   |
|                                                             |
|  Kubernetes Service                                         |
|  - Annotated with topology.kubernetes.io/zone               |
|  - Cilium replaces kube-proxy, respecting topology hints    |
|  - Routes client traffic preferentially within same AZ      |
+-------------------------------------------------------------+

Legend:
- Service traffic stays within the same AZ
- If Pod targets exist in the same AZ, no cross-AZ routing occurs
- Traffic flows across AZs only if necessary (failover or no local endpoints)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deploy Cilium ENI Mode:&lt;/strong&gt; This maps pod IPs to ENIs tied to the same AZ subnet as the hosting node, enabling native AWS routing and cutting down on costly inter-AZ traffic.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------------------------+
|            AWS Availability Zone A (us-west-2a)           |
|                                                           |
|   +-----------------+      +-----------------+            |
|   |   EC2 Node #1   |      |   EC2 Node #2   |            |
|   |                 |      |                 |            |
|   | +-----------+   |      | +-----------+   |            |
|   | | ENI eth0  |---|------| | ENI eth0  |---|----+       |
|   | +-----------+   |      | +-----------+   |    |       |
|   |   |             |      |   |             |    |       |
|   | +-----------+   |      | +-----------+   |    |       |
|   | | Pod A     |   |      | | Pod B     |   |    |       |
|   | +-----------+   |      | +-----------+   |    |       |
|   |                 |      |                 |    |       |
|   +-----------------+      +-----------------+    |       |
|                                                   |       |
|         Native AWS Subnet &amp;amp; Route Table (local)   |       |
+---------------------------------------------------|-------+
                                                    |
                   minimal inter-AZ traffic         |
                                                    |
+---------------------------------------------------|-------+
|            AWS Availability Zone B (us-west-2b)   |       |
|                                                   |       |
|   +-----------------+      +-----------------+    |       |
|   |   EC2 Node #3   |      |   EC2 Node #4   |    |       |
|   |                 |      |                 |    |       |
|   | +-----------+   |      | +-----------+   |    |       |
|   | | ENI eth0  |---|------| | ENI eth0  |---|----+       |
|   | +-----------+   |      | +-----------+   |            |
|   |   |             |      |   |             |            |
|   | +-----------+   |      | +-----------+   |            |
|   | | Pod C     |   |      | | Pod D     |   |            |
|   | +-----------+   |      | +-----------+   |            |
|   |                 |      |                 |            |
|   +-----------------+      +-----------------+            |
|                                                           |
+-----------------------------------------------------------+

Legend:
- ENI: AWS Elastic Network Interface
- Pod: Kubernetes Pod, with IP mapped to ENI in node's AZ subnet
- Native subnet &amp;amp; route: traffic is routed locally within AZ
- Inter-AZ traffic: minimized (only when necessary for HA or failover)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Leverage Cluster Mesh:&lt;/strong&gt; For multi-cluster or multi-region scenarios, Cluster Mesh manages service endpoints to prefer local pods and restrict unnecessary data flow across zones.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------------------------------------------------------------------------------+
|                                AWS Region (Multi-AZ)                             |
|                                                                                  |
|    +-------------------------------+      +-------------------------------+      |
|    | Availability Zone A (AZ-a)    |      | Availability Zone B (AZ-b)    |      |
|    |                               |      |                               |      |
|    |  +-------------------------+  |      |  +-------------------------+  |      |
|    |  |    Cluster A (in AZ-a)  |  |      |  |    Cluster B (in AZ-b)  |  |      |
|    |  |                         |  |      |  |                         |  |      |
|    |  |  +------+  +------+     |  |      |  |  +------+  +------+     |  |      |
|    |  |  | Pods |  | Pods |     |  |      |  |  | Pods |  | Pods |     |  |      |
|    |  |  +------+  +------+     |  |      |  |  +------+  +------+     |  |      |
|    |  |                         |  |      |  |                         |  |      |
|    |  |  Traffic stays local    |  |      |  |  Traffic stays local    |  |      |
|    |  |  within AZ and Cluster  |  |      |  |  within AZ and Cluster  |  |      |
|    |  +------------|------------+  |      |  +------------|------------+  |      |
|    +---------------|---------------+      +---------------|---------------+      |
|                    |                                      |                      |
|    Traffic to other clusters stays minimal                |                      |
|    for high availability &amp;amp; resiliency                     |                      |
|                    +--------------------------------------+                      |
|                    |                                                             |
|                    |                                                             |
|            +-------v-------+                                                     |
|            |  Cluster Mesh |                                                     |
|            |  Synchronizes |                                                     |
|            |  Service &amp;amp;    |                                                     |
|            |  Endpoint Info|                                                     |
|            +---------------+                                                     |
|                                                                                  |
|               Resiliency: Failover / backup cluster routes traffic across AZs    |
+----------------------------------------------------------------------------------+ 

Legend:
- Pods communicate locally within their cluster and AZ.
- Traffic to other AZs only for resiliency or failover (Cluster Mesh).
- Cluster Mesh ensures clusters share service status across AZs without unnecessary cross-AZ pod traffic.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Many users have reported notable savings on AWS data transfer costs by carefully tuning these settings in real deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benefits of Using Cilium for Cross-AZ Optimization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost Reduction:&lt;/strong&gt; Keeps data transfer local to the zone, cutting expensive AWS inter-AZ charges.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improved Availability:&lt;/strong&gt; Maintains Kubernetes service resiliency by balancing traffic intelligently but favoring locality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability with Hubble:&lt;/strong&gt; Deep, real-time visibility into pod-to-pod communication paths helps diagnose network flow and optimize topology.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-Grained Security:&lt;/strong&gt; Layer 7 network policies enable precise control over permissible traffic patterns in and across AZs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Challenges to Consider
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complex Configuration:&lt;/strong&gt; Setting up advanced IPAM modes and topology-aware routing requires deeper networking knowledge.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning Curve:&lt;/strong&gt; Teams new to eBPF and Cilium’s enhanced policy model may face an adjustment period.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Resource Limits:&lt;/strong&gt; AWS ENI attachment limits and subnet sizing must be carefully managed to avoid capacity bottlenecks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes Version Dependency:&lt;/strong&gt; Some features rely on newer Kubernetes releases supporting topology hints and service routing.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Optimizing cross-AZ traffic on AWS Kubernetes clusters is essential for both cost efficiency and application performance. Cilium’s eBPF-driven approach combined with AWS native networking integration offers a modern, powerful solution. While setup complexity exists, the tradeoff is significant savings and greater control. For technical teams ready to invest in advanced networking, Cilium is a compelling choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Further Reading and Official Resources
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cilium Topology-Aware Routing: &lt;a href="https://docs.cilium.io/en/stable/networking/topology-aware-routing/" rel="noopener noreferrer"&gt;https://docs.cilium.io/en/stable/networking/topology-aware-routing/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Cilium AWS ENI Mode Documentation: &lt;a href="https://docs.cilium.io/en/stable/networking/aws-eni/" rel="noopener noreferrer"&gt;https://docs.cilium.io/en/stable/networking/aws-eni/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Installing Cilium on EKS in ENI Mode: &lt;a href="https://cilium.io/blog/2025/06/19/eks-eni-install/" rel="noopener noreferrer"&gt;https://cilium.io/blog/2025/06/19/eks-eni-install/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Kubernetes Topology-Aware Service Routing: &lt;a href="https://kubernetes.io/docs/concepts/services-networking/service-topology/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/services-networking/service-topology/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Observability with Hubble: &lt;a href="https://docs.cilium.io/en/stable/operations/hubble/" rel="noopener noreferrer"&gt;https://docs.cilium.io/en/stable/operations/hubble/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Cluster Mesh Overview: &lt;a href="https://docs.cilium.io/en/stable/networking/clustermesh/" rel="noopener noreferrer"&gt;https://docs.cilium.io/en/stable/networking/clustermesh/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Getting Started with Cilium on Amazon EKS: &lt;a href="https://aws.amazon.com/blogs/opensource/getting-started-with-cilium-service-mesh-on-amazon-eks/" rel="noopener noreferrer"&gt;https://aws.amazon.com/blogs/opensource/getting-started-with-cilium-service-mesh-on-amazon-eks/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With these resources and a considered approach, teams can unlock the full potential of Cilium to streamline AWS Kubernetes networking and lower your cross-AZ bill. Happy networking!&lt;/p&gt;

</description>
      <category>cilium</category>
      <category>kubernetes</category>
      <category>aws</category>
      <category>containers</category>
    </item>
    <item>
      <title>Istio in Simple English: Imagine Your Apps Living in a Smart City 🚦🏙️</title>
      <dc:creator>Hardeep Singh Tiwana</dc:creator>
      <pubDate>Fri, 29 Aug 2025 05:02:14 +0000</pubDate>
      <link>https://dev.to/hstiwana/istio-in-simple-english-imagine-your-apps-living-in-a-smart-city-h55</link>
      <guid>https://dev.to/hstiwana/istio-in-simple-english-imagine-your-apps-living-in-a-smart-city-h55</guid>
      <description>&lt;p&gt;After &lt;a href="https://dev.to/hstiwana/understanding-kubernetes-in-simple-english-what-would-kubernetes-look-like-if-it-was-a-global-1bal"&gt;explaining Kubernetes in simple terms&lt;/a&gt;, many have asked about service meshes, particularly Istio. So let’s dive into &lt;strong&gt;Istio&lt;/strong&gt;, a powerful service mesh that helps manage, secure, and observe microservices in a Kubernetes environment.&lt;/p&gt;

&lt;p&gt;If Kubernetes is like a global restaurant franchise, Istio is like the &lt;strong&gt;traffic control and security system&lt;/strong&gt; of a bustling smart city filled with tons of little shops, roads, and delivery trucks all needing to communicate reliably and securely.&lt;/p&gt;

&lt;p&gt;Imagine your collection of microservices as vibrant businesses spread across this city, each handling its own specific job. Some sell bread, others deliver packages, some offer repairs, it’s a complex ecosystem that needs order to thrive.&lt;/p&gt;

&lt;p&gt;Without a city planner, traffic controller, and security patrols, this city becomes chaotic fast, with delivery crashes, wrong shipments, and security breaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Welcome to Istio City: The Smart Traffic &amp;amp; Security Authority 🚦🏙️
&lt;/h2&gt;

&lt;p&gt;Istio is the invisible infrastructure layer that sits &lt;strong&gt;between the services (shops) and their communication networks (roads)&lt;/strong&gt;, helping manage, secure, and monitor traffic moving through your city.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Istio Smart City Architecture: Two Big Departments 🏢🧠
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Control Plane: The City Hall 🏛️
&lt;/h3&gt;

&lt;p&gt;At the heart of Istio’s smart city is the &lt;strong&gt;Control Plane&lt;/strong&gt;, led by a brainy department called &lt;strong&gt;Istiod&lt;/strong&gt;. It works like city hall, responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traffic Planning and Rules&lt;/strong&gt;: Deciding which roads trucks take, who gets priority, and who must stop. (Traffic management)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security &amp;amp; Identity&lt;/strong&gt;: Issuing ID badges (certificates) to trucks and enforcing checkpoints to block unauthorized visitors. (mTLS, authentication, authorization)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration Distribution&lt;/strong&gt;: Sending new laws and updates to traffic patrols and checkpoints across the city. (Proxy configuration)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service Discovery&lt;/strong&gt;: Keeping track of all active shops and routes in the city.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Data Plane: The Traffic Controllers on the Roads 🚓
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Data Plane&lt;/strong&gt; consists of numerous &lt;strong&gt;Envoy proxies&lt;/strong&gt; that act as local traffic cops and watchdogs stationed alongside each shop or neighborhood. They:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handle the actual flow of traffic between shops (service-to-service communication)&lt;/li&gt;
&lt;li&gt;Enforce traffic rules, security policies, and routing decisions from city hall&lt;/li&gt;
&lt;li&gt;Collect data on traffic patterns to send back to the control plane.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Traffic Tools of Istio City 🛠️
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sidecar Proxies 🛺&lt;/strong&gt;: In the classic model, every shop gets its own personal traffic cop walking right beside it, guiding every visitor in or out. These “sidecar” proxies are attached to each microservice (Pod). They intercept all requests in and out, securing, routing, and monitoring communication.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gateways 🚪&lt;/strong&gt;: Big city gates that control traffic coming into the city from outside, handling things like securing communication from outside customers or other cities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Virtual Services 🛣️&lt;/strong&gt;: These are the traffic plans dictating which roads should lead visitors to which shops, including fancy maneuvers like canary releases or A/B testing, sending some visitors down new paths without disrupting the flow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Destination Rules 🎯&lt;/strong&gt;: Policies applied to destinations (shops) about how they want visitors handled, controlling load balancing methods, connection pools, and failure recovery behavior.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Sidecar Mode 🛺 vs Ambient Mesh Mode 🚕
&lt;/h2&gt;

&lt;p&gt;Istio lets you choose how to deploy your traffic cops:&lt;/p&gt;

&lt;h3&gt;
  
  
  Sidecar Mode 🛺 (Each Shop Has Its Own Traffic Cop)
&lt;/h3&gt;

&lt;p&gt;In Sidecar Mode, every microservice gets its own Envoy proxy sidecar walking alongside. Think of this as assigning a personal traffic cop who manages all the incoming and outgoing traffic for that one shop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Granular control&lt;/strong&gt; over traffic for every single microservice.&lt;/li&gt;
&lt;li&gt;Supports the full spectrum of Istio features (fine routing, detailed telemetry, strict security).&lt;/li&gt;
&lt;li&gt;Helps direct visitors to the closest shop for faster service.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Challenges:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each sidecar uses CPU and memory, having hundreds or thousands adds overhead.&lt;/li&gt;
&lt;li&gt;Increased complexity in managing many proxies.&lt;/li&gt;
&lt;li&gt;Slight latency increase as traffic goes through proxies one by one.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ambient Mesh Mode 🚕 (Smart Roads with Patrol Cars)
&lt;/h3&gt;

&lt;p&gt;In the newer Ambient Mode, Istio shifts from giving every shop a dedicated traffic cop to creating &lt;strong&gt;smart, shared roads&lt;/strong&gt; patrolled by a few highly efficient traffic controllers. Instead of a cop next to every shop, the roads themselves become intelligent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lower resource usage, fewer proxies means better efficiency at scale.&lt;/li&gt;
&lt;li&gt;Easier upgrades and simplified operations since fewer proxies to manage.&lt;/li&gt;
&lt;li&gt;Works well for large-scale deployments or services where full sidecar detail isn’t needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Challenges:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Less detailed control at the microservice level right now.&lt;/li&gt;
&lt;li&gt;Some advanced Istio features are still catching up in support.&lt;/li&gt;
&lt;li&gt;Larger security zones; a misconfiguration affects more services.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why Choose Istio? The City’s Edge In Microservices Management 🌟
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traffic Control&lt;/strong&gt;: Manage traffic flow with retries, timeouts, canary releases, and circuit breakers so the city runs smoothly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt;: Automatic mutual TLS, identity verification, and policies build a zero-trust city protecting shops from unauthorized visitors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability&lt;/strong&gt;: Detailed logs, metrics, and tracing give city planners insights into traffic jams before shoppers complain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resilience &amp;amp; Flexibility&lt;/strong&gt;: Quickly redirect traffic, recover from failures, and deploy new service versions without shutting things down.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Challenges in Running Istio City 🚧
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The complexity of Istio’s infrastructure and configuration can be overwhelming for smaller teams.&lt;/li&gt;
&lt;li&gt;Managing sidecar overhead and scaling efficiently requires careful planning.&lt;/li&gt;
&lt;li&gt;Keeping policies consistent in dynamic, multi-cloud environments takes skill.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Diagram: Istio Smart City Analogy
&lt;/h2&gt;

&lt;p&gt;Here is a custom diagram representing the Istio service mesh smart city analogy, showing the difference between Sidecar Mode and Ambient Mesh Mode, with key components and their roles symbolized visually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                 +------------------------+
                 |       ISTIO CITY HALL  |
                 |     (Control Plane /   |
                 |      Istiod Controller)|
                 +-----------+------------+
                             |
           ----------------------------------------
           |                                      |
+-----------------------+             +---------------------------+
|    Sidecar Mode 🛺    |             |  Ambient Mesh Mode 🚕     |
|  (Personal Traffic    |             |  (Smart Shared Roads)     |
|   Cop for Each Shop)  |             |                           |
+-----------------------+             +---------------------------+
| +-------------------+ |             | +-----------------------+ |
| |    Shop A         | |             | |    Neighborhood A     | |
| | [App Container]   | |             | | +-------------------+ | |
| | [Envoy Sidecar]   | |             | | | Shared Patrol Car | | |
| +-------------------+ |             | | +-------------------+ | |
|                       |             | |                       | |
| +-------------------+ |             | | +-------------------+ | |
| |    Shop B         | |             | | |  Shop A, Shop B   | | |
| | [App Container]   | |  &amp;lt;------&amp;gt;   | | +-------------------+ | |
| | [Envoy Sidecar]   | |  Traffic    | |                       | |
| +-------------------+ |  Flow       | | +-------------------+ | |
|                       |             | | |    Neighborhood B   | |
| +-------------------+ |             | | | +-----------------+ | |
| |   Gateway (City   | |             | | | |  Shared Patrol  | | |
| |      Gate)        | |             | | | |    Car          | | |
| +-------------------+ |             | | | +-----------------+ | |
|                       |             | | +---------------------+ |
+-----------------------+             +---------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;KEY ROLES:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Control Plane (City Hall): Manages traffic rules, security policies, and distributes configs.&lt;/li&gt;
&lt;li&gt;Data Plane (Sidecars or Ambient Patrols): Enforces traffic routing, security, telemetry.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;FEATURES:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sidecar Mode: Proxies attached to each app handle traffic individually.&lt;/li&gt;
&lt;li&gt;Ambient Mode: Smart shared proxies manage traffic for multiple apps collectively.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;BENEFITS &amp;amp; CHALLENGES:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sidecar: Granular control &amp;amp; full features; resource overhead &amp;amp; complexity.&lt;/li&gt;
&lt;li&gt;Ambient: Lower overhead &amp;amp; simpler ops; less granular control currently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This diagram visually contrasts the two modes with buildings/shops representing microservices and their proxy traffic managers as individual sidecars or shared patrol cars on the roads, under the supervision of the central Control Plane city hall. It highlights the components and their roles within the smart city (Istio) analogy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Closing Thoughts 🍞➡️🚦➡️🌆
&lt;/h2&gt;

&lt;p&gt;If Kubernetes is your global kitchen serving hundreds of dishes simultaneously, Istio is your city’s traffic and security authority ensuring every dish travels safely and smoothly from kitchen to customer. Whether you assign a dedicated traffic cop to every kitchen station with sidecars, or upgrade to smart, shared roads with ambient mesh, Istio empowers your microservices city to grow resilient and secure, freeing up your chefs and bakers to focus on cooking the best apps.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://istio.io/latest/docs/" rel="noopener noreferrer"&gt;Istio Official Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://istio.io/latest/docs/ops/deployment/architecture/" rel="noopener noreferrer"&gt;Istio Architecture Diagram&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.redhat.com/en/topics/microservices/what-is-a-service-mesh" rel="noopener noreferrer"&gt;Service Mesh Explained&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.envoyproxy.io/" rel="noopener noreferrer"&gt;Envoy Proxy&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/" rel="noopener noreferrer"&gt;Kubernetes Networking Basics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/hstiwana/understanding-kubernetes-in-simple-english-what-would-kubernetes-look-like-if-it-was-a-global-1bal"&gt;Understanding Kubernetes in Simple English&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Note: This article is a simplified analogy to help understand Istio concepts. Real-world implementations may vary based on specific use cases and configurations.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Let me know if you found this helpful or have any questions! :)
&lt;/h3&gt;

</description>
      <category>istio</category>
      <category>servicemesh</category>
      <category>kubernetes</category>
      <category>microservices</category>
    </item>
    <item>
      <title>Containers in Plain English: The Shipping Container of Tech 🚢🍱</title>
      <dc:creator>Hardeep Singh Tiwana</dc:creator>
      <pubDate>Wed, 16 Jul 2025 17:43:07 +0000</pubDate>
      <link>https://dev.to/hstiwana/-containers-in-plain-english-the-shipping-container-of-tech-1ge</link>
      <guid>https://dev.to/hstiwana/-containers-in-plain-english-the-shipping-container-of-tech-1ge</guid>
      <description>&lt;p&gt;You asked, and I listened! After the great feedback on my &lt;strong&gt;&lt;a href="https://dev.to/hstiwana/understanding-kubernetes-in-simple-english-what-would-kubernetes-look-like-if-it-was-a-global-1bal"&gt;Kubernetes in plain English&lt;/a&gt;&lt;/strong&gt; explanation, many of you requested a similar breakdown for &lt;strong&gt;containers&lt;/strong&gt;. So, here's my attempt to demystify containers for you. Enjoy the read!&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;Imagine you’re sending a &lt;strong&gt;meal kit&lt;/strong&gt; 🍱 (your application) from your kitchen to friends around the world. But kitchens everywhere have different equipment (hardware, operating systems), so you worry: &lt;em&gt;What if my recipe needs a special pan, or a rare spice?&lt;/em&gt; Here’s where &lt;strong&gt;containers&lt;/strong&gt; 🚢 come to the rescue.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a Container? (The Bento Box of Software 🍱)
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;container&lt;/strong&gt; is like a perfectly-packed, sealed &lt;strong&gt;bento box&lt;/strong&gt; for your meal kit. Inside, you don’t just have the food (your code), but also every little thing needed to &lt;em&gt;make that meal work&lt;/em&gt;—sauces, utensils, spice packets (your dependencies), even instructions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With containers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You send out your &lt;strong&gt;bento box&lt;/strong&gt; and &lt;em&gt;any kitchen&lt;/em&gt; can serve your meal exactly as intended—no confusion, no missing ingredients, no awkward substitutions.&lt;/li&gt;
&lt;li&gt;The recipient opens the box and gets a self-contained meal, ready to enjoy, independent of their own pantry.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Technologies (with Friendly Analogies)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Container Image 🖼️ ≈ Recipe Blueprint 📒:&lt;/strong&gt;
Like a detailed photo and recipe booklet, a &lt;strong&gt;container image&lt;/strong&gt; contains every step and all ingredients needed to construct the meal, lock it in, and ship it anywhere.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container Engine 🔥 ≈ Chef’s Stove:&lt;/strong&gt;
The &lt;strong&gt;container engine&lt;/strong&gt; (like Docker) is the versatile stove that knows how to cook &lt;em&gt;any&lt;/em&gt; meal packed in one of these bento boxes, regardless of the local kitchen quirks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Host OS 🖥️ ≈ Restaurant Floor:&lt;/strong&gt;
The kitchen flooring—supports all the stoves. You can run many containers side-by-side, each on its own burner, cooking up entirely different meals without them bumping into each other (thanks to isolation features).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Registry 🗄️ ≈ Recipe Warehouse:&lt;/strong&gt;
A central &lt;em&gt;warehouse&lt;/em&gt; where all recipes (container images) are safely stored and ready to be shipped to any kitchen in seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How Containers Are Different from Traditional Boxes (Virtual Machines)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Containers 🍱&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Virtual Machines 🏢&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ship only the meal, not the whole kitchen; super lightweight&lt;/td&gt;
&lt;td&gt;Each box packs not just the meal, but the &lt;em&gt;entire kitchen&lt;/em&gt; (full OS), making it heavy and bulky&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fast to open, serve, and refresh&lt;/td&gt;
&lt;td&gt;Takes longer to unbox and setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A dozen containers can share one kitchen floor&lt;/td&gt;
&lt;td&gt;Each needs its own floor space&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why People Love Containers (Benefits)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Portability 🚀:&lt;/strong&gt;
Ship your meal anywhere—laptop, cloud, or on-premise kitchen—with the &lt;em&gt;guarantee it’ll taste the same everywhere&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource Efficiency 💡:&lt;/strong&gt;
No wasted space. Spin up dozens of meals (apps) on a single kitchen floor (host machine) without fighting for room.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability 📈:&lt;/strong&gt;
Add more meals during rush hour, or pack them away after lunch (scale up/down instantly). Container orchestrators (think kitchen managers) like Kubernetes can automate this process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency 🔄:&lt;/strong&gt;
Every cook (developer) and diner (user) gets the same meal, every time—no nasty surprises.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Some Real Challenges (The Sour Bits)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complexity 🌀:&lt;/strong&gt;
Once you’re shipping thousands of meal kits around the globe, you need smart kitchen managers (like Kubernetes) to keep everything running smoothly. That adds a new layer of learning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security 🔐:&lt;/strong&gt;
Everyone loves easy-to-share meals, but you must guard against someone sneaking bad ingredients (vulnerabilities) into your kits.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visibility 👀:&lt;/strong&gt;
So many small boxes—hard to see what’s inside all of them, making monitoring and troubleshooting tricky.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Containers&lt;/strong&gt; are your secret to stress-free, scalable, and reliable &lt;em&gt;meal delivery&lt;/em&gt;—no matter where or how you cook. They're the magic lunchbox that guarantees your creation looks and tastes the same, from your own kitchen to the cloud’s massive cafeteria 🍱➡️☁️.&lt;/p&gt;

&lt;p&gt;The next time you deploy an app in a container, picture your perfectly packed bento box, ready to delight diners everywhere—just add heat!&lt;/p&gt;




&lt;h2&gt;
  
  
  📖🧠📚Sources, Guides, and Inspiration📖🧠📚:
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/hstiwana/understanding-kubernetes-in-simple-english-what-would-kubernetes-look-like-if-it-was-a-global-1bal"&gt;https://dev.to/hstiwana/understanding-kubernetes-in-simple-english-what-would-kubernetes-look-like-if-it-was-a-global-1bal&lt;/a&gt;&lt;br&gt;
&lt;a href="https://aws.amazon.com/what-is/containerization/" rel="noopener noreferrer"&gt;https://aws.amazon.com/what-is/containerization/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://en.wikipedia.org/wiki/Containerization_(computing)" rel="noopener noreferrer"&gt;https://en.wikipedia.org/wiki/Containerization_(computing)&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.docker.com/resources/what-container/" rel="noopener noreferrer"&gt;https://www.docker.com/resources/what-container/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>containers</category>
      <category>devops</category>
      <category>tutorial</category>
      <category>makeiteasytoremember</category>
    </item>
    <item>
      <title>Understanding Kubernetes in Simple English: What would Kubernetes look like if it was a global restaurant franchise?</title>
      <dc:creator>Hardeep Singh Tiwana</dc:creator>
      <pubDate>Tue, 15 Jul 2025 21:47:18 +0000</pubDate>
      <link>https://dev.to/hstiwana/understanding-kubernetes-in-simple-english-what-would-kubernetes-look-like-if-it-was-a-global-1bal</link>
      <guid>https://dev.to/hstiwana/understanding-kubernetes-in-simple-english-what-would-kubernetes-look-like-if-it-was-a-global-1bal</guid>
      <description>&lt;p&gt;Imagine &lt;strong&gt;Kubernetes&lt;/strong&gt; as a futuristic, global restaurant franchise. Running thousands of branches reliably, efficiently, and securely needs more than good chefs and cooks—it needs an orchestrated symphony of managers, systems, and trusted recipes. Let’s cook up a story that brings Kubernetes concepts to life through the daily operations of this grand culinary operation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Welcome to Kubernetes Kitchen 🍳🎛️!
&lt;/h2&gt;

&lt;p&gt;Each &lt;strong&gt;Application 🍲&lt;/strong&gt; is a Signature Dish served in your restaurant. But a modern dish is more than just the food—it comes with unique instructions, tools, and even a particular type of pan (&lt;em&gt;dependencies&lt;/em&gt;). Every time a plate is prepared, it's following a carefully packed kit: this is our &lt;strong&gt;Container 🍲&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;Pod 🍳&lt;/strong&gt; is like a cooking station on the kitchen line, perhaps with several chefs working side by side on the same dish (multiple containers working together tightly).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ReplicaSet 👨‍🍳&lt;/strong&gt; is the restaurant manager responsible for ensuring you always have the right number of cooking stations making a specific dish, so you never run out during the dinner rush. If a cook falls ill or a stove breaks, the manager instantly sets up another station so that service continues uninterrupted.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;Deployment 🧑‍💼&lt;/strong&gt; is like the general manager who sets the policy: “We need three pizza stations at all times, and if we ever update the recipe, do it without missing a single order!” The deployment manages the manager (&lt;em&gt;ReplicaSet&lt;/em&gt;), so if there’s a menu change (new dish version), it smoothly transitions from the old to the new without service disruption.&lt;/p&gt;

&lt;p&gt;Your dish’s secret sauce recipe? Those are &lt;strong&gt;Secrets 🔒&lt;/strong&gt;. The shared pantry list? Those are your &lt;strong&gt;ConfigMaps 🗒️&lt;/strong&gt;: detailed notes provided to each station according to your restaurant’s need for consistency.&lt;/p&gt;

&lt;p&gt;Your &lt;strong&gt;Volumes 🧊&lt;/strong&gt; are the fridge or pantry spaces shared by stations, so cooks can store their ingredients and access them anytime—perfect for special prep or long-simmering stocks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kubernetes: The Management Backbone 🍳🎛️
&lt;/h3&gt;

&lt;p&gt;Let’s tour the heart of this restaurant empire—&lt;strong&gt;the Kubernetes Cluster&lt;/strong&gt;:&lt;/p&gt;

&lt;h4&gt;
  
  
  The Control Plane: Where Strategy Happens 🍳🎛️
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API Server:🛎️&lt;/strong&gt; The busy bee! This is our "head receptionist". Every instruction—from new recipe rollouts to extra cooks for a busy Saturday—passes through here. All guests (users or components) submit their requests to the API Server, who ensures every message is communicated and tracked across the cluster. Gatekeeper for all Kubenetes instructions and state changes: all staff must check in here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;etcd:🗄️&lt;/strong&gt; Most imprtant component, the "master recipe safe/vault". Every table booking, pantry stock, dish recipe, and station setup is logged here—a consistent, reliable, distributed database that never loses important notes. Secure, central storage for recipes, inventory, reservations (cluster state and configuration).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Controller Manager:🕹️&lt;/strong&gt; The franchise’s "operations manager". Ensures kitchen floor matches the plan: more stations if needed, retire those not in use. If a kitchen promises to have three pasta stations but one disappears, the controller manager notices and brings another online. It oversees that declarations (from deployments, replicasets, etc.) match reality, constantly adjusting to keep the desired state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduler:📅&lt;/strong&gt; The "line/shift supervisor". As new orders (Pods) come in, the scheduler assigns each to the best available station (Node), making sure the workload is balanced, and no kitchen is overburdened.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud Controller Manager:🌐&lt;/strong&gt; The "travel/facilities manager". Connects kitchens to new cities (cloud), coordinates services like front door access or equipment delivery. It ensures each restaurant interacts smoothly with its city—whether it’s opening in new locations on a cloud platform or requesting resources from the local infrastructure.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  On the Kitchen Floor: The Worker Nodes 🍽️🏢
&lt;/h4&gt;

&lt;p&gt;Every restaurant &lt;strong&gt;Node 🍽️🏢&lt;/strong&gt; is a bustling branch, staffed with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;kubelet:👩‍🍳&lt;/strong&gt; The "sous-chef" to the control plane, ensuring every cooking station (Pod) on the node is running as ordered, checking their status, and reporting back upstairs/HQ (API server).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container Runtime:🔥&lt;/strong&gt; The "Cooking appliances" e.g the cooking equipment—stove, oven, and pans built for containers (Docker, containerd, etc.)—capable of cooking each dish as packaged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kube Proxy:🚦&lt;/strong&gt; The "Waiter/Recept./Networking" team. Making sure the correct dishes (services) reach the right tables (network addresses), handling the kitchen’s communication with guests and with other kitchens.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Special Components for Smooth Operations
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pod:🍳&lt;/strong&gt; The "kitchen station", prepped and loaded with ingredients (containers) for a dish.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service:🏷️&lt;/strong&gt; The "front counter". Customers don’t care which kitchen made their pasta; they ask for “Pasta Al Dente,” and Service directs that request to any available, healthy station.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ConfigMap:🗒️&lt;/strong&gt; The "recipe cards" openly displayed in the kitchen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secret:🔒&lt;/strong&gt; The "locked-away safe" with secret sauce recipes—chef-only access.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volume:🧊&lt;/strong&gt; The "shared fridge" for ingredients, accessible as needed.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🚀 Scaling the Chain: A Day in the Franchise 🚀
&lt;/h3&gt;

&lt;p&gt;Let’s say your hit app, &lt;em&gt;Pizza Deluxe&lt;/em&gt;, is going viral:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;strong&gt;Deployment 🧑‍💼&lt;/strong&gt; (general manager) dictates: “We need 10 pizza stations, always running the latest pizza recipe.”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ReplicaSet 👨‍🍳&lt;/strong&gt; ensures exactly 10 active pizza stations (Pods) ready to cook.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Scheduler 📅&lt;/strong&gt; finds the best locations for every new station as demand rises.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kubelet&lt;/strong&gt; on each node confirms, “My stations are prepped and cooking!” If a cook leaves, a new one gets hired automatically.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;API Server 🛎️&lt;/strong&gt; never misses a single order, and &lt;strong&gt;etcd 🗄️&lt;/strong&gt; ensures organizational memory is always correct and up-to-date.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Services 🏷️&lt;/strong&gt; ensure that when customers (users) ask for pizza, their request is sent to the right available Pod, so no patron waits too long.&lt;/li&gt;
&lt;li&gt;Want to update the recipe? Deployment manages a rolling upgrade, introducing new containers gradually, so service never goes down.&lt;/li&gt;
&lt;li&gt;Volumes persist the dough between station rebuilds, Secrets keep the sauce recipe safe, and ConfigMaps post the menu outside each kitchen.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Kubernetes Kitchen: Visualized
&lt;/h2&gt;

&lt;p&gt;Here’s my attempt to make a conceptual &lt;strong&gt;diagram&lt;/strong&gt; representing the analogy and flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------------------------+
|    Kubernetes API       |&amp;lt;------+
|          Server         |       |
+-----------+-------------+       |
            |                     |
            v                     |
+-----------+-----------------+   |
|        Controller Manager   |&amp;lt;--+
+----------------------------+
            |
            v
+--------------------------+
|       Scheduler          |
+-----------+--------------+
            |
            v                 +--------------+
+-----------+-------------+   |   ETCD      |
|    Nodes (Restaurants)  |---| (Recipe DB) |
+-------------------------+   +--------------+
| +---------------+      |
| |   Kubelet     |      |
| +---------------+      |
| | Kube Proxy    |      |
| +---------------+      |
| |ContainerRuntime|     |
| +---------------+      |
| | Pods          |      |
| +------|-------+       |
|        v               |
|   Containers (Dishes)  |
+------------------------+
|
+---&amp;gt; ConfigMap, Secret, Volume (pantry, safe, fridge)
|
+---&amp;gt; Service (front counter)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;(Visualization: Each control plane component “manages” the restaurant network, while each node is a kitchen staffed with all elements needed to make, package, and serve your signature dishes.)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As your restaurant chain grows, &lt;strong&gt;Kubernetes&lt;/strong&gt; tirelessly orchestrates every kitchen—ensuring every customer gets a hot, perfectly prepped meal at scale, every time.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Imagine Kubernetes as your ultimate franchise operations HQ—scaling, managing, and securing every station, recipe, and service window, so your team can focus on creating the world's best dining experience.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;Want to learn what powers the Kubernetes? read my blog on &lt;a href="https://dev.to/hstiwana/-containers-in-plain-english-the-shipping-container-of-tech-1ge"&gt;Containers in Plain English: The Shipping Container of Tech 🚢🍱&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;📖🧠📚Sources, Guides, and Inspiration📖🧠📚:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Kubernetes Components and Architecture&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/concepts/overview/components/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/overview/components/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/concepts/architecture/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/architecture/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://spacelift.io/blog/kubernetes-architecture" rel="noopener noreferrer"&gt;https://spacelift.io/blog/kubernetes-architecture&lt;/a&gt;&lt;br&gt;
&lt;a href="https://spot.io/resources/kubernetes-architecture/11-core-components-explained/" rel="noopener noreferrer"&gt;https://spot.io/resources/kubernetes-architecture/11-core-components-explained/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://sysdig.com/learn-cloud-native/components-of-kubernetes/" rel="noopener noreferrer"&gt;https://sysdig.com/learn-cloud-native/components-of-kubernetes/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Analogy &amp;amp; Flow: Restaurant Scenario&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://kodekloud.com/blog/day-4-deployments-replicasets-how-kubernetes-runs-and-manages-your-app/" rel="noopener noreferrer"&gt;https://kodekloud.com/blog/day-4-deployments-replicasets-how-kubernetes-runs-and-manages-your-app/&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deployments, ReplicaSets, and Pods: Lifecycle and Scaling&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://zeet.co/blog/kubernetes-deployment-vs-pod" rel="noopener noreferrer"&gt;https://zeet.co/blog/kubernetes-deployment-vs-pod&lt;/a&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://kubernetes.io/docs/concepts/workloads/controllers/deployment/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/workloads/controllers/deployment/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://www.linkedin.com/posts/gabrielokom_kubernetes-deployment-replicaset-activity-7249259381932920832-Bkn0" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/gabrielokom_kubernetes-deployment-replicaset-activity-7249259381932920832-Bkn0&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>tutorial</category>
      <category>makeiteasytoremember</category>
    </item>
    <item>
      <title>Part2: Kubernetes Backup on Managed Services: What Changes When You Use EKS?</title>
      <dc:creator>Hardeep Singh Tiwana</dc:creator>
      <pubDate>Mon, 23 Jun 2025 16:12:03 +0000</pubDate>
      <link>https://dev.to/hstiwana/part2-kubernetes-backup-on-managed-services-what-changes-when-you-use-eks-30el</link>
      <guid>https://dev.to/hstiwana/part2-kubernetes-backup-on-managed-services-what-changes-when-you-use-eks-30el</guid>
      <description>&lt;p&gt;In my &lt;a href="https://dev.to/hstiwana/kubernetes-backup-strategies-balancing-cost-security-and-availability-3jpd"&gt;previous blog post&lt;/a&gt;, I covered Kubernetes backup strategies for self-managed clusters, highlighting cost, security, and availability. But what happens when you’re using a managed Kubernetes service like Amazon Elastic Kubernetes Service (EKS)? Let’s dive into the key differences and best practices for backing up Kubernetes on managed platforms.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Big Shift: Managed Control Plane
&lt;/h2&gt;

&lt;p&gt;With managed Kubernetes services like Amazon EKS, AWS handles the control plane—including etcd, the API server, and scheduler. &lt;strong&gt;You don’t have direct access to etcd or the control plane components.&lt;/strong&gt; This means you can’t perform traditional etcd snapshots as you would on a self-managed cluster. Instead, your backup strategy must focus on what you can control:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes objects&lt;/strong&gt; (Deployments, Services, ConfigMaps, Secrets, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent data&lt;/strong&gt; (EBS volumes used by your workloads)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Application configurations and manifests&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What to Back Up on EKS
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes Objects:&lt;/strong&gt; Anything you create or manage via the Kubernetes API—workloads, configurations, and policies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Volumes:&lt;/strong&gt; Data stored on EBS volumes attached to your pods.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Networking and Security:&lt;/strong&gt; Ingress, Network Policies, and RBAC rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application Data:&lt;/strong&gt; For databases or stateful apps, use application-aware backups for consistency.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to Back Up on EKS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Use Velero for Kubernetes Object Backup
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://velero.io/" rel="noopener noreferrer"&gt;Velero&lt;/a&gt; is the go-to tool for backing up and restoring Kubernetes resources on EKS. It works directly with the Kubernetes API, so it’s perfect for managed services where you can’t access etcd. Velero can back up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;All resources in a namespace or across the cluster&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Persistent volumes (with the right configuration)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Custom resources and configurations&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Velero supports scheduling, retention policies, and can store backups in S3, which integrates well with AWS security and cost controls.&lt;/p&gt;

&lt;h3&gt;
  
  
  Back Up Persistent Data
&lt;/h3&gt;

&lt;p&gt;For stateful applications, use Velero’s volume snapshot feature to back up EBS volumes. This ensures your data is protected and can be restored if needed. You can also use application-specific backup tools for databases (e.g., pg_dump, mysqldump) and store the output in S3.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automate and Test
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schedule regular backups&lt;/strong&gt; to minimize data loss.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automate retention&lt;/strong&gt; to delete old backups and control costs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test restores&lt;/strong&gt; to ensure your backups are valid and your recovery process works.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security and Availability
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Encryption:&lt;/strong&gt; Use &lt;a href="https://aws.amazon.com/kms/" rel="noopener noreferrer"&gt;AWS KMS&lt;/a&gt; to encrypt backups at rest and in transit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutable Backups:&lt;/strong&gt; Store backups in S3 with &lt;a href="https://aws.amazon.com/s3/features/object-lock/" rel="noopener noreferrer"&gt;Object Lock&lt;/a&gt; to prevent tampering or deletion.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Region Storage:&lt;/strong&gt; Replicate backups across regions for disaster recovery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access Control:&lt;/strong&gt; Use IAM and RBAC to restrict who can create, delete, or restore backups.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost Considerations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Storage Tiering:&lt;/strong&gt; Move older backups to cheaper storage like &lt;a href="https://aws.amazon.com/s3/storage-classes/glacier/" rel="noopener noreferrer"&gt;S3 Glacier&lt;/a&gt; to save money.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental Backups:&lt;/strong&gt; Only back up changed data to reduce storage and bandwidth costs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retention Policies:&lt;/strong&gt; Automatically delete old backups to avoid unnecessary charges.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What You Can’t Back Up
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Control Plane/etcd:&lt;/strong&gt; Managed by AWS, not accessible for direct backup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node-Level State:&lt;/strong&gt; Unless you use custom tools or scripts, node-level state is typically not backed up by default.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Backup Target&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Self-Managed Kubernetes&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;EKS (Managed Kubernetes)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;etcd/Control Plane&lt;/td&gt;
&lt;td&gt;Yes (manual snapshots)&lt;/td&gt;
&lt;td&gt;No (managed by AWS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes Objects&lt;/td&gt;
&lt;td&gt;Yes (Velero, etcdctl)&lt;/td&gt;
&lt;td&gt;Yes (Velero via API)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persistent Volumes&lt;/td&gt;
&lt;td&gt;Yes (Velero, volume snapshots)&lt;/td&gt;
&lt;td&gt;Yes (Velero, EBS snapshots)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Application Data&lt;/td&gt;
&lt;td&gt;Yes (app-aware tools)&lt;/td&gt;
&lt;td&gt;Yes (app-aware tools, S3 storage)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Networking/Security&lt;/td&gt;
&lt;td&gt;Yes (Velero, GitOps)&lt;/td&gt;
&lt;td&gt;Yes (Velero, GitOps)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Use Velero for disaster recovery and migration.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Automate backups and retention to control costs.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Encrypt and protect backups with AWS security features.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Test your restore process regularly.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Store backups in multiple regions for resilience.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Additional Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://repost.aws/knowledge-center/eks-cluster-back-up-restore" rel="noopener noreferrer"&gt;AWS re:Post: Back up and restore an Amazon EKS cluster&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://trilio.io/kubernetes-disaster-recovery/eks-backup/" rel="noopener noreferrer"&gt;Trilio: EKS Backup Tutorial &amp;amp; Best Practices&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://aws.amazon.com/blogs/containers/backup-and-restore-your-amazon-eks-cluster-resources-using-velero/" rel="noopener noreferrer"&gt;AWS Blog: Backup and restore your Amazon EKS cluster resources using Velero&lt;/a&gt;&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;In summary:&lt;/strong&gt;&lt;br&gt;
When using managed Kubernetes services like EKS, your backup strategy shifts to focus on Kubernetes objects, persistent data, and application configurations—leveraging tools like Velero and AWS storage features for a robust, cost-effective, and secure approach.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>backup</category>
      <category>recovery</category>
      <category>availability</category>
    </item>
    <item>
      <title>Part1: Kubernetes Backup Strategies: Balancing Cost, Security, and Availability</title>
      <dc:creator>Hardeep Singh Tiwana</dc:creator>
      <pubDate>Mon, 23 Jun 2025 15:54:17 +0000</pubDate>
      <link>https://dev.to/hstiwana/kubernetes-backup-strategies-balancing-cost-security-and-availability-3jpd</link>
      <guid>https://dev.to/hstiwana/kubernetes-backup-strategies-balancing-cost-security-and-availability-3jpd</guid>
      <description>&lt;p&gt;Backing up a Kubernetes cluster is a critical task for any organization running containerized workloads. However, it’s not just about what you back up—it’s also about how you do it, how much it costs, and how you ensure your backups are secure and available when needed. This post brings together best practices for Kubernetes backups, with a focus on cost efficiency, robust security, and high availability.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Back Up in Kubernetes
&lt;/h2&gt;

&lt;p&gt;A comprehensive backup strategy for Kubernetes should include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Configuration and State&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;etcd database:&lt;/strong&gt; Stores all cluster data and is essential for disaster recovery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes objects:&lt;/strong&gt; Deployments, StatefulSets, Services, ConfigMaps, Secrets, and custom resources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manifests:&lt;/strong&gt; Store in version control (e.g., Git) for easy recovery and versioning.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Persistent Data&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Persistent Volumes (PVs) and Persistent Volume Claims (PVCs):&lt;/strong&gt; Critical for stateful applications.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application data:&lt;/strong&gt; Use application-aware backups for databases and other stateful workloads.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Networking and Security&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Services, Ingress, Network Policies:&lt;/strong&gt; Ensure consistent access and security post-restore.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  How to Back Up Kubernetes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Tools and Methods
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;etcd Snapshots:&lt;/strong&gt; Use &lt;code&gt;etcdctl&lt;/code&gt; to create and restore snapshots.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Velero:&lt;/strong&gt; Open-source tool for backup, restore, and disaster recovery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volume Snapshots:&lt;/strong&gt; Use Kubernetes’ &lt;code&gt;VolumeSnapshot&lt;/code&gt; API for point-in-time backups of persistent data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitOps:&lt;/strong&gt; Store manifests and configuration in Git for declarative management.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example: Velero Backup Command
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;velero backup create my-backup &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include-namespaces&lt;/span&gt; prod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--storage-location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;s3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ttl&lt;/span&gt; 720h &lt;span class="se"&gt;\ &lt;/span&gt;     &lt;span class="c"&gt;# 30-day retention&lt;/span&gt;
  &lt;span class="nt"&gt;--snapshot-volumes&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--volume-snapshot-locations&lt;/span&gt; aws-us-east-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cost Optimization Strategies
&lt;/h2&gt;

&lt;p&gt;Backing up persistent data can become expensive if not managed carefully. Here are ways to reduce costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Storage Tiering:&lt;/strong&gt; Move older backups to cheaper storage tiers (e.g., AWS S3 Glacier).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental Backups:&lt;/strong&gt; Only back up changed data to minimize storage and network costs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retention Automation:&lt;/strong&gt; Automatically delete outdated backups using tools like Velero’s &lt;code&gt;ttl&lt;/code&gt; parameter.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deduplication &amp;amp; Compression:&lt;/strong&gt; Reduce backup size with tools like Kasten K10 or TrilioVault.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frequency Tuning:&lt;/strong&gt; Align backup schedules with business needs—daily instead of hourly for non-critical workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Cost Factor&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;High-Cost Approach&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Optimized Approach&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;Premium SSD ($$)&lt;/td&gt;
&lt;td&gt;Tiered + compressed ($)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retention&lt;/td&gt;
&lt;td&gt;Manual ($$)&lt;/td&gt;
&lt;td&gt;Automated (free/low)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backup Frequency&lt;/td&gt;
&lt;td&gt;Hourly ($$)&lt;/td&gt;
&lt;td&gt;Daily/weekly ($)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Security Best Practices
&lt;/h2&gt;

&lt;p&gt;Security is a critical aspect of backup management:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Encryption:&lt;/strong&gt; Enable AES-256 encryption in transit (TLS) and at rest (e.g., AWS KMS).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutable Backups:&lt;/strong&gt; Use WORM-compliant storage (e.g., AWS S3 Object Lock) to prevent tampering.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access Control:&lt;/strong&gt; Apply RBAC and IAM policies to restrict backup access; audit with CloudTrail.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrity Checks:&lt;/strong&gt; Validate backups with checksums and periodic test restores.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Security Measure&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Encryption&lt;/td&gt;
&lt;td&gt;Data encrypted in transit and at rest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Immutable Backups&lt;/td&gt;
&lt;td&gt;Backups cannot be altered or deleted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access Control&lt;/td&gt;
&lt;td&gt;Only authorized users can access backups&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integrity Checks&lt;/td&gt;
&lt;td&gt;Regular validation and test restores&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Availability Considerations
&lt;/h2&gt;

&lt;p&gt;Ensuring backups are available when needed is just as important as creating them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Region Replication:&lt;/strong&gt; Store backups across multiple regions or availability zones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disaster Recovery Drills:&lt;/strong&gt; Regularly test restore procedures to ensure backups are valid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Immutable Infrastructure:&lt;/strong&gt; Use Velero with etcd snapshots for cluster-state recovery.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Availability Feature&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Region Storage&lt;/td&gt;
&lt;td&gt;Backups stored in multiple geographic locations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regular Test Restores&lt;/td&gt;
&lt;td&gt;Ensures recoverability and backup integrity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Immutable Infrastructure&lt;/td&gt;
&lt;td&gt;Prevents accidental or malicious changes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Cost-Security-Availability Tradeoff Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Goal&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;High-Cost Approach&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Optimized Approach&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;Premium SSD ($$)&lt;/td&gt;
&lt;td&gt;Tiered + compressed ($)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Custom encryption ($$)&lt;/td&gt;
&lt;td&gt;Cloud-managed KMS + IAM ($)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Availability&lt;/td&gt;
&lt;td&gt;Real-time replication ($$)&lt;/td&gt;
&lt;td&gt;Multi-region + weekly snaps ($$$)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Back up both cluster state (etcd) and persistent data (PVs/PVCs).&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Use tools like Velero and Kubernetes’ VolumeSnapshot API for automation.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Optimize costs with storage tiering, incremental backups, and automated retention + Storage Lifecycle Management policies.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ensure security with encryption, immutable backups, and strict access control.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Guarantee availability with multi-region storage and regular test restores.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://velero.io/docs/v1.9/cost-optimization/" rel="noopener noreferrer"&gt;Velero Cost Optimization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/backup/pricing/" rel="noopener noreferrer"&gt;AWS Backup Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/" rel="noopener noreferrer"&gt;Kubernetes Availability Configs&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By following these guidelines, you can create a robust, cost-effective, and secure backup strategy for your Kubernetes clusters—ensuring your workloads are always protected and recoverable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/hstiwana/part2-kubernetes-backup-on-managed-services-what-changes-when-you-use-eks-30el"&gt;Continue to Part2&lt;/a&gt; &lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>backup</category>
      <category>recovery</category>
      <category>availability</category>
    </item>
    <item>
      <title>Kubernetes Scheduling: podAntiAffinity vs. topologySpreadConstraints</title>
      <dc:creator>Hardeep Singh Tiwana</dc:creator>
      <pubDate>Wed, 18 Jun 2025 18:37:37 +0000</pubDate>
      <link>https://dev.to/hstiwana/kubernetes-scheduling-podantiaffinity-vs-topologyspreadconstraints-41j4</link>
      <guid>https://dev.to/hstiwana/kubernetes-scheduling-podantiaffinity-vs-topologyspreadconstraints-41j4</guid>
      <description>&lt;p&gt;When it comes to deploying resilient and highly available applications in Kubernetes, scheduling constraints are key. Two powerful tools for controlling pod placement are podAntiAffinity and topologySpreadConstraints. While both help manage pod distribution, they serve different purposes and offer distinct advantages. Let’s break down what each does, how they differ, and when to use them.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Imagine you’re running a critical application on Kubernetes. You want to ensure that your pods are spread across different nodes or zones to avoid downtime if a single node fails. This is where scheduling constraints come into play. Kubernetes offers several mechanisms for this, but two of the most important are &lt;strong&gt;&lt;code&gt;podAntiAffinity&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;topologySpreadConstraints&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What Is &lt;code&gt;podAntiAffinity&lt;/code&gt;?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;podAntiAffinity&lt;/strong&gt; is a scheduling rule that prevents certain pods from being co-located on the same node or topology domain (like a zone). It’s designed for scenarios where you want to maximize fault tolerance by ensuring that no two instances of your application run on the same node or zone.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How it works:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Strict separation&lt;/strong&gt;: You can specify that a pod should not run on the same node as another pod with a certain label.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Topology key&lt;/strong&gt;: Uses a topologyKey (e.g., &lt;code&gt;kubernetes.io/hostname&lt;/code&gt; for nodes, &lt;code&gt;topology.kubernetes.io/zone&lt;/code&gt; for zones).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enforcement&lt;/strong&gt;: Can be set as &lt;code&gt;requiredDuringSchedulingIgnoredDuringExecution&lt;/code&gt; (strict) or &lt;code&gt;preferredDuringSchedulingIgnoredDuringExecution&lt;/code&gt; (best effort).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Use case:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;podAntiAffinity&lt;/code&gt; when you absolutely must prevent pods from being on the same node or zone—such as for database replicas or critical microservices.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What Are &lt;code&gt;topologySpreadConstraints&lt;/code&gt;?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;topologySpreadConstraints&lt;/code&gt;&lt;/strong&gt; are a more flexible way to control pod distribution. Instead of just preventing co-location, they allow you to specify how evenly pods should be distributed across topology domains (nodes, zones, regions).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How it works:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Even distribution&lt;/strong&gt;: You can define a &lt;code&gt;maxSkew&lt;/code&gt; to set the maximum allowable difference in pod count between domains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Topology key&lt;/strong&gt;: Uses &lt;code&gt;topologyKey&lt;/code&gt; to specify the domain (node, zone, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;: You can configure whether to allow scheduling if constraints can’t be met (whenUnsatisfiable).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hierarchical control&lt;/strong&gt;: Works across multiple levels (nodes within zones, zones within regions).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Use case:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Use topologySpreadConstraints when you want to balance pod distribution for high availability, load balancing, or cost optimization, and are willing to tolerate some imbalance if necessary.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Comparison Table&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;podAntiAffinity&lt;/th&gt;
&lt;th&gt;topologySpreadConstraints&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Strict separation&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (but can be close)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Even distribution&lt;/td&gt;
&lt;td&gt;Not guaranteed&lt;/td&gt;
&lt;td&gt;Yes (configurable)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Topology flexibility&lt;/td&gt;
&lt;td&gt;Specific (node, zone)&lt;/td&gt;
&lt;td&gt;Hierarchical or flat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scheduling flexibility&lt;/td&gt;
&lt;td&gt;No (can block scheduling)&lt;/td&gt;
&lt;td&gt;Yes (can allow skew)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Can You Use Both?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Absolutely! Combining &lt;code&gt;podAntiAffinity&lt;/code&gt; and &lt;code&gt;topologySpreadConstraints&lt;/code&gt; gives you the best of both worlds: strict separation where needed, and balanced distribution for overall resilience.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;When to Use Each&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;podAntiAffinity&lt;/strong&gt;: When you must prevent pods from being on the same node or zone (e.g., to avoid single points of failure).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;topologySpreadConstraints&lt;/strong&gt;: When you want to balance pod distribution across your cluster for high availability, load balancing, or cost optimization.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Understanding the differences between &lt;code&gt;podAntiAffinity&lt;/code&gt; and &lt;code&gt;topologySpreadConstraints&lt;/code&gt; is crucial for designing robust Kubernetes deployments. Use &lt;code&gt;podAntiAffinity&lt;/code&gt; for strict separation and &lt;code&gt;topologySpreadConstraints&lt;/code&gt; for flexible, balanced distribution. Together, they help you build resilient, highly available applications.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>k8s</category>
      <category>deepdive</category>
      <category>kubernetesinternals</category>
    </item>
    <item>
      <title>How does VolumeMount work in Kubernetes? What happens in the backend?</title>
      <dc:creator>Hardeep Singh Tiwana</dc:creator>
      <pubDate>Sun, 04 May 2025 23:34:06 +0000</pubDate>
      <link>https://dev.to/hstiwana/how-does-volumemount-work-in-kubernetes-what-happens-in-the-backend-3k0n</link>
      <guid>https://dev.to/hstiwana/how-does-volumemount-work-in-kubernetes-what-happens-in-the-backend-3k0n</guid>
      <description>&lt;h2&gt;
  
  
  Understanding Kubernetes VolumeMounts, PersistentVolumeClaims, and StorageClasses (with YAML Examples)
&lt;/h2&gt;

&lt;p&gt;Persistent storage is essential for stateful applications in Kubernetes. To manage storage dynamically and reliably, Kubernetes uses a combination of &lt;code&gt;VolumeMounts&lt;/code&gt;, &lt;code&gt;PersistentVolumeClaims&lt;/code&gt; (PVCs), and &lt;code&gt;StorageClasses&lt;/code&gt;. In this post, we’ll demystify how these components work together, and provide practical YAML examples to help you implement them in your clusters.&lt;/p&gt;




&lt;h3&gt;
  
  
  1. VolumeMounts: Connecting Storage to Containers
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;VolumeMount&lt;/code&gt; specifies where a volume should appear inside a container. It references a volume defined at the Pod level and maps it to a directory inside the container.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;YAML Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;volume-mount-example&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
      &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
      &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-storage&lt;/span&gt;
          &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/usr/share/nginx/html&lt;/span&gt;
  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-storage&lt;/span&gt;
      &lt;span class="na"&gt;emptyDir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;  &lt;span class="c1"&gt;# Ephemeral storage for demonstration&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The volume &lt;code&gt;my-storage&lt;/code&gt; is mounted inside the &lt;code&gt;nginx&lt;/code&gt; container at &lt;code&gt;/usr/share/nginx/html&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Any files written to that path are stored in the &lt;code&gt;emptyDir&lt;/code&gt; volume, which is deleted when the Pod is removed.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. PersistentVolumeClaims and PersistentVolumes: Decoupling Storage from Pods
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;PersistentVolume (PV):&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A cluster-wide resource representing a piece of storage.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;PersistentVolumeClaim (PVC):&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
A request for storage by a user or application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;YAML Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# PersistentVolume (PV)&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolume&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pv-example&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;capacity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
  &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
  &lt;span class="na"&gt;hostPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/mnt/data&lt;/span&gt;  &lt;span class="c1"&gt;# For demo purposes; use real storage in production&lt;/span&gt;

&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# PersistentVolumeClaim (PVC)&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pvc-example&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The PV is a piece of storage available in the cluster.&lt;/li&gt;
&lt;li&gt;The PVC requests 1Gi of storage with &lt;code&gt;ReadWriteOnce&lt;/code&gt; access.&lt;/li&gt;
&lt;li&gt;Kubernetes binds the PVC to the PV if their requirements match.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. StorageClasses: Automating and Tiering Storage
&lt;/h3&gt;

&lt;p&gt;A &lt;code&gt;StorageClass&lt;/code&gt; defines how to provision storage dynamically. It specifies the provisioner (such as a cloud provider’s CSI driver) and parameters for the storage backend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;YAML Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example StorageClass for AWS EBS&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;storage.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;StorageClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ebs-csi-storageclass&lt;/span&gt; &lt;span class="c1"&gt;# This name is used by PersistentVolumeClaim&lt;/span&gt;
&lt;span class="na"&gt;provisioner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ebs.csi.aws.com&lt;/span&gt; &lt;span class="c1"&gt;# Example for AWS EBS CSI provisioner&lt;/span&gt;
&lt;span class="na"&gt;volumeBindingMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WaitForFirstConsumer&lt;/span&gt;
&lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp3&lt;/span&gt; &lt;span class="c1"&gt;# Or your preferred EBS volume type&lt;/span&gt;
  &lt;span class="na"&gt;iops&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5000"&lt;/span&gt; &lt;span class="c1"&gt;# Example IOPS&lt;/span&gt;
  &lt;span class="na"&gt;throughput&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;250"&lt;/span&gt; &lt;span class="c1"&gt;# Example Throughput&lt;/span&gt;
  &lt;span class="na"&gt;encrypted&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt; &lt;span class="c1"&gt;# Example encryption&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When a PVC requests &lt;code&gt;storageClassName: ebs-csi-storageclass&lt;/code&gt;, Kubernetes uses this StorageClass to dynamically provision a new PV.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  4. Bringing It All Together: Pod Using a PVC and StorageClass
&lt;/h3&gt;

&lt;p&gt;Here’s how you’d use a PVC (which in turn uses a StorageClass) in a Pod:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;YAML Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# PersistentVolumeClaim using a StorageClass&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dynamic-pvc&lt;/span&gt; &lt;span class="c1"&gt;# This name is used by POD in "volumes" section&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10Gi&lt;/span&gt;
  &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ebs-csi-storageclass&lt;/span&gt; &lt;span class="c1"&gt;# see reference to StorageClass name above&lt;/span&gt;

&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# Pod mounting the PVC&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-pod&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-container&lt;/span&gt;
      &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx:latest&lt;/span&gt;
      &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-ebs-volume&lt;/span&gt; &lt;span class="c1"&gt;#This name and name in volumes section should match&lt;/span&gt;
          &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/mnt/data&lt;/span&gt;
  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-ebs-volume&lt;/span&gt; &lt;span class="c1"&gt;# This name and name in volumeMounts section should match&lt;/span&gt;
      &lt;span class="na"&gt;persistentVolumeClaim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;claimName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dynamic-pvc&lt;/span&gt; &lt;span class="c1"&gt;#see reference in PersistentVolumeClaim above&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Workflow:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The PVC (&lt;code&gt;dynamic-pvc&lt;/code&gt;) requests 5Gi storage using the &lt;code&gt;ebs-csi-storageclass&lt;/code&gt; StorageClass.&lt;/li&gt;
&lt;li&gt;Kubernetes dynamically provisions a PV using the StorageClass.&lt;/li&gt;
&lt;li&gt;The Pod references the PVC in its &lt;code&gt;volumes&lt;/code&gt; section.&lt;/li&gt;
&lt;li&gt;The PVC is mounted inside the container at &lt;code&gt;/mnt/data&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  5. Lifecycle and Backend Process
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Provisioning&lt;/strong&gt;: If dynamic provisioning is used, the StorageClass’s provisioner creates the storage when a PVC is created.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Binding&lt;/strong&gt;: The PVC is bound to a PV (either static or dynamically provisioned).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mounting&lt;/strong&gt;: When the Pod is scheduled, the kubelet mounts the storage to the container’s filesystem at the specified &lt;code&gt;mountPath&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reclaiming&lt;/strong&gt;: When the PVC is deleted, the PV’s &lt;code&gt;reclaimPolicy&lt;/code&gt; determines if the storage is deleted, retained, or recycled.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Summary Table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Defined By&lt;/th&gt;
&lt;th&gt;Key Fields&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VolumeMount&lt;/td&gt;
&lt;td&gt;Mounts storage inside a container&lt;/td&gt;
&lt;td&gt;Pod spec (user)&lt;/td&gt;
&lt;td&gt;name, mountPath&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PersistentVolume&lt;/td&gt;
&lt;td&gt;Cluster storage resource&lt;/td&gt;
&lt;td&gt;Admin/Kubernetes&lt;/td&gt;
&lt;td&gt;capacity, accessModes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PersistentVolumeClaim&lt;/td&gt;
&lt;td&gt;Request for storage&lt;/td&gt;
&lt;td&gt;User&lt;/td&gt;
&lt;td&gt;resources, accessModes, storageClassName&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;StorageClass&lt;/td&gt;
&lt;td&gt;Storage “profile” for dynamic provisioning&lt;/td&gt;
&lt;td&gt;Admin&lt;/td&gt;
&lt;td&gt;provisioner, parameters&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Kubernetes Storage: Common Issues, Troubleshooting, and Limitations
&lt;/h2&gt;

&lt;p&gt;Building on the fundamentals of VolumeMounts, PersistentVolumeClaims, and StorageClasses, it’s crucial to understand the real-world challenges that teams face when running stateful workloads in Kubernetes. Here, we’ll cover common issues, troubleshooting strategies, and key limitations you should be aware of.&lt;/p&gt;




&lt;h3&gt;
  
  
  Common Issues with Kubernetes Storage
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. PersistentVolumeClaim (PVC) Not Bound&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The PVC remains in a &lt;code&gt;Pending&lt;/code&gt; state and is not bound to any PersistentVolume (PV).&lt;/li&gt;
&lt;li&gt;Causes include mismatched storage size, access modes, or StorageClass between the PVC and available PVs, or insufficient underlying storage resources.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Volume Mount Failures&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pods may fail to start with errors related to mounting the volume.&lt;/li&gt;
&lt;li&gt;This can be due to incorrect volume definitions, unavailable storage backends, or node failures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Storage Plugin/CSI Issues&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Problems with the Container Storage Interface (CSI) driver, such as outdated versions or plugin crashes, can prevent volumes from being provisioned or mounted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Pod Stuck in Pending or CrashLoopBackOff&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage-related configuration errors can cause pods to remain in &lt;code&gt;Pending&lt;/code&gt; or repeatedly crash (&lt;code&gt;CrashLoopBackOff&lt;/code&gt;), especially when required volumes are not available or properly mounted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Deletion and Reclaim Policy Problems&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PVs may not be deleted or released as expected due to misconfigured reclaim policies, leading to orphaned resources and wasted storage.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Troubleshooting Kubernetes Storage Issues
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Step-by-Step Troubleshooting:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Check Pod Events and Status&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;kubectl describe pod&lt;/code&gt; to view events and error messages related to volume mounting or PVC binding.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Inspect PVC and PV Status&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Run &lt;code&gt;kubectl get pvc&lt;/code&gt; and &lt;code&gt;kubectl describe pvc&lt;/code&gt; to check if the PVC is bound and to see any error messages.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;kubectl get pv&lt;/code&gt; to examine the status and properties of PersistentVolumes.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Verify StorageClass and CSI Driver&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Ensure the correct StorageClass is referenced and the CSI driver is running and up to date (&lt;code&gt;kubectl get pod -n kube-system | grep csi&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Review Node and Network Health&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Check node status and network connectivity to the storage backend, as network issues can prevent volume attachment.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Check for Resource Constraints&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Ensure there are enough resources (CPU, memory, storage) available on nodes to support the requested volumes.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Zone-aware Auto Scaling&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;If your workloads are zone-specific you'll need to create separate nodegroups for each zone. This is because the &lt;code&gt;cluster-autoscaler&lt;/code&gt; assumes that all nodes in a group are exactly equivalent. So, for example, if a scale-up event is triggered by a pod which needs a zone-specific PVC (e.g. an EBS volume), the new node might get scheduled in the wrong AZ and the pod will fail to start.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Logs and Observability&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Examine logs from the affected pod and CSI driver for detailed error information. Use monitoring tools to track resource usage and events.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Manual Remediation&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;If a node fails, remove it with &lt;code&gt;kubectl delete node&lt;/code&gt; to trigger pod rescheduling and volume reattachment.&lt;/li&gt;
&lt;li&gt;Deleting and recreating pods or PVCs can sometimes resolve transient issues.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h3&gt;
  
  
  Limitations of Kubernetes Storage
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Storage Backend Compatibility&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not all storage solutions support every Kubernetes feature (e.g., ReadWriteMany access mode is not available on many block storage backends).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Dynamic Provisioning Constraints&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dynamic provisioning relies on properly configured StorageClasses and CSI drivers. Misconfiguration or lack of support for certain features can lead to failed provisioning. also see &lt;strong&gt;Zone-aware Auto Scaling&lt;/strong&gt; in &lt;strong&gt;Troubleshooting&lt;/strong&gt; section above.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Data Durability and Redundancy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes itself does not provide data replication or backup; this is the responsibility of the storage backend or external tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Performance Overheads&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage performance depends on the underlying infrastructure. Network-attached storage may introduce latency compared to local disks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Scaling and Resource Quotas&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Storage scalability is limited by the backend and resource quotas. Over-provisioning or lack of quotas can lead to resource contention and degraded performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;6. Security and Access Controls&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fine-grained access controls for storage resources may be limited, especially when using some legacy or simple backends.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Example: Troubleshooting a PVC Not Bound
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check PVC status&lt;/span&gt;
kubectl get pvc

&lt;span class="c"&gt;# Describe the PVC for detailed events and errors&lt;/span&gt;
kubectl describe pvc dynamic-pvc

&lt;span class="c"&gt;# Check available PVs and their properties&lt;/span&gt;
kubectl get pv

&lt;span class="c"&gt;# If needed, check StorageClass and CSI driver status&lt;/span&gt;
kubectl get storageclass
kubectl get pod &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system | &lt;span class="nb"&gt;grep &lt;/span&gt;csi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Best Practices
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Always match PVC requests (size, access mode, StorageClass) to available PVs.&lt;/li&gt;
&lt;li&gt;Monitor pod, PVC, and PV events regularly.&lt;/li&gt;
&lt;li&gt;Keep CSI drivers up to date and monitor their health.&lt;/li&gt;
&lt;li&gt;Configure storage quotas and limits to avoid resource exhaustion.&lt;/li&gt;
&lt;li&gt;Choose storage backends that align with your application’s durability, performance, and scalability needs.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;When a Kubernetes Pod that uses a PersistentVolumeClaim (PVC) is deleted-whether due to failure, scaling, or an update-the underlying PersistentVolume (PV) and its data are preserved. Here’s how Kubernetes reattaches the volume to a new Pod:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;PVC and PV Binding:&lt;/strong&gt; The PVC remains bound to the PV as long as the PVC resource exists. The binding is tracked by the &lt;code&gt;claimRef&lt;/code&gt; field in the PV, which references the PVC.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Pod Replacement:&lt;/strong&gt; When a new Pod is created (for example, by a Deployment or StatefulSet), and it references the same PVC in its &lt;code&gt;.spec.volumes&lt;/code&gt;, Kubernetes schedules the Pod and ensures the volume is reattached and mounted at the specified path inside the new container.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Volume Attachment:&lt;/strong&gt; The kubelet on the target node coordinates with the storage backend (via the CSI driver or in-tree volume plugin) to attach the PV to the node where the new Pod is scheduled. Once attached, the volume is mounted into the container at the path specified in &lt;code&gt;volumeMounts&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Persistence:&lt;/strong&gt; Because the PV is persistent and not tied to any single Pod, the new Pod sees the same data as the previous Pod. This enables seamless failover or rolling updates without data loss.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Kubernetes storage is powerful and flexible. By combining &lt;code&gt;VolumeMounts&lt;/code&gt;, &lt;code&gt;PersistentVolumeClaims&lt;/code&gt;, and &lt;code&gt;StorageClasses&lt;/code&gt;, you can decouple your applications from the underlying storage, automate provisioning, and ensure your workloads have the storage performance and reliability they need.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>volumemounts</category>
      <category>underthehood</category>
      <category>linux</category>
    </item>
  </channel>
</rss>
