<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: srinu nuthi</title>
    <description>The latest articles on DEV Community by srinu nuthi (@srinu_nuthi_5ff587c586662).</description>
    <link>https://dev.to/srinu_nuthi_5ff587c586662</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3983809%2Fcb051504-a84c-4db8-9e57-a52ea689aa69.png</url>
      <title>DEV Community: srinu nuthi</title>
      <link>https://dev.to/srinu_nuthi_5ff587c586662</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/srinu_nuthi_5ff587c586662"/>
    <language>en</language>
    <item>
      <title>How DNS Works: A Practical Guide for DevOps and Developers</title>
      <dc:creator>srinu nuthi</dc:creator>
      <pubDate>Sun, 14 Jun 2026 12:03:00 +0000</pubDate>
      <link>https://dev.to/srinu_nuthi_5ff587c586662/how-dns-works-a-practical-guide-for-devops-and-developers-15dp</link>
      <guid>https://dev.to/srinu_nuthi_5ff587c586662/how-dns-works-a-practical-guide-for-devops-and-developers-15dp</guid>
      <description>&lt;p&gt;Every time you type &lt;code&gt;www.google.com&lt;/code&gt; and hit enter, your computer doesn't actually know where Google lives on the internet — it has to ask for directions. That's DNS: the internet's phonebook, translating human-readable names into machine-readable IP addresses. Let's follow a single DNS query from the moment you press enter to the moment your browser connects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we need DNS
&lt;/h2&gt;

&lt;p&gt;Imagine memorizing &lt;code&gt;142.250.185.46&lt;/code&gt; instead of typing &lt;code&gt;google.com&lt;/code&gt;. IP addresses change, servers move, but domain names stay constant and memorable. DNS is the bridge between human memory and computer networking.&lt;/p&gt;

&lt;h2&gt;
  
  
  The journey of a DNS query
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Browser cache
&lt;/h3&gt;

&lt;p&gt;Before asking anyone, your browser checks its own DNS cache. Each entry has a &lt;strong&gt;TTL (Time To Live)&lt;/strong&gt;. If there's a valid, non-expired entry, the journey ends here — microseconds. Let's assume it's a miss.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. OS cache
&lt;/h3&gt;

&lt;p&gt;Your browser asks the OS, which keeps its own cache shared across all apps. On Windows you can view it with &lt;code&gt;ipconfig /displaydns&lt;/code&gt;. Still a miss in our scenario.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Hosts file
&lt;/h3&gt;

&lt;p&gt;The OS checks the hosts file (&lt;code&gt;/etc/hosts&lt;/code&gt; on Unix, &lt;code&gt;C:\Windows\System32\drivers\etc\hosts&lt;/code&gt; on Windows). It's a manual override — admins use it for testing, and you can block domains by pointing them at &lt;code&gt;127.0.0.1&lt;/code&gt;. No entry for public sites, so we move on.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The recursive resolver
&lt;/h3&gt;

&lt;p&gt;Now we leave your computer. The OS sends the query to a &lt;strong&gt;recursive resolver&lt;/strong&gt; — usually your ISP's, or a public one like Google's &lt;code&gt;8.8.8.8&lt;/code&gt; or Cloudflare's &lt;code&gt;1.1.1.1&lt;/code&gt;. Think of it as a librarian: it doesn't know the answer offhand, but it knows how to find it. The query carries the domain name, the record type (usually &lt;code&gt;A&lt;/code&gt; for IPv4), and your return address.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. The resolver's cache
&lt;/h3&gt;

&lt;p&gt;The resolver serves thousands or millions of users, so popular domains are almost always cached. If &lt;code&gt;www.example.com&lt;/code&gt; is cached and fresh, you get an answer in 10–50 ms. If not, it recursively walks the DNS hierarchy on your behalf.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Root servers
&lt;/h3&gt;

&lt;p&gt;At the top of the DNS tree are the &lt;strong&gt;root servers&lt;/strong&gt; — 13 sets (A–M), each actually a globally distributed cluster via anycast. The root doesn't know where your domain is, but it knows who runs &lt;code&gt;.com&lt;/code&gt;, and returns a &lt;strong&gt;referral&lt;/strong&gt; to the &lt;code&gt;.com&lt;/code&gt; TLD servers.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. TLD servers
&lt;/h3&gt;

&lt;p&gt;The resolver asks the &lt;code&gt;.com&lt;/code&gt; TLD servers. They maintain the registry of which &lt;strong&gt;authoritative nameservers&lt;/strong&gt; handle each &lt;code&gt;.com&lt;/code&gt; domain, and return another referral:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;example.com is managed by:
ns1.examplehost.com (198.51.100.1)
ns2.examplehost.com (198.51.100.2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  8. Authoritative nameservers
&lt;/h3&gt;

&lt;p&gt;This is the source of truth. The resolver asks the authoritative nameserver, which looks up the record in its zone file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;www.example.com.    3600    IN    A    93.184.216.34
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;3600&lt;/code&gt; — TTL in seconds (how long to cache)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;IN&lt;/code&gt; — Internet class&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;A&lt;/code&gt; — record type (IPv4; &lt;code&gt;AAAA&lt;/code&gt; for IPv6)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;93.184.216.34&lt;/code&gt; — the IP&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  9–10. The return journey and local caching
&lt;/h3&gt;

&lt;p&gt;The resolver caches the answer (respecting the TTL) and returns it. Your OS caches it, passes it to the browser, which also caches it. Next time, it's instant.&lt;/p&gt;

&lt;h3&gt;
  
  
  11. Making the connection
&lt;/h3&gt;

&lt;p&gt;With the IP in hand, the browser opens a TCP connection (port 80 for HTTP, 443 for HTTPS), does the handshake (+ TLS for HTTPS), sends an HTTP GET, and renders the page. The whole DNS process on a cold query is typically &lt;strong&gt;20–120 ms&lt;/strong&gt;; cached, under 10 ms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The flow:&lt;/strong&gt; Browser → OS Cache → Recursive Resolver → Root → TLD → Authoritative NS → back to Browser.&lt;/p&gt;

&lt;h2&gt;
  
  
  DNS record types beyond A
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A&lt;/strong&gt; — domain → IPv4&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AAAA&lt;/strong&gt; — domain → IPv6&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CNAME&lt;/strong&gt; — alias from one domain to another&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MX&lt;/strong&gt; — mail servers (with priorities for redundancy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TXT&lt;/strong&gt; — arbitrary text; used for domain verification, SPF, DKIM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NS&lt;/strong&gt; — which servers are authoritative&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SOA&lt;/strong&gt; — administrative info for the zone&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PTR&lt;/strong&gt; — reverse of an A record (IP → name), important for mail server legitimacy&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The importance of TTL
&lt;/h2&gt;

&lt;p&gt;Every record has a TTL set by the owner — a balance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Short TTL (60–300s):&lt;/strong&gt; changes propagate fast, but more query load on your nameservers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long TTL (3600–86400s):&lt;/strong&gt; big reduction in query load and better performance, but changes can take up to a day to propagate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Common practice: long TTLs for stable infrastructure, but &lt;strong&gt;lower it to 5–10 minutes before a planned change&lt;/strong&gt;, then raise it back afterward.&lt;/p&gt;

&lt;h2&gt;
  
  
  How fast is DNS, really?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Browser cache hit: &amp;lt;1 ms&lt;/li&gt;
&lt;li&gt;OS cache hit: 1–5 ms&lt;/li&gt;
&lt;li&gt;Resolver cache hit: 10–30 ms&lt;/li&gt;
&lt;li&gt;Full recursive query (cold): 20–120 ms&lt;/li&gt;
&lt;li&gt;DNS over HTTPS: +10–50 ms for encryption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For popular sites, resolver cache hit rates approach 95%, so root servers handle less than 1% of all queries. That's how DNS scales to trillions of queries a day.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's so fast: anycast
&lt;/h2&gt;

&lt;p&gt;Multiple servers worldwide share the same IP. Query &lt;code&gt;8.8.8.8&lt;/code&gt; and you're routed to the nearest one — Tokyo users hit Tokyo, London users hit London. You're not crossing oceans; you're connecting to a server nearby.&lt;/p&gt;

&lt;h2&gt;
  
  
  DNS and CDNs
&lt;/h2&gt;

&lt;p&gt;CDNs (Cloudflare, Akamai, CloudFront) use &lt;strong&gt;GeoDNS&lt;/strong&gt;: the authoritative nameserver examines where the query comes from and returns the optimal IP for that location. Some go further with DNS-based load balancing based on server health or load.&lt;/p&gt;

&lt;h2&gt;
  
  
  The invisible infrastructure
&lt;/h2&gt;

&lt;p&gt;DNS is one of the internet's most critical yet invisible services. Every click begins with a DNS query. When it works, it's invisible; when it fails, the internet seems broken. The next time you press enter, appreciate the journey — in milliseconds, your request cascades through caches, crosses continents, and returns with an answer.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://srinun.in/posts/how-dns-works/" rel="noopener noreferrer"&gt;srinun.in&lt;/a&gt;. I write about DevOps, AWS, and Kubernetes — &lt;a href="https://www.linkedin.com/in/srinivasulu-n-4050973b3/" rel="noopener noreferrer"&gt;connect with me on LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dns</category>
      <category>networking</category>
      <category>devops</category>
      <category>webdev</category>
    </item>
    <item>
      <title>How I Cut AWS CloudWatch Costs by 50%: Moving VPC Flow Logs to S3</title>
      <dc:creator>srinu nuthi</dc:creator>
      <pubDate>Sun, 14 Jun 2026 11:57:49 +0000</pubDate>
      <link>https://dev.to/srinu_nuthi_5ff587c586662/how-i-cut-aws-cloudwatch-costs-by-50-moving-vpc-flow-logs-to-s3-aom</link>
      <guid>https://dev.to/srinu_nuthi_5ff587c586662/how-i-cut-aws-cloudwatch-costs-by-50-moving-vpc-flow-logs-to-s3-aom</guid>
      <description>&lt;p&gt;A customer came to me frustrated about their AWS bill. After reviewing the billing dashboard, we found that &lt;strong&gt;over 50% of their costs were coming from CloudWatch vended logs&lt;/strong&gt; — specifically VPC Flow Logs. Here's how we cut that bill in half with two simple changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why CloudWatch gets so expensive for VPC Flow Logs
&lt;/h2&gt;

&lt;p&gt;(Vended logs are logs AWS services generate for you automatically — VPC Flow Logs, Route 53 query logs, CloudFront access logs.)&lt;/p&gt;

&lt;p&gt;VPC Flow Logs are incredibly useful, but storing everything in CloudWatch Logs gets pricey fast:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data ingestion charges per GB&lt;/li&gt;
&lt;li&gt;Storage costs that accumulate over time&lt;/li&gt;
&lt;li&gt;No automatic retention policies by default&lt;/li&gt;
&lt;li&gt;Vended logs piling up quietly&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The fix: move VPC Flow Logs to S3
&lt;/h2&gt;

&lt;p&gt;They didn't need real-time querying for most flow logs — mainly weekly security reviews and occasional troubleshooting. Perfect candidates for S3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost comparison (1 TB of logs):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;CloudWatch Logs&lt;/th&gt;
&lt;th&gt;S3 + Parquet&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ingestion&lt;/td&gt;
&lt;td&gt;$0.50/GB&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;$0.03/GB/mo&lt;/td&gt;
&lt;td&gt;$0.023/GB/mo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compression&lt;/td&gt;
&lt;td&gt;none&lt;/td&gt;
&lt;td&gt;80–90% smaller&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monthly total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$530&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$2.30 + query cost&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  How I did it (4 steps)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Send VPC Flow Logs to S3 in Parquet format
&lt;/h3&gt;

&lt;p&gt;In the VPC console, create a flow log with destination &lt;strong&gt;S3&lt;/strong&gt; and log file format &lt;strong&gt;Parquet&lt;/strong&gt;. Parquet auto-compresses by 80–90% — massive storage savings vs plain text.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Set up S3 lifecycle policies
&lt;/h3&gt;

&lt;p&gt;Don't keep everything in S3 Standard forever. A lifecycle rule:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Day 0–30: S3 Standard (immediate Athena analysis)&lt;/li&gt;
&lt;li&gt;Day 30+: S3 Glacier Instant Retrieval (cheaper, still queryable)&lt;/li&gt;
&lt;li&gt;Day 90+: S3 Glacier Deep Archive (~$0.00099/GB, compliance)&lt;/li&gt;
&lt;li&gt;Day 365: expire&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Set retention on CloudWatch log groups
&lt;/h3&gt;

&lt;p&gt;The biggest quick win — many log groups had &lt;strong&gt;no retention policy&lt;/strong&gt;, so logs were kept forever. Set sane retention: 7 days for debug, 30 for app logs, 90+ for audit/compliance. Never leave it as "Never expire."&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Query with Amazon Athena when needed
&lt;/h3&gt;

&lt;p&gt;Logs are now in S3, so query them on demand with Athena — you only pay for what you query. Parquet makes those queries fast and cheap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The results
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CloudWatch costs dropped ~50%&lt;/strong&gt; in the first billing cycle&lt;/li&gt;
&lt;li&gt;Parquet compression cut storage ~85%&lt;/li&gt;
&lt;li&gt;Query performance actually improved with Athena + Parquet&lt;/li&gt;
&lt;li&gt;Retention policies stopped future cost creep&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Don't wait for costs to become a problem. Set billing alerts and review CloudWatch usage monthly — many teams are shocked when they finally check the detailed bill.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Moving VPC Flow Logs from CloudWatch to S3 with Parquet was one of the easiest cost-optimization wins I've done. Direct S3 delivery + Parquet compression + proper retention delivered immediate results. If your AWS bill looks high, start in Cost Explorer and look for vended logs and log groups with no retention.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://srinun.in/posts/reduce-cloudwatch-costs/" rel="noopener noreferrer"&gt;srinun.in&lt;/a&gt;. I write about DevOps, AWS, and Kubernetes — &lt;a href="https://www.linkedin.com/in/srinivasulu-n-4050973b3/" rel="noopener noreferrer"&gt;connect with me on LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>cloud</category>
    </item>
    <item>
      <title>QPS Limit Exceeded on EKS Start-up: The Image Pull Thundering Herd</title>
      <dc:creator>srinu nuthi</dc:creator>
      <pubDate>Sun, 14 Jun 2026 11:51:41 +0000</pubDate>
      <link>https://dev.to/srinu_nuthi_5ff587c586662/qps-limit-exceeded-on-eks-start-up-the-image-pull-thundering-herd-14ge</link>
      <guid>https://dev.to/srinu_nuthi_5ff587c586662/qps-limit-exceeded-on-eks-start-up-the-image-pull-thundering-herd-14ge</guid>
      <description>&lt;p&gt;I scaled our dev EKS cluster down to zero overnight to save cost. The next morning it didn't come back up cleanly — pods got stuck and the events were full of "QPS limit exceeded". The cause wasn't the automation. It was every pod trying to pull its image at the same second. Here's the thundering herd, and how I fixed it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I started stopping the dev cluster at night
&lt;/h2&gt;

&lt;p&gt;A dev cluster doesn't need to run 24/7. There are 168 hours in a week, but a dev environment realistically only needs ~50 (10 hours a day, 5 days a week). So I set up a schedule: scale the node groups to zero at night, bring them back at 8 AM. The control plane stays up; the expensive worker nodes go to zero.&lt;/p&gt;

&lt;p&gt;Savings: roughly &lt;strong&gt;60–70%&lt;/strong&gt; on dev worker-node compute.&lt;/p&gt;

&lt;h2&gt;
  
  
  Then the cluster woke up angry
&lt;/h2&gt;

&lt;p&gt;The automation worked perfectly going &lt;em&gt;down&lt;/em&gt;. The problem was going &lt;em&gt;up&lt;/em&gt;. The first morning the cluster scaled back from zero, pods got stuck in &lt;code&gt;ContainerCreating&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Failed to pull image "xxxx.dkr.ecr.ap-south-1.amazonaws.com/my-app:latest":
... 429 Too Many Requests
Warning  Failed   kubelet  Error: ErrImagePull
Warning  Failed   kubelet  ... QPS limit exceeded / Rate exceeded
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Root cause: everything pulls at once
&lt;/h2&gt;

&lt;p&gt;When a cluster runs normally, pods start at different times, so image pulls are naturally spread out. But when you bring a cluster back from &lt;strong&gt;zero&lt;/strong&gt;, that smooth spread collapses into a single spike:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All node groups scale up together — a batch of fresh nodes joins within the same minute.&lt;/li&gt;
&lt;li&gt;Every node starts with an &lt;strong&gt;empty image cache&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;The scheduler places every pending pod from every namespace at once.&lt;/li&gt;
&lt;li&gt;So every kubelet, on every node, fires image pulls to the registry &lt;strong&gt;at the same second&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a classic &lt;strong&gt;thundering herd&lt;/strong&gt;, and it hits two rate limits at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Registry-side (ECR):&lt;/strong&gt; Amazon ECR rate-limits the API calls used during a pull (&lt;code&gt;GetDownloadUrlForLayer&lt;/code&gt;, &lt;code&gt;BatchGetImage&lt;/code&gt;, &lt;code&gt;GetAuthorizationToken&lt;/code&gt;). Hundreds of simultaneous pulls return &lt;code&gt;429 Too Many Requests&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node-side (kubelet):&lt;/strong&gt; Each kubelet also rate-limits pulls via &lt;code&gt;registryPullQPS&lt;/code&gt; and &lt;code&gt;registryBurst&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;The key insight: the error has nothing to do with broken images or a down registry. It's purely a &lt;strong&gt;concurrency&lt;/strong&gt; problem — too many pulls in too short a window.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  How I fixed it
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Stagger the scale-up instead of big-banging it
&lt;/h3&gt;

&lt;p&gt;The single most effective fix. Instead of scaling all node groups to full size at once, bring capacity back in &lt;strong&gt;phases&lt;/strong&gt; — a few nodes, wait a couple minutes for their images to land, then the rest. Same idea for workloads: restore critical namespaces first, the rest a few minutes later.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Tune the kubelet pull limits
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubelet.config.k8s.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;KubeletConfiguration&lt;/span&gt;
&lt;span class="na"&gt;serializeImagePulls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;   &lt;span class="c1"&gt;# allow parallel pulls per node&lt;/span&gt;
&lt;span class="na"&gt;registryPullQPS&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;          &lt;span class="c1"&gt;# default is 5&lt;/span&gt;
&lt;span class="na"&gt;registryBurst&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;            &lt;span class="c1"&gt;# default is 10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Caution: turning these &lt;em&gt;up&lt;/em&gt; while the &lt;em&gt;registry&lt;/em&gt; is the bottleneck can make ECR throttling worse. Pair it with step 3.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Put a pull-through cache in front of ECR
&lt;/h3&gt;

&lt;p&gt;Set up an &lt;strong&gt;ECR pull-through cache&lt;/strong&gt; and make sure the cluster reaches ECR over a &lt;strong&gt;VPC interface endpoint&lt;/strong&gt; (plus the S3 gateway endpoint). Repeated pulls hit a warm cache instead of re-fetching upstream — especially valuable for public images with their own aggressive rate limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Pre-pull the hot images
&lt;/h3&gt;

&lt;p&gt;Don't let nodes start with an empty cache: &lt;strong&gt;bake common images into a custom AMI&lt;/strong&gt;, or run a lightweight &lt;strong&gt;image pre-puller DaemonSet&lt;/strong&gt;. Fewer cold pulls means a far smaller herd.&lt;/p&gt;

&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;QPS limit exceeded&lt;/code&gt; / &lt;code&gt;429&lt;/code&gt; errors disappeared on subsequent morning start-ups.&lt;/li&gt;
&lt;li&gt;Pods reached &lt;code&gt;Running&lt;/code&gt; faster.&lt;/li&gt;
&lt;li&gt;We kept the full cost savings of scaling to zero — without the painful wake-up.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;If you only do one thing: stagger the scale-up. Most "QPS limit exceeded" start-up failures vanish the moment you stop bringing the entire cluster back in a single burst.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Scaling a dev cluster to zero overnight is one of the easiest cost wins in Kubernetes — but "scale to zero" quietly turns your start-up from a trickle into a flood. Once I stopped big-banging the start-up and gave ECR breathing room with a cache and pre-pulled images, the mornings got quiet again.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://srinun.in/posts/eks-image-pull-qps-limit/" rel="noopener noreferrer"&gt;srinun.in&lt;/a&gt;. I write about DevOps, AWS, and Kubernetes — &lt;a href="https://www.linkedin.com/in/srinivasulu-n-4050973b3/" rel="noopener noreferrer"&gt;connect with me on LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>aws</category>
      <category>devops</category>
      <category>eks</category>
    </item>
    <item>
      <title>Kubernetes PriorityClass Isn't Enough: Pinning a Pod to AMD Nodes During an ARM Migration</title>
      <dc:creator>srinu nuthi</dc:creator>
      <pubDate>Sun, 14 Jun 2026 11:45:27 +0000</pubDate>
      <link>https://dev.to/srinu_nuthi_5ff587c586662/kubernetes-priorityclass-isnt-enough-pinning-a-pod-to-amd-nodes-during-an-arm-migration-20mf</link>
      <guid>https://dev.to/srinu_nuthi_5ff587c586662/kubernetes-priorityclass-isnt-enough-pinning-a-pod-to-amd-nodes-during-an-arm-migration-20mf</guid>
      <description>&lt;p&gt;We started moving our workloads from AMD (x86) nodes to ARM (Graviton) nodes for the lower price and better performance. Our pipelines now build both architectures, but the frontend's multi-arch build was painfully slow, so we decided to keep the frontend on AMD for now. My first instinct was a PriorityClass. It wasn't enough on its own. Here's why, and the full combination that actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we're moving to ARM
&lt;/h2&gt;

&lt;p&gt;AWS Graviton (ARM) instances are cheaper than their equivalent x86 instances and, for a lot of workloads, faster per dollar. For anyone watching their EKS bill, migrating to ARM is one of the better levers you can pull.&lt;/p&gt;

&lt;p&gt;The catch: your container images have to be built for the target architecture. An image built only for &lt;code&gt;amd64&lt;/code&gt; won't run on an &lt;code&gt;arm64&lt;/code&gt; node. So step one of any ARM migration is making your build pipelines produce &lt;strong&gt;multi-arch images&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "tiny" pipeline change that wasn't so tiny
&lt;/h2&gt;

&lt;p&gt;Building multi-arch images is, on paper, a one-line change with &lt;code&gt;docker buildx&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker buildx build &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--platform&lt;/span&gt; linux/amd64,linux/arm64 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-t&lt;/span&gt; myrepo/app:tag &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--push&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That single &lt;code&gt;--platform linux/amd64,linux/arm64&lt;/code&gt; produces a manifest with both architectures. Once pushed, the container runtime on each node automatically pulls the variant that matches the node's CPU.&lt;/p&gt;

&lt;p&gt;But there's a cost: you're now building &lt;strong&gt;twice&lt;/strong&gt;. And if your build host is x86, the &lt;code&gt;arm64&lt;/code&gt; half is built under emulation (QEMU), which can be dramatically slower. For most of our services that was fine. For the &lt;strong&gt;frontend&lt;/strong&gt;, the build time ballooned.&lt;/p&gt;

&lt;p&gt;So we decided: migrate everything else to ARM, but keep the &lt;strong&gt;frontend on AMD only&lt;/strong&gt; for now. The challenge then became: how do we &lt;em&gt;guarantee&lt;/em&gt; the frontend always runs on an AMD node?&lt;/p&gt;

&lt;h2&gt;
  
  
  Attempt 1: just use a PriorityClass (spoiler: not enough)
&lt;/h2&gt;

&lt;p&gt;My first thought was a PriorityClass — make the frontend "more important" so it always gets a spot on the AMD nodes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scheduling.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PriorityClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend-high-priority&lt;/span&gt;
&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000000&lt;/span&gt;
&lt;span class="na"&gt;globalDefault&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Frontend&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;wins&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;contention&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;limited&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;AMD&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;nodes."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is useful — but it does &lt;strong&gt;not&lt;/strong&gt; do what I first assumed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A PriorityClass controls the &lt;em&gt;order&lt;/em&gt; pods are scheduled and whether a pod can &lt;em&gt;preempt&lt;/em&gt; (evict) lower-priority pods to make room. It does &lt;strong&gt;NOT&lt;/strong&gt; pin a pod to a particular node or CPU architecture. With only a PriorityClass, nothing stops the frontend from being scheduled onto an ARM node — where its amd64-only image won't even run.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;PriorityClass answers "who gets scheduled first?" — not "&lt;em&gt;where&lt;/em&gt; does this pod run?" Two different questions, and I was conflating them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real fix: three pieces that each do one job
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. nodeSelector — decides WHERE the pod can land
&lt;/h3&gt;

&lt;p&gt;This is the piece that actually pins the frontend to x86. Kubernetes labels every node with its architecture automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;priorityClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend-high-priority&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;kubernetes.io/arch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;amd64&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myrepo/frontend:tag&lt;/span&gt;   &lt;span class="c1"&gt;# amd64-only is fine now&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;kubernetes.io/arch: amd64&lt;/code&gt;, the scheduler only ever places the frontend on an AMD node. PriorityClass could never have done this.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. PriorityClass — decides WHO wins when AMD nodes are full
&lt;/h3&gt;

&lt;p&gt;Now the AMD nodes are a &lt;em&gt;scarce&lt;/em&gt; resource (we're shrinking them as we move to ARM). If other pods fill them up, the frontend could be stuck &lt;code&gt;Pending&lt;/code&gt;. This is where the PriorityClass earns its keep: when the high-priority frontend can't fit, the scheduler &lt;strong&gt;preempts&lt;/strong&gt; lower-priority pods on the AMD nodes to make room, and those evicted pods get rescheduled onto the ARM nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Taints &amp;amp; tolerations — keep everyone else OFF the AMD nodes
&lt;/h3&gt;

&lt;p&gt;Relying on preemption works, but it's reactive — pods get scheduled then evicted, which causes churn. The cleaner approach is to stop other pods from landing on the AMD nodes in the first place. Taint the AMD nodes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl taint nodes &amp;lt;amd-node&amp;gt; &lt;span class="nv"&gt;workload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;frontend:NoSchedule
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then let only the frontend tolerate it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;      &lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;workload"&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Equal"&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;frontend"&lt;/span&gt;
          &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NoSchedule"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the AMD nodes are effectively &lt;strong&gt;reserved&lt;/strong&gt; for the frontend. The PriorityClass becomes a safety net rather than the primary mechanism.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mental model that finally made it click
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;nodeSelector / affinity&lt;/strong&gt; = where a pod is &lt;em&gt;allowed&lt;/em&gt; to go (attraction)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Taints / tolerations&lt;/strong&gt; = which pods a node &lt;em&gt;repels&lt;/em&gt; (reservation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PriorityClass&lt;/strong&gt; = who gets scheduled first and who can evict whom (order)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They're three different questions. "Just a PriorityClass" failed because it only answers the third one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gotchas worth knowing
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Don't taint your AMD nodes without checking system pods.&lt;/strong&gt; DaemonSets and critical add-ons need to tolerate the taint or run elsewhere, or you'll break things like logging/monitoring agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Preemption causes churn.&lt;/strong&gt; Use &lt;code&gt;preemptionPolicy: Never&lt;/code&gt; if you want priority ordering &lt;em&gt;without&lt;/em&gt; evicting others.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep priority values sane.&lt;/strong&gt; Don't set your app above &lt;code&gt;system-cluster-critical&lt;/code&gt; / &lt;code&gt;system-node-critical&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;This is a transition state.&lt;/strong&gt; The end goal is still a native multi-arch frontend on ARM.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Scheduling priority and pod placement are not the same thing. A PriorityClass will never keep a pod on a particular architecture — it just decides who goes first. To pin our frontend to AMD nodes, the nodeSelector did the placement, taints reserved the capacity, and the PriorityClass was the safety net.&lt;/p&gt;

&lt;p&gt;If you're partway through an ARM (Graviton) migration and need certain workloads to stay on x86 for a while, reach for all three — and know which problem each one is solving.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://srinun.in/posts/k8s-priorityclass-arm-migration/" rel="noopener noreferrer"&gt;srinun.in&lt;/a&gt;. I write about DevOps, AWS, and Kubernetes — &lt;a href="https://www.linkedin.com/in/srinivasulu-n-4050973b3/" rel="noopener noreferrer"&gt;connect with me on LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>aws</category>
      <category>devops</category>
      <category>eks</category>
    </item>
    <item>
      <title>AWS VPC IPAM: The Most Underrated Feature for Avoiding IP Address Chaos</title>
      <dc:creator>srinu nuthi</dc:creator>
      <pubDate>Sun, 14 Jun 2026 11:42:01 +0000</pubDate>
      <link>https://dev.to/srinu_nuthi_5ff587c586662/aws-vpc-ipam-the-most-underrated-feature-for-avoiding-ip-address-chaos-2e2o</link>
      <guid>https://dev.to/srinu_nuthi_5ff587c586662/aws-vpc-ipam-the-most-underrated-feature-for-avoiding-ip-address-chaos-2e2o</guid>
      <description>&lt;p&gt;Most teams manage their VPC IP ranges in a spreadsheet — until two VPCs overlap, a peering connection refuses to come up, and nobody can remember who owns which CIDR. Amazon VPC IP Address Manager (IPAM) is the feature that quietly solves all of this. It's one of the most underrated tools in AWS networking, and here's why.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem nobody talks about until it breaks
&lt;/h2&gt;

&lt;p&gt;IP address planning feels trivial on day one. You spin up a VPC, pick &lt;code&gt;10.0.0.0/16&lt;/code&gt;, and move on.&lt;/p&gt;

&lt;p&gt;Fast-forward a year. You now have a dozen VPCs across multiple accounts and two regions. Someone documented the CIDRs in a spreadsheet that's already out of date. Then you try to set up a VPC peering connection or attach a VPC to a Transit Gateway and you hit a wall:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Two VPCs were both created with &lt;code&gt;10.0.0.0/16&lt;/code&gt;. Overlapping CIDR ranges can't be peered or routed together. Now you're looking at re-IPing an entire VPC — a painful, high-risk migration — just because nobody had a single source of truth.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the silent tax of growing on AWS: IP sprawl. And it's exactly what IPAM was built to eliminate.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is AWS VPC IPAM?
&lt;/h2&gt;

&lt;p&gt;IPAM is a managed feature of Amazon VPC that gives you one place to plan, track, and monitor every IP address in your AWS environment. Instead of a spreadsheet, you get a live, automated system of record.&lt;/p&gt;

&lt;p&gt;It's built around a few simple concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IPAM&lt;/strong&gt; — the top-level resource that manages everything.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scopes&lt;/strong&gt; — a container for pools. You get a &lt;em&gt;private&lt;/em&gt; scope (internal RFC 1918 ranges) and a &lt;em&gt;public&lt;/em&gt; scope, kept separate so ranges never collide.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pools&lt;/strong&gt; — collections of CIDR ranges. Pools are &lt;strong&gt;hierarchical&lt;/strong&gt;: a top-level pool can be carved into regional pools, then per-account or per-environment pools.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Allocations&lt;/strong&gt; — a CIDR handed out from a pool to a resource like a VPC.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it like a filing cabinet: one big drawer (top pool), folders by region, sub-folders by account or environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it's so underrated: 5 things IPAM does for you
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Automatic CIDR assignment (no more guessing)
&lt;/h3&gt;

&lt;p&gt;Instead of a human picking a CIDR and hoping it's free, you tell IPAM "give this VPC a /24 from the dev pool" and it allocates a non-overlapping range automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Overlap prevention across accounts and regions
&lt;/h3&gt;

&lt;p&gt;Because every allocation comes from a managed pool, IPAM &lt;strong&gt;guarantees you can't hand out the same range twice&lt;/strong&gt;. This is the killer feature for anyone running AWS Organizations with many accounts. No more peering failures or emergency re-IP migrations.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Real utilization monitoring
&lt;/h3&gt;

&lt;p&gt;IPAM continuously tracks how much of each pool is in use and can alert you (via CloudWatch) before a pool runs out.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Public IPv4 cost control
&lt;/h3&gt;

&lt;p&gt;Since AWS started charging for &lt;strong&gt;every&lt;/strong&gt; public IPv4 address (about &lt;code&gt;$0.005&lt;/code&gt;/hour each, attached or idle), unused Elastic IPs are real money leaking out monthly. IPAM's &lt;strong&gt;public IP insights&lt;/strong&gt; show every public IPv4, what it's attached to, and whether it's idle — so you can release the ones you don't need.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Audit history and compliance
&lt;/h3&gt;

&lt;p&gt;IPAM keeps a historical record of how your IP space has been allocated over time. When someone asks "what was using this CIDR three months ago?", you have an actual answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Free Tier vs Advanced Tier
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Free Tier&lt;/strong&gt; — basic planning, tracking, and monitoring within a single account and region. Great for getting started.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced Tier&lt;/strong&gt; — cross-account via AWS Organizations, cross-region pools, automated allocation, public IP insights, compliance monitoring, and usage history. Billed per active managed IP at a small hourly rate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For any real multi-account org, the Advanced Tier almost always pays for itself by preventing even one re-IP migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to get started (quick version)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Create an &lt;strong&gt;IPAM&lt;/strong&gt; in your operating region.&lt;/li&gt;
&lt;li&gt;In the private scope, create a &lt;strong&gt;top-level pool&lt;/strong&gt; with your overall range (e.g. &lt;code&gt;10.0.0.0/8&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Carve out &lt;strong&gt;regional and environment pools&lt;/strong&gt; beneath it.&lt;/li&gt;
&lt;li&gt;When creating new VPCs, choose &lt;strong&gt;"allocate CIDR from IPAM pool"&lt;/strong&gt; instead of typing one in.&lt;/li&gt;
&lt;li&gt;Turn on &lt;strong&gt;public IP insights&lt;/strong&gt; and hunt down idle public IPv4 addresses.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt; Even if you never automate VPC creation, just pointing IPAM at your existing environment to &lt;em&gt;discover&lt;/em&gt; what you already have — and where it overlaps — is worth the setup time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;IPAM isn't flashy. But it solves a problem that gets exponentially more expensive the longer you ignore it — and most teams only discover it after a painful overlap has already broken something.&lt;/p&gt;

&lt;p&gt;If you're running more than a couple of VPCs, set up IPAM before you need it. Treat your IP space like the shared, finite resource it actually is.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published on &lt;a href="https://srinun.in/posts/aws-vpc-ipam/" rel="noopener noreferrer"&gt;srinun.in&lt;/a&gt;. I write about DevOps, AWS, and Kubernetes — &lt;a href="https://www.linkedin.com/in/srinivasulu-n-4050973b3/" rel="noopener noreferrer"&gt;connect with me on LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>networking</category>
      <category>devops</category>
      <category>cloud</category>
    </item>
  </channel>
</rss>
