DEV Community: srinu nuthi

EKS Upgrades Are No Longer a One-Way Door: Kubernetes Version Rollback

srinu nuthi — Thu, 02 Jul 2026 16:50:48 +0000

For years, upgrading an EKS control plane was a one-way door. You bumped the Kubernetes minor version, and if something broke afterwards, you had to fix forward — under pressure, in production, with no escape hatch. AWS just changed that: Amazon EKS now supports Kubernetes version rollback.

The problem: upgrades you couldn't take back

You move a cluster from 1.29 to 1.30. The upgrade succeeds. Hours later, something misbehaves — a deprecated API your workloads still call, an add-on that isn't happy on the new version, a controller acting strangely under real traffic.

Until now, there was no going back. The control plane version was a one-way ratchet, and your only option was to debug and patch in place. That fear is exactly why so many teams sit several versions behind.

What AWS shipped

Revert to the previous minor version (roll 1.30 back to 1.29)
Within 7 days of the upgrade completing
Trigger it from the console, AWS CLI, or SDK
No additional cost, in all regions where EKS is available

It's not a blind revert: readiness checks

Before rolling back, EKS evaluates rollback readiness insights — automated checks including:

API compatibility — will your workloads still work on the older version?
Version skew — the allowed gap between control plane and nodes
Add-on compatibility — managed add-ons that must match the target version
Cluster health — general readiness signals

For EKS Auto Mode clusters, EKS rolls back the worker nodes first, then the control plane, honoring your disruption controls — the correct order given version skew rules.

"In-place upgrade only" — what that means

There are two ways to move an EKS cluster to a new version:

	In-place upgrade	Blue/green (migration)
What upgrades	The existing cluster (1.29 → 1.30)	A new cluster; migrate workloads over
Rollback method	AWS's new 7-day rollback	Point back to the old cluster (it still exists)
Effort	Low — a click / one API call	High — build, migrate, cut over

With blue/green you already have a rollback path (the old cluster is still running), so there's nothing for AWS to revert. The new feature targets the far more common in-place path, where you previously had no safety net at all.

The catch

This is a safety net, not an "undo anytime" button:

In-place upgrades only — blue/green isn't covered (and doesn't need it)
7-day window — once it closes, the previous version is gone

The intent: use the window to validate the new version under real production load, and pull the ripcord fast if it goes wrong.

How this should change your playbook

Still test in non-prod first. Rollback lowers risk; it doesn't remove the need to test.
Treat the 7 days as an active validation window. Deliberately exercise the risky paths — deprecated APIs, finicky add-ons, peak load — while the escape hatch is open.
Know your rollback command before you need it. Don't learn the API during an incident.
Mind version skew on nodes — which is exactly why Auto Mode sequences nodes first.

Takeaway

This quietly removes one of the scariest parts of operating EKS. Upgrades stop being a leap of faith and become something you can validate and, if needed, reverse. If you've been sitting a few versions behind because upgrading felt too risky, it's a good moment to revisit that.

Originally published on srinun.in. I write about DevOps, AWS, and Kubernetes — connect with me on LinkedIn.

How DNS Works: A Practical Guide for DevOps and Developers

srinu nuthi — Sun, 14 Jun 2026 12:03:00 +0000

Every time you type www.google.com and hit enter, your computer doesn't actually know where Google lives on the internet — it has to ask for directions. That's DNS: the internet's phonebook, translating human-readable names into machine-readable IP addresses. Let's follow a single DNS query from the moment you press enter to the moment your browser connects.

Why we need DNS

Imagine memorizing 142.250.185.46 instead of typing google.com. IP addresses change, servers move, but domain names stay constant and memorable. DNS is the bridge between human memory and computer networking.

The journey of a DNS query

1. Browser cache

Before asking anyone, your browser checks its own DNS cache. Each entry has a TTL (Time To Live). If there's a valid, non-expired entry, the journey ends here — microseconds. Let's assume it's a miss.

2. OS cache

Your browser asks the OS, which keeps its own cache shared across all apps. On Windows you can view it with ipconfig /displaydns. Still a miss in our scenario.

3. Hosts file

The OS checks the hosts file (/etc/hosts on Unix, C:\Windows\System32\drivers\etc\hosts on Windows). It's a manual override — admins use it for testing, and you can block domains by pointing them at 127.0.0.1. No entry for public sites, so we move on.

4. The recursive resolver

Now we leave your computer. The OS sends the query to a recursive resolver — usually your ISP's, or a public one like Google's 8.8.8.8 or Cloudflare's 1.1.1.1. Think of it as a librarian: it doesn't know the answer offhand, but it knows how to find it. The query carries the domain name, the record type (usually A for IPv4), and your return address.

5. The resolver's cache

The resolver serves thousands or millions of users, so popular domains are almost always cached. If www.example.com is cached and fresh, you get an answer in 10–50 ms. If not, it recursively walks the DNS hierarchy on your behalf.

6. Root servers

At the top of the DNS tree are the root servers — 13 sets (A–M), each actually a globally distributed cluster via anycast. The root doesn't know where your domain is, but it knows who runs .com, and returns a referral to the .com TLD servers.

7. TLD servers

The resolver asks the .com TLD servers. They maintain the registry of which authoritative nameservers handle each .com domain, and return another referral:

example.com is managed by:
ns1.examplehost.com (198.51.100.1)
ns2.examplehost.com (198.51.100.2)

8. Authoritative nameservers

This is the source of truth. The resolver asks the authoritative nameserver, which looks up the record in its zone file:

www.example.com.    3600    IN    A    93.184.216.34

3600 — TTL in seconds (how long to cache)
IN — Internet class
A — record type (IPv4; AAAA for IPv6)
93.184.216.34 — the IP

9–10. The return journey and local caching

The resolver caches the answer (respecting the TTL) and returns it. Your OS caches it, passes it to the browser, which also caches it. Next time, it's instant.

11. Making the connection

With the IP in hand, the browser opens a TCP connection (port 80 for HTTP, 443 for HTTPS), does the handshake (+ TLS for HTTPS), sends an HTTP GET, and renders the page. The whole DNS process on a cold query is typically 20–120 ms; cached, under 10 ms.

The flow: Browser → OS Cache → Recursive Resolver → Root → TLD → Authoritative NS → back to Browser.

DNS record types beyond A

A — domain → IPv4
AAAA — domain → IPv6
CNAME — alias from one domain to another
MX — mail servers (with priorities for redundancy)
TXT — arbitrary text; used for domain verification, SPF, DKIM
NS — which servers are authoritative
SOA — administrative info for the zone
PTR — reverse of an A record (IP → name), important for mail server legitimacy

The importance of TTL

Every record has a TTL set by the owner — a balance:

Short TTL (60–300s): changes propagate fast, but more query load on your nameservers.
Long TTL (3600–86400s): big reduction in query load and better performance, but changes can take up to a day to propagate.

Common practice: long TTLs for stable infrastructure, but lower it to 5–10 minutes before a planned change, then raise it back afterward.

How fast is DNS, really?

Browser cache hit: <1 ms
OS cache hit: 1–5 ms
Resolver cache hit: 10–30 ms
Full recursive query (cold): 20–120 ms
DNS over HTTPS: +10–50 ms for encryption

For popular sites, resolver cache hit rates approach 95%, so root servers handle less than 1% of all queries. That's how DNS scales to trillions of queries a day.

Why it's so fast: anycast

Multiple servers worldwide share the same IP. Query 8.8.8.8 and you're routed to the nearest one — Tokyo users hit Tokyo, London users hit London. You're not crossing oceans; you're connecting to a server nearby.

DNS and CDNs

CDNs (Cloudflare, Akamai, CloudFront) use GeoDNS: the authoritative nameserver examines where the query comes from and returns the optimal IP for that location. Some go further with DNS-based load balancing based on server health or load.

The invisible infrastructure

DNS is one of the internet's most critical yet invisible services. Every click begins with a DNS query. When it works, it's invisible; when it fails, the internet seems broken. The next time you press enter, appreciate the journey — in milliseconds, your request cascades through caches, crosses continents, and returns with an answer.

Originally published on srinun.in. I write about DevOps, AWS, and Kubernetes — connect with me on LinkedIn.

How I Cut AWS CloudWatch Costs by 50%: Moving VPC Flow Logs to S3

srinu nuthi — Sun, 14 Jun 2026 11:57:49 +0000

A customer came to me frustrated about their AWS bill. After reviewing the billing dashboard, we found that over 50% of their costs were coming from CloudWatch vended logs — specifically VPC Flow Logs. Here's how we cut that bill in half with two simple changes.

Why CloudWatch gets so expensive for VPC Flow Logs

(Vended logs are logs AWS services generate for you automatically — VPC Flow Logs, Route 53 query logs, CloudFront access logs.)

VPC Flow Logs are incredibly useful, but storing everything in CloudWatch Logs gets pricey fast:

Data ingestion charges per GB
Storage costs that accumulate over time
No automatic retention policies by default
Vended logs piling up quietly

The fix: move VPC Flow Logs to S3

They didn't need real-time querying for most flow logs — mainly weekly security reviews and occasional troubleshooting. Perfect candidates for S3.

Cost comparison (1 TB of logs):

	CloudWatch Logs	S3 + Parquet
Ingestion	$0.50/GB	—
Storage	$0.03/GB/mo	$0.023/GB/mo
Compression	none	80–90% smaller
Monthly total	~$530	~$2.30 + query cost

How I did it (4 steps)

1. Send VPC Flow Logs to S3 in Parquet format

In the VPC console, create a flow log with destination S3 and log file format Parquet. Parquet auto-compresses by 80–90% — massive storage savings vs plain text.

2. Set up S3 lifecycle policies

Don't keep everything in S3 Standard forever. A lifecycle rule:

Day 0–30: S3 Standard (immediate Athena analysis)
Day 30+: S3 Glacier Instant Retrieval (cheaper, still queryable)
Day 90+: S3 Glacier Deep Archive (~$0.00099/GB, compliance)
Day 365: expire

3. Set retention on CloudWatch log groups

The biggest quick win — many log groups had no retention policy, so logs were kept forever. Set sane retention: 7 days for debug, 30 for app logs, 90+ for audit/compliance. Never leave it as "Never expire."

4. Query with Amazon Athena when needed

Logs are now in S3, so query them on demand with Athena — you only pay for what you query. Parquet makes those queries fast and cheap.

The results

CloudWatch costs dropped ~50% in the first billing cycle
Parquet compression cut storage ~85%
Query performance actually improved with Athena + Parquet
Retention policies stopped future cost creep

Pro tip: Don't wait for costs to become a problem. Set billing alerts and review CloudWatch usage monthly — many teams are shocked when they finally check the detailed bill.

Takeaway

Moving VPC Flow Logs from CloudWatch to S3 with Parquet was one of the easiest cost-optimization wins I've done. Direct S3 delivery + Parquet compression + proper retention delivered immediate results. If your AWS bill looks high, start in Cost Explorer and look for vended logs and log groups with no retention.

Originally published on srinun.in. I write about DevOps, AWS, and Kubernetes — connect with me on LinkedIn.

QPS Limit Exceeded on EKS Start-up: The Image Pull Thundering Herd

srinu nuthi — Sun, 14 Jun 2026 11:51:41 +0000

I scaled our dev EKS cluster down to zero overnight to save cost. The next morning it didn't come back up cleanly — pods got stuck and the events were full of "QPS limit exceeded". The cause wasn't the automation. It was every pod trying to pull its image at the same second. Here's the thundering herd, and how I fixed it.

Why I started stopping the dev cluster at night

A dev cluster doesn't need to run 24/7. There are 168 hours in a week, but a dev environment realistically only needs ~50 (10 hours a day, 5 days a week). So I set up a schedule: scale the node groups to zero at night, bring them back at 8 AM. The control plane stays up; the expensive worker nodes go to zero.

Savings: roughly 60–70% on dev worker-node compute.

Then the cluster woke up angry

The automation worked perfectly going down. The problem was going up. The first morning the cluster scaled back from zero, pods got stuck in ContainerCreating:

Failed to pull image "xxxx.dkr.ecr.ap-south-1.amazonaws.com/my-app:latest":
... 429 Too Many Requests
Warning  Failed   kubelet  Error: ErrImagePull
Warning  Failed   kubelet  ... QPS limit exceeded / Rate exceeded

Root cause: everything pulls at once

When a cluster runs normally, pods start at different times, so image pulls are naturally spread out. But when you bring a cluster back from zero, that smooth spread collapses into a single spike:

All node groups scale up together — a batch of fresh nodes joins within the same minute.
Every node starts with an empty image cache.
The scheduler places every pending pod from every namespace at once.
So every kubelet, on every node, fires image pulls to the registry at the same second.

This is a classic thundering herd, and it hits two rate limits at once:

Registry-side (ECR): Amazon ECR rate-limits the API calls used during a pull (GetDownloadUrlForLayer, BatchGetImage, GetAuthorizationToken). Hundreds of simultaneous pulls return 429 Too Many Requests.
Node-side (kubelet): Each kubelet also rate-limits pulls via registryPullQPS and registryBurst.

The key insight: the error has nothing to do with broken images or a down registry. It's purely a concurrency problem — too many pulls in too short a window.

How I fixed it

1. Stagger the scale-up instead of big-banging it

The single most effective fix. Instead of scaling all node groups to full size at once, bring capacity back in phases — a few nodes, wait a couple minutes for their images to land, then the rest. Same idea for workloads: restore critical namespaces first, the rest a few minutes later.

2. Tune the kubelet pull limits

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serializeImagePulls: false   # allow parallel pulls per node
registryPullQPS: 10          # default is 5
registryBurst: 20            # default is 10

Caution: turning these up while the registry is the bottleneck can make ECR throttling worse. Pair it with step 3.

3. Put a pull-through cache in front of ECR

Set up an ECR pull-through cache and make sure the cluster reaches ECR over a VPC interface endpoint (plus the S3 gateway endpoint). Repeated pulls hit a warm cache instead of re-fetching upstream — especially valuable for public images with their own aggressive rate limits.

4. Pre-pull the hot images

Don't let nodes start with an empty cache: bake common images into a custom AMI, or run a lightweight image pre-puller DaemonSet. Fewer cold pulls means a far smaller herd.

The result

The QPS limit exceeded / 429 errors disappeared on subsequent morning start-ups.
Pods reached Running faster.
We kept the full cost savings of scaling to zero — without the painful wake-up.

If you only do one thing: stagger the scale-up. Most "QPS limit exceeded" start-up failures vanish the moment you stop bringing the entire cluster back in a single burst.

Takeaway

Scaling a dev cluster to zero overnight is one of the easiest cost wins in Kubernetes — but "scale to zero" quietly turns your start-up from a trickle into a flood. Once I stopped big-banging the start-up and gave ECR breathing room with a cache and pre-pulled images, the mornings got quiet again.

Originally published on srinun.in. I write about DevOps, AWS, and Kubernetes — connect with me on LinkedIn.

Kubernetes PriorityClass Isn't Enough: Pinning a Pod to AMD Nodes During an ARM Migration

srinu nuthi — Sun, 14 Jun 2026 11:45:27 +0000

We started moving our workloads from AMD (x86) nodes to ARM (Graviton) nodes for the lower price and better performance. Our pipelines now build both architectures, but the frontend's multi-arch build was painfully slow, so we decided to keep the frontend on AMD for now. My first instinct was a PriorityClass. It wasn't enough on its own. Here's why, and the full combination that actually works.

Why we're moving to ARM

AWS Graviton (ARM) instances are cheaper than their equivalent x86 instances and, for a lot of workloads, faster per dollar. For anyone watching their EKS bill, migrating to ARM is one of the better levers you can pull.

The catch: your container images have to be built for the target architecture. An image built only for amd64 won't run on an arm64 node. So step one of any ARM migration is making your build pipelines produce multi-arch images.

The "tiny" pipeline change that wasn't so tiny

Building multi-arch images is, on paper, a one-line change with docker buildx:

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t myrepo/app:tag \
  --push .

That single --platform linux/amd64,linux/arm64 produces a manifest with both architectures. Once pushed, the container runtime on each node automatically pulls the variant that matches the node's CPU.

But there's a cost: you're now building twice. And if your build host is x86, the arm64 half is built under emulation (QEMU), which can be dramatically slower. For most of our services that was fine. For the frontend, the build time ballooned.

So we decided: migrate everything else to ARM, but keep the frontend on AMD only for now. The challenge then became: how do we guarantee the frontend always runs on an AMD node?

Attempt 1: just use a PriorityClass (spoiler: not enough)

My first thought was a PriorityClass — make the frontend "more important" so it always gets a spot on the AMD nodes.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: frontend-high-priority
value: 1000000
globalDefault: false
description: "Frontend wins contention on the limited AMD nodes."

This is useful — but it does not do what I first assumed:

A PriorityClass controls the order pods are scheduled and whether a pod can preempt (evict) lower-priority pods to make room. It does NOT pin a pod to a particular node or CPU architecture. With only a PriorityClass, nothing stops the frontend from being scheduled onto an ARM node — where its amd64-only image won't even run.

PriorityClass answers "who gets scheduled first?" — not "where does this pod run?" Two different questions, and I was conflating them.

The real fix: three pieces that each do one job

1. nodeSelector — decides WHERE the pod can land

This is the piece that actually pins the frontend to x86. Kubernetes labels every node with its architecture automatically:

spec:
  template:
    spec:
      priorityClassName: frontend-high-priority
      nodeSelector:
        kubernetes.io/arch: amd64
      containers:
        - name: frontend
          image: myrepo/frontend:tag   # amd64-only is fine now

With kubernetes.io/arch: amd64, the scheduler only ever places the frontend on an AMD node. PriorityClass could never have done this.

2. PriorityClass — decides WHO wins when AMD nodes are full

Now the AMD nodes are a scarce resource (we're shrinking them as we move to ARM). If other pods fill them up, the frontend could be stuck Pending. This is where the PriorityClass earns its keep: when the high-priority frontend can't fit, the scheduler preempts lower-priority pods on the AMD nodes to make room, and those evicted pods get rescheduled onto the ARM nodes.

3. Taints & tolerations — keep everyone else OFF the AMD nodes

Relying on preemption works, but it's reactive — pods get scheduled then evicted, which causes churn. The cleaner approach is to stop other pods from landing on the AMD nodes in the first place. Taint the AMD nodes:

kubectl taint nodes <amd-node> workload=frontend:NoSchedule

Then let only the frontend tolerate it:

      tolerations:
        - key: "workload"
          operator: "Equal"
          value: "frontend"
          effect: "NoSchedule"

Now the AMD nodes are effectively reserved for the frontend. The PriorityClass becomes a safety net rather than the primary mechanism.

The mental model that finally made it click

nodeSelector / affinity = where a pod is allowed to go (attraction)
Taints / tolerations = which pods a node repels (reservation)
PriorityClass = who gets scheduled first and who can evict whom (order)

They're three different questions. "Just a PriorityClass" failed because it only answers the third one.

Gotchas worth knowing

Don't taint your AMD nodes without checking system pods. DaemonSets and critical add-ons need to tolerate the taint or run elsewhere, or you'll break things like logging/monitoring agents.
Preemption causes churn. Use preemptionPolicy: Never if you want priority ordering without evicting others.
Keep priority values sane. Don't set your app above system-cluster-critical / system-node-critical.
This is a transition state. The end goal is still a native multi-arch frontend on ARM.

Takeaway

Scheduling priority and pod placement are not the same thing. A PriorityClass will never keep a pod on a particular architecture — it just decides who goes first. To pin our frontend to AMD nodes, the nodeSelector did the placement, taints reserved the capacity, and the PriorityClass was the safety net.

If you're partway through an ARM (Graviton) migration and need certain workloads to stay on x86 for a while, reach for all three — and know which problem each one is solving.

Originally published on srinun.in. I write about DevOps, AWS, and Kubernetes — connect with me on LinkedIn.

AWS VPC IPAM: The Most Underrated Feature for Avoiding IP Address Chaos

srinu nuthi — Sun, 14 Jun 2026 11:42:01 +0000

Most teams manage their VPC IP ranges in a spreadsheet — until two VPCs overlap, a peering connection refuses to come up, and nobody can remember who owns which CIDR. Amazon VPC IP Address Manager (IPAM) is the feature that quietly solves all of this. It's one of the most underrated tools in AWS networking, and here's why.

The problem nobody talks about until it breaks

IP address planning feels trivial on day one. You spin up a VPC, pick 10.0.0.0/16, and move on.

Fast-forward a year. You now have a dozen VPCs across multiple accounts and two regions. Someone documented the CIDRs in a spreadsheet that's already out of date. Then you try to set up a VPC peering connection or attach a VPC to a Transit Gateway and you hit a wall:

Two VPCs were both created with 10.0.0.0/16. Overlapping CIDR ranges can't be peered or routed together. Now you're looking at re-IPing an entire VPC — a painful, high-risk migration — just because nobody had a single source of truth.

This is the silent tax of growing on AWS: IP sprawl. And it's exactly what IPAM was built to eliminate.

What is AWS VPC IPAM?

IPAM is a managed feature of Amazon VPC that gives you one place to plan, track, and monitor every IP address in your AWS environment. Instead of a spreadsheet, you get a live, automated system of record.

It's built around a few simple concepts:

IPAM — the top-level resource that manages everything.
Scopes — a container for pools. You get a private scope (internal RFC 1918 ranges) and a public scope, kept separate so ranges never collide.
Pools — collections of CIDR ranges. Pools are hierarchical: a top-level pool can be carved into regional pools, then per-account or per-environment pools.
Allocations — a CIDR handed out from a pool to a resource like a VPC.

Think of it like a filing cabinet: one big drawer (top pool), folders by region, sub-folders by account or environment.

Why it's so underrated: 5 things IPAM does for you

1. Automatic CIDR assignment (no more guessing)

Instead of a human picking a CIDR and hoping it's free, you tell IPAM "give this VPC a /24 from the dev pool" and it allocates a non-overlapping range automatically.

2. Overlap prevention across accounts and regions

Because every allocation comes from a managed pool, IPAM guarantees you can't hand out the same range twice. This is the killer feature for anyone running AWS Organizations with many accounts. No more peering failures or emergency re-IP migrations.

3. Real utilization monitoring

IPAM continuously tracks how much of each pool is in use and can alert you (via CloudWatch) before a pool runs out.

4. Public IPv4 cost control

Since AWS started charging for every public IPv4 address (about $0.005/hour each, attached or idle), unused Elastic IPs are real money leaking out monthly. IPAM's public IP insights show every public IPv4, what it's attached to, and whether it's idle — so you can release the ones you don't need.

5. Audit history and compliance

IPAM keeps a historical record of how your IP space has been allocated over time. When someone asks "what was using this CIDR three months ago?", you have an actual answer.

Free Tier vs Advanced Tier

Free Tier — basic planning, tracking, and monitoring within a single account and region. Great for getting started.
Advanced Tier — cross-account via AWS Organizations, cross-region pools, automated allocation, public IP insights, compliance monitoring, and usage history. Billed per active managed IP at a small hourly rate.

For any real multi-account org, the Advanced Tier almost always pays for itself by preventing even one re-IP migration.

How to get started (quick version)

Create an IPAM in your operating region.
In the private scope, create a top-level pool with your overall range (e.g. 10.0.0.0/8).
Carve out regional and environment pools beneath it.
When creating new VPCs, choose "allocate CIDR from IPAM pool" instead of typing one in.
Turn on public IP insights and hunt down idle public IPv4 addresses.

Pro tip: Even if you never automate VPC creation, just pointing IPAM at your existing environment to discover what you already have — and where it overlaps — is worth the setup time.

Takeaway

IPAM isn't flashy. But it solves a problem that gets exponentially more expensive the longer you ignore it — and most teams only discover it after a painful overlap has already broken something.

If you're running more than a couple of VPCs, set up IPAM before you need it. Treat your IP space like the shared, finite resource it actually is.

Originally published on srinun.in. I write about DevOps, AWS, and Kubernetes — connect with me on LinkedIn.