srinu nuthi

Posted on Jun 14 • Originally published at srinun.in

Kubernetes PriorityClass Isn't Enough: Pinning a Pod to AMD Nodes During an ARM Migration

#aws #eks #kubernetes #devops

We started moving our workloads from AMD (x86) nodes to ARM (Graviton) nodes for the lower price and better performance. Our pipelines now build both architectures, but the frontend's multi-arch build was painfully slow, so we decided to keep the frontend on AMD for now. My first instinct was a PriorityClass. It wasn't enough on its own. Here's why, and the full combination that actually works.

Why we're moving to ARM

AWS Graviton (ARM) instances are cheaper than their equivalent x86 instances and, for a lot of workloads, faster per dollar. For anyone watching their EKS bill, migrating to ARM is one of the better levers you can pull.

The catch: your container images have to be built for the target architecture. An image built only for amd64 won't run on an arm64 node. So step one of any ARM migration is making your build pipelines produce multi-arch images.

The "tiny" pipeline change that wasn't so tiny

Building multi-arch images is, on paper, a one-line change with docker buildx:

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t myrepo/app:tag \
  --push .

That single --platform linux/amd64,linux/arm64 produces a manifest with both architectures. Once pushed, the container runtime on each node automatically pulls the variant that matches the node's CPU.

But there's a cost: you're now building twice. And if your build host is x86, the arm64 half is built under emulation (QEMU), which can be dramatically slower. For most of our services that was fine. For the frontend, the build time ballooned.

So we decided: migrate everything else to ARM, but keep the frontend on AMD only for now. The challenge then became: how do we guarantee the frontend always runs on an AMD node?

Attempt 1: just use a PriorityClass (spoiler: not enough)

My first thought was a PriorityClass — make the frontend "more important" so it always gets a spot on the AMD nodes.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: frontend-high-priority
value: 1000000
globalDefault: false
description: "Frontend wins contention on the limited AMD nodes."

This is useful — but it does not do what I first assumed:

A PriorityClass controls the order pods are scheduled and whether a pod can preempt (evict) lower-priority pods to make room. It does NOT pin a pod to a particular node or CPU architecture. With only a PriorityClass, nothing stops the frontend from being scheduled onto an ARM node — where its amd64-only image won't even run.

PriorityClass answers "who gets scheduled first?" — not "where does this pod run?" Two different questions, and I was conflating them.

The real fix: three pieces that each do one job

1. nodeSelector — decides WHERE the pod can land

This is the piece that actually pins the frontend to x86. Kubernetes labels every node with its architecture automatically:

spec:
  template:
    spec:
      priorityClassName: frontend-high-priority
      nodeSelector:
        kubernetes.io/arch: amd64
      containers:
        - name: frontend
          image: myrepo/frontend:tag   # amd64-only is fine now

With kubernetes.io/arch: amd64, the scheduler only ever places the frontend on an AMD node. PriorityClass could never have done this.

2. PriorityClass — decides WHO wins when AMD nodes are full

Now the AMD nodes are a scarce resource (we're shrinking them as we move to ARM). If other pods fill them up, the frontend could be stuck Pending. This is where the PriorityClass earns its keep: when the high-priority frontend can't fit, the scheduler preempts lower-priority pods on the AMD nodes to make room, and those evicted pods get rescheduled onto the ARM nodes.

3. Taints & tolerations — keep everyone else OFF the AMD nodes

Relying on preemption works, but it's reactive — pods get scheduled then evicted, which causes churn. The cleaner approach is to stop other pods from landing on the AMD nodes in the first place. Taint the AMD nodes:

kubectl taint nodes <amd-node> workload=frontend:NoSchedule

Then let only the frontend tolerate it:

      tolerations:
        - key: "workload"
          operator: "Equal"
          value: "frontend"
          effect: "NoSchedule"

Now the AMD nodes are effectively reserved for the frontend. The PriorityClass becomes a safety net rather than the primary mechanism.

The mental model that finally made it click

nodeSelector / affinity = where a pod is allowed to go (attraction)
Taints / tolerations = which pods a node repels (reservation)
PriorityClass = who gets scheduled first and who can evict whom (order)

They're three different questions. "Just a PriorityClass" failed because it only answers the third one.

Gotchas worth knowing

Don't taint your AMD nodes without checking system pods. DaemonSets and critical add-ons need to tolerate the taint or run elsewhere, or you'll break things like logging/monitoring agents.
Preemption causes churn. Use preemptionPolicy: Never if you want priority ordering without evicting others.
Keep priority values sane. Don't set your app above system-cluster-critical / system-node-critical.
This is a transition state. The end goal is still a native multi-arch frontend on ARM.

Takeaway

Scheduling priority and pod placement are not the same thing. A PriorityClass will never keep a pod on a particular architecture — it just decides who goes first. To pin our frontend to AMD nodes, the nodeSelector did the placement, taints reserved the capacity, and the PriorityClass was the safety net.

If you're partway through an ARM (Graviton) migration and need certain workloads to stay on x86 for a while, reach for all three — and know which problem each one is solving.

Originally published on srinun.in. I write about DevOps, AWS, and Kubernetes — connect with me on LinkedIn.

DEV Community