<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hein Reyneke</title>
    <description>The latest articles on DEV Community by Hein Reyneke (@dreamsword981).</description>
    <link>https://dev.to/dreamsword981</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3816948%2Fe937a758-2e48-4fd0-b157-3076304d75d4.jpeg</url>
      <title>DEV Community: Hein Reyneke</title>
      <link>https://dev.to/dreamsword981</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dreamsword981"/>
    <language>en</language>
    <item>
      <title>We migrated 3 EKS clusters on AWS to Karpenter. Here's what nobody warns you about.</title>
      <dc:creator>Hein Reyneke</dc:creator>
      <pubDate>Tue, 10 Mar 2026 13:18:36 +0000</pubDate>
      <link>https://dev.to/dreamsword981/we-migrated-3-eks-clusters-on-aws-to-karpenter-heres-what-nobody-warns-you-about-2e9k</link>
      <guid>https://dev.to/dreamsword981/we-migrated-3-eks-clusters-on-aws-to-karpenter-heres-what-nobody-warns-you-about-2e9k</guid>
      <description>&lt;p&gt;Most teams switch for the cost savings. They stay for the headaches they didn't see coming.&lt;/p&gt;

&lt;p&gt;Before Karpenter, we were running fixed managed node groups — over-provisioned, expensive, and sized by gut feel. Cluster Autoscaler helped, but it scaled based on what you pre-configured, not what your workloads actually needed. You were still making the sizing decisions. You were just automating the slow parts.&lt;/p&gt;

&lt;p&gt;Karpenter changes the contract entirely. It provisions exactly what your pods ask for. Which sounds great - until you realise your pods have been lying about what they need.&lt;/p&gt;

&lt;p&gt;Here's what actually happened, and what to do instead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The bootstrap deadlock nobody mentions&lt;/strong&gt;&lt;br&gt;
Karpenter can't run on nodes it provisions. It has a nodeAffinity that actively avoids them.If you remove your managed node group too early, Karpenter has nowhere to live. &lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Always keep a small managed node group tainted with karpenter.sh/controller. It runs Karpenter and CoreDNS. Nothing else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Karpenter only sees requests. Not limits.&lt;/strong&gt;&lt;br&gt;
If your pods have requests of 200Mi but limits of 1Gi, Karpenter packs 5 pods onto a node and thinks it's fine.&lt;br&gt;
When those pods actually use 400Mi each — the node explodes.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Align requests closer to actual usage. Use VPA in recommendation-only mode to get data from real historical usage, not guesses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Never allow "small" instances in production NodePools&lt;/strong&gt;&lt;br&gt;
On smaller instance types, DaemonSets alone (aws-node, kube-proxy, ebs-csi, log shippers) can eat 30-40% of allocatable memory before a single application pod lands.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Remove small instance sizes from your NodePool requirements. Set a minimum memory floor. We use 4Gi as the baseline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PDBs won't save you from OOMKill&lt;/strong&gt;&lt;br&gt;
PodDisruptionBudgets protect against voluntary disruptions — consolidation, node drains.&lt;br&gt;
The kernel doesn't ask permission. It kills the process instantly.&lt;br&gt;
Worse: pods deployed at the same time leak memory at the same rate and OOM simultaneously. nginx has no backends. 503s for everyone.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; For Node.js workloads, set your V8 heap limit below your container memory limit. Give the garbage collector a chance before the kernel gets involved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Missing IAM permissions show up as cryptic failures&lt;/strong&gt;&lt;br&gt;
Everyone knows Karpenter needs the obvious EC2 and IAM permissions.&lt;br&gt;
Almost nobody mentions iam:AddRoleToInstanceProfile and iam:RemoveRoleFromInstanceProfile.Missing these doesn't throw a clear error. It just quietly fails to launch nodes and you spend an hour staring at logs.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Audit your Karpenter controller IAM role before your first migration, not during it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The migration order matters more than you think&lt;/strong&gt;&lt;br&gt;
Infrastructure dependencies have a sequence — get it wrong and you'll be untangling failures instead of validating success.&lt;br&gt;
The same applies to environment rollouts. Don't treat lower environments as a formality. They exist to surface exactly the problems your production can't afford to have.&lt;br&gt;
&lt;strong&gt;Fix:&lt;/strong&gt; Define your rollout order before you start and treat each environment as a proper stage gate with explicit validation criteria.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The real lesson?&lt;/strong&gt;&lt;br&gt;
Karpenter is genuinely excellent. But it shifts responsibility — from "provision nodes manually" to "understand exactly what your pods need."&lt;/p&gt;

&lt;p&gt;Cluster Autoscaler let you get away with sloppy resource definitions. Karpenter exposes them. Loudly. In production.&lt;/p&gt;

&lt;p&gt;Get your requests and limits honest before you migrate. Everything else is manageable.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>kubernetes</category>
      <category>devops</category>
      <category>karpenter</category>
    </item>
  </channel>
</rss>
