<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mark Johnson</title>
    <description>The latest articles on DEV Community by Mark Johnson (@tazmainiandevil).</description>
    <link>https://dev.to/tazmainiandevil</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3935934%2Fd3eb3178-4e19-4b3b-aabc-3af1261404ca.jpeg</url>
      <title>DEV Community: Mark Johnson</title>
      <link>https://dev.to/tazmainiandevil</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tazmainiandevil"/>
    <language>en</language>
    <item>
      <title>Production-Ready GPU Inference Autoscaling on EKS with Karpenter, KEDA, and Dragonfly</title>
      <dc:creator>Mark Johnson</dc:creator>
      <pubDate>Sun, 17 May 2026 09:06:04 +0000</pubDate>
      <link>https://dev.to/tazmainiandevil/production-ready-gpu-inference-autoscaling-on-eks-with-karpenter-keda-and-dragonfly-2f1p</link>
      <guid>https://dev.to/tazmainiandevil/production-ready-gpu-inference-autoscaling-on-eks-with-karpenter-keda-and-dragonfly-2f1p</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; This architecture uses Karpenter + KEDA + Dragonfly on EKS to scale GPU inference pods from zero, pull model images quicker, and cut GPU spend with spot-first provisioning. Cold starts are 84s; warm starts are 7s (with small image). Everything is GitOps-driven via ArgoCD and fully reproducible with Terraform.&lt;/p&gt;

&lt;p&gt;If you’re tired of paying for GPU nodes that sit idle half the day, or waiting minutes for cold starts when traffic suddenly spikes, this guide is for you.&lt;/p&gt;

&lt;p&gt;Most teams running GPU inference on Kubernetes eventually hit the same wall:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPUs are expensive&lt;/li&gt;
&lt;li&gt;Traffic is spiky&lt;/li&gt;
&lt;li&gt;Cold starts are painful&lt;/li&gt;
&lt;li&gt;Large model images make everything worse&lt;/li&gt;
&lt;li&gt;Scaling is either too slow or too costly&lt;/li&gt;
&lt;li&gt;GitOps workflows often break when autoscaling enters the picture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture solves all of that.&lt;/p&gt;

&lt;p&gt;It gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scale‑to‑zero&lt;/strong&gt; when idle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fast burst capacity&lt;/strong&gt; when demand arrives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Predictable cost&lt;/strong&gt; with spot‑first provisioning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal cold‑start pain&lt;/strong&gt;, even with 8–40 GB model images&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;A clean GitOps + IaC workflow&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compatibility with both small and large GPU workloads&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It works whether you run a single &lt;strong&gt;g4dn.xlarge&lt;/strong&gt; or a fleet of &lt;strong&gt;A100s&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;All code lives in the companion repo: 👉 &lt;a href="https://github.com/Tazmainiandevil/eks-gpu-inference-autoscaling" rel="noopener noreferrer"&gt;https://github.com/Tazmainiandevil/eks-gpu-inference-autoscaling&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why This Architecture Exists&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;GPU nodes are expensive. A single &lt;strong&gt;p4d.24xlarge&lt;/strong&gt; costs ~$32/hr on‑demand.&lt;/p&gt;

&lt;p&gt;If your traffic is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;quiet at night&lt;/li&gt;
&lt;li&gt;busy during the day&lt;/li&gt;
&lt;li&gt;unpredictable during peak hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…then paying for always‑on GPU capacity is pure waste.&lt;/p&gt;

&lt;p&gt;But scaling to zero introduces its own problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU node provisioning&lt;/li&gt;
&lt;li&gt;Large image pulls&lt;/li&gt;
&lt;li&gt;Model weight loading&lt;/li&gt;
&lt;li&gt;CUDA initialisation&lt;/li&gt;
&lt;li&gt;Slow ECR bandwidth&lt;/li&gt;
&lt;li&gt;Cold-start latency that can stretch into minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture eliminates the false choice between:&lt;/p&gt;

&lt;p&gt;❌ &lt;strong&gt;Always‑on GPU waste&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;❌ &lt;strong&gt;Cold‑start disasters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;and gives you:&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Elastic GPU capacity&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Predictable cost&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Fast warm starts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Production‑grade reliability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It’s built from components that each solve a specific pain point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Karpenter&lt;/strong&gt; → fast, flexible GPU node provisioning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KEDA / Knative&lt;/strong&gt; → pod‑level autoscaling for async + HTTP workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dragonfly&lt;/strong&gt; → P2P image distribution that eliminates ECR bottlenecks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA Device Plugin + GFD&lt;/strong&gt; → predictable GPU scheduling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ArgoCD&lt;/strong&gt; → GitOps-driven reproducibility&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform&lt;/strong&gt; → infrastructure as code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, they give you GPU elasticity without the cost burn.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Who This Architecture Is For&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Small workloads (1 GPU)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You want scale‑to‑zero&lt;/li&gt;
&lt;li&gt;You want predictable cold‑start behaviour&lt;/li&gt;
&lt;li&gt;You want to avoid paying for idle GPUs&lt;/li&gt;
&lt;li&gt;You want a clean GitOps workflow&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Large workloads (10–100+ GPUs)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;You need fast burst capacity&lt;/li&gt;
&lt;li&gt;You need to avoid ECR bottlenecks&lt;/li&gt;
&lt;li&gt;You need spot‑first provisioning with fallback&lt;/li&gt;
&lt;li&gt;You need observability‑driven autoscaling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Common real‑world fits&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;LLM inference APIs&lt;/li&gt;
&lt;li&gt;Async video/image processing&lt;/li&gt;
&lt;li&gt;Model evaluation pipelines&lt;/li&gt;
&lt;li&gt;Embedding services&lt;/li&gt;
&lt;li&gt;Chatbots and RAG systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Benchmark Results + Load Test Summary
&lt;/h2&gt;

&lt;p&gt;Before diving into the architecture, let’s answer the only question that really matters:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does this actually work?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes and the numbers are the payoff for everything that follows.&lt;/p&gt;

&lt;p&gt;These results come from two instrumented runs of &lt;code&gt;validate-scaling.sh&lt;/code&gt; on a real &lt;strong&gt;EKS 1.35&lt;/strong&gt; cluster in &lt;strong&gt;eu‑west‑2&lt;/strong&gt;, using &lt;strong&gt;T4 GPUs&lt;/strong&gt; (&lt;code&gt;g4dn.xlarge&lt;/code&gt; spot).&lt;/p&gt;

&lt;h3&gt;
  
  
  Cold vs Warm Start (Dragonfly P2P Cache + existing node)
&lt;/h3&gt;

&lt;p&gt;Note: a 130 MB stub image was used for the tests.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Cold Start&lt;/th&gt;
&lt;th&gt;Warm Start (Dragonfly P2P)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Queue push → node Ready&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;37s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Queue push → pods serving&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;84s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scale‑to‑zero&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;307s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;285s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 84s cold start breaks into two phases, EC2 launch (~37s) and ECR image pull (~47s), covered in detail in the Scaling Lifecycle section below.&lt;/p&gt;

&lt;p&gt;Warm start &lt;strong&gt;84s → 7s (12× faster)&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Scale‑to‑Zero Behaviour&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Scale‑in is dominated by &lt;strong&gt;KEDA’s cooldownPeriod&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Default: &lt;strong&gt;300s&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Observed: &lt;strong&gt;285–307s&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After cooldown:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Karpenter’s &lt;code&gt;whenEmpty&lt;/code&gt; + &lt;code&gt;consolidateAfter&lt;/code&gt; drains the node&lt;/li&gt;
&lt;li&gt;EC2 instance terminates cleanly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matches the configured behaviour almost exactly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Load Test Summary
&lt;/h3&gt;

&lt;p&gt;A 90‑second load test with 5 concurrent workers at fixed 150ms processing time produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput:&lt;/strong&gt; 602–607 requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;p50 latency:&lt;/strong&gt; ~175ms (150ms processing + ~25ms HTTP overhead)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU:&lt;/strong&gt; NVIDIA T4, 15,360 MiB VRAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dragonfly mirror:&lt;/strong&gt; PASS (warm run served entirely from P2P cache)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation score:&lt;/strong&gt; 24/25 checks passed

&lt;ul&gt;
&lt;li&gt;The single FAIL was a consolidation timing edge case in the test script, not an architecture issue&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Trade-offs Accepted&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;84s cold start is the cost of full scale-to-zero&lt;/li&gt;
&lt;li&gt;Dragonfly adds operational complexity&lt;/li&gt;
&lt;li&gt;pollingInterval: 30 means up to 30s of lag before scale-out&lt;/li&gt;
&lt;li&gt;Spot-first means occasional interruptions&lt;/li&gt;
&lt;li&gt;ECR image-per-model means large images, no hot-swapping weights&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;This architecture is intentionally simple to understand at a glance, yet flexible enough to support everything from a single‑GPU dev cluster to a 100‑GPU production fleet.&lt;/p&gt;

&lt;p&gt;It’s built around three principles:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;System pods should never depend on burst GPU capacity&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Baseline GPU capacity (if any) should be predictable and stable&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Burst GPU capacity should scale to zero and back without friction&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here’s how that looks in practice.&lt;/p&gt;

&lt;p&gt;A clean, production‑ready AWS layout:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;VPC (10.0.0.0/16)  &lt;br&gt;
├── Public subnets  &lt;br&gt;
│ └── NAT Gateway, Internet Gateway  &lt;br&gt;
└── Private subnets  &lt;br&gt;
└── EKS nodes, VPC endpoints (ECR, S3, STS)&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Node Groups &amp;amp; NodePools&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The cluster uses a &lt;strong&gt;three‑tier node strategy&lt;/strong&gt;:&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;1. Always‑On System Node Group (Managed Node Group)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CoreDNS&lt;/li&gt;
&lt;li&gt;Karpenter&lt;/li&gt;
&lt;li&gt;KEDA&lt;/li&gt;
&lt;li&gt;Knative control plane&lt;/li&gt;
&lt;li&gt;Prometheus&lt;/li&gt;
&lt;li&gt;Dragonfly seed nodes&lt;/li&gt;
&lt;li&gt;ArgoCD&lt;/li&gt;
&lt;li&gt;Kyverno&lt;/li&gt;
&lt;li&gt;External Secrets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Default: &lt;code&gt;t3.medium&lt;/code&gt; (cheap, predictable)&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;2. Baseline GPU Node Group (Optional)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Used in production when you want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;predictable minimum GPU capacity&lt;/li&gt;
&lt;li&gt;warm image caches&lt;/li&gt;
&lt;li&gt;stable throughput during business hours&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Default: &lt;strong&gt;0 GPUs in dev&lt;/strong&gt;, &lt;strong&gt;4 GPUs in prod&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;3. Karpenter NodePools (Burst Capacity)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Two NodePools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;general&lt;/strong&gt; → CPU-only spot nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gpu&lt;/strong&gt; → GPU spot nodes with on‑demand fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GPU NodePool supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;g4dn.xlarge&lt;/code&gt;, &lt;code&gt;g4dn.2xlarge&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;g5.xlarge&lt;/code&gt;, &lt;code&gt;g5.2xlarge&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Spot-first, on-demand fallback&lt;/li&gt;
&lt;li&gt;GPU taints to prevent CPU workloads from landing on GPU nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This separation is critical:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Without separate NodePools, Karpenter may place a CPU-only workload on a GPU node, wasting expensive capacity.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Addons (Platform Layer)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;All installed via ArgoCD:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Karpenter v1.9.0&lt;/strong&gt; – node autoprovisioner&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NVIDIA Device Plugin v0.19.1&lt;/strong&gt; – exposes &lt;code&gt;nvidia.com/gpu&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU Feature Discovery&lt;/strong&gt; – labels GPU nodes with VRAM, product, CUDA version&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knative Serving v1.21.1&lt;/strong&gt; – HTTP autoscaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kourier v1.21.0&lt;/strong&gt; – Knative networking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KEDA v2.19.0&lt;/strong&gt; – async autoscaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kube-prometheus-stack v82.10.4&lt;/strong&gt; – metrics + Grafana&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dragonfly v1.6.15&lt;/strong&gt; – P2P image distribution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DCGM Exporter&lt;/strong&gt; – GPU metrics for Prometheus&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kyverno&lt;/strong&gt; – policies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External Secrets Operator&lt;/strong&gt; – secret management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is declarative, GitOps-driven, and reproducible.&lt;/p&gt;
&lt;h3&gt;
  
  
  High-Level Architecture Diagram
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fphgfto1ngli7zb4lz12a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fphgfto1ngli7zb4lz12a.png" width="800" height="521"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Why This Architecture Works&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;System pods are isolated from GPU churn&lt;/li&gt;
&lt;li&gt;GPU workloads only land on GPU nodes&lt;/li&gt;
&lt;li&gt;GPU nodes scale to zero cleanly&lt;/li&gt;
&lt;li&gt;Dragonfly eliminates ECR bottlenecks&lt;/li&gt;
&lt;li&gt;Karpenter provisions nodes fast&lt;/li&gt;
&lt;li&gt;KEDA/Knative scale pods predictably&lt;/li&gt;
&lt;li&gt;GitOps keeps everything reproducible&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Scaling Lifecycle (End‑to‑End)
&lt;/h2&gt;

&lt;p&gt;To tune this architecture effectively, you need to understand how the components interact from the moment a job arrives to the moment the GPU node disappears.&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;mental model&lt;/strong&gt; that makes everything else click:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;KEDA is the signal (“we need more pods”).&lt;/strong&gt; &lt;strong&gt;Karpenter is the muscle (“here are the nodes to run them”).&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Below is the complete end‑to‑end flow, including timings from real benchmark runs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14dlnhpiyyayj0j3rqfo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F14dlnhpiyyayj0j3rqfo.png" width="800" height="358"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Spot Interruption Handling
&lt;/h3&gt;

&lt;p&gt;Spot interruptions are handled cleanly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS sends a &lt;strong&gt;2‑minute warning&lt;/strong&gt; to an SQS queue&lt;/li&gt;
&lt;li&gt;Karpenter listens to that queue&lt;/li&gt;
&lt;li&gt;When a termination notice arrives:

&lt;ul&gt;
&lt;li&gt;Karpenter &lt;strong&gt;cordons&lt;/strong&gt; the node&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drains&lt;/strong&gt; all pods&lt;/li&gt;
&lt;li&gt;Schedules replacements on other nodes&lt;/li&gt;
&lt;li&gt;Terminates the instance gracefully&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU jobs in flight when the notice arrives have up to 2 minutes to complete*&lt;/li&gt;
&lt;li&gt;Pods are drained gracefully before the instance terminates&lt;/li&gt;
&lt;li&gt;Replacements are scheduled on other nodes before termination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Spot-first is safe because the architecture is interruption-aware.&lt;/p&gt;

&lt;p&gt;*The deployment sets &lt;code&gt;terminationGracePeriodSeconds: 120&lt;/code&gt;, matching AWS’s 2-minute spot interruption notice, this gives the pod the full window to finish in-flight work before Kubernetes hard-kills it. Whether jobs are actually preserved depends on the inference server handling SIGTERM: it needs to stop accepting new requests and wait for in-flight work to complete. The stub doesn’t implement this (it’s not a real inference server). For production workloads, add a SIGTERM handler or a &lt;code&gt;preStop&lt;/code&gt; hook to drain the queue.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Timings (Cold vs Warm)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;From real benchmark runs on g4dn.xlarge spot, eu-west-2:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Node Ready:&lt;/strong&gt; 37s cold → 3s warm&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pods Serving:&lt;/strong&gt; 84s cold → 7s warm&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale-to-zero:&lt;/strong&gt; 285–307s (matches KEDA cooldown)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cold starts break down into two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;EC2 launch (~37s)&lt;/strong&gt; – the unavoidable floor: AWS booting a GPU instance, AL2023 initialising, and the NVIDIA device plugin advertising &lt;code&gt;nvidia.com/gpu&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image pull (~47s)&lt;/strong&gt; – even a 130 MB stub image takes ~47s pulled directly from ECR on a cold node; for real inference images (8–40 GB) this phase dominates&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The warm start eliminates both. The GPU node is already running, no provisioning needed. Dragonfly has already seeded the P2P cache from the cold pull, so the ECR registry isn’t contacted again; layers are served peer-to-peer in under a second.&lt;/p&gt;

&lt;p&gt;Dragonfly is the difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;47s ECR pull&lt;/strong&gt; – cold node, image fetched directly from ECR&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt;5s P2P delivery&lt;/strong&gt; – image served from Dragonfly peer cache; ECR is never contacted&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In production, “warm” is a deliberate choice: setting &lt;code&gt;consolidateAfter: 2h&lt;/code&gt; on the GPU NodePool keeps nodes alive between traffic peaks, preserving both the warm node and the Dragonfly cache. In dev, consolidation is 3 minutes, so the warm state is transient.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Why This Lifecycle Matters&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This lifecycle is the backbone of the entire architecture. It explains why:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KEDA and Karpenter complement each other&lt;/li&gt;
&lt;li&gt;Dragonfly is mandatory for large images&lt;/li&gt;
&lt;li&gt;GPU nodes can scale to zero safely&lt;/li&gt;
&lt;li&gt;Cold starts are predictable&lt;/li&gt;
&lt;li&gt;Warm starts are extremely fast&lt;/li&gt;
&lt;li&gt;Spot-first provisioning is viable&lt;/li&gt;
&lt;li&gt;GitOps remains stable even under churn&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once you understand this flow, the rest of the architecture becomes intuitive.&lt;/p&gt;
&lt;h2&gt;
  
  
  Key Design Decisions
&lt;/h2&gt;

&lt;p&gt;These are the decisions that determine whether your GPU autoscaling architecture will be fast, predictable, and cost‑efficient or painful, flaky, and expensive.&lt;/p&gt;

&lt;p&gt;Each choice below is deliberate, battle‑tested, and grounded in real‑world constraints.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;1. Karpenter vs Cluster Autoscaler vs EKS Auto Mode&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Karpenter&lt;/th&gt;
&lt;th&gt;Cluster Autoscaler&lt;/th&gt;
&lt;th&gt;EKS Auto Mode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fastest&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slow&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU flexibility&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;High&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost control&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Excellent&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom AMIs&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spot-first&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scale-to-zero&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supports mixed GPU fleets&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Verdict&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If you care about &lt;strong&gt;GPU cost&lt;/strong&gt;, &lt;strong&gt;flexibility&lt;/strong&gt;, and &lt;strong&gt;scale‑to‑zero&lt;/strong&gt;, &lt;strong&gt;Karpenter wins&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Why&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Karpenter provisions nodes directly via &lt;strong&gt;EC2 RunInstances / CreateFleet&lt;/strong&gt;, bypassing ASGs entirely. This gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;faster provisioning&lt;/li&gt;
&lt;li&gt;more instance types&lt;/li&gt;
&lt;li&gt;spot-first with fallback&lt;/li&gt;
&lt;li&gt;custom AMIs&lt;/li&gt;
&lt;li&gt;GPU-aware scheduling&lt;/li&gt;
&lt;li&gt;consolidation (automatic scale-in)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;EKS Auto Mode is simpler but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;costs more&lt;/li&gt;
&lt;li&gt;hides configuration&lt;/li&gt;
&lt;li&gt;doesn’t support custom AMIs&lt;/li&gt;
&lt;li&gt;doesn’t support DRA&lt;/li&gt;
&lt;li&gt;doesn’t support mixed GPU fleets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cluster Autoscaler is stable but slow and ASG-bound.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;2. GPU Resource Allocation: Device Plugin vs DRA&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This single choice determines what autoscaling strategies are even possible.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Device Plugin (what this architecture uses)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The NVIDIA device plugin exposes &lt;code&gt;nvidia.com/gpu&lt;/code&gt; as a schedulable resource:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It’s:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;simple&lt;/li&gt;
&lt;li&gt;predictable&lt;/li&gt;
&lt;li&gt;stable&lt;/li&gt;
&lt;li&gt;widely supported&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;compatible with Karpenter&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Dynamic Resource Allocation (DRA)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;DRA is the future of GPU scheduling in Kubernetes.&lt;/p&gt;

&lt;p&gt;It introduces &lt;strong&gt;ResourceClaims&lt;/strong&gt; with CEL expressions for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fractional GPUs&lt;/li&gt;
&lt;li&gt;MIG profiles&lt;/li&gt;
&lt;li&gt;time‑slicing&lt;/li&gt;
&lt;li&gt;MPS&lt;/li&gt;
&lt;li&gt;Blackwell hardware (mandatory)&lt;/li&gt;
&lt;li&gt;structured parameters (added in Kubernetes 1.31)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But today:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Karpenter does not support DRA.&lt;/strong&gt; (Tracked in &lt;code&gt;kubernetes-sigs/karpenter#1231&lt;/code&gt;, no published ETA.)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If a pod uses &lt;code&gt;resourceClaims&lt;/code&gt;, Karpenter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ignores it entirely&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;cannot provision nodes for it&lt;/li&gt;
&lt;li&gt;leaves pods stuck in Pending&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;EKS Auto Mode has the same limitation.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;If you need DRA today&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You must use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cluster Autoscaler + managed node groups&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;static GPU capacity&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;no scale-to-zero&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Quick comparison&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Device Plugin&lt;/th&gt;
&lt;th&gt;DRA&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Allocation granularity&lt;/td&gt;
&lt;td&gt;Whole GPU&lt;/td&gt;
&lt;td&gt;Fractional, MIG, time-slice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Karpenter compatible&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EKS Auto Mode compatible&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cluster Autoscaler compatible&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU sharing&lt;/td&gt;
&lt;td&gt;Uniform&lt;/td&gt;
&lt;td&gt;Per-pod flexible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Blackwell / P6e‑GB200&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Bursty inference, Karpenter scaling&lt;/td&gt;
&lt;td&gt;Multi-tenant sharing, MIG, Blackwell&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Future-proofing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once DRA reaches GA &lt;strong&gt;and&lt;/strong&gt; Karpenter supports it, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;swap the device plugin for a DRA driver&lt;/li&gt;
&lt;li&gt;keep your pod specs the same&lt;/li&gt;
&lt;li&gt;gain fractional GPUs + MIG + time-slicing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture is designed to evolve cleanly when that day comes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Always Set GPU Limits on the NodePool&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This is a critical safety guardrail.&lt;/p&gt;

&lt;p&gt;Karpenter’s default limits cover CPU and memory, &lt;strong&gt;not GPUs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Without:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;…a misconfigured job can trigger &lt;strong&gt;unbounded GPU node provisioning&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The repo includes this limit by default.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. GPU Node Labeling with GPU Feature Discovery (GFD)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When a GPU node joins the cluster, you need more than just &lt;code&gt;nvidia.com/gpu: "1"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;You need to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VRAM&lt;/li&gt;
&lt;li&gt;GPU model&lt;/li&gt;
&lt;li&gt;CUDA version&lt;/li&gt;
&lt;li&gt;GPU count&lt;/li&gt;
&lt;li&gt;Architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GFD annotates nodes with labels like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="err"&gt;nvidia.com/&lt;/span&gt;&lt;span class="py"&gt;gpu.memory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;40Gi&lt;/span&gt;
&lt;span class="err"&gt;nvidia.com/&lt;/span&gt;&lt;span class="py"&gt;gpu.product&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;A100-SXM4-40GB&lt;/span&gt;
&lt;span class="err"&gt;nvidia.com/&lt;/span&gt;&lt;span class="py"&gt;gpu.count&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;8&lt;/span&gt;
&lt;span class="err"&gt;nvidia.com/&lt;/span&gt;&lt;span class="py"&gt;gpu.cuda.version&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;12.2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;5. Dragonfly vs. ECR-only pulls&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Dragonfly turns image distribution from:&lt;/p&gt;

&lt;p&gt;❌ &lt;strong&gt;O(N)&lt;/strong&gt; pulls from ECR&lt;/p&gt;

&lt;p&gt;into&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;O(1)&lt;/strong&gt; pull + &lt;strong&gt;P2P fan-out&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3dpj8o9blsiu7b740dih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3dpj8o9blsiu7b740dih.png" width="800" height="259"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For large images (8–40 GB), this is the difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;15–30 minutes cold start&lt;/strong&gt; (with a real 8–40 GB image, not the 130 MB stub)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;seconds warm start&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dragonfly is mandatory for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;large model images&lt;/li&gt;
&lt;li&gt;burst scaling&lt;/li&gt;
&lt;li&gt;multi-node inference fleets&lt;/li&gt;
&lt;li&gt;cost-efficient ECR usage&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;6. Why Separate CPU and GPU NodePools&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If you don’t separate them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Karpenter may place CPU-only workloads on GPU nodes&lt;/li&gt;
&lt;li&gt;GPU nodes become expensive general-purpose nodes&lt;/li&gt;
&lt;li&gt;GPU capacity becomes unpredictable&lt;/li&gt;
&lt;li&gt;Consolidation becomes less effective&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The GPU NodePool carries a taint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;gpu=true:NoSchedule
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only pods that explicitly tolerate it can land on GPU nodes.&lt;/p&gt;

&lt;p&gt;This guarantees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no accidental GPU waste&lt;/li&gt;
&lt;li&gt;predictable GPU scheduling&lt;/li&gt;
&lt;li&gt;clean scale-to-zero behaviour&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;7. Why ECR + Dragonfly Instead of EFS for Model Storage&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Some teams store model weights on EFS and mount them into pods.&lt;/p&gt;

&lt;p&gt;This architecture intentionally avoids that.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;EFS is slower than local NVMe&lt;/li&gt;
&lt;li&gt;EFS adds per-GB and per-IOPS cost&lt;/li&gt;
&lt;li&gt;EFS introduces cold-start latency&lt;/li&gt;
&lt;li&gt;EFS is a single point of contention&lt;/li&gt;
&lt;li&gt;ECR + Dragonfly is faster, cheaper, and scales better&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Packaging model weights inside container images gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;versioned, immutable artifacts&lt;/li&gt;
&lt;li&gt;reproducible deployments&lt;/li&gt;
&lt;li&gt;compatibility with GitOps&lt;/li&gt;
&lt;li&gt;P2P distribution via Dragonfly&lt;/li&gt;
&lt;li&gt;no shared filesystem bottlenecks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;8. MIG, Time-Slicing, and Multi-Tenant GPU Sharing&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This architecture does &lt;strong&gt;not&lt;/strong&gt; use MIG or time-slicing today because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They require DRA when using Kubernetes-native MIG scheduling&lt;/li&gt;
&lt;li&gt;Karpenter does not support DRA&lt;/li&gt;
&lt;li&gt;GPU Operator introduces complexity not needed for burst inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the architecture is designed to evolve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GFD already exposes MIG-capable metadata&lt;/li&gt;
&lt;li&gt;NodePools can be extended with MIG profiles&lt;/li&gt;
&lt;li&gt;DRA migration will be seamless once supported&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Autoscaling Components (KEDA + Knative)
&lt;/h2&gt;

&lt;p&gt;GPU autoscaling isn’t just about provisioning nodes, it’s about scaling &lt;strong&gt;pods&lt;/strong&gt; in a way that matches your workload pattern.&lt;/p&gt;

&lt;p&gt;This architecture supports &lt;strong&gt;two autoscaling modes&lt;/strong&gt;, each optimized for a different inference pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;KEDA&lt;/strong&gt; → async, queue-driven, batch, event-driven&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Knative&lt;/strong&gt; → HTTP, interactive, low-latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They can coexist in the same cluster without interfering with each other.&lt;/p&gt;

&lt;p&gt;Let’s break them down.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;KEDA – Autoscaling for Async Workloads&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;KEDA is ideal when inference is &lt;strong&gt;queue-driven&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SQS&lt;/li&gt;
&lt;li&gt;Kafka&lt;/li&gt;
&lt;li&gt;Redis&lt;/li&gt;
&lt;li&gt;RabbitMQ&lt;/li&gt;
&lt;li&gt;Prometheus metrics&lt;/li&gt;
&lt;li&gt;Cron triggers&lt;/li&gt;
&lt;li&gt;Custom metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why KEDA?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Async workloads have a natural buffer (the queue), which makes them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bursty&lt;/li&gt;
&lt;li&gt;unpredictable&lt;/li&gt;
&lt;li&gt;latency-tolerant&lt;/li&gt;
&lt;li&gt;throughput-oriented&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;KEDA shines here because it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scales from &lt;strong&gt;0 → N&lt;/strong&gt; based on queue depth&lt;/li&gt;
&lt;li&gt;supports &lt;strong&gt;Prometheus&lt;/strong&gt; as a trigger&lt;/li&gt;
&lt;li&gt;integrates cleanly with Karpenter&lt;/li&gt;
&lt;li&gt;works with any container runtime&lt;/li&gt;
&lt;li&gt;is simple, predictable, and transparent&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why Prometheus Instead of SQS?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;You &lt;em&gt;can&lt;/em&gt; scale directly from SQS, but Prometheus is better for GPU inference.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Prometheus gives you:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;source-agnostic metrics&lt;/strong&gt; (you can switch queues without changing autoscaling logic)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;richer semantics&lt;/strong&gt; (queue depth, processing time, GPU utilization, backlog age)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;unified observability&lt;/strong&gt; (everything in one place: Grafana, alerts, dashboards)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;better control&lt;/strong&gt; (you can scale on &lt;em&gt;derived&lt;/em&gt; metrics, not just raw queue length)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: Scale on &lt;code&gt;gpu_job_queue_depth&lt;/code&gt; instead of raw SQS message count.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Example KEDA ScaledObject (Simplified)&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;minReplicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
&lt;span class="na"&gt;pollingInterval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;span class="na"&gt;cooldownPeriod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;
&lt;span class="na"&gt;triggers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;metricName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu_job_queue_depth&lt;/span&gt;
      &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Key behaviours&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;minReplicaCount: 0&lt;/code&gt; → &lt;strong&gt;scale-to-zero&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pollingInterval: 30&lt;/code&gt; → polls Prometheus every 30 seconds&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cooldownPeriod: 300&lt;/code&gt; → prevents flapping&lt;/li&gt;
&lt;li&gt;Prometheus trigger → flexible, observable, debuggable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the async backbone of the architecture.&lt;/p&gt;

&lt;p&gt;Note: Set &lt;code&gt;pollingInterval: 15&lt;/code&gt; if you need tighter scaling response, the tradeoff is double the Prometheus query rate. With &lt;code&gt;pollingInterval: 30&lt;/code&gt;, a job arriving just after a poll can wait 30s before KEDA acts. 37s node + 47s image + 30s KEDA lag = ~114s worst case.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Knative – Autoscaling for HTTP Workloads&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Knative is ideal when inference is &lt;strong&gt;HTTP-driven&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;chatbots&lt;/li&gt;
&lt;li&gt;LLM APIs&lt;/li&gt;
&lt;li&gt;embedding endpoints&lt;/li&gt;
&lt;li&gt;image/video processing APIs&lt;/li&gt;
&lt;li&gt;interactive services&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why Knative?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Knative gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;HTTP autoscaling&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;concurrency-based scaling&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;scale-to-zero&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;built-in activator&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;request buffering&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;automatic cold-start mitigation&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Knative is effectively “serverless for Kubernetes,” but with GPU support.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Concurrency Model (Important)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Knative scales based on &lt;strong&gt;concurrency&lt;/strong&gt;, not CPU or memory.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;containerConcurrency: 1&lt;/code&gt; → 1 request per pod&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;containerConcurrency: 10&lt;/code&gt; → 10 requests per pod&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For GPU inference, you typically want:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1–2&lt;/strong&gt; for large models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4–8&lt;/strong&gt; for small models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10+&lt;/strong&gt; for embedding workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives you predictable latency and throughput.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The Activator (Cold-Start Mitigation)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When a Knative service is scaled to zero:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the &lt;strong&gt;activator&lt;/strong&gt; receives the first request&lt;/li&gt;
&lt;li&gt;buffers it&lt;/li&gt;
&lt;li&gt;triggers scale-up&lt;/li&gt;
&lt;li&gt;forwards the request once the pod is ready&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;404s&lt;/li&gt;
&lt;li&gt;connection resets&lt;/li&gt;
&lt;li&gt;client timeouts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s one of the reasons Knative is so good for GPU inference APIs.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;KEDA vs Knative – When to Use Which&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Queue-driven&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;KEDA&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batch jobs&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;KEDA&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Async pipelines&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;KEDA&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HTTP APIs&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Knative&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interactive inference&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Knative&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chatbots / RAG&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Knative&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mixed workloads&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Both&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Can they coexist?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Yes – and they do in this architecture.&lt;/p&gt;

&lt;p&gt;KEDA handles async pipelines. Knative handles HTTP inference. Both scale pods. Karpenter provisions nodes for both.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why Not Use Only One?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KEDA is not optimized for HTTP&lt;/li&gt;
&lt;li&gt;Knative is not optimized for queues&lt;/li&gt;
&lt;li&gt;Both have different scaling semantics&lt;/li&gt;
&lt;li&gt;Both solve different cold-start problems&lt;/li&gt;
&lt;li&gt;Both integrate cleanly with Karpenter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture uses the right tool for each job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dragonfly + containerd Wiring (AL2023, certs.d, hosts.toml)
&lt;/h2&gt;

&lt;p&gt;Dragonfly is the component that turns GPU autoscaling from “theoretically possible” into “practically fast.” But Dragonfly only works if &lt;strong&gt;containerd is wired correctly&lt;/strong&gt; and on AL2023, the defaults changed in a way that breaks most online guides.&lt;/p&gt;

&lt;p&gt;This section explains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;why AL2023 ignores inline registry config&lt;/li&gt;
&lt;li&gt;why you must use &lt;code&gt;certs.d&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;how the DaemonSet writes &lt;code&gt;hosts.toml&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;how Dragonfly intercepts all ECR pulls&lt;/li&gt;
&lt;li&gt;what happens when it doesn’t&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is one of the most important operational details in the entire system.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The AL2023 containerd Change (Critical)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;EKS 1.35 uses &lt;strong&gt;containerd 1.7.x&lt;/strong&gt; on AL2023.&lt;/p&gt;

&lt;p&gt;This version &lt;strong&gt;ignores inline registry configuration&lt;/strong&gt; inside &lt;code&gt;/etc/containerd/config.toml&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Instead, it uses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/etc/containerd/certs.d/&amp;lt;registry&amp;gt;/hosts.toml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;inline registry mirrors do nothing&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Dragonfly will not intercept pulls&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;containerd silently falls back to ECR&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;you lose all P2P benefits&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the #1 cause of “Dragonfly isn’t working” reports.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Correct Wiring: certs.d + hosts.toml&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To ensure Dragonfly intercepts all ECR pulls, you must create:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/etc/containerd/certs.d/&amp;lt;account&amp;gt;.dkr.ecr.&amp;lt;region&amp;gt;.amazonaws.com/hosts.toml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With contents like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="py"&gt;server&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://&amp;lt;registry&amp;gt;"&lt;/span&gt;

&lt;span class="nn"&gt;[host."http://127.0.0.1:4001"]&lt;/span&gt;
  &lt;span class="py"&gt;capabilities&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"pull"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"resolve"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells containerd:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“When pulling from ECR, talk to Dragonfly first.”&lt;/li&gt;
&lt;li&gt;“Dragonfly will fetch from ECR if needed.”&lt;/li&gt;
&lt;li&gt;“Otherwise, use the P2P cache.”&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;DaemonSet: Writing hosts.toml Automatically&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The repo includes a DaemonSet that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;runs on &lt;strong&gt;every node&lt;/strong&gt; (including GPU nodes)&lt;/li&gt;
&lt;li&gt;detects the correct ECR registry hostname&lt;/li&gt;
&lt;li&gt;writes the appropriate &lt;code&gt;hosts.toml&lt;/code&gt; file&lt;/li&gt;
&lt;li&gt;restarts containerd if needed&lt;/li&gt;
&lt;li&gt;validates the configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;every node routes pulls through Dragonfly&lt;/li&gt;
&lt;li&gt;no manual configuration&lt;/li&gt;
&lt;li&gt;no drift&lt;/li&gt;
&lt;li&gt;no surprises during scale-out&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Validating Dragonfly Wiring&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The repo includes validation checks in &lt;code&gt;validate-scaling.sh&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;confirms Dragonfly mirror is active&lt;/li&gt;
&lt;li&gt;confirms containerd is using hosts.toml&lt;/li&gt;
&lt;li&gt;confirms warm pulls come from P2P cache&lt;/li&gt;
&lt;li&gt;confirms no ECR bandwidth is used&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A successful warm run shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Dragonfly mirror: PASS warm run served entirely from P2P cache
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see this, you’re good.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why Not Use ECR Pull-Through Cache Instead?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it still requires O(N) pulls&lt;/li&gt;
&lt;li&gt;it does not provide P2P fan-out&lt;/li&gt;
&lt;li&gt;it does not reduce cold-start time&lt;/li&gt;
&lt;li&gt;it does not reduce ECR throttling&lt;/li&gt;
&lt;li&gt;it does not help with large images&lt;/li&gt;
&lt;li&gt;it does not reduce NAT Gateway costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dragonfly solves all of these.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitOps with ArgoCD (App‑of‑Apps Pattern)
&lt;/h2&gt;

&lt;p&gt;Autoscaling GPU infrastructure is only half the story. The other half is making sure the entire platform, Karpenter, KEDA, Knative, Dragonfly, NVIDIA plugin, workloads, policies is deployed &lt;strong&gt;consistently&lt;/strong&gt;, &lt;strong&gt;declaratively&lt;/strong&gt;, and &lt;strong&gt;without manual kubectl&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That’s where &lt;strong&gt;ArgoCD&lt;/strong&gt; comes in.&lt;/p&gt;

&lt;p&gt;This architecture uses a clean, production‑grade &lt;strong&gt;App‑of‑Apps&lt;/strong&gt; pattern that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deploys the entire cluster from a single root manifest&lt;/li&gt;
&lt;li&gt;separates platform from workloads&lt;/li&gt;
&lt;li&gt;enforces ordering via sync waves&lt;/li&gt;
&lt;li&gt;self‑heals drift&lt;/li&gt;
&lt;li&gt;integrates cleanly with Terraform&lt;/li&gt;
&lt;li&gt;keeps everything GitOps‑driven&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s break it down.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The App‑of‑Apps Model&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;At the top level, there is a single ArgoCD Application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;argocd/app-of-apps.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This root application points ArgoCD at the &lt;code&gt;argocd/&lt;/code&gt; directory, which contains &lt;strong&gt;four ApplicationSets&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;platform-helm&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;platform-kustomize&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;security-infra&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;apps&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each ApplicationSet deploys a logical slice of the platform.&lt;/p&gt;

&lt;p&gt;This gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clean separation of concerns&lt;/li&gt;
&lt;li&gt;predictable ordering&lt;/li&gt;
&lt;li&gt;reproducible environments&lt;/li&gt;
&lt;li&gt;zero manual installation steps&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The Four ApplicationSets&lt;/strong&gt;
&lt;/h3&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. platform-helm&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Deploys all Helm‑based platform components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Karpenter&lt;/li&gt;
&lt;li&gt;KEDA&lt;/li&gt;
&lt;li&gt;Prometheus + Grafana&lt;/li&gt;
&lt;li&gt;Kyverno&lt;/li&gt;
&lt;li&gt;OpenCost&lt;/li&gt;
&lt;li&gt;Dragonfly&lt;/li&gt;
&lt;li&gt;NVIDIA device plugin&lt;/li&gt;
&lt;li&gt;DCGM Exporter&lt;/li&gt;
&lt;li&gt;External Secrets Operator&lt;/li&gt;
&lt;li&gt;Pushgateway&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are the operators and services that form the backbone of the cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. platform-kustomize&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Deploys all Kustomize‑based platform components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Knative Serving&lt;/li&gt;
&lt;li&gt;Kourier&lt;/li&gt;
&lt;li&gt;Karpenter NodePool + EC2NodeClass&lt;/li&gt;
&lt;li&gt;Kyverno policies&lt;/li&gt;
&lt;li&gt;External Secrets ClusterSecretStore&lt;/li&gt;
&lt;li&gt;Grafana admin ExternalSecret&lt;/li&gt;
&lt;li&gt;Prometheus alerting rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layer configures the platform operators installed by &lt;code&gt;platform-helm&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. security-infra&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Deploys foundational security primitives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Namespaces&lt;/li&gt;
&lt;li&gt;Pod Security Standards (PSS)&lt;/li&gt;
&lt;li&gt;NetworkPolicies (default deny)&lt;/li&gt;
&lt;li&gt;ResourceQuotas&lt;/li&gt;
&lt;li&gt;PriorityClasses&lt;/li&gt;
&lt;li&gt;PodDisruptionBudgets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures the cluster is secure &lt;em&gt;before&lt;/em&gt; any operators or workloads are deployed.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. apps&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Deploys the actual inference workloads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KEDA Deployment + ScaledObject&lt;/li&gt;
&lt;li&gt;Knative Service&lt;/li&gt;
&lt;li&gt;Any additional inference pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layer depends on all CRDs and operators being ready.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Sync Waves (Critical for First Install)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ArgoCD sync waves ensure everything is applied in the correct order.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Wave 0 – security-infra&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Creates namespaces + PSS labels. Operators cannot start without these.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Wave 1 – platform-helm + platform-kustomize&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Installs operators and their CRDs. Internal sub-waves ensure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CRDs are installed before CRD-dependent manifests&lt;/li&gt;
&lt;li&gt;NodePools apply only after Karpenter is ready&lt;/li&gt;
&lt;li&gt;Kyverno policies apply only after Kyverno is running&lt;/li&gt;
&lt;li&gt;External Secrets apply only after ESO is ready&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Wave 2 – apps&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Deploys inference workloads only after:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KEDA is ready&lt;/li&gt;
&lt;li&gt;Knative is ready&lt;/li&gt;
&lt;li&gt;Karpenter is ready&lt;/li&gt;
&lt;li&gt;Dragonfly is ready&lt;/li&gt;
&lt;li&gt;NVIDIA plugin is ready&lt;/li&gt;
&lt;li&gt;GPU nodes can be provisioned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents race conditions and broken first installs.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;GitOps Workflow&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once the root Application is applied:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; argocd/app-of-apps.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ArgoCD takes over:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;installs all platform components&lt;/li&gt;
&lt;li&gt;installs all workloads&lt;/li&gt;
&lt;li&gt;self-heals drift&lt;/li&gt;
&lt;li&gt;re-syncs on every push to &lt;code&gt;main&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;enforces declarative state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;manual helm installs&lt;/li&gt;
&lt;li&gt;kubectl apply&lt;/li&gt;
&lt;li&gt;one-off scripts&lt;/li&gt;
&lt;li&gt;snowflake clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything is GitOps-driven.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terraform + Infrastructure (AWS Layer)
&lt;/h2&gt;

&lt;p&gt;Everything in this architecture sits on top of a clean, predictable AWS foundation. Terraform owns the infrastructure layer, VPC, EKS, IAM, node groups, Pod Identity, and VPC endpoints, while ArgoCD owns the Kubernetes layer.&lt;/p&gt;

&lt;p&gt;This separation gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reproducible clusters&lt;/li&gt;
&lt;li&gt;safe upgrades&lt;/li&gt;
&lt;li&gt;clean GitOps workflows&lt;/li&gt;
&lt;li&gt;zero manual configuration&lt;/li&gt;
&lt;li&gt;predictable GPU autoscaling behaviour&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s break down the key infrastructure components and why they matter.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;VPC Endpoints (Critical for GPU Autoscaling)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;GPU autoscaling stresses ECR and STS more than almost any other workload. Without VPC endpoints, every image pull and every Pod Identity credential request goes through the NAT Gateway.&lt;/p&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;slow image pulls&lt;/li&gt;
&lt;li&gt;NAT bottlenecks&lt;/li&gt;
&lt;li&gt;surprise NAT bills&lt;/li&gt;
&lt;li&gt;unpredictable cold-start behaviour&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture provisions four endpoints:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. ecr.api (Interface Endpoint)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Authenticates ECR pulls.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. ecr.dkr (Interface Endpoint)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Transfers image layer manifests.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. s3 (Gateway Endpoint)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ECR stores image layers in S3, this routes traffic over the AWS backbone.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. sts (Interface Endpoint)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Required for EKS Pod Identity credential resolution.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why these matter&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;With these endpoints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;image pulls stay inside the VPC&lt;/li&gt;
&lt;li&gt;Dragonfly can seed the P2P cache faster&lt;/li&gt;
&lt;li&gt;Pod Identity works without NAT&lt;/li&gt;
&lt;li&gt;cold starts become predictable&lt;/li&gt;
&lt;li&gt;NAT data transfer costs for ECR/S3/STS drop to near-zero&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is one of the most important infrastructure optimizations in the entire architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;EKS Pod Identity (Not IRSA)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This architecture uses &lt;strong&gt;EKS Pod Identity&lt;/strong&gt;, not IRSA, for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Karpenter&lt;/li&gt;
&lt;li&gt;EBS CSI&lt;/li&gt;
&lt;li&gt;any AWS-integrated components&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why Pod Identity?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;no OIDC thumbprint management&lt;/li&gt;
&lt;li&gt;no service account annotations&lt;/li&gt;
&lt;li&gt;no IAM role trust policy complexity&lt;/li&gt;
&lt;li&gt;no race conditions during bootstrap&lt;/li&gt;
&lt;li&gt;simpler, safer, cleaner&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Terraform provisions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the IAM roles&lt;/li&gt;
&lt;li&gt;the Pod Identity associations&lt;/li&gt;
&lt;li&gt;the eks-pod-identity-agent addon&lt;/li&gt;
&lt;li&gt;the STS VPC endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Karpenter can call EC2 APIs&lt;/li&gt;
&lt;li&gt;EBS CSI can provision volumes&lt;/li&gt;
&lt;li&gt;workloads can access AWS services&lt;/li&gt;
&lt;li&gt;no NAT Gateway dependency&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;IAM Requirements for Karpenter&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Karpenter needs permissions for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ec2:RunInstances&lt;/code&gt; / &lt;code&gt;CreateFleet&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;instance type discovery&lt;/li&gt;
&lt;li&gt;spot price discovery&lt;/li&gt;
&lt;li&gt;AMI resolution via SSM&lt;/li&gt;
&lt;li&gt;IAM instance profile management&lt;/li&gt;
&lt;li&gt;pricing API access (&lt;code&gt;pricing:GetProducts&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Missing any one of these causes a different failure mode:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;nodes stuck in NotReady&lt;/li&gt;
&lt;li&gt;nodes failing to join&lt;/li&gt;
&lt;li&gt;pods stuck in Pending&lt;/li&gt;
&lt;li&gt;Karpenter provisioning loops&lt;/li&gt;
&lt;li&gt;silent failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The repo includes the full, correct IAM policy.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Node Join Authorization&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Karpenter-provisioned nodes must be authorized to join the cluster.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;terraform-aws-modules/eks&lt;/code&gt; v21.x:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;this is handled automatically via &lt;strong&gt;EKS access entries&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re on an older version:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;you must manually update &lt;code&gt;aws-auth&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;otherwise nodes will register but remain &lt;strong&gt;NotReady&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a common pitfall in DIY setups, but fully automated here.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;GPU AMIs (AL2023)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This architecture uses the &lt;strong&gt;AL2023 GPU AMI&lt;/strong&gt;, which includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NVIDIA drivers pre-installed&lt;/li&gt;
&lt;li&gt;containerd 1.7.x&lt;/li&gt;
&lt;li&gt;correct kernel modules&lt;/li&gt;
&lt;li&gt;correct GPU runtime configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it does &lt;strong&gt;not&lt;/strong&gt; include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the NVIDIA device plugin&lt;/li&gt;
&lt;li&gt;GPU Feature Discovery&lt;/li&gt;
&lt;li&gt;DCGM Exporter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are installed via ArgoCD.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why AL2023?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;faster boot times&lt;/li&gt;
&lt;li&gt;better GPU driver stability&lt;/li&gt;
&lt;li&gt;better containerd performance&lt;/li&gt;
&lt;li&gt;better security posture&lt;/li&gt;
&lt;li&gt;long-term support&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;DCGM Exporter (GPU Metrics)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;DCGM Exporter provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU utilization&lt;/li&gt;
&lt;li&gt;GPU memory usage&lt;/li&gt;
&lt;li&gt;temperature&lt;/li&gt;
&lt;li&gt;power draw&lt;/li&gt;
&lt;li&gt;ECC errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These metrics feed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus&lt;/li&gt;
&lt;li&gt;Grafana dashboards&lt;/li&gt;
&lt;li&gt;autoscaling decisions (if desired)&lt;/li&gt;
&lt;li&gt;alerting rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It schedules only on GPU nodes via:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;node affinity&lt;/li&gt;
&lt;li&gt;GPU taints&lt;/li&gt;
&lt;li&gt;tolerations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures clean separation between CPU and GPU observability.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Terraform Owns the Infrastructure, ArgoCD Owns the Platform&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This separation is intentional:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Terraform owns:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;VPC&lt;/li&gt;
&lt;li&gt;Subnets&lt;/li&gt;
&lt;li&gt;NAT Gateway&lt;/li&gt;
&lt;li&gt;VPC endpoints&lt;/li&gt;
&lt;li&gt;EKS cluster&lt;/li&gt;
&lt;li&gt;Node groups&lt;/li&gt;
&lt;li&gt;IAM roles&lt;/li&gt;
&lt;li&gt;Pod Identity&lt;/li&gt;
&lt;li&gt;EBS CSI&lt;/li&gt;
&lt;li&gt;Karpenter IAM&lt;/li&gt;
&lt;li&gt;Security groups&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;ArgoCD owns:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Karpenter&lt;/li&gt;
&lt;li&gt;KEDA&lt;/li&gt;
&lt;li&gt;Knative&lt;/li&gt;
&lt;li&gt;Dragonfly&lt;/li&gt;
&lt;li&gt;NVIDIA plugin&lt;/li&gt;
&lt;li&gt;DCGM Exporter&lt;/li&gt;
&lt;li&gt;Prometheus&lt;/li&gt;
&lt;li&gt;Kyverno&lt;/li&gt;
&lt;li&gt;External Secrets&lt;/li&gt;
&lt;li&gt;Workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clean separation of concerns&lt;/li&gt;
&lt;li&gt;safe upgrades&lt;/li&gt;
&lt;li&gt;reproducible clusters&lt;/li&gt;
&lt;li&gt;GitOps-driven platform&lt;/li&gt;
&lt;li&gt;no manual configuration drift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With it, the architecture becomes fast, predictable, and production-ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;GPU VRAM Sizing (Quick Reference)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Choosing the right GPU instance type is mostly about &lt;strong&gt;VRAM&lt;/strong&gt;. Model size (in parameters) determines how much VRAM you need for inference, especially in &lt;strong&gt;fp16&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This table gives you a fast, practical reference for common AWS GPU instances.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;GPU VRAM Sizing Table&lt;/strong&gt;
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instance Type&lt;/th&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;Max Model Size (fp16)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;g4dn.xlarge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1× T4&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;~7B params&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;g5.xlarge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1× A10G&lt;/td&gt;
&lt;td&gt;24 GB&lt;/td&gt;
&lt;td&gt;~13B params&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;g5.12xlarge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;4× A10G&lt;/td&gt;
&lt;td&gt;96 GB&lt;/td&gt;
&lt;td&gt;~70B params&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;p4d.24xlarge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8× A100&lt;/td&gt;
&lt;td&gt;320 GB&lt;/td&gt;
&lt;td&gt;~180B params (tensor parallel)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Notes&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;fp16 models require roughly &lt;strong&gt;2 bytes per parameter&lt;/strong&gt;, plus overhead.&lt;/li&gt;
&lt;li&gt;Larger models (30B–180B) require &lt;strong&gt;tensor parallelism&lt;/strong&gt; across multiple GPUs.&lt;/li&gt;
&lt;li&gt;For vLLM, throughput scales with both VRAM and memory bandwidth.&lt;/li&gt;
&lt;li&gt;For Triton, VRAM determines batch size and concurrency.&lt;/li&gt;
&lt;li&gt;For embedding models, VRAM requirements are much lower, often 4–8 GB is enough.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Node Autoscaling (Karpenter Good Practices)
&lt;/h2&gt;

&lt;p&gt;Karpenter is the engine that makes GPU autoscaling fast, flexible, and cost‑efficient but only if you configure it correctly. GPU workloads amplify every mistake: a misconfigured NodePool can cost thousands per month or cause multi‑minute cold starts.&lt;/p&gt;

&lt;p&gt;This section distills the best practices that make GPU autoscaling predictable, safe, and cheap.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Spot‑First Provisioning with On‑Demand Fallback&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Spot GPUs are dramatically cheaper:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;g4dn.xlarge: ~70% cheaper&lt;/li&gt;
&lt;li&gt;g5.xlarge: ~60% cheaper&lt;/li&gt;
&lt;li&gt;p4d: often 50% cheaper&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But spot capacity is not guaranteed.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The correct pattern is:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Try spot first&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Fallback to on‑demand if spot unavailable&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Never block workloads waiting for spot&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is implemented via:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;requirements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/capacity-type&lt;/span&gt;
    &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spot"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on-demand"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Karpenter automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;attempts spot&lt;/li&gt;
&lt;li&gt;falls back to on‑demand&lt;/li&gt;
&lt;li&gt;retries spot later during consolidation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives you the best of both worlds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;low cost&lt;/li&gt;
&lt;li&gt;high reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Separate CPU and GPU NodePools&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This is &lt;strong&gt;mandatory&lt;/strong&gt; for predictable GPU autoscaling.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If CPU workloads can land on GPU nodes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU nodes become expensive general-purpose nodes&lt;/li&gt;
&lt;li&gt;consolidation becomes ineffective&lt;/li&gt;
&lt;li&gt;GPU capacity becomes unpredictable&lt;/li&gt;
&lt;li&gt;scale-to-zero breaks&lt;/li&gt;
&lt;li&gt;Karpenter may refuse to terminate GPU nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The fix:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;GPU NodePool has a taint: &lt;code&gt;gpu=true:NoSchedule&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;GPU workloads have a matching toleration&lt;/li&gt;
&lt;li&gt;CPU workloads do &lt;strong&gt;not&lt;/strong&gt; tolerate it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This guarantees:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU nodes only run GPU workloads&lt;/li&gt;
&lt;li&gt;CPU workloads never waste GPU capacity&lt;/li&gt;
&lt;li&gt;GPU nodes can scale to zero cleanly&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Set GPU Limits on the NodePool (Critical)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Without GPU limits, Karpenter may over‑provision GPU nodes.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;runaway provisioning&lt;/li&gt;
&lt;li&gt;unexpected multi‑GPU nodes&lt;/li&gt;
&lt;li&gt;cost explosions&lt;/li&gt;
&lt;li&gt;scheduling mismatches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is one of the most important safety guardrails in the entire architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Use Multiple GPU Instance Types (Flexibility = Speed)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Allowing multiple GPU types dramatically improves provisioning speed.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;requirements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node.kubernetes.io/instance-type&lt;/span&gt;
    &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;g4dn.xlarge&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;g4dn.2xlarge&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;g5.xlarge&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;g5.2xlarge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;5. Consolidation Windows (Scale‑In Behaviour)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Karpenter’s consolidation logic determines when GPU nodes are terminated.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Recommended settings:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;dev:&lt;/strong&gt; &lt;code&gt;consolidateAfter: 3m&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;prod:&lt;/strong&gt; &lt;code&gt;consolidateAfter: 2h&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;dev: fast feedback, low cost&lt;/li&gt;
&lt;li&gt;prod: avoid flapping during traffic spikes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Consolidation only triggers when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no pods are running&lt;/li&gt;
&lt;li&gt;no pods are pending&lt;/li&gt;
&lt;li&gt;no pods are terminating&lt;/li&gt;
&lt;li&gt;no pods are using local storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures safe, predictable scale‑in.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;6. Use&lt;/strong&gt; &lt;code&gt;whenEmpty: delete&lt;/code&gt; &lt;strong&gt;for GPU NodePools&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This ensures GPU nodes terminate as soon as they’re empty.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;disruption&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;consolidationPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WhenEmpty&lt;/span&gt;
  &lt;span class="na"&gt;consolidateAfter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;3m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;7. Use GFD Labels for VRAM‑Aware Scheduling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;GPU Feature Discovery labels nodes with VRAM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="err"&gt;nvidia.com/&lt;/span&gt;&lt;span class="py"&gt;gpu.memory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;16Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows you to schedule workloads based on VRAM requirements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nvidia.com/gpu.memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;24Gi"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is essential when mixing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;T4 (16 GB)&lt;/li&gt;
&lt;li&gt;A10G (24 GB)&lt;/li&gt;
&lt;li&gt;A100 (40–80 GB)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It prevents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OOM errors&lt;/li&gt;
&lt;li&gt;wasted capacity&lt;/li&gt;
&lt;li&gt;unpredictable scheduling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;8. Prefer Smaller GPU Nodes for Bursty Workloads&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For bursty inference workloads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;g4dn.xlarge&lt;/strong&gt; (1× T4)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;g5.xlarge&lt;/strong&gt; (1× A10G)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;are ideal.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;faster provisioning&lt;/li&gt;
&lt;li&gt;faster warm starts&lt;/li&gt;
&lt;li&gt;better bin-packing&lt;/li&gt;
&lt;li&gt;less fragmentation&lt;/li&gt;
&lt;li&gt;easier scale-to-zero&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Large multi‑GPU nodes (A100, H100) are best for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;large LLMs&lt;/li&gt;
&lt;li&gt;tensor parallelism&lt;/li&gt;
&lt;li&gt;high-throughput batch inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But they are slower to provision and harder to scale elastically.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;9. Don’t Use DaemonSets on GPU Nodes (Unless Safe)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Most DaemonSets block scale-to-zero.&lt;/p&gt;

&lt;p&gt;If you must run a DaemonSet on GPU nodes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;use tolerations&lt;/li&gt;
&lt;li&gt;use node affinity&lt;/li&gt;
&lt;li&gt;use &lt;code&gt;nodeSelector&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;ensure it’s lightweight&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples that are safe:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DCGM Exporter&lt;/li&gt;
&lt;li&gt;Node Problem Detector&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Examples that break scale-to-zero:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;logging agents&lt;/li&gt;
&lt;li&gt;service meshes&lt;/li&gt;
&lt;li&gt;sidecar-heavy stacks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture avoids all of those on GPU nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;10. Use Pod Anti‑Affinity for Multi‑GPU Workloads&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If you run multiple GPU pods per node, use anti‑affinity to avoid:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU contention&lt;/li&gt;
&lt;li&gt;VRAM fragmentation&lt;/li&gt;
&lt;li&gt;unpredictable latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;podAntiAffinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requiredDuringSchedulingIgnoredDuringExecution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;labelSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inference&lt;/span&gt;
      &lt;span class="na"&gt;topologyKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes.io/hostname&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures one GPU pod per node unless explicitly desired.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security Hardening (Practical, Production‑Ready Defaults)
&lt;/h2&gt;

&lt;p&gt;GPU inference clusters often start as “just get it running,” but once you move toward production, you need guardrails. This architecture includes a &lt;strong&gt;minimal, sane, production‑ready security baseline&lt;/strong&gt; that doesn’t get in your way but protects you from the most common failure modes.&lt;/p&gt;

&lt;p&gt;Everything here is deployed via ArgoCD in the &lt;code&gt;security-infra&lt;/code&gt; ApplicationSet.&lt;/p&gt;

&lt;p&gt;Let’s walk through the components.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Pod Security Standards (PSS) – Namespaces Enforced by Default&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Namespaces are labeled individually based on what runs in them, there is no one-size-fits-all PSS level for a cluster that includes system DaemonSets, GPU drivers, and application workloads:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Namespace&lt;/th&gt;
&lt;th&gt;enforce&lt;/th&gt;
&lt;th&gt;audit&lt;/th&gt;
&lt;th&gt;warn&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;platform&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;restricted&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;baseline&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;baseline&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Internal services; no privilege needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;inference&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;baseline&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;baseline&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;baseline&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No privilege needed; baseline avoids over-restricting debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;monitoring&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;privileged&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;baseline&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;baseline&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;DCGM Exporter and node-exporter require host device access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;external-secrets&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;baseline&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;baseline&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;baseline&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No privilege needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;kube-system&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;em&gt;(unlabelled)&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;–&lt;/td&gt;
&lt;td&gt;NVIDIA device plugin requires &lt;code&gt;privileged: true&lt;/code&gt;; EKS leaves unenforced by default&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Inference pods themselves run without elevated privileges, the NVIDIA device plugin, which &lt;em&gt;does&lt;/em&gt; require host-level access, runs in &lt;code&gt;kube-system&lt;/code&gt; where PSS is not enforced.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. NetworkPolicies – Default Deny Everywhere&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Every namespace gets a &lt;strong&gt;default deny&lt;/strong&gt; NetworkPolicy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no pod-to-pod traffic unless explicitly allowed&lt;/li&gt;
&lt;li&gt;no pod-to-node traffic unless allowed&lt;/li&gt;
&lt;li&gt;no cross-namespace traffic unless allowed&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. ResourceQuotas – Prevent Runaway Workloads&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Each namespace gets a ResourceQuota to prevent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;runaway pod creation&lt;/li&gt;
&lt;li&gt;runaway PVC creation&lt;/li&gt;
&lt;li&gt;runaway GPU requests&lt;/li&gt;
&lt;li&gt;accidental cluster exhaustion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;limits.cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
&lt;span class="na"&gt;limits.memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;64Gi&lt;/span&gt;
&lt;span class="na"&gt;requests.nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;workloads cannot consume all GPUs&lt;/li&gt;
&lt;li&gt;workloads cannot starve system components&lt;/li&gt;
&lt;li&gt;misconfigurations cannot take down the cluster&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. PriorityClasses – System Pods Always Win&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Two PriorityClasses are defined:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;system-critical&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Used by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Karpenter&lt;/li&gt;
&lt;li&gt;KEDA&lt;/li&gt;
&lt;li&gt;Knative control plane&lt;/li&gt;
&lt;li&gt;Prometheus&lt;/li&gt;
&lt;li&gt;Dragonfly seed nodes&lt;/li&gt;
&lt;li&gt;External Secrets&lt;/li&gt;
&lt;li&gt;Kyverno&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;workload-default&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Used by inference workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. PodDisruptionBudgets – Protect Critical Components&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;PDBs are applied to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Karpenter&lt;/li&gt;
&lt;li&gt;KEDA&lt;/li&gt;
&lt;li&gt;Knative activator&lt;/li&gt;
&lt;li&gt;Prometheus&lt;/li&gt;
&lt;li&gt;Dragonfly seed nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;6. Kyverno Policies – Guardrails, Not Handcuffs&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Kyverno audits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;require-gpu-request&lt;/code&gt;&lt;/strong&gt; – pods must declare &lt;code&gt;nvidia.com/gpu&lt;/code&gt; in resource limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;require-gpu-toleration&lt;/code&gt;&lt;/strong&gt; – pods must tolerate the &lt;code&gt;gpu=true:NoSchedule&lt;/code&gt; taint&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;require-pdb-for-gpu-pods&lt;/code&gt;&lt;/strong&gt; – pods must carry the &lt;code&gt;app&lt;/code&gt; label so a PodDisruptionBudget can select them&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Why Kyverno?&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;simpler than OPA Gatekeeper&lt;/li&gt;
&lt;li&gt;easier to reason about&lt;/li&gt;
&lt;li&gt;integrates cleanly with GitOps&lt;/li&gt;
&lt;li&gt;policies are YAML, not Rego&lt;/li&gt;
&lt;li&gt;great for platform teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kyverno is deployed via ArgoCD and configured in the &lt;code&gt;platform-kustomize&lt;/code&gt; layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;7. External Secrets Operator – No Secrets in Git&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Secrets come from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS Secrets Manager&lt;/li&gt;
&lt;li&gt;AWS SSM Parameter Store&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not from Git.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;8. No DaemonSets on GPU Nodes (Except DCGM Exporter)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Most DaemonSets block scale-to-zero, but Karpenter’s &lt;code&gt;WhenEmpty&lt;/code&gt; policy defines “empty” as no non-DaemonSet pods remaining. DaemonSet pods like DCGM Exporter are excluded from that check, so they don’t prevent GPU node consolidation.&lt;/p&gt;

&lt;p&gt;This means the rule is: keep DaemonSets off GPU nodes unless they serve a GPU-specific purpose.&lt;/p&gt;

&lt;p&gt;This architecture ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;logging agents run only on system nodes&lt;/li&gt;
&lt;li&gt;service meshes run only on system nodes&lt;/li&gt;
&lt;li&gt;monitoring agents run only on system nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;only DCGM Exporter&lt;/strong&gt; runs on GPU nodes, and the benchmark confirms GPU nodes still consolidate cleanly (285–307s scale-in) with it running.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This keeps GPU nodes ephemeral and cheap.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;9. Ingress Hardening (Knative + Kourier)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Knative + Kourier is configured with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTTPS termination&lt;/li&gt;
&lt;li&gt;mTLS between components&lt;/li&gt;
&lt;li&gt;strict routing&lt;/li&gt;
&lt;li&gt;no wildcard hosts&lt;/li&gt;
&lt;li&gt;no public access unless explicitly enabled&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This protects inference APIs from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;accidental exposure&lt;/li&gt;
&lt;li&gt;cross-namespace routing&lt;/li&gt;
&lt;li&gt;misconfigured hostnames&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Gotchas (Common Pitfalls &amp;amp; How to Avoid Them)
&lt;/h2&gt;

&lt;p&gt;Even with a clean architecture, GPU autoscaling has sharp edges. These are the issues that most commonly break scale‑to‑zero, slow down cold starts, or cause unpredictable behaviour.&lt;/p&gt;

&lt;p&gt;This section is a &lt;strong&gt;checklist&lt;/strong&gt;, if something isn’t working, start here.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;A. **Dragonfly Not Intercepting Pulls&lt;/strong&gt;**
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo08lkeunou9nuhmfc86p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo08lkeunou9nuhmfc86p.png" width="800" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the #1 cause of slow warm starts.&lt;/p&gt;

&lt;p&gt;Fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensure registry hostname matches exactly&lt;/li&gt;
&lt;li&gt;Ensure &lt;code&gt;hosts.toml&lt;/code&gt; exists under &lt;code&gt;certs.d&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Ensure Dragonfly proxy service is reachable&lt;/li&gt;
&lt;li&gt;Ensure containerd restarted after config change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If warm starts take &amp;gt;10 seconds, this is the culprit.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;B. Karpenter Provisioning Issues&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feswsei99437wie7rpyvd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feswsei99437wie7rpyvd.png" width="800" height="194"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Common causes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing EC2 IAM permissions&lt;/li&gt;
&lt;li&gt;Pod Identity not bound&lt;/li&gt;
&lt;li&gt;NodePool missing GPU instance types&lt;/li&gt;
&lt;li&gt;Using DRA instead of device plugin&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If pods are Pending with no nodes created, start here.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;C. Pods Stuck in Pending&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffaeoczaa8gyzujo4xtot.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffaeoczaa8gyzujo4xtot.png" width="800" height="190"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Usually caused by scheduling constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing GPU toleration&lt;/li&gt;
&lt;li&gt;Wrong VRAM selector&lt;/li&gt;
&lt;li&gt;Wrong instance type selector&lt;/li&gt;
&lt;li&gt;DaemonSet occupying GPU nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If pods are Pending &lt;em&gt;after&lt;/em&gt; nodes are created, this is the section to check.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;D. Slow Cold Starts&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm585beyboislbn77se92.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm585beyboislbn77se92.png" width="800" height="224"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Cold node starts should be ~30–40 seconds.&lt;/p&gt;

&lt;p&gt;If they’re &amp;gt;60 seconds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPC endpoints missing&lt;/li&gt;
&lt;li&gt;Dragonfly not intercepting pulls&lt;/li&gt;
&lt;li&gt;Large images pulled directly from ECR&lt;/li&gt;
&lt;li&gt;Knative activator buffering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is almost always a networking or image distribution issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;E. Scale-to-Zero Not Working&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj8abuxjq2t1vmcgmuvjf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj8abuxjq2t1vmcgmuvjf.png" width="800" height="190"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GPU nodes should terminate cleanly after cooldown.&lt;/p&gt;

&lt;p&gt;If they don’t:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KEDA cooldown too long&lt;/li&gt;
&lt;li&gt;DaemonSets blocking drains&lt;/li&gt;
&lt;li&gt;PDBs too strict&lt;/li&gt;
&lt;li&gt;Sidecars preventing termination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the most common cause of “GPU nodes never scale in.”&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;F. Unpredictable GPU Costs&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvneuf9s95f8dhogo178e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvneuf9s95f8dhogo178e.png" width="800" height="249"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If GPU costs spike unexpectedly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU workloads landed on GPU nodes&lt;/li&gt;
&lt;li&gt;GPU NodePool missing limits&lt;/li&gt;
&lt;li&gt;Spot-only with no fallback&lt;/li&gt;
&lt;li&gt;Consolidation disabled&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why the architecture enforces taints, tolerations, and limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started (Quickstart)
&lt;/h2&gt;

&lt;p&gt;This architecture is fully reproducible. You can deploy the entire stack, VPC, EKS, Karpenter, KEDA, Knative, Dragonfly, NVIDIA plugin, workloads, using Terraform + ArgoCD with no manual steps.&lt;/p&gt;

&lt;p&gt;You can deploy the full architecture in a few steps.&lt;/p&gt;

&lt;p&gt;For complete instructions, see the repo’s README.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Clone the repo&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Tazmainiandevil/eks-gpu-inference-autoscaling
&lt;span class="nb"&gt;cd &lt;/span&gt;eks-gpu-inference-autoscaling
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2*&lt;em&gt;. Run the setup script&lt;/em&gt;*
&lt;/h3&gt;

&lt;p&gt;This injects your GitHub org, environment, and repo paths into the ArgoCD manifests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./scripts/setup.sh &lt;span class="nt"&gt;--env&lt;/span&gt; dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It updates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;argocd/app-of-apps.yaml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;argocd/applicationset-platform-kustomize.yaml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;argocd/applicationset-security.yaml&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;argocd/applicationset-apps.yaml&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;…and validates the structure before applying.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Install ArgoCD&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace argocd
kubectl apply &lt;span class="nt"&gt;-n&lt;/span&gt; argocd &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note: For production, install ArgoCD via its Helm chart with pinned chart version and custom values (admin password, RBAC, SSO). The raw manifest install above is sufficient for this demo but should not be used as-is in a shared cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Apply the root Application&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ArgoCD deploys the entire platform automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; argocd/app-of-apps.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;5. Validate scaling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Once the cluster is healthy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./validate-scaling.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For full installation details, configuration options, and environment-specific guidance, see the repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;GPU autoscaling isn’t magic, it’s engineering. And when you put the right pieces together, the results speak for themselves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cold starts become predictable&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Warm starts become fast&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU nodes scale to zero cleanly&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Costs stay under control&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The platform stays stable under load&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Everything is reproducible through GitOps&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture works because every layer reinforces the others:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Terraform gives you a clean, secure AWS foundation&lt;/li&gt;
&lt;li&gt;ArgoCD turns the cluster into a self‑healing system&lt;/li&gt;
&lt;li&gt;Karpenter provisions GPU nodes quickly and cheaply&lt;/li&gt;
&lt;li&gt;KEDA and Knative scale workloads intelligently&lt;/li&gt;
&lt;li&gt;Dragonfly eliminates ECR bottlenecks&lt;/li&gt;
&lt;li&gt;NVIDIA’s device plugin + GFD make GPU scheduling predictable&lt;/li&gt;
&lt;li&gt;Security guardrails keep everything safe without slowing you down&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The end result is a platform that can run anything from a single‑GPU dev workload to a bursty, multi‑GPU production fleet, without manual intervention, without snowflake clusters, and without surprise bills.&lt;/p&gt;

&lt;p&gt;If you want to explore the code, deploy the architecture, or adapt it to your own workloads, the full repository is here: 👉&lt;a href="https://github.com/Tazmainiandevil/eks-gpu-inference-autoscaling" rel="noopener noreferrer"&gt;https://github.com/Tazmainiandevil/eks-gpu-inference-autoscaling&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a foundation you can build on, for LLMs, embeddings, vision models, batch inference, streaming pipelines, or whatever comes next.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;References and Further Reading&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://karpenter.sh/docs/" rel="noopener noreferrer"&gt;Karpenter Documentation&lt;/a&gt; – NodePool, EC2NodeClass, disruption policies, and the v1.x migration guide&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://keda.sh/docs/" rel="noopener noreferrer"&gt;KEDA Documentation&lt;/a&gt; – ScaledObject reference, Prometheus scaler, cron trigger, and fallback configuration&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/NVIDIA/k8s-device-plugin" rel="noopener noreferrer"&gt;NVIDIA Kubernetes Device Plugin&lt;/a&gt; – DaemonSet that exposes &lt;code&gt;nvidia.com/gpu&lt;/code&gt; as a schedulable resource&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/kubernetes-sigs/karpenter/issues/1231" rel="noopener noreferrer"&gt;Karpenter DRA Support – Issue #1231&lt;/a&gt; – Track progress on Karpenter + DRA compatibility; no ETA as of Karpenter v1.9.0&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/dragonflyoss/Dragonfly2" rel="noopener noreferrer"&gt;Dragonfly P2P Image Distribution&lt;/a&gt; – Peer-to-peer OCI distribution that prevents ECR bandwidth saturation during fleet scale-out&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://prometheus-operator.dev/" rel="noopener noreferrer"&gt;kube-prometheus-stack&lt;/a&gt; – Prometheus Operator, Grafana, and Alertmanager for Kubernetes; provides the metrics that drive KEDA scaling&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>aws</category>
      <category>devops</category>
      <category>gpu</category>
    </item>
  </channel>
</rss>
