Dr M O Faruque Sarker

Posted on May 15

Optimising GenAI/ML workloads in AWS EKS with Karpenter

#karpenter #machinelearning #eks #genai

After returning from AWS Summit London 2026 I was doing some research on running AI/ML workload in AWS EKS with Karpenter. With some assistance from Gemini I turned some of my notes from various talks into this guide that will talk through the intricacies of deploying and scaling Generative AI (GenAI) workloads on AWS EKS, leveraging the power of Karpenter.

Why GenAI/ML Infrastructure Sizing is Hard 📏

The initial challenge with GenAI/ML workloads often stems from translating business requirements into technical specifications. A request like "I have 10,000 users, and my LLM needs to respond fast" is a common starting point, but it provides insufficient detail for hardware selection. We need to convert this into a concrete workload model.

From Business Vision to Technical Metrics

Metric Type	Requirement Example	Technical Translation
Application Pattern	Agentic application	Interactive / Agentic workload model
Input Load	5,000 Tokens IN	Average prompt length (context window)
Output Load	500 Tokens OUT	Average generated response length
Throughput	25 Requests Per Second	Concurrent inference calls (RPS)
Workload Nature	Not spiky, stable workload	Predictable demand, minimal burstiness
Experience	End-to-end latency < 5 seconds	Maximum acceptable time for a full response

LLM inference systems are rarely "sized" by a formula; they are measured and optimized. We follow a structured lifecycle to reach the optimal architecture.

From Estimation to Deployment: A 4-Step Framework 🗺️

To consistently meet service level objectives with the lowest possible cost, we utilise a methodical, four-step process:

Step	Phase	Key Activities
1	Business Inputs	Define the user base, the specific use case, and the desired user experience.
2	Workload Model	Calculate tokens per second (TPS), requests per second (RPS), and define Service Level Objectives (SLOs).
3	Benchmark & Experiment	Validate SLOs through concurrency optimization and reproducible testing across different solutions and parameters.
4	Optimal Architecture	Finalize GPU selection and runtime configuration to meet performance and cost targets.

Building the Foundation: Cluster and OS Selection 🏗️

For maximum flexibility and control over your data plane, we recommend self-managed Kubernetes clusters on Amazon EKS. AWS handles the control plan, while you manage the data plane—including networking, security, monitoring, scaling, and storage.

When choosing a Node OS, the goal is to have the smallest, most secure footprint with pre-baked drivers for your AI/ML workloads.

EKS Node Operating System Options

OS Option	Best For	Included Features
Amazon Linux 2023 (AL2023)	General Purpose AI/ML	EFA Drivers, NVIDIA Kernel drivers, NVIDIA runtime, MIG support.
Bottlerocket	Security & Performance	Minimal OS, EFA Drivers, NVIDIA Kernel drivers, NVIDIA runtime, MIG support, NVIDIA device plugin, faster boot times for scaling.

Modern Autoscaling with Karpenter ⚡

In Kubernetes, we distinguish between two types of scaling: application scaling and data plane scaling.

Kubernetes Scaling Mechanisms

Scaling Type	Tooling	Description
Application Scaling	Vertical Pod Autoscaler (VPA), Horizontal Pod Autoscaler (HPA), Kubernetes Event-Driven Autoscaling (KEDA)	Adjusts the resources (CPU, Memory) or number of pods for an application based on demand or custom metrics.
Data Plane Scaling	Cluster Autoscaler (CAS), Karpenter	Manages the underlying compute nodes (EC2 instances) of your Kubernetes cluster.

Karpenter is particularly effective for AI/ML workloads due to its ability to provision the exact GPU instance needed for a pending pod in seconds, making it more dynamic and efficient than traditional Cluster Autoscaler.

When an LLM processes a request, it typically involves two phases:

Prefill Phase: The model processes the input tokens. No output is generated yet. Performance is measured by tokens processed per second.
Decode Phase: The model generates output tokens sequentially. The "Time To First Token" (TTFT) is a critical latency metric here. Large inputs and outputs can significantly increase TTFT and overall latency.

Capacity Management with Karpenter NodePools

Karpenter allows for sophisticated capacity management using NodePools, enabling precise control over your compute resources:

NodePool Type	Use Case	Implementation Strategy
Single NodePool	General Purpose Workloads	Simple configuration for consistent compute performance.
Multiple NodePools	Workload Isolation	Isolate compute for accelerated vs. non-accelerated workloads or different GPU types.
Weighted NodePools	Prioritization & Cost	Define an order across pools to ensure preferred GPUs are used first, or to prioritize cost-effective instances.
Static NodePools	Baseline / Reserved Load	Utilize On-Demand Capacity Reservations (ODCR) or ML Capacity Blocks for guaranteed capacity.
Bursting NodePools	Spiky Workloads	Allow instance type diversification (e.g., across G4, G5, P4) to ensure rapid scaling and availability during demand spikes.

Evaluating Optimal Architecture 📊

To truly find the "sweet spot" for your GenAI architecture, thorough evaluation is essential.

Reproducible Load Testing & Experimentation

Design Load Tests: Use tools like Locust to create realistic load tests with real data, simulating user behavior and workload patterns.
Test Different Solutions: Run reproducible tests against various deployment options:
- AWS Bedrock: Managed service for foundation models.
- Amazon SageMaker: Fully managed service for ML lifecycle.
- Self-managed EKS: Granular control over infrastructure.
Evaluate Hardware: Test against different GPU architectures (e.g., G-series, P-series) and AWS ML chips (Inferentia, Trainium).
Tune Runtime Parameters: Experiment with parameters like KV cache optimization and batch size to maximize performance.

AWS GPU, ML Accelerators, and FPGA Landscape 🏎️

AWS offers a diverse range of hardware tailored for different stages of the ML lifecycle:

Category	Instance Families	Best Use Case
NVIDIA GPUs	G4dn, G5, G6, P4, P5	General inference and training; high compatibility with existing ML frameworks.
AWS ML Chips	Inferentia (Inf), Trainium (Trn)	Cost-optimized inference (Inf) and high-scale training (Trn) for deep learning.
Specialized Accelerators	DL, VT, F1/F2 (FPGA)	Deep learning, video transcoding, and custom logic with reconfigurable hardware.

The Trade-off: Latency vs. Throughput ⚖️

GPU sizing doesn't follow a simple formula; it's a critical trade-off between latency (how quickly a single request is processed) and throughput (how many requests are processed over time). At low concurrency, each request is fast but hardware is underutilized. As concurrency increases, hardware utilization improves, but individual request latency can suffer.

Your objective for performance testing is to find the right concurrency balance with the optimal number of compute instances and GPUs that deliver acceptable end-to-end latency and requests per second (RPS).

Case Study: Latency Requirements Directly Drive Cost

Scenario	Instance Count	Concurrent Executions	E2E Latency	RPS	Cost-Efficiency
Underutilized	1x Instance (8 GPUs)	1x	2.5s	0.4	Fast response, but very high cost per request.
Fully Saturated	1x Instance (8 GPUs)	128x	10s	12.8	Highly utilized hardware, but potentially misses latency SLOs.
Optimized	2x Instances (16 GPUs)	64x (per node)	5s	25.6	Great value, balanced performance and cost-efficiency.

As this case study illustrates, strict latency requirements (e.g., < 2.5s) can necessitate significantly more instances and GPUs, directly impacting costs. By carefully balancing concurrency and hardware, you can find the optimal architecture that aligns with both your performance and budget goals.

By embracing this structured approach, you'll not only run your AI/ML workloads on AWS EKS with Karpenter, but you'll also optimize them for peak performance and cost-efficiency.

DEV Community