DEV Community

Cover image for Optimising GenAI/ML workloads in AWS EKS with Karpenter
Dr M O Faruque Sarker
Dr M O Faruque Sarker

Posted on

Optimising GenAI/ML workloads in AWS EKS with Karpenter

After returning from AWS Summit London 2026 I was doing some research on running AI/ML workload in AWS EKS with Karpenter. With some assistance from Gemini I turned some of my notes from various talks into this guide that will talk through the intricacies of deploying and scaling Generative AI (GenAI) workloads on AWS EKS, leveraging the power of Karpenter.

Why GenAI/ML Infrastructure Sizing is Hard 📏

The initial challenge with GenAI/ML workloads often stems from translating business requirements into technical specifications. A request like "I have 10,000 users, and my LLM needs to respond fast" is a common starting point, but it provides insufficient detail for hardware selection. We need to convert this into a concrete workload model.

From Business Vision to Technical Metrics

Metric Type Requirement Example Technical Translation
Application Pattern Agentic application Interactive / Agentic workload model
Input Load 5,000 Tokens IN Average prompt length (context window)
Output Load 500 Tokens OUT Average generated response length
Throughput 25 Requests Per Second Concurrent inference calls (RPS)
Workload Nature Not spiky, stable workload Predictable demand, minimal burstiness
Experience End-to-end latency < 5 seconds Maximum acceptable time for a full response

LLM inference systems are rarely "sized" by a formula; they are measured and optimized. We follow a structured lifecycle to reach the optimal architecture.


From Estimation to Deployment: A 4-Step Framework 🗺️

To consistently meet service level objectives with the lowest possible cost, we utilise a methodical, four-step process:

Step Phase Key Activities
1 Business Inputs Define the user base, the specific use case, and the desired user experience.
2 Workload Model Calculate tokens per second (TPS), requests per second (RPS), and define Service Level Objectives (SLOs).
3 Benchmark & Experiment Validate SLOs through concurrency optimization and reproducible testing across different solutions and parameters.
4 Optimal Architecture Finalize GPU selection and runtime configuration to meet performance and cost targets.

Building the Foundation: Cluster and OS Selection 🏗️

For maximum flexibility and control over your data plane, we recommend self-managed Kubernetes clusters on Amazon EKS. AWS handles the control plan, while you manage the data plane—including networking, security, monitoring, scaling, and storage.

When choosing a Node OS, the goal is to have the smallest, most secure footprint with pre-baked drivers for your AI/ML workloads.

EKS Node Operating System Options

OS Option Best For Included Features
Amazon Linux 2023 (AL2023) General Purpose AI/ML EFA Drivers, NVIDIA Kernel drivers, NVIDIA runtime, MIG support.
Bottlerocket Security & Performance Minimal OS, EFA Drivers, NVIDIA Kernel drivers, NVIDIA runtime, MIG support, NVIDIA device plugin, faster boot times for scaling.

Modern Autoscaling with Karpenter ⚡

In Kubernetes, we distinguish between two types of scaling: application scaling and data plane scaling.

Kubernetes Scaling Mechanisms

Scaling Type Tooling Description
Application Scaling Vertical Pod Autoscaler (VPA), Horizontal Pod Autoscaler (HPA), Kubernetes Event-Driven Autoscaling (KEDA) Adjusts the resources (CPU, Memory) or number of pods for an application based on demand or custom metrics.
Data Plane Scaling Cluster Autoscaler (CAS), Karpenter Manages the underlying compute nodes (EC2 instances) of your Kubernetes cluster.

Karpenter is particularly effective for AI/ML workloads due to its ability to provision the exact GPU instance needed for a pending pod in seconds, making it more dynamic and efficient than traditional Cluster Autoscaler.

When an LLM processes a request, it typically involves two phases:

  1. Prefill Phase: The model processes the input tokens. No output is generated yet. Performance is measured by tokens processed per second.
  2. Decode Phase: The model generates output tokens sequentially. The "Time To First Token" (TTFT) is a critical latency metric here. Large inputs and outputs can significantly increase TTFT and overall latency.

Capacity Management with Karpenter NodePools

Karpenter allows for sophisticated capacity management using NodePools, enabling precise control over your compute resources:

NodePool Type Use Case Implementation Strategy
Single NodePool General Purpose Workloads Simple configuration for consistent compute performance.
Multiple NodePools Workload Isolation Isolate compute for accelerated vs. non-accelerated workloads or different GPU types.
Weighted NodePools Prioritization & Cost Define an order across pools to ensure preferred GPUs are used first, or to prioritize cost-effective instances.
Static NodePools Baseline / Reserved Load Utilize On-Demand Capacity Reservations (ODCR) or ML Capacity Blocks for guaranteed capacity.
Bursting NodePools Spiky Workloads Allow instance type diversification (e.g., across G4, G5, P4) to ensure rapid scaling and availability during demand spikes.

Evaluating Optimal Architecture 📊

To truly find the "sweet spot" for your GenAI architecture, thorough evaluation is essential.

Reproducible Load Testing & Experimentation

  1. Design Load Tests: Use tools like Locust to create realistic load tests with real data, simulating user behavior and workload patterns.
  2. Test Different Solutions: Run reproducible tests against various deployment options:
    • AWS Bedrock: Managed service for foundation models.
    • Amazon SageMaker: Fully managed service for ML lifecycle.
    • Self-managed EKS: Granular control over infrastructure.
  3. Evaluate Hardware: Test against different GPU architectures (e.g., G-series, P-series) and AWS ML chips (Inferentia, Trainium).
  4. Tune Runtime Parameters: Experiment with parameters like KV cache optimization and batch size to maximize performance.

AWS GPU, ML Accelerators, and FPGA Landscape 🏎️

AWS offers a diverse range of hardware tailored for different stages of the ML lifecycle:

Category Instance Families Best Use Case
NVIDIA GPUs G4dn, G5, G6, P4, P5 General inference and training; high compatibility with existing ML frameworks.
AWS ML Chips Inferentia (Inf), Trainium (Trn) Cost-optimized inference (Inf) and high-scale training (Trn) for deep learning.
Specialized Accelerators DL, VT, F1/F2 (FPGA) Deep learning, video transcoding, and custom logic with reconfigurable hardware.

The Trade-off: Latency vs. Throughput ⚖️

GPU sizing doesn't follow a simple formula; it's a critical trade-off between latency (how quickly a single request is processed) and throughput (how many requests are processed over time). At low concurrency, each request is fast but hardware is underutilized. As concurrency increases, hardware utilization improves, but individual request latency can suffer.

Your objective for performance testing is to find the right concurrency balance with the optimal number of compute instances and GPUs that deliver acceptable end-to-end latency and requests per second (RPS).

Case Study: Latency Requirements Directly Drive Cost

Scenario Instance Count Concurrent Executions E2E Latency RPS Cost-Efficiency
Underutilized 1x Instance (8 GPUs) 1x 2.5s 0.4 Fast response, but very high cost per request.
Fully Saturated 1x Instance (8 GPUs) 128x 10s 12.8 Highly utilized hardware, but potentially misses latency SLOs.
Optimized 2x Instances (16 GPUs) 64x (per node) 5s 25.6 Great value, balanced performance and cost-efficiency.

As this case study illustrates, strict latency requirements (e.g., < 2.5s) can necessitate significantly more instances and GPUs, directly impacting costs. By carefully balancing concurrency and hardware, you can find the optimal architecture that aligns with both your performance and budget goals.


By embracing this structured approach, you'll not only run your AI/ML workloads on AWS EKS with Karpenter, but you'll also optimize them for peak performance and cost-efficiency.

Top comments (0)