Sagar Parmar for AWS Community Builders

Posted on Nov 5 • Originally published at sagar-parmar.Medium

NVIDIA GPU Operator Explained: Simplifying GPU Workloads on Kubernetes

#gpu #nvidia #kubernetes #ai

Introduction

While GPUs have long been a staple in industries like gaming, video editing, CAD, and 3D rendering, their role has evolved dramatically over the years. Originally designed to handle graphics-intensive tasks, GPUs have proven to be powerful tools for a wide range of computationally demanding applications. Today, their ability to perform massive parallel processing has made them indispensable in modern fields such as data science, artificial intelligence and machine learning (AI/ML), robotics, cryptocurrency mining, and scientific computing. This shift was catalysed by the introduction of CUDA (Compute Unified Device Architecture) by NVIDIA in 2007, which unlocked the potential of GPUs for general-purpose computing. As a result, GPUs are no longer just graphics accelerators they’re now at the heart of cutting-edge innovation across industries.

In this blog post we will discuss about NVIDIA GPU operator on Kubernetes and how to deploy it on the Kubernetes Cluster.

Why run GPU workload on Kubernetes?

Running GPU workload on Kubernetes offer significant advantage because it enables developer to seamlessly schedule and run GPU powered application, it simplify deployment and scaling of these workloads. With Kubernetes, workloads can be easily scaled up or down based on demand, while features like Role-Based Access Control (RBAC) provide isolation and multi-tenancy for secure, shared environments. Additionally, Kubernetes supports the creation of multi-cloud GPU clusters, allowing organizations to leverage GPU resources across different cloud providers with consistent orchestration and control.

In this article, we’ll explore the GPU-Kubernetes integration stack in depth with the help of NVIDIA GPU Operator. From the host operating system to the Kubernetes control plane, we’ll peel back each layer to understand the components required to make GPUs work seamlessly within a Kubernetes environment. More importantly, we’ll uncover why each component matters and how they interact with one another.

How are GPUs integrated into Kubernetes without using the GPU Operator?

Kubernetes excels at managing standard compute workloads, but orchestrating high-performance hardware like GPUs introduces unique challenges. Before diving into the GPU Operator, it’s important to understand the three foundational layers required to run GPU workloads in Kubernetes. Think of it as a recipe, each step must be correctly configured for the GPU to function seamlessly within the cluster.

Step 1: The Host Operating System

Everything begins at the host level. The NVIDIA device driver is the critical software that communicates directly with the GPU hardware. A key requirement here is version compatibility between the driver and the CUDA toolkit embedded in your container image. This compatibility matrix must be accurate any mismatch can break GPU functionality.

Step-2: The container runtime e.g, Docker, Container-d, CRI-O, RunC etc.

Next, we need a bridge between the container runtime (e.g., Docker, containerd, CRI-O) and the host GPU. This is where the NVIDIA Container Toolkit comes in.

Core Functions of the Toolkit:

GPU Access Enablement: Provides essential libraries like libnvidia-container and nvidia-container-cli to configure runtimes for GPU access.
Runtime Configuration: Injects GPU device files, drivers, and environment variables into containers via runtime hooks (e.g., updates to /etc/containerd/config.toml).
Device Plugin Dependency: The NVIDIA Device Plugin relies on the toolkit to expose GPU resources to Kubernetes.
Abstraction Layer: Allows containers to use GPUs without bundling drivers or CUDA libraries inside the image keeping containers lightweight and portable.

Without this toolkit, containers remain unaware of the GPU hardware on the node.

Step-3: The Kubernetes Orchestration layer.

Finally, Kubernetes needs to recognize and schedule GPU resources. This is achieved through the NVIDIA Device Plugin, which runs as a DaemonSet on GPU-enabled nodes.

Core Functions of the Device Plugin:

GPU Discovery & Advertising: Detects available GPUs and registers them with the Kubelet as extended resources (e.g., nvidia.com/gpu).
Resource Allocation: When a pod requests a GPU, the plugin ensures the container receives the correct device files, drivers, and environment variables.
Health Monitoring: Continuously checks GPU health and updates Kubernetes to prevent scheduling on faulty devices.
GPU Sharing & Partitioning: It maximizes utilization via advanced features: -

• Time-Slicing: Allows multiple containers to share a single GPU’s compute power.

• Multi-Instance GPU (MIG): Partitions high-end GPUs (like the A100) into multiple, fully isolated hardware instances.

• Virtual GPU(vGPU ): Enables the sharing of a single GPU among multiple virtual machines.

Why Scaling GPU Workloads in Kubernetes Is Hard and How Operators Help?

The three-layer setup we discussed works well on a single machine. But things get complicated when you scale to a production-grade Kubernetes cluster with hundreds or thousands of nodes. That’s when the manual approach starts to fall apart and the real operational pain begins.

Manually managing an entire fleet introduces a massive operational challenge that can bring projects to a grinding halt. You’re navigating a minefield of issues:

Driver Compatibility: Different GPU models require different, specific driver versions.
Configuration Drift: Nodes inevitably fall out of sync over time.
Risky Upgrades: The upgrade process becomes a high-risk nightmare.
Doubled Workload: You often end up managing two completely separate software stacks one for CPU nodes and another for GPU nodes effectively doubling your workload.

To solve these scaling challenges, the Kubernetes community embraced a powerful cloud-native pattern: ‘The Operator’. Think of it as an automated expert a robotic administrator that continuously monitors your cluster and handles all the tedious, error-prone tasks for you. It brings consistency, reliability, and automation to GPU management at scale.

The GPU Operator works in a control loop, constantly observing the state of your nodes and ensuring they match the desired configuration you’ve defined. This means no more manual setup, no more configuration drift, and no more juggling separate software stacks for CPU and GPU nodes. Instead, you get consistency, reliability, and automation at scale.

This shift from manual management to automated orchestration is what makes the Operator pattern so transformative. It turns GPU infrastructure from a fragile, high-maintenance setup into a resilient, self-healing system.

How NVIDIA GPU Operator works?

The Operator establishes a consistent, automated workflow for every node in your cluster. It eliminates manual intervention through a streamlined process. It begins by: -

Discovery: It first identifies which nodes physically possess GPUs.
Installation and Configuration: In the required order, it automatically installs the necessary containerised drivers, configures the Container Toolkit, and deploys the device plugin along with monitoring tools.
Validation: This final step is critical: the Operator validates that every component is working perfectly before allowing Kubernetes to schedule any AI workloads on that node.

This process guarantees reliability and prevents misconfigured nodes from disrupting GPU-intensive applications.

Installation the NVIDIA GPU Operator.

Installing the NVIDIA GPU Operator in Kubernetes is straightforward with Helm. The Operator automates the deployment and configuration of all essential GPU components including drivers, the container toolkit, and device plugins across your cluster. To ensure a smooth setup, follow a step-by-step approach.

Prerequisites:

Before proceeding please make sure that you have met the following prerequisites:

Operating System Requirements for the GPU Operator:

• To use the NVIDIA GPU Driver container for your workloads, all GPU-enabled worker nodes must share the same operating system version.

• If you need to mix different operating systems across GPU nodes, you must pre-install the NVIDIA GPU Driver manually on each respective node instead of using the containerized driver.

• CPU-only nodes have no OS restrictions, as the GPU Operator does not manage or configure them.
Helm is installed.
You have permission to execute kubectl commands against the target cluster.

Installation steps:

Add NVIDIA Helm Repository

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
 && helm repo update

2. Install GPU Operator

helm install --wait --generate-name \
 -n gpu-operator --create-namespace \
 nvidia/gpu-operator \
 --version=v25.10.0

If the NVIDIA driver or toolkit is already installed on your nodes, you can disable either or both during GPU Operator deployment by using the following flags:

--set driver.enabled=false
--set toolkit.enabled=false

3. Verify the installation, by checking the status of the deployed resources.

kubectl get pods -n gpu-operator

You should see the GPU operator components running in the namespace.

4. We can also check the configuration of the node to check if the nodes with the GPU are configured correctly

kubectl describe nodes

Name:               sagar.rajput27@live.com
Roles:              worker
Labels:             node-role.kubernetes.io/worker=true
                    nvidia.com/gpu.count=1
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=pre-installed
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.mig-manager=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.nvsm=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.mode=compute
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-H100-PCIe
                    nvidia.com/gpu.replicas=1
...
Annotations:        nvidia.com/gpu-driver-upgrade-enabled: true
                    projectcalico.org/IPv4Address: 10.*.*.*/*
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.*.*.*
...
Capacity:
  cpu:                64
  ephemeral-storage:  32758Mi
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             527533864Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                64
  ephemeral-storage:  32631789953
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             527533864Ki
  nvidia.com/gpu:     1
  pods:               110

We can see that the node with a GPU hardware attached has GPU-related labels and annotations added to it. Additionally, the GPU resources are visible under the Capacity and Allocatable sections.

Verification by running sample GPU application

We can test the setup by deploying the CUDA vectoradd application provided by NVIDIA on our cluster. This image is an NVIDIA CUDA sample that demonstrates vector addition a basic GPU computation.

Under the resources → limits section of this manifest, you’ll notice nvidia.com/gpu: 1. This instructs Kubernetes to schedule the Pod on a node equipped with an NVIDIA GPU.

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

pod/cuda-vectoradd created

Now we can check the logs

kubectl logs pod/cuda-vectoradd

Logs Output

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Now our cluster is ready to deploy the GPU workload.

GPU Sharing to Maximize GPU Utilization

GPUs are expensive, high-performance hardware and leaving them idle is a waste of valuable resources. Once your GPUs are up and running in Kubernetes, the real challenge becomes efficient sharing. The goal is to extract maximum value from every single card.

The GPU Operator makes this easy by allowing you to configure advanced sharing strategies declaratively. For example:

MIG (Multi-Instance GPU): Physically partitions a single GPU into multiple, fully isolated instances each with dedicated memory and compute.
MPS (Multi-Process Service): Enables concurrent execution of multiple GPU processes.
Time-Slicing: Ideal for development workloads that only need occasional GPU access.

The optimal GPU sharing strategy depends entirely on your specific workload requirements and operational goals. The choice involves balancing factors like performance isolation, dynamic flexibility, and raw utilization. A workload that demands predictable performance in a multi-tenant cluster has very different needs than an interactive development workload.

Optional GPU Operator Components: Streamlining Data Movement

The GPU Operator includes additional components that are not enabled by default, such as GPUDirect RDMA and GPUDirect Storage. These tools are designed to streamline data movement between GPUs and other system components, effectively bypassing traditional bottlenecks like the CPU and system memory.

GPUDirect RDMA (Remote Direct Memory Access)

GPUDirect RDMA enables direct memory access between GPUs and PCIe devices (such as NICs or storage adapters), without involving the CPU or system RAM. This is ideal for High-Performance Computing (HPC) and AI training, where latency is critical.

Benefits:

Lower latency: Data moves directly between the GPU and the device.
Reduced CPU load: Frees up CPU cycles for compute tasks.
Higher bandwidth: Enables faster data transfer for distributed workloads.

Use Cases:

GPU-to-GPU communication across nodes
Real-time inference at the edge
High-speed networking in HPC clusters

GPUDirect Storage

GPUDirect Storage allows GPUs to read data directly from NVMe or other storage devices again bypassing the CPU and system memory. This is essential for AI/ML workloads that need access and process large datasets quickly.

Benefits:

Faster data ingestion: Minimizes I/O bottlenecks during training or inference.
Efficient data pipeline: Direct flow from storage to GPU memory.
Simplified architecture: Eliminates unnecessary memory copies and CPU involvement.

Use Cases:

Large-scale deep learning training
Data analytics pipelines
Scientific simulations with massive datasets

Both technologies are part of NVIDIA’s strategy to optimize data movement for GPU workloads. By enabling direct communication paths between GPUs and external devices, they unlock higher performance and lower latency, better resource utilization in Kubernetes environments where scalability and efficiency are critical.

Summary

Integrating NVIDIA GPUs into Kubernetes typically involves a complex, three-layer manual setup: host drivers, the container toolkit, and the Kubernetes device plugin. This approach works for single machines but creates massive operational challenges like configuration drift and incompatible drivers at scale.

The NVIDIA GPU Operator is the solution. It uses the Operator pattern to automate the entire lifecycle, acting as a “robotic administrator” that discovers GPUs, installs the necessary software stack in the correct order, validates the setup, and streamlines maintenance.

The core benefit? Simplifying your infrastructure so you can focus on AI workloads, not operational headaches.

I hope you found this post informative and engaging. I would love to hear your thoughts on this post, so do start a conversation on Twitter or LinkedIn.