Sharing A Nvidia GPU Between Pods In Kubernetes

#kubernetes #devops #programming #docker

As larger graphical-based workloads (like building Models) becomes more of a need for organizations, the ability to use GPUs is increasing. The problem is that it’s incredibly expensive. For example, if you want to get a few Nvidia H100’s in a Kubernetes cluster in the cloud, you’re looking at hundreds of thousands of dollars in cost.

With GPU sharing, that cost and overall management of infrastructure decreases significantly.

In this blog post, you’ll learn how to implement sharing of Nvidia GPUs.

The Benefit

As it stands right now, GPUs are incredibly expensive. If you look at a lot of the large “AI Factories” out of places like OpenAI, they’re spending billions of dollars just to have enough hardware to run Models.

There are two big things to think about here:

That’s a lot of hardware/infrastructure to manage.
It’s very expensive.

Having the ability to share one GPU, much like memory/CPU/Worker Nodes are shared in a Kubernetes cluster for multiple Pods, allows organizations to not only take advantage of using GPUs, but it keeps costs down and makes graphical-based workloads more readily available.

Nvidia Operator Deployment

The Nvidia Operator helps ease the pain between getting communication of hardware and software, much like any other Driver. Even if a desktop or a laptop, a Driver is the software that allows the hardware to be available/communicate with the shell.

As all Kubernetes Operators do, the Nvidia GPU Operator also extends Kubernetes to be able to use the GPU itself.

💡

Operators allow you to extend the Kubernetes API to work with an API of your choosing. They also contain a Controller which ensures that the current state of the deployment is the desired state. For example, if you have a Controller that’s looking at Pods, it’ll ensure that however many Replicas are supposed to be deployed are in fact, deployed.

First, install the Driver.

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml

Next, create the gpu-operator Namespace as the Helm Chart for the Kubernetes Nvidia Operator will exist there.

kubectl create ns gpu-operator

Because it’s a GPU Operator and can take a significant amount of resources with a Kubernetes cluster, set up a Resource Quota to ensure that no more than 100 Pods can use the GPU at once.

kubectl apply -f - << EOF
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-operator-quota
  namespace: gpu-operator
spec:
  hard:
    pods: 100
  scopeSelector:
    matchExpressions:
    - operator: In
      scopeName: PriorityClass
      values:
        - system-node-critical
        - system-cluster-critical
EOF

Install the Nvidia GPU Operator via Helm.

helm install --wait gpu-operator \
  -n gpu-operator \
  nvidia/gpu-operator \
  --set hostPaths.driverInstallDir=/home/kubernetes/bin/nvidia \
  --set toolkit.installDir=/home/kubernetes/bin/nvidia \
  --set cdi.enabled=true \
  --set cdi.default=true \
  --set driver.enabled=false

Wait 2-4 minutes and then check to ensure that the Pods within the GPU Operator are up and operational.

kubectl get pods -n gpu-operator

Once the Pods are up, confirm the GPU works by deploying a test Nvidia Pod.

kubectl apply -f - << EOF
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: vectoradd
    image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

You should see an output similiar to the one below.

Next you’ll learn how to slice a GPU, which means one GPU is able to be used for multiple Pods.

GPU Slicing

When you hear “slicing”, it’s a method of taking one GPU and allowing it to be used across more than one Pod.

💡

There’s also a method called MPS, but it seems like slicing is used the most right now.

The first thing you’ll do is set up a Config Map for the slicing.

Once thing to point out is the replica count. Notice how the replica count currently says 4? That means four (4) Pods can use the GPU. If you bumped it up to ten, that means ten (10) Pods could share the GPU. This of course depends on the type of GPU and if the resources are available like any other piece of hardware.

kubectl apply -f - << EOF 
apiVersion: v1
kind: ConfigMap
metadata:
  name: plugin-config
  namespace: gpu-operator
data:
  time-slicing: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4
  mps: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      mps:
        renameByDefault: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4
EOF

Next, patch the cluster policy to have the ability to implement Nvidia’s time slicing.

cat > patch.yaml << EOF
spec:
  devicePlugin:
    config:
      name: plugin-config
      default: time-slicing
EOF

kubectl patch clusterpolicies.nvidia.com/cluster-policy --type=merge --patch-file=patch.yaml

Run the following command to confirm that the GPU is shared.

kubectl describe node $GPU_NODE_NAME | grep "Allocatable:" -A7

You can now use a Job to test out that sharing/slicing the GPU worked.

💡

You can also use a Pod or whatever else you’d like. This is just an example.

kubectl apply -f - << EOF
apiVersion: batch/v1
kind: Job
metadata:
  name: dcgm-prof-tester
spec:
  parallelism: 4
  template:
    metadata:
      labels:
        app: dcgm-prof-tester
    spec:
      restartPolicy: OnFailure
      containers:
        - name: dcgmproftester12
          image: nvcr.io/nvidia/cloud-native/dcgm:3.3.8-1-ubuntu22.04
          command: ["/usr/bin/dcgmproftester12"]
          args: ["--no-dcgm-validation", "-t 1004", "-d 30"]
          resources:
            limits:
              nvidia.com/gpu: 1
          securityContext:
            capabilities:
              add: ["SYS_ADMIN"]
EOF