DEV Community

Bogdan Doncea
Bogdan Doncea

Posted on

Splitting One GPU Across Multiple Kubernetes Pods — Without MIG, Without Enterprise Licenses

A years-old GPU frustration, a conference discovery, and a 2AM PoC that actually worked

The Problem I've Been Carrying for Years

If you work with AI or video at scale and you're not at one of the big hyperscalers, you've probably hit this wall before: you have GPUs, and you're wasting them.

Not because your workloads don't need GPU — they do. But because individually, each workload is small. AI inference services rarely saturate a whole card. Processing jobs spin up, eat some compute, and die. Embedding models, classifiers, lightweight LLMs — they each need a slice of a GPU, not the whole thing. None of them come close to maxing out the hardware on their own. And yet, in a typical Kubernetes setup, each one claims an entire GPU card and sits there hoarding it while the rest goes to waste.

I've been building a platform that runs multiple AI and video processing workloads in parallel — inference services, enrichment pipelines, on-demand processing jobs. The kind of system where a lot of different things need GPU access at the same time, but no single one of them needs a whole card. The stack is K8s, Kafka, Redis, some databases a handful of Python and Java services. And GPUs — always the GPUs.

The GPU problem specifically: we have T4s and L40S nodes, and we could never properly share them between pods without playing with fire.

We tried two things:

  • NVIDIA Time-Slicing — Easy to set up via the GPU Operator and it looks good on paper. In practice, for streaming and transcoding workloads it was a non-starter. Time-slicing serialises GPU access, which introduces jitter and latency spikes — exactly what you cannot have when you're processing live video or audio. Frames drop, buffers stall, quality degrades. We turned it off fast.

  • Plain Docker with --gpus device=0 — Which I'll get into. We actually used this for a long time, and it worked — sort of.

So we did what any team does when the tooling isn't there: we built it ourselves.


Our Homegrown Solution — Reinventing the Wheel (But It Spun)

A while back, my team built an internal orchestration layer around a simple reality: Kubernetes GPU support was too coarse for what we needed, so we worked around it.

The split was straightforward in concept: CPU-based tasks ran as K8s pods, GPU-based tasks ran as Docker containers. Everything that didn't need a GPU lived happily in the cluster as proper Kubernetes workloads — ETL pipelines, API services, data processing, the full stack. But the moment a task needed GPU, it stepped outside K8s entirely.

For GPU tasks, we had a purpose-built orchestrator service. This orchestrator had to run on the same node as the GPU — because it talked to the local Docker daemon directly to spin up containers there. We enforced this with node affinity rules, pinning the orchestrator to the GPU node so it could reach the Docker API and launch containers on that specific machine. When a GPU task came in, the orchestrator started a Docker container with --gpus device=N, the task ran, the container was torn down. All GPU-based AI work happened this way — plain Docker containers on the GPU node, completely outside Kubernetes.

# GPU tasks — Docker containers on the GPU node, launched via local Docker API
container = docker_client.containers.run(
    image="our-model-server:latest",
    detach=True,
    device_requests=[
        docker.types.DeviceRequest(device_ids=["0"], capabilities=[["gpu"]])
    ],
    environment={
        "CUDA_VISIBLE_DEVICES": "0",
        "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:512"  # hint only, not enforced
    }
)
Enter fullscreen mode Exit fullscreen mode
# CPU tasks — normal K8s pods, no special handling needed
pod_manifest = {
    "apiVersion": "v1",
    "kind": "Pod",
    "spec": {
        "containers": [{
            "resources": {
                "limits": {"cpu": "2", "memory": "4Gi"}
            }
        }]
    }
}
k8s_client.create_namespaced_pod(namespace="workloads", body=pod_manifest)
Enter fullscreen mode Exit fullscreen mode

It worked. We ran it in production. The team was proud of it — and honestly, it was solid engineering given the constraints. But the problems were always there:

  • No VRAM isolation — Docker containers on the same GPU shared memory completely. One greedy process could OOM the rest, and when it happened everything fell over at once.
  • GPU workloads living outside K8s — a whole class of tasks with no K8s lifecycle management, no health checks, no rolling restarts. A permanent special case that needed permanent special handling.
  • Node affinity as a constraint, not a choice — the orchestrator had to be pinned to the GPU node to reach the Docker daemon. Scaling to multiple GPU nodes meant more orchestrators, more complexity, more things to coordinate.
  • No per-container GPU metrics — visibility into who was using what meant scraping nvidia-smi and correlating PIDs manually. Fragile and tedious.

We knew it was technical debt. We just couldn't find anything better. Until Amsterdam.


KubeCon Europe 2026 — Amsterdam

I went to KubeCon this year primarily to answer one question: is there something in the cloud-native ecosystem that handles sub-GPU partitioning on lower-end hardware without requiring H100s?

The talks were good. The hallway track was better. I had conversations with people from platform teams at AI startups, SaaS companies, and a few cloud providers. The picture that emerged was clear — and a little frustrating. The majority weren't even wrestling with this problem. They were on cloud, spinning up GPU instances on demand, scaling out horizontally whenever they needed more compute. GPU sharing? Why bother when you can just add another node?

But for teams running on-prem or on fixed GPU budgets — and there were more of us in that room than the cloud-native crowd might assume — the story was different. We either wasted GPU resources with whole-GPU-per-pod allocations, paid the H100 tax to get MIG, or built our own solutions. Same wall, different paint.

I attended a session on GPU resource management and heard mentions of several tools — GPU Operator, DRA (Dynamic Resource Allocation, which is still maturing in K8s 1.31/1.32), KAI Scheduler, and then something I hadn't heard of before: HAMi.

There was a session that stopped me mid-scroll: "Dynamic, Smart, Stable GPU-Sharing Middleware In Kubernetes". Five minutes in I had stopped taking notes on anything else. The talk walked through exactly the problem I'd been living with — sub-GPU partitioning on hardware that doesn't support MIG — and presented HAMi as the answer. Software-level vGPU, hard VRAM isolation, any CUDA GPU, K8s native.

What made it land even harder was that HAMi had also been mentioned earlier in the keynotes. Not as a footnote — as a legitimate part of the GPU sharing story on Kubernetes.


2AM in the Hotel Room — The PoC That Shouldn't Have Happened

The city kept us out until midnight. Amsterdam will do that. I said goodbye to everyone, walked back to the hotel, and should have gone straight to sleep — full day of sessions, a lot of walking, and an early morning talk the next day.

Instead I opened the laptop.

I'd been turning HAMi over in my head since that session. Not casually — obsessively. I had my MicroK8s home lab accessible remotely. I had a GPU sitting idle. I had all the context from the past year of fighting this exact problem loaded in my head. I genuinely could not wait until I got home to try it. The idea of going to sleep without at least attempting the install felt physically uncomfortable in the way only a very specific kind of engineering nerd will understand.

So there I was, at 2AM Amsterdam time, laptop on the hotel desk, SSH tunnel back home, microk8s helm3 repo add running. Extremely classic.

Three hours later I had a working HAMi installation, two pods running on the same physical GPU with separate VRAM slices, and nvidia-smi showing exactly what I'd spent two years trying to achieve. I didn't go to sleep until I saw that output. Totally worth it.

Let me tell you what HAMi actually is, because the name is a bit opaque.

HAMi (Heterogeneous AI Computing Virtualization Middleware) — formerly known as k8s-vGPU-scheduler — is a CNCF Sandbox project that provides software-level GPU virtualization for Kubernetes. It works on any CUDA GPU, including your T4s, RTX cards, L40S, and others that don't support MIG.

The core mechanism is elegant: HAMi injects a shared library (libvgpu.so) into each container via LD_PRELOAD. This library intercepts every cudaMalloc call at the CUDA API level. If your pod's cumulative VRAM allocation would exceed its configured limit, HAMi returns CUDA_ERROR_OUT_OF_MEMORY — a hard wall. The pod dies. The other pods sharing that GPU are completely unaffected.

Physical GPU (RTX 3080 — 10GB)
       ↓
  NVIDIA Driver
       ↓
  libvgpu.so  ←── HAMi injects this via LD_PRELOAD
  (intercepts cudaMalloc, enforces per-pod limits)
       ↓
  ┌─────────────┐    ┌─────────────┐
  │   Pod A     │    │   Pod B     │
  │  2GB limit  │    │  3GB limit  │
  │  25% cores  │    │  40% cores  │
  └─────────────┘    └─────────────┘
Enter fullscreen mode Exit fullscreen mode

This is fundamentally different from every other approach:

  • Unlike MIG, it doesn't require specific hardware
  • Unlike time-slicing, it enforces VRAM isolation (not just temporal sharing)
  • Unlike MPS, a failing pod doesn't crash the shared context
  • Unlike plain Docker, it's K8s-native and actually enforces limits

What I Actually Built at 2AM — The PoC

I tested this on my home lab machine (NVIDIA GeForce RTX 3080, 10GB VRAM) running MicroK8s. Here's the full stack I set up, with the actual files I used.

Step 1 — Enable GPU Support and Install HAMi

# Enable microk8s GPU addon
microk8s enable gpu

# Install cert-manager (HAMi's webhook needs it)
microk8s kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml
microk8s kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/instance=cert-manager \
  -n cert-manager --timeout=180s

# Add HAMi helm repo
microk8s helm3 repo add hami-charts https://project-hami.github.io/HAMi/
microk8s helm3 repo update

# Get K8s version (--short is deprecated in newer kubectl)
K8S_VERSION=$(microk8s kubectl version -o json | python3 -c "
import sys, json, re
v = json.load(sys.stdin)['serverVersion']['gitVersion'].lstrip('v')
print(re.split(r'[+\-]', v)[0])
")

# Install HAMi
microk8s helm3 install hami hami-charts/hami \
  --namespace kube-system \
  --set scheduler.kubeScheduler.imageTag=v${K8S_VERSION} \
  --set devicePlugin.nvidiaDriverPath=/usr/local/nvidia \
  --set scheduler.defaultSchedulerPolicy.gpuMemory=true \
  --set scheduler.defaultSchedulerPolicy.gpuCores=true

# CRITICAL: Label the GPU node — without this, the device-plugin DaemonSet stays at DESIRED: 0
microk8s kubectl label node <your-node-name> gpu=on
Enter fullscreen mode Exit fullscreen mode

Gotcha #1: The gpu=on label is mandatory. The HAMi device-plugin DaemonSet has a nodeSelector that requires it. I lost 20 minutes on this before I understood why DESIRED: 0 wasn't moving.

Step 2 — Two Workers Sharing One GPU

Here's where it gets interesting. I deployed two PyTorch workloads simultaneously on the same physical GPU, each with hard VRAM limits.

gpu_worker_a.yaml — light workload, 20% VRAM (~2GB), 25% SM cores:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-worker-a
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-worker
      instance: worker-a
  template:
    metadata:
      labels:
        app: gpu-worker
        instance: worker-a
    spec:
      schedulerName: hami-scheduler          # critical — tells K8s to use HAMi's scheduler
      containers:
      - name: gpu-worker
        image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
        command: ["python3", "-u", "-c"]
        args:
        - |
          import torch, time, os
          pod = os.environ.get('POD_NAME', 'worker-a')
          device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
          print(f'[{pod}] device={device} gpu={torch.cuda.get_device_name(0)}', flush=True)

          # Allocate 1.5GB resident tensor (well within 2GB limit)
          elements = (1500 * 1024 * 1024) // 4
          blob = torch.zeros(elements, dtype=torch.float32, device=device)
          print(f'[{pod}] VRAM allocated: {torch.cuda.memory_allocated() // 1024**2}MB', flush=True)

          a = torch.randn(1024, 1024, device=device, dtype=torch.float16)
          b = torch.randn(1024, 1024, device=device, dtype=torch.float16)
          i = 0
          while True:
              c = torch.matmul(a, b)
              torch.cuda.synchronize()
              i += 1
              if i % 100 == 0:
                  print(f'[{pod}] iter={i} vram={torch.cuda.memory_allocated() // 1024**2}MB', flush=True)
              time.sleep(0.1)
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: PYTHONUNBUFFERED
          value: "1"
        resources:
          limits:
            nvidia.com/gpu: "1"                    # REQUIRED — HAMi trigger, without this it ignores the pod
            nvidia.com/gpucores: "25"              # 25% SM core cap (soft throttle)
            nvidia.com/gpumem-percentage: "20"     # 20% of VRAM = ~2048MB hard wall
Enter fullscreen mode Exit fullscreen mode

gpu_worker_b.yaml — heavier workload, 30% VRAM (~3GB), 40% SM cores:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-worker-b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-worker
      instance: worker-b
  template:
    metadata:
      labels:
        app: gpu-worker
        instance: worker-b
    spec:
      schedulerName: hami-scheduler
      containers:
      - name: gpu-worker
        image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
        command: ["python3", "-u", "-c"]
        args:
        - |
          import torch, time, os
          pod = os.environ.get('POD_NAME', 'worker-b')
          device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
          print(f'[{pod}] device={device} gpu={torch.cuda.get_device_name(0)}', flush=True)

          elements = (2000 * 1024 * 1024) // 4
          blob = torch.zeros(elements, dtype=torch.float32, device=device)
          print(f'[{pod}] VRAM allocated: {torch.cuda.memory_allocated() // 1024**2}MB', flush=True)

          a = torch.randn(2048, 2048, device=device, dtype=torch.float16)
          b = torch.randn(2048, 2048, device=device, dtype=torch.float16)
          i = 0
          while True:
              c = torch.matmul(a, b)
              torch.cuda.synchronize()
              i += 1
              if i % 100 == 0:
                  print(f'[{pod}] iter={i} vram={torch.cuda.memory_allocated() // 1024**2}MB', flush=True)
              time.sleep(0.05)
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: PYTHONUNBUFFERED
          value: "1"
        resources:
          limits:
            nvidia.com/gpu: "1"
            nvidia.com/gpucores: "40"
            nvidia.com/gpumem-percentage: "30"     # 30% = ~3072MB hard wall
Enter fullscreen mode Exit fullscreen mode

Deploy them:

microk8s kubectl apply -f gpu_worker_a.yaml
microk8s kubectl apply -f gpu_worker_b.yaml

# Watch logs from both simultaneously
microk8s kubectl logs -l app=gpu-worker --prefix=true -f
Enter fullscreen mode Exit fullscreen mode

The Output That Made Me Pump My Fist at 2AM

[pod/gpu-worker-a-.../gpu-worker] [worker-a] device=cuda gpu=NVIDIA GeForce RTX 3080
[pod/gpu-worker-b-.../gpu-worker] [worker-b] device=cuda gpu=NVIDIA GeForce RTX 3080
[pod/gpu-worker-b-.../gpu-worker] [worker-b] VRAM allocated: 2000MB
[pod/gpu-worker-a-.../gpu-worker] [worker-a] VRAM allocated: 1500MB
[pod/gpu-worker-b-.../gpu-worker] [worker-b] iter=100 vram=2032MB
[pod/gpu-worker-a-.../gpu-worker] [worker-a] iter=100 vram=1514MB
[pod/gpu-worker-b-.../gpu-worker] [worker-b] iter=200 vram=2032MB
[pod/gpu-worker-a-.../gpu-worker] [worker-a] iter=200 vram=1514MB
Enter fullscreen mode Exit fullscreen mode

And on the host, nvidia-smi showed what I'd been trying to achieve for a long time:

+-------------------------------------------------------------------------------------+
| Processes:                                                                          |
|  GPU   GI   CI   PID     Type  Process name                             GPU Memory |
|======================================================================================|
|    0   N/A  N/A  116033  C     python3                                    1828MiB  |
|    0   N/A  N/A  116034  C     python3                                    2860MiB  |
+-------------------------------------------------------------------------------------+
Enter fullscreen mode Exit fullscreen mode

Two processes. Same physical GPU. Both running. Separately allocated VRAM slices. No code changes in the application.


The Gotchas — Things That Would Have Wrecked Me Without Debugging

This was a 2AM session, so I hit every wall possible. Here are the real ones:

Gotcha #1: nvidia.com/gpu: "1" is mandatory

HAMi's NVIDIA device counter needs to see nvidia.com/gpu as the entry point. Setting only gpucores and gpumem-percentage without it causes HAMi to skip the pod completely. You'll see "FilteringFailed: does not request any resource" in the scheduler logs but the pod will still get scheduled (by the fallback default scheduler) — without any VRAM isolation.

Check the scheduler logs during pod creation to confirm:

microk8s kubectl logs -n kube-system \
  $(microk8s kubectl get pod -n kube-system \
    -l app.kubernetes.io/component=hami-scheduler \
    -o jsonpath='{.items[0].metadata.name}') \
  -c vgpu-scheduler-extender --since=2m | grep "allocate success"
Enter fullscreen mode Exit fullscreen mode

You want to see: "device allocate success" allocate device={"NVIDIA":[{"Usedmem":2048,"Usedcores":25}]}

Gotcha #2: The device-plugin DaemonSet starts with DESIRED: 0

The HAMi device-plugin DaemonSet has nodeSelector: gpu: "on". Without labelling your node, the DaemonSet sits idle and HAMi's CUDA shim never gets injected. You'll think everything is working (pods schedule, run, use GPU) but there's no isolation happening.

# This is required — do it right after HAMi install
microk8s kubectl label node <your-node-name> gpu=on
Enter fullscreen mode Exit fullscreen mode

Gotcha #3: bind-phase stuck at "allocating"

If pods show hami.io/bind-phase: allocating (not success), the device-plugin wasn't running when the pods were first scheduled. Delete the pods — Kubernetes recreates them, and this time the device-plugin will properly inject the shim.

microk8s kubectl delete pod -l app=gpu-worker
# They reschedule automatically via the Deployment
Enter fullscreen mode Exit fullscreen mode

Verify:

# Must be non-empty
microk8s kubectl exec $POD_A -- env | grep CUDA_DEVICE_MEMORY_SHARED_CACHE

# Must say "success"
microk8s kubectl get pods -l app=gpu-worker -o yaml | grep "bind-phase"
Enter fullscreen mode Exit fullscreen mode

HAMi vs Our Homebrew Docker Approach — An Honest Comparison

Having lived with both, here's the real difference.

What Plain Docker Actually Does (and Doesn't Do)

When you run docker run --gpus device=0, Docker mounts /dev/nvidia0, /dev/nvidiactl, and /dev/nvidia-uvm into the container. That's it. Every container pointed at the same GPU device sees the whole GPU. There is no VRAM wall.

# Two containers, same GPU, no isolation
docker run --gpus device=0 -e NVIDIA_VISIBLE_DEVICES=0 pytorch/pytorch:latest python3 -c "
import torch
# This will happily allocate ALL available VRAM
blob = torch.zeros(9_000_000_000 // 4, dtype=torch.float32, device='cuda')
print(f'Allocated: {torch.cuda.memory_allocated() // 1024**2}MB')
"
Enter fullscreen mode Exit fullscreen mode

If container A runs that script while container B is also on GPU 0 — container B OOMs. Both processes die or degrade. There's no fence between them.

The only mitigation available in pure Docker is application-level:

# This is a suggestion, not enforcement
torch.cuda.set_per_process_memory_fraction(0.5, device=0)
Enter fullscreen mode Exit fullscreen mode

This requires modifying application code, applies differently per framework, and can be bypassed accidentally or intentionally.

What HAMi Actually Does

HAMi injects libvgpu.so via LD_PRELOAD into each container's process. This library wraps every CUDA memory function. When your process calls cudaMalloc(size), HAMi checks your pod's cumulative allocation against its configured limit. If you'd exceed it, it returns CUDA_ERROR_OUT_OF_MEMORY immediately. No negotiation.

Container calls cudaMalloc(1GB)
       ↓
libvgpu.so intercepts
       ↓
cumulative_alloc + 1GB > pod_limit?
  YES → return CUDA_ERROR_OUT_OF_MEMORY  (your pod, your problem)
  NO  → pass through to real cudaMalloc
Enter fullscreen mode Exit fullscreen mode

The other pods on the same GPU are completely unaffected. Their VRAM slices are spatially isolated — different physical memory pages.

The Comparison Table

Dimension Our Docker Approach HAMi on K8s
VRAM enforcement ❌ Application hint only ✅ Hard kernel-level wall
OOM blast radius ❌ Whole GPU ✅ Per-container only
K8s native ❌ Docker API separate ✅ Full K8s integration
App code changes ⚠️ Required for hints ✅ Zero changes
Auto-recovery ❌ Manual ✅ K8s Deployment handles it
Monitoring ❌ DIY nvidia-smi scripts ✅ Built-in Prometheus endpoints
Setup complexity ✅ Simple ⚠️ HAMi + K8s required

HAMi's Built-In Monitoring — This Part Surprised Me

I expected to need to wire up dcgm-exporter, configure Prometheus scrape configs, and build Grafana dashboards from scratch. Instead, HAMi ships two Prometheus metric endpoints out of the box.

Port :31992 (device-plugin, real-time per-container):

curl -s http://localhost:31992/metrics | grep -v "^#"
Enter fullscreen mode Exit fullscreen mode
vGPU_device_memory_usage_in_bytes{podname="gpu-worker-a",...} 1.82884864e+09
vGPU_device_memory_usage_in_bytes{podname="gpu-worker-b",...} 2.39507968e+09
vGPU_device_memory_limit_in_bytes{podname="gpu-worker-a",...} 2.147483648e+09
vGPU_device_memory_limit_in_bytes{podname="gpu-worker-b",...} 3.221225472e+09
Device_utilization_desc_of_container{podname="gpu-worker-a",...} 12
Device_utilization_desc_of_container{podname="gpu-worker-b",...} 31
HostCoreUtilization{deviceuuid="GPU-53aae475-...",...} 14
HostGPUMemoryUsage{deviceuuid="GPU-53aae475-...",...} 5.82e+09
Enter fullscreen mode Exit fullscreen mode

Port :31993 (scheduler, allocation view):

curl -s http://localhost:31993/metrics | grep -v "^#"
Enter fullscreen mode Exit fullscreen mode
GPUDeviceSharedNum{...} 2
GPUDeviceCoreAllocated{...} 65
GPUDeviceMemoryAllocated{...} 5.36870912e+09
vGPUCoreAllocated{podname="gpu-worker-a",...} 25
vGPUCoreAllocated{podname="gpu-worker-b",...} 40
vGPUMemoryAllocated{podname="gpu-worker-a",...} 2.147483648e+09
vGPUMemoryAllocated{podname="gpu-worker-b",...} 3.221225472e+09
QuotaUsed{quotaName="nvidia.com/gpumem", quotanamespace="default",...} 5120
Enter fullscreen mode Exit fullscreen mode

GPUDeviceSharedNum: 2 — two containers sharing one GPU, confirmed from HAMi's perspective.

Wire these to Prometheus with ServiceMonitors and you have a full observability story:

hami_service_monitoring.yaml:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-scheduler-metrics
  namespace: observability
  labels:
    release: kube-prom-stack
spec:
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-scheduler
      app.kubernetes.io/instance: hami
  endpoints:
  - port: monitor          # → pod :9395
    interval: 10s
    path: /metrics
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-device-plugin-metrics
  namespace: observability
  labels:
    release: kube-prom-stack
spec:
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-device-plugin
      app.kubernetes.io/instance: hami
  endpoints:
  - port: monitorport      # → pod :9394
    interval: 5s
    path: /metrics
Enter fullscreen mode Exit fullscreen mode

Understanding the Limits — Soft vs Hard

This is important to get right before you put HAMi in production.

VRAM (gpumem-percentage) — Hard enforcement. HAMi intercepts cudaMalloc in userspace. When your pod exceeds its limit, it gets CUDA_ERROR_OUT_OF_MEMORY. This is deterministic, reliable, and completely isolates the impact to the offending pod.

SM cores (gpucores) — Soft enforcement. HAMi doesn't have a hardware mechanism to limit SM core usage on non-MIG GPUs. Instead, it monitors GPU utilization and injects cudaDeviceSynchronize() + sleep cycles to throttle kernel submissions when a pod exceeds its core budget. This is best-effort — expect ±5-10% deviation from your configured cap. The GPU doesn't enforce this at hardware level.

gpumem-percentage: "20"   →  Hard. If exceeded → CUDA_ERROR_OUT_OF_MEMORY. Deterministic.
gpucores: "25"            →  Soft. Best-effort ±5-10%. Not a hardware guarantee.
Enter fullscreen mode Exit fullscreen mode

For most use cases involving multiple AI workloads sharing a GPU, the hard VRAM wall is what matters most. SM throttling is a nice-to-have for fairness but not a safety guarantee.

If you need hard SM guarantees, you're on MIG territory — A100/H100 only.


What This Means for My Platform

Coming home from Amsterdam with a working HAMi PoC changes the architecture conversation significantly.

Before: Two-tier GPU management. Docker API for short-lived containers. K8s for long-lived pods. Homebrew GPU pool tracker. No unified monitoring. No VRAM isolation. Multiple separate failure modes.

After (planned): Single K8s cluster. HAMi handles all GPU slicing. Inference pods, processing jobs, and batch workloads all described as K8s Deployments or Jobs with nvidia.com/gpumem-percentage limits. Unified observability via HAMi's Prometheus endpoints. Automatic rescheduling on failure. Namespace quotas per team.

The short-lived GPU job use case specifically — I'm confident HAMi can handle it. On-demand workloads with predictable, bounded VRAM usage are exactly what sub-GPU partitioning is designed for. You can pack several of them onto a single GPU that used to be allocated whole to one process at a time.


Try It Yourself — Full PoC Files

Everything I built is in the files below. You need MicroK8s, an NVIDIA GPU, and about 30 minutes.

HAMi GPU Sharing on MicroK8s

Split a single physical NVIDIA GPU across multiple Kubernetes pods using HAMi (Heterogeneous AI Computing Virtualization Middleware). No MIG, no hardware partitioning — works on consumer GPUs.

How It Works

HAMi injects a CUDA shim via LD_PRELOAD into each container. The shim intercepts cudaMalloc and kernel launch calls to enforce per-pod limits:

  • VRAM — hard cap; pod is OOM-killed if it exceeds its allocation
  • GPU cores — soft cap via kernel submission throttling (±5–10% deviation is normal)
Physical GPU (e.g. RTX 3080 — 10 GB VRAM)
├── gpu-worker-a  →  20% VRAM (~2 GB)  +  25% SM cores
└── gpu-worker-b  →  30% VRAM (~3 GB)  +  40% SM cores

Both pods run truly in parallel on different SMs. Time-slicing only occurs under SM contention.

Prerequisites

  • Ubuntu 22.04 / 24.04
  • NVIDIA driver installed on host (nvidia-smi works)
  • MicroK8s installed (snap install microk8s --classic)
  • helm3

The files:

  • sanity_check.yaml — Verify GPU access before installing HAMi
  • gpu_worker_a.yaml — Worker A deployment (20% VRAM, 25% cores)
  • gpu_worker_b.yaml — Worker B deployment (30% VRAM, 40% cores)
  • hami_service_monitoring.yaml — Prometheus ServiceMonitors for both HAMi endpoints
  • grafana_dashboard.yaml — Auto-importing Grafana dashboard via ConfigMap

Quick start (after MicroK8s is running with GPU addon enabled):

# 1. Install cert-manager
microk8s kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml
microk8s kubectl wait --for=condition=ready pod \
  -l app.kubernetes.io/instance=cert-manager -n cert-manager --timeout=180s

# 2. Install HAMi
microk8s helm3 repo add hami-charts https://project-hami.github.io/HAMi/
microk8s helm3 repo update
K8S_VERSION=$(microk8s kubectl version -o json | python3 -c \
  "import sys,json,re; v=json.load(sys.stdin)['serverVersion']['gitVersion'].lstrip('v'); print(re.split(r'[+\-]',v)[0])")
microk8s helm3 install hami hami-charts/hami --namespace kube-system \
  --set scheduler.kubeScheduler.imageTag=v${K8S_VERSION} \
  --set devicePlugin.nvidiaDriverPath=/usr/local/nvidia \
  --set scheduler.defaultSchedulerPolicy.gpuMemory=true \
  --set scheduler.defaultSchedulerPolicy.gpuCores=true

# 3. Label your GPU node
microk8s kubectl label node <your-node-name> gpu=on

# 4. Deploy workers
microk8s kubectl apply -f gpu_worker_a.yaml
microk8s kubectl apply -f gpu_worker_b.yaml

# 5. Watch the magic
microk8s kubectl logs -l app=gpu-worker --prefix=true -f &
watch -n 2 nvidia-smi

# 6. Verify HAMi's view of the split
curl -s http://localhost:31993/metrics | grep -v "^#"
Enter fullscreen mode Exit fullscreen mode

Final Thoughts — From One Infrastructure Nerd to Another

I've been at this GPU sharing problem for some time. MIG was the dream but not the reality for most hardware budgets. Time-slicing was a band-aid. Our homebrew solution was genuinely good engineering but was always technical debt waiting to be paid.

HAMi is the first thing I've found that genuinely plugs the gap — software-level VRAM isolation on commodity GPUs, K8s native, zero application changes, and built-in observability. It's not magic: the SM throttling is soft, the setup requires K8s knowledge, and there's still a ceiling of what you can pack onto a 10GB card. But it's real, it works, and it's an open CNCF project with active development.

The fact that I found it at KubeCon, had a working PoC by 2AM, and was watching two pods cleanly share an RTX 3080 before I went to sleep — that's a pretty good endorsement.

If you're running AI workloads on Kubernetes and you're wasting GPU budget on whole-GPU-per-pod allocations, give HAMi a look. Your platform budget will thank you.

Top comments (2)

Collapse
 
klement_gunndu profile image
klement Gunndu

The LD_PRELOAD CUDA interception is a smart approach. Curious how HAMi handles multi-process training — if two pods run DDP on the same physical GPU, does the shim coordinate memory limits across processes or per-container only?

Collapse
 
bogdandoncea profile image
Bogdan Doncea

The shim enforces per-container only — no cross-pod coordination at runtime

Each libvgpu.so instance intercepts cuMemAlloc in its own process and has zero visibility into what the other pod's shim has allocated