<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bogdan Doncea</title>
    <description>The latest articles on DEV Community by Bogdan Doncea (@bogdandoncea).</description>
    <link>https://dev.to/bogdandoncea</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3848061%2Fa122bf83-5b76-4e08-8c22-e499400140ec.jpeg</url>
      <title>DEV Community: Bogdan Doncea</title>
      <link>https://dev.to/bogdandoncea</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bogdandoncea"/>
    <language>en</language>
    <item>
      <title>Splitting One GPU Across Multiple Kubernetes Pods — Without MIG, Without Enterprise Licenses</title>
      <dc:creator>Bogdan Doncea</dc:creator>
      <pubDate>Sat, 28 Mar 2026 18:45:47 +0000</pubDate>
      <link>https://dev.to/bogdandoncea/splitting-one-gpu-across-multiple-kubernetes-pods-without-mig-without-enterprise-licenses-2f02</link>
      <guid>https://dev.to/bogdandoncea/splitting-one-gpu-across-multiple-kubernetes-pods-without-mig-without-enterprise-licenses-2f02</guid>
      <description>&lt;p&gt;&lt;em&gt;A years-old GPU frustration, a conference discovery, and a 2AM PoC that actually worked&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem I've Been Carrying for Years
&lt;/h2&gt;

&lt;p&gt;If you work with AI or video at scale and you're not at one of the big hyperscalers, you've probably hit this wall before: &lt;strong&gt;you have GPUs, and you're wasting them&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Not because your workloads don't need GPU — they do. But because individually, each workload is small. AI inference services rarely saturate a whole card. Processing jobs spin up, eat some compute, and die. Embedding models, classifiers, lightweight LLMs — they each need a slice of a GPU, not the whole thing. None of them come close to maxing out the hardware on their own. And yet, in a typical Kubernetes setup, each one claims an entire GPU card and sits there hoarding it while the rest goes to waste.&lt;/p&gt;

&lt;p&gt;I've been building a platform that runs multiple AI and video processing workloads in parallel — inference services, enrichment pipelines, on-demand processing jobs. The kind of system where a lot of different things need GPU access at the same time, but no single one of them needs a whole card. The stack is K8s, Kafka, Redis, some databases a handful of Python and Java services. And GPUs — always the GPUs.&lt;/p&gt;

&lt;p&gt;The GPU problem specifically: &lt;strong&gt;we have T4s and L40S nodes, and we could never properly share them between pods without playing with fire&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We tried two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;NVIDIA Time-Slicing&lt;/strong&gt; — Easy to set up via the GPU Operator and it looks good on paper. In practice, for streaming and transcoding workloads it was a non-starter. Time-slicing serialises GPU access, which introduces jitter and latency spikes — exactly what you cannot have when you're processing live video or audio. Frames drop, buffers stall, quality degrades. We turned it off fast.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plain Docker with &lt;code&gt;--gpus device=0&lt;/code&gt;&lt;/strong&gt; — Which I'll get into. We actually used this for a long time, and it worked — sort of.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we did what any team does when the tooling isn't there: &lt;strong&gt;we built it ourselves&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Our Homegrown Solution — Reinventing the Wheel (But It Spun)
&lt;/h2&gt;

&lt;p&gt;A while back, my team built an internal orchestration layer around a simple reality: Kubernetes GPU support was too coarse for what we needed, so we worked around it.&lt;/p&gt;

&lt;p&gt;The split was straightforward in concept: CPU-based tasks ran as K8s pods, GPU-based tasks ran as Docker containers. Everything that didn't need a GPU lived happily in the cluster as proper Kubernetes workloads — ETL pipelines, API services, data processing, the full stack. But the moment a task needed GPU, it stepped outside K8s entirely.&lt;/p&gt;

&lt;p&gt;For GPU tasks, we had a purpose-built orchestrator service. This orchestrator had to run on the same node as the GPU — because it talked to the local Docker daemon directly to spin up containers there. We enforced this with node affinity rules, pinning the orchestrator to the GPU node so it could reach the Docker API and launch containers on that specific machine. When a GPU task came in, the orchestrator started a Docker container with &lt;code&gt;--gpus device=N&lt;/code&gt;, the task ran, the container was torn down. All GPU-based AI work happened this way — plain Docker containers on the GPU node, completely outside Kubernetes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# GPU tasks — Docker containers on the GPU node, launched via local Docker API
&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;docker_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;containers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;our-model-server:latest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;detach&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_requests&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;docker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;types&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DeviceRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;device_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;capabilities&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;environment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PYTORCH_CUDA_ALLOC_CONF&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_split_size_mb:512&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# hint only, not enforced
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CPU tasks — normal K8s pods, no special handling needed
&lt;/span&gt;&lt;span class="n"&gt;pod_manifest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;apiVersion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Pod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spec&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;containers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resources&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;limits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cpu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4Gi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;k8s_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_namespaced_pod&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;workloads&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;body&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;pod_manifest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;It worked. We ran it in production. The team was proud of it — and honestly, it was solid engineering given the constraints. But the problems were always there:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No VRAM isolation&lt;/strong&gt; — Docker containers on the same GPU shared memory completely. One greedy process could OOM the rest, and when it happened everything fell over at once.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU workloads living outside K8s&lt;/strong&gt; — a whole class of tasks with no K8s lifecycle management, no health checks, no rolling restarts. A permanent special case that needed permanent special handling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node affinity as a constraint, not a choice&lt;/strong&gt; — the orchestrator had to be pinned to the GPU node to reach the Docker daemon. Scaling to multiple GPU nodes meant more orchestrators, more complexity, more things to coordinate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No per-container GPU metrics&lt;/strong&gt; — visibility into who was using what meant scraping &lt;code&gt;nvidia-smi&lt;/code&gt; and correlating PIDs manually. Fragile and tedious.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We knew it was technical debt. We just couldn't find anything better. Until Amsterdam.&lt;/p&gt;


&lt;h2&gt;
  
  
  KubeCon Europe 2026 — Amsterdam
&lt;/h2&gt;

&lt;p&gt;I went to KubeCon this year primarily to answer one question: &lt;strong&gt;is there something in the cloud-native ecosystem that handles sub-GPU partitioning on lower-end hardware without requiring H100s?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The talks were good. The hallway track was better. I had conversations with people from platform teams at AI startups, SaaS companies, and a few cloud providers. The picture that emerged was clear — and a little frustrating. The majority weren't even wrestling with this problem. They were on cloud, spinning up GPU instances on demand, scaling out horizontally whenever they needed more compute. GPU sharing? Why bother when you can just add another node?&lt;/p&gt;

&lt;p&gt;But for teams running on-prem or on fixed GPU budgets — and there were more of us in that room than the cloud-native crowd might assume — the story was different. We either wasted GPU resources with whole-GPU-per-pod allocations, paid the H100 tax to get MIG, or built our own solutions. Same wall, different paint.&lt;/p&gt;

&lt;p&gt;I attended a session on GPU resource management and heard mentions of several tools — GPU Operator, DRA (Dynamic Resource Allocation, which is still maturing in K8s 1.31/1.32), KAI Scheduler, and then something I hadn't heard of before: &lt;strong&gt;HAMi&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;There was a session that stopped me mid-scroll: &lt;em&gt;"Dynamic, Smart, Stable GPU-Sharing Middleware In Kubernetes"&lt;/em&gt;. Five minutes in I had stopped taking notes on anything else. The talk walked through exactly the problem I'd been living with — sub-GPU partitioning on hardware that doesn't support MIG — and presented &lt;strong&gt;HAMi&lt;/strong&gt; as the answer. Software-level vGPU, hard VRAM isolation, any CUDA GPU, K8s native.&lt;/p&gt;

&lt;p&gt;What made it land even harder was that HAMi had also been mentioned earlier in the keynotes. Not as a footnote — as a legitimate part of the GPU sharing story on Kubernetes.&lt;/p&gt;


&lt;h2&gt;
  
  
  2AM in the Hotel Room — The PoC That Shouldn't Have Happened
&lt;/h2&gt;

&lt;p&gt;The city kept us out until midnight. Amsterdam will do that. I said goodbye to everyone, walked back to the hotel, and should have gone straight to sleep — full day of sessions, a lot of walking, and an early morning talk the next day.&lt;/p&gt;

&lt;p&gt;Instead I opened the laptop.&lt;/p&gt;

&lt;p&gt;I'd been turning HAMi over in my head since that session. Not casually — obsessively. I had my MicroK8s home lab accessible remotely. I had a GPU sitting idle. I had all the context from the past year of fighting this exact problem loaded in my head. I genuinely could not wait until I got home to try it. The idea of going to sleep without at least attempting the install felt physically uncomfortable in the way only a very specific kind of engineering nerd will understand.&lt;/p&gt;

&lt;p&gt;So there I was, at 2AM Amsterdam time, laptop on the hotel desk, SSH tunnel back home, &lt;code&gt;microk8s helm3 repo add&lt;/code&gt; running. Extremely classic.&lt;/p&gt;

&lt;p&gt;Three hours later I had a working HAMi installation, two pods running on the same physical GPU with separate VRAM slices, and &lt;code&gt;nvidia-smi&lt;/code&gt; showing exactly what I'd spent two years trying to achieve. I didn't go to sleep until I saw that output. Totally worth it.&lt;/p&gt;

&lt;p&gt;Let me tell you what HAMi actually is, because the name is a bit opaque.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HAMi (Heterogeneous AI Computing Virtualization Middleware)&lt;/strong&gt; — formerly known as &lt;code&gt;k8s-vGPU-scheduler&lt;/code&gt; — is a CNCF Sandbox project that provides software-level GPU virtualization for Kubernetes. It works on &lt;strong&gt;any CUDA GPU&lt;/strong&gt;, including your T4s, RTX cards, L40S, and others that don't support MIG.&lt;/p&gt;

&lt;p&gt;The core mechanism is elegant: HAMi injects a shared library (&lt;code&gt;libvgpu.so&lt;/code&gt;) into each container via &lt;code&gt;LD_PRELOAD&lt;/code&gt;. This library intercepts every &lt;code&gt;cudaMalloc&lt;/code&gt; call at the CUDA API level. If your pod's cumulative VRAM allocation would exceed its configured limit, HAMi returns &lt;code&gt;CUDA_ERROR_OUT_OF_MEMORY&lt;/code&gt; — a hard wall. The pod dies. The other pods sharing that GPU are completely unaffected.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Physical GPU (RTX 3080 — 10GB)
       ↓
  NVIDIA Driver
       ↓
  libvgpu.so  ←── HAMi injects this via LD_PRELOAD
  (intercepts cudaMalloc, enforces per-pod limits)
       ↓
  ┌─────────────┐    ┌─────────────┐
  │   Pod A     │    │   Pod B     │
  │  2GB limit  │    │  3GB limit  │
  │  25% cores  │    │  40% cores  │
  └─────────────┘    └─────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This is fundamentally different from every other approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unlike &lt;strong&gt;MIG&lt;/strong&gt;, it doesn't require specific hardware&lt;/li&gt;
&lt;li&gt;Unlike &lt;strong&gt;time-slicing&lt;/strong&gt;, it enforces VRAM isolation (not just temporal sharing)&lt;/li&gt;
&lt;li&gt;Unlike &lt;strong&gt;MPS&lt;/strong&gt;, a failing pod doesn't crash the shared context&lt;/li&gt;
&lt;li&gt;Unlike &lt;strong&gt;plain Docker&lt;/strong&gt;, it's K8s-native and actually enforces limits&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  What I Actually Built at 2AM — The PoC
&lt;/h2&gt;

&lt;p&gt;I tested this on my home lab machine (NVIDIA GeForce RTX 3080, 10GB VRAM) running MicroK8s. Here's the full stack I set up, with the actual files I used.&lt;/p&gt;
&lt;h3&gt;
  
  
  Step 1 — Enable GPU Support and Install HAMi
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Enable microk8s GPU addon&lt;/span&gt;
microk8s &lt;span class="nb"&gt;enable &lt;/span&gt;gpu

&lt;span class="c"&gt;# Install cert-manager (HAMi's webhook needs it)&lt;/span&gt;
microk8s kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml
microk8s kubectl &lt;span class="nb"&gt;wait&lt;/span&gt; &lt;span class="nt"&gt;--for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ready pod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-l&lt;/span&gt; app.kubernetes.io/instance&lt;span class="o"&gt;=&lt;/span&gt;cert-manager &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; cert-manager &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;180s

&lt;span class="c"&gt;# Add HAMi helm repo&lt;/span&gt;
microk8s helm3 repo add hami-charts https://project-hami.github.io/HAMi/
microk8s helm3 repo update

&lt;span class="c"&gt;# Get K8s version (--short is deprecated in newer kubectl)&lt;/span&gt;
&lt;span class="nv"&gt;K8S_VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;microk8s kubectl version &lt;span class="nt"&gt;-o&lt;/span&gt; json | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
import sys, json, re
v = json.load(sys.stdin)['serverVersion']['gitVersion'].lstrip('v')
print(re.split(r'[+&lt;/span&gt;&lt;span class="se"&gt;\-&lt;/span&gt;&lt;span class="s2"&gt;]', v)[0])
"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Install HAMi&lt;/span&gt;
microk8s helm3 &lt;span class="nb"&gt;install &lt;/span&gt;hami hami-charts/hami &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kube-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; scheduler.kubeScheduler.imageTag&lt;span class="o"&gt;=&lt;/span&gt;v&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;K8S_VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; devicePlugin.nvidiaDriverPath&lt;span class="o"&gt;=&lt;/span&gt;/usr/local/nvidia &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; scheduler.defaultSchedulerPolicy.gpuMemory&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; scheduler.defaultSchedulerPolicy.gpuCores&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;

&lt;span class="c"&gt;# CRITICAL: Label the GPU node — without this, the device-plugin DaemonSet stays at DESIRED: 0&lt;/span&gt;
microk8s kubectl label node &amp;lt;your-node-name&amp;gt; &lt;span class="nv"&gt;gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;on
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Gotcha #1&lt;/strong&gt;: The &lt;code&gt;gpu=on&lt;/code&gt; label is mandatory. The HAMi device-plugin DaemonSet has a nodeSelector that requires it. I lost 20 minutes on this before I understood why &lt;code&gt;DESIRED: 0&lt;/code&gt; wasn't moving.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h3&gt;
  
  
  Step 2 — Two Workers Sharing One GPU
&lt;/h3&gt;

&lt;p&gt;Here's where it gets interesting. I deployed two PyTorch workloads simultaneously on the same physical GPU, each with hard VRAM limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;gpu_worker_a.yaml&lt;/code&gt;&lt;/strong&gt; — light workload, 20% VRAM (~2GB), 25% SM cores:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-worker-a&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-worker&lt;/span&gt;
      &lt;span class="na"&gt;instance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;worker-a&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-worker&lt;/span&gt;
        &lt;span class="na"&gt;instance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;worker-a&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;schedulerName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hami-scheduler&lt;/span&gt;          &lt;span class="c1"&gt;# critical — tells K8s to use HAMi's scheduler&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-worker&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python3"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-u"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;import torch, time, os&lt;/span&gt;
          &lt;span class="s"&gt;pod = os.environ.get('POD_NAME', 'worker-a')&lt;/span&gt;
          &lt;span class="s"&gt;device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')&lt;/span&gt;
          &lt;span class="s"&gt;print(f'[{pod}] device={device} gpu={torch.cuda.get_device_name(0)}', flush=True)&lt;/span&gt;

          &lt;span class="s"&gt;# Allocate 1.5GB resident tensor (well within 2GB limit)&lt;/span&gt;
          &lt;span class="s"&gt;elements = (1500 * 1024 * 1024) // 4&lt;/span&gt;
          &lt;span class="s"&gt;blob = torch.zeros(elements, dtype=torch.float32, device=device)&lt;/span&gt;
          &lt;span class="s"&gt;print(f'[{pod}] VRAM allocated: {torch.cuda.memory_allocated() // 1024**2}MB', flush=True)&lt;/span&gt;

          &lt;span class="s"&gt;a = torch.randn(1024, 1024, device=device, dtype=torch.float16)&lt;/span&gt;
          &lt;span class="s"&gt;b = torch.randn(1024, 1024, device=device, dtype=torch.float16)&lt;/span&gt;
          &lt;span class="s"&gt;i = 0&lt;/span&gt;
          &lt;span class="s"&gt;while True:&lt;/span&gt;
              &lt;span class="s"&gt;c = torch.matmul(a, b)&lt;/span&gt;
              &lt;span class="s"&gt;torch.cuda.synchronize()&lt;/span&gt;
              &lt;span class="s"&gt;i += 1&lt;/span&gt;
              &lt;span class="s"&gt;if i % 100 == 0:&lt;/span&gt;
                  &lt;span class="s"&gt;print(f'[{pod}] iter={i} vram={torch.cuda.memory_allocated() // 1024**2}MB', flush=True)&lt;/span&gt;
              &lt;span class="s"&gt;time.sleep(0.1)&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;POD_NAME&lt;/span&gt;
          &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;fieldRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;fieldPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;metadata.name&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PYTHONUNBUFFERED&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;                    &lt;span class="c1"&gt;# REQUIRED — HAMi trigger, without this it ignores the pod&lt;/span&gt;
            &lt;span class="na"&gt;nvidia.com/gpucores&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;25"&lt;/span&gt;              &lt;span class="c1"&gt;# 25% SM core cap (soft throttle)&lt;/span&gt;
            &lt;span class="na"&gt;nvidia.com/gpumem-percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;20"&lt;/span&gt;     &lt;span class="c1"&gt;# 20% of VRAM = ~2048MB hard wall&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;&lt;code&gt;gpu_worker_b.yaml&lt;/code&gt;&lt;/strong&gt; — heavier workload, 30% VRAM (~3GB), 40% SM cores:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-worker-b&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-worker&lt;/span&gt;
      &lt;span class="na"&gt;instance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;worker-b&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-worker&lt;/span&gt;
        &lt;span class="na"&gt;instance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;worker-b&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;schedulerName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hami-scheduler&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-worker&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python3"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-u"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;import torch, time, os&lt;/span&gt;
          &lt;span class="s"&gt;pod = os.environ.get('POD_NAME', 'worker-b')&lt;/span&gt;
          &lt;span class="s"&gt;device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')&lt;/span&gt;
          &lt;span class="s"&gt;print(f'[{pod}] device={device} gpu={torch.cuda.get_device_name(0)}', flush=True)&lt;/span&gt;

          &lt;span class="s"&gt;elements = (2000 * 1024 * 1024) // 4&lt;/span&gt;
          &lt;span class="s"&gt;blob = torch.zeros(elements, dtype=torch.float32, device=device)&lt;/span&gt;
          &lt;span class="s"&gt;print(f'[{pod}] VRAM allocated: {torch.cuda.memory_allocated() // 1024**2}MB', flush=True)&lt;/span&gt;

          &lt;span class="s"&gt;a = torch.randn(2048, 2048, device=device, dtype=torch.float16)&lt;/span&gt;
          &lt;span class="s"&gt;b = torch.randn(2048, 2048, device=device, dtype=torch.float16)&lt;/span&gt;
          &lt;span class="s"&gt;i = 0&lt;/span&gt;
          &lt;span class="s"&gt;while True:&lt;/span&gt;
              &lt;span class="s"&gt;c = torch.matmul(a, b)&lt;/span&gt;
              &lt;span class="s"&gt;torch.cuda.synchronize()&lt;/span&gt;
              &lt;span class="s"&gt;i += 1&lt;/span&gt;
              &lt;span class="s"&gt;if i % 100 == 0:&lt;/span&gt;
                  &lt;span class="s"&gt;print(f'[{pod}] iter={i} vram={torch.cuda.memory_allocated() // 1024**2}MB', flush=True)&lt;/span&gt;
              &lt;span class="s"&gt;time.sleep(0.05)&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;POD_NAME&lt;/span&gt;
          &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;fieldRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;fieldPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;metadata.name&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PYTHONUNBUFFERED&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
            &lt;span class="na"&gt;nvidia.com/gpucores&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;40"&lt;/span&gt;
            &lt;span class="na"&gt;nvidia.com/gpumem-percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;30"&lt;/span&gt;     &lt;span class="c1"&gt;# 30% = ~3072MB hard wall&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Deploy them:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;microk8s kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; gpu_worker_a.yaml
microk8s kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; gpu_worker_b.yaml

&lt;span class="c"&gt;# Watch logs from both simultaneously&lt;/span&gt;
microk8s kubectl logs &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gpu-worker &lt;span class="nt"&gt;--prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  The Output That Made Me Pump My Fist at 2AM
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;[pod/gpu-worker-a-.../gpu-worker] [worker-a] device=cuda gpu=NVIDIA GeForce RTX 3080
[pod/gpu-worker-b-.../gpu-worker] [worker-b] device=cuda gpu=NVIDIA GeForce RTX 3080
[pod/gpu-worker-b-.../gpu-worker] [worker-b] VRAM allocated: 2000MB
[pod/gpu-worker-a-.../gpu-worker] [worker-a] VRAM allocated: 1500MB
[pod/gpu-worker-b-.../gpu-worker] [worker-b] iter=100 vram=2032MB
[pod/gpu-worker-a-.../gpu-worker] [worker-a] iter=100 vram=1514MB
[pod/gpu-worker-b-.../gpu-worker] [worker-b] iter=200 vram=2032MB
[pod/gpu-worker-a-.../gpu-worker] [worker-a] iter=200 vram=1514MB
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;And on the host, &lt;code&gt;nvidia-smi&lt;/code&gt; showed what I'd been trying to achieve for a long time:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;+-------------------------------------------------------------------------------------+
| Processes:                                                                          |
|  GPU   GI   CI   PID     Type  Process name                             GPU Memory |
|======================================================================================|
|    0   N/A  N/A  116033  C     python3                                    1828MiB  |
|    0   N/A  N/A  116034  C     python3                                    2860MiB  |
+-------------------------------------------------------------------------------------+
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Two processes. Same physical GPU. Both running. Separately allocated VRAM slices. &lt;strong&gt;No code changes in the application.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  The Gotchas — Things That Would Have Wrecked Me Without Debugging
&lt;/h2&gt;

&lt;p&gt;This was a 2AM session, so I hit every wall possible. Here are the real ones:&lt;/p&gt;
&lt;h3&gt;
  
  
  Gotcha #1: &lt;code&gt;nvidia.com/gpu: "1"&lt;/code&gt; is mandatory
&lt;/h3&gt;

&lt;p&gt;HAMi's NVIDIA device counter needs to see &lt;code&gt;nvidia.com/gpu&lt;/code&gt; as the entry point. Setting only &lt;code&gt;gpucores&lt;/code&gt; and &lt;code&gt;gpumem-percentage&lt;/code&gt; without it causes HAMi to skip the pod completely. You'll see &lt;code&gt;"FilteringFailed: does not request any resource"&lt;/code&gt; in the scheduler logs but the pod will still get scheduled (by the fallback default scheduler) — without any VRAM isolation.&lt;/p&gt;

&lt;p&gt;Check the scheduler logs during pod creation to confirm:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;microk8s kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="si"&gt;$(&lt;/span&gt;microk8s kubectl get pod &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-l&lt;/span&gt; app.kubernetes.io/component&lt;span class="o"&gt;=&lt;/span&gt;hami-scheduler &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{.items[0].metadata.name}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; vgpu-scheduler-extender &lt;span class="nt"&gt;--since&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2m | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"allocate success"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;You want to see: &lt;code&gt;"device allocate success" allocate device={"NVIDIA":[{"Usedmem":2048,"Usedcores":25}]}&lt;/code&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Gotcha #2: The device-plugin DaemonSet starts with DESIRED: 0
&lt;/h3&gt;

&lt;p&gt;The HAMi device-plugin DaemonSet has &lt;code&gt;nodeSelector: gpu: "on"&lt;/code&gt;. Without labelling your node, the DaemonSet sits idle and HAMi's CUDA shim never gets injected. You'll think everything is working (pods schedule, run, use GPU) but there's no isolation happening.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# This is required — do it right after HAMi install&lt;/span&gt;
microk8s kubectl label node &amp;lt;your-node-name&amp;gt; &lt;span class="nv"&gt;gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;on
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Gotcha #3: bind-phase stuck at "allocating"
&lt;/h3&gt;

&lt;p&gt;If pods show &lt;code&gt;hami.io/bind-phase: allocating&lt;/code&gt; (not &lt;code&gt;success&lt;/code&gt;), the device-plugin wasn't running when the pods were first scheduled. Delete the pods — Kubernetes recreates them, and this time the device-plugin will properly inject the shim.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;microk8s kubectl delete pod &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gpu-worker
&lt;span class="c"&gt;# They reschedule automatically via the Deployment&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Verify:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Must be non-empty&lt;/span&gt;
microk8s kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nv"&gt;$POD_A&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nb"&gt;env&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;CUDA_DEVICE_MEMORY_SHARED_CACHE

&lt;span class="c"&gt;# Must say "success"&lt;/span&gt;
microk8s kubectl get pods &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gpu-worker &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s2"&gt;"bind-phase"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  HAMi vs Our Homebrew Docker Approach — An Honest Comparison
&lt;/h2&gt;

&lt;p&gt;Having lived with both, here's the real difference.&lt;/p&gt;
&lt;h3&gt;
  
  
  What Plain Docker Actually Does (and Doesn't Do)
&lt;/h3&gt;

&lt;p&gt;When you run &lt;code&gt;docker run --gpus device=0&lt;/code&gt;, Docker mounts &lt;code&gt;/dev/nvidia0&lt;/code&gt;, &lt;code&gt;/dev/nvidiactl&lt;/code&gt;, and &lt;code&gt;/dev/nvidia-uvm&lt;/code&gt; into the container. That's it. Every container pointed at the same GPU device sees &lt;strong&gt;the whole GPU&lt;/strong&gt;. There is no VRAM wall.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Two containers, same GPU, no isolation&lt;/span&gt;
docker run &lt;span class="nt"&gt;--gpus&lt;/span&gt; &lt;span class="nv"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;NVIDIA_VISIBLE_DEVICES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 pytorch/pytorch:latest python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"
import torch
# This will happily allocate ALL available VRAM
blob = torch.zeros(9_000_000_000 // 4, dtype=torch.float32, device='cuda')
print(f'Allocated: {torch.cuda.memory_allocated() // 1024**2}MB')
"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If container A runs that script while container B is also on GPU 0 — container B OOMs. Both processes die or degrade. There's no fence between them.&lt;/p&gt;

&lt;p&gt;The only mitigation available in pure Docker is application-level:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This is a suggestion, not enforcement
&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set_per_process_memory_fraction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This requires modifying application code, applies differently per framework, and can be bypassed accidentally or intentionally.&lt;/p&gt;
&lt;h3&gt;
  
  
  What HAMi Actually Does
&lt;/h3&gt;

&lt;p&gt;HAMi injects &lt;code&gt;libvgpu.so&lt;/code&gt; via &lt;code&gt;LD_PRELOAD&lt;/code&gt; into each container's process. This library wraps every CUDA memory function. When your process calls &lt;code&gt;cudaMalloc(size)&lt;/code&gt;, HAMi checks your pod's cumulative allocation against its configured limit. If you'd exceed it, it returns &lt;code&gt;CUDA_ERROR_OUT_OF_MEMORY&lt;/code&gt; immediately. No negotiation.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Container calls cudaMalloc(1GB)
       ↓
libvgpu.so intercepts
       ↓
cumulative_alloc + 1GB &amp;gt; pod_limit?
  YES → return CUDA_ERROR_OUT_OF_MEMORY  (your pod, your problem)
  NO  → pass through to real cudaMalloc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The other pods on the same GPU are completely unaffected. Their VRAM slices are spatially isolated — different physical memory pages.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Comparison Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Our Docker Approach&lt;/th&gt;
&lt;th&gt;HAMi on K8s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;VRAM enforcement&lt;/td&gt;
&lt;td&gt;❌ Application hint only&lt;/td&gt;
&lt;td&gt;✅ Hard kernel-level wall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OOM blast radius&lt;/td&gt;
&lt;td&gt;❌ Whole GPU&lt;/td&gt;
&lt;td&gt;✅ Per-container only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;K8s native&lt;/td&gt;
&lt;td&gt;❌ Docker API separate&lt;/td&gt;
&lt;td&gt;✅ Full K8s integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;App code changes&lt;/td&gt;
&lt;td&gt;⚠️ Required for hints&lt;/td&gt;
&lt;td&gt;✅ Zero changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto-recovery&lt;/td&gt;
&lt;td&gt;❌ Manual&lt;/td&gt;
&lt;td&gt;✅ K8s Deployment handles it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring&lt;/td&gt;
&lt;td&gt;❌ DIY nvidia-smi scripts&lt;/td&gt;
&lt;td&gt;✅ Built-in Prometheus endpoints&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup complexity&lt;/td&gt;
&lt;td&gt;✅ Simple&lt;/td&gt;
&lt;td&gt;⚠️ HAMi + K8s required&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  HAMi's Built-In Monitoring — This Part Surprised Me
&lt;/h2&gt;

&lt;p&gt;I expected to need to wire up dcgm-exporter, configure Prometheus scrape configs, and build Grafana dashboards from scratch. Instead, HAMi ships two Prometheus metric endpoints out of the box.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Port &lt;code&gt;:31992&lt;/code&gt;&lt;/strong&gt; (device-plugin, real-time per-container):&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:31992/metrics | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"^#"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;vGPU_device_memory_usage_in_bytes&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;podname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpu-worker-a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mf"&gt;1.82884864&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;09&lt;/span&gt;
&lt;span class="n"&gt;vGPU_device_memory_usage_in_bytes&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;podname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpu-worker-b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mf"&gt;2.39507968&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;09&lt;/span&gt;
&lt;span class="n"&gt;vGPU_device_memory_limit_in_bytes&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;podname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpu-worker-a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mf"&gt;2.147483648&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;09&lt;/span&gt;
&lt;span class="n"&gt;vGPU_device_memory_limit_in_bytes&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;podname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpu-worker-b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mf"&gt;3.221225472&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;09&lt;/span&gt;
&lt;span class="n"&gt;Device_utilization_desc_of_container&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;podname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpu-worker-a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;
&lt;span class="n"&gt;Device_utilization_desc_of_container&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;podname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpu-worker-b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;31&lt;/span&gt;
&lt;span class="n"&gt;HostCoreUtilization&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;deviceuuid&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GPU-53aae475-..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;
&lt;span class="n"&gt;HostGPUMemoryUsage&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;deviceuuid&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"GPU-53aae475-..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mf"&gt;5.82&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;09&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Port &lt;code&gt;:31993&lt;/code&gt;&lt;/strong&gt; (scheduler, allocation view):&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:31993/metrics | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"^#"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight prometheus"&gt;&lt;code&gt;&lt;span class="n"&gt;GPUDeviceSharedNum&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="n"&gt;GPUDeviceCoreAllocated&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;65&lt;/span&gt;
&lt;span class="n"&gt;GPUDeviceMemoryAllocated&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mf"&gt;5.36870912&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;09&lt;/span&gt;
&lt;span class="n"&gt;vGPUCoreAllocated&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;podname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpu-worker-a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;
&lt;span class="n"&gt;vGPUCoreAllocated&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;podname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpu-worker-b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;
&lt;span class="n"&gt;vGPUMemoryAllocated&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;podname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpu-worker-a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mf"&gt;2.147483648&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;09&lt;/span&gt;
&lt;span class="n"&gt;vGPUMemoryAllocated&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;podname&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"gpu-worker-b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mf"&gt;3.221225472&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;09&lt;/span&gt;
&lt;span class="n"&gt;QuotaUsed&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="na"&gt;quotaName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"nvidia.com/gpumem"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;quotanamespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;5120&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;GPUDeviceSharedNum: 2&lt;/code&gt; — two containers sharing one GPU, confirmed from HAMi's perspective.&lt;/p&gt;

&lt;p&gt;Wire these to Prometheus with ServiceMonitors and you have a full observability story:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;hami_service_monitoring.yaml&lt;/code&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring.coreos.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceMonitor&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hami-scheduler-metrics&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;observability&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;release&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-prom-stack&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchNames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app.kubernetes.io/component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hami-scheduler&lt;/span&gt;
      &lt;span class="na"&gt;app.kubernetes.io/instance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hami&lt;/span&gt;
  &lt;span class="na"&gt;endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitor&lt;/span&gt;          &lt;span class="c1"&gt;# → pod :9395&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/metrics&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitoring.coreos.com/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceMonitor&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hami-device-plugin-metrics&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;observability&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;release&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-prom-stack&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;namespaceSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchNames&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app.kubernetes.io/component&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hami-device-plugin&lt;/span&gt;
      &lt;span class="na"&gt;app.kubernetes.io/instance&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hami&lt;/span&gt;
  &lt;span class="na"&gt;endpoints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;monitorport&lt;/span&gt;      &lt;span class="c1"&gt;# → pod :9394&lt;/span&gt;
    &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5s&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/metrics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Understanding the Limits — Soft vs Hard
&lt;/h2&gt;

&lt;p&gt;This is important to get right before you put HAMi in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VRAM (gpumem-percentage) — Hard enforcement.&lt;/strong&gt; HAMi intercepts &lt;code&gt;cudaMalloc&lt;/code&gt; in userspace. When your pod exceeds its limit, it gets &lt;code&gt;CUDA_ERROR_OUT_OF_MEMORY&lt;/code&gt;. This is deterministic, reliable, and completely isolates the impact to the offending pod.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SM cores (gpucores) — Soft enforcement.&lt;/strong&gt; HAMi doesn't have a hardware mechanism to limit SM core usage on non-MIG GPUs. Instead, it monitors GPU utilization and injects &lt;code&gt;cudaDeviceSynchronize()&lt;/code&gt; + sleep cycles to throttle kernel submissions when a pod exceeds its core budget. This is best-effort — expect ±5-10% deviation from your configured cap. The GPU doesn't enforce this at hardware level.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;gpumem-percentage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;20"&lt;/span&gt;   &lt;span class="s"&gt;→  Hard. If exceeded → CUDA_ERROR_OUT_OF_MEMORY. Deterministic.&lt;/span&gt;
&lt;span class="na"&gt;gpucores&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;25"&lt;/span&gt;            &lt;span class="s"&gt;→  Soft. Best-effort ±5-10%. Not a hardware guarantee.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;For most use cases involving multiple AI workloads sharing a GPU, the hard VRAM wall is what matters most. SM throttling is a nice-to-have for fairness but not a safety guarantee.&lt;/p&gt;

&lt;p&gt;If you need hard SM guarantees, you're on MIG territory — A100/H100 only.&lt;/p&gt;


&lt;h2&gt;
  
  
  What This Means for My Platform
&lt;/h2&gt;

&lt;p&gt;Coming home from Amsterdam with a working HAMi PoC changes the architecture conversation significantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt;: Two-tier GPU management. Docker API for short-lived containers. K8s for long-lived pods. Homebrew GPU pool tracker. No unified monitoring. No VRAM isolation. Multiple separate failure modes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After (planned)&lt;/strong&gt;: Single K8s cluster. HAMi handles all GPU slicing. Inference pods, processing jobs, and batch workloads all described as K8s Deployments or Jobs with &lt;code&gt;nvidia.com/gpumem-percentage&lt;/code&gt; limits. Unified observability via HAMi's Prometheus endpoints. Automatic rescheduling on failure. Namespace quotas per team.&lt;/p&gt;

&lt;p&gt;The short-lived GPU job use case specifically — I'm confident HAMi can handle it. On-demand workloads with predictable, bounded VRAM usage are exactly what sub-GPU partitioning is designed for. You can pack several of them onto a single GPU that used to be allocated whole to one process at a time.&lt;/p&gt;


&lt;h2&gt;
  
  
  Try It Yourself — Full PoC Files
&lt;/h2&gt;

&lt;p&gt;Everything I built is in the files below. You need MicroK8s, an NVIDIA GPU, and about 30 minutes.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/bogdancstrike" rel="noopener noreferrer"&gt;
        bogdancstrike
      &lt;/a&gt; / &lt;a href="https://github.com/bogdancstrike/HAMi-kubernetes-gpu-partitioning-demo" rel="noopener noreferrer"&gt;
        HAMi-kubernetes-gpu-partitioning-demo
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;HAMi GPU Sharing on MicroK8s&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;Split a single physical NVIDIA GPU across multiple Kubernetes pods using &lt;a href="https://github.com/Project-HAMi/HAMi" rel="noopener noreferrer"&gt;HAMi&lt;/a&gt; (Heterogeneous AI Computing Virtualization Middleware). No MIG, no hardware partitioning — works on consumer GPUs.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;How It Works&lt;/h2&gt;
&lt;/div&gt;

&lt;p&gt;HAMi injects a CUDA shim via &lt;code&gt;LD_PRELOAD&lt;/code&gt; into each container. The shim intercepts &lt;code&gt;cudaMalloc&lt;/code&gt; and kernel launch calls to enforce per-pod limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VRAM&lt;/strong&gt; — hard cap; pod is OOM-killed if it exceeds its allocation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU cores&lt;/strong&gt; — soft cap via kernel submission throttling (±5–10% deviation is normal)&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;&lt;pre class="notranslate"&gt;&lt;code&gt;Physical GPU (e.g. RTX 3080 — 10 GB VRAM)
├── gpu-worker-a  →  20% VRAM (~2 GB)  +  25% SM cores
└── gpu-worker-b  →  30% VRAM (~3 GB)  +  40% SM cores
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;Both pods run truly in parallel on different SMs. Time-slicing only occurs under SM contention.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Prerequisites&lt;/h2&gt;
&lt;/div&gt;

&lt;ul&gt;
&lt;li&gt;Ubuntu 22.04 / 24.04&lt;/li&gt;
&lt;li&gt;NVIDIA driver installed on host (&lt;code&gt;nvidia-smi&lt;/code&gt; works)&lt;/li&gt;
&lt;li&gt;MicroK8s installed (&lt;code&gt;snap install microk8s --classic&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;helm3&lt;/code&gt;…&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/bogdancstrike/HAMi-kubernetes-gpu-partitioning-demo" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;The files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;sanity_check.yaml&lt;/code&gt;&lt;/strong&gt; — Verify GPU access before installing HAMi&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;gpu_worker_a.yaml&lt;/code&gt;&lt;/strong&gt; — Worker A deployment (20% VRAM, 25% cores)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;gpu_worker_b.yaml&lt;/code&gt;&lt;/strong&gt; — Worker B deployment (30% VRAM, 40% cores)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;hami_service_monitoring.yaml&lt;/code&gt;&lt;/strong&gt; — Prometheus ServiceMonitors for both HAMi endpoints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;grafana_dashboard.yaml&lt;/code&gt;&lt;/strong&gt; — Auto-importing Grafana dashboard via ConfigMap&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Quick start (after MicroK8s is running with GPU addon enabled):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Install cert-manager&lt;/span&gt;
microk8s kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/cert-manager/cert-manager/releases/download/v1.14.4/cert-manager.yaml
microk8s kubectl &lt;span class="nb"&gt;wait&lt;/span&gt; &lt;span class="nt"&gt;--for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ready pod &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-l&lt;/span&gt; app.kubernetes.io/instance&lt;span class="o"&gt;=&lt;/span&gt;cert-manager &lt;span class="nt"&gt;-n&lt;/span&gt; cert-manager &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;180s

&lt;span class="c"&gt;# 2. Install HAMi&lt;/span&gt;
microk8s helm3 repo add hami-charts https://project-hami.github.io/HAMi/
microk8s helm3 repo update
&lt;span class="nv"&gt;K8S_VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;microk8s kubectl version &lt;span class="nt"&gt;-o&lt;/span&gt; json | python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"import sys,json,re; v=json.load(sys.stdin)['serverVersion']['gitVersion'].lstrip('v'); print(re.split(r'[+&lt;/span&gt;&lt;span class="se"&gt;\-&lt;/span&gt;&lt;span class="s2"&gt;]',v)[0])"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
microk8s helm3 &lt;span class="nb"&gt;install &lt;/span&gt;hami hami-charts/hami &lt;span class="nt"&gt;--namespace&lt;/span&gt; kube-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; scheduler.kubeScheduler.imageTag&lt;span class="o"&gt;=&lt;/span&gt;v&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;K8S_VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; devicePlugin.nvidiaDriverPath&lt;span class="o"&gt;=&lt;/span&gt;/usr/local/nvidia &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; scheduler.defaultSchedulerPolicy.gpuMemory&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; scheduler.defaultSchedulerPolicy.gpuCores&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;

&lt;span class="c"&gt;# 3. Label your GPU node&lt;/span&gt;
microk8s kubectl label node &amp;lt;your-node-name&amp;gt; &lt;span class="nv"&gt;gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;on

&lt;span class="c"&gt;# 4. Deploy workers&lt;/span&gt;
microk8s kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; gpu_worker_a.yaml
microk8s kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; gpu_worker_b.yaml

&lt;span class="c"&gt;# 5. Watch the magic&lt;/span&gt;
microk8s kubectl logs &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gpu-worker &lt;span class="nt"&gt;--prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; &amp;amp;
watch &lt;span class="nt"&gt;-n&lt;/span&gt; 2 nvidia-smi

&lt;span class="c"&gt;# 6. Verify HAMi's view of the split&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://localhost:31993/metrics | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"^#"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Final Thoughts — From One Infrastructure Nerd to Another
&lt;/h2&gt;

&lt;p&gt;I've been at this GPU sharing problem for some time. MIG was the dream but not the reality for most hardware budgets. Time-slicing was a band-aid. Our homebrew solution was genuinely good engineering but was always technical debt waiting to be paid.&lt;/p&gt;

&lt;p&gt;HAMi is the first thing I've found that genuinely plugs the gap — software-level VRAM isolation on commodity GPUs, K8s native, zero application changes, and built-in observability. It's not magic: the SM throttling is soft, the setup requires K8s knowledge, and there's still a ceiling of what you can pack onto a 10GB card. But it's real, it works, and it's an open CNCF project with active development.&lt;/p&gt;

&lt;p&gt;The fact that I found it at KubeCon, had a working PoC by 2AM, and was watching two pods cleanly share an RTX 3080 before I went to sleep — that's a pretty good endorsement.&lt;/p&gt;

&lt;p&gt;If you're running AI workloads on Kubernetes and you're wasting GPU budget on whole-GPU-per-pod allocations, give HAMi a look. Your platform budget will thank you.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>devops</category>
      <category>cloudnative</category>
    </item>
  </channel>
</rss>
