<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Stelia Developers</title>
    <description>The latest articles on DEV Community by Stelia Developers (@steliadevs).</description>
    <link>https://dev.to/steliadevs</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3806265%2F1ff16a74-a81d-487e-aca1-8c47291b88d3.jpg</url>
      <title>DEV Community: Stelia Developers</title>
      <link>https://dev.to/steliadevs</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/steliadevs"/>
    <language>en</language>
    <item>
      <title>No more CPU fights: how we build truly isolated cloud compute</title>
      <dc:creator>Stelia Developers</dc:creator>
      <pubDate>Thu, 16 Apr 2026 08:08:32 +0000</pubDate>
      <link>https://dev.to/steliadevs/no-more-cpu-fights-how-we-build-truly-isolated-cloud-compute-4438</link>
      <guid>https://dev.to/steliadevs/no-more-cpu-fights-how-we-build-truly-isolated-cloud-compute-4438</guid>
      <description>&lt;p&gt;&lt;em&gt;by &lt;a href="https://www.linkedin.com/in/peter-bangert/" rel="noopener noreferrer"&gt;Peter Bangert&lt;/a&gt;, Senior Platform Engineer, Stelia&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;At &lt;a href="//stelia.ai"&gt;Stelia AI&lt;/a&gt;, we have our own cloud infrastructure for self-hosting and self-delivery. Among our products — ranging from managed Kubernetes to Slurm — we provide an isolated GPU/compute instance comparable to EC2.&lt;/p&gt;

&lt;p&gt;Our goal is simple: deliver compute instances (read: virtual machines with strict CPU ownership, NUMA locality, and zero cross-tenant contention — all inside Kubernetes.)&lt;/p&gt;

&lt;p&gt;To accomplish this, we run virtual machines inside pods and pass host file descriptors into the hypervisor to attach paravirtualised (virtio) networking and storage devices. This gives us the operational flexibility of Kubernetes while maintaining VM-level isolation.&lt;/p&gt;

&lt;p&gt;But doing this correctly requires careful control of CPU, memory, NUMA topology, and thread placement.&lt;/p&gt;

&lt;p&gt;Let’s walk through how we built it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;This article is meant to be a technical guide providing an introduction to the tech stack, namely, Virtual Machines running on Kubernetes, and the motivation behind these choices, particularly pertaining to how we, at Stelia AI, perceive the future of the AI Infrastructure industry. Then we will be providing an introduction to the Kubernetes Topology Manager and the basics you need to set up your own NUMA-aware pod scheduling. Lastly, we will go a step further and discuss our highly technical approach to ensuring vCPU thread isolation for our Virtual Machines.&lt;/p&gt;

&lt;p&gt;Motivation: focusing on duty, self-mastery, and trust in divine order&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Kubernetes
&lt;/h2&gt;

&lt;p&gt;There is a prevailing trend that AI/ML workloads are migrating to Kubernetes; &lt;a href="https://www.cncf.io/announcements/2026/01/20/kubernetes-established-as-the-de-facto-operating-system-for-ai-as-production-use-hits-82-in-2025-cncf-annual-cloud-native-survey/" rel="noopener noreferrer"&gt;CNCF reports&lt;/a&gt; that as of early 2026, 66% of organisations use Kubernetes for inferencing. &lt;/p&gt;

&lt;p&gt;While we at Stelia AI have experience working with other infrastructure stacks, i.e., OpenStack, or simply bare metal, we chose Kubernetes because it has evolved beyond a container orchestrator into a universal control plane. By leveraging Custom Resource Definitions (CRDs), we treat our isolated VMs as first-class objects, benefiting from K8s’ robust object serialisation, RBAC, and API versioning out of the box. Also, by using the “Reconciler” controller pattern, you move away from “fire-and-forget” imperative scripts/IAC to more fine-grained state control. &lt;/p&gt;

&lt;h2&gt;
  
  
  Why Virtual Machines
&lt;/h2&gt;

&lt;p&gt;By running virtual machines inside pods, we are using Kubernetes as the management plane (scheduling, networking, and storage) and the VM as the execution boundary. This also provides security hardening as opposed to providing pod/container access via mitigating the escape risk of a syscall exploit, and near hardware native performance via PCIe passthrough or SR-IOV.&lt;/p&gt;

&lt;h2&gt;
  
  
  NUMA Locality in Kubernetes
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reference: &lt;a href="https://www.redhat.com/en/blog/topology-aware-scheduling-in-kubernetes-part-1-the-high-level-business-case" rel="noopener noreferrer"&gt;Topology Aware Scheduling in Kubernetes&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Given our requirements to build an AI inferencing/training suitable infrastructure, we need to take lessons from classic HPC.&lt;/p&gt;

&lt;p&gt;Not all memory is created equal. Non-Uniform Memory Access (NUMA) means that a specific group of CPU cores has “local” access to a specific bank of RAM.&lt;/p&gt;

&lt;p&gt;The current &lt;a href="https://www.techpowerup.com/forums/threads/amd-recommends-epyc-processors-for-everyday-ai-server-tasks.334004/" rel="noopener noreferrer"&gt;gold standard for everyday AI workloads is the AMD EPYC&lt;/a&gt; dual socket processor, a very flexible workhorse that plays a pivotal role in creating balance by delivering high-performance, efficiency, and security. Below is a diagram of a Dual Socket system. Ideally, for HPC/AI workloads, you want your processes and threads running in a single socket, or NUMA node, as this eliminates the NUMA Hop by requesting memory across the interconnect, which results in a 15–30% drop in performance.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcs7gn8w3gr026lkxl1ps.webp" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcs7gn8w3gr026lkxl1ps.webp" alt="Redhat diagram"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPU Locality:&lt;/strong&gt; In addition, GPUs are connected via the PCIe bus, so if your workload is running on the wrong socket, the data path between the CPU and GPU becomes a congested highway resulting in jitter or tail-latency.&lt;/p&gt;

&lt;p&gt;To add another dimension, the amount of NUMA Nodes per Socket is configurable on most systems via the UEFI/BIOS. There exist typically NPS levels of 1,2, and 4, meaning on a Dual Socket system, you can configure up to 8 NUMA nodes. For our use case, we configure our systems with a single NUMA node per socket, so 2 in total.&lt;/p&gt;

&lt;h2&gt;
  
  
  But Dual Socket Systems aren’t that complex, is this overkill?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi88bec1zxwdpfm9w60qx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi88bec1zxwdpfm9w60qx.jpg" alt="Supermicro meme"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Following SuperMicros announcement in May 2023, unveiling an &lt;a href="https://www.supermicro.com/en/pressreleases/supermicro-leads-industry-first-eight-socket-and-four-socket-servers-most-demanding" rel="noopener noreferrer"&gt;8 Socket Server based on Intel CPUs…&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It seems Dual Sockets are only the beginning. The sooner we adapt our Kubernetes topology management in the fast-growing world of bigger servers and larger workloads, the more prepared we will be moving forward. &lt;/p&gt;

&lt;h2&gt;
  
  
  Solving Resource Contention
&lt;/h2&gt;

&lt;p&gt;When building and operating GPU compute resources, it's imperative that when a customer requests 20 CPUs for their virtual instance, they get 20 CPUs. In order to dedicate CPUs to instances, we’ll need to make the other stuff running on the system avoid them. &lt;/p&gt;

&lt;p&gt;The traditional approach for that was the &lt;code&gt;isolcpus&lt;/code&gt; kernel argument, but that’s deprecated, and cpusets in cgroups are the way to go.&lt;/p&gt;

&lt;p&gt;Previously, when running our infrastructure on OpenStack or Bare-Metal, we had a tedious approach involving:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Creating specific cgroups for different process types (system, instances, user) &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Manual CPU pinning (taskset)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;cpuset manipulation&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because we now operate inside Kubernetes, we can instead leverage three core components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;CPU Manager&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory Manager&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Device Manager&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Coordinated by the &lt;strong&gt;Topology Manager&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Kubernetes Topology Manager Deep Dive
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reference: &lt;a href="https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/" rel="noopener noreferrer"&gt;Control Topology Management Policies on a node&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The Topology Manager coordinates CPU, memory, and device allocation to ensure NUMA alignment. It acts as a central coordinator that polls three specific sub-managers to see what they have available:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/" rel="noopener noreferrer"&gt;&lt;strong&gt;CPU Manager&lt;/strong&gt;&lt;/a&gt;: Decides which specific cores to assign (for &lt;code&gt;Guaranteed&lt;/code&gt; QoS pods). &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://kubernetes.io/docs/tasks/administer-cluster/memory-manager/" rel="noopener noreferrer"&gt;&lt;strong&gt;Memory Manager&lt;/strong&gt;&lt;/a&gt;: Handles pinning memory pages to specific NUMA nodes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Device Manager&lt;/strong&gt;: Manages high-speed peripherals like GPUs or SR-IOV NICs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each manager sends back a &lt;strong&gt;Topology Hint&lt;/strong&gt;, which is a bitmask saying, “I can satisfy this request using NUMA node 0 or node 1.”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0ggudqxzc93goevxh5w.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo0ggudqxzc93goevxh5w.jpeg" alt="Kubernetes memory manager diagram"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Topology Manager then merges these hints according to a configured policy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Topology Manager Policies
&lt;/h2&gt;

&lt;p&gt;The Topology Manager's responsibility is to coordinate how resources are allocated across NUMA nodes. Its goal is to align CPUs, memory, and devices so workloads avoid costly cross-socket communication, but there is a tunable policy to dictate its strictness.&lt;/p&gt;

&lt;p&gt;The Topology Manager supports several policies:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;none&lt;/code&gt;: No NUMA alignment is attempted. This is the default behavior and resources may come from any NUMA node.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;best-effort&lt;/code&gt;: Attempts to align resources on the same NUMA node, but does not strictly enforce it. If alignment is not possible, the pod is still admitted.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;restricted&lt;/code&gt;: Requires NUMA alignment when possible. If alignment cannot be achieved, the pod is rejected.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;single-numa-node&lt;/code&gt;: The strictest policy. All requested resources must come from a single NUMA node, or the pod will not be scheduled.&lt;/p&gt;

&lt;h2&gt;
  
  
  Our Policy
&lt;/h2&gt;

&lt;p&gt;Our goal is to provide performance-optimised resource allocation whenever possible, but ultimately we need to be flexible enough to account for instance requests which span well beyond a single NUMA node, therefore we chose to use the &lt;code&gt;best-effort&lt;/code&gt; policy. &lt;/p&gt;

&lt;p&gt;This is the configuration we are currently using in our kubelet configuration, please consult the kubernetes documents linked in this article to understand the purpose of the reserved memory/CPU requirements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;cpuManagerPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;static&lt;/span&gt;
&lt;span class="na"&gt;memoryManagerPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Static&lt;/span&gt;
&lt;span class="na"&gt;topologyManagerPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;best-effort&lt;/span&gt;
&lt;span class="na"&gt;topologyManagerScope&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pod&lt;/span&gt;
&lt;span class="na"&gt;reservedSystemCPUs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0,64"&lt;/span&gt;
&lt;span class="na"&gt;kubeReserved&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3"&lt;/span&gt;
  &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;
&lt;span class="na"&gt;systemReserved&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
  &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1Gi"&lt;/span&gt;
&lt;span class="na"&gt;evictionHard&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;memory.available&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1Gi"&lt;/span&gt;
  &lt;span class="na"&gt;nodefs.available&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10%"&lt;/span&gt;
  &lt;span class="na"&gt;imagefs.available&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;15%"&lt;/span&gt;
&lt;span class="na"&gt;reservedMemory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;numaNode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;numaNode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2Gi"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Important: guaranteed QoS requirement
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Reference: &lt;a href="https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/" rel="noopener noreferrer"&gt;Kubernetes Pod Quality of Service Classes&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Topology Manager alignment only works reliably for pods in the &lt;code&gt;Guaranteed&lt;/code&gt; QoS class.&lt;/p&gt;

&lt;p&gt;This requires that CPU and memory requests exactly match their limits, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4"&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8Gi"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a pod is in the &lt;code&gt;Guaranteed&lt;/code&gt; class:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the &lt;strong&gt;CPU Manager&lt;/strong&gt; assigns exclusive CPU cores&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;Memory Manager&lt;/strong&gt; can enforce NUMA-local allocations&lt;/li&gt;
&lt;li&gt;the &lt;strong&gt;Topology Manager&lt;/strong&gt; can coordinate resource placement across subsystems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pods in the &lt;code&gt;Burstable&lt;/code&gt; or &lt;code&gt;BestEffort&lt;/code&gt; QoS classes do not receive these guarantees, and their resources may be spread across NUMA nodes even when a topology policy is enabled.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hidden problem: not all processes belong to the customer
&lt;/h2&gt;

&lt;p&gt;If we look at all the running processes within the container that manages the hypervisor responsible for running the customer's virtual machine, we see something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;gpu-instance:/#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ps
&lt;span class="go"&gt;PID   USER     TIME  COMMAND
    1 root      0:00 instance-runner
   24 dnsmasq   0:00 dnsmasq 
   30 root      5:08 qemu-system-x86_64 
   35 root      5:08 virtiofsd
   37 root      0:00 bash
   43 root      0:00 ps
&lt;/span&gt;&lt;span class="gp"&gt;gpu-instance:/#&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even though Kubernetes gave the pod exclusive CPUs, all processes inside the pod share those CPUs.&lt;/p&gt;

&lt;p&gt;That includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;vCPU threads (customer workload)&lt;/li&gt;
&lt;li&gt;I/O threads&lt;/li&gt;
&lt;li&gt;QEMU management threads&lt;/li&gt;
&lt;li&gt;Runtime threads from our controller&lt;/li&gt;
&lt;li&gt;virtiofsd (shared filesystem)&lt;/li&gt;
&lt;li&gt;dnsmasq (DHCP server)
We needed isolation &lt;em&gt;inside&lt;/em&gt; the container. Ideally, we would like everything besides the QEMU vCPU threads to be running on other cores elsewhere, meaning we need to move them to another cgroup. &lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  About Control Group v2
&lt;/h2&gt;

&lt;p&gt;As we discussed previously regarding the Pod Quality of Services classes, these classes are realised logically as different Cgroups by kubelet. Cgroups are a Linux Kernel feature that allows processes to be organised into hierarchical groups to limit, account for, and isolate resource usage. &lt;br&gt;
Let's take a look at the Cgroups currently present on our worker node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/sys/fs/cgroup/kubepods.slice
|   |-- kubepods-besteffort.slice
|   |   |-- kubepods-besteffort-pod0233c9aa_e18f_4614_97ba_94606228ec2f.slice
|   |   |   |-- cri-containerd-56b896a784d2bcb4c4a1877bc9350e0c412d9af98ec185ad3989aad78da0fead.scope
|   |   |   `-- cri-containerd-f486e076c03e5d720e7cf05abafd4456bff78d250ee896014ce79d65fd631d7b.scope
|   |   |-- kubepods-besteffort-pod232fe706_7428_4502_b880_19f28aa8ca3d.slice
|   |   |   |-- cri-containerd-4623cf1d300610228d5926a1f0c532dc9dd61db027d1d89a47a188e04871a73c.scope
|   |   |   |-- cri-containerd-72a8fb24580415a79e1b1b1142e6235ef61ca8ec2c2e7fec08da9574aea21810.scope
|   |-- kubepods-burstable.slice
|   |   |-- kubepods-burstable-pod332102fa_8018_4db4_9acc_50dd2f3a3460.slice
|   |   |   |-- cri-containerd-4fde7a33f9dc03c0d52d582266597e8f89d4d2bed6fd27232709eeb2dd34be0c.scope
|   |   |   |-- cri-containerd-7aca7509f4d290a8bcb7280c929cc85411fc803302dccf768aad3d937a169e95.scope
|   |-- kubepods-pod3ceda12c_3ee9_40fa_9f42_61428fe654a6.slice
|   |   |-- cri-containerd-0e16c65fb6a8d15dbd90aa3f1e07f6ba709d1adb47eee0af5aa6d1f92ffd4d04.scope
|   |   |-- cri-containerd-b02c37e654f2f8356cf2641f75a94194ecf3b8dbdb38d61de3644a67e49045a2.scope
|   |   `-- cri-containerd-eb6218013c1ba67fc5288216059150c3d8946da7ad6dd8cfdc7516621d80c260.scope
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we see the three different QoS classes we previously described existing as different cgroup &lt;em&gt;slices&lt;/em&gt; (unit of cgroups managed by systemd). Within each cgroup slice, we see pods, and within those pods, the individual containers. &lt;/p&gt;

&lt;p&gt;For burstable and best-effort pod cgroups, you won't see much in terms of resource definitions, since kubelet and the topology manager don't provide any guarantees on resource isolation; however, you will see limits in place, such as &lt;code&gt;memory.max&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;root@worker1:/sys/fs/cgroup#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod332102fa_8018_4db4_9acc_50dd2f3a3460.slice/cri-containerd-4fde7a33f9dc03c0d52d582266597e8f89d4d2bed6fd27232709eeb2dd34be0c.scope/memory.max
&lt;span class="go"&gt;1073741824
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, for our guaranteed pods, you will see something more interesting: you will see &lt;code&gt;cpuset.cpus&lt;/code&gt; defining the specific CPUs kubelet is reserving for the instance (in this case 4 cpus, 14–17)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;root@worker1:/sys/fs/cgroup#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;kubepods.slice/kubepods-pod416de7c2_21de_472d_817c_fa9d4306cb7d.slice/cri-containerd-5baab67f1daf9297c60974e50ad9f94f288077161a7c41376478e79aae671e07.scope/cpuset.cpus
&lt;span class="go"&gt;14-17
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a guaranteed pod is created, kubelet refers to its list of all currently guaranteed CPUs in its &lt;code&gt;/var/lib/kubelet/cpu_manager_state&lt;/code&gt;, subtracts however many CPUs are being requested from its &lt;code&gt;defaultCpuSet&lt;/code&gt;, allocates this for the requested pod, and then subtracts these CPUs from every burstable and best-effort pods' cgroup. &lt;br&gt;
Let's take a look. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Currently our kubelet’s cpu_manager_state show us we have CPUs 2–11 on this 20 core system for shared use and the rest are reserved for instances.
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"policyName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"static"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"defaultCpuSet"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2-11"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"entries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"3ceda12c-3ee9-40fa-9f42-61428fe654a6"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"instance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0-1,12-13"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"416de7c2-21de-472d-817c-fa9d4306cb7d"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"instance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"14-17"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"b41f38d8-2c21-4168-af77-6d0cc5482dcb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"instance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"18-19"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"checksum"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3944432841&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;If we look at the cpusets allocated to the burstable and best-effort cgroups, they should coincide with the default cpuset:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;root@worker1:/sys/fs/cgroup#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod0233c9aa_e18f_4614_97ba_94606228ec2f.slice/cri-containerd-f486e076c03e5d720e7cf05abafd4456bff78d250ee896014ce79d65fd631d7b.scope/cpuset.cpus
&lt;span class="go"&gt;2-11
&lt;/span&gt;&lt;span class="gp"&gt;root@worker1:/sys/fs/cgroup#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod332102fa_8018_4db4_9acc_50dd2f3a3460.slice/cri-containerd-4fde7a33f9dc03c0d52d582266597e8f89d4d2bed6fd27232709eeb2dd34be0c.scope/cpuset.cpus
&lt;span class="go"&gt;2-11
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  So how can we isolate the threads within our cgroup from one another?
&lt;/h2&gt;

&lt;p&gt;Use &lt;a href="https://docs.kernel.org/admin-guide/cgroup-v2.html#threads" rel="noopener noreferrer"&gt;threaded cgroups&lt;/a&gt;?&lt;/p&gt;

&lt;p&gt;A simple solution to isolate our worker threads from all the other processes running within our Pod would be to simply create two children cgroups of the current cgroup our Pod belongs to, make them threaded cgroups, and split the threads between them.&lt;/p&gt;

&lt;p&gt;However, there exists a limitation within Control Groups v2, in that when certain domain controllers are enabled, children cgroups cannot be threaded due to a lack of supporting control over sub-resources. &lt;/p&gt;

&lt;p&gt;Currently, the following controllers are threaded and can be enabled in a threaded cgroup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cpu&lt;/li&gt;
&lt;li&gt;cpuset&lt;/li&gt;
&lt;li&gt;perf_event&lt;/li&gt;
&lt;li&gt;pids&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What this means is that, since we have the memory manager enabled, this is not an option (it requires hugetlb and memory controllers).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;root@worker1:/sys/fs/cgroup#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cat &lt;/span&gt;kubepods.slice/kubepods-pod416de7c2_21de_472d_817c_fa9d4306cb7d.slice/cgroup.subtree_control
&lt;span class="go"&gt;cpuset cpu io memory hugetlb pids rdma misc
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As a result of these limitations, in order to accomplish our goal, we must develop our own solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Instance Cgroup Controller
&lt;/h2&gt;

&lt;p&gt;To outline the overall approach, first we will define the goals and criteria for what we want to accomplish.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create an Instance Cgroup controller, which we will deploy as a Daemonset on all worker nodes&lt;/li&gt;
&lt;li&gt;The Instance Cgroup controller will manage a float cgroup and will allow for Instances to register new cgroups.&lt;/li&gt;
&lt;li&gt;New cgroup registration will take parameters read from the current instance's cgroup (cpuset)&lt;/li&gt;
&lt;li&gt;Develop our Instance Runner to then move all processes to the newly registered cgroup&lt;/li&gt;
&lt;li&gt;All non-QEMU worker pids/threads should be placed within the float cgroup, and all vCPU threads should be placed in the registered cgroup&lt;/li&gt;
&lt;li&gt;Deregistration and placing pids back in the original place during teardown&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;An architecture of this approach can be seen below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybv7wmaon4b5geilwqgm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fybv7wmaon4b5geilwqgm.png" alt="Instance Cgroup controller architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Set up Instance Cgroup Controller
&lt;/h2&gt;

&lt;p&gt;The Instance Cgroup Controller is responsible for managing CPU topology and cgroup placement for all virtual machine instances running on a node. Conceptually, it acts as a local resource coordinator that sits between the instance lifecycle manager and the Linux kernel’s cgroup interfaces.&lt;/p&gt;

&lt;p&gt;The primary goals of the controller are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create and maintain cgroup heirarchy for float processes and instances&lt;/li&gt;
&lt;li&gt;Registration and deregistration API for Instance Runner&lt;/li&gt;
&lt;li&gt;Reconcile cpusets and cgroups statelessly, and manage cleanup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Establishing the Cgroup Hierarchy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The controller manages a dedicated subtree under the system cgroup hierarchy. The structure looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;steliа/
 ├─ instance-&amp;lt;pod-uuid&amp;gt;/
 ├─ float/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;stelia/&lt;/code&gt; is the root cgroup managed by our runtime.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;instance-&amp;lt;pod-uuid&amp;gt;/&lt;/code&gt; represents the cgroup assigned to a specific VM instance.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;float/&lt;/code&gt; is a shared execution pool where processes temporarily run before they are pinned to dedicated CPUs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The next step will be to enable the required domain controllers for this cgroup hierarchy and then make these sub-cgroups (float, instances) threaded cgroup types. &lt;/p&gt;

&lt;p&gt;It is important to note that we will only be enabling the cpuset domain controller in our cgroup hierarchy, as we are allocating hugepages to our virtual machine. Hugepage allocation is taken from kubelets allocation and will remain static as it's bound to the process at start time, so this should not be affected when the pid is moved, and the stelia cgroup does not have a hugetlb controller enabled (doesn't exist for threaded cgroups), so the kernel should not touch these resources.&lt;/p&gt;

&lt;p&gt;Setting this up in Rust will appear as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;io&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="cd"&gt;/// Ensure the stelia cgroup topology exists and is configured correctly.&lt;/span&gt;
&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;ensure_topology&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;stelia_cgroup&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// e.g. /sys/fs/cgroup/stelia&lt;/span&gt;
    &lt;span class="n"&gt;float_cgroup&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// e.g. /sys/fs/cgroup/stelia/float&lt;/span&gt;
    &lt;span class="n"&gt;total_cpus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// e.g. "0-19 (from cpu_manager_state)"&lt;/span&gt;
    &lt;span class="n"&gt;mems&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// e.g. "0,1" number of memory domains (from memory_manager_state)"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="c1"&gt;// Create the root stelia cgroup&lt;/span&gt;
    &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;create_dir_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stelia_cgroup&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Configure cpuset settings for the root cgroup&lt;/span&gt;
    &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stelia_cgroup&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cpuset.mems"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;mems&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stelia_cgroup&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cpuset.cpus"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;total_cpus&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Enable cpuset controller for children&lt;/span&gt;
    &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stelia_cgroup&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cgroup.subtree_control"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s"&gt;"+cpuset"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Create the shared "float" cgroup&lt;/span&gt;
    &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;create_dir_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;float_cgroup&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Configure the float cgroup&lt;/span&gt;
    &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;float_cgroup&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cgroup.type"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s"&gt;"threaded"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;float_cgroup&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cpuset.mems"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;mems&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;float_cgroup&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"cpuset.cpus"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;total_cpus&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Registration and Deregistration endpoints&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using JSON-RPC over a Unix domain socket is a good fit for communication between the Instance Runner and the Instance Cgroup Controller because it provides a lightweight RPC mechanism without introducing unnecessary networking overhead. Since both components run on the same node, there is no need to traverse the full IP or HTTP networking stack and they can instead communicate through a Unix filesystem socket. &lt;/p&gt;

&lt;p&gt;A simplified JSON-RPC request used by the Instance Runner might look like, where the Instance Runner will provide its pod uuid and cpuset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"jsonrpc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"registerCgroup"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"uuid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cpuset"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4-7"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The controller responds with the newly registered instance group path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"jsonrpc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cgroup_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/sys/fs/cgroup/stelia/instance-123"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;123&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Reconciliation loop and cleanup
&lt;/h2&gt;

&lt;p&gt;Every time an event occurs (such as an instance registering or deregistering), the controller performs a reconciliation pass. The reconciliation process does two main things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Recalculate Float CPUs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The float cgroup is updated using the current &lt;code&gt;defaultCpuSet&lt;/code&gt; value from kubelet &lt;code&gt;/var/lib/kubelet/cpu_manager_state&lt;/code&gt;. If the &lt;code&gt;defaultCpuSet&lt;/code&gt; differs from the current cpuset which is assigned to the float cgroup, it is updated.&lt;/p&gt;

&lt;p&gt;This is done by updating the &lt;code&gt;/sys/fs/cgroup/stelia/float/cpuset.cpus&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"policyName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"static"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"defaultCpuSet"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2-11"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"entries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"3ceda12c-3ee9-40fa-9f42-61428fe654a6"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"instance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0-1,12-13"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"416de7c2-21de-472d-817c-fa9d4306cb7d"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"instance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"14-17"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"b41f38d8-2c21-4168-af77-6d0cc5482dcb"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"instance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"18-19"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"checksum"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3944432841&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Cleanup: remove stale cgroups&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Stale instance cgroups are cleaned up if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They do not appear in the kubelet CPU manager state&lt;/li&gt;
&lt;li&gt;They contain no processes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevents resource leaks if an instance crashes or kubelet restarts. If the &lt;code&gt;cgroups.procs&lt;/code&gt; and &lt;code&gt;cgroups.threads&lt;/code&gt; is empty, it can be considered stale and can be safely removed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: manage processes from Instance Runner
&lt;/h2&gt;

&lt;p&gt;Once the cgroup hierarchy and CPU topology are established, the Instance Runner becomes responsible for placing VM processes into the correct cgroups and assigning them CPUs.&lt;/p&gt;

&lt;p&gt;The process typically involves the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Register with the Instance Cgroup controller&lt;/li&gt;
&lt;li&gt;Move all PIDs to the float cgroup&lt;/li&gt;
&lt;li&gt;Discover QEMU worker threads&lt;/li&gt;
&lt;li&gt;Move QEMU worker threads to the instance cgroup&lt;/li&gt;
&lt;li&gt;Pin QEMU worker threads to specific CPUs&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Registering with the controller
&lt;/h2&gt;

&lt;p&gt;The instance runner is able to discover its pod-uuid and the cpuset allocated to this container/pod by reading from &lt;code&gt;/proc/self&lt;/code&gt; as shown below:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;gpu-instance:/#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /proc/self/cpuset
&lt;span class="go"&gt;/kubepods.slice/kubepods-podb9e4f71c_64a5_4f8d_8574_41363cbaf8e5.slice/cri-containerd-506962bc49f80b0ae7461edb1dbefdd626c4ece2b4bf103747a402982e91bf39.scope
&lt;/span&gt;&lt;span class="gp"&gt;gpu-instance:/#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /sys/fs/cgroup/kubepods.slice/kubepods-podb9e4f71c_64a5_4f8d_8574_41363cbaf8e5.slice/cri-containerd-506962bc49f80b0ae7461edb1dbefdd626c4ece2b4bf103747a402982e91bf39.scope/cpuset.cpus
&lt;span class="go"&gt;18-19
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this information, the Instance Runner can construct a registration request over a JSON-RPC Unix domain socket, which is provided as a volume mount to the Instance Runner pod. This provides a lightweight control plane without requiring shared state between components.&lt;/p&gt;

&lt;p&gt;A typical registration request looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"jsonrpc"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"registerCgroup"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"uuid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"123"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cpuset"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"4-7"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response returns the assigned instance cgroup and CPU allocation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Moving processes into the float cgroup
&lt;/h2&gt;

&lt;p&gt;Having received our newly registered cgroup, we want to first move all the processes within the container to the float cgroup (runs on shared CPUs), with the goal of later moving the vCPU threads to the CPU-isolated cgroup. &lt;/p&gt;

&lt;p&gt;Let's take a look back at our process namespace from before. This is a list of all running processes in our container.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;gpu-instance:/#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ps
&lt;span class="go"&gt;PID   USER     TIME  COMMAND
    1 root      0:00 instance-runner
   24 dnsmasq   0:00 dnsmasq 
   30 root      5:08 qemu-system-x86_64 
   35 root      5:08 virtiofsd
   37 root      0:00 bash
   43 root      0:00 ps
&lt;/span&gt;&lt;span class="gp"&gt;gpu-instance:/#&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The simplest way to manage moving all processes to another cgroup is to move pid 1 (main process) before spawning any subprocesses; this way, all spawned processes will inherit the parent cgroup.&lt;/p&gt;

&lt;p&gt;So to move the main process to the float cgroup, this is as simple as writing the pid to the float cgroup's &lt;code&gt;cgroup.procs&lt;/code&gt; file.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo &lt;/span&gt;1 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /sys/fs/cgroup/stelia/float/cgroup.procs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Caveat: preventing Kubernetes cgroup cleanup
&lt;/h2&gt;

&lt;p&gt;Because instances are launched inside Kubernetes pods, we must account for Kubelet’s cgroup reconciliation logic. Kubelet periodically cleans up cgroups that it believes are unused. If all processes are moved out of a container cgroup, the kubelet will delete it.&lt;/p&gt;

&lt;p&gt;To prevent this, we keep a placeholder process inside the original Kubernetes cgroup. For this, the Instance Runner will spawn an infinite sleep process within the container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;process&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Command&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;spawn_placeholder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;io&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nn"&gt;Command&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"sleep"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"infinity"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.spawn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This process never participates in the instance runtime but acts as a &lt;em&gt;cgroup anchor&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Discovering vCPU Threads via QMP
&lt;/h2&gt;

&lt;p&gt;Now that all processes are moved into the float cgroup and are now able to utilise all CPUs on the node, we need to isolate our vCPU ‘worker’ threads. How do we discover these threads?&lt;/p&gt;

&lt;p&gt;QEMU exposes runtime information through the QEMU Machine Protocol (QMP). We use this interface to discover the thread IDs for each virtual CPU. Specifically, we call the QMP command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;query-cpus-fast
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This returns information about all vCPU threads. In Rust, this might look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;try_get_vcpu_threads&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;qmp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;QmpClient&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qmp&lt;/span&gt;
        &lt;span class="nf"&gt;.execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"query-cpus-fast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;.await&lt;/span&gt;
        &lt;span class="nf"&gt;.context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"query-cpus-fast failed"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;vcpus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;QmpCpuInfo&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_value&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nd"&gt;debug!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Found {} QEMU vCPU threads"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vcpus&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vcpus&lt;/span&gt;&lt;span class="nf"&gt;.into_iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="py"&gt;.thread_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.collect&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each returned &lt;code&gt;thread_id&lt;/code&gt; corresponds to a Linux thread representing a QEMU vCPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Moving vCPU Threads to the instance cgroup
&lt;/h2&gt;

&lt;p&gt;Once the vCPU thread IDs are known, they are moved into the instance cgroup, similar to how we moved processes before, however instead to &lt;code&gt;cgroup.threads&lt;/code&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &amp;lt;tid&amp;gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /sys/fs/cgroup/stelia/sgcinstance-xxx/cgroup.threads
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Pinning threads to specific CPUs
&lt;/h2&gt;

&lt;p&gt;After the threads are placed in the correct cgroup, the final step is to assign each thread a CPU affinity. This ensures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;minimal scheduling contention&lt;/li&gt;
&lt;li&gt;NUMA locality (as provided by kubelet allocation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example Rust implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;nix&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sched&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;sched_setaffinity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CpuSet&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;nix&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;unistd&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Pid&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;pin_thread&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;nix&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;CpuSet&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="nf"&gt;.set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;sched_setaffinity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Pid&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_raw&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;set&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each vCPU thread is pinned to one CPU from the instance’s allocated CPU set.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resulting process layout
&lt;/h2&gt;

&lt;p&gt;After initialisation, the respective cgroups end up with a structure similar to this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stelia/
├─ float/
│   └─ qemu
│   ├─ virtiofsd
│   ├─ dnsmasq
│   └─ instance runner
│
├─ instance-123/
│   └─ vcpu threads (pinned)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While the original Kubernetes pod still contains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubepods/
└─ pod-abc123/
    └─ placeholder (sleep infinity)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This approach allows us to combine Kubernetes lifecycle management with low-level CPU isolation without fighting kubelet’s cgroup garbage-collection logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Running on a Dual Socket system
&lt;/h2&gt;

&lt;p&gt;So, after all this work, what do the results look like on an actual Dual Socket system? After configuring the kubelet with the aforementioned configuration on our AMD Epyc with 128 cpus and 256 GiB of memory, let's first observe the NUMA domains of each CPU on the system. We have ~5.5k 2MB hugepages configured on each NUMA node (~113 GiB).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;NUMA:
  NUMA node(s):              2
  NUMA node0 CPU(s):         0-31,64-95
  NUMA node1 CPU(s):         32-63,96-127

&lt;/span&gt;&lt;span class="gp"&gt;root@gpu-node:~#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /sys/devices/system/node/node&lt;span class="k"&gt;*&lt;/span&gt;/hugepages/hugepages-2048kB/nr_hugepages
&lt;span class="go"&gt;55552
55296
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We will start by creating an instance with 40 CPUs and 80 GiB of hugepages. If we look at the CPU and hugepage allocation from kubelet in our containers cgroup, we see that the instances' CPUs and hugepages are localised to NUMA node 0:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;root@gpu3:~#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /sys/fs/cgroup/kubepods.slice/kubepods-podce32ebd2_f905_449e_b74b_4231a84b88e1.slice/cri-containerd-8a7496cc1fc043b490ebf3c3ec51d939cbeb988febb88a7c96b1dc871ae6a4e4.scope/cpuset.cpus
&lt;span class="go"&gt;1-20,65-84
&lt;/span&gt;&lt;span class="gp"&gt;root@gpu-node:~#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /sys/fs/cgroup/kubepods.slice/kubepods-podce32ebd2_f905_449e_b74b_4231a84b88e1.slice/cri-containerd-8a7496cc1fc043b490ebf3c3ec51d939cbeb988febb88a7c96b1dc871ae6a4e4.scope/hugetlb.2MB.numa_stat
&lt;span class="go"&gt;total=8589934592 N0=8589934592 N1=0
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After our controllers then complete their cgroup registration and process migrations, we can observe the following when running a script that checks for which cpuset a process may run on and which cpu they are currently running on. As we can see, every process is running on the defaultCpuSet (everything besides the allocated cpus for the instance), and there are exactly 40 QEMU threads which are pinned to a dedicated CPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;root@gpu3:~#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;./pid_affinity.sh
&lt;span class="go"&gt;
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="c"&gt;#####################################################################&lt;/span&gt;
&lt;span class="go"&gt;Inspecting: dnsmasq
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="c"&gt;#####################################################################&lt;/span&gt;
&lt;span class="go"&gt;
======================================================================
==== dnsmasq PID: 3794608 ====
======================================================================

==== dnsmasq cgroup ====
0::/stelia-io/float

==== dnsmasq main thread affinity ====
pid 3794608's current affinity list: 0,21-64,85-127

==== Listing dnsmasq threads ====

TID        THREAD_NAME               ALLOWED_CPUS         RUNNING_ON
---------- ------------------------- -------------------  ----------
3794608    dnsmasq                   0,21-64,85-127       39


&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="c"&gt;#####################################################################&lt;/span&gt;
&lt;span class="go"&gt;Inspecting: QEMU
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="c"&gt;#####################################################################&lt;/span&gt;
&lt;span class="go"&gt;
======================================================================
==== QEMU PID: 3794651 ====
======================================================================

==== QEMU cgroup ====
0::/stelia-io/float

==== QEMU main thread affinity ====
pid 3794651's current affinity list: 0,21-64,85-127

==== Listing QEMU threads ====

TID        THREAD_NAME               ALLOWED_CPUS         RUNNING_ON
---------- ------------------------- -------------------  ----------
3794651    qemu-system-x86           0,21-64,85-127       62
3794652    qemu-system-x86           0,21-64,85-127       124
3794654    qemu-system-x86           0,21-64,85-127       61
3794656    qemu-system-x86           1                    1
3794657    qemu-system-x86           2                    2
3794658    qemu-system-x86           3                    3
3794659    qemu-system-x86           4                    4
3794660    qemu-system-x86           5                    5
3794661    qemu-system-x86           6                    6
3794663    qemu-system-x86           7                    7
3794664    qemu-system-x86           8                    8
3794665    qemu-system-x86           9                    9
3794666    qemu-system-x86           10                   10
3794667    qemu-system-x86           11                   11
3794668    qemu-system-x86           12                   12
3794669    qemu-system-x86           13                   13
3794670    qemu-system-x86           14                   14
3794671    qemu-system-x86           15                   15
3794672    qemu-system-x86           16                   16
3794673    qemu-system-x86           17                   17
3794674    qemu-system-x86           18                   18
3794675    qemu-system-x86           19                   19
3794676    qemu-system-x86           20                   20
3794677    qemu-system-x86           65                   65
3794678    qemu-system-x86           66                   66
3794679    qemu-system-x86           67                   67
3794680    qemu-system-x86           68                   68
3794681    qemu-system-x86           69                   69
3794682    qemu-system-x86           70                   70
3794683    qemu-system-x86           71                   71
3794684    qemu-system-x86           72                   72
3794685    qemu-system-x86           73                   73
3794686    qemu-system-x86           74                   74
3794687    qemu-system-x86           75                   75
3794688    qemu-system-x86           76                   76
3794689    qemu-system-x86           77                   77
3794690    qemu-system-x86           78                   78
3794691    qemu-system-x86           79                   79
3794692    qemu-system-x86           80                   80
3794693    qemu-system-x86           81                   81
3794694    qemu-system-x86           82                   82
3794695    qemu-system-x86           83                   83
3794696    qemu-system-x86           84                   84
3794737    kvm-nx-lpage-re           0,21-64,85-127       52


&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="c"&gt;#####################################################################&lt;/span&gt;
&lt;span class="go"&gt;Inspecting: Instance Runner
&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="c"&gt;#####################################################################&lt;/span&gt;
&lt;span class="go"&gt;
======================================================================
==== Instance Runner PID: 3794442 ====
======================================================================

==== Instance Runner cgroup ====
0::/stelia-io/float

==== Instance Runner main thread affinity ====
pid 3794442's current affinity list: 0,21-64,85-127

==== Listing Instance Runner threads ====

TID        THREAD_NAME               ALLOWED_CPUS         RUNNING_ON
---------- ------------------------- -------------------  ----------
3794442    instance-runner           0,21-64,85-127       85
3794456    tokio-runtime-w           0,21-64,85-127       40
3794457    tokio-runtime-w           0,21-64,85-127       10
3794458    tokio-runtime-w           0,21-64,85-127       83
3794459    tokio-runtime-w           0,21-64,85-127       12
3794460    tokio-runtime-w           0,21-64,85-127       10
3794461    tokio-runtime-w           0,21-64,85-127       82

==== Done ====
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we showed how to run NUMA-localised, thread-isolated virtual machines in Kubernetes for AI and HPC workloads. By leveraging Kubernetes’ CPU, Memory, and Topology Managers, we preserve exclusive CPU ownership and NUMA locality. Our Instance Cgroup Controller and Instance Runner isolate vCPU threads from runtime processes, pinning them to the allocated CPUs. The result is deterministic performance, low latency, and strong tenant isolation, all while fully operating inside Kubernetes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations and future work
&lt;/h2&gt;

&lt;p&gt;NUMA localisation is critical for HPC and AI workloads. However, there is a known architectural limitation in Kubernetes:&lt;/p&gt;

&lt;h2&gt;
  
  
  Topology Manager complexity
&lt;/h2&gt;

&lt;p&gt;The Topology Manager merges NUMA hints using bitmask intersections. In worst-case scenarios, the merging algorithm has exponential characteristics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;O(2^n)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where n = number of NUMA nodes.&lt;/p&gt;

&lt;p&gt;Today, this is manageable as most production systems are dual-socket (2 NUMA nodes), and AMD EPYC commonly presents 2 NUMA domains per socket configuration&lt;/p&gt;

&lt;p&gt;But future systems are changing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4-socket systems&lt;/li&gt;
&lt;li&gt;8-socket high-density compute nodes&lt;/li&gt;
&lt;li&gt;CXL memory expansion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If Kubernetes runs on 8-socket NUMA systems, the current merging algorithm may become computationally expensive during scheduling.&lt;/p&gt;

&lt;p&gt;And currently, this is a known issue as the bitmask affinity suggestions, as well, don't seem to filter out non-preferential affinity assignments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kubernetes/kubernetes/issues/131738?source=post_page-----bdb41facb7e3---------------------------------------" rel="noopener noreferrer"&gt;Production-Grade Container Scheduling and Management - kubernetes&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When we create a 1-GPU pod on a machine with 8 NUMA nodes (AMD CPU + NVIDIA 4090D), the hint providers for CPU, memory, hugepages, and GPU each generate approximately 255 hints. During the hint merging phase, the topology manager needs to evaluate 255⁴ (over 4.2 billion) possible hint combinations. In our testing, this process took nearly 21 minutes.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>cloud</category>
      <category>kubernetes</category>
      <category>performance</category>
    </item>
    <item>
      <title>Why we chose Ceph as part of our storage-related solutions for production-scale AI</title>
      <dc:creator>Stelia Developers</dc:creator>
      <pubDate>Thu, 19 Mar 2026 13:23:01 +0000</pubDate>
      <link>https://dev.to/steliadevs/why-we-chose-ceph-as-part-of-our-storage-related-solutions-for-production-scale-ai-22c5</link>
      <guid>https://dev.to/steliadevs/why-we-chose-ceph-as-part-of-our-storage-related-solutions-for-production-scale-ai-22c5</guid>
      <description>&lt;p&gt;In the fast-paced world of DevOps and cloud infrastructure, there is a natural gravitation toward tools that offer instant gratification. We value the "Day 1" experience: the single binary download, the five-minute setup, and the immediate results. When a tool allows you to go from zero to a working prototype in the time it takes to drink a coffee, it gains adoption rapid-fire.&lt;/p&gt;

&lt;p&gt;However, when you are architecting modern AI-ready cloud infrastructure from the ground up, the laws of physics – and the definition of success – are fundamentally different. We aren't simply hosting static websites or lightweight user databases. We are building the high-throughput pipelines required to feed petabytes of training data into hungry H100/H200 GPU clusters. We are managing Retrieval-Augmented Generation (RAG) workflows where millisecond latency isn't just a metric; it’s the difference between a functional product and a failed user experience.&lt;/p&gt;

&lt;p&gt;In this high-stakes environment, the pressure to take infrastructure shortcuts is overwhelming. For years, the industry standard advice for object storage has been MinIO. If you ask a room full of startup technical leaders what to use for S3-compatible storage, their answer will be MinIO because it’s simple, fast, and works out of the box.&lt;/p&gt;

&lt;p&gt;And they are not wrong. MinIO is an impressive piece of engineering. It is incredibly fast and offers a developer experience that feels like magic on Day 1.&lt;/p&gt;

&lt;p&gt;But at &lt;a href="//stelia.ai"&gt;Stelia&lt;/a&gt;, we realised early on that we couldn't optimise for Day 1. We had to optimise for Day 1,000. We are building a fortress for organisations' models, not a playground for prototypes. When we examined the long-term trajectory of the storage landscape, we saw a divergence between the free code and the paid product that was becoming too wide to ignore.&lt;/p&gt;

&lt;p&gt;We faced a critical architectural choice: build our platform on technologies that offer ease of use but introduce significant supply chain risk, or choose the hard option and undertake the engineering rigour required to build on a true, community-governed foundation.&lt;/p&gt;

&lt;p&gt;We chose the hard option; we chose to invest in long-term durability. And as a result, we selected Ceph as one part of our storage-related solutions.&lt;/p&gt;

&lt;p&gt;Below, we outline why we made that decision, and why we believe it ensures organisations' data is safer, cheaper, and more performant with us in the long run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The evolution of open source business models
&lt;/h2&gt;

&lt;p&gt;To understand why we moved away from the "easy" option, it is important to look at the business context without cynicism. Infrastructure companies need to monetise, and the "Open Core" model is a standard path. However, the strategies companies use to achieve profitability have profound downstream effects on the users building upon their software.&lt;/p&gt;

&lt;p&gt;Over the last few years, we have witnessed a slow, calculated pivot in the object storage market. This wasn't an overnight change. It was a gradual evolution that has made it increasingly difficult for infrastructure providers to rely on certain open-source projects without incurring massive enterprise licensing costs or legal complexity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The licensing complexity (AGPLv3)
&lt;/h2&gt;

&lt;p&gt;The first sign of this shift occurred in 2021, when the licensing landscape for MinIO changed from the permissive Apache 2.0 license to the GNU AGPLv3.&lt;/p&gt;

&lt;p&gt;For the uninitiated, the distinction between these licenses is massive. Apache 2.0 is the ‘do what you want, just give us credit' license. It allows for broad innovation and integration without legal strings attached.&lt;/p&gt;

&lt;p&gt;AGPLv3, however, is designed to close the "SaaS loophole". It essentially states that if you modify the software and let users interact with it over a network (which is the definition of a cloud service), you must release your source code as well.&lt;/p&gt;

&lt;p&gt;For a hobbyist or a student, this distinction is irrelevant. But for a corporation building a proprietary AI platform, AGPLv3 must be assessed with caution. It introduces legal ambiguity. The question is: “Does linking our internal orchestration layers to the storage backend potentially require us to open-source our proprietary app?”&lt;/p&gt;

&lt;p&gt;The answer is "maybe." In the world of enterprise risk management, "maybe" is a stop sign. This licensing move forces many companies into a corner: purchase a commercial license to avoid the headache, or accept some compliance risks. We wanted a foundation where the legal ground wouldn't shift beneath our feet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The feature gap
&lt;/h2&gt;

&lt;p&gt;Beyond the license, we began to notice a growing feature delta – a widening gap between what is available in the GitHub repository and what is sold in the enterprise binary.&lt;/p&gt;

&lt;p&gt;The most visible casualty of this shift was the Web Management Console. In earlier iterations, the open-source version provided a robust user interface for managing buckets, users, identity policies, and lifecycle rules. It was a true single pane of glass for administrators.&lt;/p&gt;

&lt;p&gt;Over time, however, the community version of this console was stripped down. Critical administrative features – such as OpenID Connect (OIDC) and LDAP integration for identity management, tiering configurations, and deep observability metrics – were removed or hidden behind the enterprise paywall. Today, the open-source console functions primarily as a file browser.&lt;/p&gt;

&lt;p&gt;If you want the full administrative suite to manage a multi-petabyte cluster, you are now expected to pay for the enterprise product. For us, this signalled that the open-source version was no longer viewed as a standalone product, but rather as a demo for the paid tier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Entering maintenance mode
&lt;/h2&gt;

&lt;p&gt;Perhaps the most challenging development for DevOps teams has been the operational friction introduced recently. With the open-source edition effectively entering what many in the community call "maintenance mode," the project has ceased to be a living, breathing foundation for new infrastructure.&lt;/p&gt;

&lt;p&gt;Innovation has been bifurcated. Performance tuning, AI-specific optimisations, and advanced replication features are increasingly channelled exclusively into the commercial product. Even more disruptive was the change in how binaries and Docker images are distributed.&lt;/p&gt;

&lt;p&gt;In a modern, containerised world, the inability to easily pull a verified, stable, and compliant image from a standard registry is a major hurdle. It forces teams to compile from source or rely on unverified third-party builds, introducing security risks into the supply chain. You cannot build a platform today on software that is essentially frozen in time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The alternative: Ceph - an open-source ecosystem
&lt;/h2&gt;

&lt;p&gt;When we decided to look for a different path, we turned to Ceph.&lt;/p&gt;

&lt;p&gt;Ceph is an open-source ecosystem, not just a product. Often described as the ‘Linux of Storage’, Ceph is a distributed storage platform that delivers Object, Block, and File storage on top of a single, unified data plane.&lt;/p&gt;

&lt;p&gt;The primary differentiator for us wasn't only the code; it was the governance.&lt;/p&gt;

&lt;p&gt;MinIO is controlled by a single corporation.&lt;/p&gt;

&lt;p&gt;Ceph, by contrast, is governed by the Ceph Foundation under the umbrella of the Linux Foundation. Its board includes representatives from industry giants like Red Hat, IBM, Canonical, and scientific organisations like CERN. There is no single leader who can wake up tomorrow and decide to deprecate the open-source version. The code truly belongs to the community.&lt;/p&gt;

&lt;p&gt;This governance structure aligns perfectly with our philosophy. We wanted a storage layer that would be as open and reliable in ten years as it is today.&lt;/p&gt;

&lt;p&gt;In fact, CERN is the ultimate showcase for Ceph. They don't just sit on the board; they rely on Ceph to manage over 100 petabytes of storage that underpins the IT infrastructure for the &lt;a href="https://home.cern/science/accelerators/large-hadron-collider" rel="noopener noreferrer"&gt;Large Hadron Collider&lt;/a&gt;. It is the high-performance backbone for their OpenStack cloud used by thousands of physicists to analyse particle collision data. For those sceptical about manageability, CERN's engineering team regularly publishes "Ten-year retrospective" talks on YouTube. These videos detail how a small team manages this massive, mission-critical environment using the exact same open-source code we use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical deep dive: architecture &amp;amp; data placement
&lt;/h2&gt;

&lt;p&gt;Governance aside, the technical differences between Ceph and its competitors are profound. If you are a developer or an architect, it is important to understand why Ceph is historically considered harder to use, and why that complexity buys you scalability that other systems struggle to match.&lt;/p&gt;

&lt;p&gt;The core difference lies in how these systems answer a simple question: "Where do I put this file?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The "pool" problem in rigid architectures
&lt;/h2&gt;

&lt;p&gt;Many object storage systems use a hashing ring architecture combined with erasure coding. In an ideal world, this creates a ‘shared-nothing’ architecture where every node is identical. This is fantastic for speed in small, static setups.&lt;/p&gt;

&lt;p&gt;However, this rigidity creates a massive problem when it's time to scale. In many of these systems, you cannot simply add one hard drive to a cluster. You generally have to scale by adding ‘server pools.’&lt;/p&gt;

&lt;p&gt;Imagine you start with a cluster of 4 nodes, each with 4 drives (16 drives total). If you run out of space, you typically cannot just plug a new 20TB drive into an empty slot. To maintain the geometry of the erasure coding, you often have to add another symmetrical set of 16 drives. This step-function scaling is incredibly expensive.&lt;/p&gt;

&lt;p&gt;Furthermore, these systems often lack automatic rebalancing. If you add a new pool of drives, new data is written there, but the old data stays on the old, full drives. You end up with "hot" and "cold" spots in your cluster. Your total throughput is limited by the performance of the new pool, rather than the aggregate power of the whole cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ceph and the CRUSH approach
&lt;/h2&gt;

&lt;p&gt;Ceph takes a radically different approach. It eliminates the need for a central lookup table or rigid server pools using an algorithm called CRUSH (Controlled Replication Under Scalable Hashing).&lt;/p&gt;

&lt;p&gt;In legacy storage systems, a central Metadata Server acts like a librarian.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Request:&lt;/strong&gt; "Where is &lt;code&gt;training_data_batch_1.json&lt;/code&gt;?"&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Librarian:&lt;/strong&gt; Checks database... "It is on Drive 4, Sector 2."&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As clusters grow to petabyte scale, this ‘librarian’ becomes a bottleneck. If the database gets too big or the librarian gets overwhelmed, the entire cloud slows down.&lt;/p&gt;

&lt;p&gt;Ceph fires the librarian.&lt;/p&gt;

&lt;p&gt;Instead, Ceph distributes a "map" of the cluster to every client (your application).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Request:&lt;/strong&gt; "I want to write training_data_batch_1.json."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Client:&lt;/strong&gt; Runs the CRUSH algorithm locally. "Mathematically, given the current state of the cluster, this file must go to OSD #4."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; The client talks directly to OSD #4.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because the clients calculate data placement themselves, there is no central gateway bottleneck. You can hammer a Ceph cluster with millions of IOPS, and because the clients are doing the maths, the cluster scales linearly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-healing data
&lt;/h2&gt;

&lt;p&gt;This architectural difference shines when hardware fails – which, at scale, happens inevitably.&lt;/p&gt;

&lt;p&gt;In Ceph, if we add a single new hard drive, the cluster detects it. The CRUSH map updates to reflect the new capacity. The cluster then automatically begins moving data from full drives to the new empty drive in the background. It balances itself like water finding its level.&lt;/p&gt;

&lt;p&gt;Conversely, if a drive dies, Ceph marks it as "down" and immediately begins reconstructing the missing data bits onto the remaining survivors using its internal redundancy. We can sleep through a drive failure and replace it during standard business hours, knowing the data has already healed itself.&lt;/p&gt;

&lt;h2&gt;
  
  
  The complexity myth and the Kubernetes solution
&lt;/h2&gt;

&lt;p&gt;The strongest argument against Ceph has historically been: "But it's so hard to manage."&lt;/p&gt;

&lt;p&gt;Five years ago, we would have agreed. Managing a Ceph cluster used to require deep expertise in Linux internals, manual editing of text configuration files, and hand-calculating placement groups. It was a beast.&lt;/p&gt;

&lt;p&gt;But the landscape has changed dramatically with the rise of Kubernetes and Rook.&lt;/p&gt;

&lt;p&gt;Rook is a Cloud Native Computing Foundation (CNCF) project that acts as an "operator" for Ceph. It brings cloud-native automation to storage. Rook handles the dirty work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt; It automates the rollout of the storage daemons.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Upgrades:&lt;/strong&gt; Want to upgrade Ceph? Change one line of YAML, and Rook handles the rolling restart, ensuring data safety the whole time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Expansion:&lt;/strong&gt; Plug in new drives, and Rook detects them, provisions the Object Storage Daemons (OSDs), and begins the rebalancing process.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rook has democratised Ceph. It brings the ‘Day 1’ experience of Ceph much closer to the simplicity of other tools, without sacrificing the Day 1,000 power and freedom.&lt;/p&gt;

&lt;h2&gt;
  
  
  The developer cheat sheet
&lt;/h2&gt;

&lt;p&gt;For the engineers and architects evaluating their options, here is how the two stacks compare in the current landscape:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpkl2ykthsrlv5y4hrkst.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpkl2ykthsrlv5y4hrkst.jpg" alt="Developer cheat sheet table" width="800" height="1071"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't rent your foundation
&lt;/h2&gt;

&lt;p&gt;Our decision to choose Ceph wasn't about finding the easiest path, it was about finding the most sustainable one.&lt;/p&gt;

&lt;p&gt;It was about moving away from platforms which historically demonstrated a willingness to remove features, change licenses, and freeze open-source code. Eventually, those costs trickle down to the customer – either in the form of higher prices to cover enterprise licensing fees or, worse, forced migrations when the free version becomes unmaintainable.&lt;/p&gt;

&lt;p&gt;We will not pass that supply chain risk on to our customers.&lt;/p&gt;

&lt;p&gt;We chose Ceph because it allows us to offer organisations a storage layer that is battle-tested, infinitely scalable, and free from the threat of vendor lock-in.&lt;/p&gt;

&lt;p&gt;Ultimately:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We handle the complexity:&lt;/strong&gt; Ceph is complex under the hood. We take on the burden of tuning CRUSH maps, managing deep scrubbing, and balancing placement groups so customers just get a fast, resilient S3 endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We control the costs:&lt;/strong&gt; Because we aren't paying a per-terabyte tax to a proprietary software vendor, we don't have to charge customers one either. That means better egress rates and lower storage costs for your models.&lt;/p&gt;

&lt;p&gt;In the AI gold rush, many vendors optimise for speed to market. We focus on building infrastructure that remains dependable, performant and resilient when systems reach production scale.&lt;/p&gt;

</description>
      <category>ceph</category>
      <category>cloudstorage</category>
      <category>architecture</category>
      <category>ai</category>
    </item>
    <item>
      <title>Why understanding application behaviour is the prerequisite for scaling AI</title>
      <dc:creator>Stelia Developers</dc:creator>
      <pubDate>Tue, 10 Mar 2026 13:48:25 +0000</pubDate>
      <link>https://dev.to/steliadevs/why-understanding-application-behaviour-is-the-prerequisite-for-scaling-ai-4m05</link>
      <guid>https://dev.to/steliadevs/why-understanding-application-behaviour-is-the-prerequisite-for-scaling-ai-4m05</guid>
      <description>&lt;p&gt;As AI systems move from experimental pilots into production-critical enterprise applications, the question of how to scale them reliably is front of mind.&lt;/p&gt;

&lt;p&gt;Scaling AI and ML workloads has long been assumed to be achievable through the linear approach of adding more and more infrastructure, proven successful with previous web applications and databases. We see this approach baked into technical teams across the enterprise landscape, provisioning more GPUs as inference latency degrades and accelerating infrastructure procurement conversations as soon as training jobs stall.&lt;/p&gt;

&lt;p&gt;But in reality, scaling AI applications for reliable and lasting performance doesn’t begin with the infrastructure, but first by determining application behaviour and ensuring that the solution designed supports the specific performance priorities required.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“You cannot scale what you do not understand. Understanding application behaviour dictates hosting and delivery success.” Dave Hughes, Stelia CTO&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Decisions around scaling typically begin with how much, before answering what kind. By flipping these conversations on their head, we consider how different workload types express distinct behavioural traits, and how architecting with these traits in mind enables production-scale delivery of enterprise applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why behavioural traits must define requirements
&lt;/h2&gt;

&lt;p&gt;Every application has distinct objectives and operational constraints that shape how it behaves under load. Understanding these behavioural traits is key to revealing which architectural requirements matter most for achieving performance at scale.&lt;/p&gt;

&lt;p&gt;For example, a multiplayer gaming server’s highest priority is supporting concurrent users, which in production translates to holding thousands of persistent connections with continuous bidirectional data flow. A Minecraft server with 100 players logged in for 19-hour sessions demands long-lived stateful connections where session state must survive server restarts and memory must remain stable over extended periods.&lt;/p&gt;

&lt;p&gt;Comparing this to an e-commerce platform where users add items to a cart, triggering short-lived HTTP requests, stateless interactions and variable, bursty traffic – the performance priorities change completely.&lt;/p&gt;

&lt;p&gt;Each application’s behavioural traits directly correspond to the unique architectural requirements that performance at scale demands. While a gaming server with these performance demands requires connection-aware load balancing and graceful connection draining, an e-commerce platform’s architectural challenge shifts entirely toward sudden traffic spikes that demand elastic compute provisioning and cache efficiency.&lt;/p&gt;

&lt;p&gt;In practice, no single definition of “an application” should exist within scaling discussions, and application behaviour spans multiple patterns, each demanding different scaling strategies and deriving entirely varied architectural choices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The table below illustrates some of the considerations different applications require:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqldour2i6irbm0oruhz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqldour2i6irbm0oruhz.jpg" alt="A table showing application type, connection pattern, and primary challenges" width="800" height="574"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The gap between theoretical scaling and enterprise reality
&lt;/h2&gt;

&lt;p&gt;While beginning with application behaviour under load in mind is the ideal approach, the reality is that most enterprise applications evolve from prototypes designed organically for immediate functionality, without complete architectural foresight of expected requirements at production-scale.&lt;/p&gt;

&lt;p&gt;At Stelia, we are often approached by teams struggling to progress successful pilots born from incremental feature additions, where scale was dismissed as a future problem until it became an urgent imperative. By this point, retrofitting an application designed without foresight costs both resource and time, as architectural decisions that made sense at prototype scale must be undone to remove production-scale blockers.&lt;/p&gt;

&lt;p&gt;In the current market, understanding how an application actually behaves under load from the outset is both a technical and strategic priority. Organisations cannot afford to lose competitive advantage due to hidden scaling constraints that could have been addressed earlier. When behavioural constraints become visible early, modification can be targeted rather than speculative, enabling faster time to market and more reliable production performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  How can enterprises change tact to enable effective scaling of AI workloads?
&lt;/h2&gt;

&lt;p&gt;Closing the gap between a behaviour-first approach, and the reality of moving enterprise pilots to production scale requires a fundamental restructure of approach. This transformation begins with visibility, progresses through targeted modification, and concludes with infrastructure decisions that support the application’s actual behaviour rather than fighting against it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Identify behavioural constraints from the outset.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With the goal to observe actual runtime characteristics, understanding application behaviour must begin with instrumentation under realistic load conditions, profiling to determine where time is actually spent, where memory grows, and how data moves through the system.&lt;/p&gt;

&lt;p&gt;These observations will reveal the constraints that will determine whether the application is able to scale, and where modifications may be required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Modify the application to remove scaling blockers.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With constraints in full view, the changes required will be based entirely on the application’s behavioural profile, and these application-level changes can be made before infrastructure compensations are implemented to hide inefficiencies.&lt;/p&gt;

&lt;p&gt;Modifications made at this stage will create a dynamic whereby infrastructure supports well-behaved applications, not attempts to fix poorly architected ones.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Architect hosting aligned to true behaviour.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Only after understanding and modifying an application’s behaviour can infrastructure decisions then be made effectively, as instance types, orchestration patterns, and data locality strategies all flow directly from understanding an applications performance requirements under load.&lt;/p&gt;

&lt;p&gt;The behavioural traits identified at the outset are able to translate into concrete architectural choices, and infrastructure becomes designed to support requirements rather than forcing the application to conform to available infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Set appropriate governance and security boundaries.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Inevitably, different behavioural patterns demand different governance and security approaches. Real-time inference serving sensitive data operates under entirely different compliance and security requirements than batch training on anonymised datasets.&lt;/p&gt;

&lt;p&gt;Data residency, access controls, and audit requirements must align with both the application’s behaviour and the sensitivity of the data it processes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why full-stack expertise is essential
&lt;/h2&gt;

&lt;p&gt;Executing this approach successfully however, requires fluency across the entire stack. Application development, infrastructure provisioning, and performance optimisation are typically treated as separate disciplines with separate teams. But effective scaling demands understanding how these layers interact in operational environments.&lt;/p&gt;

&lt;p&gt;Such fluency across the stack is rare. Most organisations have deep expertise in one layer but lack the cross-stack fluency needed to diagnose behavioural constraints, modify applications appropriately, and architect infrastructure that supports the resulting behaviour.&lt;/p&gt;

&lt;p&gt;This is not a criticism of existing teams; it reflects how technical specialisation has evolved. But it does create a capability gap that must be addressed, either through building internal expertise or partnering with those who possess this holistic systems understanding. The teams that scale AI workloads successfully in this next phase of AI impact, will be those who understand how to treat operationalising AI at scale as a unified problem rather than separate isolated challenges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reframing the scaling question
&lt;/h2&gt;

&lt;p&gt;Scaling AI workloads effectively doesn’t come down to a question of infrastructure capacity but instead one of understanding. Understanding how the application behaves under load, what constraints that behaviour creates, and how to architect systems that support rather than fight that behaviour.&lt;/p&gt;

&lt;p&gt;The organisations moving successfully from pilot to production are those that begin with observation rather than procurement. They instrument to understand actual runtime characteristics, modify applications to address the constraints those characteristics reveal, and only then make infrastructure decisions based on how the modified application actually performs.&lt;/p&gt;

&lt;p&gt;This approach requires a shift in how scaling problems are framed, flipping the conversation from how much infrastructure is required to what kind of application are we dealing with and what does it need to operate effectively at scale. Answer these questions first, and the infrastructure decisions follow naturally.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>architecture</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
