DEV Community: Harshit Luthra

Two tenants, one GPU, and no wall between them

Harshit Luthra — Thu, 02 Jul 2026 19:15:37 +0000

Originally published at harshit.cloud on 2026-07-25.

Run nvidia-smi inside a pod in one namespace, then inside a pod in another, on a cluster using GPU time-slicing. Look at the GPU UUID and the PCI bus ID.

This is the last part. Seven parts built the fleet, wired it, scaled it, watched it, routed to it, and kept it alive through deploys. This one is about the moment more than one team shares it, which is where the assumptions that held for a single tenant quietly stop being true. The load-bearing one, the thing most teams get wrong: a Kubernetes namespace is not a wall on the GPU.

the isolation you don't have

A namespace is an API-scoping and RBAC boundary. It has nothing to do with the hardware. The device plugin advertises nvidia.com/gpu resources and the scheduler places pods onto them regardless of which namespace asked, so "one GPU, four replicas" under time-slicing means the scheduler can land a pod from team-a and a pod from team-b on the same physical die with no memory partition between them. What isolation you actually get depends entirely on the sharing mechanism, and it's worth stating exactly:

Fig. 2 · the only two rows with a real memory wall are "whole GPU" and "MIG." MPS caps are software. Time-slicing has nothing. If your mental model was "different namespace, different memory," this is the row that corrects it.

The config that creates this situation is ordinary and common. This is a device-plugin time-slicing setup, the kind people turn on to raise utilization:

sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 4          # one physical GPU -> four schedulable "GPUs", zero isolation

MPS swaps that block for per-client soft caps (CUDA_MPS_PINNED_DEVICE_MEM_LIMIT, a compute percentage), which are enforced by the driver, not by hardware, and share a failure domain. This isn't only a noisy-neighbor concern. Security researchers have shown covert and side channels through the GPU's shared uncore engines, readable with unprivileged NVML calls, that bypass both MPS and MIG partitioning, and GPU DRAM that isn't zeroed on context teardown has leaked data across tenants under time-slicing. The one-line defensive setting: turn on renameByDefault: true so a shared GPU advertises as nvidia.com/gpu.shared, and a tenant can't request a "shared" GPU thinking it's a private one.

Key Insight: A Kubernetes namespace isolates the API and RBAC, never GPU memory. Under time-slicing or MPS, two tenants can sit on the same physical die with no hardware wall between their VRAM. If tenants don't trust each other, only MIG or a whole-GPU allocation is safe. This is the correction most GPU-cluster security models are missing.

MIG when tenants can't trust each other

MIG is the answer when the boundary has to be real. It carves the physical card into hardware partitions, each with its own DRAM slice, L2, and SMs, so one instance cannot read another's memory regardless of driver or kernel bugs. Expose it through the GPU Operator in mixed strategy, which advertises each profile as its own resource:

kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
  -p='[{"op":"replace","path":"/spec/mig/strategy","value":"mixed"}]'

The node then advertises MIG profiles as first-class schedulable resources, and a tenant requests a hardware-isolated slice by name:

resources:
  limits:
    nvidia.com/mig-1g.10gb: 1     # one hardware-isolated 10GB partition

The tradeoff is the one from part 1: partitions are fixed at configure time, reconfiguring drains the node's GPU pods, and a workload can't burst past its slice. That rigidity is the price of a real wall. For untrusted multi-tenancy it's a price worth paying; for a single trusted team it's usually not.

quotas that don't waste the cluster

Isolation stops one tenant from reading another's memory. Quotas stop one tenant from eating the whole cluster. The blunt version is a ResourceQuota capping GPU requests per namespace (extended resources are quotable only via requests.):

apiVersion: v1
kind: ResourceQuota
metadata: { name: team-a-gpu-quota, namespace: team-a }
spec:
  hard:
    requests.nvidia.com/gpu: "8"

That's static, though, and it strands GPUs whenever a team is idle. Kueue fixes that by admitting jobs against quota and letting teams in a shared cohort borrow each other's idle GPUs while guaranteeing each its floor back on demand. A ClusterQueue holds the quota; the cohortName is what enables borrowing:

apiVersion: kueue.x-k8s.io/v1beta2
kind: ClusterQueue
metadata: { name: team-a-cq }
spec:
  cohortName: "gpu-pool"                 # teams in one cohort lend/borrow idle GPUs
  resourceGroups:
  - coveredResources: ["nvidia.com/gpu"]
    flavors:
    - name: h100-flavor
      resources:
      - name: nvidia.com/gpu
        nominalQuota: 8                  # guaranteed floor
        borrowingLimit: 4                # cap on what it can borrow from the cohort

One version trap worth checking before you copy that: current Kueue serves v1beta2, where cohorts are a named field (and can be their own CRD with fair-share weights). Plenty of clusters still run v1beta1, where the field is a plain spec.cohort string and the standalone Cohort doesn't exist. kubectl get crd clusterqueues.kueue.x-k8s.io -o jsonpath='{.spec.versions[*].name}' tells you which you have. NVIDIA's KAI scheduler (the open-sourced core of Run:ai) models the same idea as a hierarchical Queue with a quota (the deserved floor) and an overQuotaWeight (your proportional claim on idle GPUs above it), plus time-based fair-share so a team that under-used earlier gets favored later.

the endpoint nobody locked

The most common GPU-cluster security hole isn't exotic. It's a vLLM pod with a Service and no authentication, reachable from anywhere on the cluster or, worse, the internet: free inference, prompt exfiltration, model theft. Lock it at three layers. Default-deny ingress so only the gateway can reach the model server:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: default-deny-ingress, namespace: team-a }
spec:
  podSelector: {}
  policyTypes: ["Ingress"]

Then an authenticated, TLS-terminating gateway in front (Gateway API HTTPRoute with auth attached upstream), and a least-privilege ServiceAccount so a compromised inference pod can't read the rest of the cluster's secrets:

apiVersion: v1
kind: ServiceAccount
metadata: { name: vllm-sa, namespace: team-a }
automountServiceAccountToken: false     # the model server never calls the API server

Which connects to the last quiet failure: secrets. Gated models return a 403 at download time when the pod has no valid Hugging Face token, and that surfaces as exactly the CrashLoopBackOff from part 7, except this time it really is the app. Mount the token from a Secret (or an external manager), never bake it into the image, and pin the model by digest so you don't silently load a tampered checkpoint:

env:
- name: HF_TOKEN
  valueFrom:
    secretKeyRef: { name: hf-token, key: token }

Prefer .safetensors over pickle formats (loading a pickle can execute arbitrary code), pin the model revision to a commit, and verify its checksum. The supply chain for a 140GB weight file deserves the same suspicion as any other dependency, which is the whole thesis of the lazy-security series if you want the longer version.

the frontier: confidential computing

One emerging piece, worth knowing exists even if you're not deploying it yet. Confidential computing extends a CPU trusted execution environment to the GPU, so even the cloud operator hosting your node can't read the weights or activations in VRAM. H100 CC-mode is GA: the driver, running inside a confidential VM, encrypts everything crossing the PCIe bus, and CUDA apps run unmodified once trust is established. Blackwell extends it with NVLink encryption for multi-GPU confidential domains. Independent benchmarks put the overhead under 5% for typical LLM inference (it's the CPU-to-GPU I/O encryption that costs, so small-batch workloads pay a bit more). On Kubernetes it lands through Kata Containers and Confidential Containers, and it's still maturing operationally. For regulated or sensitive-IP inference, it's the direction; for most workloads today, it's a section to file away.

who's paying for the idle GPUs

The last shared-cluster problem is money, and it's the biggest one, because cluster GPU utilization commonly sits at 30 to 40%. The gap between GPUs bought and GPUs doing work is enormous, and the fix is making teams see their own idle GPU-hours. Enforce a team label on every GPU pod (a policy engine like Kyverno rejects pods without one), export per-GPU metrics with DCGM, and attribute GPU-hours per team:

validate:
  message: "GPU pods must carry a 'team' label for chargeback."
  pattern:
    metadata:
      labels: { team: "?*" }

With the label enforced, DCGM's per-pod metrics roll up into GPU-hours per team, and tagging inference requests with an X-Team header at the gateway takes it down to token-level attribution. Showback doesn't reclaim a single GPU by itself. It just makes the waste visible to the people who can, which turns out to be most of the battle.

That's eight parts, and it's time to say the thing the whole series was circling. A GPU deployment is a dozen layers under one pod, a small network in one box, a large one between boxes, a set of graphs that tell you the truth, an autoscaler that turns it off, a router that doubles it for free, a set of probes that keep it breathing, and a tenancy model that decides who it hurts when it breaks. The silicon was the easy part. Everything that makes it hard lives in the layers around it, and every one of those layers is a place you can be the person who saw it coming, or the one explaining it in the incident review. That was the job the entire time. The GPUs were never the point.

Your model isn't crashing, your probe is

Harshit Luthra — Thu, 02 Jul 2026 19:15:33 +0000

Originally published at harshit.cloud on 2026-07-23.

The pods were in CrashLoopBackOff and the logs said nothing. vLLM started, printed its usual banner, began loading weights, and then died. Restarted, loaded again, died again. It looked exactly like a broken build, so the first hour went into the model, the image, the CUDA version, everything except the actual culprit, which was a fourteen-line probe config nobody had looked at.

This is part 7. The first six parts built the fleet, wired it, scaled it, watched it, and routed to it. This part is about keeping a model server alive through the three things that routinely kill it: a health probe that fires too early, a node drain that cuts a request in half, and a model-version rollout that drops traffic on the floor. None of it is glamorous. All of it is what stands between you and a 3am page.

your model isn't crashing, your probe is

Here's the tell, and once you've seen it once you never miss it again:

vLLM binds its HTTP port almost immediately, but /health on :8000 only returns 200 once the weights are loaded and the engine is warm. For a 70B on a cold pull that's several minutes. A default liveness probe with a small initialDelaySeconds starts checking during that window, gets a connection refused, decides the container is dead, and the kubelet kills it. It never finishes loading, so it never passes, so it loops forever. The exit code is 137 (128 + SIGKILL) which is the kubelet's fingerprint, not an application crash.

The fix is a startupProbe. It holds the liveness and readiness probes off entirely until it succeeds, and its failureThreshold × periodSeconds is your total load budget:

startupProbe:
  httpGet: { path: /health, port: 8000 }
  periodSeconds: 10
  failureThreshold: 60      # 60 × 10s = 600s (10 min) to finish loading
livenessProbe:
  httpGet: { path: /health, port: 8000 }
  periodSeconds: 10
  failureThreshold: 3       # after startup, catch a real hang in 30s

The reason this beats just cranking initialDelaySeconds: 600 on the liveness probe is that once the startup probe passes, liveness reverts to its tight interval and still catches a genuine deadlock in thirty seconds. A giant liveness delay would blind you to real hangs for ten minutes after every restart. With the startup probe in place, the pods ride through the load and come up clean:

$ kubectl get pods -l app=vllm-llama3-70b
NAME                               READY   STATUS    RESTARTS   AGE
vllm-llama3-70b-7f4b9c6d8f-9m2tq   1/1     Running   0          8m03s
vllm-llama3-70b-7f4b9c6d8f-c8xvn   1/1     Running   0          8m03s

Key Insight: A model that takes minutes to load needs a startupProbe, not a bigger liveness delay. Without one, the probe that's supposed to detect a dead server is the thing killing a healthy one, and the crash loop looks exactly like an application bug. This is the single most common self-inflicted LLM serving outage.

draining without dropping a stream

The next one bites on every deploy and every node scale-down. When Kubernetes deletes a pod, it sends SIGTERM and removes the pod from the service endpoints at the same instant. vLLM handles SIGTERM correctly (it stops accepting new requests and finishes the in-flight ones), but if the load balancer is still routing to it during that beat, new requests land on a server that's shutting down. And if the grace period is too short, the kubelet SIGKILLs the process mid-generation, dropping a half-finished response the client has to retry from scratch.

Two fields fix it. A preStop hook that sleeps long enough for the endpoint removal to propagate before SIGTERM reaches vLLM, and a terminationGracePeriodSeconds set above your longest in-flight generation:

terminationGracePeriodSeconds: 210   # preStop(15s) + longest decode(~180s) + margin
lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 15"]

The sequence is worth internalizing, because the parallelism is the point: the endpoint removal and the preStop sleep happen at the same time, so by the time SIGTERM actually reaches vLLM, the load balancer has already stopped sending it traffic.

Fig. 2 · the drain, in order. The whole job of preStop is to buy time for the load balancer to stop routing before vLLM stops answering.

surviving a node drain

Parts 3 and 5 leaned on Karpenter to consolidate and remove idle GPU nodes. That same consolidation, unguarded, will happily evict every replica of a service at once and take it to zero. The guard is a PodDisruptionBudget, and for a small pool of expensive GPU replicas you want maxUnavailable: 1 so at most one goes down for any voluntary disruption:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-llama3-70b-pdb
spec:
  maxUnavailable: 1
  selector:
    matchLabels: { app: vllm-llama3-70b }

Karpenter drains through the Eviction API, which respects PDBs, so this is enough to keep the service serving through a consolidation. For a pod that's mid-critical-work and must not be interrupted at all, there's a stronger lever, the karpenter.sh/do-not-disrupt annotation, which excludes its node from consolidation entirely:

metadata:
  annotations:
    karpenter.sh/do-not-disrupt: "true"

Use it deliberately, though. Leave it on permanently and you've told Karpenter it can never reclaim that node, which is how you end up back in part 5's problem of expensive GPUs that never scale down. The PDB is the always-on floor; the annotation is for pods you're actively protecting.

one replica, many pods

When a model is too big for one node (the 405B from part 3, tensor-parallel across eight GPUs and pipeline-parallel across two nodes), a single replica is a group of pods, and a normal Deployment can't express that. LeaderWorkerSet can. It treats a leader plus N-1 workers as one unit: replicas is the number of these groups, size is the pods per group, and RecreateGroupOnPodRestart means if any pod in the group dies the whole group restarts, which is correct, because a tensor-parallel replica missing one member is dead weight:

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 2                    # two independent model replicas
  leaderWorkerTemplate:
    size: 2                      # 2 pods each: one leader + one worker
    restartPolicy: RecreateGroupOnPodRestart

The leader starts the Ray head and vLLM's OpenAI server with --tensor-parallel-size × --pipeline-parallel-size equal to the total GPUs across the group; the workers join the Ray cluster via the injected LWS_LEADER_ADDRESS. The catch from part 2 still applies with force: the group has to land on NVLink-connected GPUs or the cross-node collective crawls, so pair LWS with gang scheduling and topology-aware placement (Kueue's TAS, or the KAI scheduler). Rolling updates are first-class through rolloutStrategy with maxUnavailable and maxSurge, so you can roll a multi-node replica without taking the whole service down.

rolling out a new model without an outage

The last way to drop traffic is deploying a new model version badly. KServe makes the careful version cheap: set canaryTrafficPercent on an InferenceService and point storageUri at the new weights, and it splits traffic between the current good revision (which it tracks automatically) and the new one.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama3-chat
spec:
  predictor:
    canaryTrafficPercent: 10       # 10% to the new revision, 90% stays on last-good
    model:
      storageUri: "s3://models/llama3-chat/v2"

You watch the split live, and the metrics from part 4 (TTFT, error rate, and any quality signal) decide whether it graduates:

$ kubectl get isvc llama3-chat
NAME          URL                       READY   PREV   LATEST   LATESTREADYREVISION
llama3-chat   http://llama3-chat...     True    90     10       llama3-chat-predictor-00002

Promotion is deleting the canaryTrafficPercent field and re-applying: all traffic shifts to the new revision and the old one scales to zero. Blue-green is the same mechanism flipped straight to 100; shadow is mirroring live traffic to the candidate without returning its responses, so you can compare outputs with zero user risk. Two things to know before you rely on it. First, this traffic-splitting is a serverless (Knative) mode feature; in raw deployment mode you split with Gateway API route weights instead. Second, version the weights, the serving config, and the prompt template together as one revision, or your canary metrics are comparing two things that differ in ways you didn't track. (And don't reach for ModelMesh for multi-model serving; the project is archived.)

That's the service staying up through loads, drains, and deploys. The last thing between you and a calm on-call isn't the software at all, it's other people: the tenants sharing your cluster, the ones who can reach your endpoint, and a GPU-memory boundary that turns out to be nowhere near where you think it is. That's the final part, and it's the one most likely to end up in an incident review.

The cheapest speedup is your load balancer

Harshit Luthra — Thu, 02 Jul 2026 19:14:53 +0000

Originally published at harshit.cloud on 2026-07-18.

A team I know had a slow chatbot and did the obvious thing: added replicas. TTFT barely moved. They added more. Still slow, and now the bill was worse. Every GPU showed healthy utilization, the queue metrics from part 4 were climbing, and adding capacity wasn't buying the speedup capacity is supposed to buy. Then someone swapped the Kubernetes Service in front of the pods for a prefix-cache-aware router, changed nothing else, and the same eight GPUs got more than twice as fast. The problem was never the GPUs or the count. It was the load balancer, quietly throwing away the most expensive thing the servers had built.

This is part 6. Parts 1 through 5 built the fleet, watched it, and scaled it. This part is about the layer in front of the fleet: how requests get assigned to replicas, and why the default answer (round-robin, the same load balancing you'd use for a stateless web app) is the wrong one for LLM inference. It's the highest ratio of payoff to effort in the whole series, because the win comes from a routing decision, not from hardware you have to buy.

the load balancer that throws away your cache

Here's the thing a normal load balancer doesn't know: LLM replicas are not stateless. Each vLLM replica keeps its own KV cache (from part 2, the running memory of tokens it has already processed), and it also keeps a prefix cache: if two requests share the same opening tokens (a long system prompt, a RAG document, the history of a chat), the second one can reuse the first one's cached computation and skip prefill entirely. Prefill is the compute-bound, expensive half of a request. A prefix cache hit makes it nearly free.

But each replica has its own prefix cache. A round-robin load balancer scatters requests across replicas blind to what each one holds. So a request that shares a 2,000-token prefix with something served thirty seconds ago lands on a different replica that never saw it, re-runs the entire prefill from scratch, and fills its KV cache doing redundant work. Multiply that across a prefix-heavy workload and the fleet spends most of its compute re-prefilling prompts it already processed, on a different pod. The caches sit there full of answers to questions that keep getting routed elsewhere.

the numbers, from one honest benchmark

This isn't a theoretical gain, and the cleanest measurement of it is a benchmark Andy Golubev ran on EKS in June 2026. The setup was deliberately boring: Qwen2.5-7B on vLLM, eight g5.xlarge nodes with one A10G each, one decode replica per node. Two runs, identical in every way except the front door. First run: a plain Kubernetes Service doing round-robin. Second run: an llm-d prefix-cache-aware router. Same model, same eight GPUs, same eight replicas. The workload was vllm bench serve replaying 9,000 prompts that shared a pool of 150 long (2,048-token) prefixes, which is exactly the RAG-and-chat shape real traffic takes.

The gap was not subtle.

Fig. 2 · the Golubev EKS numbers. The one that matters most is the last one: round-robin ran the cache pinned at 99% and thrashing; the smart router left it headroom, because it stopped generating redundant work.

Read the cache-hit line again: 11% to 93%. Round-robin was reusing almost nothing; the aware router reused almost everything. Mean TTFT dropped from a genuinely broken 19 seconds to under a second. And the fleet stopped queueing (waiting requests went from ~180 to zero) not because it got more capacity, but because it stopped wasting the capacity it had. Google reports the same shape from GKE's managed version: TTFT improvements up to 96% at peak load on prefix-heavy workloads. The lesson is uncomfortable and freeing at once. Before you buy a ninth GPU, check whether your router is making the eight you have re-do each other's work.

routing to the replica that already knows

The mechanism is simpler than it sounds. vLLM emits events about what its prefix cache holds. A cache-aware router consumes those events and keeps a live picture of which replica has which prefixes cached. When a request arrives, instead of picking the next replica in rotation, the router picks the one most likely to already hold the request's prefix, so the cache hit actually happens.

That decision point has a name in the Kubernetes world: the Endpoint Picker, or EPP. It doesn't just look at prefix locality. A good EPP scores every candidate replica on several signals at once and sends the request to the best total score:

prefix-cache locality: does this replica already hold the request's opening tokens? Longer match, higher score.
load: what's this replica's KV-cache utilization and queue depth right now? A replica that's already saturated scores lower even if it has the prefix, so you don't stampede one hot pod.
LoRA affinity: if you serve multiple fine-tuned adapters, is the right adapter already loaded here? Loading one is not free, so prefer the replica that has it.

The router balances those against each other, and it can queue or shed when everything is overloaded. This is the difference between "which pod is next" and "which pod will serve this request fastest given what it already has warm." Same request, same pods, a decision made with information the round-robin balancer never had.

the standard nobody had two years ago

The reason this is worth a whole chapter now, and wasn't in 2024, is that it stopped being a bespoke hack and became a Kubernetes standard. The Gateway API Inference Extension is a SIG-Networking project that adds LLM-aware routing on top of the ordinary Gateway API. It introduces an InferencePool, which is a group of pods that share the same accelerator, base model, and model server (the LLM-shaped version of a Service), and it wires an Endpoint Picker into the request path through Envoy's external-processing (ext-proc) protocol, so the proxy calls out to the EPP for a decision on every request.

Fig. 3 · the request path. The gateway is a normal Envoy proxy; the intelligence lives in the Endpoint Picker it consults per request, over the same ext-proc hook Envoy already uses for auth and rate limiting.

The project has reached GA, with InferencePool graduating to a stable v1 API. (One honest caveat: the exact GA milestone landed over early-to-mid 2026, so pin the version you're deploying rather than trusting a blog's date, this one included.) Google productized it as GKE Inference Gateway, which is GA and is literally powered by the llm-d router underneath. So the "advanced" thing here is also increasingly the default, k8s-native thing, which is exactly the production framing this series cares about.

the whole stack, named

"llm-d" gets thrown around as if it were a server. It isn't; it's an assembly, and it's clearer to name the pieces:

vLLM is the model server. It owns the per-replica KV and prefix cache and emits the cache events the router needs.
KServe is the serving control plane, exposing an LLMInferenceService custom resource so you describe the model and its serving config as a normal Kubernetes object.
The inference gateway is Envoy plus the Gateway API Inference Extension: the data plane plus the LLM-aware routing above.
The router itself is that L7 proxy plus the Endpoint Picker, making the per-request decision from cache, load, and LoRA signals.
Disaggregated prefill/decode from part 3 is an optional add-on here, with the KV cache handed between pools over a connector.

If you're not ready to adopt the full gateway, the vLLM project's own production-stack helm chart is a lighter on-ramp: it deploys a router service in front of your vLLM pods that already does model-aware and prefix-aware routing, plus KV-cache offload through LMCache. Same idea, smaller commitment. Either way, the thing you're installing is a router that knows what a KV cache is.

watching the router

Part 4 said the metrics that matter live in the serving engine. Add one more surface: the router. The signals that tell you whether the routing layer is earning its keep are the ones that moved in that benchmark. Prefix cache hit rate is the headline. A round-robin fleet sits low (that 11%); a well-routed prefix-heavy fleet should sit high (the 93%). If you turned on cache-aware routing and the hit rate didn't climb, either your workload doesn't actually share prefixes or the router isn't seeing the cache events.

Key Insight: Prefix cache hit rate is the one dashboard number that proves the router is working. Watch it alongside per-replica KV utilization and waiting-request count. If hit rate is high and evenly spread, routing is doing its job. If one replica is pinned while others idle, your scorer is over-weighting cache locality and stampeding a hot pod. If hit rate is flat near zero, your traffic isn't prefix-heavy and this whole chapter buys you little.

That last line is the honest boundary. Cache-aware routing is close to free money for workloads with real prefix reuse: RAG over a shared corpus, long system prompts, multi-turn chat, agents replaying context. For traffic where every prompt is unique and short, there's no cache to reuse and the fancy router mostly just adds a hop. The KV-connector interfaces underneath are also still moving, so expect some churn if you build on the bleeding edge. Know which workload you have before you reach for this, the same way you'd check the topology in part 2 before promising a throughput number.

The router is the front door. What sits behind it, the probes that decide when a replica is ready, the way you roll a new model version out without dropping a request, the tenants all fighting over the same pool of GPUs, is where a production inference platform actually gets hard. That's where the series goes next.

Scaling GPU inference to zero and back

Harshit Luthra — Thu, 02 Jul 2026 19:14:48 +0000

Originally published at harshit.cloud on 2026-07-16.

The finance dashboard is what started it. A staging cluster running two H100 nodes around the clock, mostly to serve a demo that got maybe forty requests a day, all of them during business hours. The nodes sat idle from 7pm to 9am and every weekend, billing the whole time. Somebody added scale-to-zero over a Friday. Monday morning the first person to open the demo waited six minutes for a response, assumed it was broken, and filed a bug. The bill went down and the product got worse, which is scale-to-zero working exactly as designed and nobody being happy about it.

This is the last part of the series. Parts 1 through 4 covered building GPU infrastructure and seeing what it's doing. This part is about the thing that actually shows up on the invoice: a GPU you're paying for while it does nothing. Scaling GPU inference elastically, all the way to zero when there's no traffic, is the biggest lever on the bill. It's also genuinely hard, because unlike a stateless web pod that starts in a second, a GPU replica has to drag a hundred gigabytes of model weights onto the card before it can serve a single token. The whole post is about that gap.

the cold-start tax

Scaling a web service to zero is free because starting a new pod is nearly instant. Scaling an LLM to zero is expensive because starting a replica is not. Add up what has to happen before the first token, on a cold node:

Node provisioning. The cluster asks the cloud for a GPU node, waits for it to boot and join. One to five minutes, and that assumes the GPU is even available to hand you (more on that at the end).
Image pull. A CUDA plus vLLM container image is commonly 5 to 15 GB. On a fresh node with a cold cache, pulling and unpacking it is minutes, not seconds.
Model load. Llama-3-70B in BF16 is about 140 GB of weights, usually sitting in object storage. Reading that over the network and moving it onto the GPU is the big one, and done naively it's several minutes.
Warmup. CUDA graph capture, torch.compile, and a few dummy forward passes to populate caches. Tens of seconds before the first real request is fast.

Stack those up and a naive 70B cold start runs six to nine minutes, more if those 140 GB of weights come cold over a slow path. That's the tax. Everything else in this post is a way to stop paying it, so that scale-to-zero saves the money without the six-minute Monday.

One cruel interaction hides in here: a naive Kubernetes health probe treats that multi-minute load as a failure. A liveness probe with default timing restarts the pod mid-load, and now you have a crash loop that looks like a vLLM bug but is really a probe firing too early. Set the liveness initialDelaySeconds above your worst-case load time, keep it longer than the readiness delay, and add a preStop sleep so in-flight requests drain before the pod goes down. It's the single most common self-inflicted serving outage, and it costs nothing to avoid once you've seen it once.

scale on the queue, not the GPU

Before scaling to zero, get scaling to anything right, and the first mistake is scaling on GPU utilization. It's the obvious metric and it's the wrong one, for the reason part 4 laid out: a GPU can sit at 95% while the queue is empty, or at 40% while requests pile up. Scaling on GPU-util adds replicas late and removes them at the wrong time.

Scale on the serving signal instead. The metric that actually tracks unmet demand is the queue: vllm:num_requests_waiting, or KV-cache utilization as a leading indicator. You'll still find plenty of setups scaling on GPU-util or cache percentage, and those aren't wrong so much as indirect; the point is to scale on demand rather than on how busy the chip happens to look, with tail latency as the guardrail. In practice that means KEDA (the Kubernetes event-driven autoscaler) with a Prometheus trigger reading that metric, or an HPA wired to the same value through the Prometheus adapter. For traffic that's genuinely request-driven and spiky, KEDA's HTTP add-on can scale on in-flight request count directly. The shape is the same: pick the metric that means "users are waiting," set the threshold at the per-replica capacity you measured with the load test in part 4, and let it add replicas before the queue becomes TTFT.

actually reaching zero

Scaling to zero is a special case of scaling, with one extra problem: when you're at zero replicas, there's nothing running to receive the request that's supposed to wake you up. Something has to catch that first request and hold it while a replica spins up.

KEDA does this with an activation threshold: below it, the deployment sits at zero; the first event scales it to one. Knative Serving builds the pattern in more deliberately, with an activator component that buffers incoming requests while a cold replica starts, then releases them once it's ready, so the request is slow but not dropped. KServe (which runs on Knative) exposes this as a simple minReplicas: 0 on an InferenceService, and it's the most common way teams get GPU model servers to zero on Kubernetes.

Fig. 2 · the lifecycle, and the unlucky first request. Everything after this figure is about shrinking the cold-start box so that request waits seconds, not minutes.

The honest catch is that reaching zero and the cold-start tax are the same coin. Zero replicas is where the savings are, and it's also where every request pays the full six minutes. Everything below is about making that first request cost seconds instead, because a scale-to-zero setup with a six-minute cold start is a cost win and a product loss, and you rarely get to keep both.

There's also a breakeven worth running before you build any of this. Scale-to-zero wins when utilization is low and spiky; it loses somewhere around half-time, where a dedicated node is both cheaper than paying serverless rates by the second and free of cold starts entirely. If a GPU is busy more than half the day, don't scale it to zero. Right-size it and leave it on.

killing the image pull

The container image is the first fixable minute. Two families of fix.

The first is to stop pulling the whole thing before you start. SOCI (Seekable OCI), lazy-loading via a containerd snapshotter, lets a container start running against an index of the image and fetch the actual bytes on demand, so the model server boots while the layers it doesn't need yet are still downloading. estargz (the containerd stargz snapshotter) and Nydus do the same lazy-pull trick with different formats; GKE's Image Streaming is the managed version, and it cut a 5.4 GB Triton image's start from 191 seconds to 30. There's also a variant for the case where you will read the whole image anyway (AI images touch most of their bytes immediately): SOCI's parallel-pull mode just parallelizes the download and unpack instead of lazy-loading, and AWS measured roughly 60% off a 10 GB image.

The second is to not pull over the network at all. Pre-bake the image into the node's disk image so it's local before the pod schedules. Run an in-cluster pull-through cache or a peer-to-peer image mirror (Spegel and friends) so the second node to need an image gets it from a neighbor instead of the registry. Pre-pull hot images with a DaemonSet so they're warm before traffic arrives. None of these is clever; all of them beat pulling 15 GB cold from a registry when a hundred replicas try it at once.

killing the model load

The bigger minute is the model weights, and this is where the newer tooling has moved fast. Loading 140 GB of safetensors off a disk or object store the default way is serial and slow. The fixes stream instead.

NVIDIA's Run:ai Model Streamer reads weights from object storage in many parallel streams straight onto the GPU, overlapping download with load, and vLLM supports it directly (--load-format runai_streamer). NVIDIA's own benchmark took an 8B model's S3 load from 28 seconds at four streams to under five at thirty-two. CoreWeave's tensorizer (--load-format tensorizer) serializes the model into a format that streams from S3 or local disk with near-zero deserialization overhead. Both turn a multi-minute load into tens of seconds. Underneath, safetensors already supports memory-mapped zero-copy loading, which helps when the file is local.

And local is the other half. Cache the weights on the node's NVMe (an instance-store disk) so a restart reads from local flash instead of re-downloading. Or mount a shared, fast filesystem so every replica reads the same warm copy, but mind the access mode: a ReadWriteOnce PVC serves one replica and then silently blocks the second pod from mounting, so anything that scales past a single replica needs ReadWriteMany (EFS, FSx for Lustre, NFS, CephFS). That RWO-to-RWX switch is a classic thing to discover the first time an autoscale event never becomes a second pod. Some teams ship the model as its own OCI artifact and let the image machinery above handle it (KServe's modelcar pattern). The principle is the same one from part 2: the bandwidth between the weights and the GPU is the bottleneck, so shorten that path.

The frontier technique skips loading altogether. GPU memory snapshots (Modal and Cerebrium both ship this, built on the CUDA checkpoint/restore API in recent driver branches) checkpoint a fully warmed replica, weights on the card and CUDA graphs already captured, then restore that image straight onto a GPU. Because it bypasses weight load, torch.compile, and graph capture in one move, it's the only approach that also kills the warmup tax. Modal reports a vLLM Qwen model dropping from 45 seconds to 5, and Cerebrium measured cold starts down 71 to 88 percent. The catch is portability: a snapshot is pinned to a specific GPU model and driver branch, so it's a per-SKU artifact, not a universal one.

Key Insight: Scale-to-zero is a tradeoff, not a free win. Every second of cold start is latency the first user eats; every minute of warm idle is money you burn. You buy down the cold start with lazy image pulls, model streaming, and a warm node pool, then set a minimum replica floor high enough that your p99 cold start stays inside the SLO. "Zero when truly idle, one when it might not be" beats a dogmatic zero.

Fig. 3 · the fix menu, one column per stage. Snapshots are the only trick in the third column, which is why they're the frontier: everything else leaves the warmup tax standing.

turning off idle nodes

Scaling pods to zero doesn't save anything if the expensive GPU node they were on keeps running. The node has to go too. This is the autoscaler's job below the pod level: Karpenter consolidates workloads and removes nodes that are empty or underutilized (consolidateAfter, disruption budgets so it doesn't yank capacity mid-request), and the older cluster-autoscaler scales node groups down on the same principle. A GPU node that's been empty for a few minutes is thirty to a hundred-plus dollars a day; letting it linger is the single most common way GPU bills quietly balloon.

For predictable traffic, the laziest win is scheduled scaling. If the demo only serves business hours, a cron trigger that scales the floor to zero at 7pm and back to one at 9am captures most of the savings with none of the cold-start risk during the day. And to hide cold starts when you do scale up, keep a warm node in reserve with low-priority placeholder pods (over-provisioning): real work evicts the placeholders instantly and lands on a node that already has the image, so the pod cold start doesn't also pay the node cold start.

spot, and the capacity trap

Two closing realities that scale-to-zero runs into. Spot instances make idle capacity cheap, and for inference they can work (unlike the gang-scheduled training from part 3, a single inference replica dying is survivable). But a spot GPU can be reclaimed with about two minutes' notice, so you need a node-termination handler that drains in-flight requests and a plan for where the replacement comes from.

Which is the trap: scaling up assumes there's a GPU to scale up onto. In 2026, popular GPUs are not always available on demand, and a scale-to-zero service that can't reacquire an H100 at 9am Monday is worse than one that never scaled down. The mitigations are the capacity blocks and reservations from part 3, an on-demand fallback when spot is dry, and a warm floor for anything with a real SLO. Scale-to-zero is a cost strategy, not a capacity strategy, and confusing the two is how you save money right up until the morning you can't get your GPUs back.

That's the cost side handled. The fleet scales with demand and turns itself off when it's idle, and the first user back doesn't wait five minutes for the privilege. But there's one lever left, and it's the strangest one, because it costs nothing to pull. Everything so far has quietly assumed that a request, once it arrives, lands on some replica and gets served. Part 6 is about that word "some." It turns out that which replica you pick, out of a pool that all look identical from the outside, can make the exact same hardware more than twice as fast. The load balancer, of all the unglamorous things, is where the last big win hides.

What a green GPU dashboard hides

Harshit Luthra — Thu, 02 Jul 2026 19:14:09 +0000

Originally published at harshit.cloud on 2026-07-11.

The GPU dashboard was a wall of green. Every card pinned at 90-something percent utilization, power near TDP, temperatures fine, no XID errors. By every signal from part 1 of this series, the box was healthy and working hard. Meanwhile the on-call channel had three messages from product asking why the chatbot took eight seconds to say its first word. The hardware was busy. The users were furious. Both were true.

That is the trap of monitoring GPU inference with GPU metrics. A card at 92% utilization tells you a kernel is running. It tells you nothing about whether requests are piling up in a queue, whether the KV cache is full and requests are being evicted, or whether the first token is landing in 200 milliseconds or eight seconds. This is part 4 of the series. Parts 1 through 3 built the thing: the stack under a pod, the wires in a box, the network between boxes. This part is about seeing what it's doing once real traffic hits it, and the short version is that the numbers users feel live in the serving engine, not on the GPU.

the four numbers users actually feel

Before any dashboard, get the vocabulary straight, because LLM latency is not one number. A request has a shape, and four measurements describe it.

TTFT, time to first token, is how long the user stares at a blinking cursor before anything appears. It's dominated by prefill (processing the whole prompt) plus however long the request sat in a queue before the server picked it up. This is the number product complains about.

ITL, inter-token latency, is the gap between consecutive tokens once generation starts. (Some tools report TPOT, time per output token, the averaged version of the same thing; benchmarks like GuideLLM show both side by side, so don't treat them as interchangeable in a report.) It's what makes text feel like it's streaming smoothly or stuttering out. Decode is memory-bandwidth-bound, so ITL degrades as you pack more concurrent requests onto a GPU.

End-to-end latency is the whole request, first byte to last. For a long generation it's mostly output_tokens × ITL, which means it's as much about how much the model says as how fast it says it.

Throughput is output tokens per second across all requests, and it moves in the opposite direction from latency. Batch more requests together and throughput climbs while per-request latency gets worse. There's no single "fast" setting. There's a frontier, and where you sit on it is a product decision (a chatbot wants low TTFT, a batch summarization job wants raw throughput) rather than a tuning default.

The number that captures both at once is goodput: the requests per second you can serve while still meeting your latency SLO. A server can post gorgeous raw throughput while quietly blowing p99 TTFT for half its users, so goodput (throughput, filtered by "did it meet the SLO") is the only throughput figure worth quoting.

what vLLM actually tells you

vLLM exposes a Prometheus /metrics endpoint, and once you've read it a few times the health of the server is obvious at a glance. The metrics that matter split into two groups: what the queue is doing, and what the KV cache is doing.

The queue first. vllm:num_requests_running is how many requests are being served right now; vllm:num_requests_waiting is how many are stuck in line because the server can't fit them yet. A healthy server has a running count near its batch capacity and a waiting count near zero. When num_requests_waiting starts climbing and staying up, that eight-second TTFT has arrived, and no GPU metric will show it. vllm:request_queue_time_seconds measures the wait directly.

Then the KV cache (the model's running memory of the tokens it has already processed, which grows with every active request and every token generated). vllm:kv_cache_usage_perc is the fraction of KV cache in use. This is the one to stare at, because when it approaches 100% vLLM has to start preempting: evicting a half-finished request to free memory, then recomputing it from scratch later. vllm:num_preemptions_total counts that happening. A rising preemption rate means the server is thrashing, doing the same work twice, and every latency number is about to get worse.

Fig. 2 · the same server, healthy and drowning. GPU utilization is high in both; the queue, the cache, and the preemption counter are what separate a busy server from a failing one.

The rest fill in the picture. vllm:time_to_first_token_seconds and vllm:inter_token_latency_seconds are histograms, so you alert on the p95 or p99, not the average (the average hides the user who waited twelve seconds). vllm:prompt_tokens_total and vllm:generation_tokens_total give you real throughput, computed with rate() (vLLM removed its old pre-averaged throughput gauges, so you do the division yourself). And vllm:prefix_cache_hits_total over vllm:prefix_cache_queries_total tells you how much prompt reuse you're getting, which matters enormously for the RAG and multi-turn workloads from part 2. It's also the number a smart router watches to decide which replica to send a request to, which is the whole subject of part 6.

One warning that will save you an afternoon: these names drift between releases. The V1 engine renamed the KV-cache gauge (it used to be gpu_cache_usage_perc) and swapped the per-output-token metric to inter_token_latency_seconds. Diff the live /metrics output of your exact build before you copy anyone's PromQL, this post included.

Key Insight: GPU utilization tells you the chip is busy; it never tells you users are waiting. The metric that predicts an angry inbox is vllm:num_requests_waiting climbing, usually because vllm:kv_cache_usage_perc hit its ceiling and the server started evicting half-finished requests. Watch the queue and the cache, not GPU-util.

the same story in SGLang and Triton

The engine changes, the questions don't. SGLang exposes its own Prometheus metrics behind --enable-metrics: sglang:num_running_reqs and sglang:num_queue_reqs are the running-and-waiting pair, sglang:token_usage is the KV-cache fraction, and sglang:cache_hit_rate reports how often its RadixAttention prefix cache paid off. (One gotcha of exactly the kind above: SGLang flipped the metric prefix from sglang: to sglang_ in v0.5.4, and the bundled Grafana dashboard hasn't caught up, so a fresh install can read "No Data" until you fix the prefix.) If you built on SGLang for its prefix-sharing (the reason to pick it in part 2), that last one is how you confirm the bet is paying.

Triton with the TensorRT-LLM backend reports through Triton's metrics endpoint instead: request and queue durations, inflight-batcher stats, per-model success and failure counts. Different names, same three questions every time: how long are requests waiting, is the cache saturated, is the tail latency inside the SLO.

The lesson worth internalizing is that these are all the same dashboard with different labels. Queue depth, cache utilization, tail TTFT, tail ITL. Learn to read one engine and you can read all of them.

compute-bound, memory-bound, or queue-bound

The reason to keep the GPU metrics from part 1 next to the serving metrics is that together they diagnose why the server is slow, which is the only thing that tells you what to do about it. Three shapes cover most incidents.

If TTFT is high and num_requests_waiting is high but gpu_cache_usage_perc is low, you're queue-bound: requests are backing up faster than you can start them, and the fix is more replicas (which is part 5). If ITL is degrading and DCGM_FI_PROF_DRAM_ACTIVE is pinned while tensor activity isn't, you're memory-bound on decode, and the fix is a smaller batch, quantization, or better KV-cache management. If tensor cores are saturated during prefill and DRAM isn't, you're compute-bound, which for inference usually means very long prompts and points at chunked prefill or a prompt-length limit.

Fig. 3 · the serving metric tells you something is wrong; the GPU metric next to it tells you which kind of wrong. You need both panels on the same screen.

Neither set of metrics is enough alone. GPU metrics without serving metrics miss the queue entirely. Serving metrics without GPU metrics can't tell a memory-bound stall from a compute-bound one. The dashboard that works has both, side by side, on one screen.

what to alert on

Most teams over-alert on the GPU and under-alert on the experience. The page that matters fires on the user's SLO, not the hardware's vitals. A working starter set:

TTFT p99 over budget for N minutes. This is the customer-facing SLO. Everything else is a leading indicator of this.
num_requests_waiting sustained above zero. A brief spike is fine; a standing queue means you're under-provisioned and the next thing to break is TTFT.
Preemption rate climbing. num_preemptions_total moving means the KV cache is saturated and the server is recomputing evicted work. It's the early warning before latency falls off a cliff.
Error rate. Request failures and, specifically, CUDA out-of-memory events, which on an inference server usually mean a batch or context-length setting is too aggressive.
The part-1 hardware alerts still stand. XID errors, thermal throttle, ECC. A dying GPU shows up as latency variance long before it shows up as an error, so keep those wired.

Tail latency is the whole game here. Alerting on average TTFT is how you find out about an outage from the customer instead of the pager, because the average stays calm while your p99 is on fire.

tracing a single slow request

Aggregate metrics tell you the fleet is unhealthy. They don't tell you why this request took nine seconds. For that you want per-request tracing, and the ecosystem has standardized on OpenTelemetry's GenAI semantic conventions: spans carry gen_ai.* attributes (the model, input and output token counts, the request parameters) so a single request's journey through the gateway, the queue, prefill, and decode is one connected trace. When a specific user reports a slow response, a trace tells you whether it sat in a queue, hit a cache miss, or just asked for a 4,000-token essay. The metrics say the kitchen is slow; the trace shows you which order got lost.

proving it before prod

You don't want to discover your latency frontier during a launch. Load-test the serving endpoint before it sees real traffic, with a tool that speaks LLM rather than plain HTTP. vLLM ships vllm bench serve (the old benchmark_serving.py), which replays a request distribution, reports TTFT, ITL, and throughput percentiles, and computes goodput directly if you hand it SLO thresholds. GuideLLM (now a Red Hat project) does the same with a sweep mode that finds your safe operating range on its own; NVIDIA's genai-perf covers the Triton side, though NVIDIA is steering new work to its successor AIPerf; and the Kubernetes serving working group's inference-perf standardizes the numbers across engines so you can compare vLLM to SGLang honestly. Whatever you pick, the output you care about is the same: the curve of tail latency against offered load, and the load at which p99 TTFT crosses your SLO. That crossover is the goodput ceiling, the real per-replica capacity, and it's the input to everything in part 5.

Because that's the thing this post sets up. Once you can see the queue building and the cache saturating, the obvious next question is: why am I staring at these graphs manually at 2am instead of having the queue depth add a replica by itself, and drop it again when the traffic goes home. That's scaling, and scaling GPUs that cost six dollars an hour is its own kind of problem.

Scaling GPUs past one box

Harshit Luthra — Thu, 02 Jul 2026 19:14:04 +0000

Originally published at harshit.cloud on 2026-07-09.

When Meta trained Llama 3 405B on 16,384 H100s, the cluster hit 419 unexpected interruptions over 54 days. That's one failure every three hours or so, for 54 days straight, and about 78% of them were hardware. GPUs and their HBM3 memory accounted for roughly half. They also watched the datacenter's power draw swing by tens of megawatts as thousands of GPUs idled and resumed in sync, which is a sentence that should make any infrastructure engineer sit up.

This is the part of the series where the GPU stops being the interesting component. Part 1 was the stack under one pod, part 2 was one box and the wires inside it. Once you cross the node boundary, the network becomes the machine, failures become continuous rather than exceptional, and the scheduler decides whether your very expensive cluster does useful work or deadlocks against itself. Everything here is about the stuff between the boxes.

the network is the machine

Synchronous training does an all-reduce of the entire gradient (every parameter, billions of them) every single step. That collective is a hard barrier: the slowest link gates every GPU in the job. Add nodes and your compute scales, but the all-reduce volume grows and so does the chance that one link is slow or dead. This is why, past one node, communication rather than FLOPs sets your scaling efficiency, and why the network gear costs as much attention as the GPUs.

The fabric itself is a two-horse race. InfiniBand is the incumbent for dedicated training superclusters: the generation ladder runs EDR 100Gb, HDR 200Gb, NDR 400Gb (Quantum-2 switches, ConnectX-7 NICs), and now XDR 800Gb (Quantum-X800, ConnectX-8). Most DGX SuperPODs and the big named clusters run it. RoCE v2, RDMA over Ethernet, is winning share on cost and on letting existing Ethernet teams reuse what they know. The catch with RoCE is that it needs a carefully tuned lossless fabric (Priority Flow Control plus ECN marking) or you get congestion storms, and getting that right at scale is its own discipline.

Meta is the proof that Ethernet can do it. They built two 24,576-GPU H100 clusters, one on RoCE and one on Quantum-2 InfiniBand, and trained Llama 3 405B on the RoCE one with no network bottleneck, after co-designing the topology, the PFC/ECN thresholds, and an all-reduce-aware load balancer. That's the honest framing: RoCE works at scale, but Meta spent real engineering to make it work. Two more pieces earn their keep on either fabric. GPUDirect RDMA lets the NIC DMA straight into GPU memory, skipping a bounce through host RAM, and without it every hop stages through system memory. SHARP does the reduction inside the switch ASIC, so gradients get summed in the network instead of shuttled between every node, which on the newest Blackwell fabrics is a large multiplier on effective all-reduce bandwidth.

NCCL across the wire

The same NCCL from part 2 handles inter-node collectives, and the failure mode here is specific and common: NCCL silently falls back to TCP sockets when it can't find or use the RDMA path, and the job "works" while running an order of magnitude too slow. The env vars that prevent that are worth pinning in your launcher.

NCCL_IB_HCA names which RDMA NICs to use, and getting it wrong means NCCL picks one NIC and loses your rail parallelism. NCCL_SOCKET_IFNAME has to point at the real data-plane interface, not eth0 or lo, a classic container and Kubernetes trap. NCCL_CROSS_NIC=0 on a rail-optimized fabric keeps a ring on the same rail instead of hopping across them. NCCL_IB_GID_INDEX is the RoCE gotcha: the wrong GID index gives you no traffic or a silent slow path. On a fresh cluster the first all-reduce is routinely two to ten times slower than optimal until the env vars, the GID index, the PFC/ECN config, and the topology file are all correct. The bring-up ritual is always nccl-tests: run all_reduce_perf, measure the achieved bus bandwidth, compare it against what 400 or 800 Gb should give you, and don't trust the cluster until the number is close.

who schedules the gang

Here's the failure that surprises people coming from web infrastructure. Vanilla Kubernetes schedules pods independently, one at a time, with no concept of a job that needs all its pods at once. Give it a 4-node training job and it will happily place 3 pods and leave the 4th Pending forever, holding three nodes of GPUs idle. Run two such jobs and they can each grab most of what the other needs and starve each other indefinitely. Distributed training is all-or-nothing, and a scheduler that doesn't know that will deadlock your cluster. The first time it happens you assume you're out of GPUs. You're not. They're all sitting idle, reserved by pods that will never get their partners.

Fig. 2 · gang scheduling in one picture. The left side is how a cluster quietly wedges itself; the right side is the fix, and the reason every GPU scheduler below exists.

The fix is gang scheduling: admit all N pods together or none, so partial allocations can't happen. The tools that provide it each have an honest drawback:

Slurm is the HPC default and has gang scheduling and topology awareness built in, plus new block scheduling for aligning jobs to NVL72 racks. Its weakness is that containers and multitenancy are bolted on (Pyxis plus Enroot), and it's a poor fit for long-running inference services. That gap is why Slurm-on-Kubernetes projects like SchedMD's Slinky and CoreWeave's SUNK exist.
Kubernetes vanilla has no gang scheduling and no topology awareness by default. You add a batch scheduler on top; you don't run training on the default scheduler.
Volcano is the de facto CNCF choice: gang scheduling via PodGroups, queues, fair-share. It runs as a second scheduler that bypasses the default, which complicates coexistence with normally-scheduled workloads, and gang scheduling itself costs maybe 10–15% utilization because resources sit idle waiting for the full gang.
Kueue is the Kubernetes-native answer for queueing and quota, and it cooperates with the default scheduler instead of replacing it. The tradeoff is that it does admission and quota, not fine-grained placement, so you still need scheduler plugins underneath for the actual gang and topology binding.
Run:ai is the commercial option, now NVIDIA-owned, with fractional GPU and pooling. NVIDIA open-sourced the core as KAI Scheduler (Apache 2.0, now CNCF Sandbox), so a free path exists, but the full enterprise feature set stays paid and KAI is young as a standalone project.
YuniKorn brings strong hierarchical-queue multitenancy from the Spark world, at the cost of being another full scheduler replacement with a smaller AI-specific ecosystem than Volcano.

The cross-cutting truth is that gang scheduling trades utilization for progress. Holding GPUs idle while you wait for the full gang is the price of not deadlocking, and it's a price worth paying.

splitting the model

When a model outgrows one GPU, there are three axes to split it on, and real training combines them. Data parallelism replicates the whole model on each GPU and all-reduces gradients; it's the simplest and only works when the model plus its optimizer states fit on one card. Tensor parallelism splits individual matrix multiplies across GPUs and is communication-heavy, so you keep it inside a node on NVLink (TP=8 is the common ceiling). Pipeline parallelism cuts the layers into stages across nodes and passes activations point-to-point, which is cheap enough to cross the network.

Fig. 3 · the standard frontier recipe: tensor-parallel inside the NVLink domain, pipeline-parallel across nodes, data-parallel on top. Match each split to the bandwidth it can afford.

For the common case of "my model doesn't fit but I want to stay in native PyTorch," FSDP2 shards the parameters, gradients, and optimizer states across GPUs and reconstructs each layer on the fly via all-gather, prefetching the next shard to overlap communication with compute. DeepSpeed's ZeRO does the same idea in stages: stage 1 shards optimizer states, stage 2 adds gradients, stage 3 adds parameters and is functionally equivalent to FSDP. For frontier scale and maximum MFU you reach for Megatron-Core and combine tensor, pipeline, and sequence parallelism into the 3D (now 4D, with expert parallelism for MoE) recipe: TP=8 inside the node, pipeline across nodes, data parallelism on top.

Checkpointing is the reliability workhorse and used to be the tax that made frequent saves unaffordable. Async distributed checkpointing fixed that by writing state in a background thread that overlaps the next iterations; TorchTitan reports 5–15x lower checkpoint overhead than synchronous saves. That matters directly because of the failure rate: at one interruption every three hours, you want to checkpoint on the order of tens of minutes, and torchrun's elastic mode restarts the job from the last snapshot when a node dies. The newer torchft goes further, recovering a failed replica from a healthy peer without restarting the whole job.

when the model won't fit one node

Serving crosses the node boundary for the same reason training does: the model, or its KV cache, exceeds one node's total HBM. DeepSeek-V3 at 671B, Llama 405B, the big mixture-of-experts (MoE) models. You split them with tensor and pipeline parallelism across nodes, and for MoE you add wide expert parallelism, spreading experts across many nodes so each GPU holds few experts but sees a large batch per expert.

The pattern that's become standard is disaggregated prefill and decode. Prefill (processing the prompt) is compute-bound; decode (generating tokens one at a time) is memory-bandwidth-bound. Running them in one pool means prefill work stalls decode latency. Splitting them into separate worker pools lets you scale each for its own bottleneck and transfer the KV cache between them over RDMA. It isn't a free win, though. Moving the KV cache between pools costs bandwidth, so disaggregation pays off when prefill interference is genuinely the bottleneck (long prompts, high concurrency) and can be net-negative when it isn't. DeepSeek runs it because at their scale it clearly is; a chatbot with short prompts might not need it at all.

Fig. 4 · prefill and decode want different hardware ratios, so modern stacks run them as separate pools and ship the KV cache between them. DeepSeek-V3 runs a 32-GPU prefill unit in front of a 320-GPU decode pool.

DeepSeek's own deployment runs a 32-GPU prefill unit (4 nodes, expert-parallel across 32) feeding a much larger decode pool, and an LMSYS reproduction on 96 H100s hit 52,000 input and 22,000 output tokens per second per node. On Kubernetes the primitive that expresses "this is one model replica made of many pods" is LeaderWorkerSet: one leader, N workers, scheduled and scaled as a unit, which is exactly what gang scheduling and topology-aware placement need to bite on. NVIDIA Dynamo and llm-d sit on top: Dynamo for distributed serving with a KV-cache-aware router, and llm-d for KV-cache-aware routing on the Gateway API Inference Extension. That routing layer turns out to be one of the biggest free wins in the whole stack, which is why it gets its own part 6.

everything fails at scale

Reliability stops being a checkbox and becomes the main event. The Llama 3 numbers from the top of this post are the reference point, and ByteDance's MegaScale run on 12,288 GPUs tells the same story: 55.2% MFU and more than a hundred failure-recovery events over a few weeks. The failure you can't see coming is silent data corruption, where a GPU computes a wrong number without erroring. It doesn't crash or log anything. It just hands back the wrong answer, and your loss curve grows a mysterious kink a few hours later. Meta caught six such events in 54 days; Google reports SDC-related disruptions roughly every one to two weeks. A single corrupted gradient contaminates the global update across every worker, and it's now a first-class reliability topic with its own whitepapers.

Key Insight: At cluster scale, hardware failure is the normal state, not an exception you engineer away. A 16k-GPU run loses a GPU every few hours. You can't stop that, so the whole game is checkpointing often enough and keeping enough hot spares that a dead node costs you minutes instead of the whole run.

The straggler is the SDC's cousin: a GPU or link that's degraded but not dead, quietly throttling a synchronous job because collectives move at the speed of the slowest member. Detecting it at scale is genuinely hard, so systems run periodic self-check diagnostics that pause the job, measure NVLink and compute per node, diagnose, and resume from checkpoint. Imbue open-sourced their bare-metal playbook for a 4,088-H100 cluster: check VBIOS and baseboard firmware, the Mellanox OFED stack, PCIe link speed and width, then run matmuls to measure actual NVLink bandwidth, cordon anything that fails, and swap in a hot spare automatically. That last part is the operating model at scale. You don't fix nodes in the critical path; you drain them and pull from a spare pool, because the cluster is always partially broken and the job has to keep moving.

Fig. 5 · the failure breakdown from Meta's Llama 3 405B run. GPUs and their memory are roughly half; the rest is the long tail an at-scale operator plans around, not against.

paying for it

The economics are why all of the above matters. At $2–6 per GPU-hour, a 16,000-GPU cluster idling during a recovery burns thousands of dollars a minute, and gang scheduling means one bad node can idle the whole job. That's the real argument for the reliability engineering: not uptime for its own sake, but goodput, the useful training throughput net of failures and restarts.

Fig. 6 · the on-demand spread between neoclouds and hyperscalers is wide, and it moves monthly. Big training runs almost never pay on-demand; they live on reserved capacity or capacity blocks.

The pricing spread is wide and moves every month. Neoclouds like Lambda, CoreWeave, and Nebius run an H100 around $2.5–3.5 per hour; AWS lists closer to $6.88. B200s are $5–6 on neoclouds and north of $14 on AWS. Reserved commitments of one to twelve months cut 16–40% off on-demand, and that's where most large training capacity actually lives. Spot is 30–70% cheaper and nearly unusable for gang-scheduled training, because losing any one node preempts the whole synchronous job and reacquiring N contiguous, topology-aligned nodes on spot is a fantasy. That gap is why capacity blocks exist: AWS EC2 Capacity Blocks and GCP's Dynamic Workload Scheduler let you reserve co-located GPUs for a fixed window, booked weeks ahead, because on-demand can't guarantee the topology and reserved is too long a commit for one run. AWS raised those block prices about 15% in early 2026, which tells you which way demand is going.

That's the whole arc of building this stuff. A GPU deployment is not a GPU. It's a dozen layers under one pod, a small network inside one box, and a large one between boxes, and the interesting failures always live in the wiring rather than the silicon. Meta's cluster lost a GPU every three hours and still trained a frontier model, because the whole apparatus around the GPUs (the fabric, the scheduler, the checkpointing, the spare pool) was built to keep moving while parts of it were on fire. That's how you build it. The next part is how you watch it once real traffic arrives, which turns out to be a different problem than watching the GPUs.

One box, eight GPUs, and the wires between them

Harshit Luthra — Thu, 02 Jul 2026 19:13:24 +0000

Originally published at harshit.cloud on 2026-07-04.

We bought two boxes that were supposed to be identical. Eight H100s each, same rack, same image. On one of them a tensor-parallel serve of Llama-3-70B did about 3,000 tokens a second. On the other it did 900, with every GPU pinned at high utilization the whole time. Same model, same code, same card count. The difference was a PCIe switch and a BIOS setting nobody had checked, and it took most of a day to find because every dashboard said both boxes were healthy.

That's the thing about a multi-GPU box. It looks like a bag of eight GPUs. It behaves like a small, opinionated network, and the wiring between the cards matters more than the cards. This is part 2 of the series. Part 1 was the twelve-layer stack under a single GPU pod. This one stays inside one chassis: how the GPUs talk, how to read the topology, why NCCL is slow, and how a 70B model actually lands on the hardware. Part 3 leaves the box.

the box is a network, not a bag of GPUs

The first question about any multi-GPU server is how the GPUs are wired, because that sets a hard ceiling on everything above it. There are two very different animals sold as "8-GPU servers."

An HGX or DGX baseboard wires all eight GPUs through a bank of NVSwitches. Every GPU reaches every other GPU at the full NVLink rate, non-blocking. On H100 and H200 that's NVLink 4 at 900 GB/s per GPU. On B200 it's NVLink 5 at 1.8 TB/s. That flat, full-bandwidth mesh is the reason you can split a model eight ways and have the halves talk fast enough to keep up.

A cheaper "8x PCIe" box has no NVSwitch. The GPUs hang off PCIe switches and the CPU root complexes, and GPU-to-GPU traffic crawls through PCIe, often routed up through the CPU. PCIe Gen5 x16 is about 128 GB/s, Gen4 about 64. NVLink 4 is roughly seven times faster than Gen5 and fourteen times faster than Gen4. That gap is the entire reason tensor parallelism cares about your topology. The two "identical" boxes in the opening weren't identical: one had the NVSwitch mesh, the other routed two of its GPU pairs across a PCIe switch with a BIOS feature quietly strangling them.

One units trap that trips up everyone reading spec sheets: NVIDIA quotes NVLink bandwidth bidirectionally. A100's "600 GB/s" is 300 each way. Pick a convention, state it once, and don't compare someone's unidirectional number to your bidirectional one.

Key Insight: Two servers with the same eight GPUs can differ by 3× on the same job. Before you promise a throughput number, run nvidia-smi topo -m and confirm the GPUs talk over NVLink (the NV-prefixed rows), not over PCIe routed through the CPU (the SYS rows).

Fig. 2 · bandwidth per GPU across the interconnects you'll actually meet. The jump from PCIe to NVLink is the one that decides whether a split model keeps up with itself.

reading nvidia-smi topo -m

You don't have to guess at any of this. nvidia-smi topo -m prints the whole connection matrix, and learning to read it is the single most useful GPU-ops skill after nvidia-smi itself. Every cell tells you how one GPU reaches another, and the symbols form a quality ladder from best to worst:

NV#: connected by # bonded NVLinks. Best. NV18 means eighteen links, full H100 mesh.
PIX: a single PCIe bridge, same switch. Fine.
PXB: multiple PCIe bridges, but not across the CPU host bridge.
PHB: crosses a PCIe host bridge, through the CPU, same NUMA node.
NODE: crosses host bridges within a NUMA node.
SYS: crosses the inter-socket link between CPUs. Worst. This is CPU-to-CPU-to-GPU.

If the GPUs in your tensor-parallel group show NV18 to each other, you're golden. If any pair shows SYS, your collective operations are dragging across the socket interconnect and you've found your 900-tokens-a-second box. The same matrix has a column for GPU-to-NIC affinity, which matters enormously for the multi-node story in part 3: you want the NIC on the same PCIe complex as the GPU it feeds.

NCCL picks a road

Every framework that splits work across GPUs (PyTorch DDP and FSDP, DeepSpeed, Megatron, the tensor-parallel path in vLLM) does its cross-GPU communication through NCCL, NVIDIA's collectives library. NCCL is where "the GPUs need to agree on a number" turns into actual bytes on actual wires, and it auto-picks the road.

Inside one box it prefers, in order: peer-to-peer over NVLink (best), peer-to-peer over PCIe, shared host memory (staged through RAM), then network sockets (worst, and a sign something's misconfigured). When an all-reduce (the step where every GPU merges its numbers with all the others and ends up with the combined result) is slow, the debugging loop is almost always the same handful of moves:

# 1. what did NCCL actually choose?
NCCL_DEBUG=INFO python train.py 2>&1 | grep -iE 'via|transport|channel'
#   "via P2P/direct pointer" = good. "via SHM" or "via NET/Socket" intra-node = bad.

# 2. confirm the GPUs are NVLinked, not routed over SYS
nvidia-smi topo -m

# 3. benchmark against the ceiling
all_reduce_perf -b 8 -e 4G -f 2 -g 8   # from nccl-tests

# 4. if a hang clears when you disable P2P, you have an ACS/IOMMU problem
NCCL_P2P_DISABLE=1 python train.py

The env vars worth knowing are few. NCCL_DEBUG=INFO tells you what topology and transport NCCL chose. NCCL_P2P_LEVEL and NCCL_P2P_DISABLE control peer-to-peer. NCCL_SOCKET_IFNAME picks the bootstrap interface, and pointing it at the wrong one (lo, docker0) is a classic way to make init hang. NCCL_TOPO_FILE lets you hand NCCL a topology description, which you sometimes need on cloud VMs because virtualized PCI hides the real affinity and NCCL guesses wrong. On a healthy 8-GPU NVSwitch box you usually touch none of these and it just works. The trouble starts on anything cheaper or virtualized.

the invisible tax: NUMA and ACS

Two settings below the framework quietly decide whether your bandwidth numbers are real, and neither shows up in a GPU dashboard.

The first is NUMA pinning. A multi-socket server splits its PCIe lanes and RAM between CPU sockets, and each GPU is physically wired to one socket. Run your process on socket 0 while it drives a GPU hung off socket 1, and every host-to-device copy crosses the inter-socket link. NCCL's low-latency protocol stages data through a pinned CPU buffer, so this hits communication, not just data loading. The fix is a one-liner: numactl --cpunodebind=0 --membind=0 <cmd>, matched to the socket that owns the GPU. nvidia-smi topo -m prints the affinity so you know which socket that is. It's genuinely deflating to spend an afternoon profiling and find the answer was a numactl prefix, but that's most of this job.

The second is PCIe ACS, Access Control Services, and it's the one that cost us most of that day. ACS forces PCIe peer-to-peer transactions to route up through the CPU root complex so the platform can police them. That defeats direct GPU-to-GPU DMA across a PCIe switch: latency climbs, throughput collapses, and NCCL can hang outright. ACS has to be off for peer-to-peer to work across a switch. You check it with lspci -vvv | grep -i acsctl and disable it in BIOS or with a setpci loop over the bridges. Its cousin, the IOMMU, does the same routing-through-the-root-complex thing, which is exactly why passthrough GPUs on cloud VMs often show degraded peer-to-peer: the isolation that makes virtualization safe is the isolation that makes GPUDirect slow. If nvidia-smi insists your GPUs "are not P2P capable," the shortlist is ACS enabled, IOMMU on, consumer cards, or GPUs on different root complexes. p2pBandwidthLatencyTest from the CUDA samples confirms which.

fitting a 70B model on the box

The practical question most teams actually have is: how many GPUs does my model need, and how do I split it. Tensor parallelism splits every layer's weight matrices across GPUs, which means every token, at every layer, triggers a collective to recombine the partial results. That makes TP communication-bound and latency-sensitive, which is the real reason it wants NVLink. On a PCIe-only box, tensor parallelism is frequently slower than just pipelining the layers, and vLLM's own guidance says as much: no NVLink, prefer --pipeline-parallel-size over --tensor-parallel-size.

The memory math for Llama-3-70B is a worked example worth carrying around:

Weights: 70B params times 2 bytes for BF16 is about 140 GB. That already doesn't fit on one 80GB card.
KV cache (the model's running memory of the tokens it has already processed): roughly 2.5 GB per sequence at 8K context, climbing to tens of GB at 128K. This is what eats whatever VRAM the weights left behind and sets your max batch size.
Real footprint with activations and framework overhead lands north of 200 GB.

So Llama-3-70B at BF16 wants two H100 80GB cards (--tensor-parallel-size 2), or four A100 40GB, or a fistful of 24GB cards. Tensor parallelism splits the KV cache too, so two GPUs buys you roughly double the batch headroom, not just room for the weights. Quantize to FP8 on Hopper or Blackwell, or INT4 with AWQ, and the weights roughly halve or quarter, which can drop a 70B onto a single card at some quality cost. You can quantize the KV cache too (--kv-cache-dtype=fp8_e5m2), which shrinks the biggest consumer of leftover VRAM once batch sizes climb. And a rule worth internalizing: tensor-parallel a model only when it genuinely doesn't fit one GPU. For a model that does fit, run several replicas at --tensor-parallel-size 1 instead, because TP's per-layer communication is pure overhead you're paying for nothing when one card already holds the whole model. The constraint people forget: your TP size has to divide the number of attention heads evenly, so you don't get to pick arbitrary GPU counts, and on Kubernetes it also has to equal the pod's nvidia.com/gpu limit or the server won't start. (You learn this the moment --tensor-parallel-size 6 refuses to start and the error message does nothing to help.)

the caveats that page you

Dense GPU boxes fail in ways that a single card never does, and most of them present as "one GPU is a little slow" rather than a clean error.

Thermal throttling is first. Eight cards share airflow, and a fully loaded chassis runs hot. Data-center GPUs start clocking down before about 85°C, and nvidia-smi --query-gpu=clocks_throttle_reasons.active will tell you whether it's a software thermal slowdown, a hardware one, or a power cap. Power capping is the sibling: nvidia-smi -pl sets a limit below the card's TDP (700W on an H100 SXM), and a rack without enough power budget will cap every card and slow the whole box uniformly.

Then the straggler problem, which is the nastiest because it hides. Collective operations synchronize every step, so the slowest GPU sets the pace for all of them. One card that's throttling, power-capped, or quietly degraded drags the entire tensor-parallel group, and the symptom is maddening: every GPU shows high utilization, throughput is low, and nothing errors. You find it by looking at per-GPU clocks and temperature and hunting the outlier.

The rest of the list is worth a scan before you sign off on a box:

PCIe lane starvation. A card silently negotiating Gen3 x4 instead of Gen5 x16 is pure mystery slowness. lspci -vv shows the negotiated LnkSta.
NVLink errors. nvidia-smi nvlink -e shows CRC and replay counters. Rising counts mean a flaky link or cable and degraded bandwidth.
Oversubscribed PCIe switch. Cheap boxes put several GPUs behind one switch uplink, and they contend for it.
The /dev/shm trap, if you're on Kubernetes. NCCL stages some intra-node transfers through shared memory, and a container's default 64 MiB /dev/shm hangs multi-GPU serving with no useful error. Mount an emptyDir with medium: Memory at /dev/shm. It's in every production vLLM manifest, and it's why tensor-parallel works on a bare VM but hangs in a pod.
Xid 79, the "fell off the bus" from part 1, shows up here too when a card overheats or loses power delivery under a full load it never saw in acceptance testing.

picking an inference engine

If the box is for serving rather than training, the engine choice matters, and by 2026 the field has settled. All the live engines converge on the same two tricks: continuous batching (decide the batch membership every decode step so the GPU never idles on the slowest request) and paged KV cache (manage the cache like OS virtual memory so you don't waste most of your VRAM on fragmentation). The differences are in the scheduler and the compile strategy.

Fig. 3 · where the four engines land. Start at vLLM; move only when a profiled bottleneck points somewhere specific.

Start with vLLM. It has the broadest model and hardware coverage, it invented PagedAttention, and it stands up with sane defaults faster than anything else. Reach for TensorRT-LLM when a profiled bottleneck justifies the cost, because it delivers the highest raw throughput and lowest latency on NVIDIA hardware but makes you compile a per-model, per-GPU engine and run a heavier ops burden for it. Reach for SGLang when your workload shares prefixes: RAG, multi-turn chat, agents, or high-concurrency MoE, where its RadixAttention prefix cache pulls meaningfully ahead. And don't start new work on TGI; HuggingFace put it in maintenance mode and points people at vLLM or SGLang themselves. One layer up from the engine, NVIDIA's Dynamo wraps any of these for distributed, disaggregated serving with a KV-cache-aware router, and NIM packages them as prebuilt microservices. On a single box the raw engine is what you're tuning, but those are the names you'll meet the moment you scale out.

The two boxes from the opening ran the same vLLM. Same engine, same flags, one served three times the traffic. The engine was never the variable. The wires were. Which is the whole lesson of a single node, and also the reason part 3 is about the wires between nodes, where the same story plays out at a hundred times the scale and a hundred times the cost.

The dozen layers under a GPU pod

Harshit Luthra — Thu, 02 Jul 2026 19:13:19 +0000

Originally published at harshit.cloud on 2026-07-02.

The pod was stuck in Pending and the node had eight H100s sitting idle. kubectl describe node said nvidia.com/gpu: 0. nvidia-smi on the host printed all eight cards, healthy, 40°C, nothing running. So the hardware was fine, the driver was fine, and Kubernetes was convinced there were zero GPUs in a box that cost more than my car.

That gap is the whole job. A GPU pod doesn't run on a GPU. It runs on about a dozen layers stacked between the silicon and your container, and any one of them can be quietly broken while every layer above and below it looks green. If you've shipped normal apps on Kubernetes, a GPU pod looks identical right up until it doesn't: underneath sits a stack of hardware and driver pieces a web pod never touches, and that's where the surprises live. This series is about running those layers in production without getting paged. Part 1 is the anatomy: what the layers are, what breaks at each one, and which numbers actually tell you the truth. Part 2 is a single box with eight GPUs and the wires between them. Part 3 is scaling past one box, where the network becomes the machine. Part 4 is watching the whole thing under real traffic, part 5 is scaling it to zero when nobody's using it without the bill or the cold start eating you, part 6 is the routing layer in front of it all, where the right load balancer buys a 2× speedup for free, part 7 keeps it breathing through loads and deploys, and part 8 is sharing it with other teams without the tenants, or a security boundary that isn't where you think, burning you.

the dozen layers between silicon and your pod

Start at the bottom and climb. Each layer trusts the one under it and lies to the one above it when things go wrong.

The silicon is the GPU itself plus the NVSwitch fabric that wires the GPUs on a board together, plus the NIC (a ConnectX-7 or BlueField-3) that carries traffic off the box. This is where hardware faults live: ECC errors (bit-flips in memory the card catches and corrects), thermal throttle, a card that stops answering on the PCIe bus.

Firmware sits on the card. Modern GPUs have a GSP, a GPU System Processor, a little RISC-V core that runs firmware and offloads work the host driver used to do. When you hear about a GPU "hanging" for no visible reason, the GSP firmware timing out is a common culprit.

The kernel driver is nvidia.ko and friends (nvidia-uvm, nvidia-peermem). It's a kernel module, which means it's compiled against your exact kernel headers. Upgrade the kernel without rebuilding the module and the driver won't load. This is the layer that breaks on a Tuesday because someone patched the base image.

The userspace driver is libcuda.so, the CUDA driver API. It ships with the driver, not with CUDA, and this trips people up constantly. nvidia-smi talks to the driver through NVML, which is why nvidia-smi can work while your actual CUDA program fails: they're using different entry points into the same stack.

The CUDA runtime is libcudart plus the math and collective libraries: cuBLAS, cuDNN, NCCL. Here's the thing nobody tells you on day one. PyTorch and TensorFlow wheels bundle their own copy of all of this. When you pip install torch==2.x+cu128, you are installing CUDA 12.8, cuDNN, and NCCL inside the wheel. The host node doesn't need a CUDA install at all. It needs a driver new enough to satisfy that bundled runtime, and nothing more. Once that clicks, half the version confusion evaporates.

The NVIDIA Container Toolkit (nvidia-ctk, libnvidia-container, currently 1.19.1) is the bridge between the host driver and the container. At container start it injects /dev/nvidia* and the driver libraries into the container's filesystem. Miss this and you get the classic symptom: nvidia-smi works on the host, fails inside the pod.

The container runtime is containerd or CRI-O running runc underneath. It has to be told to use the nvidia runtime. If default_runtime_name in /etc/containerd/config.toml isn't set, pods land on the node with no GPU access and no obvious error. (Recent GPU Operator versions wire this through the NRI/CDI plugin instead of a default runtime, but the failure mode is identical: get it wrong and the pod sees no GPU.)

The device plugin (0.19.x) is the piece that talks to the kubelet. It counts the GPUs and advertises them as nvidia.com/gpu: 8, or nvidia.com/mig-1g.10gb: 56 if you're slicing. When this crashes or can't reach the driver, you get nvidia.com/gpu: 0 and pods stuck Pending. That was my incident at the top. The device plugin had crash-looped after a driver-container restart and never re-registered.

Node Feature Discovery and GPU Feature Discovery label the node with what it has: GPU model, memory, compute capability, MIG profile, driver version. The scheduler reads those labels to place pods. Wrong labels, wrong placement.

The GPU Operator (v26.3.x) is the thing that installs and manages every layer above the kernel as a set of DaemonSets. It runs the driver as a container, wires the toolkit, deploys the device plugin, DCGM, the MIG manager. It's a huge convenience and one more control loop to debug when it gets stuck reconciling.

On top, the scheduler (kube-scheduler, or Kueue, Volcano, Run:ai, or Slurm) decides which pod lands on which GPU. And finally the workload: a training job that needs all N GPUs at once or nothing, or an inference server that would happily take a seventh of one card.

Twelve layers. The reason GPU infra feels harder than normal infra isn't any single layer. It's that the failure at layer 3 shows up as a symptom at layer 9, and the tooling at layer 9 has no idea layer 3 exists.

Key Insight: The layer that breaks and the symptom you notice are rarely the same layer. The pod is Pending up at the scheduler, but the cause is a driver that didn't load down near the metal. When a GPU pod misbehaves, debug from the bottom of the stack up, not the top down.

One part of this stack is quietly being rebuilt under you. The device plugin advertising nvidia.com/gpu: 8 is a flat count: a pod asks for a number and gets whatever GPUs the node has. Kubernetes 1.34 (September 2025) made Dynamic Resource Allocation (DRA) generally available, and it's the eventual replacement for that model. DRA is a ResourceClaim API, the way a PersistentVolumeClaim is for storage, so a pod can ask for "two NVLink-connected GPUs with at least 40GB each" instead of a bare count, and it's how rack-scale multi-node NVLink (GB200 NVL72) gets scheduled at all. NVIDIA ships a DRA driver through the GPU Operator. The device plugin is still the common path in mid-2026, but a stack diagram drawn today should treat it as the model DRA is replacing, not the permanent one.

the version matrix that pages you

The single most common self-inflicted outage is a version mismatch, and the matrix has four axes: driver, CUDA toolkit, cuDNN, and framework. The good news is that three compatibility mechanisms mean you rarely have to line up all four exactly. The bad news is that nobody explains which mechanism they're relying on, so it feels like luck.

Backward compatibility is the easy one. A newer driver runs older CUDA binaries, always. An R580 driver runs a CUDA 12 app and a CUDA 13 app without complaint. So keeping the driver ahead of everything is safe.

CUDA minor-version compatibility is the one you lean on daily. Any CUDA 12.x toolkit runs on any driver that supports 12.0. A driver from the 525 era will run a CUDA 12.8 binary, because they share the CUDA 12 major. This is why the framework-bundles-its-own-CUDA pattern works: the wheel carries CUDA 12.8, your node has some 12-capable driver, and they meet in the middle.

Forward compatibility is the escape hatch. The cuda-compat package ships updated libcuda stubs that let a newer CUDA major run on an older driver branch. It's how you run a CUDA 13 app on a node still pinned to an R535 driver you can't upgrade yet. It only works on data-center GPUs, and it's a deliberate override, not something to build on.

Here's the current data-center driver picture as of mid-2026:

Branch	Type	EOL	CUDA
R535	LTS	June 2026	12
R580	LTS	June 2028	13
R595	Production	March 2027	13
R610	New Feature	Aug 2026	13

If you take one thing from this section: manage the driver, let the framework carry the rest. Pin your training and serving images to NGC base images (nvcr.io/nvidia/pytorch:25.xx) where NVIDIA has already matched CUDA, cuDNN, and NCCL for you, and you delete an entire class of 2am pages.

One current gotcha worth flagging if you're on Blackwell. Drivers from 580.65.06 turn on Coherent Driver Memory Management by default for GB200 and GH200 on Kubernetes, and CDMM is incompatible with both MIG and GPUDirect Storage right now. If you buy GB200s planning to slice them with MIG, check that first, because it's not going to be on the datasheet.

slicing one GPU three ways

A single H100 has 80GB of HBM (its fast on-board memory) and enough compute to serve dozens of small models. Handing a whole card to a workload that uses 8% of it is how you set money on fire. There are three ways to share a GPU, and they are not interchangeable. The difference is isolation.

Fig. 2 · the same GPU shared three ways. Isolation drops as you move right; utilization convenience goes up. Pick by how much you trust the tenants.

MIG (Multi-Instance GPU) is hardware partitioning, available on the data-center cards (A100, A30, H100, H200, B200) but not the smaller inference GPUs like the L4 or A10G, which fall back to MPS. It cuts one GPU into up to seven instances, each with its own SMs (the GPU's compute cores), its own dedicated slice of HBM, its own memory controller and L2. A fault in one instance doesn't touch the others. That's real hardware isolation, the kind you want when tenants don't trust each other. The profiles are named [compute]g.[memory]gb. On an H100 80GB you get seven 1g.10gb slices, or two 3g.40gb, or one 7g.80gb that's just the whole card back. An H200's 141GB gives you seven 1g.18gb slices; a B200's 180GB gives seven 1g.23gb. The quirk that catches everyone: compute fractions go 1/7, 2/7, 3/7, 4/7, 7/7. There is no 5g or 6g profile. Memory is quantized in eighths. So a 1g slice gets one-seventh of the compute but one-eighth of the memory, and the arithmetic never quite lines up.

MPS (Multi-Process Service) is a daemon that multiplexes several processes into one GPU context so their kernels run genuinely concurrently, not round-robin. You can cap each client's memory (CUDA_MPS_PINNED_DEVICE_MEM_LIMIT) and compute share (CUDA_MPS_ACTIVE_THREAD_PERCENTAGE). What you don't get is hardware memory protection or clean fault isolation. One client that OOMs can take its neighbors down with it. MPS is for high-throughput inference where you own all the tenants and want better SM utilization than time-slicing gives you.

Time-slicing is the crude one. The device plugin just lies about the GPU count, advertising one physical card as ten replicas. Ten pods land, and they take turns via context switching. There is no memory partitioning and no fault isolation at all. If one pod grabs 70GB of an 80GB card, the other nine OOM. It's fine for notebooks, CI, and dev clusters where the work is bursty and nobody's SLA depends on it. It has no business in front of production traffic.

	Memory isolation	Fault isolation	Concurrency	Use for
MIG	hardware	hardware	spatial	multi-tenant, untrusted, prod serving
MPS	soft caps	weak	true concurrent	trusted high-throughput inference
Time-slicing	none	none	round-robin	dev, notebooks, CI

You can stack them: partition a card into seven MIG slices, then time-slice each slice for burstier workloads. Most teams don't need that. Most teams need to notice they're running one 8% workload per $30k card and switch to 7× 1g.10gb.

the metrics that lie to you

Here's the number that ruins more capacity planning than any other: GPU-Util. When nvidia-smi shows 95% and everyone relaxes, they've misread it. DCGM_FI_DEV_GPU_UTIL answers exactly one question: was a kernel running during the sample window. It says nothing about how much of the silicon that kernel used. You can see 95% GPU utilization and 25% of the actual compute capacity in use at the same moment, and both numbers are honest about different things. Somebody will still screenshot the 95% into a capacity deck, and now you're being asked to buy more of a card you're already wasting.

Fig. 3 · GPU-Util says a kernel ran. MFU says how much of the chip did useful work. The gap between them is memory stalls, collective communication, and non-matmul overhead you paid for anyway.

The number that matters for training is MFU, Model FLOPs Utilization: the FLOPs your model actually did divided by the theoretical peak. It's the metric the Llama and PaLM papers report, and 35–50% is considered excellent for large-scale training. The gap between "a kernel ran" and "the chip did useful math" is memory-bandwidth stalls, NCCL all-reduce waiting on the network, attention softmax and layernorm that aren't matmuls, optimizer steps, and activation recompute. All of it counts against your wall clock. None of it counts as useful FLOPs.

The money follows directly. On an 8-GPU H100 node at roughly $3 per GPU-hour, the difference between 25% and 45% MFU is about 1.8× the effective cost per token. Sticker price per GPU-hour is the number vendors compete on. Utilization is the number that actually sets your bill.

For real telemetry, DCGM (Data Center GPU Manager) is the layer, and dcgm-exporter scrapes it into Prometheus. The fields worth putting on a dashboard from day one:

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE: Tensor Core utilization, the one that tracks real training throughput.
DCGM_FI_PROF_DRAM_ACTIVE: HBM bandwidth in use. High here with low tensor activity means you're memory-bound.
DCGM_FI_DEV_FB_USED: HBM used. Your OOM early-warning.
DCGM_FI_DEV_XID_ERRORS: the last Xid code. The single most important reliability signal on the box.
DCGM_FI_DEV_ECC_DBE_VOL and the row-remap fields: memory health, trending toward RMA.
DCGM_FI_DEV_CLOCK_THROTTLE_REASONS: a bitmask telling you whether the card is throttling on power, thermals, or a reliability limit.

dcgmi diag -r 3 runs about twelve minutes of escalating health checks (memory bandwidth, PCIe, NVLink, thermals under load) and is the thing to run before you trust a node you just recovered.

the Xid codes worth memorizing

Xid errors are the driver's way of telling the kernel log that the hardware had to correct or retry something it shouldn't have. They land in dmesg as NVRM: Xid (PCI:0000:xx:00): <code>. A nonzero Xid is not always fatal, but it's never nothing. A handful are worth knowing on sight because they change what you do next.

Fig. 4 · the Xid triage most on-call runbooks converge on. The split that matters is app-fault versus hardware-fault, because one restarts a pod and the other cordons a node.

Xid 13, 31, 43 are usually your fault, not the hardware's: illegal memory access, a bad kernel, a page fault from application code. Restart the workload, look at the model, not the card.

Xid 48 is a double-bit ECC error, uncorrectable. Xid 94 is a contained ECC error, where the damage stayed inside the offending app and the other apps on the card survived. Xid 95 is the uncontained version, where the blast radius crossed apps, and the GPU has to be reset before anything restarts on it. The 94/95 split is the one to internalize: contained means drain politely, uncontained means the card is compromised until reset.

Xid 63 and 64 are row-remapping events. Modern HBM can retire bad memory rows the way a disk retires bad sectors. A 63 is the card recording that it did this; persistent 64s (remap failures) mean it's running out of spare rows and it's an RMA candidate. Watch the trend, don't panic on the first one.

Xid 79 is the one that ruins a night: "GPU has fallen off the bus." The card stopped answering on PCIe entirely. Thermal, power delivery, seating, a dying board. The node needs a reset, and if the same physical slot throws it again, that's a card headed back for RMA. Field reports put it around 3% of H100 deployments in the first year, which sounds small until you multiply it by a thousand-GPU fleet.

Xid 119 and 120 are the GSP firmware timing out. Reset the GPU. On a few driver versions it's common enough that ops teams disable the GSP firmware as a workaround, which tells you how much fun that particular bug is.

The remediation ladder most teams settle on is boring and effective: app restart or driver reload or node reboot clears roughly 60% of incidents within fifteen minutes; anything that survives that gets dcgmi diag -r 3; anything that fails the diag gets cordoned and sent back.

what to actually care about

If you're standing up GPU infrastructure and wondering where to spend your attention, the honest ranking isn't the one the marketing implies. It's roughly this. Get the driver-and-toolkit layer boringly stable, because that's where the self-inflicted outages live. Instrument MFU and the Xid stream before you instrument anything pretty, because those two tell you whether you're wasting money and whether the hardware is dying. Decide your sharing model (MIG for multi-tenant, whole cards for training) before you have tenants, because retrofitting isolation is miserable. And treat the twelve-layer stack as the thing it is: a place where a green dashboard at layer 9 can sit directly on top of a card that fell off the bus at layer 1.

This is also usually the point where teams bring in someone who has already burned a few weeks on Xid codes and MIG partitioning, rather than doing it live on their own GPU bill. GPU and ML infrastructure builds are part of the independent infrastructure consulting work I take on.

The pod that was stuck Pending at the top of this post came back the moment the device plugin re-registered. Fifteen seconds of fix, forty minutes of staring at a healthy nvidia-smi wondering how the machine could see eight GPUs that Kubernetes swore didn't exist. That distance, between what the hardware knows and what the scheduler believes, is where most of your on-call rotation lives too. The next part goes inside a single box with eight of these cards and the wires that decide whether they cooperate or just sit next to each other.

The git commands I actually run every day

Harshit Luthra — Wed, 20 May 2026 19:10:05 +0000

Originally published at harshit.cloud on 2026-05-20.

I've been using git for a decade and most of what I type still fits on a single hand. The 200-page Pro Git book is wonderful and almost none of it survives contact with a real Tuesday. What survives is a small, boring set of commands that get rerun constantly.

This post is that list, ordered by how often my fingers actually type them. Aliases are from the oh-my-zsh git plugin (enabled in most zsh configs that exist); the full command sits next to the alias so it's portable.

the daily eight

These are the ones I'd type in my sleep. If you're not using all eight already, picking them up pays back inside a week.

gst

git status

gst

I run this between every other command. It's the cheapest sanity check git has. Branch, ahead/behind, staged, unstaged, untracked. Two seconds. If you only learn one alias, learn this one.

glola

git log --oneline --graph --decorate --all

glola | head -30

The one true log. Graph of every branch (local + remote), one line per commit, colored refs. Pipe through head because most of the time you only care about the last 20-30 commits.

gd / gds

git diff / git diff --staged

gd          # what's changed but not staged
gds         # what's staged and about to be committed

gds before every commit. If you set delta as your pager (brew install git-delta, then pager = delta in ~/.gitconfig), the output stops being painful to read.

gcam

git commit -a -m

gcam "fix: trailing slash in webhook URL"

Quick one-line commits for small fixes. For anything bigger I drop the -m and let $EDITOR open so I can write a proper message with a body.

gpsup

git push --set-upstream origin <current-branch>

gpsup

First push of a new branch. The full command is annoying to type, so gpsup figures out the current branch name itself. After the first push, plain gp (just git push) works because upstream is set.

gco / gcb

git checkout / git checkout -b

gco main             # switch to main
gco -                # switch to previous branch
gcb feature/login    # create + switch to new branch

gco - is the one to notice. Like cd - for branches. When you're bouncing between two branches all day, it's a single keystroke each way instead of typing the name.

gpf

git push --force-with-lease

gpf

After rebasing or amending. Always use --force-with-lease, never --force. The lease version refuses to push if someone else has pushed to your branch since your last fetch, saving you from silently overwriting a teammate's work. There is no good reason to ever type --force in 2026.

gfa

git fetch --all --prune

gfa

Refresh every remote, prune deleted remote branches. Run before you start anything that depends on knowing the current state of the world. The --prune half is what makes the cleanup ritual below work.

checkout recent branches

git branch lists alphabetically, which is useless. What you actually want is "that branch from Tuesday," which means sorting by last commit:

git config --global alias.recent \
  "for-each-ref --sort=-committerdate refs/heads/ \
   --format='%(HEAD) %(color:yellow)%(refname:short)%(color:reset) \
             %(color:green)(%(committerdate:relative))%(color:reset) %(contents:subject)'"

git recent | head -10

That covers looking. For switching, pipe the same list into fzf and you never type a branch name again:

# fco: fuzzy-checkout a recent branch
fco() {
  local branch
  branch=$(git for-each-ref --sort=-committerdate refs/heads/ \
             --format='%(refname:short)' \
           | fzf --height 40% --reverse \
                 --preview 'git log --oneline --decorate --color=always -15 {}')
  [ -n "$branch" ] && git checkout "$branch"
}

Branches arrive sorted by recency, so the one you want is almost always in the top three. Type two letters of its name, Enter, done. The preview pane shows the branch's recent commits so you can confirm it's the right Tuesday. gco - still wins for bouncing between exactly two branches; fco wins for everything else. (brew install fzf if you don't have it. You want it for Ctrl-R history search anyway.)

the cleanup ritual

Run this weekly. If you've ever scrolled through 80 stale branches looking for the one you actually want, you already know why.

The easy half deletes every local branch whose tip is already in main:

gfa
git branch --merged main | grep -v '\*\|main\|master' | xargs -n1 git branch -d

Works only if your team uses merge commits. Most don't. GitHub's "Squash and merge" creates a brand-new commit on main with a different SHA, so git branch --merged never catches your local branch. Its commits aren't in main's history at all.

The workaround: after gfa, any branch whose tracked remote was deleted shows as [gone]. Those are usually your merged-and-deleted PRs.

Usually, not always. [gone] only means the remote tracking branch is gone. Nearly always that's a squash-merged PR whose branch GitHub auto-deleted. But it can also be a branch you pushed, someone deleted server-side, and you never merged. So don't force-delete every [gone] branch with git branch -D. I once watched one show [gone] while it still held 26 unmerged commits; a force-delete there loses them for good.

So check each [gone] branch for patch-equivalence against the base before deleting. Squash-merges get caught, genuinely unmerged work gets kept. This lives in my ~/.gitconfig as git gone:

# ~/.gitconfig, under [alias]  →  run as: git gone
gone = "!f() { \
    git fetch --all --prune; \
    base=$(git rev-parse --abbrev-ref origin/HEAD 2>/dev/null); base=${base:-origin/main}; \
    for b in $(git for-each-ref --format='%(refname:short) %(upstream:track)' refs/heads \
               | awk '$2==\"[gone]\"{print $1}'); do \
      if [ -z \"$(git cherry \"$base\" \"$b\" | grep '^+')\" ]; then git branch -D \"$b\"; \
      else echo \"kept $b (commits not in $base)\"; fi; \
    done; }; f"

One command does the whole ritual: the git fetch --all --prune prunes the dead remote refs, then the loop deletes the merged local branches in the same pass. No separate gfa first.

git cherry compares by patch-id, not SHA. A squash-merged branch shows every commit as - (an equivalent already exists in the base) and gets deleted; a branch with real unpushed work shows + lines and stays. The -D is only reached after patch-equivalence is proven, so it never eats unmerged work.

Or install git-trim (brew install git-trim), which does the same classification and more. It catches squash-merges even when the tracking ref isn't [gone], and skips diverged branches by default.

git trim                # dry-run
git trim --confirm      # actually delete

This is the closest thing to "did my PR ship?" you can ask git directly.

the weekly four

Not in your fingers yet, but should be.

`git switch` and `git restore`

git switch -c new-feature           # create + switch
git restore --staged file.txt       # unstage
git restore --source=abc123 file.go # restore single file from any commit

switch and restore split the four jobs checkout used to do. The one I reach for most is restore --source=<sha> <path>. Translation: "grab this single file from three commits ago without touching anything else."

interactive rebase with autosquash

git commit --fixup=abc123       # fixup commit targeting abc123
# ... keep working ...
git rebase -i --autosquash main # all fixups slot into place automatically

The single biggest workflow win I've found in ten years of git. While reviewing your own PR you find a bug four commits back. Don't fix it on top. --fixup=<sha> creates a commit targeting the offender, and the autosquash rebase squashes everything into place when you're done. Install git-absorb (brew install git-absorb) and it even picks the target SHA for you: edit the files, run git absorb --and-rebase, done.

`git reflog`, the universal undo

git reflog
git reset --hard HEAD@{5}

Every change to HEAD is logged for 90 days. Bad rebase? reflog. Deleted branch? reflog. There is almost nothing in git you can't undo if you know about it.

`git worktree`

git worktree add ../proj-hotfix hotfix/prod-down
git worktree remove ../proj-hotfix

Need to fix a prod bug while halfway through a feature? Don't stash. worktree add gives you a second checkout in a sibling directory, sharing the same .git. I use it constantly for "let me review your PR" without leaving my own branch.

set it once

Five config lines and a daemon. Enable, forget.

git config --global rerere.enabled true          # remember conflict resolutions, replay them
git config --global push.default current         # `git push` pushes current branch to same name
git config --global push.autoSetupRemote true    # first push sets upstream automatically
git config --global diff.algorithm histogram     # cleaner diffs than the default myers
git config --global merge.conflictStyle zdiff3   # conflict markers include the common ancestor
git maintenance start                            # background gc/prefetch on a schedule

autoSetupRemote retires gpsup entirely. zdiff3 shows the original code both sides diverged from; once you've used it, plain <<<<<<< markers feel like flying blind.

when something is broken

Not daily, but when the question is "when did this start," nothing else answers it:

git log -S "functionName"          # pickaxe: commits where this string was added or removed
git blame -w -C -C -C file.go      # blame the logic's actual author, not the formatter
git log -p --follow file.go        # full file history, including across renames
git range-diff @{u} @              # what a rebase actually changed; run before force-pushing

-S searches the content of the diff, not commit messages. Different thing entirely from --grep. And plain blame gives credit to whoever last ran Prettier; -w -C -C -C follows the code across whitespace changes, moves, and file boundaries to the person who wrote the logic.

the four tools worth installing today

fzf (brew install fzf). Powers the fco branch picker above, plus fuzzy Ctrl-R history.
git-absorb (brew install git-absorb). Auto-fixup commits without picking SHAs.
delta (brew install git-delta). Diff and blame output that doesn't hurt to look at.
lazygit (brew install lazygit). TUI for the operations that are tedious on CLI: partial commits, stash management, conflict resolution.

This post used to end with two AI shell helpers for the stuff git can't tell you; those now live in their own TIL.

ten years in, the surprise

After a decade, the command I run most isn't commit. It isn't push. It's gst, hundreds of times a day, between every other operation. The most-used git command in my shell is the one that does nothing.

A single zsh function for one-line AI answers that knows when to pre-type the command

Harshit Luthra — Wed, 20 May 2026 07:01:10 +0000

Originally published at harshit.cloud on 2026-05-20.

I kept opening a chat tab just to ask "what's the kubectl command for decoding a secret" or "convert 42 GiB to bytes". The context switch was costing more than the answer was worth.

Wrapping an AI CLI into a single shell function fixed it. The interesting part is print -z, plus one heuristic that needs more care than it looks.

the function

# p: one-shot AI query. Examples: `p whats 2 + 2`, `p kubectl secret decode grafana`
# Smart dispatch: if the answer looks like a runnable command, pre-type it into
# the next prompt (print -z). Otherwise print to stdout. Math/facts get printed,
# commands get queued for you to review and press Enter.
p() {
  emulate -L zsh
  setopt NO_GLOB
  if [ $# -eq 0 ]; then
    echo "usage: p <question or task>" >&2
    return 1
  fi
  local out
  out=$(pi -p --no-session --append-system-prompt 'Answer in ONE line. No preamble, no explanation, no markdown, no code fences. For shell/kubectl/git/etc requests output only the command. For factual or math questions output only the answer.' "$*" \
        | tr -d '\000-\037' \
        | sed 's/^[[:space:]]*//;s/[[:space:]]*$//')
  if [ -z "$out" ]; then
    return 1
  fi
  local first="${out%% *}"
  if [[ "$first" == [a-zA-Z_]* ]] && whence -p "$first" >/dev/null 2>&1; then
    print -z -- "$out"
  else
    print -r -- "$out"
  fi
}
alias p='noglob p'

pi is just whatever AI CLI you have. Swap in claude -p, llm, gh copilot suggest, ollama run. The pattern doesn't care about the backend.

what it feels like

$ p whats 2 + 2
4

$ p capital of mongolia
Ulaanbaatar

$ p regex for matching an email
[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}

$ p kubectl secret decode grafana
# next prompt now shows, cursor at the end:
$ kubectl get secret grafana -o go-template='{{range $k,$v := .data}}{{$k}}: {{$v | base64decode}}{{"\n"}}{{end}}'█

$ p find all log files modified today
# next prompt:
$ find . -type f -name "*.log" -mtime -1█

Same two-letter command for both. Answers go to stdout, commands go to the prompt buffer where you can edit them before pressing Enter.

the key idea: `print -z` for runnable output

print -z is the trick that makes this design work. It pushes text onto the zsh line editor, i.e. into your next prompt, pre-typed and ready. Compared to every alternative:

Strategy	Speed	Safety	Friction
`eval "$(...)"`	fastest	bad, auto-runs model output	none
Pipe to `pbcopy`	medium	safe	switch focus, paste
Print to stdout	medium	safe	select + copy + paste
`print -z`	fastest	safe, you press Enter	none

The mental model: print -z is what Ctrl-R history search does when you accept a result. Native zsh. You always see and approve the command before it runs.

the heuristic: when is the answer a command?

The smart dispatch decides between print -z (pre-type) and print -r (stdout) by looking at the first word of the answer:

if [[ "$first" == [a-zA-Z_]* ]] && whence -p "$first" >/dev/null 2>&1; then
  print -z -- "$out"
else
  print -r -- "$out"
fi

Two checks, both load-bearing:

First char is a letter or underscore. Excludes digits (4), symbols ([, /, (), and anything else that obviously isn't a command name.
whence -p resolves it to a PATH executable. Not just "this name exists in the shell", but specifically a real binary on disk.

Why whence -p and not command -v? Read on.

the footgun: oh-my-zsh numeric aliases

My first attempt used command -v "$first" as the heuristic. It looked right. It failed in a way that took a minute to spot.

When I ran p whats 2 + 2, the answer was 4, but nothing appeared in my terminal. The function exited cleanly with status 0. No error.

What had happened: oh-my-zsh's dirhistory plugin (loaded by default in many configs) aliases 1 through 9 to cd -1 ... cd -9 for jumping around the directory stack. So command -v 4 returned true. 4 was a recognized alias, and the function tried to print -z 4 into my prompt buffer.

In a real interactive shell, that would have stuffed 4 into my prompt invisibly (it'd appear when I hit Enter). In my non-interactive test (zsh -ic '...') it disappeared into the void because there's no line editor to render the stuffed buffer.

The fix has two parts:

[[ "$first" == [a-zA-Z_]* ]] alone would have caught it, because 4 doesn't start with a letter.
whence -p instead of command -v makes it doubly safe. whence -p only matches binaries in PATH, ignoring aliases, functions, and builtins. Aliases like 4 → cd -4 are filtered out.

Either check alone would have caught the bug. Having both means the next time I add a feature here, I don't have to remember which one was load-bearing.

defensive details that earn their keep

Three small things prevent subtle bugs:

`noglob` on the alias

alias p='noglob p'

Without this, p list all *.log files would have zsh expand *.log against the current directory before the function ever sees it. With noglob, the glob characters pass through literally. Same trick git uses for its arguments.

`emulate -L zsh` + `setopt NO_GLOB`

emulate -L zsh
setopt NO_GLOB

emulate -L zsh resets shell options to defaults, scoped to this function only (the -L means local, so they restore on return). NO_GLOB is belt-and-suspenders for callers that bypass the alias (command p ..., \p ..., or scripts that don't see your aliases).

output sanitization

tr -d '\000-\037' | sed 's/^[[:space:]]*//;s/[[:space:]]*$//'

tr -d '\000-\037' strips all C0 control characters. That includes ANSI escape sequences (ESC = \033), stray nulls, and any invisible cruft the model might emit. Critical for print -z because control characters in the payload corrupt the line editor's display.

sed then trims leading and trailing whitespace, which the model usually adds even when told not to.

why `"$*"` and not `"$@"`

"$*" joins all positional args into one string with spaces between them. "$@" would pass them as separate args, which most AI CLIs would concatenate anyway, but some treat the first positional as the prompt and the rest as files (the @file.txt convention is common). Joining explicitly avoids that ambiguity.

If your CLI supports -- to end option parsing, prefer:

your-ai-cli -p ... -- "$*"

pi doesn't accept --, hence the bare "$*".

the system-prompt nudge actually matters

Without --append-system-prompt, even with -p, the default coding-assistant prompt wraps shell commands in code fences and adds a one-sentence intro. That breaks print -z (code fences are not commands) and clutters the terminal.

The phrasing that worked best:

Answer in ONE line. No preamble, no explanation, no markdown, no code fences. For shell/kubectl/git/etc requests output only the command. For factual or math questions output only the answer.

"No markdown, no code fences" is doing most of the work. Without it you get backtick-wrapped output that print -z would happily push into your prompt as `kubectl get pods`, which is not a runnable command.

why this beats the chat UI for short questions

Action	Chat UI	`p`
Switch context	yes	no
Round-trip latency	~3-5s + UI	~1-2s
Output format	markdown, prose	bare answer or pre-typed command
Get command into shell	select + copy + paste	already in your prompt
Session pollution	yes	no (`--no-session`)
Glob-expansion footgun	n/a	guarded (`noglob`)

For anything longer than a paragraph the chat UI is still better. For "what's the syntax for X" or "the command for Y", the terminal is the right place to put the answer.

the one substitution that fixed it

command -v → whence -p. One swap. The rest of the function (the noglob, the emulate -L zsh, the control-char strip) was already doing its job. The bug was trusting that "this name resolves in the shell" meant "this name is a binary on disk." It doesn't, and on any zsh with oh-my-zsh loaded it especially doesn't.

How to bypass corporate MDM and AI gateways on Claude Code

Harshit Luthra — Mon, 18 May 2026 16:59:08 +0000

Originally published at harshit.cloud on 2026-05-08.

If you're reading this, there's roughly an 80% chance your company rolled out an MDM last quarter, your network team wedged Claude API traffic through an AI gateway around the same time, and now Claude Code boots with MCPs you didn't pick while forwarding your prompts somewhere you haven't audited. /mcp shows three servers nothing in your repo touches. env | grep ANTHROPIC returns a base URL on a domain you've never seen. The experience got worse and nobody asked you.

This post covers both leashes. The MDM one is fixable in 12 lines of zsh. The AI gateway one depends on how deep your network team went.

what's an MDM, in three sentences

MDM stands for Mobile Device Management. Jamf, Kandji, Intune, Workspace ONE, whichever agent enrolled your laptop on day one. It owns parts of /Library, can write files there as root with the system-immutable flag set, and re-pushes them on a schedule, which is why a plain rm doesn't survive. For Claude Code, the relevant directory is /Library/Application Support/ClaudeCode/.

the managed-settings situation

The two files doing the work are /Library/Application Support/ClaudeCode/managed-settings.json and /Library/Application Support/ClaudeCode/managed-mcp.json. Claude Code reads them on startup, treats them as the highest-priority settings layer, and merges them over whatever you have in ~/.claude/settings.json. Anything IT puts in there wins: forced MCPs, forced skills, allowed and denied permission lists, and the env block that can set ANTHROPIC_BASE_URL. That last one is how the AI gateway routing gets wired into Claude Code in the first place.

why `rm` doesn't work

First instinct fails, and not in a way that's obvious:

sudo rm "/Library/Application Support/ClaudeCode/managed-settings.json"
# rm: managed-settings.json: Operation not permitted

Root isn't enough. The MDM agent sets the file's system-immutable flag with chflags schg after writing it. That flag blocks deletion even by root until it's cleared. The macOS chflags(1) man page is the receipt. schg is the "system immutable" flag, and the file "may not be changed, moved, or deleted" while it's set.

Confirm it on your own machine:

ls -lO "/Library/Application Support/ClaudeCode/managed-settings.json"
# -rw-r--r--  1 root  wheel  schg  482 May 14 09:11 managed-settings.json

schg in column five is the marker.

The detail that matters: managed-settings.json is the same config layer your ~/.claude/settings.json uses. The IT copy just lives under /Library, is owned by root, and has the schg flag set. The merge logic doesn't know which file came from a human.

the cleanup script

One thing worth flagging before you run this. On macOS, the schg flag is normally clearable by root for files outside SIP-protected paths, and /Library/Application Support/ClaudeCode/ is not SIP-protected. So sudo chflags noschg works as written. If your MDM also writes its config into a SIP-protected location (rare for application config, more common for system extensions), you'd need Recovery Mode Terminal to clear those, which is a different conversation. The script's 2>/dev/null will silently swallow that failure, so if reruns don't seem to take, that's where to look.

Save this as /usr/local/sbin/claudecode-cleanup.sh, make it executable, run with sudo:

#!/bin/zsh
FILES=(
  "/Library/Application Support/ClaudeCode/managed-settings.json"
  "/Library/Application Support/ClaudeCode/managed-mcp.json"
)
for f in "${FILES[@]}"; do
  # Clear immutable flag if file exists, then remove
  [ -e "$f" ] && /usr/bin/chflags noschg "$f" 2>/dev/null
  /bin/rm -f "$f"
done

sudo chmod 755 /usr/local/sbin/claudecode-cleanup.sh
sudo /usr/local/sbin/claudecode-cleanup.sh

Two lines do the real work. chflags noschg clears the immutable bit. rm -f removes the file. The 2>/dev/null swallows the noise on a clean machine where the file isn't there.

Restart Claude Code. /mcp should be back to whatever you actually installed, and /permissions should be whatever's in ~/.claude/settings.json instead of whatever IT decided you needed.

the launchd arms race

I'd love to tell you this is permanent. It isn't.

MDM agents sync on a schedule. Every 15 minutes, every hour, on login, depending on profile. When they sync, they notice the file is gone, put it back, and re-apply the schg flag. You'll watch managed-mcp.json reappear like a horror-movie villain you keep stabbing.

A few options, in increasing order of trouble you're inviting:

Run the script on a launchd LaunchAgent that fires at login. Once per session. Low impact, low effectiveness, but if your MDM only syncs at login this is enough.
Run it on a launchd timer with a 60-second interval. Now you're in an arms race with the sync schedule. Works until someone in IT notices a config-drift alert for your hostname.
Block the MDM agent's outbound DNS. Effective, loud, and the kind of thing that gets your laptop wiped on the next compliance audit.

I run the first one. The MDM gets its login telemetry, my dev environment isn't broken for the hour or so between syncs, nobody opens a ticket. Pick the option that matches how much you actually want to fight this.

Minimal ~/Library/LaunchAgents/cloud.harshit.claudecode-cleanup.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
  <key>Label</key><string>cloud.harshit.claudecode-cleanup</string>
  <key>ProgramArguments</key>
  <array>
    <string>/usr/bin/sudo</string>
    <string>-n</string>
    <string>/usr/local/sbin/claudecode-cleanup.sh</string>
  </array>
  <key>RunAtLoad</key><true/>
</dict>
</plist>

sudo -n only works if you've added a NOPASSWD line for that exact script in /etc/sudoers.d/claudecode-cleanup. Which the MDM might rewrite. The arms race goes deeper than you think.

the AI gateway angle

The other leash sits at the network layer. Companies route Claude API traffic through a gateway (Cloudflare AI Gateway, Portkey, LiteLLM, internal proxies) so they can log prompts, strip PII, enforce per-user quotas, or quietly downgrade Opus calls to Haiku when the monthly bill spikes. Claude Code respects ANTHROPIC_BASE_URL and will talk to whatever endpoint it points at, as long as your OAuth token or API key authenticates there.

Two routing patterns to recognize:

The env block in managed-settings.json. IT sets ANTHROPIC_BASE_URL=https://ai-gw.corp.example.com/v1 inside the env section of the managed file. Claude Code reads it on startup. Same fix as the MCP file. The cleanup script above already kills this.
System proxy plus a corporate root CA. Your laptop has a "Corporate Root CA" in keychain, and either an https.proxy setting or transparent network interception routes api.anthropic.com traffic through the gateway. Deleting managed-settings.json does nothing here. The interception lives below the application layer.

To tell which one you have, run this in a fresh shell:

env | grep -i anthropic
# If you see ANTHROPIC_BASE_URL, it's the env block.

curl -v https://api.anthropic.com/v1/messages 2>&1 | grep -iE 'issuer|subject|server certificate'
# If the cert chain is signed by your corporate CA, it's transparent interception.

bypassing the gateway

For the env-block case, the cleanup script already does the work. Restart your shell after running it:

unset ANTHROPIC_BASE_URL
env | grep -i anthropic
# (empty)

For the transparent-proxy case, your options shrink:

Personal hotspot for sensitive sessions. Burns mobile data, leaves no trail through the gateway. Most realistic option for an individual contributor.
WireGuard or Tailscale out to a personal node. Works if your MDM profile allows it. Many block third-party VPNs through com.apple.systempolicy.kernel-extension-policy.
Personal device for personal work. Boring answer. The one that holds up in HR if it ever comes up.

What doesn't work: removing the corporate root CA from keychain. It's pinned by an MDM payload and gets re-added on next sync, same pattern as managed-settings.json.

should you actually do this

Worth saying out loud: both leashes exist because someone at your company had a reason. Compliance, data residency, an incident from six months ago whose postmortem nobody can find.

If the forced MCP is internal-secrets-lookup and the gateway logs prompts to a SOC pipeline, your team probably wants you using it. If the MCP is corporate-docs-mcp pointed at a 404 and the gateway downgrades Opus to Haiku because someone misread an invoice, you're deleting dead weight.

The script doesn't know which. Ask before you script. Most MDM platforms support per-user opt-out scopes, and one polite Slack message to IT beats a launchd plist.

what these scripts don't do

The cleanup clears two files. It does not:

Stop the MDM agent.
Touch ~/.claude/settings.json. Your settings stay yours.
Handle /Library/Application Support/ClaudeCode/managed-permissions.json if your MDM uses one. Add it to the FILES array.
Survive a reboot or a sync. The agent re-pushes on next check-in.
Defeat a transparent proxy with a pinned corporate CA. Use the hotspot.

If you wanted a permanent escape from corporate IT, you wouldn't be reading a blog about chflags.

Lazy SRE's guide to secure systems, part 5: the dev laptop is the perimeter

Harshit Luthra — Mon, 18 May 2026 16:58:52 +0000

Originally published at harshit.cloud on 2026-05-03.

In June 2024, Mandiant published the writeup for the Snowflake mass-extortion campaign. Ticketmaster, Santander, AT&T, LendingTree, Advance Auto Parts. Roughly 165 Snowflake tenants in total had data extracted from their warehouses. The defining detail wasn't sophistication. It was the laptop.

Mandiant traced the entry point to infostealer malware (Lumma, RedLine, Vidar variants) running on contractor and developer machines. Their report described the affected devices as personal systems also used for gaming and downloading pirated software. The infostealer harvested every credential the browser had ever saved, including the Snowflake login that didn't have MFA enforced. The attackers walked through the front door of a Fortune 500's data warehouse.

This is part 5. Earlier parts covered npm (Part 1), GitHub Actions (Part 2), the unsexy infrastructure list (Part 3), and DNS auth records (Part 4). Part 5 is about the laptop. The piece of hardware on an engineer's desk that has every SSH key, AWS profile, kubeconfig, GitHub PAT, Slack token, and Stripe key they have ever used to do their job.

The thesis from Part 1 stands. Future You at 3am will not run an EDR scan after every browser extension install. The config that prevents the extension from being installed in the first place is the one that runs while you sleep: the MDM that whitelists, the disk encryption that protects what gets stolen, the hardware MFA that survives the keylogger.

MDM is the table you set first

Mobile Device Management is the thing every small startup skips and every enterprise has. The bad-faith reason is that it's expensive and annoying. The honest reason in 2026 is that the free options have caught up.

For a 15-person Apple-heavy team, the lazy stack is Apple Business Manager (free, Apple-only) plus Fleet (OSS, free under 300 endpoints on the self-hosted path, generous free tier on Fleet's cloud). Apple Business Manager assigns a Mac to your organization at first boot, before the user creates a personal Apple ID on it. Fleet runs the osquery agent on every machine and lets you push configuration profiles (the same plist payloads Jamf would push) plus query inventory in SQL syntax.

The lazy default config profile, in plain English:

Require FileVault. Escrow the recovery key to MDM. If the laptop walks, the disk is encrypted; if the user forgets the password, you can recover.
Require auto-lock at five minutes idle, password to wake. Not a screensaver.
Block unsigned package installs, restrict the Mac App Store to managed Apple IDs only.
Require macOS updates within fourteen days of release. The fourteen days lets you skip a known-bad point release; longer than fourteen is negligence.
Block AirDrop on the corporate Wi-Fi, restrict USB external storage to read-only (or block entirely if your workflow doesn't need it).
Install osquery via MDM, enrolled to your Fleet server.

For Linux and Windows in the mix, Fleet covers both with the same osquery agent and the same query syntax. The MDM-config-profile half is Windows Intune (free with Microsoft 365 Business Premium) or Workspace ONE's free tier. Either way, the stack is "Fleet for inventory and detections + a platform-specific MDM for enforcement."

The lazy fix for the most common gap: a weekly cron that runs one Fleet query, "every laptop without FileVault enabled," and posts a Slack alert with the user's name. The conversation that follows is "we found your machine, can you enable it today", not a six-month audit.

hardware keys, one-time spend

YubiKey 5 NFC is $50. Buy two per engineer: one for the desk, one for the bag. Total for 15 engineers: $1,500, one-time, capital expense, deductible.

What it gets you:

WebAuthn / FIDO2 for SSO login (Google, Okta, GitHub, Cloudflare, AWS): a keylogger can record every keystroke and still never get the second factor.
SSH key storage in hardware. ssh-keygen -t ed25519-sk -O resident writes the key into the YubiKey. The private key never exists on disk.
PIV smartcard for VPN auth, code signing (gpg --card-edit).
TOTP fallback for the SaaS that hasn't shipped WebAuthn yet.

The free alternative for the SaaS that doesn't support hardware keys is passkeys. Passkeys are WebAuthn under the hood, also phishing-resistant, built into iOS, macOS, Android, Windows Hello, Chrome, and Safari. Free. The catch is sync: if the engineer's iCloud is compromised, so is the passkey. Hardware keys aren't synced; they are a physical token. The lazy answer is both: passkeys for low-risk auth, YubiKeys for the keys that gate production.

Cost: $1,500 one-time for 15 engineers. The cheapest line item in this post for what it gets you.

EDR is where the budget goes

Endpoint Detection and Response is the part of this stack that costs real money. For OSS-only, the answer is osquery + Wazuh, which works but requires writing detections by hand. For a 15-person team with one platform engineer, "write your own EDR detections" is not a project anyone will finish.

The honest 2026 small-team answer is Microsoft Defender for Business at $3/user/month. It ships in Microsoft 365 Business Premium (also useful if you're on M365 anyway), has acceptable macOS coverage, and includes managed detections written by Microsoft's security team. Cost for 15 engineers: $540/year. CrowdStrike Falcon Go is $60/endpoint/year if you want the strongest detection at small-team scale; same math, $900/year for 15.

Fig. 2 · three configurations. Pick the middle bar unless you have a reason.

The lazy stance: Defender for Business if you're on Microsoft 365 already. Falcon Go if you're not on M365 and want managed detection without the OSS-engineer overhead. osquery + Wazuh only if you have a security engineer with bandwidth to maintain the detections, which most 15-person startups don't. Pretending otherwise is how you end up with a fancy SIEM nobody reads.

the password manager and browser hygiene argument

1Password Business at ~$8/user/month. Bitwarden Teams at $4. Apple Passwords (or 1Password Families) if you're Mac-only and don't need shared vaults. Pick one and stop arguing about it on the team's #tools channel.

The point of the password manager isn't strong passwords. The point is:

One place for credentials, audited.
Shared vaults for vendor logins, instead of "share the password in Slack DM" hygiene.
Breach notifications when a saved password appears in a public breach corpus.
Masked email aliases (1Password feature, Apple's Hide My Email equivalent): every signup gets a separate alias, every spam list is contained.

Browser hygiene matters because the Snowflake infostealer harvested credentials from browser local storage. Specifically:

Enforce browser auto-updates via MDM. Both Chrome and Edge expose policy keys for this; Firefox via policies.json.
Block sync of work browser profiles to personal Google/Apple accounts. The "I signed into Chrome with my personal account and now all my work bookmarks are in someone else's cloud" leak is real.
Block "developer mode" extension installs. Force extensions to come from the Chrome Web Store; force the Web Store to honor the org's allowlist via the ExtensionInstallAllowlist policy.
Disable browser password saving entirely. Everything routes through the password manager.

Total: $1,440/year for 15 engineers on 1Password Business. $720 on Bitwarden Teams. $0 on Apple Passwords if it covers your needs. Pick a line and walk it.

the personal device problem

The Snowflake breach was about contractors using personal Macs for work. The lazy answer at a 15-person startup might surprise: corp-issue every contractor a laptop. Yes, including the four-hour-a-week consultant.

A refurbished MacBook Air with 16GB RAM is roughly $700 from Apple's Certified Refurbished store. The cost of a Snowflake-scale breach starts at $370K (the reported AT&T ransom) and ends in the customer-churn and legal-exposure column. The break-even point on hardware-for-contractors is under three serious incidents, ever.

Fig. 3 · same laptop, different enrollment. The right panel is the one where Mandiant doesn't write your name down.

What "no work on personal devices" actually requires:

Contract clause: hardware is issued, personal-device use for work is prohibited.
MDM enrollment at first boot via Apple Business Manager (or Windows Autopilot).
Disabled iCloud personal sign-in; only managed Apple IDs.
Wipe via MDM on offboarding, before reissue.
No "I can just SSH from home for ten minutes" escape hatch. The escape hatch is what the contractor will use the day they get phished.

This is the section of the post that gets the most pushback. The pushback is right about cost and wrong about risk. Run the math at your scale; it runs the same direction every time.

the receipts

For 15 engineers, the first-year laptop security budget:

YubiKey 5 × 30 keys (two per engineer): $1,500, one-time.
Fleet (OSS self-hosted on a small VPS): $240/year.
Microsoft Defender for Business: $540/year. Substitute Falcon Go at $900 if not on M365, or osquery+Wazuh at $0 if you have a security engineer.
1Password Business: $1,440/year. Or Bitwarden Teams at $720. Or Apple Passwords at $0.
Refurbished corp laptops for non-employee contractors: ~$700 per, as needed.

Total recurring: roughly $1,020–$2,220/year for 15 engineers, depending on the EDR and password-manager line. Add the one-time YubiKey spend and the first year lands at $2,520–$3,720. Call it $14–$21 per engineer per month.

What it catches: every infostealer that hits a managed laptop (Defender flags it), every credential that lives in the browser (replaced by the password manager), every login that doesn't have phishing-resistant MFA (the YubiKey is required), every personal device touching production (blocked by the no-BYOD policy).

What it doesn't catch: a determined adversary with physical access and unlimited time. A laptop in a hotel room with no FileVault is owned. A laptop with FileVault and a YubiKey left in the USB-A port overnight is owned slower. Neither situation is what this stack is built for; it is built for the infostealer that landed on the contractor's personal Mac.

If you do one thing this week, buy two YubiKeys for yourself, enroll them on GitHub, Google, and Okta, and turn off SMS-based MFA on each. Total cost: $100, one hour. Then do the rest of the team next quarter.