<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manikandan T</title>
    <description>The latest articles on DEV Community by Manikandan T (@manikandan_t_6d72e32ac4e8).</description>
    <link>https://dev.to/manikandan_t_6d72e32ac4e8</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3922966%2F9f69e4ac-5138-476e-ac55-023629425266.png</url>
      <title>DEV Community: Manikandan T</title>
      <link>https://dev.to/manikandan_t_6d72e32ac4e8</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/manikandan_t_6d72e32ac4e8"/>
    <language>en</language>
    <item>
      <title>The GPU Cold Starts Nobody Warns You About: Autoscaling LLM Inference on Kubernetes</title>
      <dc:creator>Manikandan T</dc:creator>
      <pubDate>Wed, 03 Jun 2026 18:23:20 +0000</pubDate>
      <link>https://dev.to/manikandan_t_6d72e32ac4e8/the-gpu-cold-starts-nobody-warns-you-about-autoscaling-llm-inference-on-kubernetes-88j</link>
      <guid>https://dev.to/manikandan_t_6d72e32ac4e8/the-gpu-cold-starts-nobody-warns-you-about-autoscaling-llm-inference-on-kubernetes-88j</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;When an LLM inference pod scales on Kubernetes, it doesn't start in seconds like a CPU service. It hits a sequential chain of bottlenecks that can stall Time-to-First-Token (TTFT) for 10+ minutes:&lt;/p&gt;

&lt;p&gt;Each phase blocks the next. Optimizing one shifts the bottleneck downstream - you must address all four.&lt;/p&gt;

&lt;p&gt;The reason these traps hurt so much is that a GPU inference pod doesn't come up in seconds like a stateless web service. It moves through a sequential chain where each stage blocks the next:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83f17gi5ofy0ir23otd2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F83f17gi5ofy0ir23otd2.png" alt=" " width="800" height="437"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Modern inference containers (vLLM + PyTorch + CUDA + NCCL) are 10–20GB. Model weights for production models (Llama-3 70B, DeepSeek-V3) exceed 130GB. This is the reality of GPU autoscaling - every cold start moves tens of gigabytes before generating a single token.&lt;/p&gt;

&lt;p&gt;This post covers practical solutions for each phase, issues I hit during implementation, and cloud-native alternatives across providers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 1: Node Provisioning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Karpenter Over Cluster Autoscaler&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you're still on Cluster Autoscaler for inference workloads, switch to Karpenter. The key differences that matter for GPU scaling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Event-driven - reacts to pending pods in milliseconds vs CA's 10s+ polling loop&lt;/li&gt;
&lt;li&gt;Direct cloud API - bypasses ASGs/VMSS entirely, selects instance types dynamically&lt;/li&gt;
&lt;li&gt;Workload-aware - evaluates GPU, memory, CPU, taints, affinity constraints together and picks the optimal instance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For inference specifically, the important Karpenter configuration is restricting your NodePool to a consistent GPU family. This matters downstream for compile cache sharing (covered in Phase 4).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NodePool&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-inference&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requirements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/capacity-type&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on-demand"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node.kubernetes.io/instance-type&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p4d.24xlarge"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# A100 instances - consistent GPU architecture&lt;/span&gt;
      &lt;span class="na"&gt;taints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia.com/gpu&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
          &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoSchedule&lt;/span&gt;
  &lt;span class="na"&gt;disruption&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;consolidationPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WhenEmpty&lt;/span&gt;
    &lt;span class="na"&gt;consolidateAfter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Scaling Strategy: Scale When N Pods Are&amp;nbsp;Running&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of warm pools with pause pods, configure your HPA/KEDA to trigger node provisioning proactively when existing pods approach saturation. The idea: when you have N pods running and load reaches a threshold, scale to N+1 before all pods are saturated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;keda.sh/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ScaledObject&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-scaler&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-inference&lt;/span&gt;
  &lt;span class="na"&gt;minReplicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicaCount&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
  &lt;span class="na"&gt;triggers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prometheus&lt;/span&gt;
      &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;serverAddress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus:9090&lt;/span&gt;
        &lt;span class="na"&gt;metricName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm_pending_requests&lt;/span&gt;
        &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sum(vllm:num_requests_waiting)&lt;/span&gt;
        &lt;span class="na"&gt;threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10"&lt;/span&gt;  &lt;span class="c1"&gt;# Scale when queue depth exceeds threshold&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures Karpenter sees a pending pod early, begins provisioning, and the new pod is ready before existing pods are overwhelmed. You never scale from zero - you scale from N to N+1.&lt;/p&gt;

&lt;p&gt;The hard floor remains: VM allocation + OS boot + GPU driver initialization + kubelet registration = 90–120 seconds for GPU instances. GPU driver init is the hidden cost - NVIDIA driver + device enumeration adds 20–30s that CPU instances don't pay. The only way past this floor is having the node already running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud-Specific Provisioning&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fla6cjd3fwtk1ze5a9ks1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fla6cjd3fwtk1ze5a9ks1.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2: Container Image&amp;nbsp;Pull
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A vLLM container image is 10–20GB. When Karpenter provisions a fresh node, downloading from DockerHub/ECR/ACR saturates NAT gateways. Multiple nodes pulling simultaneously = &lt;code&gt;ImagePullBackOff&lt;/code&gt; errors and 3-5 minute delays.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Lazy Pulling (eStargz/SOCI) Fails for GPU Workloads&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Lazy pulling restructures images to allow on-demand file extraction via FUSE. The container "starts" immediately and fetches files as needed via HTTP Range Requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this is catastrophic for ML inference:&lt;/strong&gt; Python ML runtimes import thousands of shared objects sequentially (&lt;code&gt;torch&lt;/code&gt;, &lt;code&gt;libcudart&lt;/code&gt;, &lt;code&gt;triton&lt;/code&gt;, etc.). Each uncached import = synchronous HTTP round-trip through FUSE. The result: pull time drops dramatically, but application readiness (time to first successful &lt;code&gt;/health&lt;/code&gt; response) regresses badly - often by an order of magnitude or more. The latency shifts from a one-time bulk download to thousands of small, serialized network round-trips spread across the Python import sequence.&lt;/p&gt;

&lt;p&gt;The registry becomes a runtime dependency. Every &lt;code&gt;import torch&lt;/code&gt; call blocks on network I/O. Do not use lazy pulling for LLM inference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spegel: P2P Image Distribution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/spegel-org/spegel" rel="noopener noreferrer"&gt;Spegel&lt;/a&gt; is the practical P2P solution for GPU clusters - stateless, lightweight, zero control plane overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;DaemonSet on every node (must tolerate GPU taints)&lt;/li&gt;
&lt;li&gt;Advertises SHA256 layer digests via Kademlia DHT&lt;/li&gt;
&lt;li&gt;containerd configured to route pulls through localhost mirror&lt;/li&gt;
&lt;li&gt;New nodes query DHT, stream layers from peers over internal VPC network&lt;/li&gt;
&lt;li&gt;404 fallback to external registry if layer not found in cluster
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;helm upgrade --install spegel oci://ghcr.io/spegel-org/helm-charts/spegel \&lt;/span&gt;
  &lt;span class="s"&gt;--namespace spegel --create-namespace \&lt;/span&gt;
  &lt;span class="s"&gt;--version v0.7.1 \&lt;/span&gt;
  &lt;span class="s"&gt;--set "spegel.mirrorResolveTimeout=5s" \&lt;/span&gt;
  &lt;span class="s"&gt;--set "spegel.mirrorResolveRetries=5" \&lt;/span&gt;
  &lt;span class="s"&gt;--set "tolerations[0].key=nvidia.com/gpu" \&lt;/span&gt;
  &lt;span class="s"&gt;--set "tolerations[0].operator=Exists" \&lt;/span&gt;
  &lt;span class="s"&gt;--set "tolerations[0].effect=NoSchedule"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical requirement:&lt;/strong&gt; For Spegel to work, at least one node must already have the image cached (this is why you scale from N to N+1, not from zero).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compatibility:&lt;/strong&gt;&lt;br&gt;
Refer - &lt;a href="https://spegel.dev/docs/getting-started/#compatibility" rel="noopener noreferrer"&gt;https://spegel.dev/docs/getting-started/#compatibility&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v1qh0dh5nvmta7me79l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8v1qh0dh5nvmta7me79l.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Issue: Default mirrorResolveTimeout Is Too Aggressive&lt;/strong&gt;&lt;br&gt;
Spegel's default &lt;code&gt;mirrorResolveTimeout&lt;/code&gt; is 20ms. This is extremely tight Kademlia DHT lookups that exceed 20ms fall back to the upstream registry, even when a peer has the layer. This explains why you might see ~90% hit rates instead of near-100%. Increasing to 5s with 5 retries gives the P2P network enough time to resolve peers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Issue: containerd 2.1 Breaks Spegel&amp;nbsp;Silently&lt;/strong&gt;&lt;br&gt;
This was the most time-consuming debugging issue. If you're running AL2023, Ubuntu 24.04, or any OS with containerd 2.1, there are three breaking defaults:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. &lt;code&gt;use_local_image_pull = false&lt;/code&gt; (default)&lt;/strong&gt;&lt;br&gt;
containerd 2.1 routes all pulls through a new transfer service (io.containerd.transfer.v1). This transfer service does NOT honor registry mirrors in hosts.toml. Spegel is silently bypassed - every pull goes to the external registry regardless of mirror configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. &lt;code&gt;discard_unpacked_layers = true&lt;/code&gt; (default)&lt;/strong&gt;&lt;br&gt;
containerd discards compressed layers after extraction. Spegel needs preserved layers to serve them to peers. Without them, the P2P network fails silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Registry-specific &lt;code&gt;hosts.toml&lt;/code&gt; overrides &lt;code&gt;_default/hosts.toml&lt;/code&gt;&lt;/strong&gt;&lt;br&gt;
If you create docker.io/hosts.toml in your node userData, it overrides Spegel's _default/hosts.toml mirror configuration. Spegel's init container creates the _default/hosts.toml - do NOT create registry-specific host files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix - containerd overrides in node userData:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: EC2NodeClass userData for AL2023&lt;/span&gt;
&lt;span class="na"&gt;userData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;MIME-Version: 1.0&lt;/span&gt;
  &lt;span class="s"&gt;Content-Type: multipart/mixed; boundary="BOUNDARY"&lt;/span&gt;

  &lt;span class="s"&gt;--BOUNDARY&lt;/span&gt;
  &lt;span class="s"&gt;Content-Type: application/node.eks.aws&lt;/span&gt;

  &lt;span class="s"&gt;---&lt;/span&gt;
  &lt;span class="s"&gt;apiVersion: node.eks.aws/v1alpha1&lt;/span&gt;
  &lt;span class="s"&gt;kind: NodeConfig&lt;/span&gt;
  &lt;span class="s"&gt;spec:&lt;/span&gt;
    &lt;span class="s"&gt;containerd:&lt;/span&gt;
      &lt;span class="s"&gt;config: |&lt;/span&gt;
        &lt;span class="s"&gt;[plugins."io.containerd.cri.v1.images".registry]&lt;/span&gt;
          &lt;span class="s"&gt;config_path = "/etc/containerd/certs.d"&lt;/span&gt;
        &lt;span class="s"&gt;[plugins."io.containerd.cri.v1.images"]&lt;/span&gt;
          &lt;span class="s"&gt;discard_unpacked_layers = false&lt;/span&gt;
          &lt;span class="s"&gt;use_local_image_pull = true&lt;/span&gt;
  &lt;span class="s"&gt;--BOUNDARY--&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cloud-Native Image Solutions&lt;/p&gt;

&lt;p&gt;These outperform P2P for most use cases:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsto8o8qmnsmngtye59on.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsto8o8qmnsmngtye59on.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; Cloud-native streaming &amp;gt; Spegel P2P &amp;gt; Lazy pulling. Use Spegel for multi-cloud or when cloud-native options are unavailable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3: Model Weight&amp;nbsp;Loading
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Math&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T_transfer = Payload_Size / Bandwidth
130 GB over 10 Gbps = ~104 seconds (theoretical minimum)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Real-world transfers from S3/HuggingFace Hub: 3–5 minutes due to TCP overhead and rate limiting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared Storage (PVC) for Pre-Downloaded Weights&lt;/strong&gt;&lt;br&gt;
The standard pattern: pre-download weights to a ReadWriteMany PVC, mount it in inference pods.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-cache-pvc&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteMany&lt;/span&gt;
  &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;efs-sc&lt;/span&gt;  &lt;span class="c1"&gt;# Or Azure Files / Filestore&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;200Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pre-download via Job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Job&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-download&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Never&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;downloader&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python:3.11-slim&lt;/span&gt;
          &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sh"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
              &lt;span class="s"&gt;pip install huggingface_hub[hf_xet]&lt;/span&gt;
              &lt;span class="s"&gt;huggingface-cli download meta-llama/Llama-3-70B-Instruct \&lt;/span&gt;
                &lt;span class="s"&gt;--local-dir /shared/models/Llama-3-70B-Instruct&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HF_TOKEN&lt;/span&gt;
              &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hf-token&lt;/span&gt;
                  &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;token&lt;/span&gt;
          &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shared-cache&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/shared&lt;/span&gt;
      &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shared-cache&lt;/span&gt;
          &lt;span class="na"&gt;persistentVolumeClaim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;claimName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-cache-pvc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Shared filesystems (EFS/NFS/Azure Files) have IOPS bottlenecks when many nodes read 130GB concurrently. EFS elastic mode delivers ~130 MB/s - that's 15x slower than local NVMe.&lt;/p&gt;

&lt;h2&gt;
  
  
  Node-Local NVMe: The Performance Path
&lt;/h2&gt;

&lt;p&gt;A100 instances (p4d/p5 on AWS, ND-series on Azure, A2/A3 on GKE) include physically attached NVMe disks delivering ~2,000+ MB/s reads. Use them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NVMe Setup: Approaches by Environment&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Option 1: Karpenter instanceStorePolicy (recommended for EKS)&lt;/strong&gt;&lt;br&gt;
Karpenter's EC2NodeClass supports &lt;code&gt;instanceStorePolicy: RAID0&lt;/code&gt;, which auto-formats all NVMe instance store devices as RAID0 and mounts them to &lt;code&gt;/mnt/k8s-disks/0&lt;/code&gt;. Kubelet uses this as ephemeral storage, so &lt;code&gt;emptyDir&lt;/code&gt; volumes are automatically NVMe-backed. No userData script, no privileged pods, and no manual device detection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.k8s.aws/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EC2NodeClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpu-nodeclass&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;amiSelectorTerms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alias&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;al2023@latest&lt;/span&gt;
  &lt;span class="na"&gt;instanceStorePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RAID0&lt;/span&gt;
  &lt;span class="c1"&gt;# ... other fields (role, subnets, security groups, etc.)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the cleanest approach for Karpenter one line in the EC2NodeClass spec. The NVMe is ready before any pod schedules, and using &lt;code&gt;emptyDir&lt;/code&gt; instead of &lt;code&gt;hostPath&lt;/code&gt; means no hardcoded mount paths and no Pod Security Policy concerns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 2: userData/cloud-init&lt;/strong&gt;&lt;br&gt;
For environments without Karpenter (self-managed ASGs, other provisioners), format and mount NVMe during node boot via userData:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# EC2NodeClass userData - formats NVMe at boot&lt;/span&gt;
&lt;span class="na"&gt;userData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
  &lt;span class="s"&gt;MIME-Version: 1.0&lt;/span&gt;
  &lt;span class="s"&gt;Content-Type: multipart/mixed; boundary="BOUNDARY"&lt;/span&gt;

  &lt;span class="s"&gt;--BOUNDARY&lt;/span&gt;
  &lt;span class="s"&gt;Content-Type: text/x-shellscript&lt;/span&gt;

  &lt;span class="s"&gt;#!/bin/bash&lt;/span&gt;
  &lt;span class="s"&gt;# Format and mount instance store NVMe (skip boot volume nvme0)&lt;/span&gt;
  &lt;span class="s"&gt;DEVICES=$(ls /dev/nvme*n1 2&amp;gt;/dev/null | grep -v nvme0)&lt;/span&gt;
  &lt;span class="s"&gt;if [ -n "$DEVICES" ]; then&lt;/span&gt;
    &lt;span class="s"&gt;DEVICE_COUNT=$(echo "$DEVICES" | wc -l)&lt;/span&gt;
    &lt;span class="s"&gt;if [ "$DEVICE_COUNT" -gt 1 ]; then&lt;/span&gt;
      &lt;span class="s"&gt;yum install -y mdadm || apt-get install -y mdadm&lt;/span&gt;
      &lt;span class="s"&gt;echo "y" | mdadm --create /dev/md0 --level=0 \&lt;/span&gt;
        &lt;span class="s"&gt;--raid-devices=$DEVICE_COUNT $DEVICES&lt;/span&gt;
      &lt;span class="s"&gt;mkfs.ext4 -F /dev/md0&lt;/span&gt;
      &lt;span class="s"&gt;mkdir -p /mnt/fast-disks&lt;/span&gt;
      &lt;span class="s"&gt;mount /dev/md0 /mnt/fast-disks&lt;/span&gt;
    &lt;span class="s"&gt;else&lt;/span&gt;
      &lt;span class="s"&gt;mkfs.ext4 -F $DEVICES&lt;/span&gt;
      &lt;span class="s"&gt;mkdir -p /mnt/fast-disks&lt;/span&gt;
      &lt;span class="s"&gt;mount $DEVICES /mnt/fast-disks&lt;/span&gt;
    &lt;span class="s"&gt;fi&lt;/span&gt;
    &lt;span class="s"&gt;chmod 777 /mnt/fast-disks&lt;/span&gt;
  &lt;span class="s"&gt;fi&lt;/span&gt;

  &lt;span class="s"&gt;--BOUNDARY&lt;/span&gt;
  &lt;span class="s"&gt;Content-Type: application/node.eks.aws&lt;/span&gt;
  &lt;span class="s"&gt;# ... rest of NodeConfig (containerd overrides etc.)&lt;/span&gt;
  &lt;span class="s"&gt;--BOUNDARY--&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; AL2023 uses &lt;code&gt;dnf&lt;/code&gt;, not &lt;code&gt;yum&lt;/code&gt;. This approach still avoids privileged DaemonSets and Pod Security Policy violations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 3: Cloud-managed NVMe provisioning&lt;/strong&gt;&lt;br&gt;
Azure Container Storage auto-provisions NVMe RAID0 StoragePools on ND-series VMs (similar to Karpenter's &lt;code&gt;instanceStorePolicy&lt;/code&gt;). GKE local SSDs can be configured via node pool settings.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglyz1c9ct7tg372jy8js.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fglyz1c9ct7tg372jy8js.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 4: Privileged DaemonSet (development/testing only)&lt;/strong&gt;&lt;br&gt;
A DaemonSet with &lt;code&gt;privileged: true&lt;/code&gt; and &lt;code&gt;hostPID: true&lt;/code&gt; can format NVMe drives post-boot. However, this is typically blocked in production by Pod Security Standards (Restricted/Baseline), OPA/Gatekeeper/Kyverno policies, and compliance requirements (SOC2, PCI-DSS). Only use this in development clusters where security policies are relaxed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;InitContainer: PVC → NVMe&amp;nbsp;Sync&lt;/strong&gt;&lt;br&gt;
Once NVMe is available (via any method above), use an initContainer to copy the pre-downloaded model from the shared PVC to NVMe-backed storage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;initContainers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-sync&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alpine:latest&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sh"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
        &lt;span class="s"&gt;echo "Syncing model from shared PVC to NVMe-backed storage..."&lt;/span&gt;
        &lt;span class="s"&gt;cp -r /shared/models/Llama-3-70B-Instruct /nvme/models/&lt;/span&gt;
        &lt;span class="s"&gt;echo "Sync complete."&lt;/span&gt;
    &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shared-cache&lt;/span&gt;
        &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/shared&lt;/span&gt;
        &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvme-storage&lt;/span&gt;
        &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/nvme&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;instanceStorePolicy: RAID0&lt;/code&gt;, mount the NVMe-backed volume as &lt;code&gt;emptyDir: {}&lt;/code&gt; (kubelet places it on the NVMe mount automatically). With the userData approach, use &lt;code&gt;hostPath: { path: /mnt/fast-disks }&lt;/code&gt;. The &lt;code&gt;emptyDir&lt;/code&gt; approach is preferred because it avoids hardcoded paths, works with Pod Security Standards, and Kubernetes manages the lifecycle.&lt;/p&gt;

&lt;p&gt;Each pod pays the PVC to NVMe copy cost (~60–90s for 130GB at EFS elastic throughput). With emptyDir, each pod copies independently (emptyDir is per-pod), but the copy from EFS to NVMe is still far faster than reading directly from EFS during inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cloud-Specific Model&amp;nbsp;Storage
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fueg3awdvf182w4pblyh8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fueg3awdvf182w4pblyh8.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;GKE's Hyperdisk ML is notable: a single pre-populated volume mounted read-only across 2,500 pods eliminates all multi-node download redundancy.&lt;/p&gt;

&lt;h2&gt;
  
  
  NVIDIA ModelExpress / NIXL (Experimental)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; I have not tested this personally. The following is from NVIDIA documentation and community reports. Including it because it represents the theoretical fastest path for weight distribution in multi-node GPU clusters.&lt;/p&gt;

&lt;p&gt;For environments with RDMA/InfiniBand interconnects (multi-node A100/H100 clusters), NVIDIA ModelExpress enables P2P weight distribution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New worker communicates with metadata server (Redis sidecar)&lt;/li&gt;
&lt;li&gt;Locates active GPU worker running the same model&lt;/li&gt;
&lt;li&gt;Streams tensors directly from active worker's GPU memory via RDMA&lt;/li&gt;
&lt;li&gt;Zero storage dependency for scale-out
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="c"&gt;# UCX transport configuration for NIXL
&lt;/span&gt;&lt;span class="n"&gt;UCX_TLS&lt;/span&gt;=&lt;span class="n"&gt;rc_x&lt;/span&gt;,&lt;span class="n"&gt;rc&lt;/span&gt;,&lt;span class="n"&gt;dc_x&lt;/span&gt;,&lt;span class="n"&gt;dc&lt;/span&gt;,&lt;span class="n"&gt;cuda_copy&lt;/span&gt;
&lt;span class="n"&gt;UCX_RNDV_SCHEME&lt;/span&gt;=&lt;span class="n"&gt;get_zcopy&lt;/span&gt;
&lt;span class="n"&gt;MODEL_EXPRESS_NO_SHARED_STORAGE&lt;/span&gt;=&lt;span class="m"&gt;1&lt;/span&gt;  &lt;span class="c"&gt;# gRPC fallback when shared storage unavailable
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the fastest possible weight distribution - active GPU memory to new GPU memory over InfiniBand. Relevant for large TP deployments on H100 clusters with EFA/InfiniBand interconnects.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 4: fastsafetensors and Weight Loading Optimization
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Standard Loading&amp;nbsp;Path&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3frysxdpva8hbqz61yo.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff3frysxdpva8hbqz61yo.gif" alt=" " width="799" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The CPU RAM acts as a bounce buffer - data passes through without computation, purely as a transfer intermediary. For 130GB of weights on A100 with 80GB VRAM, this means multiple sequential PCIe transfers with CPU orchestration. Multi-minute load times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;fastsafetensors: Faster Weight&amp;nbsp;Loading&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;fastsafetensors provides two loading paths:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. GDS path:&lt;/strong&gt; On GDS-optimized distributed filesystems (Lustre, WekaFS), it uses NVIDIA GPUDirect Storage to DMA directly from storage to GPU VRAM, bypassing CPU entirely. Performance: 4.8x to 7.5x speedup over standard loading.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. POSIX I/O path (nogds):&lt;/strong&gt; On local NVMe/ext4 or when GDS drivers are unavailable, it falls back to an optimized POSIX I/O path. This is still significantly faster than standard loading when reading from NVMe (~2,000+ MB/s) vs shared filesystems like EFS (~130 MB/s).&lt;br&gt;
The key insight: the biggest performance gain comes from NVMe vs shared filesystem, not from GDS bypassing the CPU. Moving model weights from EFS to local NVMe is a ~15x bandwidth improvement regardless of whether GDS is active.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffb25hi6387ukbo8mwf42.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffb25hi6387ukbo8mwf42.gif" alt=" " width="799" height="376"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Enable in vLLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--load-format"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fastsafetensors"&lt;/span&gt;
&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;USE_FASTSAFETENSOR&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GDS: When It Matters and When It Doesn't&lt;/strong&gt;&lt;br&gt;
GDS (GPUDirect Storage) enables direct DMA from storage to GPU VRAM, bypassing CPU bounce buffers. But GDS is a filesystem-level optimization, not just a driver. It requires a GDS-optimized filesystem to function:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GDS-optimized filesystems:&lt;/strong&gt; Lustre (FSx for Lustre), WekaFS, GPFS these support the &lt;code&gt;cuFile&lt;/code&gt; API natively. On these, fastsafetensors delivers 4.8–7.5x speedup via true DMA.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Local NVMe/ext4:&lt;/strong&gt; Not GDS-optimized. Even with &lt;code&gt;nvidia_fs.ko&lt;/code&gt; loaded, GDS runs in compatibility mode (CPU bounce buffer, no faster than POSIX I/O) or doesn't engage at all. fastsafetensors detects this and falls back to its &lt;code&gt;nogds&lt;/code&gt; path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to check GDS status on your node:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check if nvidia-gds package is installed&lt;/span&gt;
dpkg &lt;span class="nt"&gt;-l&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;nvidia-gds    &lt;span class="c"&gt;# Debian/Ubuntu&lt;/span&gt;
rpm &lt;span class="nt"&gt;-qa&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;nvidia-gds     &lt;span class="c"&gt;# RHEL/AL2023&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check if GDS kernel module is loaded&lt;/span&gt;
lsmod | &lt;span class="nb"&gt;grep &lt;/span&gt;nvidia_fs
&lt;span class="c"&gt;# From inside a container, check vLLM's detection&lt;/span&gt;
&lt;span class="c"&gt;# Look for this in logs:&lt;/span&gt;
&lt;span class="c"&gt;# "GDS not enabled, setting nogds=True"  ← GDS NOT available&lt;/span&gt;
&lt;span class="c"&gt;# No such message = GDS is active&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GDS Compatibility by Instance:&lt;/strong&gt;&lt;br&gt;
GDS support depends on the driver stack (nvidia-fs kernel module, MOFED/OFED drivers) and critically, the storage filesystem. Any modern NVIDIA datacenter GPU can technically do GDS if the correct drivers are installed AND the filesystem supports the &lt;code&gt;cuFile&lt;/code&gt; API.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhp5174p5w1owtqz7k3nv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhp5174p5w1owtqz7k3nv.png" alt=" " width="800" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Issue I hit:&lt;/strong&gt; On g6 instances (L4, testing a smaller model), vLLM logged &lt;code&gt;GDS not enabled, setting nogds=True&lt;/code&gt;. The AL2023 GPU AMI does not include &lt;code&gt;nvidia-gds&lt;/code&gt;. But even installing GDS would not have helped here the model was on local NVMe/ext4, which is not a GDS-optimized filesystem. The real fix was moving model weights from EFS to NVMe, not installing GDS drivers.&lt;/p&gt;

&lt;p&gt;On datacenter instances (p4d/p5/ND-series/A3) the MOFED + nvidia-fs stack is pre-installed. fastsafetensors delivers its full 4.8–7.5x GDS speedup only when reading from a GDS-optimized distributed filesystem (e.g., FSx for Lustre). When reading from local NVMe/ext4 on these same instances, the GDS bypass does not engage the speedup comes from NVMe bandwidth, not GDS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NUMA Binding for Multi-Socket A100/H100 Servers&lt;/strong&gt;&lt;br&gt;
On p4d (8x A100) and p5 (8x H100), the machine has multiple NUMA domains. Without binding, Python workers may drift across CPU sockets, causing cross-NUMA memory access during Tensor Parallel sharding.&lt;br&gt;
The standard approach is to wrap vLLM with &lt;code&gt;numactl&lt;/code&gt; or configure NUMA affinity via environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLM_WORKER_MULTIPROC_METHOD&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spawn"&lt;/span&gt;  &lt;span class="c1"&gt;# Required for multi-GPU - default 'fork' causes issues with CUDA contexts&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For explicit NUMA pinning, wrap the entrypoint with &lt;code&gt;numactl&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numactl"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--cpunodebind=0"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--membind=0"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python3"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-m"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vllm.entrypoints.openai.api_server"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pins each GPU worker to the CPU cores and memory closest to its PCIe lanes. Primarily affects steady-state throughput rather than startup latency, but prevents throughput degradation under TP configurations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Some vLLM versions expose a &lt;code&gt;--enable-numa&lt;/code&gt; flag. Verify availability in your target version - the NUMA interface has changed across releases. The &lt;code&gt;VLLM_WORKER_MULTIPROC_METHOD=spawn&lt;/code&gt; env var is the stable requirement for multi-GPU setups.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 5: torch.compile Cache Persistence
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem&lt;/strong&gt;&lt;br&gt;
After weights are loaded, vLLM uses torch.compile to trace CUDA graphs for the decode loop. TorchInductor generates low-level device kernels optimized for the specific GPU architecture.&lt;/p&gt;

&lt;p&gt;Default FULL_AND_PIECEWISE mode captures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monolithic decode graphs for uniform sequence lengths&lt;/li&gt;
&lt;li&gt;Segmented piecewise graphs for variable prefill dimensions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This compilation is strictly serial, CPU-bound. Every new pod pays this penalty independently. Compilation time scales with model size and sequence length range: a 7B model on a single GPU compiles in ~20–30s, while a 70B model across 8x A100 with large context ranges can take 2–5 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Fix: Shared Compile Cache on RWX&amp;nbsp;Storage&lt;/strong&gt;&lt;br&gt;
Redirect the compile cache to shared storage. First pod compiles, all subsequent pods reuse.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLM_CACHE_ROOT&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/shared/compile_cache"&lt;/span&gt;  &lt;span class="c1"&gt;# Points to RWX PVC (EFS/Azure Files/Filestore)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;Cache structure&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="s"&gt;/shared/compile_cache/&lt;/span&gt;
  &lt;span class="s"&gt;torch_compile_cache/&lt;/span&gt;
    &lt;span class="s"&gt;&amp;lt;hash&amp;gt;/&lt;/span&gt;                         &lt;span class="c1"&gt;# model config + PyTorch version + GPU compute capability&lt;/span&gt;
      &lt;span class="s"&gt;rank_0_0/&lt;/span&gt;
        &lt;span class="s"&gt;backbone/&lt;/span&gt;
          &lt;span class="s"&gt;transformed_code.py&lt;/span&gt;       &lt;span class="c1"&gt;# Compiled Python code&lt;/span&gt;
          &lt;span class="s"&gt;computation_graph.py&lt;/span&gt;      &lt;span class="c1"&gt;# Graph structure&lt;/span&gt;
          &lt;span class="s"&gt;inductor_cache/&lt;/span&gt;           &lt;span class="c1"&gt;# Final compiled kernels&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cached artifacts are environment-specific - they're safe to reuse only across pods sharing the same GPU architecture, CUDA, PyTorch, and vLLM version. vLLM derives this by hashing a long list of config and PyTorch factors, and the underlying Inductor kernels are compiled to architecture-specific cubins - so the binding to a specific GPU is real even though it isn't a single tidy "compute capability" field. First pod writes the cache; subsequent pods in the same environment hit and load directly. This is vendor-confirmed: vLLM's docs state the compiled artifacts can be reused across machines with the same environment, and explicitly recommend generating the cache once and sharing it among instances for autoscaling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result from implementation (7B model, single L4 GPU, max_model_len=12000):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab2xgcjered3k9tqefrr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab2xgcjered3k9tqefrr.png" alt=" " width="799" height="99"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Log evidence of a successful hit:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4j7re0xrn55etbsi34cp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4j7re0xrn55etbsi34cp.png" alt=" " width="800" height="64"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That's a 60% reduction on a 7B model from a single environment variable. For larger models (70B on 8x A100), fresh compilation can take 2–5 minutes - the cache hit savings scale proportionally, typically reducing to 10–20s of cache loading.&lt;/p&gt;

&lt;h2&gt;
  
  
  Issue: GPU Architecture Specificity
&lt;/h2&gt;

&lt;p&gt;The compile cache hash includes GPU compute capability:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A100 (compute 8.0) → cache hash: 8d22bdd77e
H100 (compute 9.0) → cache hash: f04cb94f7b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;An A100 cache CANNOT be reused on an H100 node.&lt;/strong&gt; If your NodePool allows mixed GPU types, pods on different architectures face full recompilation despite cache existing.&lt;br&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Restrict NodePool to a single GPU family:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;requirements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node.kubernetes.io/instance-type&lt;/span&gt;
    &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;p4d.24xlarge"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# All nodes = A100, same compute capability&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is why Phase 1 matters - consistent GPU architecture enables compile cache sharing cluster-wide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compilation Mode Reference&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr55kadh0zm516g3izgze.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr55kadh0zm516g3izgze.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Complete Optimized Deployment&lt;/strong&gt;&lt;br&gt;
Putting it all together - vLLM on A100 with all optimizations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-optimized&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-optimized&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-optimized&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia.com/gpu&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Exists&lt;/span&gt;
          &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoSchedule&lt;/span&gt;
      &lt;span class="na"&gt;initContainers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-sync&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alpine:latest&lt;/span&gt;
          &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sh"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
              &lt;span class="s"&gt;echo "Copying model from shared PVC to NVMe-backed emptyDir..."&lt;/span&gt;
              &lt;span class="s"&gt;cp -r /shared/models/Llama-3-70B-Instruct /nvme/models/&lt;/span&gt;
          &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shared-cache&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/shared&lt;/span&gt;
              &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvme-storage&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/nvme&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm/vllm-openai:latest&lt;/span&gt;
          &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numactl"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--cpunodebind=0"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--membind=0"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;
                    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python3"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-m"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vllm.entrypoints.openai.api_server"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--model"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/nvme/models/Llama-3-70B-Instruct"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--tensor-parallel-size"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--load-format"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fastsafetensors"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--gpu-memory-utilization"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.90"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--dtype"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bfloat16"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--host"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.0.0.0"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--port"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000"&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;USE_FASTSAFETENSOR&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLM_WORKER_MULTIPROC_METHOD&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spawn"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLM_CACHE_ROOT&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/shared/compile_cache"&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;96"&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1024Gi"&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
          &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvme-storage&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/nvme&lt;/span&gt;
              &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shared-cache&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/shared&lt;/span&gt;
          &lt;span class="na"&gt;startupProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
              &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
            &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;120&lt;/span&gt;
            &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
          &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
              &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
            &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
            &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
      &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvme-storage&lt;/span&gt;
          &lt;span class="na"&gt;emptyDir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;  &lt;span class="c1"&gt;# NVMe-backed via instanceStorePolicy: RAID0&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;shared-cache&lt;/span&gt;
          &lt;span class="na"&gt;persistentVolumeClaim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;claimName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-cache-pvc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; The &lt;code&gt;nvme-storage&lt;/code&gt; volume uses &lt;code&gt;emptyDir: {}&lt;/code&gt; which is automatically NVMe-backed when the EC2NodeClass has &lt;code&gt;instanceStorePolicy: RAID0&lt;/code&gt;. For non-Karpenter environments using userData-formatted NVMe, replace with &lt;code&gt;hostPath: { path: /mnt/fast-disks }&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cloud-Native Reference Architectures
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AWS EKS&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Karpenter (EC2NodeClass) → p4d.24xlarge (8x A100)
  → Bottlerocket + EBS Snapshot (zero-second image pull)
  → EFS PVC (compile cache, RWX)
  → NVMe hostPath (model weights, PCIe speed)
  → vLLM + fastsafetensors + VLLM_CACHE_ROOT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Azure AKS&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAP (AKSNodeClass) → ND A100 v4
  → ACR Artifact Streaming (~5s image availability)
  → Azure Container Storage (NVMe RAID0, auto-provisioned)
  → Azure Files PVC (compile cache, RWX)
  → vLLM + fastsafetensors + VLLM_CACHE_ROOT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Google GKE&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ComputeClasses (NAP) → A2/A3 (A100/H100)
  → GKE Image Streaming (transparent remote mount)
  → Hyperdisk ML (model weights, ReadOnlyMany, 2500 pods)
  → GCS/Filestore PVC (compile cache, RWX)
  → vLLM + fastsafetensors + VLLM_CACHE_ROOT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Comparison&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2nyon6q2teu99mj2eay.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2nyon6q2teu99mj2eay.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Expected Timings (Theoretical)
&lt;/h2&gt;

&lt;p&gt;With all optimizations on A100 infrastructure (p4d.24xlarge), scaling from N to N+1 pods where N &amp;gt;= 1:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdtffgcq43r30malk091.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhdtffgcq43r30malk091.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notes on these numbers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Node provisioning 90–120s includes:&lt;/strong&gt; EC2/VM API call (~5s) + instance allocation (~15–30s) + OS boot (~15s) + NVIDIA driver initialization (~20–30s) + kubelet registration + node ready (~10–15s). GPU driver init is the hidden cost most people underestimate - it adds 20–30s that CPU instances don't pay.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spegel P2P 30–60s:&lt;/strong&gt; A 15–20GB image over VPC internal networking (p4d has 100 Gbps baseline). Spegel uses HTTP-based transfer over TCP, not RDMA - real throughput is well below wire speed. DHT lookups and multi-peer coordination add overhead. Pin to v0.7.1 and tune mirrorResolveTimeout to 5s for reliable hit rates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud-native streaming 5–15s:&lt;/strong&gt; EBS Snapshot = data pre-attached at boot (near-instant). ACR/GKE streaming = remote filesystem mount with on-demand paging (container starts before full download).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PVC → NVMe sync 60–90s:&lt;/strong&gt; 130GB at EFS elastic throughput (~1.5 GB/s burst). Shared filesystem read speed is the bottleneck here. On Azure with premium NVMe StoragePool, or with S3/Blob direct download, this can be faster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weight loading 15–25s:&lt;/strong&gt; 130GB / 8 GPUs = ~16GB per GPU. PCIe Gen4 x16 = 32 GB/s per GPU theoretical. With safetensors metadata parsing overhead, effective throughput is ~40–60% of theoretical. On local NVMe/ext4, fastsafetensors uses POSIX I/O (not GDS) GDS only engages on distributed filesystems like Lustre.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compile cache load 10–20s for 70B:&lt;/strong&gt; Loading pre-compiled graphs from shared storage for all TP ranks. Larger models have more compilation units to deserialize.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Node Provisioning&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Restrict NodePool to a single GPU instance type - this enables compile cache sharing across nodes&lt;/li&gt;
&lt;li&gt;Karpenter's security groups on provisioned nodes may differ from managed node groups - cross-SG ingress rules are needed for pod networking&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;consolidateAfter&lt;/code&gt; appropriately - too aggressive and you lose nodes that could serve the next scale event&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Image Pull (Spegel /&amp;nbsp;P2P)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;containerd 2.1 on AL2023/Ubuntu 24.04 defaults &lt;code&gt;use_local_image_pull=false&lt;/code&gt; - Spegel is silently bypassed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;discard_unpacked_layers=true&lt;/code&gt; breaks P2P layer serving - must be explicitly overridden&lt;/li&gt;
&lt;li&gt;Registry-specific &lt;code&gt;docker.io/hosts.toml&lt;/code&gt; overrides Spegel's &lt;code&gt;_default/hosts.toml&lt;/code&gt; - do not create registry-specific host files in userData&lt;/li&gt;
&lt;li&gt;Spegel requires at least one existing node with the cached image - useless for true scale-from-zero, design around scale-from-N&lt;/li&gt;
&lt;li&gt;Default mirrorResolveTimeout of 20ms is too aggressive Kademlia DHT lookups exceeding this fall back to upstream. Tune to 5s with 5 retries for better hit rates&lt;/li&gt;
&lt;li&gt;Large base layers (PyTorch/CUDA) may not have distribution source labels in the image manifest, causing DHT lookup failures - these fall back to registry pulls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Model Weight&amp;nbsp;Loading&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EFS/NFS throughput (~130 MB/s) is slower than local NVMe (~2000 MB/s) - use NVMe for weight reads, shared PVC for pre-download and cache sharing. Throughput may vary based on file system used.&lt;/li&gt;
&lt;li&gt;The initContainer PVC → NVMe copy is a one-time cost per node - subsequent pods on the same node find the model on NVMe and skip it&lt;/li&gt;
&lt;li&gt;For 130GB+ models, PVC → NVMe copy takes 60–90s - this is unavoidable on first boot but is a one-time cost&lt;/li&gt;
&lt;li&gt;Use Karpenter's &lt;code&gt;instanceStorePolicy: RAID0&lt;/code&gt; for NVMe provisioning (preferred) or userData/cloud-init as a fallback. Avoid privileged DaemonSets.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;fastsafetensors /&amp;nbsp;GDS&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The primary speedup is NVMe vs shared filesystem (~15x bandwidth) not GDS bypassing the CPU&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;nogds=True&lt;/code&gt; in vLLM logs is expected and not a problem on local NVMe/ext4 fastsafetensors' POSIX I/O path is still fast&lt;/li&gt;
&lt;li&gt;GDS only provides its 4.8–7.5x speedup on GDS-optimized distributed filesystems (Lustre, WekaFS, GPFS) not on local ext4/XFS&lt;/li&gt;
&lt;li&gt;Even with nvidia-gds installed and nvidia_fs loaded, GDS runs in compatibility mode on ext4 (CPU bounce buffer, same as POSIX)&lt;/li&gt;
&lt;li&gt;Datacenter instances (p4d/p5/ND-series/A3) ship with MOFED + nvidia-fs pre-installed but GDS only engages when paired with a GDS-optimized filesystem&lt;/li&gt;
&lt;li&gt;For local NVMe setups, fastsafetensors is still worth using its optimized POSIX I/O path is efficient on NVMe&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;torch.compile Cache&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Highest ROI optimization - one env var (&lt;code&gt;VLLM_CACHE_ROOT&lt;/code&gt;), trivial to implement, 60%+ compile time reduction&lt;/li&gt;
&lt;li&gt;Cache is GPU-architecture-specific - A100 cache (8.0)&amp;nbsp;!= H100 cache (9.0), never mix in same NodePool&lt;/li&gt;
&lt;li&gt;First pod after a PyTorch version upgrade or model config change will recompile - cache invalidation is hash-based&lt;/li&gt;
&lt;li&gt;The shared PVC must be ReadWriteMany - first pod writes, subsequent pods read&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;General&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Solve every phase - optimizing one just exposes the next bottleneck&lt;/li&gt;
&lt;li&gt;Cloud-native solutions (image streaming, Hyperdisk ML) give the biggest wins with least operational complexity&lt;/li&gt;
&lt;li&gt;The physical limit is PCIe bandwidth + VM boot time - everything else is software-solvable&lt;/li&gt;
&lt;li&gt;Scale from N &amp;gt;= 1, not from zero - this enables Spegel P2P, warm NVMe caches, and avoids the worst-case cold start&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;&lt;br&gt;
The infrastructure code, Kubernetes manifests, and deployment configurations referenced in this post are available in the companion repository:&lt;br&gt;
GitHub Repository - &lt;a href="https://github.com/Manikandan-t/gpu-autoscaling-accelerator" rel="noopener noreferrer"&gt;GPU Autoscaling Accelerator&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;This post synthesizes research from multiple design iterations and hands-on implementation on EKS with Karpenter, Spegel, fastsafetensors, and persistent torch.compile caching.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>72B Parameters, Zero Quantization, One GPU: Benchmarking Qwen2-VL on AMD MI300X</title>
      <dc:creator>Manikandan T</dc:creator>
      <pubDate>Wed, 13 May 2026 08:02:13 +0000</pubDate>
      <link>https://dev.to/manikandan_t_6d72e32ac4e8/72b-parameters-zero-quantization-one-gpu-benchmarking-qwen2-vl-on-amd-mi300x-15mh</link>
      <guid>https://dev.to/manikandan_t_6d72e32ac4e8/72b-parameters-zero-quantization-one-gpu-benchmarking-qwen2-vl-on-amd-mi300x-15mh</guid>
      <description>&lt;p&gt;I loaded Qwen2-VL-72B-Instruct at full BF16 precision on a single GPU, served 64 concurrent DocVQA streams, and kept the system stable at 99.5% KV cache utilization - all for $1.99/hour on the AMD Developer Cloud.&lt;/p&gt;

&lt;p&gt;This post walks through exactly how I did it: the hardware economics that make it possible, the deployment configuration that makes it stable, and the benchmark results that prove it works.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Building enterprise-grade visual RAG architectures - Invoice extraction, contract intelligence, automated RFP processing, document QA, OCR-heavy PDF understanding, and long-context retrieval pipelines - requires vision-language models that don't hallucinate structural details. Qwen2-VL-72B is still one of the most capable open-weights models for these tasks.&lt;/p&gt;

&lt;p&gt;The problem is running it. A 72-billion parameter model in BF16 precision consumes roughly 144GB of VRAM just to load the weights. Traditional 80GB GPUs force you into aggressive 4-bit quantization, which severely degrades OCR accuracy and multimodal reasoning.&lt;/p&gt;

&lt;p&gt;The AMD Instinct MI300X changes the deployment calculus entirely. With 192GB of HBM3 memory, it fits the full unquantized model on a single GPU and leaves 48GB of headroom for KV caches and concurrent workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Economics: MI300X vs. A100 and H100
&lt;/h2&gt;

&lt;p&gt;Before diving into deployment details, let's address the cost question — because hardware costs cannot be evaluated in a vacuum. You have to evaluate the cost per usable gigabyte of VRAM required to serve your specific model.&lt;/p&gt;

&lt;h3&gt;
  
  
  The NVIDIA 80GB Constraint
&lt;/h3&gt;

&lt;p&gt;If you deploy on NVIDIA infrastructure using A100 (80GB) or H100 (80GB) GPUs, a single GPU is physically incapable of loading Qwen2-VL-72B unquantized. You are forced into one of two compromises.&lt;/p&gt;

&lt;p&gt;The first option is aggressive quantization: crush the model down to 4-bit (AWQ/GPTQ) to fit it on a single 80GB card. This severely degrades OCR and multimodal reasoning capabilities — exactly the capabilities you need for enterprise document processing.&lt;/p&gt;

&lt;p&gt;The second option is tensor parallelism (TP=2): provision a multi-GPU node and shard the model across two cards using &lt;code&gt;--tensor-parallel-size 2&lt;/code&gt;. This works, but it introduces cross-device NCCL communication overhead on every forward pass, inflating inter-token latency beyond what the raw memory bandwidth would suggest.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Cost Breakdown
&lt;/h3&gt;

&lt;p&gt;Using standard tier-2 cloud pricing (Lambda Cloud, CoreWeave - generally cheaper than AWS/GCP on-demand):&lt;/p&gt;

&lt;p&gt;A 2x A100 (80GB) node runs approximately $3.00 to $4.00 per hour per card. You get the 160GB of pooled VRAM you need, but on older Ampere architecture with slower memory bandwidth, plus the NCCL overhead between cards.&lt;/p&gt;

&lt;p&gt;A 2x H100 (80GB) node runs approximately $6.00 to $8.00+ per hour per card. Hopper is blazing fast, but you are paying for two cards' worth of compute just to get 160GB of pooled VRAM - and you still carry the TP=2 communication overhead.&lt;/p&gt;

&lt;p&gt;A single AMD MI300X (192GB) node on the AMD Developer Cloud costs $1.99 per hour (Price may vary for production).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft5ujfry4uhdtikbemc2i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft5ujfry4uhdtikbemc2i.png" alt="AMD Developer Cloud - MI300X" width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqtvjp6ie67t3qy0jabiu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqtvjp6ie67t3qy0jabiu.png" alt="1 GPU vs 8 GPU Comparison" width="800" height="305"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architectural Advantage
&lt;/h3&gt;

&lt;p&gt;The MI300X doesn't just cut the hourly cost by 50-75%. It completely eliminates the complexity of multi-GPU tensor parallelism. There is no cross-device communication overhead. The inter-token latency is bounded strictly by the 5.3 TB/s memory bandwidth of a single HBM3 pool — and my stress test benchmarks confirmed ITL of 43-66ms at the synchronous baseline, which validates that the memory subsystem delivers on its theoretical bandwidth promises.&lt;/p&gt;

&lt;p&gt;For enterprise teams scaling visual RAG pipelines, this shifts the unit economics of multimodal inference from prohibitive to profitable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hardware and Environment
&lt;/h2&gt;

&lt;p&gt;I provisioned the environment on the AMD Developer Cloud. Here are the system specifications:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPU:&lt;/strong&gt; 1x AMD Instinct MI300X (192GB HBM3 VRAM)&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Compute:&lt;/strong&gt; 20 vCPUs, 240GB RAM&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Boot Storage:&lt;/strong&gt; 720GB NVMe SSD&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Scratch Storage:&lt;/strong&gt; 5TB NVMe SSD&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Software Stack:&lt;/strong&gt; Ubuntu 22.04, ROCm 7.2.0, Docker  &lt;/p&gt;

&lt;p&gt;The 192GB VRAM is the critical specification. With ~144GB consumed by the model weights, that leaves approximately 48GB of headroom. That 48GB is what allows processing massive base64-encoded images, maintaining large context windows, and handling concurrent batch requests without triggering OOM errors.&lt;/p&gt;
&lt;h3&gt;
  
  
  Preparing NVMe Storage
&lt;/h3&gt;

&lt;p&gt;The 5TB NVMe scratch disk needs to be mounted for the HuggingFace cache. Downloading 144GB of weights to the boot disk will exhaust space and throttle loading times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Format the scratch disk with XFS (excellent large-file and parallel I/O handling)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;wipefs &lt;span class="nt"&gt;-af&lt;/span&gt; /dev/vdc1
&lt;span class="nb"&gt;sudo &lt;/span&gt;mkfs.xfs &lt;span class="nt"&gt;-f&lt;/span&gt; /dev/vdc1
&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /mnt/models
&lt;span class="nb"&gt;sudo &lt;/span&gt;mount /dev/vdc1 /mnt/models
&lt;span class="nb"&gt;sudo chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; &lt;span class="nv"&gt;$USER&lt;/span&gt;:&lt;span class="nv"&gt;$USER&lt;/span&gt; /mnt/models

&lt;span class="c"&gt;# Point HuggingFace cache to the NVMe drive&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /mnt/models/huggingface
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HF_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/mnt/models/huggingface
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"export HF_HOME=/mnt/models/huggingface"&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; ~/.bashrc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the cache on NVMe, subsequent container restarts load the full 144GB of weights into VRAM in seconds rather than minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Deploying vLLM on MI300X
&lt;/h2&gt;

&lt;p&gt;Deploying vLLM on AMD hardware requires passing the correct kernel drivers into the Docker container. Unlike NVIDIA's &lt;code&gt;--gpus all&lt;/code&gt; flag, the ROCm ecosystem requires direct device passthrough of the KFD (Kernel Fusion Driver) and DRI (Direct Rendering Infrastructure) interfaces.&lt;/p&gt;

&lt;p&gt;Here is the production deployment command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; vllm-qwen2-vl-72b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt; host &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ipc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/dev/kfd &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/dev/dri &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--group-add&lt;/span&gt; video &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--group-add&lt;/span&gt; render &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; /mnt/models:/mnt/models:rw &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;HF_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/mnt/models/huggingface &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;VLLM_USE_TRITON_FLASH_ATTN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; Qwen/Qwen2-VL-72B-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
  vllm/vllm-openai-rocm:v0.20.1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dtype&lt;/span&gt; bfloat16 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.92 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 16384 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-num-seqs&lt;/span&gt; 64 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-num-batched-tokens&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-chunked-prefill&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Why Each Flag Matters
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Host integration&lt;/strong&gt; (&lt;code&gt;--network host --ipc=host&lt;/code&gt;): Bypassing Docker's bridge network eliminates overhead, which is critical for benchmarking true API latency. Host IPC is required for efficient shared memory operations between vLLM's internal processes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ROCm passthrough&lt;/strong&gt; (&lt;code&gt;--device=/dev/kfd --device=/dev/dri --group-add video --group-add render&lt;/code&gt;): This is how the container communicates with the CDNA architecture of the MI300X. If your container fails to start with mysterious ROCm errors, the cause is almost always a permissions issue with these device paths or missing group additions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Precision&lt;/strong&gt; (&lt;code&gt;--dtype bfloat16&lt;/code&gt;): BF16 is the optimal datatype for MI300X. It provides the same dynamic range as FP32, preventing the numerical overflow issues that occur with standard FP16 during the massive attention matrix multiplications in 72B+ models. The MI300X Matrix Core technology natively supports BF16 — do not force FP16 on this architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory management&lt;/strong&gt; (&lt;code&gt;--gpu-memory-utilization 0.92&lt;/code&gt;): This tells vLLM to reserve 92% of the 192GB VRAM. After loading the model weights, the remaining allocation is dedicated entirely to the KV cache block pool, managed by vLLM's PagedAttention system. The engine carved out 32.18 GiB specifically for the KV cache, providing 105,440 tokens of cache capacity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concurrency limits&lt;/strong&gt; (&lt;code&gt;--max-num-seqs 64&lt;/code&gt;, &lt;code&gt;--max-num-batched-tokens 8192&lt;/code&gt;): These define the batching boundaries to prevent OOM under heavy load. With 64 maximum concurrent sequences and 8192 tokens per batch, the scheduler has enough room to interleave requests without exhausting the KV cache blocks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunked prefill&lt;/strong&gt; (&lt;code&gt;--enable-chunked-prefill&lt;/code&gt;): This is non-negotiable for multimodal models. Vision inputs generate massive prompt token counts — a single high-resolution document image can tokenize into thousands of visual tokens. Without chunked prefill, a single massive document would monopolize the entire prefill pipeline, stalling all other requests in the batch. Chunked prefill breaks the initial prompt processing into smaller chunks and interleaves them with decode steps from other in-flight requests.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4yzmrt8eig0ryhz2zkj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj4yzmrt8eig0ryhz2zkj.png" alt="VLLM version-0.20.1" width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpyfr11onx21bh90qqkbf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpyfr11onx21bh90qqkbf.png" alt="ASGI server running on port 8000" width="800" height="468"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because vLLM exposes an OpenAI-compatible API, this endpoint is a drop-in replacement for existing application logic. LangChain, LlamaIndex, or custom agentic workflows can point directly to &lt;code&gt;localhost:8000/v1&lt;/code&gt; without modifying the integration layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Monitoring AMD GPUs During Inference
&lt;/h2&gt;

&lt;p&gt;If you come from the NVIDIA ecosystem, your muscle memory will reach for &lt;code&gt;nvidia-smi&lt;/code&gt;. On consumer AMD cards, you might try &lt;code&gt;radeontop&lt;/code&gt;. Neither works for data center CDNA architectures like the MI300X.&lt;/p&gt;

&lt;p&gt;The correct tool is &lt;code&gt;amd-smi&lt;/code&gt; (or &lt;code&gt;rocm-smi&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;watch &lt;span class="nt"&gt;-n&lt;/span&gt; 2 amd-smi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftr3wkfjpw01auetpsvgm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftr3wkfjpw01auetpsvgm.png" alt="Memory Info of MI300X" width="800" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjcvyqbvm2w9wymjnrvqv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjcvyqbvm2w9wymjnrvqv.png" alt="Memory Info of MI300X with utilisation bar" width="800" height="620"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two things to note from this output. First, the VRAM usage stays relatively static during inference because vLLM pre-allocates the entire KV cache block pool at startup based on the &lt;code&gt;0.92&lt;/code&gt; utilization flag. What fluctuates is power draw and GPU utilization, which spike during the compute-heavy prefill phases of multimodal requests. Second, &lt;code&gt;amd-smi&lt;/code&gt; sometimes aggregates memory differently than &lt;code&gt;nvidia-smi&lt;/code&gt;. Trust the vLLM engine logs — specifically the "GPU KV cache usage" percentage reported every 10 seconds — for the most accurate view of your KV cache block utilization.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmarking the Deployment
&lt;/h2&gt;

&lt;p&gt;To validate this infrastructure for production document processing, I used GuideLLM to run two distinct benchmarking phases against the live endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase A: Synthetic Stress Test — VRAM Saturation Sweep
&lt;/h3&gt;

&lt;p&gt;This test was designed to push the KV cache to its absolute breaking point using maximum-context synthetic prompts.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;guidellm benchmark &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"http://localhost:8000/v1"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; &lt;span class="s2"&gt;"Qwen/Qwen2-VL-72B-Instruct"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; sweep &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s2"&gt;"prompt_tokens=8192,output_tokens=1024"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-seconds&lt;/span&gt; 300 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--warmup&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; ./results-stress &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--outputs&lt;/span&gt; json,html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sweep profile automatically escalates from synchronous (1 request at a time) through throughput-maximizing batches and then across increasing constant-rate loads. This produces a full performance curve from idle to saturated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flov2g9lwqdguiusgbogp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flov2g9lwqdguiusgbogp.png" alt="Cache Hit Rate Log" width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkm7bd9wdxv76zmfxfbmr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkm7bd9wdxv76zmfxfbmr.png" alt="constant-rate strategies, with input tokens fixed at 8,211 and output at 1,024 per request" width="800" height="729"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fawjtght8j8txc5a30fnw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fawjtght8j8txc5a30fnw.png" alt="latency and throughput statistics" width="800" height="801"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The critical result from Phase A: the synchronous baseline ITL was 39.6ms (median), climbing to 66.8ms at higher concurrency. This proves the MI300X HBM3 memory bandwidth is delivering. Anything under 100ms ITL feels instantaneous to a human reader in a streaming interface.&lt;/p&gt;

&lt;p&gt;At peak load, the KV cache hit 99.5% utilization — and the system survived. This is where chunked prefill earns its keep. Without it, sending a massive batch of new prompts to a system at 99% KV cache capacity would cause an immediate OOM crash. Chunked prefill allows the scheduler to break incoming prefill work into small blocks, filling the remaining gaps without exceeding physical limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phase B: Enterprise DocVQA Workload
&lt;/h3&gt;

&lt;p&gt;Synthetic data validates the infrastructure. Real data validates the architecture. I used the &lt;code&gt;lmms-lab/DocVQA&lt;/code&gt; dataset, throwing 64 concurrent streams at the GPU to simulate a heavily loaded internal document analysis tool.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;guidellm benchmark &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"http://localhost:8000/v1"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; &lt;span class="s2"&gt;"Qwen/Qwen2-VL-72B-Instruct"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; concurrent &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rate&lt;/span&gt; 64 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s2"&gt;"lmms-lab/DocVQA"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-args&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "DocVQA"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-seconds&lt;/span&gt; 120 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--warmup&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; ./results-doc &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--outputs&lt;/span&gt; json,html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh3iimu0wbx4af4v43n6o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh3iimu0wbx4af4v43n6o.png" alt="DocVQA dataset - Parquet files downloading from HuggingFace, ConcurrentProfile resolved at 64 streams" width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq157bq9skf91ey5359cr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq157bq9skf91ey5359cr.png" alt="DocVQA concurrent load - KV cache fluctuating between 21% and 94.8% as real document images cycle" width="800" height="469"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsp7wgdjyb36dbb13yxgj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsp7wgdjyb36dbb13yxgj.png" alt="median TTFT of 38.5 seconds, median ITL of 1,879ms, total throughput of 2,621 tokens/sec" width="800" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmv355gkps5jyoccr04jm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmv355gkps5jyoccr04jm.png" alt="median input of 4,996 tokens per request, median output of 19 tokens, median image input of 3.86M pixels per request" width="800" height="575"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The DocVQA results tell a different story than the synthetic test — and that's the point. Real multimodal workloads are fundamentally harder than synthetic text. Each document image tokenizes into thousands of visual tokens (median 4,996 input tokens per request, with 3.86 million pixels of image data), which means the prefill phase dominates. The median TTFT of 38.5 seconds at 64 concurrent streams reflects the GPU working through massive vision encoder computations for dozens of simultaneous documents.&lt;/p&gt;

&lt;p&gt;The system completed 46 requests in 110 seconds with 64 concurrent streams — no errors, no OOMs, no crashes. The server throughput of 2,621 total tokens per second demonstrates that even under extreme multimodal concurrency, the architecture remains stable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Understanding the Latency Pipeline
&lt;/h2&gt;

&lt;p&gt;To build reliable systems on top of these numbers, you need to understand what happens between the moment a user submits a request and the moment they see the complete response. The inference pipeline has two fundamentally different computational phases, and each one is bottlenecked by a different hardware resource.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq2fnucf2q21eo3vyl908.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq2fnucf2q21eo3vyl908.png" alt="prefill-decode latency pipeline" width="800" height="578"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  TTFT: Time To First Token (Compute-Bound)
&lt;/h3&gt;

&lt;p&gt;TTFT measures the time between request submission and the first generated token appearing. For multimodal models, TTFT is dominated by the prefill phase — the GPU must process the base64 image through the vision encoder (ViT), project the visual embeddings into the LLM's token space, concatenate them with the text prompt tokens, and then perform the full self-attention computation over the entire combined sequence to populate the KV cache.&lt;/p&gt;

&lt;p&gt;This is a compute-bound operation. The GPU cores are doing dense matrix multiplications across thousands of visual tokens. Under the 64-stream DocVQA load, TTFT was 38.5 seconds (median) — each request is competing for compute time with 63 other in-flight prefill and decode operations.&lt;/p&gt;

&lt;p&gt;In production, if TTFT is too high for your SLA, the levers are: reduce &lt;code&gt;--max-num-seqs&lt;/code&gt; to limit concurrency (trading throughput for latency), tune &lt;code&gt;--max-num-batched-tokens&lt;/code&gt; to prioritize individual request latency, or scale horizontally by adding more MI300X nodes behind a load balancer.&lt;/p&gt;

&lt;h3&gt;
  
  
  ITL: Inter-Token Latency (Memory-Bandwidth-Bound)
&lt;/h3&gt;

&lt;p&gt;Once prefill completes and the KV cache is populated, the model enters the decode phase. It generates one token at a time in an autoregressive loop. Each token generation requires reading the entire 144GB of model weights from HBM3 VRAM to the compute units.&lt;/p&gt;

&lt;p&gt;This is a memory-bandwidth-bound operation. The GPU cores are fast enough — they are waiting on data delivery from memory. This is why the MI300X's 5.3 TB/s HBM3 bandwidth matters so much. My synthetic stress test showed a synchronous ITL baseline of 39.6ms, which aligns closely with the theoretical minimum: 144GB of weights divided by 5.3 TB/s bandwidth equals roughly 27ms per token, with the remainder accounted for by attention computation over the KV cache, kernel launch overhead, and scheduling latency.&lt;/p&gt;

&lt;p&gt;At higher concurrency, ITL rises because the memory bus is shared across all in-flight decode operations. Under the synthetic sweep, ITL scaled gracefully from 39.6ms (synchronous) to 66.8ms (highest constant rate) — a 1.7x increase despite a 6x increase in concurrency. Under the DocVQA workload at 64 concurrent streams, the median ITL was 1,879ms, reflecting the extreme memory pressure of simultaneously maintaining KV caches for 64 high-resolution document contexts.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Chunked Prefill Prevents Catastrophic Failure
&lt;/h3&gt;

&lt;p&gt;During Phase A, the KV cache hit 99.5% utilization. Without chunked prefill, a new incoming request at this point would attempt to allocate its full prefill budget in one shot — and fail with an OOM crash, potentially taking down the entire serving process.&lt;/p&gt;

&lt;p&gt;Chunked prefill changes this behavior. Instead of processing the entire prompt in a single monolithic computation, the scheduler breaks the prefill into smaller chunks (bounded by &lt;code&gt;--max-num-batched-tokens&lt;/code&gt;). Between chunks, it interleaves decode steps from other in-flight requests. This means the system can gradually allocate KV cache blocks as they become available from completed requests, rather than demanding the full allocation upfront. The result is graceful degradation under pressure rather than catastrophic failure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;VRAM reporting nuances.&lt;/strong&gt; The &lt;code&gt;amd-smi&lt;/code&gt; tool and the VRAM bar visualization sometimes report different figures than what vLLM's internal engine logs show. This is because &lt;code&gt;amd-smi&lt;/code&gt; reports total GPU memory allocation (including driver overhead, CUDA graphs, and pre-allocated buffers), while vLLM reports specifically on KV cache block utilization. For production monitoring, instrument against the vLLM &lt;code&gt;/metrics&lt;/code&gt; Prometheus endpoint, which exposes &lt;code&gt;vllm:gpu_cache_usage_perc&lt;/code&gt; directly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The BF16 imperative.&lt;/strong&gt; Do not attempt FP16 on MI300X for models of this size. BF16 is natively supported by the Matrix Core technology, maintains FP32-equivalent dynamic range, and avoids the precision loss that causes output degradation in 72B+ parameter models. This is not a preference — it is a correctness requirement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ROCm is production-ready.&lt;/strong&gt; The ROCm 7.2 + vLLM v0.20.1 stack ran stable through sustained stress testing with zero crashes. For teams evaluating AMD as an alternative to NVIDIA for inference workloads, the ecosystem has matured significantly. The primary friction point is in the initial Docker configuration (device passthrough and group permissions), not in runtime stability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SHM sizing.&lt;/strong&gt; If you encounter cross-process communication errors in vLLM, pass &lt;code&gt;--shm-size 8g&lt;/code&gt; to your Docker run command. This is not always required but resolves intermittent failures in certain multi-worker configurations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reproduce This
&lt;/h2&gt;

&lt;p&gt;The exact commands used in this post:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Mount NVMe and set HF cache&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;wipefs &lt;span class="nt"&gt;-af&lt;/span&gt; /dev/vdc1 &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;mkfs.xfs &lt;span class="nt"&gt;-f&lt;/span&gt; /dev/vdc1
&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /mnt/models &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;mount /dev/vdc1 /mnt/models
&lt;span class="nb"&gt;sudo chown&lt;/span&gt; &lt;span class="nt"&gt;-R&lt;/span&gt; &lt;span class="nv"&gt;$USER&lt;/span&gt;:&lt;span class="nv"&gt;$USER&lt;/span&gt; /mnt/models
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /mnt/models/huggingface
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;HF_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/mnt/models/huggingface
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 2. Launch vLLM (ROCm, v0.20.1)&lt;/span&gt;
docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; vllm-qwen2-vl-72b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--network&lt;/span&gt; host &lt;span class="nt"&gt;--ipc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/dev/kfd &lt;span class="nt"&gt;--device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/dev/dri &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--group-add&lt;/span&gt; video &lt;span class="nt"&gt;--group-add&lt;/span&gt; render &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; /mnt/models:/mnt/models:rw &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;HF_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/mnt/models/huggingface &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;VLLM_USE_TRITON_FLASH_ATTN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  vllm/vllm-openai-rocm:v0.20.1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; Qwen/Qwen2-VL-72B-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dtype&lt;/span&gt; bfloat16 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--gpu-memory-utilization&lt;/span&gt; 0.92 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-model-len&lt;/span&gt; 16384 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-num-seqs&lt;/span&gt; 64 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-num-batched-tokens&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-chunked-prefill&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 3. Stress test (synthetic sweep)&lt;/span&gt;
guidellm benchmark &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"http://localhost:8000/v1"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; &lt;span class="s2"&gt;"Qwen/Qwen2-VL-72B-Instruct"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; sweep &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s2"&gt;"prompt_tokens=8192,output_tokens=1024"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-seconds&lt;/span&gt; 300 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--warmup&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; ./results-stress &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--outputs&lt;/span&gt; json,html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 4. DocVQA benchmark (64 concurrent streams)&lt;/span&gt;
guidellm benchmark &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"http://localhost:8000/v1"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; &lt;span class="s2"&gt;"Qwen/Qwen2-VL-72B-Instruct"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile&lt;/span&gt; concurrent &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rate&lt;/span&gt; 64 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data&lt;/span&gt; &lt;span class="s2"&gt;"lmms-lab/DocVQA"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--data-args&lt;/span&gt; &lt;span class="s1"&gt;'{"name": "DocVQA"}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-seconds&lt;/span&gt; 120 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--warmup&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; ./results-doc &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--outputs&lt;/span&gt; json,html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The AMD Instinct MI300X fundamentally alters how we architect enterprise AI infrastructure. Loading a 72-billion parameter multimodal model with zero quantization, dedicating 32GB to the KV cache, and serving 64 concurrent document analysis streams on a single node at $1.99/hour - this is a capability that did not exist at this price point 12 months ago.&lt;/p&gt;

&lt;p&gt;For ML engineering teams building automated document processing, visual data extraction, or complex agentic systems, the VRAM constraints of 80GB hardware have forced painful compromises between model quality and deployment feasibility. The MI300X, paired with ROCm 7.2 and vLLM's advanced scheduling (chunked prefill, PagedAttention), provides a stable, powerful foundation for production-grade unquantized inference - at a fraction of the cost of equivalent NVIDIA configurations.&lt;/p&gt;

&lt;p&gt;And AMD is continuing to push the memory boundary further. The Instinct MI325X extends capacity to 256GB HBM3E, targeting massive MoE and ultra-long-context inference workloads. Beyond that, the Instinct MI350X and MI355X move into next-generation CDNA4 territory with 288GB HBM3E, positioning AMD aggressively for frontier-scale enterprise AI.&lt;/p&gt;

&lt;p&gt;What makes this trajectory especially significant is not just raw capacity - it is architectural simplification. Today, deploying a 72B model unquantized on NVIDIA means splitting weights across multiple GPUs, engineering around KV cache exhaustion, and accepting the latency overhead of cross-device communication. With 192GB-class accelerators, those constraints disappear for this model class. With 256–288GB, they disappear for even larger architectures - MoE models, ultra-long-context workloads, and multi-modal pipelines that would currently require four or more 80GB cards.&lt;/p&gt;

&lt;p&gt;For enterprise AI engineering, the shift from 80GB-class to 192–288GB-class accelerators is not incremental. It fundamentally changes what becomes practical in production: fewer nodes, simpler serving topologies, lower operational complexity, and - critically - no quantization tax on model quality.&lt;/p&gt;

</description>
      <category>vllm</category>
      <category>rocm</category>
      <category>mi300x</category>
      <category>genai</category>
    </item>
  </channel>
</rss>
