<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Oleksii Nizhegolenko</title>
    <description>The latest articles on DEV Community by Oleksii Nizhegolenko (@ratibor78).</description>
    <link>https://dev.to/ratibor78</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3936724%2F7684339b-1105-4e04-81b4-3aa879415b3f.jpg</url>
      <title>DEV Community: Oleksii Nizhegolenko</title>
      <link>https://dev.to/ratibor78</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ratibor78"/>
    <language>en</language>
    <item>
      <title>Running OpenAI's gpt-oss-20b with 128k Context on a Single L4 GPU</title>
      <dc:creator>Oleksii Nizhegolenko</dc:creator>
      <pubDate>Tue, 19 May 2026 08:47:44 +0000</pubDate>
      <link>https://dev.to/ratibor78/running-openais-gpt-oss-20b-with-128k-context-on-a-single-l4-gpu-27ac</link>
      <guid>https://dev.to/ratibor78/running-openais-gpt-oss-20b-with-128k-context-on-a-single-l4-gpu-27ac</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fot4qvi6oipzfvcqo1917.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fot4qvi6oipzfvcqo1917.png" alt=" " width="800" height="255"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;Alexey Nizhegolenko&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
DevOps Engineer, AgentOps Engineer, AI Infrastructure Engineer&lt;/p&gt;



&lt;p&gt;This is the second article in my series on self-hosting LLMs on GKE. In the &lt;a href="https://dev.to/ratibor78/running-gemma-4-26b-on-gke-with-a-single-l4-gpu-4l6g"&gt;first article&lt;/a&gt; I covered deploying Gemma4 26B with a 28,000 token context window. This time I'll show you something more impressive: &lt;code&gt;openai/gpt-oss-20b&lt;/code&gt; running with a &lt;strong&gt;128,000 token context&lt;/strong&gt; on the same single L4 GPU.&lt;/p&gt;

&lt;p&gt;The setup has been running in production since November 2025, for about 6 months, with no major incidents. That's the kind of track record worth writing about.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why gpt-oss-20b?
&lt;/h2&gt;

&lt;p&gt;OpenAI released &lt;code&gt;gpt-oss-20b&lt;/code&gt; in August 2025 as their first open-weight model since GPT-2. It's a 21B-parameter Mixture-of-Experts model (~3.6B active parameters per token) with mxfp4 quantization built-in, meaning the weights are already compressed using microscaling FP4 format, which is far more memory-efficient than standard quantization approaches like AWQ or GPTQ.&lt;/p&gt;

&lt;p&gt;Two things make it stand out:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;128k context window&lt;/strong&gt; - this is the main reason to pick this model over alternatives. Most quantized models on a 24GB L4 GPU are limited to 20-64k tokens. gpt-oss-20b achieves 128k through the combination of mxfp4 weights (~13GB on disk).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Built-in reasoning&lt;/strong&gt; - the model uses chain-of-thought reasoning internally. In API responses, you'll see a &lt;code&gt;reasoning_content&lt;/code&gt; field with the model's thought process before the final answer. This is useful for complex analytical tasks where you want to understand how the model concluded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenAI tool calling format&lt;/strong&gt; - natively compatible with &lt;code&gt;--tool-call-parser openai&lt;/code&gt;, which means it drops in as a replacement for OpenAI API clients without any prompt engineering changes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Hardware and Cost
&lt;/h2&gt;

&lt;p&gt;Same hardware as Part 1 - &lt;code&gt;g2-standard-4&lt;/code&gt; with one NVIDIA L4 GPU (24GB VRAM, 4 vCPU, 16GB RAM).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instance type&lt;/th&gt;
&lt;th&gt;On-demand price&lt;/th&gt;
&lt;th&gt;Spot price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;g2-standard-4 (1x L4)&lt;/td&gt;
&lt;td&gt;~$0.70/hr&lt;/td&gt;
&lt;td&gt;~$0.21/hr&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This article uses a &lt;strong&gt;standard on-demand&lt;/strong&gt; node pool to keep the setup simple and predictable. The spot-based, cost-optimised variant and the zone-aware failover architecture that makes spot safe to run in production are the subject of Part 3.&lt;/p&gt;
&lt;h2&gt;
  
  
  How 128k Context Fits on 24GB VRAM
&lt;/h2&gt;

&lt;p&gt;The real answer is simpler than you might expect - it fits entirely in GPU RAM. From the actual startup logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model weights (mxfp4):   13.72 GiB
KV cache (fp8):           4.17 GiB  →  182,336 tokens available
CUDA graphs:              0.60 GiB
Total:                   ~18.5 GiB out of 24 GiB (0.85 utilisation)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU KV cache size: 182,336 tokens
Maximum concurrency for 128,000 tokens per request: 2.68x
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;182k tokens of KV cache covers 128k context with room for more than two simultaneous requests. No CPU offloading needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does &lt;code&gt;--swap-space 6&lt;/code&gt; exist in the config then?&lt;/strong&gt; It's a safety net, if KV cache ever overflows under unusual load patterns, vLLM can spill to CPU RAM instead of dropping requests. In practice, it hasn't been used in 6 months of production. The fp8 KV cache combined with mxfp4 weights is efficient enough that everything fits comfortably on the GPU.&lt;/p&gt;

&lt;p&gt;The real reason this works at 128k where other models can't is &lt;strong&gt;mxfp4 quantization&lt;/strong&gt;. It stores weights in microscaling FP4 format - roughly 2x more efficient than AWQ INT4. This frees up ~2GB of VRAM compared to an equivalent AWQ model, and that extra headroom goes directly into KV cache budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GKE cluster (Standard mode) in example, it's &lt;code&gt;us-central1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;kubectl&lt;/code&gt; configured&lt;/li&gt;
&lt;li&gt;Google Artifact Registry for Docker images&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Create the GPU Node Pool
&lt;/h2&gt;

&lt;p&gt;A standard on-demand node pool, single zone, scale-to-zero, one L4 GPU at peak:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud container node-pools create l4-gptoss &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;YOUR_CLUSTER_NAME &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--zone&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;us-central1-a &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--machine-type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;g2-standard-4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--accelerator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia-l4,count&lt;span class="o"&gt;=&lt;/span&gt;1,gpu-driver-version&lt;span class="o"&gt;=&lt;/span&gt;latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-autoscaling&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--min-nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--node-labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gpt-oss-20b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--node-taints&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia.com/gpu&lt;span class="o"&gt;=&lt;/span&gt;present:NoSchedule &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--scopes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cloud-platform
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--node-labels=service=gpt-oss-20b&lt;/code&gt; label is what the StatefulSet's &lt;code&gt;nodeSelector&lt;/code&gt; targets, and the &lt;code&gt;nvidia.com/gpu&lt;/code&gt; taint keeps non-GPU workloads off this pool.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Prepare the vLLM Image
&lt;/h2&gt;

&lt;p&gt;gpt-oss-20b requires vLLM v0.12.0 or later with mxfp4 support. Push it to your Artifact Registry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker pull vllm/vllm-openai:v0.12.0

docker tag vllm/vllm-openai:v0.12.0 &lt;span class="se"&gt;\&lt;/span&gt;
  us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:v0.12.0

docker push us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:v0.12.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3: Create Namespace and Secrets
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;openai/gpt-oss-20b&lt;/code&gt; is a public model, no HuggingFace token required. You only need an API key to protect your vLLM endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace gptoss-multi

&lt;span class="c"&gt;# API key for protecting the vLLM endpoint&lt;/span&gt;
kubectl create secret generic vllm-api-multi &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;VLLM_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-api-key-here &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; gptoss-multi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Deploy gpt-oss-20b
&lt;/h2&gt;

&lt;p&gt;Here's the complete StatefulSet manifest. Scheduling is intentionally minimal - a simple &lt;code&gt;nodeSelector&lt;/code&gt; targeting the &lt;code&gt;service: gpt-oss-20b&lt;/code&gt; label, plus a toleration for the GPU taint. No node affinity rules; the zone-aware scheduling logic comes in Part 3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-multi&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-multi&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-20b&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-20b&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;StatefulSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-20b&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-multi&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-20b&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-20b&lt;/span&gt;
  &lt;span class="na"&gt;updateStrategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RollingUpdate&lt;/span&gt;
  &lt;span class="na"&gt;persistentVolumeClaimRetentionPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;whenDeleted&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Delete&lt;/span&gt;
    &lt;span class="na"&gt;whenScaled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Retain&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-20b&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;terminationGracePeriodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-oss-20b&lt;/span&gt;
      &lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia.com/gpu&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Exists&lt;/span&gt;
          &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoSchedule&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:v0.12.0&lt;/span&gt;
          &lt;span class="na"&gt;imagePullPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;IfNotPresent&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--model&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;openai/gpt-oss-20b&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--api-key&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;$(VLLM_API_KEY)&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--gpu-memory-utilization&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.85"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--max-model-len&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128000"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--swap-space&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;6"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--tensor-parallel-size&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--max-num-seqs&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--max-num-partial-prefills&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--max-num-batched-tokens&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8128"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--kv-cache-dtype&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;fp8&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--enable-auto-tool-choice&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--tool-call-parser&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--host&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--port&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000"&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HF_HOME&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;XDG_CACHE_HOME&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models/.xdg-cache&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TRITON_CACHE_DIR&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models/.triton&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLM_API_KEY&lt;/span&gt;
              &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLM_API_KEY&lt;/span&gt;
                  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm-api-multi&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLM_LOGGING_LEVEL&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INFO&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NVIDIA_VISIBLE_DEVICES&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NVIDIA_DRIVER_CAPABILITIES&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;compute,utility&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LD_LIBRARY_PATH&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib:/lib&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TORCH_CUDA_ARCH_LIST&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8.9"&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
              &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
              &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
          &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
              &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
              &lt;span class="na"&gt;scheme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTP&lt;/span&gt;
            &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
            &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
            &lt;span class="na"&gt;timeoutSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
            &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;12&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2Gi&lt;/span&gt;
              &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3500m"&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;12Gi&lt;/span&gt;
              &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
          &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-cache&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dshm&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/dev/shm&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-lib64&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/lib64&lt;/span&gt;
              &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-bin&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/bin&lt;/span&gt;
              &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dshm&lt;/span&gt;
          &lt;span class="na"&gt;emptyDir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;medium&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Memory&lt;/span&gt;
            &lt;span class="na"&gt;sizeLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;6Gi&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-lib64&lt;/span&gt;
          &lt;span class="na"&gt;hostPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/lib64&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Directory&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-bin&lt;/span&gt;
          &lt;span class="na"&gt;hostPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/bin&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Directory&lt;/span&gt;
  &lt;span class="na"&gt;volumeClaimTemplates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
      &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-cache&lt;/span&gt;
      &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
        &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;standard-rwo&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; gptoss-20b.yaml
kubectl logs &lt;span class="nt"&gt;-f&lt;/span&gt; statefulset/gptoss-20b &lt;span class="nt"&gt;-n&lt;/span&gt; gptoss-multi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's what a healthy startup looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Architecture confirmed - custom GptOss model class
&lt;span class="go"&gt;INFO [model.py] Resolved architecture: GptOssForCausalLM
INFO [model.py] Using max model len 128000

&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;mxfp4 quantization confirmed, Marlin kernel selected
&lt;span class="go"&gt;INFO [mxfp4.py] Using Marlin backend
WARNING: Your GPU does not have native support for FP4 computation.
         Weight-only FP4 compression will be used via Marlin kernel.

&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Weights loaded - note 13.72 GiB vs 15.55 GiB &lt;span class="k"&gt;for &lt;/span&gt;Gemma4
&lt;span class="go"&gt;INFO [default_loader.py] Loading weights took 73.54 seconds
INFO [gpu_model_runner.py] Model loading took 13.7193 GiB memory and 104.729963 seconds

&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;torch.compile from cache - 13 seconds instead of ~90
&lt;span class="go"&gt;INFO [monitor.py] torch.compile takes 13.67 s in total

&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;KV cache - this is the key number
&lt;span class="go"&gt;INFO [gpu_worker.py] Available KV cache memory: 4.17 GiB
INFO [kv_cache_utils.py] GPU KV cache size: 182,336 tokens
INFO [kv_cache_utils.py] Maximum concurrency for 128,000 tokens per request: 2.68x

&lt;/span&gt;&lt;span class="gp"&gt;#&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;Server ready
&lt;span class="go"&gt;INFO: Application startup complete.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The WARNING about FP4 support is expected and not a problem. L4 is sm_8.9 architecture. Native FP4 requires Blackwell (sm_9.0+). The Marlin kernel handles this transparently with no quality impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Configuration Decisions Explained
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;--gpu-memory-utilization 0.85&lt;/code&gt; instead of 0.96-0.97?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We need to leave headroom for the CPU swap mechanism. When the KV cache overflows from GPU to CPU RAM, vLLM needs free GPU memory for the swap buffers. Using 0.97 here will cause OOM under load with long contexts. 0.85 is the stable value we've validated over 6 months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;--max-num-seqs 3&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;With 128k context, each sequence can occupy a huge amount of KV cache. Allowing too many parallel sequences risks exhausting both GPU and CPU swap memory simultaneously. Three concurrent sequences is the conservative limit that keeps the deployment stable under real-world load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;--max-num-batched-tokens 8128&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This limits how many tokens get processed per engine step. With long-context requests, an uncapped value here can cause prefill spikes that OOM the GPU. 8128 gives a good balance between throughput and stability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;--max-num-partial-prefills 1&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For very long prompts, vLLM splits prefill across multiple steps (chunked prefill). Setting this to 1 means only one chunk is processed at a time, which keeps memory usage predictable during long-context ingestion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why 60Gi PVC instead of 30Gi?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model weights are ~13GB but the torch.compile cache, XDG cache, and Triton cache for a 128k context model are significantly larger than for a 28k model. 60Gi gives comfortable headroom.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expose the API
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-ingress&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-multi&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kubernetes.io/ingress.class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/backend-protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HTTP"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-read-timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;600"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-send-timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;600"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-body-size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50m"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-buffering&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;off"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss.yourdomain.com&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
            &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
            &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gptoss-multi&lt;/span&gt;
                &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the &lt;code&gt;proxy-read-timeout: 600&lt;/code&gt; — with 128k context requests can take a long time for prefill. The default nginx timeout of 60 seconds will kill long-context requests mid-generation.&lt;/p&gt;

&lt;p&gt;Test it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://gptoss.yourdomain.com/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer your-api-key"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain how Kubernetes scheduling works."}],
    "max_tokens": 500
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance Results
&lt;/h2&gt;

&lt;p&gt;All numbers are measured from our production instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test 1 - Short context (94 prompt tokens, 500 output):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://gptoss.yourdomain.com/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer your-api-key"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Write a detailed technical explanation of how Kubernetes scheduling works..."}],
    "max_tokens": 500
  }'&lt;/span&gt;

&lt;span class="c"&gt;# real 0m9.505s  →  500 tokens / 9.5s = ~52 tokens/sec&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Test 2 - Long context (8,076 prompt tokens, 200 output):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ~8k tokens of context&lt;/span&gt;
python3 &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"print('word ' * 8000)"&lt;/span&gt; | xargs &lt;span class="nt"&gt;-I&lt;/span&gt;&lt;span class="o"&gt;{}&lt;/span&gt; curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://gptoss.yourdomain.com/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer your-api-key"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;model&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;openai/gpt-oss-20b&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;messages&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: [{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;role&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;content&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;{} Summarize the above.&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}], &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;max_tokens&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;: 200}"&lt;/span&gt;

&lt;span class="c"&gt;# real 0m6.113s  →  ~53 tokens/sec generation, TTFT ~1.47 sec&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Test 3 - 3 parallel requests:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;1..3&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://gptoss.yourdomain.com/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer your-api-key"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "openai/gpt-oss-20b", "messages": [{"role": "user", "content": "Explain distributed systems consistency models in detail."}], "max_tokens": 500}'&lt;/span&gt; &amp;amp;
&lt;span class="k"&gt;done
&lt;/span&gt;&lt;span class="nb"&gt;wait&lt;/span&gt;
&lt;span class="c"&gt;# real 0m16.1s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Results summary:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Short context (94 tok)&lt;/th&gt;
&lt;th&gt;Long context (8k tok)&lt;/th&gt;
&lt;th&gt;3 parallel&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~52 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~53 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~32 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTFT&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;237ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1.47 sec&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~410ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt tokens&lt;/td&gt;
&lt;td&gt;94&lt;/td&gt;
&lt;td&gt;8,076&lt;/td&gt;
&lt;td&gt;77&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV cache usage&lt;/td&gt;
&lt;td&gt;&amp;lt;1%&lt;/td&gt;
&lt;td&gt;&amp;lt;1%&lt;/td&gt;
&lt;td&gt;&amp;lt;1%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The most important finding here: &lt;strong&gt;throughput stays flat regardless of prompt length&lt;/strong&gt;. 52 tok/s with 94 tokens vs 53 tok/s with 8,076 tokens. The mxfp4 quantization handles long contexts extremely efficiently. The only cost of longer context is TTFT - prefilling 8k tokens takes ~1.47 seconds vs 237ms for a short prompt, which is expected and linear.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reasoning Feature
&lt;/h2&gt;

&lt;p&gt;One thing worth calling out the model exposes its internal reasoning process. Every response includes a &lt;code&gt;reasoning_content&lt;/code&gt; field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reasoning_content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"We need to explain distributed systems consistency models in detail. 
  Likely include eventual consistency, strong consistency, linearizability, sequential 
  consistency, causal consistency... We'll structure: 1. Intro. 2. CAP theorem. 
  3. ACID vs BASE..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the model's chain-of-thought before generating the final answer. For analytical tasks, debugging agent failures, or building explainable AI pipelines, this is genuinely useful - you can see exactly how the model reasoned through a problem.&lt;/p&gt;

&lt;p&gt;Note that &lt;code&gt;content&lt;/code&gt; is &lt;code&gt;null&lt;/code&gt; in the response above - the reasoning model separates thinking from output. Your client needs to handle both fields.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Breakdown
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Cost/month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;g2-standard-4 on-demand&lt;/td&gt;
&lt;td&gt;~$500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVC 60GB standard-rwo&lt;/td&gt;
&lt;td&gt;~$6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$506/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 52 tok/s running 24/7:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;52 tokens/sec × 3600 × 24 × 30 = ~134 billion tokens/month theoretical
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 20% average utilization: &lt;strong&gt;~27 billion tokens/month&lt;/strong&gt; for $506.&lt;/p&gt;

&lt;p&gt;(Part 3 cuts this roughly 3x by moving to a spot with a failover architecture that makes the spot safe to run.)&lt;/p&gt;

&lt;h2&gt;
  
  
  Compared to the Gemma4 Article
&lt;/h2&gt;

&lt;p&gt;If you read Part 1, here's how the two models compare side by side:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;gpt-oss-20b&lt;/th&gt;
&lt;th&gt;Gemma 4 26B AWQ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Context window&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;128,000 tokens&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;28,000 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;~52 tok/s&lt;/td&gt;
&lt;td&gt;~51 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TTFT (short)&lt;/td&gt;
&lt;td&gt;237ms&lt;/td&gt;
&lt;td&gt;84ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Weights size&lt;/td&gt;
&lt;td&gt;~13GB (mxfp4)&lt;/td&gt;
&lt;td&gt;~16GB (AWQ int4)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;VRAM for weights&lt;/td&gt;
&lt;td&gt;13.72 GiB&lt;/td&gt;
&lt;td&gt;15.55 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV cache pool (GPU)&lt;/td&gt;
&lt;td&gt;4.17 GiB&lt;/td&gt;
&lt;td&gt;3.12 GiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KV cost per token&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;~24.5 KB&lt;/strong&gt; (GQA, fp8)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;~112.7 KB&lt;/strong&gt; (global head_dim=512)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Max tokens in KV&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;182,336&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;29,709&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU util setting&lt;/td&gt;
&lt;td&gt;0.85&lt;/td&gt;
&lt;td&gt;0.97&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;✅ built-in&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool calling&lt;/td&gt;
&lt;td&gt;openai format&lt;/td&gt;
&lt;td&gt;gemma4 format&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;License&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;td&gt;Apache 2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HuggingFace access&lt;/td&gt;
&lt;td&gt;gated&lt;/td&gt;
&lt;td&gt;public&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why such a huge difference in context despite similar KV cache size?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the most interesting technical finding. The answer lies in how each model's attention is shaped.&lt;/p&gt;

&lt;p&gt;From the vLLM startup logs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gpt-oss-20b: 4.17 GiB for 182,336 tokens → &lt;strong&gt;~24.5 KB per token&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Gemma4 26B: 3.12 GiB for 29,709 tokens → &lt;strong&gt;~112.7 KB per token&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The standard formula for KV cache memory per token is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bytes per token = 2 × layers × KV_heads × head_dim × bytes_per_element
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;gpt-oss-20b is a &lt;strong&gt;24-layer&lt;/strong&gt; model with 8 KV heads (GQA — 64 query heads grouped onto just 8 KV heads) and a head_dim of 64. With fp8 KV cache (1 byte per element):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2 × 24 × 8 × 64 × 1 = 24,576 bytes ≈ ~24 KB per token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That already matches the ~24.5 KB/token we see in the logs almost exactly - and &lt;code&gt;4.17 GiB ÷ 24,576 bytes ≈ 182,336&lt;/code&gt;, which is precisely the headline KV pool size vLLM reports. So there is no mystery in the per-token number and no hidden reduction happening: aggressive GQA (8 KV heads instead of 64), a small head_dim of 64, and fp8 KV cache are what make each token cheap. vLLM computes the headline token capacity using exactly this uniform per-token cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;So, where does Sliding Window Attention (SWA) actually help?&lt;/strong&gt; Not in the per-token headline - in concurrency. gpt-oss-20b alternates layer types: roughly half of its 24 layers use full global attention, the other half use a tight 128-token sliding window. For long-context requests, the sliding-window layers do &lt;strong&gt;not&lt;/strong&gt; grow with prompt length - they stay bounded at 128 tokens whether the prompt is 4k or 128k. So a real 128k request only pays the full per-token price on about half its layers; the sliding-window half is effectively free at length.&lt;/p&gt;

&lt;p&gt;This is exactly what the log line reports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;INFO [kv_cache_utils.py] Maximum concurrency for 128,000 tokens per request: 2.68x
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A naive reading would expect &lt;code&gt;182,336 ÷ 128,000 ≈ 1.42x&lt;/code&gt; concurrency for 128k requests. vLLM reports &lt;strong&gt;2.68x&lt;/strong&gt; - nearly double — because its memory manager understands the hybrid SWA structure and knows a 128k sequence costs roughly half the uniform estimate (only the ~12 full-attention layers accumulate full-length KV; the ~12 sliding-window layers plateau at 128 tokens). That ~1.9x uplift over the naive ratio &lt;em&gt;is&lt;/em&gt; the SWA payoff - it buys concurrency headroom, not a cheaper headline per-token figure.&lt;/p&gt;

&lt;p&gt;In contrast, Gemma4 26B uses a heavy heterogeneous attention architecture: most layers are local sliding-window layers at head_dim=256, but a few global attention layers use a much larger head_dim=512 (8 query heads grouped onto 4 KV heads via GQA). It's those wide head_dim=512 global layers that dominate the KV budget. The startup logs flag the split explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;INFO [config.py] Gemma4 model has heterogeneous head dimensions
     (head_dim=256, global_head_dim=512).
     Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gemma4's global attention layers with a massive head_dim=512 cost dramatically more per token, pushing its combined average overhead to ~112.7 KB per token - roughly 4.6x heavier than gpt-oss-20b's ~24.5 KB.&lt;/p&gt;

&lt;p&gt;This explains the gap:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gpt-oss-20b: 4.17 GiB ÷ ~24.5 KB/token ≈ &lt;strong&gt;182k tokens&lt;/strong&gt; of headline KV pool cheap per token thanks to aggressive GQA + small head_dim + fp8, plus ~2.68x effective concurrency at 128k because the Sliding Window layers don't grow with context&lt;/li&gt;
&lt;li&gt;Gemma4: 3.12 GiB ÷ ~112.7 KB/token ≈ &lt;strong&gt;29k tokens&lt;/strong&gt; - heavy global attention dimensions for maximum recall accuracy at the cost of density&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Choose gpt-oss-20b when you need long context on budget hardware or an OpenAI-compatible tool-calling drop-in. Choose Gemma 4 when you need lower TTFT or native vision input.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Assessment
&lt;/h2&gt;

&lt;p&gt;Same caveat as in Part 1 - this is a quantized model, not the full cloud API. The mxfp4 quantization is more aggressive than AWQ int4, which can affect quality on tasks requiring precise numerical reasoning or very long coherent outputs.&lt;/p&gt;

&lt;p&gt;In practice we haven't noticed quality issues for the use cases we run: document analysis, structured data extraction, automation agents, and code review. For these tasks, the model performs well and the 128k context is genuinely useful - you can feed entire codebases or long documents without chunking.&lt;/p&gt;

&lt;p&gt;Data privacy remains the core advantage. Everything runs inside your VPC.&lt;/p&gt;

&lt;h2&gt;
  
  
  6 Months of Production Data
&lt;/h2&gt;

&lt;p&gt;The setup has been running since November 2025. A few things we learned over time:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pod restarts are fast and self-healing&lt;/strong&gt; - when the node is recycled (GKE node auto-upgrade, maintenance, or a manual node pool operation), the StatefulSet pod is rescheduled, picks up the PVC with cached weights and torch.compile artifacts, and is back up in ~3 minutes. No data loss, no manual intervention. (Surviving &lt;em&gt;spot preemption&lt;/em&gt; - and eliminating even that ~3-minute gap with a multi-zone replica architecture - is exactly what Part 3 covers.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory is stable&lt;/strong&gt; - no OOM events in 6 months with the &lt;code&gt;--max-num-seqs 3&lt;/code&gt; limit. The CPU swap mechanism handles occasional long-context requests without instability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The model is consistent&lt;/strong&gt; - response quality and latency have been stable. No drift or degradation observed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Afterwords
&lt;/h2&gt;

&lt;p&gt;Running a model with 128k context on a $0.70/hr instance felt ambitious when we started. Six months later, it's just infrastructure that runs. The key insight is that mxfp4 quantization combined with aggressive GQA and Sliding Window Attention is what makes 128k context genuinely feasible on 24GB VRAM - not a hack, but an architectural decision that vLLM understands and optimizes for natively.&lt;/p&gt;

&lt;p&gt;This deployment is deliberately simple - a single replica on a standard on-demand node, with minimal scheduling. The next article in this series builds directly on it: a zone-aware multi-node &lt;strong&gt;spot&lt;/strong&gt; setup with a K8S controller and automatic failover that guarantees at least one replica is always serving - the architecture that makes this both cheap (~3x cost reduction) and truly resilient in production.&lt;/p&gt;

&lt;p&gt;If you have questions or feedback, feel free to reach out.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Running Gemma 4 26B with 256k Context on GKE with a Single L4 GPU</title>
      <dc:creator>Oleksii Nizhegolenko</dc:creator>
      <pubDate>Mon, 18 May 2026 07:01:46 +0000</pubDate>
      <link>https://dev.to/ratibor78/running-gemma-4-26b-on-gke-with-a-single-l4-gpu-4l6g</link>
      <guid>https://dev.to/ratibor78/running-gemma-4-26b-on-gke-with-a-single-l4-gpu-4l6g</guid>
      <description>&lt;p&gt;&lt;em&gt;Alexey Nizhegolenko DevOps Engineer, AI Infrastructure Engineer&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When you start looking at self-hosting large language models in 2026, the options can feel overwhelming. You have dozens of models, quantization formats, inference engines, and cloud configurations to choose from. In this article, I'll show you a straightforward, step-by-step way to deploy &lt;strong&gt;Gemma4 26B-A4B&lt;/strong&gt; on Google Kubernetes Engine using a single NVIDIA L4 GPU - and push the context window all the way to 256,000 tokens. Real numbers, real mistakes, no marketing fluff.&lt;/p&gt;

&lt;p&gt;This is the first article in a series. Here we focus on getting a stable, production-ready deployment on a standard (non-spot) L4 instance. In the next article, we'll look at how to build a resilient multi-zone setup with spot instances to cut costs significantly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Gemma4 26B-A4B?
&lt;/h2&gt;

&lt;p&gt;Gemma4 is Google DeepMind's latest open-weight model family released in April 2026 under Apache 2.0. The 26B-A4B variant is a Mixture-of-Experts (MoE) architecture - 26 billion total parameters but only ~4 billion active per token. This means it punches well above its weight in terms of quality while keeping inference costs reasonable. The architecture also supports up to 1M token context natively, which makes it an interesting target for squeezing out maximum context on constrained hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware and Cost
&lt;/h2&gt;

&lt;p&gt;An NVIDIA L4 GPU comes with 24GB VRAM. On GCP, the smallest instance with an L4 is &lt;code&gt;g2-standard-4&lt;/code&gt; (4 vCPU, 16GB RAM).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instance type&lt;/th&gt;
&lt;th&gt;On-demand price&lt;/th&gt;
&lt;th&gt;Spot price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;g2-standard-4 (1x L4)&lt;/td&gt;
&lt;td&gt;~$0.70/hr&lt;/td&gt;
&lt;td&gt;~$0.21/hr&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For this article, we use the on-demand instance for stability. The spot option and how to handle preemptions will be covered in Part 2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;Before starting, you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A GKE cluster (Standard mode, not Autopilot) in any region with GPU available&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;kubectl&lt;/code&gt; configured to connect to your cluster&lt;/li&gt;
&lt;li&gt;Google Artifact Registry repository for your Docker images&lt;/li&gt;
&lt;li&gt;Basic familiarity with Kubernetes StatefulSets and PersistentVolumes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Create the GPU Node Pool
&lt;/h2&gt;

&lt;p&gt;Create a dedicated node pool for your LLM workload. We use &lt;code&gt;g2-standard-4&lt;/code&gt; with one L4 GPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud container node-pools create l4-llm &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;YOUR_CLUSTER_NAME &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--zone&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;us-central1-b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--machine-type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;g2-standard-4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--accelerator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia-l4,count&lt;span class="o"&gt;=&lt;/span&gt;1,gpu-driver-version&lt;span class="o"&gt;=&lt;/span&gt;latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-autoscaling&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--min-nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--node-labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;l4-llm &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--node-taints&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia.com/gpu&lt;span class="o"&gt;=&lt;/span&gt;present:NoSchedule &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--scopes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cloud-platform
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth noting here. We start with &lt;code&gt;--num-nodes=0&lt;/code&gt; and enable autoscaling - the node will spin up automatically when we deploy our StatefulSet. The &lt;code&gt;--node-taints&lt;/code&gt; ensures only GPU workloads land on this expensive node pool. The &lt;code&gt;--scopes=cloud-platform&lt;/code&gt; is important: without it, the node can't pull images from Artifact Registry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Prepare the vLLM Docker Image
&lt;/h2&gt;

&lt;p&gt;We use the official vLLM image with Gemma4 support. Pull it and push it to your Artifact Registry to reduce traffic cost on model startup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Authenticate Docker with Artifact Registry&lt;/span&gt;
gcloud auth configure-docker us-central1-docker.pkg.dev

&lt;span class="c"&gt;# Pull the Gemma 4 compatible vLLM image&lt;/span&gt;
docker pull vllm/vllm-openai:gemma4-0505-cu129

&lt;span class="c"&gt;# Tag and push to your registry&lt;/span&gt;
docker tag vllm/vllm-openai:gemma4-0505-cu129 &lt;span class="se"&gt;\&lt;/span&gt;
  us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129

docker push &lt;span class="se"&gt;\&lt;/span&gt;
  us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The image is ~27GB - this is normal. It bundles CUDA runtime, PyTorch, vLLM, and all dependencies into a self-contained package.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Create the Namespace
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Deploy Gemma 4 26B
&lt;/h2&gt;

&lt;p&gt;Now for the main part. Here's the complete StatefulSet manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;StatefulSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
  &lt;span class="na"&gt;updateStrategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RollingUpdate&lt;/span&gt;
  &lt;span class="na"&gt;persistentVolumeClaimRetentionPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;whenDeleted&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Retain&lt;/span&gt;
    &lt;span class="na"&gt;whenScaled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Retain&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;terminationGracePeriodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;l4-llm&lt;/span&gt;
      &lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia.com/gpu&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Exists&lt;/span&gt;
          &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoSchedule&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129&lt;/span&gt;
          &lt;span class="na"&gt;imagePullPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;IfNotPresent&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--model&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--served-model-name&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gemma-4-26b-a4b&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--gpu-memory-utilization&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.97"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--kv-cache-dtype&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;fp8&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--max-model-len&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;256000"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--tensor-parallel-size&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--max-num-seqs&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--enable-chunked-prefill&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--max-num-batched-tokens&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4096"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--enable-auto-tool-choice&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--tool-call-parser&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gemma4&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--reasoning-parser&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gemma4&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--async-scheduling&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--limit-mm-per-prompt&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{"image":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"audio":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"video":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0}'&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--host&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--port&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000"&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HF_HOME&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;XDG_CACHE_HOME&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models/.xdg-cache&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TRITON_CACHE_DIR&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models/.triton&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLM_LOGGING_LEVEL&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INFO&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NVIDIA_VISIBLE_DEVICES&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NVIDIA_DRIVER_CAPABILITIES&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;compute,utility&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TORCH_CUDA_ARCH_LIST&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8.9"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LD_LIBRARY_PATH&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib:/lib&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
              &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
              &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
          &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
              &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
              &lt;span class="na"&gt;scheme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTP&lt;/span&gt;
            &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
            &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
            &lt;span class="na"&gt;timeoutSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
            &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2Gi&lt;/span&gt;
              &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3500m"&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;12Gi&lt;/span&gt;
              &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
          &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-cache&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dshm&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/dev/shm&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-lib64&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/lib64&lt;/span&gt;
              &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-bin&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/bin&lt;/span&gt;
              &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dshm&lt;/span&gt;
          &lt;span class="na"&gt;emptyDir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;medium&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Memory&lt;/span&gt;
            &lt;span class="na"&gt;sizeLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2Gi&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-lib64&lt;/span&gt;
          &lt;span class="na"&gt;hostPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/lib64&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Directory&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-bin&lt;/span&gt;
          &lt;span class="na"&gt;hostPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/bin&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Directory&lt;/span&gt;
  &lt;span class="na"&gt;volumeClaimTemplates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
      &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-cache&lt;/span&gt;
      &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
        &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;standard-rwo&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now apply it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; gemma4-26b.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What Happens on First Start
&lt;/h2&gt;

&lt;p&gt;The first startup takes around 8-10 minutes total. Here's the breakdown:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;vLLM image pull from Artifact Registry&lt;/strong&gt; (~5 min on a fresh node) - the image is 27GB and needs to be pulled once per node. On subsequent restarts on the same node, &lt;code&gt;imagePullPolicy: IfNotPresent&lt;/code&gt; skips this step entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model download from HuggingFace&lt;/strong&gt; (~2 min) - vLLM automatically downloads &lt;code&gt;cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit&lt;/code&gt; to the PVC. This also happens only once - the PVC persists across pod restarts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weight loading into GPU&lt;/strong&gt; (~90 sec) - 16GB of AWQ quantized weights loaded from PVC into VRAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;torch.compile&lt;/strong&gt; (~50 sec first run, ~11 sec from cache) - JIT compilation of CUDA kernels, result saved to PVC cache&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Watch the progress:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs &lt;span class="nt"&gt;-f&lt;/span&gt; statefulset/gemma4-26b &lt;span class="nt"&gt;-n&lt;/span&gt; gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's what a healthy startup looks like with annotations so you know what to expect at each stage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# vLLM resolves the model architecture - confirms Gemma4 is recognized correctly
INFO [model.py] Resolved architecture: Gemma4ForConditionalGeneration
INFO [model.py] Using max model len 256000

# Text-only mode confirmed - multimodal encoders skipped, saving ~1GB VRAM
INFO [registry.py] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.

# Model weights downloading from HuggingFace to PVC (first run only)
INFO [weight_utils.py] Time spent downloading weights: 114.785540 seconds

# Weights loaded from PVC into GPU VRAM - 16GB in ~87 seconds
INFO [weight_utils.py] Filesystem type for checkpoints: EXT4. Checkpoint size: 16.01 GiB.
INFO [default_loader.py] Loading weights took 87.56 seconds
INFO [gpu_model_runner.py] Model loading took 15.55 GiB memory and 90.185187 seconds

# torch.compile - on first run ~50 sec, from cache ~11 sec
INFO [backends.py] Directly load the compiled graph(s) for compile range (1, 4096) from the cache, took 2.789 s
INFO [monitor.py] torch.compile took 10.94 s in total

# Final VRAM budget - this is the key number to watch
INFO [gpu_worker.py] Available KV cache memory: 5.26 GiB
INFO [kv_cache_utils.py] GPU KV cache size: 459,627 tokens
INFO [kv_cache_utils.py] Maximum concurrency for 256,000 tokens per request: 1.80x

# Server is ready
INFO: Application startup complete.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You're looking for this line to confirm everything worked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INFO: Application startup complete.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Second Start and Beyond
&lt;/h3&gt;

&lt;p&gt;This is where the PVC pays off. On every subsequent restart:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No download&lt;/strong&gt; - model already on PVC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;torch.compile from cache&lt;/strong&gt; - 11 seconds instead of 50&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total cold start: ~2 min 30 sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We measured this precisely across multiple restarts - the numbers are consistent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pod start:      15:08:57
Weights loaded: 15:11:02  (87 sec from PVC)
torch.compile:  15:11:15  (11 sec from cache)
Server ready:   15:11:29

Total from pod start: ~2 min 32 sec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On subsequent restarts on the same node (image already cached):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pod start → Server ready: ~2 min 32 sec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Key Configuration Decisions Explained
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why AWQ 4-bit quantization?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The original Gemma 4 26B model weights are ~52GB in BF16 - it simply won't fit on a 24GB L4. We use &lt;code&gt;cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit&lt;/code&gt;, which brings it down to ~16GB. This format uses Marlin INT4 kernels under the hood and runs well on L4 (sm_8.9 architecture).&lt;/p&gt;

&lt;p&gt;We tried FP8 and NVFP4 quantizations. FP8 doesn't fit either (~26GB). NVFP4 requires Blackwell (sm_9.0+) and won't run on L4 at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;--limit-mm-per-prompt '{"image": 0, "audio": 0, "video": 0}'&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemma 4 is a multimodal model. Even if you don't use images or audio, vLLM reserves GPU memory for the multimodal encoders during profiling. Setting all limits to 0 puts vLLM into text-only mode, which frees about 1GB of VRAM. This is essential for reaching 256,000 token context on a 24GB card.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;--kv-cache-dtype fp8&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;KV cache stores intermediate attention states. Using fp8 instead of bf16 halves the memory footprint, giving more room for longer contexts. The quality impact is minimal for typical chat and document analysis tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;--gpu-memory-utilization 0.97&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We leave 3% (~720MB) as a safety buffer for CUDA runtime allocations and temporary buffers during forward passes. At 0.97 we get 5.26 GiB for KV cache — enough for 459,627 tokens and 1.80x concurrency at 256k context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;--enable-chunked-prefill&lt;/code&gt; with &lt;code&gt;--max-num-batched-tokens 4096&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the key to unlocking 256k context on a single L4. Without chunked prefill, vLLM profiles the GPU by running a forward pass with all &lt;code&gt;max_model_len&lt;/code&gt; tokens at once — at 256k tokens that profiling pass alone would require more VRAM than the card has.&lt;/p&gt;

&lt;p&gt;With chunked prefill enabled, vLLM breaks large prefills into chunks of &lt;code&gt;max_num_batched_tokens&lt;/code&gt; (4096 tokens per step). This doesn't limit your context length — a 256k token document still gets fully processed — it just means the processing happens in 4096-token chunks instead of one massive pass. The profiling overhead drops dramatically, freeing several extra gigabytes for KV cache.&lt;/p&gt;

&lt;p&gt;The tradeoff: time to first token (TTFT) for very long prompts increases proportionally, since the prefill now takes more steps. For document analysis and agent pipelines, this is completely acceptable.&lt;/p&gt;

&lt;p&gt;Why &lt;code&gt;--max-num-seqs 8&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;During startup profiling, vLLM reserves CUDA graphs for &lt;code&gt;max_num_seqs&lt;/code&gt; simultaneous&lt;br&gt;
sequences. With the default 256 sequences, vLLM would attempt to profile and capture&lt;br&gt;
graphs for all 256 concurrent decode sizes — physically impossible on 24GB VRAM at&lt;br&gt;
200k token context. Setting &lt;code&gt;--max-num-seqs 8&lt;/code&gt; limits graph capture to sizes&lt;br&gt;
[1, 2, 4, 8, 16], consuming only 0.08 GiB for CUDA graphs while leaving 3.62 GiB&lt;br&gt;
for the KV cache pool (259,682 tokens). In practice, each of the 8 concurrent users&lt;br&gt;
gets a dynamically allocated share of that pool — a single active user gets the full&lt;br&gt;
200k context, while 8 simultaneous users share ~32k tokens each.&lt;/p&gt;
&lt;h2&gt;
  
  
  Performance Results
&lt;/h2&gt;

&lt;p&gt;All numbers below are measured from a real running instance, not estimated. We used &lt;code&gt;curl&lt;/code&gt; with &lt;code&gt;time&lt;/code&gt; for latency and pulled &lt;code&gt;/metrics&lt;/code&gt; for the rest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single request (500 output tokens):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://gemma4-26b.yourdomain.com/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "gemma-4-26b-a4b",
    "messages": [{"role": "user", "content": "Write a detailed technical explanation of how Kubernetes scheduling works..."}],
    "max_tokens": 500
  }'&lt;/span&gt;

&lt;span class="c"&gt;# real 0m9.783s  →  500 tokens / 9.78s = ~51 tokens/sec&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Results from &lt;code&gt;/metrics&lt;/code&gt; endpoint:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;1 request&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Throughput per request&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~51 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to First Token (TTFT)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~84ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV cache usage&lt;/td&gt;
&lt;td&gt;&amp;lt;1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Queue time&lt;/td&gt;
&lt;td&gt;~0ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Prefix cache hit rate&lt;/strong&gt; from metrics: 256/1557 tokens = &lt;strong&gt;~16.4%&lt;/strong&gt;. This means repeated system prompts or common conversation prefixes are being served from cache, reducing compute. In production, with consistent system prompts, the hit rate will be significantly higher.&lt;/p&gt;

&lt;p&gt;With 256,000 tokens, you can process entire codebases, full legal contracts, lengthy research papers, or long-running agent conversations with extensive tool call history - all in a single pass without chunking or summarization tricks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context Length Evolution: From 28k to 256k
&lt;/h2&gt;

&lt;p&gt;The journey from the initial 28,000 token deployment to 256,000 tokens happened entirely through parameter tuning — same hardware, same model, same &lt;code&gt;gpu_memory_utilization&lt;/code&gt;. Here's the full progression:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;KV cache pool&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Concurrency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Original (max-num-seqs=8, no chunked prefill)&lt;/td&gt;
&lt;td&gt;29,709 tokens&lt;/td&gt;
&lt;td&gt;28k&lt;/td&gt;
&lt;td&gt;1.06x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intermediate (chunked prefill enabled)&lt;/td&gt;
&lt;td&gt;280,517 tokens&lt;/td&gt;
&lt;td&gt;64k&lt;/td&gt;
&lt;td&gt;4.38x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;130k config&lt;/td&gt;
&lt;td&gt;395,527 tokens&lt;/td&gt;
&lt;td&gt;130k&lt;/td&gt;
&lt;td&gt;3.04x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;256k config (current)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;459,627 tokens&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;256k&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.80x&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The jump from 28k to 256k came entirely from two parameter changes: enabling chunked prefill with a small &lt;code&gt;max_num_batched_tokens&lt;/code&gt;, and reducing &lt;code&gt;max_num_seqs&lt;/code&gt; to 2. No hardware changes, no model changes, no additional cost.&lt;/p&gt;

&lt;p&gt;At 256k, concurrency drops to 1.80x - meaning a single request is always safe and two shorter requests can run simultaneously. For internal tooling, this is perfectly fine: you're trading multi-user throughput for maximum document size per request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expose the API
&lt;/h2&gt;

&lt;p&gt;To expose the model inside your cluster via HTTP, you need an Ingress controller. The most common options in 2026 are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NGINX Ingress Controller&lt;/strong&gt; - the community standard, works on any Kubernetes cluster including GKE&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GKE Gateway API&lt;/strong&gt; - the newer GKE-native approach using &lt;code&gt;HTTPRoute&lt;/code&gt; resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kong Gateway&lt;/strong&gt; - popular choice if you need API key auth, rate limiting, or routing logic on top&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The example below uses NGINX Ingress, which you can install with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;ingress-nginx ingress-nginx/ingress-nginx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; ingress-nginx &lt;span class="nt"&gt;--create-namespace&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add an Ingress to make the model accessible within your cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-ingress&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kubernetes.io/ingress.class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/backend-protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HTTP"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-read-timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;600"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-send-timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;600"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-buffering&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;off"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4.yourdomain.com&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
            &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
            &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
                &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://gemma4.yourdomain.com/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "gemma-4-26b-a4b",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}],
    "max_tokens": 200
  }'&lt;/span&gt; | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API is fully OpenAI-compatible. You can drop it in as a replacement for any OpenAI client by changing the &lt;code&gt;base_url&lt;/code&gt; to &lt;code&gt;http://gemma4.yourdomain.com/v1&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Assessment: What This Model Is Good For
&lt;/h2&gt;

&lt;p&gt;It's worth being upfront: this is a quantized community model, not the latest frontier API. AWQ 4-bit compression introduces some quality degradation compared to the full BF16 weights, and at 256k context and &lt;code&gt;max_num_seqs=2&lt;/code&gt; this is effectively a single-user setup. If you need high-concurrency inference, this configuration trades that for maximum document size.&lt;/p&gt;

&lt;p&gt;That said, in practice it handles a wide range of tasks well - log analysis, PR reviews, writing automation agents, summarizing technical documents, simple code generation, and internal tooling. For these workloads, the quality is more than good enough, and the tradeoff is clear.&lt;/p&gt;

&lt;p&gt;The bigger advantage is data privacy. With a self-hosted model, your data never leaves your infrastructure. Cloud API providers typically retain and may use your prompts for model improvement. If you're processing internal systems data, customer information, or proprietary business logic - that's a meaningful difference. Your data stays yours.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Tried That Didn't Work
&lt;/h2&gt;

&lt;p&gt;I think it's useful to share the approaches we tested before landing on this setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regional Persistent Disks&lt;/strong&gt; - We tried using a regional PD (replicated across two zones) to avoid zonal lock-in. GCP doesn't support regional PDs on G2 instances (the GPU series). This is a hard limitation, not a configuration issue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GCS FUSE&lt;/strong&gt; - Mounting the model weights directly from a GCS bucket sounds elegant. In practice, vLLM reads all 16GB of weights sequentially at startup, and GCS FUSE has no prefetching for this pattern. After 4+ minutes on the first shard with no progress, we abandoned this approach. The init container + gsutil copy is faster and simpler.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High &lt;code&gt;--max-num-seqs&lt;/code&gt; with large context&lt;/strong&gt; - Our initial attempt at 130k used &lt;code&gt;--max-num-seqs 8&lt;/code&gt; without chunked prefill. vLLM failed to start: the profiling pass tried to reserve memory for 8 × 130k = over 1 million tokens simultaneously, which is physically impossible on 24GB VRAM. The fix is &lt;code&gt;--max-num-seqs 2&lt;/code&gt; combined with &lt;code&gt;--enable-chunked-prefill&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Breakdown
&lt;/h2&gt;

&lt;p&gt;Running this setup full-time on on-demand pricing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Cost/month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;g2-standard-4 on-demand&lt;/td&gt;
&lt;td&gt;~$500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVC 30GB standard-rwo&lt;/td&gt;
&lt;td&gt;~$3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCS model storage 17GB&lt;/td&gt;
&lt;td&gt;~$0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$503/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A self-hosted model runs 24/7. At 45 tokens/sec on a single request, that's:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;45 tokens/sec × 3600 sec/hr × 24 hr × 30 days = ~116 billion tokens/month theoretical maximum
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even at 20% average utilization (which is realistic for internal tooling), you're looking at &lt;strong&gt;~23 billion tokens/month&lt;/strong&gt; for a flat $503.&lt;/p&gt;

&lt;p&gt;For comparison, as of May 2026 OpenAI's current models are priced at:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Output tokens (per 1M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4 (flagship)&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1&lt;/td&gt;
&lt;td&gt;$8.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o (legacy)&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;$0.60&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At GPT-5.4 rates, 23 billion tokens would cost &lt;strong&gt;$345,000/month&lt;/strong&gt;. Even against the much cheaper GPT-4o-mini at $0.60/M, you'd still be paying &lt;strong&gt;~$13,800/month&lt;/strong&gt; vs your flat $503. The break-even point is much lower than most teams expect.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Part 2&lt;/strong&gt; we'll show how to cut the compute cost by 70% using spot instances with a zone-aware architecture that automatically recovers from preemptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Afterwords
&lt;/h2&gt;

&lt;p&gt;This setup is running in production and being used across the team for various tasks - log analysis, automated code reviews, agent pipelines, and document processing. The main takeaway is that running a capable 26B model with a genuine 256k context window on a single L4 GPU is very achievable in 2026 - the tooling has matured significantly. The tricky parts are around understanding how vLLM's memory profiling interacts with context length. Once you know which levers to pull - chunked prefill, conservative &lt;code&gt;max_num_seqs&lt;/code&gt;, fp8 KV cache, text-only mode - the numbers move dramatically without touching the hardware at all.&lt;/p&gt;

&lt;p&gt;If you have questions or want to share your experience with similar setups, feel free to reach out. The next article in this series will cover multi-zone spot deployments with automatic failover. Stay tuned.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>googlecloud</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
