<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Oleksii Nizhegolenko</title>
    <description>The latest articles on DEV Community by Oleksii Nizhegolenko (@ratibor78).</description>
    <link>https://dev.to/ratibor78</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3936724%2F7684339b-1105-4e04-81b4-3aa879415b3f.jpg</url>
      <title>DEV Community: Oleksii Nizhegolenko</title>
      <link>https://dev.to/ratibor78</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ratibor78"/>
    <language>en</language>
    <item>
      <title>Running Gemma 4 26B on GKE with a Single L4 GPU</title>
      <dc:creator>Oleksii Nizhegolenko</dc:creator>
      <pubDate>Mon, 18 May 2026 07:01:46 +0000</pubDate>
      <link>https://dev.to/ratibor78/running-gemma-4-26b-on-gke-with-a-single-l4-gpu-4l6g</link>
      <guid>https://dev.to/ratibor78/running-gemma-4-26b-on-gke-with-a-single-l4-gpu-4l6g</guid>
      <description>&lt;p&gt;&lt;em&gt;Alexey Nizhegolenko DevOps Engineer, AI Infrastructure Engineer&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When you start looking at self-hosting large language models in 2026, the options can feel overwhelming. You have dozens of models, quantization formats, inference engines, and cloud configurations to choose from. In this article, I'll show you a straightforward, step-by-step way to deploy &lt;strong&gt;Gemma4 26B-A4B&lt;/strong&gt; on Google Kubernetes Engine using a single NVIDIA L4 GPU - with real numbers, real mistakes, and no marketing fluff.&lt;/p&gt;

&lt;p&gt;This is the first article in a series. Here we focus on getting a stable, production-ready deployment on a standard (non-spot) L4 instance. In the next article, we'll look at how to build a resilient multi-zone setup with spot instances to cut costs significantly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Gemma4 26B-A4B?
&lt;/h2&gt;

&lt;p&gt;Gemma4 is Google DeepMind's latest open-weight model family released in April 2026 under Apache 2.0. The 26B-A4B variant is a Mixture-of-Experts (MoE) architecture - 26 billion total parameters but only ~4 billion active per token. This means it punches well above its weight in terms of quality while keeping inference costs reasonable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hardware and Cost
&lt;/h2&gt;

&lt;p&gt;An NVIDIA L4 GPU comes with 24GB VRAM. On GCP, the smallest instance with an L4 is &lt;code&gt;g2-standard-4&lt;/code&gt; (4 vCPU, 16GB RAM).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instance type&lt;/th&gt;
&lt;th&gt;On-demand price&lt;/th&gt;
&lt;th&gt;Spot price&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;g2-standard-4 (1x L4)&lt;/td&gt;
&lt;td&gt;~$0.70/hr&lt;/td&gt;
&lt;td&gt;~$0.21/hr&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For this article we use the on-demand instance for stability. The spot option and how to handle preemption's will be covered in Part 2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirements
&lt;/h2&gt;

&lt;p&gt;Before starting, you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A GKE cluster (Standard mode, not Autopilot) in any region with GPU available&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;kubectl&lt;/code&gt; configured to connect to your cluster&lt;/li&gt;
&lt;li&gt;Google Artifact Registry repository for your Docker images&lt;/li&gt;
&lt;li&gt;Basic familiarity with Kubernetes StatefulSets and PersistentVolumes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Create the GPU Node Pool
&lt;/h2&gt;

&lt;p&gt;Create a dedicated node pool for your LLM workload. We use &lt;code&gt;g2-standard-4&lt;/code&gt; with one L4 GPU:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud container node-pools create l4-llm &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;YOUR_CLUSTER_NAME &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--zone&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;us-central1-b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--machine-type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;g2-standard-4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--accelerator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia-l4,count&lt;span class="o"&gt;=&lt;/span&gt;1,gpu-driver-version&lt;span class="o"&gt;=&lt;/span&gt;latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--num-nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--enable-autoscaling&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--min-nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--max-nodes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--node-labels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;service&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;l4-llm &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--node-taints&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia.com/gpu&lt;span class="o"&gt;=&lt;/span&gt;present:NoSchedule &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--scopes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cloud-platform
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth noting here. We start with &lt;code&gt;--num-nodes=0&lt;/code&gt; and enable autoscaling - the node will spin up automatically when we deploy our StatefulSet. The &lt;code&gt;--node-taints&lt;/code&gt; ensures only GPU workloads land on this expensive node pool. The &lt;code&gt;--scopes=cloud-platform&lt;/code&gt; is important: without it, the node can't pull images from Artifact Registry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Prepare the vLLM Docker Image
&lt;/h2&gt;

&lt;p&gt;We use the official vLLM image with Gemma4 support. Pull it and push it to your Artifact Registry to reduce traffic cost on model startup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Authenticate Docker with Artifact Registry&lt;/span&gt;
gcloud auth configure-docker us-central1-docker.pkg.dev

&lt;span class="c"&gt;# Pull the Gemma 4 compatible vLLM image&lt;/span&gt;
docker pull vllm/vllm-openai:gemma4-0505-cu129

&lt;span class="c"&gt;# Tag and push to your registry&lt;/span&gt;
docker tag vllm/vllm-openai:gemma4-0505-cu129 &lt;span class="se"&gt;\&lt;/span&gt;
  us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129

docker push &lt;span class="se"&gt;\&lt;/span&gt;
  us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The image is ~27GB - this is normal. It bundles CUDA runtime, PyTorch, vLLM, and all dependencies into a self-contained package.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Create the Namespace
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 4: Deploy Gemma 4 26B
&lt;/h2&gt;

&lt;p&gt;Now for the main part. Here's the complete StatefulSet manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
      &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;StatefulSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
  &lt;span class="na"&gt;updateStrategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RollingUpdate&lt;/span&gt;
  &lt;span class="na"&gt;persistentVolumeClaimRetentionPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;whenDeleted&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Retain&lt;/span&gt;
    &lt;span class="na"&gt;whenScaled&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Retain&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;terminationGracePeriodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;l4-llm&lt;/span&gt;
      &lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia.com/gpu&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Exists&lt;/span&gt;
          &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoSchedule&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vllm&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129&lt;/span&gt;
          &lt;span class="na"&gt;imagePullPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;IfNotPresent&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--model&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--served-model-name&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gemma-4-26b-a4b&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--gpu-memory-utilization&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0.97"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--kv-cache-dtype&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;fp8&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--max-model-len&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;28000"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--tensor-parallel-size&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--max-num-seqs&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--max-num-batched-tokens&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;28000"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--enable-auto-tool-choice&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--tool-call-parser&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gemma4&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--reasoning-parser&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;gemma4&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--async-scheduling&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--limit-mm-per-prompt&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{"image":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"audio":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"video":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0}'&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--host&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--port&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8000"&lt;/span&gt;
          &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HF_HOME&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;XDG_CACHE_HOME&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models/.xdg-cache&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TRITON_CACHE_DIR&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models/.triton&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VLLM_LOGGING_LEVEL&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INFO&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NVIDIA_VISIBLE_DEVICES&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;all&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NVIDIA_DRIVER_CAPABILITIES&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;compute,utility&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CUDA_VISIBLE_DEVICES&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TORCH_CUDA_ARCH_LIST&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8.9"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LD_LIBRARY_PATH&lt;/span&gt;
              &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib:/lib&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
              &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
              &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
          &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
              &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8000&lt;/span&gt;
              &lt;span class="na"&gt;scheme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HTTP&lt;/span&gt;
            &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
            &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
            &lt;span class="na"&gt;timeoutSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
            &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2Gi&lt;/span&gt;
              &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3500m"&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;12Gi&lt;/span&gt;
              &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
          &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-cache&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/models&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dshm&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/dev/shm&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-lib64&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/lib64&lt;/span&gt;
              &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-bin&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/bin&lt;/span&gt;
              &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dshm&lt;/span&gt;
          &lt;span class="na"&gt;emptyDir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;medium&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Memory&lt;/span&gt;
            &lt;span class="na"&gt;sizeLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2Gi&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-lib64&lt;/span&gt;
          &lt;span class="na"&gt;hostPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/lib64&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Directory&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-bin&lt;/span&gt;
          &lt;span class="na"&gt;hostPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/home/kubernetes/bin/nvidia/bin&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Directory&lt;/span&gt;
  &lt;span class="na"&gt;volumeClaimTemplates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
      &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;model-cache&lt;/span&gt;
      &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
        &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;standard-rwo&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now apply it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; gemma4-26b.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What Happens on First Start
&lt;/h2&gt;

&lt;p&gt;The first startup takes around 8-10 minutes total. Here's the breakdown:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;vLLM image pull from Artifact Registry&lt;/strong&gt; (~5 min on a fresh node) - the image is 27GB and needs to be pulled once per node. On subsequent restarts on the same node, &lt;code&gt;imagePullPolicy: IfNotPresent&lt;/code&gt; skips this step entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model download from HuggingFace&lt;/strong&gt; (~2 min) - vLLM automatically downloads &lt;code&gt;cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit&lt;/code&gt; to the PVC. This also happens only once - the PVC persists across pod restarts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weight loading into GPU&lt;/strong&gt; (~90 sec) - 16GB of AWQ quantized weights loaded from PVC into VRAM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;torch.compile&lt;/strong&gt; (~90 sec) - JIT compilation of CUDA kernels, result saved to PVC cache&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Watch the progress:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl logs &lt;span class="nt"&gt;-f&lt;/span&gt; statefulset/gemma4-26b &lt;span class="nt"&gt;-n&lt;/span&gt; gemma4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's what a healthy startup looks like with annotations so you know what to expect at each stage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# vLLM resolves the model architecture - confirms Gemma4 is recognized correctly
INFO [model.py] Resolved architecture: Gemma4ForConditionalGeneration
INFO [model.py] Using max model len 28000

# Text-only mode confirmed - multimodal encoders skipped, saving ~1GB VRAM
INFO [registry.py] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.

# Model weights downloading from HuggingFace to PVC (first run only)
INFO [weight_utils.py] Time spent downloading weights: 114.785540 seconds

# Weights loaded from PVC into GPU VRAM - 16GB in ~87 seconds
INFO [weight_utils.py] Filesystem type for checkpoints: EXT4. Checkpoint size: 16.01 GiB.
INFO [default_loader.py] Loading weights took 87.47 seconds
INFO [gpu_model_runner.py] Model loading took 15.55 GiB memory and 89.977503 seconds

# torch.compile - on first run ~90 sec, from cache ~10 sec
INFO [backends.py] Directly load the compiled graph(s) for compile range (1, 28000) from the cache, took 2.750 s
INFO [monitor.py] torch.compile took 10.21 s in total

# Final VRAM budget - this is the key number to watch
INFO [gpu_worker.py] Available KV cache memory: 3.12 GiB
INFO [kv_cache_utils.py] GPU KV cache size: 29,709 tokens
INFO [kv_cache_utils.py] Maximum concurrency for 28,000 tokens per request: 1.06x

# Server is ready
INFO: Application startup complete.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;Available KV cache memory&lt;/code&gt; is below ~2.95 GiB, your &lt;code&gt;--max-model-len 28000&lt;/code&gt; will fail at startup. Either lower the context length or increase &lt;code&gt;--gpu-memory-utilization&lt;/code&gt; slightly.&lt;/p&gt;

&lt;p&gt;You're looking for this line to confirm everything worked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INFO: Application startup complete.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Second Start and Beyond
&lt;/h3&gt;

&lt;p&gt;This is where the PVC pays off. On every subsequent restart:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No download&lt;/strong&gt; - model already on PVC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;torch.compile from cache&lt;/strong&gt; - 10 seconds instead of 90&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total cold start: ~2 min 40 sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We measured this precisely across multiple restarts - the numbers are consistent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scale up:      07:24:56
Pod running:   07:30:02  (image pull ~5 min on fresh node)
Weights loaded: 07:32:05  (87 sec from PVC)
torch.compile:  07:32:15  (10 sec from cache)
Server ready:   07:32:35

Total from scale up: ~7 min 39 sec (first time, includes image pull)
Total from pod start: ~2 min 33 sec (weights + compile from cache)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On subsequent restarts on the same node (image already cached):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pod start → Server ready: ~2 min 33 sec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Key Configuration Decisions Explained
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Why AWQ 4-bit quantization?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The original Gemma 4 26B model weights are ~52GB in BF16 - it simply won't fit on a 24GB L4. We use &lt;code&gt;cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit&lt;/code&gt;, which brings it down to ~16GB. This format uses Marlin INT4 kernels under the hood and runs well on L4 (sm_8.9 architecture).&lt;/p&gt;

&lt;p&gt;We tried FP8 and NVFP4 quantizations. FP8 doesn't fit either (~26GB). NVFP4 requires Blackwell (sm_9.0+) and won't run on L4 at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;--limit-mm-per-prompt '{"image": 0, "audio": 0, "video": 0}'&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemma 4 is a multimodal model. Even if you don't use images or audio, vLLM reserves GPU memory for the multimodal encoders during profiling. Setting all limits to 0 puts vLLM into text-only mode, which frees about 1GB of VRAM. This is what allows us to reach 28,000 token context instead of 24,000.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;--kv-cache-dtype fp8&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;KV cache stores intermediate attention states. Using fp8 instead of bf16 halves the memory footprint, giving more room for longer contexts. The quality impact is minimal for typical chat and document analysis tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why &lt;code&gt;--gpu-memory-utilization 0.97&lt;/code&gt;?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We found that 0.96 leaves the KV cache 40MB short for 28,000 tokens. Bumping to 0.97 makes it work. We don't go higher to keep a safety buffer against OOM on spot instances.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Results
&lt;/h2&gt;

&lt;p&gt;All numbers below are measured from a real running instance, not estimated. We used &lt;code&gt;curl&lt;/code&gt; with &lt;code&gt;time&lt;/code&gt; for latency and pulled &lt;code&gt;/metrics&lt;/code&gt; for the rest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single request (500 output tokens):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;time &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://gemma4-26b.yourdomain.com/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "gemma-4-26b-a4b",
    "messages": [{"role": "user", "content": "Write a detailed technical explanation of how Kubernetes scheduling works..."}],
    "max_tokens": 500
  }'&lt;/span&gt;

&lt;span class="c"&gt;# real 0m9.783s  →  500 tokens / 9.78s = ~51 tokens/sec&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4 parallel requests (batch simulation):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;i &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;1..4&lt;span class="o"&gt;}&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://gemma4-26b.yourdomain.com/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "gemma-4-26b-a4b", "messages": [...], "max_tokens": 500}'&lt;/span&gt; &amp;amp;
&lt;span class="k"&gt;done
&lt;/span&gt;&lt;span class="nb"&gt;wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Results from &lt;code&gt;/metrics&lt;/code&gt; endpoint:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;1 request&lt;/th&gt;
&lt;th&gt;4 parallel&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Throughput per request&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~51 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~47 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Time to First Token (TTFT)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~84ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~1.94 sec&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KV cache usage&lt;/td&gt;
&lt;td&gt;&amp;lt;1%&lt;/td&gt;
&lt;td&gt;6.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Queue time&lt;/td&gt;
&lt;td&gt;~0ms&lt;/td&gt;
&lt;td&gt;~0ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Requests running simultaneously&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A few things worth noting here. At 4 parallel requests, per-request throughput barely drops (-8%) - vLLM's async scheduling batches the decode steps efficiently. TTFT increases significantly though: from 84ms to ~1.9 seconds. This is expected behaviour when the GPU is processing multiple prefills at once. For interactive use cases, keep concurrency low. For batch processing, throughput is what matters and it holds well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prefix cache hit rate&lt;/strong&gt; from metrics: 256/1557 tokens = &lt;strong&gt;~16.4%&lt;/strong&gt;. This means repeated system prompts or common conversation prefixes are being served from cache, reducing compute. In production with consistent system prompts the hit rate will be significantly higher.&lt;/p&gt;

&lt;p&gt;The 28,000 token context is enough for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Analyzing large log files or error traces without chunking&lt;/li&gt;
&lt;li&gt;Reviewing pull requests with extensive diffs and context&lt;/li&gt;
&lt;li&gt;Processing long technical documents or reports in a single pass&lt;/li&gt;
&lt;li&gt;Multi-turn agent conversations with rich tool call history&lt;/li&gt;
&lt;li&gt;Code analysis across entire modules or services&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Expose the API
&lt;/h2&gt;

&lt;p&gt;To expose the model inside your cluster via HTTP, you need an Ingress controller. The most common options in 2026 are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NGINX Ingress Controller&lt;/strong&gt; - the community standard, works on any Kubernetes cluster including GKE&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GKE Gateway API&lt;/strong&gt; - the newer GKE-native approach using &lt;code&gt;HTTPRoute&lt;/code&gt; resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kong Gateway&lt;/strong&gt; - popular choice if you need API key auth, rate limiting, or routing logic on top&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The example below uses NGINX Ingress, which you can install with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;ingress-nginx ingress-nginx/ingress-nginx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; ingress-nginx &lt;span class="nt"&gt;--create-namespace&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add an Ingress to make the model accessible within your cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-ingress&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kubernetes.io/ingress.class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/backend-protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HTTP"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-read-timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;600"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-send-timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;600"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/proxy-buffering&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;off"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4.yourdomain.com&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
            &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
            &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemma4-26b&lt;/span&gt;
                &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-s&lt;/span&gt; http://gemma4.yourdomain.com/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "model": "gemma-4-26b-a4b",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}],
    "max_tokens": 200
  }'&lt;/span&gt; | python3 &lt;span class="nt"&gt;-m&lt;/span&gt; json.tool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API is fully OpenAI-compatible. You can drop it in as a replacement for any OpenAI client by changing the &lt;code&gt;base_url&lt;/code&gt; to &lt;code&gt;http://gemma4.yourdomain.com/v1&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest Assessment: What This Model Is Good For
&lt;/h2&gt;

&lt;p&gt;It's worth being upfront: this is a quantized community model, not the latest frontier API. AWQ 4-bit compression introduces some quality degradation compared to the full BF16 weights, and 28,000 tokens is modest compared to the 1M+ context of some cloud models.&lt;/p&gt;

&lt;p&gt;That said, in practice it handles a wide range of tasks well - log analysis, PR reviews, writing automation agents, summarizing technical documents, simple code generation, and internal tooling. For these workloads the quality is more than good enough, and the tradeoff is clear.&lt;/p&gt;

&lt;p&gt;The bigger advantage is data privacy. With a self-hosted model, your data never leaves your infrastructure. Cloud API providers typically retain and may use your prompts for model improvement. If you're processing internal systems data, customer information, or proprietary business logic - that's a meaningful difference. Your data stays yours.&lt;/p&gt;

&lt;h2&gt;
  
  
  What We Tried That Didn't Work
&lt;/h2&gt;

&lt;p&gt;I think it's useful to share the approaches we tested before landing on this setup.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regional Persistent Disks&lt;/strong&gt; - We tried using a regional PD (replicated across two zones) to avoid zonal lock-in. GCP doesn't support regional PDs on G2 instances (the GPU series). This is a hard limitation, not a configuration issue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GCS FUSE&lt;/strong&gt; - Mounting the model weights directly from a GCS bucket sounds elegant. In practice, vLLM reads all 16GB of weights sequentially at startup, and GCS FUSE has no prefetching for this pattern. After 4+ minutes on the first shard with no progress, we abandoned this approach. The init container + gsutil copy is faster and simpler.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Breakdown
&lt;/h2&gt;

&lt;p&gt;Running this setup full-time on on-demand pricing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Resource&lt;/th&gt;
&lt;th&gt;Cost/month&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;g2-standard-4 on-demand&lt;/td&gt;
&lt;td&gt;~$500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PVC 30GB standard-rwo&lt;/td&gt;
&lt;td&gt;~$3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCS model storage 17GB&lt;/td&gt;
&lt;td&gt;~$0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$503/month&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A self-hosted model runs 24/7. At 45 tokens/sec on a single request, that's:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;45 tokens/sec × 3600 sec/hr × 24 hr × 30 days = ~116 billion tokens/month theoretical maximum
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even at 20% average utilization (which is realistic for internal tooling), you're looking at &lt;strong&gt;~23 billion tokens/month&lt;/strong&gt; for a flat $503.&lt;/p&gt;

&lt;p&gt;For comparison, as of May 2026 OpenAI's current models are priced at:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Output tokens (per 1M)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.4 (flagship)&lt;/td&gt;
&lt;td&gt;$15.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4.1&lt;/td&gt;
&lt;td&gt;$8.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o (legacy)&lt;/td&gt;
&lt;td&gt;$10.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini&lt;/td&gt;
&lt;td&gt;$0.60&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At GPT-5.4 rates, 23 billion tokens would cost &lt;strong&gt;$345,000/month&lt;/strong&gt;. Even against the much cheaper GPT-4o-mini at $0.60/M, you'd still be paying &lt;strong&gt;~$13,800/month&lt;/strong&gt; vs your flat $503. The break-even point is much lower than most teams expect.&lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;Part 2&lt;/strong&gt; we'll show how to cut the compute cost by 70% using spot instances with a zone-aware architecture that automatically recovers from preemptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Afterwords
&lt;/h2&gt;

&lt;p&gt;This setup is running in production and being used across the team for various tasks - log analysis, automated code reviews, agent pipelines, and document processing. After several weeks of use it's proven to be stable and reliable. The main takeaway is that running a capable 26B model on a single L4 GPU is very achievable in 2026 — the tooling has matured significantly. The tricky parts are around quantization format compatibility and squeezing out the last few gigabytes of VRAM for context length.&lt;/p&gt;

&lt;p&gt;If you have questions or want to share your experience with similar setups, feel free to reach out. The next article in this series will cover multi-zone spot deployments with automatic failover. Stay tuned.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>googlecloud</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
