Oleksii Nizhegolenko

Posted on May 18

Running Gemma 4 26B on GKE with a Single L4 GPU

#devops #kubernetes #googlecloud #ai

Alexey Nizhegolenko DevOps Engineer, AI Infrastructure Engineer

When you start looking at self-hosting large language models in 2026, the options can feel overwhelming. You have dozens of models, quantization formats, inference engines, and cloud configurations to choose from. In this article, I'll show you a straightforward, step-by-step way to deploy Gemma4 26B-A4B on Google Kubernetes Engine using a single NVIDIA L4 GPU - with real numbers, real mistakes, and no marketing fluff.

This is the first article in a series. Here we focus on getting a stable, production-ready deployment on a standard (non-spot) L4 instance. In the next article, we'll look at how to build a resilient multi-zone setup with spot instances to cut costs significantly.

Why Gemma4 26B-A4B?

Gemma4 is Google DeepMind's latest open-weight model family released in April 2026 under Apache 2.0. The 26B-A4B variant is a Mixture-of-Experts (MoE) architecture - 26 billion total parameters but only ~4 billion active per token. This means it punches well above its weight in terms of quality while keeping inference costs reasonable.

Hardware and Cost

An NVIDIA L4 GPU comes with 24GB VRAM. On GCP, the smallest instance with an L4 is g2-standard-4 (4 vCPU, 16GB RAM).

Instance type	On-demand price	Spot price
g2-standard-4 (1x L4)	~$0.70/hr	~$0.21/hr

For this article we use the on-demand instance for stability. The spot option and how to handle preemption's will be covered in Part 2.

Requirements

Before starting, you need:

A GKE cluster (Standard mode, not Autopilot) in any region with GPU available
kubectl configured to connect to your cluster
Google Artifact Registry repository for your Docker images
Basic familiarity with Kubernetes StatefulSets and PersistentVolumes

Step 1: Create the GPU Node Pool

Create a dedicated node pool for your LLM workload. We use g2-standard-4 with one L4 GPU:

gcloud container node-pools create l4-llm \
  --cluster=YOUR_CLUSTER_NAME \
  --zone=us-central1-b \
  --machine-type=g2-standard-4 \
  --accelerator=type=nvidia-l4,count=1,gpu-driver-version=latest \
  --num-nodes=0 \
  --enable-autoscaling \
  --min-nodes=0 \
  --max-nodes=1 \
  --node-labels=service=l4-llm \
  --node-taints=nvidia.com/gpu=present:NoSchedule \
  --scopes=cloud-platform

A few things worth noting here. We start with --num-nodes=0 and enable autoscaling - the node will spin up automatically when we deploy our StatefulSet. The --node-taints ensures only GPU workloads land on this expensive node pool. The --scopes=cloud-platform is important: without it, the node can't pull images from Artifact Registry.

Step 2: Prepare the vLLM Docker Image

We use the official vLLM image with Gemma4 support. Pull it and push it to your Artifact Registry to reduce traffic cost on model startup:

# Authenticate Docker with Artifact Registry
gcloud auth configure-docker us-central1-docker.pkg.dev

# Pull the Gemma 4 compatible vLLM image
docker pull vllm/vllm-openai:gemma4-0505-cu129

# Tag and push to your registry
docker tag vllm/vllm-openai:gemma4-0505-cu129 \
  us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129

docker push \
  us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129

The image is ~27GB - this is normal. It bundles CUDA runtime, PyTorch, vLLM, and all dependencies into a self-contained package.

Step 3: Create the Namespace

kubectl create namespace gemma4

Step 4: Deploy Gemma 4 26B

Now for the main part. Here's the complete StatefulSet manifest:

apiVersion: v1
kind: Service
metadata:
  name: gemma4-26b
  namespace: gemma4
  labels:
    app: gemma4-26b
spec:
  type: ClusterIP
  selector:
    app: gemma4-26b
  ports:
    - name: http
      port: 80
      targetPort: 8000
      protocol: TCP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: gemma4-26b
  namespace: gemma4
  labels:
    app: gemma4-26b
spec:
  serviceName: gemma4-26b
  replicas: 1
  selector:
    matchLabels:
      app: gemma4-26b
  updateStrategy:
    type: RollingUpdate
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  template:
    metadata:
      labels:
        app: gemma4-26b
    spec:
      terminationGracePeriodSeconds: 60
      nodeSelector:
        service: l4-llm
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129
          imagePullPolicy: IfNotPresent
          args:
            - --model
            - cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
            - --served-model-name
            - gemma-4-26b-a4b
            - --gpu-memory-utilization
            - "0.97"
            - --kv-cache-dtype
            - fp8
            - --max-model-len
            - "28000"
            - --tensor-parallel-size
            - "1"
            - --max-num-seqs
            - "8"
            - --max-num-batched-tokens
            - "28000"
            - --enable-auto-tool-choice
            - --tool-call-parser
            - gemma4
            - --reasoning-parser
            - gemma4
            - --async-scheduling
            - --limit-mm-per-prompt
            - '{"image": 0, "audio": 0, "video": 0}'
            - --host
            - 0.0.0.0
            - --port
            - "8000"
          env:
            - name: HF_HOME
              value: /models
            - name: XDG_CACHE_HOME
              value: /models/.xdg-cache
            - name: TRITON_CACHE_DIR
              value: /models/.triton
            - name: VLLM_LOGGING_LEVEL
              value: INFO
            - name: NVIDIA_VISIBLE_DEVICES
              value: all
            - name: NVIDIA_DRIVER_CAPABILITIES
              value: compute,utility
            - name: CUDA_VISIBLE_DEVICES
              value: "0"
            - name: TORCH_CUDA_ARCH_LIST
              value: "8.9"
            - name: LD_LIBRARY_PATH
              value: /home/kubernetes/bin/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib:/lib
          ports:
            - name: http
              containerPort: 8000
              protocol: TCP
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 90
          resources:
            requests:
              cpu: "1"
              memory: 2Gi
              nvidia.com/gpu: "1"
            limits:
              cpu: "3500m"
              memory: 12Gi
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-cache
              mountPath: /models
            - name: dshm
              mountPath: /dev/shm
            - name: nvidia-lib64
              mountPath: /home/kubernetes/bin/nvidia/lib64
              readOnly: true
            - name: nvidia-bin
              mountPath: /home/kubernetes/bin/nvidia/bin
              readOnly: true
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
        - name: nvidia-lib64
          hostPath:
            path: /home/kubernetes/bin/nvidia/lib64
            type: Directory
        - name: nvidia-bin
          hostPath:
            path: /home/kubernetes/bin/nvidia/bin
            type: Directory
  volumeClaimTemplates:
    - apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: model-cache
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: standard-rwo
        resources:
          requests:
            storage: 30Gi

Now apply it:

kubectl apply -f gemma4-26b.yaml

What Happens on First Start

The first startup takes around 8-10 minutes total. Here's the breakdown:

vLLM image pull from Artifact Registry (~5 min on a fresh node) - the image is 27GB and needs to be pulled once per node. On subsequent restarts on the same node, imagePullPolicy: IfNotPresent skips this step entirely.
Model download from HuggingFace (~2 min) - vLLM automatically downloads cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit to the PVC. This also happens only once - the PVC persists across pod restarts.
Weight loading into GPU (~90 sec) - 16GB of AWQ quantized weights loaded from PVC into VRAM
torch.compile (~90 sec) - JIT compilation of CUDA kernels, result saved to PVC cache

Watch the progress:

kubectl logs -f statefulset/gemma4-26b -n gemma4

Here's what a healthy startup looks like with annotations so you know what to expect at each stage:

# vLLM resolves the model architecture - confirms Gemma4 is recognized correctly
INFO [model.py] Resolved architecture: Gemma4ForConditionalGeneration
INFO [model.py] Using max model len 28000

# Text-only mode confirmed - multimodal encoders skipped, saving ~1GB VRAM
INFO [registry.py] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.

# Model weights downloading from HuggingFace to PVC (first run only)
INFO [weight_utils.py] Time spent downloading weights: 114.785540 seconds

# Weights loaded from PVC into GPU VRAM - 16GB in ~87 seconds
INFO [weight_utils.py] Filesystem type for checkpoints: EXT4. Checkpoint size: 16.01 GiB.
INFO [default_loader.py] Loading weights took 87.47 seconds
INFO [gpu_model_runner.py] Model loading took 15.55 GiB memory and 89.977503 seconds

# torch.compile - on first run ~90 sec, from cache ~10 sec
INFO [backends.py] Directly load the compiled graph(s) for compile range (1, 28000) from the cache, took 2.750 s
INFO [monitor.py] torch.compile took 10.21 s in total

# Final VRAM budget - this is the key number to watch
INFO [gpu_worker.py] Available KV cache memory: 3.12 GiB
INFO [kv_cache_utils.py] GPU KV cache size: 29,709 tokens
INFO [kv_cache_utils.py] Maximum concurrency for 28,000 tokens per request: 1.06x

# Server is ready
INFO: Application startup complete.

If Available KV cache memory is below ~2.95 GiB, your --max-model-len 28000 will fail at startup. Either lower the context length or increase --gpu-memory-utilization slightly.

You're looking for this line to confirm everything worked:

INFO: Application startup complete.

Second Start and Beyond

This is where the PVC pays off. On every subsequent restart:

No download - model already on PVC
torch.compile from cache - 10 seconds instead of 90
Total cold start: ~2 min 40 sec

We measured this precisely across multiple restarts - the numbers are consistent:

Scale up:      07:24:56
Pod running:   07:30:02  (image pull ~5 min on fresh node)
Weights loaded: 07:32:05  (87 sec from PVC)
torch.compile:  07:32:15  (10 sec from cache)
Server ready:   07:32:35

Total from scale up: ~7 min 39 sec (first time, includes image pull)
Total from pod start: ~2 min 33 sec (weights + compile from cache)

On subsequent restarts on the same node (image already cached):

Pod start → Server ready: ~2 min 33 sec

Key Configuration Decisions Explained

Why AWQ 4-bit quantization?

The original Gemma 4 26B model weights are ~52GB in BF16 - it simply won't fit on a 24GB L4. We use cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit, which brings it down to ~16GB. This format uses Marlin INT4 kernels under the hood and runs well on L4 (sm_8.9 architecture).

We tried FP8 and NVFP4 quantizations. FP8 doesn't fit either (~26GB). NVFP4 requires Blackwell (sm_9.0+) and won't run on L4 at all.

Why --limit-mm-per-prompt '{"image": 0, "audio": 0, "video": 0}'?

Gemma 4 is a multimodal model. Even if you don't use images or audio, vLLM reserves GPU memory for the multimodal encoders during profiling. Setting all limits to 0 puts vLLM into text-only mode, which frees about 1GB of VRAM. This is what allows us to reach 28,000 token context instead of 24,000.

Why --kv-cache-dtype fp8?

KV cache stores intermediate attention states. Using fp8 instead of bf16 halves the memory footprint, giving more room for longer contexts. The quality impact is minimal for typical chat and document analysis tasks.

Why --gpu-memory-utilization 0.97?

We found that 0.96 leaves the KV cache 40MB short for 28,000 tokens. Bumping to 0.97 makes it work. We don't go higher to keep a safety buffer against OOM on spot instances.

Performance Results

All numbers below are measured from a real running instance, not estimated. We used curl with time for latency and pulled /metrics for the rest.

Single request (500 output tokens):

time curl -s http://gemma4-26b.yourdomain.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-26b-a4b",
    "messages": [{"role": "user", "content": "Write a detailed technical explanation of how Kubernetes scheduling works..."}],
    "max_tokens": 500
  }'

# real 0m9.783s  →  500 tokens / 9.78s = ~51 tokens/sec

4 parallel requests (batch simulation):

for i in {1..4}; do
  curl -s http://gemma4-26b.yourdomain.com/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "gemma-4-26b-a4b", "messages": [...], "max_tokens": 500}' &
done
wait

Results from /metrics endpoint:

Metric	1 request	4 parallel
Throughput per request	~51 tok/s	~47 tok/s
Time to First Token (TTFT)	~84ms	~1.94 sec
KV cache usage	<1%	6.1%
Queue time	~0ms	~0ms
Requests running simultaneously	1	4

A few things worth noting here. At 4 parallel requests, per-request throughput barely drops (-8%) - vLLM's async scheduling batches the decode steps efficiently. TTFT increases significantly though: from 84ms to ~1.9 seconds. This is expected behaviour when the GPU is processing multiple prefills at once. For interactive use cases, keep concurrency low. For batch processing, throughput is what matters and it holds well.

Prefix cache hit rate from metrics: 256/1557 tokens = ~16.4%. This means repeated system prompts or common conversation prefixes are being served from cache, reducing compute. In production with consistent system prompts the hit rate will be significantly higher.

The 28,000 token context is enough for:

Analyzing large log files or error traces without chunking
Reviewing pull requests with extensive diffs and context
Processing long technical documents or reports in a single pass
Multi-turn agent conversations with rich tool call history
Code analysis across entire modules or services

Expose the API

To expose the model inside your cluster via HTTP, you need an Ingress controller. The most common options in 2026 are:

NGINX Ingress Controller - the community standard, works on any Kubernetes cluster including GKE
GKE Gateway API - the newer GKE-native approach using HTTPRoute resources
Kong Gateway - popular choice if you need API key auth, rate limiting, or routing logic on top

The example below uses NGINX Ingress, which you can install with:

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx --create-namespace

Add an Ingress to make the model accessible within your cluster:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: gemma4-ingress
  namespace: gemma4
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
spec:
  ingressClassName: nginx
  rules:
    - host: gemma4.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: gemma4-26b
                port:
                  number: 80

Test it:

curl -s http://gemma4.yourdomain.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-26b-a4b",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}],
    "max_tokens": 200
  }' | python3 -m json.tool

The API is fully OpenAI-compatible. You can drop it in as a replacement for any OpenAI client by changing the base_url to http://gemma4.yourdomain.com/v1.

Honest Assessment: What This Model Is Good For

It's worth being upfront: this is a quantized community model, not the latest frontier API. AWQ 4-bit compression introduces some quality degradation compared to the full BF16 weights, and 28,000 tokens is modest compared to the 1M+ context of some cloud models.

That said, in practice it handles a wide range of tasks well - log analysis, PR reviews, writing automation agents, summarizing technical documents, simple code generation, and internal tooling. For these workloads the quality is more than good enough, and the tradeoff is clear.

The bigger advantage is data privacy. With a self-hosted model, your data never leaves your infrastructure. Cloud API providers typically retain and may use your prompts for model improvement. If you're processing internal systems data, customer information, or proprietary business logic - that's a meaningful difference. Your data stays yours.

What We Tried That Didn't Work

I think it's useful to share the approaches we tested before landing on this setup.

Regional Persistent Disks - We tried using a regional PD (replicated across two zones) to avoid zonal lock-in. GCP doesn't support regional PDs on G2 instances (the GPU series). This is a hard limitation, not a configuration issue.

GCS FUSE - Mounting the model weights directly from a GCS bucket sounds elegant. In practice, vLLM reads all 16GB of weights sequentially at startup, and GCS FUSE has no prefetching for this pattern. After 4+ minutes on the first shard with no progress, we abandoned this approach. The init container + gsutil copy is faster and simpler.

Cost Breakdown

Running this setup full-time on on-demand pricing:

Resource	Cost/month
g2-standard-4 on-demand	~$500
PVC 30GB standard-rwo	~$3
GCS model storage 17GB	~$0.40
Total	~$503/month

A self-hosted model runs 24/7. At 45 tokens/sec on a single request, that's:

45 tokens/sec × 3600 sec/hr × 24 hr × 30 days = ~116 billion tokens/month theoretical maximum

Even at 20% average utilization (which is realistic for internal tooling), you're looking at ~23 billion tokens/month for a flat $503.

For comparison, as of May 2026 OpenAI's current models are priced at:

Model	Output tokens (per 1M)
GPT-5.4 (flagship)	$15.00
GPT-4.1	$8.00
GPT-4o (legacy)	$10.00
GPT-4o-mini	$0.60

At GPT-5.4 rates, 23 billion tokens would cost $345,000/month. Even against the much cheaper GPT-4o-mini at $0.60/M, you'd still be paying ~$13,800/month vs your flat $503. The break-even point is much lower than most teams expect.

In Part 2 we'll show how to cut the compute cost by 70% using spot instances with a zone-aware architecture that automatically recovers from preemptions.

Afterwords

This setup is running in production and being used across the team for various tasks - log analysis, automated code reviews, agent pipelines, and document processing. After several weeks of use it's proven to be stable and reliable. The main takeaway is that running a capable 26B model on a single L4 GPU is very achievable in 2026 — the tooling has matured significantly. The tricky parts are around quantization format compatibility and squeezing out the last few gigabytes of VRAM for context length.

If you have questions or want to share your experience with similar setups, feel free to reach out. The next article in this series will cover multi-zone spot deployments with automatic failover. Stay tuned.

DEV Community