DEV Community: Oleksii Nizhegolenko

Running OpenAI's gpt-oss-20b with 128k Context on a Single L4 GPU

Oleksii Nizhegolenko — Tue, 19 May 2026 08:47:44 +0000

Alexey Nizhegolenko

DevOps Engineer, AgentOps Engineer, AI Infrastructure Engineer

This is the second article in my series on self-hosting LLMs on GKE. In the first article I covered deploying Gemma4 26B with a 28,000 token context window. This time I'll show you something more impressive: openai/gpt-oss-20b running with a 128,000 token context on the same single L4 GPU.

The setup has been running in production since November 2025, for about 6 months, with no major incidents. That's the kind of track record worth writing about.

Why gpt-oss-20b?

OpenAI released gpt-oss-20b in August 2025 as their first open-weight model since GPT-2. It's a 21B-parameter Mixture-of-Experts model (~3.6B active parameters per token) with mxfp4 quantization built-in, meaning the weights are already compressed using microscaling FP4 format, which is far more memory-efficient than standard quantization approaches like AWQ or GPTQ.

Two things make it stand out:

128k context window - this is the main reason to pick this model over alternatives. Most quantized models on a 24GB L4 GPU are limited to 20-64k tokens. gpt-oss-20b achieves 128k through the combination of mxfp4 weights (~13GB on disk).

Built-in reasoning - the model uses chain-of-thought reasoning internally. In API responses, you'll see a reasoning_content field with the model's thought process before the final answer. This is useful for complex analytical tasks where you want to understand how the model concluded.

OpenAI tool calling format - natively compatible with --tool-call-parser openai, which means it drops in as a replacement for OpenAI API clients without any prompt engineering changes.

Hardware and Cost

Same hardware as Part 1 - g2-standard-4 with one NVIDIA L4 GPU (24GB VRAM, 4 vCPU, 16GB RAM).

Instance type	On-demand price	Spot price
g2-standard-4 (1x L4)	~$0.70/hr	~$0.21/hr

This article uses a standard on-demand node pool to keep the setup simple and predictable. The spot-based, cost-optimised variant and the zone-aware failover architecture that makes spot safe to run in production are the subject of Part 3.

How 128k Context Fits on 24GB VRAM

The real answer is simpler than you might expect - it fits entirely in GPU RAM. From the actual startup logs:

Model weights (mxfp4):   13.72 GiB
KV cache (fp8):           4.17 GiB  →  182,336 tokens available
CUDA graphs:              0.60 GiB
Total:                   ~18.5 GiB out of 24 GiB (0.85 utilisation)

GPU KV cache size: 182,336 tokens
Maximum concurrency for 128,000 tokens per request: 2.68x

182k tokens of KV cache covers 128k context with room for more than two simultaneous requests. No CPU offloading needed.

Why does --swap-space 6 exist in the config then? It's a safety net, if KV cache ever overflows under unusual load patterns, vLLM can spill to CPU RAM instead of dropping requests. In practice, it hasn't been used in 6 months of production. The fp8 KV cache combined with mxfp4 weights is efficient enough that everything fits comfortably on the GPU.

The real reason this works at 128k where other models can't is mxfp4 quantization. It stores weights in microscaling FP4 format - roughly 2x more efficient than AWQ INT4. This frees up ~2GB of VRAM compared to an equivalent AWQ model, and that extra headroom goes directly into KV cache budget.

Requirements

GKE cluster (Standard mode) in example, it's us-central1
kubectl configured
Google Artifact Registry for Docker images

Step 1: Create the GPU Node Pool

A standard on-demand node pool, single zone, scale-to-zero, one L4 GPU at peak:

gcloud container node-pools create l4-gptoss \
  --cluster=YOUR_CLUSTER_NAME \
  --zone=us-central1-a \
  --machine-type=g2-standard-4 \
  --accelerator=type=nvidia-l4,count=1,gpu-driver-version=latest \
  --num-nodes=0 \
  --enable-autoscaling \
  --min-nodes=0 \
  --max-nodes=1 \
  --node-labels=service=gpt-oss-20b \
  --node-taints=nvidia.com/gpu=present:NoSchedule \
  --scopes=cloud-platform

The --node-labels=service=gpt-oss-20b label is what the StatefulSet's nodeSelector targets, and the nvidia.com/gpu taint keeps non-GPU workloads off this pool.

Step 2: Prepare the vLLM Image

gpt-oss-20b requires vLLM v0.12.0 or later with mxfp4 support. Push it to your Artifact Registry:

docker pull vllm/vllm-openai:v0.12.0

docker tag vllm/vllm-openai:v0.12.0 \
  us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:v0.12.0

docker push us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:v0.12.0

Step 3: Create Namespace and Secrets

openai/gpt-oss-20b is a public model, no HuggingFace token required. You only need an API key to protect your vLLM endpoint:

kubectl create namespace gptoss-multi

# API key for protecting the vLLM endpoint
kubectl create secret generic vllm-api-multi \
  --from-literal=VLLM_API_KEY=your-api-key-here \
  -n gptoss-multi

Step 4: Deploy gpt-oss-20b

Here's the complete StatefulSet manifest. Scheduling is intentionally minimal - a simple nodeSelector targeting the service: gpt-oss-20b label, plus a toleration for the GPU taint. No node affinity rules; the zone-aware scheduling logic comes in Part 3.

apiVersion: v1
kind: Service
metadata:
  name: gptoss-multi
  namespace: gptoss-multi
  labels:
    app: gptoss-20b
spec:
  type: ClusterIP
  selector:
    app: gptoss-20b
  ports:
    - name: http
      port: 80
      targetPort: 8000
      protocol: TCP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: gptoss-20b
  namespace: gptoss-multi
  labels:
    app: gptoss-20b
spec:
  serviceName: gptoss
  replicas: 1
  selector:
    matchLabels:
      app: gptoss-20b
  updateStrategy:
    type: RollingUpdate
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Delete
    whenScaled: Retain
  template:
    metadata:
      labels:
        app: gptoss-20b
    spec:
      terminationGracePeriodSeconds: 30
      nodeSelector:
        service: gpt-oss-20b
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:v0.12.0
          imagePullPolicy: IfNotPresent
          args:
            - --model
            - openai/gpt-oss-20b
            - --api-key
            - $(VLLM_API_KEY)
            - --gpu-memory-utilization
            - "0.85"
            - --max-model-len
            - "128000"
            - --swap-space
            - "6"
            - --tensor-parallel-size
            - "1"
            - --max-num-seqs
            - "3"
            - --max-num-partial-prefills
            - "1"
            - --max-num-batched-tokens
            - "8128"
            - --kv-cache-dtype
            - fp8
            - --enable-auto-tool-choice
            - --tool-call-parser
            - openai
            - --host
            - 0.0.0.0
            - --port
            - "8000"
          env:
            - name: HF_HOME
              value: /models
            - name: XDG_CACHE_HOME
              value: /models/.xdg-cache
            - name: TRITON_CACHE_DIR
              value: /models/.triton
            - name: VLLM_API_KEY
              valueFrom:
                secretKeyRef:
                  key: VLLM_API_KEY
                  name: vllm-api-multi
            - name: VLLM_LOGGING_LEVEL
              value: INFO
            - name: NVIDIA_VISIBLE_DEVICES
              value: all
            - name: NVIDIA_DRIVER_CAPABILITIES
              value: compute,utility
            - name: CUDA_VISIBLE_DEVICES
              value: "0"
            - name: LD_LIBRARY_PATH
              value: /home/kubernetes/bin/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib:/lib
            - name: TORCH_CUDA_ARCH_LIST
              value: "8.9"
          ports:
            - name: http
              containerPort: 8000
              protocol: TCP
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            initialDelaySeconds: 60
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 12
          resources:
            requests:
              cpu: "1"
              memory: 2Gi
              nvidia.com/gpu: "1"
            limits:
              cpu: "3500m"
              memory: 12Gi
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-cache
              mountPath: /models
            - name: dshm
              mountPath: /dev/shm
            - name: nvidia-lib64
              mountPath: /home/kubernetes/bin/nvidia/lib64
              readOnly: true
            - name: nvidia-bin
              mountPath: /home/kubernetes/bin/nvidia/bin
              readOnly: true
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 6Gi
        - name: nvidia-lib64
          hostPath:
            path: /home/kubernetes/bin/nvidia/lib64
            type: Directory
        - name: nvidia-bin
          hostPath:
            path: /home/kubernetes/bin/nvidia/bin
            type: Directory
  volumeClaimTemplates:
    - apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: model-cache
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: standard-rwo
        resources:
          requests:
            storage: 60Gi

Apply it:

kubectl apply -f gptoss-20b.yaml
kubectl logs -f statefulset/gptoss-20b -n gptoss-multi

Here's what a healthy startup looks like:

# Architecture confirmed - custom GptOss model class
INFO [model.py] Resolved architecture: GptOssForCausalLM
INFO [model.py] Using max model len 128000

# mxfp4 quantization confirmed, Marlin kernel selected
INFO [mxfp4.py] Using Marlin backend
WARNING: Your GPU does not have native support for FP4 computation.
         Weight-only FP4 compression will be used via Marlin kernel.

# Weights loaded - note 13.72 GiB vs 15.55 GiB for Gemma4
INFO [default_loader.py] Loading weights took 73.54 seconds
INFO [gpu_model_runner.py] Model loading took 13.7193 GiB memory and 104.729963 seconds

# torch.compile from cache - 13 seconds instead of ~90
INFO [monitor.py] torch.compile takes 13.67 s in total

# KV cache - this is the key number
INFO [gpu_worker.py] Available KV cache memory: 4.17 GiB
INFO [kv_cache_utils.py] GPU KV cache size: 182,336 tokens
INFO [kv_cache_utils.py] Maximum concurrency for 128,000 tokens per request: 2.68x

# Server ready
INFO: Application startup complete.

The WARNING about FP4 support is expected and not a problem. L4 is sm_8.9 architecture. Native FP4 requires Blackwell (sm_9.0+). The Marlin kernel handles this transparently with no quality impact.

Key Configuration Decisions Explained

Why --gpu-memory-utilization 0.85 instead of 0.96-0.97?

We need to leave headroom for the CPU swap mechanism. When the KV cache overflows from GPU to CPU RAM, vLLM needs free GPU memory for the swap buffers. Using 0.97 here will cause OOM under load with long contexts. 0.85 is the stable value we've validated over 6 months.

Why --max-num-seqs 3?

With 128k context, each sequence can occupy a huge amount of KV cache. Allowing too many parallel sequences risks exhausting both GPU and CPU swap memory simultaneously. Three concurrent sequences is the conservative limit that keeps the deployment stable under real-world load.

Why --max-num-batched-tokens 8128?

This limits how many tokens get processed per engine step. With long-context requests, an uncapped value here can cause prefill spikes that OOM the GPU. 8128 gives a good balance between throughput and stability.

Why --max-num-partial-prefills 1?

For very long prompts, vLLM splits prefill across multiple steps (chunked prefill). Setting this to 1 means only one chunk is processed at a time, which keeps memory usage predictable during long-context ingestion.

Why 60Gi PVC instead of 30Gi?

The model weights are ~13GB but the torch.compile cache, XDG cache, and Triton cache for a 128k context model are significantly larger than for a 28k model. 60Gi gives comfortable headroom.

Expose the API

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: gptoss-ingress
  namespace: gptoss-multi
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
spec:
  ingressClassName: nginx
  rules:
    - host: gptoss.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: gptoss-multi
                port:
                  number: 80

Note the proxy-read-timeout: 600 — with 128k context requests can take a long time for prefill. The default nginx timeout of 60 seconds will kill long-context requests mid-generation.

Test it:

curl -s http://gptoss.yourdomain.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Explain how Kubernetes scheduling works."}],
    "max_tokens": 500
  }'

Performance Results

All numbers are measured from our production instance.

Test 1 - Short context (94 prompt tokens, 500 output):

time curl -s http://gptoss.yourdomain.com/v1/chat/completions \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-20b",
    "messages": [{"role": "user", "content": "Write a detailed technical explanation of how Kubernetes scheduling works..."}],
    "max_tokens": 500
  }'

# real 0m9.505s  →  500 tokens / 9.5s = ~52 tokens/sec

Test 2 - Long context (8,076 prompt tokens, 200 output):

# ~8k tokens of context
python3 -c "print('word ' * 8000)" | xargs -I{} curl -s \
  http://gptoss.yourdomain.com/v1/chat/completions \
  -H "Authorization: Bearer your-api-key" \
  -H "Content-Type: application/json" \
  -d "{\"model\": \"openai/gpt-oss-20b\", \"messages\": [{\"role\": \"user\", \"content\": \"{} Summarize the above.\"}], \"max_tokens\": 200}"

# real 0m6.113s  →  ~53 tokens/sec generation, TTFT ~1.47 sec

Test 3 - 3 parallel requests:

for i in {1..3}; do
  curl -s http://gptoss.yourdomain.com/v1/chat/completions \
    -H "Authorization: Bearer your-api-key" \
    -H "Content-Type: application/json" \
    -d '{"model": "openai/gpt-oss-20b", "messages": [{"role": "user", "content": "Explain distributed systems consistency models in detail."}], "max_tokens": 500}' &
done
wait
# real 0m16.1s

Results summary:

Metric	Short context (94 tok)	Long context (8k tok)	3 parallel
Throughput	~52 tok/s	~53 tok/s	~32 tok/s
TTFT	237ms	~1.47 sec	~410ms
Prompt tokens	94	8,076	77
KV cache usage	<1%	<1%	<1%

The most important finding here: throughput stays flat regardless of prompt length. 52 tok/s with 94 tokens vs 53 tok/s with 8,076 tokens. The mxfp4 quantization handles long contexts extremely efficiently. The only cost of longer context is TTFT - prefilling 8k tokens takes ~1.47 seconds vs 237ms for a short prompt, which is expected and linear.

The Reasoning Feature

One thing worth calling out the model exposes its internal reasoning process. Every response includes a reasoning_content field:

{
  "reasoning_content": "We need to explain distributed systems consistency models in detail. 
  Likely include eventual consistency, strong consistency, linearizability, sequential 
  consistency, causal consistency... We'll structure: 1. Intro. 2. CAP theorem. 
  3. ACID vs BASE...",
  "content": null
}

This is the model's chain-of-thought before generating the final answer. For analytical tasks, debugging agent failures, or building explainable AI pipelines, this is genuinely useful - you can see exactly how the model reasoned through a problem.

Note that content is null in the response above - the reasoning model separates thinking from output. Your client needs to handle both fields.

Cost Breakdown

Resource	Cost/month
g2-standard-4 on-demand	~$500
PVC 60GB standard-rwo	~$6
Total	~$506/month

At 52 tok/s running 24/7:

52 tokens/sec × 3600 × 24 × 30 = ~134 billion tokens/month theoretical

At 20% average utilization: ~27 billion tokens/month for $506.

(Part 3 cuts this roughly 3x by moving to a spot with a failover architecture that makes the spot safe to run.)

Compared to the Gemma4 Article

If you read Part 1, here's how the two models compare side by side:

	gpt-oss-20b	Gemma 4 26B AWQ
Context window	128,000 tokens	28,000 tokens
Throughput	~52 tok/s	~51 tok/s
TTFT (short)	237ms	84ms
Weights size	~13GB (mxfp4)	~16GB (AWQ int4)
VRAM for weights	13.72 GiB	15.55 GiB
KV cache pool (GPU)	4.17 GiB	3.12 GiB
KV cost per token	~24.5 KB (GQA, fp8)	~112.7 KB (global head_dim=512)
Max tokens in KV	182,336	29,709
GPU util setting	0.85	0.97
Reasoning	✅ built-in	❌
Tool calling	openai format	gemma4 format
License	Apache 2.0	Apache 2.0
HuggingFace access	gated	public

Why such a huge difference in context despite similar KV cache size?

This is the most interesting technical finding. The answer lies in how each model's attention is shaped.

From the vLLM startup logs:

gpt-oss-20b: 4.17 GiB for 182,336 tokens → ~24.5 KB per token
Gemma4 26B: 3.12 GiB for 29,709 tokens → ~112.7 KB per token

The standard formula for KV cache memory per token is:

Bytes per token = 2 × layers × KV_heads × head_dim × bytes_per_element

gpt-oss-20b is a 24-layer model with 8 KV heads (GQA — 64 query heads grouped onto just 8 KV heads) and a head_dim of 64. With fp8 KV cache (1 byte per element):

2 × 24 × 8 × 64 × 1 = 24,576 bytes ≈ ~24 KB per token

That already matches the ~24.5 KB/token we see in the logs almost exactly - and 4.17 GiB ÷ 24,576 bytes ≈ 182,336, which is precisely the headline KV pool size vLLM reports. So there is no mystery in the per-token number and no hidden reduction happening: aggressive GQA (8 KV heads instead of 64), a small head_dim of 64, and fp8 KV cache are what make each token cheap. vLLM computes the headline token capacity using exactly this uniform per-token cost.

So, where does Sliding Window Attention (SWA) actually help? Not in the per-token headline - in concurrency. gpt-oss-20b alternates layer types: roughly half of its 24 layers use full global attention, the other half use a tight 128-token sliding window. For long-context requests, the sliding-window layers do not grow with prompt length - they stay bounded at 128 tokens whether the prompt is 4k or 128k. So a real 128k request only pays the full per-token price on about half its layers; the sliding-window half is effectively free at length.

This is exactly what the log line reports:

INFO [kv_cache_utils.py] Maximum concurrency for 128,000 tokens per request: 2.68x

A naive reading would expect 182,336 ÷ 128,000 ≈ 1.42x concurrency for 128k requests. vLLM reports 2.68x - nearly double — because its memory manager understands the hybrid SWA structure and knows a 128k sequence costs roughly half the uniform estimate (only the ~12 full-attention layers accumulate full-length KV; the ~12 sliding-window layers plateau at 128 tokens). That ~1.9x uplift over the naive ratio is the SWA payoff - it buys concurrency headroom, not a cheaper headline per-token figure.

In contrast, Gemma4 26B uses a heavy heterogeneous attention architecture: most layers are local sliding-window layers at head_dim=256, but a few global attention layers use a much larger head_dim=512 (8 query heads grouped onto 4 KV heads via GQA). It's those wide head_dim=512 global layers that dominate the KV budget. The startup logs flag the split explicitly:

INFO [config.py] Gemma4 model has heterogeneous head dimensions
     (head_dim=256, global_head_dim=512).
     Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.

Gemma4's global attention layers with a massive head_dim=512 cost dramatically more per token, pushing its combined average overhead to ~112.7 KB per token - roughly 4.6x heavier than gpt-oss-20b's ~24.5 KB.

This explains the gap:

gpt-oss-20b: 4.17 GiB ÷ ~24.5 KB/token ≈ 182k tokens of headline KV pool cheap per token thanks to aggressive GQA + small head_dim + fp8, plus ~2.68x effective concurrency at 128k because the Sliding Window layers don't grow with context
Gemma4: 3.12 GiB ÷ ~112.7 KB/token ≈ 29k tokens - heavy global attention dimensions for maximum recall accuracy at the cost of density

Choose gpt-oss-20b when you need long context on budget hardware or an OpenAI-compatible tool-calling drop-in. Choose Gemma 4 when you need lower TTFT or native vision input.

Honest Assessment

Same caveat as in Part 1 - this is a quantized model, not the full cloud API. The mxfp4 quantization is more aggressive than AWQ int4, which can affect quality on tasks requiring precise numerical reasoning or very long coherent outputs.

In practice we haven't noticed quality issues for the use cases we run: document analysis, structured data extraction, automation agents, and code review. For these tasks, the model performs well and the 128k context is genuinely useful - you can feed entire codebases or long documents without chunking.

Data privacy remains the core advantage. Everything runs inside your VPC.

6 Months of Production Data

The setup has been running since November 2025. A few things we learned over time:

Pod restarts are fast and self-healing - when the node is recycled (GKE node auto-upgrade, maintenance, or a manual node pool operation), the StatefulSet pod is rescheduled, picks up the PVC with cached weights and torch.compile artifacts, and is back up in ~3 minutes. No data loss, no manual intervention. (Surviving spot preemption - and eliminating even that ~3-minute gap with a multi-zone replica architecture - is exactly what Part 3 covers.)

Memory is stable - no OOM events in 6 months with the --max-num-seqs 3 limit. The CPU swap mechanism handles occasional long-context requests without instability.

The model is consistent - response quality and latency have been stable. No drift or degradation observed.

Afterwords

Running a model with 128k context on a $0.70/hr instance felt ambitious when we started. Six months later, it's just infrastructure that runs. The key insight is that mxfp4 quantization combined with aggressive GQA and Sliding Window Attention is what makes 128k context genuinely feasible on 24GB VRAM - not a hack, but an architectural decision that vLLM understands and optimizes for natively.

This deployment is deliberately simple - a single replica on a standard on-demand node, with minimal scheduling. The next article in this series builds directly on it: a zone-aware multi-node spot setup with a K8S controller and automatic failover that guarantees at least one replica is always serving - the architecture that makes this both cheap (~3x cost reduction) and truly resilient in production.

If you have questions or feedback, feel free to reach out.

Running Gemma 4 26B with 256k Context on GKE with a Single L4 GPU

Oleksii Nizhegolenko — Mon, 18 May 2026 07:01:46 +0000

Alexey Nizhegolenko DevOps Engineer, AI Infrastructure Engineer

When you start looking at self-hosting large language models in 2026, the options can feel overwhelming. You have dozens of models, quantization formats, inference engines, and cloud configurations to choose from. In this article, I'll show you a straightforward, step-by-step way to deploy Gemma4 26B-A4B on Google Kubernetes Engine using a single NVIDIA L4 GPU - and push the context window all the way to 256,000 tokens. Real numbers, real mistakes, no marketing fluff.

This is the first article in a series. Here we focus on getting a stable, production-ready deployment on a standard (non-spot) L4 instance. In the next article, we'll look at how to build a resilient multi-zone setup with spot instances to cut costs significantly.

Why Gemma4 26B-A4B?

Gemma4 is Google DeepMind's latest open-weight model family released in April 2026 under Apache 2.0. The 26B-A4B variant is a Mixture-of-Experts (MoE) architecture - 26 billion total parameters but only ~4 billion active per token. This means it punches well above its weight in terms of quality while keeping inference costs reasonable. The architecture also supports up to 1M token context natively, which makes it an interesting target for squeezing out maximum context on constrained hardware.

Hardware and Cost

An NVIDIA L4 GPU comes with 24GB VRAM. On GCP, the smallest instance with an L4 is g2-standard-4 (4 vCPU, 16GB RAM).

Instance type	On-demand price	Spot price
g2-standard-4 (1x L4)	~$0.70/hr	~$0.21/hr

For this article, we use the on-demand instance for stability. The spot option and how to handle preemptions will be covered in Part 2.

Requirements

Before starting, you need:

A GKE cluster (Standard mode, not Autopilot) in any region with GPU available
kubectl configured to connect to your cluster
Google Artifact Registry repository for your Docker images
Basic familiarity with Kubernetes StatefulSets and PersistentVolumes

Step 1: Create the GPU Node Pool

Create a dedicated node pool for your LLM workload. We use g2-standard-4 with one L4 GPU:

gcloud container node-pools create l4-llm \
  --cluster=YOUR_CLUSTER_NAME \
  --zone=us-central1-b \
  --machine-type=g2-standard-4 \
  --accelerator=type=nvidia-l4,count=1,gpu-driver-version=latest \
  --num-nodes=0 \
  --enable-autoscaling \
  --min-nodes=0 \
  --max-nodes=1 \
  --node-labels=service=l4-llm \
  --node-taints=nvidia.com/gpu=present:NoSchedule \
  --scopes=cloud-platform

A few things worth noting here. We start with --num-nodes=0 and enable autoscaling - the node will spin up automatically when we deploy our StatefulSet. The --node-taints ensures only GPU workloads land on this expensive node pool. The --scopes=cloud-platform is important: without it, the node can't pull images from Artifact Registry.

Step 2: Prepare the vLLM Docker Image

We use the official vLLM image with Gemma4 support. Pull it and push it to your Artifact Registry to reduce traffic cost on model startup:

# Authenticate Docker with Artifact Registry
gcloud auth configure-docker us-central1-docker.pkg.dev

# Pull the Gemma 4 compatible vLLM image
docker pull vllm/vllm-openai:gemma4-0505-cu129

# Tag and push to your registry
docker tag vllm/vllm-openai:gemma4-0505-cu129 \
  us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129

docker push \
  us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129

The image is ~27GB - this is normal. It bundles CUDA runtime, PyTorch, vLLM, and all dependencies into a self-contained package.

Step 3: Create the Namespace

kubectl create namespace gemma4

Step 4: Deploy Gemma 4 26B

Now for the main part. Here's the complete StatefulSet manifest:

apiVersion: v1
kind: Service
metadata:
  name: gemma4-26b
  namespace: gemma4
  labels:
    app: gemma4-26b
spec:
  type: ClusterIP
  selector:
    app: gemma4-26b
  ports:
    - name: http
      port: 80
      targetPort: 8000
      protocol: TCP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: gemma4-26b
  namespace: gemma4
  labels:
    app: gemma4-26b
spec:
  serviceName: gemma4-26b
  replicas: 1
  selector:
    matchLabels:
      app: gemma4-26b
  updateStrategy:
    type: RollingUpdate
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  template:
    metadata:
      labels:
        app: gemma4-26b
    spec:
      terminationGracePeriodSeconds: 60
      nodeSelector:
        service: l4-llm
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      containers:
        - name: vllm
          image: us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129
          imagePullPolicy: IfNotPresent
          args:
            - --model
            - cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
            - --served-model-name
            - gemma-4-26b-a4b
            - --gpu-memory-utilization
            - "0.97"
            - --kv-cache-dtype
            - fp8
            - --max-model-len
            - "256000"
            - --tensor-parallel-size
            - "1"
            - --max-num-seqs
            - "8"
            - --enable-chunked-prefill
            - --max-num-batched-tokens
            - "4096"
            - --enable-auto-tool-choice
            - --tool-call-parser
            - gemma4
            - --reasoning-parser
            - gemma4
            - --async-scheduling
            - --limit-mm-per-prompt
            - '{"image": 0, "audio": 0, "video": 0}'
            - --host
            - 0.0.0.0
            - --port
            - "8000"
          env:
            - name: HF_HOME
              value: /models
            - name: XDG_CACHE_HOME
              value: /models/.xdg-cache
            - name: TRITON_CACHE_DIR
              value: /models/.triton
            - name: VLLM_LOGGING_LEVEL
              value: INFO
            - name: NVIDIA_VISIBLE_DEVICES
              value: all
            - name: NVIDIA_DRIVER_CAPABILITIES
              value: compute,utility
            - name: CUDA_VISIBLE_DEVICES
              value: "0"
            - name: TORCH_CUDA_ARCH_LIST
              value: "8.9"
            - name: LD_LIBRARY_PATH
              value: /home/kubernetes/bin/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib:/lib
          ports:
            - name: http
              containerPort: 8000
              protocol: TCP
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            initialDelaySeconds: 30
            periodSeconds: 10
            timeoutSeconds: 5
            failureThreshold: 90
          resources:
            requests:
              cpu: "1"
              memory: 2Gi
              nvidia.com/gpu: "1"
            limits:
              cpu: "3500m"
              memory: 12Gi
              nvidia.com/gpu: "1"
          volumeMounts:
            - name: model-cache
              mountPath: /models
            - name: dshm
              mountPath: /dev/shm
            - name: nvidia-lib64
              mountPath: /home/kubernetes/bin/nvidia/lib64
              readOnly: true
            - name: nvidia-bin
              mountPath: /home/kubernetes/bin/nvidia/bin
              readOnly: true
      volumes:
        - name: dshm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
        - name: nvidia-lib64
          hostPath:
            path: /home/kubernetes/bin/nvidia/lib64
            type: Directory
        - name: nvidia-bin
          hostPath:
            path: /home/kubernetes/bin/nvidia/bin
            type: Directory
  volumeClaimTemplates:
    - apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: model-cache
      spec:
        accessModes:
          - ReadWriteOnce
        storageClassName: standard-rwo
        resources:
          requests:
            storage: 30Gi

Now apply it:

kubectl apply -f gemma4-26b.yaml

What Happens on First Start

The first startup takes around 8-10 minutes total. Here's the breakdown:

vLLM image pull from Artifact Registry (~5 min on a fresh node) - the image is 27GB and needs to be pulled once per node. On subsequent restarts on the same node, imagePullPolicy: IfNotPresent skips this step entirely.
Model download from HuggingFace (~2 min) - vLLM automatically downloads cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit to the PVC. This also happens only once - the PVC persists across pod restarts.
Weight loading into GPU (~90 sec) - 16GB of AWQ quantized weights loaded from PVC into VRAM
torch.compile (~50 sec first run, ~11 sec from cache) - JIT compilation of CUDA kernels, result saved to PVC cache

Watch the progress:

kubectl logs -f statefulset/gemma4-26b -n gemma4

Here's what a healthy startup looks like with annotations so you know what to expect at each stage:

# vLLM resolves the model architecture - confirms Gemma4 is recognized correctly
INFO [model.py] Resolved architecture: Gemma4ForConditionalGeneration
INFO [model.py] Using max model len 256000

# Text-only mode confirmed - multimodal encoders skipped, saving ~1GB VRAM
INFO [registry.py] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.

# Model weights downloading from HuggingFace to PVC (first run only)
INFO [weight_utils.py] Time spent downloading weights: 114.785540 seconds

# Weights loaded from PVC into GPU VRAM - 16GB in ~87 seconds
INFO [weight_utils.py] Filesystem type for checkpoints: EXT4. Checkpoint size: 16.01 GiB.
INFO [default_loader.py] Loading weights took 87.56 seconds
INFO [gpu_model_runner.py] Model loading took 15.55 GiB memory and 90.185187 seconds

# torch.compile - on first run ~50 sec, from cache ~11 sec
INFO [backends.py] Directly load the compiled graph(s) for compile range (1, 4096) from the cache, took 2.789 s
INFO [monitor.py] torch.compile took 10.94 s in total

# Final VRAM budget - this is the key number to watch
INFO [gpu_worker.py] Available KV cache memory: 5.26 GiB
INFO [kv_cache_utils.py] GPU KV cache size: 459,627 tokens
INFO [kv_cache_utils.py] Maximum concurrency for 256,000 tokens per request: 1.80x

# Server is ready
INFO: Application startup complete.

You're looking for this line to confirm everything worked:

INFO: Application startup complete.

Second Start and Beyond

This is where the PVC pays off. On every subsequent restart:

No download - model already on PVC
torch.compile from cache - 11 seconds instead of 50
Total cold start: ~2 min 30 sec

We measured this precisely across multiple restarts - the numbers are consistent:

Pod start:      15:08:57
Weights loaded: 15:11:02  (87 sec from PVC)
torch.compile:  15:11:15  (11 sec from cache)
Server ready:   15:11:29

Total from pod start: ~2 min 32 sec

On subsequent restarts on the same node (image already cached):

Pod start → Server ready: ~2 min 32 sec

Key Configuration Decisions Explained

Why AWQ 4-bit quantization?

The original Gemma 4 26B model weights are ~52GB in BF16 - it simply won't fit on a 24GB L4. We use cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit, which brings it down to ~16GB. This format uses Marlin INT4 kernels under the hood and runs well on L4 (sm_8.9 architecture).

We tried FP8 and NVFP4 quantizations. FP8 doesn't fit either (~26GB). NVFP4 requires Blackwell (sm_9.0+) and won't run on L4 at all.

Why --limit-mm-per-prompt '{"image": 0, "audio": 0, "video": 0}'?

Gemma 4 is a multimodal model. Even if you don't use images or audio, vLLM reserves GPU memory for the multimodal encoders during profiling. Setting all limits to 0 puts vLLM into text-only mode, which frees about 1GB of VRAM. This is essential for reaching 256,000 token context on a 24GB card.

Why --kv-cache-dtype fp8?

KV cache stores intermediate attention states. Using fp8 instead of bf16 halves the memory footprint, giving more room for longer contexts. The quality impact is minimal for typical chat and document analysis tasks.

Why --gpu-memory-utilization 0.97?

We leave 3% (~720MB) as a safety buffer for CUDA runtime allocations and temporary buffers during forward passes. At 0.97 we get 5.26 GiB for KV cache — enough for 459,627 tokens and 1.80x concurrency at 256k context.

Why --enable-chunked-prefill with --max-num-batched-tokens 4096?

This is the key to unlocking 256k context on a single L4. Without chunked prefill, vLLM profiles the GPU by running a forward pass with all max_model_len tokens at once — at 256k tokens that profiling pass alone would require more VRAM than the card has.

With chunked prefill enabled, vLLM breaks large prefills into chunks of max_num_batched_tokens (4096 tokens per step). This doesn't limit your context length — a 256k token document still gets fully processed — it just means the processing happens in 4096-token chunks instead of one massive pass. The profiling overhead drops dramatically, freeing several extra gigabytes for KV cache.

The tradeoff: time to first token (TTFT) for very long prompts increases proportionally, since the prefill now takes more steps. For document analysis and agent pipelines, this is completely acceptable.

Why --max-num-seqs 8?

During startup profiling, vLLM reserves CUDA graphs for max_num_seqs simultaneous
sequences. With the default 256 sequences, vLLM would attempt to profile and capture
graphs for all 256 concurrent decode sizes — physically impossible on 24GB VRAM at
200k token context. Setting --max-num-seqs 8 limits graph capture to sizes
[1, 2, 4, 8, 16], consuming only 0.08 GiB for CUDA graphs while leaving 3.62 GiB
for the KV cache pool (259,682 tokens). In practice, each of the 8 concurrent users
gets a dynamically allocated share of that pool — a single active user gets the full
200k context, while 8 simultaneous users share ~32k tokens each.

Performance Results

All numbers below are measured from a real running instance, not estimated. We used curl with time for latency and pulled /metrics for the rest.

Single request (500 output tokens):

time curl -s http://gemma4-26b.yourdomain.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-26b-a4b",
    "messages": [{"role": "user", "content": "Write a detailed technical explanation of how Kubernetes scheduling works..."}],
    "max_tokens": 500
  }'

# real 0m9.783s  →  500 tokens / 9.78s = ~51 tokens/sec

Results from /metrics endpoint:

Metric	1 request
Throughput per request	~51 tok/s
Time to First Token (TTFT)	~84ms
KV cache usage	<1%
Queue time	~0ms

Prefix cache hit rate from metrics: 256/1557 tokens = ~16.4%. This means repeated system prompts or common conversation prefixes are being served from cache, reducing compute. In production, with consistent system prompts, the hit rate will be significantly higher.

With 256,000 tokens, you can process entire codebases, full legal contracts, lengthy research papers, or long-running agent conversations with extensive tool call history - all in a single pass without chunking or summarization tricks.

Context Length Evolution: From 28k to 256k

The journey from the initial 28,000 token deployment to 256,000 tokens happened entirely through parameter tuning — same hardware, same model, same gpu_memory_utilization. Here's the full progression:

Configuration	KV cache pool	Context	Concurrency
Original (max-num-seqs=8, no chunked prefill)	29,709 tokens	28k	1.06x
Intermediate (chunked prefill enabled)	280,517 tokens	64k	4.38x
130k config	395,527 tokens	130k	3.04x
256k config (current)	459,627 tokens	256k	1.80x

The jump from 28k to 256k came entirely from two parameter changes: enabling chunked prefill with a small max_num_batched_tokens, and reducing max_num_seqs to 2. No hardware changes, no model changes, no additional cost.

At 256k, concurrency drops to 1.80x - meaning a single request is always safe and two shorter requests can run simultaneously. For internal tooling, this is perfectly fine: you're trading multi-user throughput for maximum document size per request.

Expose the API

To expose the model inside your cluster via HTTP, you need an Ingress controller. The most common options in 2026 are:

NGINX Ingress Controller - the community standard, works on any Kubernetes cluster including GKE
GKE Gateway API - the newer GKE-native approach using HTTPRoute resources
Kong Gateway - popular choice if you need API key auth, rate limiting, or routing logic on top

The example below uses NGINX Ingress, which you can install with:

helm install ingress-nginx ingress-nginx/ingress-nginx \
  --namespace ingress-nginx --create-namespace

Add an Ingress to make the model accessible within your cluster:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: gemma4-ingress
  namespace: gemma4
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
spec:
  ingressClassName: nginx
  rules:
    - host: gemma4.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: gemma4-26b
                port:
                  number: 80

Test it:

curl -s http://gemma4.yourdomain.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-26b-a4b",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}],
    "max_tokens": 200
  }' | python3 -m json.tool

The API is fully OpenAI-compatible. You can drop it in as a replacement for any OpenAI client by changing the base_url to http://gemma4.yourdomain.com/v1.

Honest Assessment: What This Model Is Good For

It's worth being upfront: this is a quantized community model, not the latest frontier API. AWQ 4-bit compression introduces some quality degradation compared to the full BF16 weights, and at 256k context and max_num_seqs=2 this is effectively a single-user setup. If you need high-concurrency inference, this configuration trades that for maximum document size.

That said, in practice it handles a wide range of tasks well - log analysis, PR reviews, writing automation agents, summarizing technical documents, simple code generation, and internal tooling. For these workloads, the quality is more than good enough, and the tradeoff is clear.

The bigger advantage is data privacy. With a self-hosted model, your data never leaves your infrastructure. Cloud API providers typically retain and may use your prompts for model improvement. If you're processing internal systems data, customer information, or proprietary business logic - that's a meaningful difference. Your data stays yours.

What We Tried That Didn't Work

I think it's useful to share the approaches we tested before landing on this setup.

Regional Persistent Disks - We tried using a regional PD (replicated across two zones) to avoid zonal lock-in. GCP doesn't support regional PDs on G2 instances (the GPU series). This is a hard limitation, not a configuration issue.

GCS FUSE - Mounting the model weights directly from a GCS bucket sounds elegant. In practice, vLLM reads all 16GB of weights sequentially at startup, and GCS FUSE has no prefetching for this pattern. After 4+ minutes on the first shard with no progress, we abandoned this approach. The init container + gsutil copy is faster and simpler.

High --max-num-seqs with large context - Our initial attempt at 130k used --max-num-seqs 8 without chunked prefill. vLLM failed to start: the profiling pass tried to reserve memory for 8 × 130k = over 1 million tokens simultaneously, which is physically impossible on 24GB VRAM. The fix is --max-num-seqs 2 combined with --enable-chunked-prefill.

Cost Breakdown

Running this setup full-time on on-demand pricing:

Resource	Cost/month
g2-standard-4 on-demand	~$500
PVC 30GB standard-rwo	~$3
GCS model storage 17GB	~$0.40
Total	~$503/month

A self-hosted model runs 24/7. At 45 tokens/sec on a single request, that's:

45 tokens/sec × 3600 sec/hr × 24 hr × 30 days = ~116 billion tokens/month theoretical maximum

Even at 20% average utilization (which is realistic for internal tooling), you're looking at ~23 billion tokens/month for a flat $503.

For comparison, as of May 2026 OpenAI's current models are priced at:

Model	Output tokens (per 1M)
GPT-5.4 (flagship)	$15.00
GPT-4.1	$8.00
GPT-4o (legacy)	$10.00
GPT-4o-mini	$0.60

At GPT-5.4 rates, 23 billion tokens would cost $345,000/month. Even against the much cheaper GPT-4o-mini at $0.60/M, you'd still be paying ~$13,800/month vs your flat $503. The break-even point is much lower than most teams expect.

In Part 2 we'll show how to cut the compute cost by 70% using spot instances with a zone-aware architecture that automatically recovers from preemptions.

Afterwords

This setup is running in production and being used across the team for various tasks - log analysis, automated code reviews, agent pipelines, and document processing. The main takeaway is that running a capable 26B model with a genuine 256k context window on a single L4 GPU is very achievable in 2026 - the tooling has matured significantly. The tricky parts are around understanding how vLLM's memory profiling interacts with context length. Once you know which levers to pull - chunked prefill, conservative max_num_seqs, fp8 KV cache, text-only mode - the numbers move dramatically without touching the hardware at all.

If you have questions or want to share your experience with similar setups, feel free to reach out. The next article in this series will cover multi-zone spot deployments with automatic failover. Stay tuned.