Alexey Nizhegolenko DevOps Engineer, AI Infrastructure Engineer
When you start looking at self-hosting large language models in 2026, the options can feel overwhelming. You have dozens of models, quantization formats, inference engines, and cloud configurations to choose from. In this article, I'll show you a straightforward, step-by-step way to deploy Gemma4 26B-A4B on Google Kubernetes Engine using a single NVIDIA L4 GPU - with real numbers, real mistakes, and no marketing fluff.
This is the first article in a series. Here we focus on getting a stable, production-ready deployment on a standard (non-spot) L4 instance. In the next article, we'll look at how to build a resilient multi-zone setup with spot instances to cut costs significantly.
Why Gemma4 26B-A4B?
Gemma4 is Google DeepMind's latest open-weight model family released in April 2026 under Apache 2.0. The 26B-A4B variant is a Mixture-of-Experts (MoE) architecture - 26 billion total parameters but only ~4 billion active per token. This means it punches well above its weight in terms of quality while keeping inference costs reasonable.
Hardware and Cost
An NVIDIA L4 GPU comes with 24GB VRAM. On GCP, the smallest instance with an L4 is g2-standard-4 (4 vCPU, 16GB RAM).
| Instance type | On-demand price | Spot price |
|---|---|---|
| g2-standard-4 (1x L4) | ~$0.70/hr | ~$0.21/hr |
For this article we use the on-demand instance for stability. The spot option and how to handle preemption's will be covered in Part 2.
Requirements
Before starting, you need:
- A GKE cluster (Standard mode, not Autopilot) in any region with GPU available
-
kubectlconfigured to connect to your cluster - Google Artifact Registry repository for your Docker images
- Basic familiarity with Kubernetes StatefulSets and PersistentVolumes
Step 1: Create the GPU Node Pool
Create a dedicated node pool for your LLM workload. We use g2-standard-4 with one L4 GPU:
gcloud container node-pools create l4-llm \
--cluster=YOUR_CLUSTER_NAME \
--zone=us-central1-b \
--machine-type=g2-standard-4 \
--accelerator=type=nvidia-l4,count=1,gpu-driver-version=latest \
--num-nodes=0 \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=1 \
--node-labels=service=l4-llm \
--node-taints=nvidia.com/gpu=present:NoSchedule \
--scopes=cloud-platform
A few things worth noting here. We start with --num-nodes=0 and enable autoscaling - the node will spin up automatically when we deploy our StatefulSet. The --node-taints ensures only GPU workloads land on this expensive node pool. The --scopes=cloud-platform is important: without it, the node can't pull images from Artifact Registry.
Step 2: Prepare the vLLM Docker Image
We use the official vLLM image with Gemma4 support. Pull it and push it to your Artifact Registry to reduce traffic cost on model startup:
# Authenticate Docker with Artifact Registry
gcloud auth configure-docker us-central1-docker.pkg.dev
# Pull the Gemma 4 compatible vLLM image
docker pull vllm/vllm-openai:gemma4-0505-cu129
# Tag and push to your registry
docker tag vllm/vllm-openai:gemma4-0505-cu129 \
us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129
docker push \
us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129
The image is ~27GB - this is normal. It bundles CUDA runtime, PyTorch, vLLM, and all dependencies into a self-contained package.
Step 3: Create the Namespace
kubectl create namespace gemma4
Step 4: Deploy Gemma 4 26B
Now for the main part. Here's the complete StatefulSet manifest:
apiVersion: v1
kind: Service
metadata:
name: gemma4-26b
namespace: gemma4
labels:
app: gemma4-26b
spec:
type: ClusterIP
selector:
app: gemma4-26b
ports:
- name: http
port: 80
targetPort: 8000
protocol: TCP
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: gemma4-26b
namespace: gemma4
labels:
app: gemma4-26b
spec:
serviceName: gemma4-26b
replicas: 1
selector:
matchLabels:
app: gemma4-26b
updateStrategy:
type: RollingUpdate
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
template:
metadata:
labels:
app: gemma4-26b
spec:
terminationGracePeriodSeconds: 60
nodeSelector:
service: l4-llm
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: us-central1-docker.pkg.dev/YOUR_PROJECT/tools/vllm-openai:gemma4-0505-cu129
imagePullPolicy: IfNotPresent
args:
- --model
- cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit
- --served-model-name
- gemma-4-26b-a4b
- --gpu-memory-utilization
- "0.97"
- --kv-cache-dtype
- fp8
- --max-model-len
- "28000"
- --tensor-parallel-size
- "1"
- --max-num-seqs
- "8"
- --max-num-batched-tokens
- "28000"
- --enable-auto-tool-choice
- --tool-call-parser
- gemma4
- --reasoning-parser
- gemma4
- --async-scheduling
- --limit-mm-per-prompt
- '{"image": 0, "audio": 0, "video": 0}'
- --host
- 0.0.0.0
- --port
- "8000"
env:
- name: HF_HOME
value: /models
- name: XDG_CACHE_HOME
value: /models/.xdg-cache
- name: TRITON_CACHE_DIR
value: /models/.triton
- name: VLLM_LOGGING_LEVEL
value: INFO
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
- name: CUDA_VISIBLE_DEVICES
value: "0"
- name: TORCH_CUDA_ARCH_LIST
value: "8.9"
- name: LD_LIBRARY_PATH
value: /home/kubernetes/bin/nvidia/lib64:/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib:/lib
ports:
- name: http
containerPort: 8000
protocol: TCP
readinessProbe:
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 90
resources:
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
limits:
cpu: "3500m"
memory: 12Gi
nvidia.com/gpu: "1"
volumeMounts:
- name: model-cache
mountPath: /models
- name: dshm
mountPath: /dev/shm
- name: nvidia-lib64
mountPath: /home/kubernetes/bin/nvidia/lib64
readOnly: true
- name: nvidia-bin
mountPath: /home/kubernetes/bin/nvidia/bin
readOnly: true
volumes:
- name: dshm
emptyDir:
medium: Memory
sizeLimit: 2Gi
- name: nvidia-lib64
hostPath:
path: /home/kubernetes/bin/nvidia/lib64
type: Directory
- name: nvidia-bin
hostPath:
path: /home/kubernetes/bin/nvidia/bin
type: Directory
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteOnce
storageClassName: standard-rwo
resources:
requests:
storage: 30Gi
Now apply it:
kubectl apply -f gemma4-26b.yaml
What Happens on First Start
The first startup takes around 8-10 minutes total. Here's the breakdown:
-
vLLM image pull from Artifact Registry (~5 min on a fresh node) - the image is 27GB and needs to be pulled once per node. On subsequent restarts on the same node,
imagePullPolicy: IfNotPresentskips this step entirely. -
Model download from HuggingFace (~2 min) - vLLM automatically downloads
cyankiwi/gemma-4-26B-A4B-it-AWQ-4bitto the PVC. This also happens only once - the PVC persists across pod restarts. - Weight loading into GPU (~90 sec) - 16GB of AWQ quantized weights loaded from PVC into VRAM
- torch.compile (~90 sec) - JIT compilation of CUDA kernels, result saved to PVC cache
Watch the progress:
kubectl logs -f statefulset/gemma4-26b -n gemma4
Here's what a healthy startup looks like with annotations so you know what to expect at each stage:
# vLLM resolves the model architecture - confirms Gemma4 is recognized correctly
INFO [model.py] Resolved architecture: Gemma4ForConditionalGeneration
INFO [model.py] Using max model len 28000
# Text-only mode confirmed - multimodal encoders skipped, saving ~1GB VRAM
INFO [registry.py] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
# Model weights downloading from HuggingFace to PVC (first run only)
INFO [weight_utils.py] Time spent downloading weights: 114.785540 seconds
# Weights loaded from PVC into GPU VRAM - 16GB in ~87 seconds
INFO [weight_utils.py] Filesystem type for checkpoints: EXT4. Checkpoint size: 16.01 GiB.
INFO [default_loader.py] Loading weights took 87.47 seconds
INFO [gpu_model_runner.py] Model loading took 15.55 GiB memory and 89.977503 seconds
# torch.compile - on first run ~90 sec, from cache ~10 sec
INFO [backends.py] Directly load the compiled graph(s) for compile range (1, 28000) from the cache, took 2.750 s
INFO [monitor.py] torch.compile took 10.21 s in total
# Final VRAM budget - this is the key number to watch
INFO [gpu_worker.py] Available KV cache memory: 3.12 GiB
INFO [kv_cache_utils.py] GPU KV cache size: 29,709 tokens
INFO [kv_cache_utils.py] Maximum concurrency for 28,000 tokens per request: 1.06x
# Server is ready
INFO: Application startup complete.
If Available KV cache memory is below ~2.95 GiB, your --max-model-len 28000 will fail at startup. Either lower the context length or increase --gpu-memory-utilization slightly.
You're looking for this line to confirm everything worked:
INFO: Application startup complete.
Second Start and Beyond
This is where the PVC pays off. On every subsequent restart:
- No download - model already on PVC
- torch.compile from cache - 10 seconds instead of 90
- Total cold start: ~2 min 40 sec
We measured this precisely across multiple restarts - the numbers are consistent:
Scale up: 07:24:56
Pod running: 07:30:02 (image pull ~5 min on fresh node)
Weights loaded: 07:32:05 (87 sec from PVC)
torch.compile: 07:32:15 (10 sec from cache)
Server ready: 07:32:35
Total from scale up: ~7 min 39 sec (first time, includes image pull)
Total from pod start: ~2 min 33 sec (weights + compile from cache)
On subsequent restarts on the same node (image already cached):
Pod start → Server ready: ~2 min 33 sec
Key Configuration Decisions Explained
Why AWQ 4-bit quantization?
The original Gemma 4 26B model weights are ~52GB in BF16 - it simply won't fit on a 24GB L4. We use cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit, which brings it down to ~16GB. This format uses Marlin INT4 kernels under the hood and runs well on L4 (sm_8.9 architecture).
We tried FP8 and NVFP4 quantizations. FP8 doesn't fit either (~26GB). NVFP4 requires Blackwell (sm_9.0+) and won't run on L4 at all.
Why --limit-mm-per-prompt '{"image": 0, "audio": 0, "video": 0}'?
Gemma 4 is a multimodal model. Even if you don't use images or audio, vLLM reserves GPU memory for the multimodal encoders during profiling. Setting all limits to 0 puts vLLM into text-only mode, which frees about 1GB of VRAM. This is what allows us to reach 28,000 token context instead of 24,000.
Why --kv-cache-dtype fp8?
KV cache stores intermediate attention states. Using fp8 instead of bf16 halves the memory footprint, giving more room for longer contexts. The quality impact is minimal for typical chat and document analysis tasks.
Why --gpu-memory-utilization 0.97?
We found that 0.96 leaves the KV cache 40MB short for 28,000 tokens. Bumping to 0.97 makes it work. We don't go higher to keep a safety buffer against OOM on spot instances.
Performance Results
All numbers below are measured from a real running instance, not estimated. We used curl with time for latency and pulled /metrics for the rest.
Single request (500 output tokens):
time curl -s http://gemma4-26b.yourdomain.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-26b-a4b",
"messages": [{"role": "user", "content": "Write a detailed technical explanation of how Kubernetes scheduling works..."}],
"max_tokens": 500
}'
# real 0m9.783s → 500 tokens / 9.78s = ~51 tokens/sec
4 parallel requests (batch simulation):
for i in {1..4}; do
curl -s http://gemma4-26b.yourdomain.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gemma-4-26b-a4b", "messages": [...], "max_tokens": 500}' &
done
wait
Results from /metrics endpoint:
| Metric | 1 request | 4 parallel |
|---|---|---|
| Throughput per request | ~51 tok/s | ~47 tok/s |
| Time to First Token (TTFT) | ~84ms | ~1.94 sec |
| KV cache usage | <1% | 6.1% |
| Queue time | ~0ms | ~0ms |
| Requests running simultaneously | 1 | 4 |
A few things worth noting here. At 4 parallel requests, per-request throughput barely drops (-8%) - vLLM's async scheduling batches the decode steps efficiently. TTFT increases significantly though: from 84ms to ~1.9 seconds. This is expected behaviour when the GPU is processing multiple prefills at once. For interactive use cases, keep concurrency low. For batch processing, throughput is what matters and it holds well.
Prefix cache hit rate from metrics: 256/1557 tokens = ~16.4%. This means repeated system prompts or common conversation prefixes are being served from cache, reducing compute. In production with consistent system prompts the hit rate will be significantly higher.
The 28,000 token context is enough for:
- Analyzing large log files or error traces without chunking
- Reviewing pull requests with extensive diffs and context
- Processing long technical documents or reports in a single pass
- Multi-turn agent conversations with rich tool call history
- Code analysis across entire modules or services
Expose the API
To expose the model inside your cluster via HTTP, you need an Ingress controller. The most common options in 2026 are:
- NGINX Ingress Controller - the community standard, works on any Kubernetes cluster including GKE
-
GKE Gateway API - the newer GKE-native approach using
HTTPRouteresources - Kong Gateway - popular choice if you need API key auth, rate limiting, or routing logic on top
The example below uses NGINX Ingress, which you can install with:
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx --create-namespace
Add an Ingress to make the model accessible within your cluster:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: gemma4-ingress
namespace: gemma4
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
nginx.ingress.kubernetes.io/proxy-buffering: "off"
spec:
ingressClassName: nginx
rules:
- host: gemma4.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: gemma4-26b
port:
number: 80
Test it:
curl -s http://gemma4.yourdomain.com/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-4-26b-a4b",
"messages": [{"role": "user", "content": "Hello, what can you do?"}],
"max_tokens": 200
}' | python3 -m json.tool
The API is fully OpenAI-compatible. You can drop it in as a replacement for any OpenAI client by changing the base_url to http://gemma4.yourdomain.com/v1.
Honest Assessment: What This Model Is Good For
It's worth being upfront: this is a quantized community model, not the latest frontier API. AWQ 4-bit compression introduces some quality degradation compared to the full BF16 weights, and 28,000 tokens is modest compared to the 1M+ context of some cloud models.
That said, in practice it handles a wide range of tasks well - log analysis, PR reviews, writing automation agents, summarizing technical documents, simple code generation, and internal tooling. For these workloads the quality is more than good enough, and the tradeoff is clear.
The bigger advantage is data privacy. With a self-hosted model, your data never leaves your infrastructure. Cloud API providers typically retain and may use your prompts for model improvement. If you're processing internal systems data, customer information, or proprietary business logic - that's a meaningful difference. Your data stays yours.
What We Tried That Didn't Work
I think it's useful to share the approaches we tested before landing on this setup.
Regional Persistent Disks - We tried using a regional PD (replicated across two zones) to avoid zonal lock-in. GCP doesn't support regional PDs on G2 instances (the GPU series). This is a hard limitation, not a configuration issue.
GCS FUSE - Mounting the model weights directly from a GCS bucket sounds elegant. In practice, vLLM reads all 16GB of weights sequentially at startup, and GCS FUSE has no prefetching for this pattern. After 4+ minutes on the first shard with no progress, we abandoned this approach. The init container + gsutil copy is faster and simpler.
Cost Breakdown
Running this setup full-time on on-demand pricing:
| Resource | Cost/month |
|---|---|
| g2-standard-4 on-demand | ~$500 |
| PVC 30GB standard-rwo | ~$3 |
| GCS model storage 17GB | ~$0.40 |
| Total | ~$503/month |
A self-hosted model runs 24/7. At 45 tokens/sec on a single request, that's:
45 tokens/sec × 3600 sec/hr × 24 hr × 30 days = ~116 billion tokens/month theoretical maximum
Even at 20% average utilization (which is realistic for internal tooling), you're looking at ~23 billion tokens/month for a flat $503.
For comparison, as of May 2026 OpenAI's current models are priced at:
| Model | Output tokens (per 1M) |
|---|---|
| GPT-5.4 (flagship) | $15.00 |
| GPT-4.1 | $8.00 |
| GPT-4o (legacy) | $10.00 |
| GPT-4o-mini | $0.60 |
At GPT-5.4 rates, 23 billion tokens would cost $345,000/month. Even against the much cheaper GPT-4o-mini at $0.60/M, you'd still be paying ~$13,800/month vs your flat $503. The break-even point is much lower than most teams expect.
In Part 2 we'll show how to cut the compute cost by 70% using spot instances with a zone-aware architecture that automatically recovers from preemptions.
Afterwords
This setup is running in production and being used across the team for various tasks - log analysis, automated code reviews, agent pipelines, and document processing. After several weeks of use it's proven to be stable and reliable. The main takeaway is that running a capable 26B model on a single L4 GPU is very achievable in 2026 — the tooling has matured significantly. The tricky parts are around quantization format compatibility and squeezing out the last few gigabytes of VRAM for context length.
If you have questions or want to share your experience with similar setups, feel free to reach out. The next article in this series will cover multi-zone spot deployments with automatic failover. Stay tuned.
Top comments (0)