Google released Gemma 4 yesterday. By the time I went to bed, I had it deployed on my home lab, running real coding benchmarks at 96 tokens per second.
The catch: no official llama.cpp image supported the gemma4 architecture yet. The stock CUDA images crash with unknown model architecture: 'gemma4'. So I built it from source, on the same Kubernetes cluster that serves inference.
This post is about what it took to go from "model dropped" to "running in production" in about two hours on consumer hardware.
The Setup
My home inference server (I call it ShadowStack):
- 2x NVIDIA RTX 5060 Ti (16GB each, 32GB total VRAM)
- AMD Ryzen 9 7900X, 64GB DDR5
- Ubuntu 24.04, MicroK8s
- NVIDIA driver 590.48.01 (CUDA 13.1)
Everything is managed by LLMKube, a Kubernetes operator I built for running llama.cpp inference. One CRD to define the model, one CRD to define the service, the operator handles the rest.
Step 1: The Architecture Problem
First attempt, I tried the server-cuda13 image (CUDA 13 build of llama.cpp):
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma4'
The Gemma 4 architecture hadn't shipped in any released llama.cpp build yet. The support was only in HEAD.
Step 2: Build From HEAD On-Cluster
I have a Kaniko build pipeline on the cluster from a previous project (TurboQuant benchmarking). I wrote a Dockerfile that clones llama.cpp HEAD and builds with CUDA targeting SM 86 (Ampere) and SM 120 (Blackwell):
FROM nvidia/cuda:12.8.0-devel-ubuntu24.04 AS builder
RUN git clone --depth 1 https://github.com/ggml-org/llama.cpp.git /build/llama.cpp
WORKDIR /build/llama.cpp
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so \
/usr/local/cuda/lib64/stubs/libcuda.so.1
ENV LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LIBRARY_PATH}
RUN cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="86;120" \
-DCMAKE_BUILD_TYPE=Release \
&& cmake --build build --target llama-server -j$(nproc)
A Kaniko Job on the cluster built this in about 15 minutes and pushed it to my local container registry. The same cluster that runs inference also builds its own inference server. No external CI needed.
Step 3: Deploy
llmkube deploy gemma4-26b --gpu --accelerator cuda --gpu-count 2 \
--source https://huggingface.co/Trilogix1/gemma-4-26B-A4B-it-GGUF/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.gguf \
--image registry.defilan.net/llama-server-latest:gemma4 \
--flash-attn --jinja --context 32768
The model is 15.6 GB at Q4_K_M. With both GPUs, that leaves about 16 GB for KV cache. Plenty for 32K context.
The operator downloaded the model, created the Deployment with the right GPU flags, set up health probes, and exposed an OpenAI-compatible endpoint. From the deploy command to the first inference request was about 3 minutes (mostly model download time).
The Numbers
Single Request
| Metric | Value |
|---|---|
| Generation | 96 tok/s |
| Prompt processing | 128 tok/s |
| Model size (Q4_K_M) | 15.6 GB |
| Active parameters per token | 4B (MoE) |
Under Load (4 concurrent workers, 2 minutes)
| Metric | Value |
|---|---|
| Aggregate throughput | 170 tok/s |
| Total requests | 110 |
| Error rate | 0% |
| P50 latency | ~2s per request |
For context, the generic benchmarks floating around say Gemma 4 26B-A4B "exceeds 40 tok/s on consumer hardware." We're doing 96 tok/s on a single request and 170 tok/s aggregate under concurrent load. The dual-GPU split and the MoE architecture (only 4B parameters active per token) make this model surprisingly fast.
Real Coding Benchmarks
I didn't just run "hello world" tests. I fed it actual bug reports from my own project and asked it to generate fixes.
Bug: GPU Rolling Update Deadlock
The issue: Kubernetes rolling updates deadlock on GPU workloads because the new pod can't schedule (old pod holds GPUs) and the old pod won't terminate (waiting for new pod to be Ready).
Gemma 4's response: correctly identified that GPU workloads should use Recreate strategy instead of RollingUpdate, with a conditional check on GPU count. Showed the chain-of-thought reasoning, considered edge cases, and verified against the pattern before outputting.
Time: 10.6 seconds for a 1024-token response including the full reasoning chain.
Bug: Stale Endpoints After Deletion
The issue: deleting an InferenceService leaves orphaned Kubernetes Endpoints.
Gemma 4's response: generated a complete UnregisterEndpoint method with DNS name sanitization, Service and Endpoints deletion, NotFound error handling, and logging. Production-quality Go code on the first try.
Time: 11.1 seconds.
Code Generation: Ginkgo BDD Tests
I asked it to write tests following an existing pattern in the codebase. It generated 4 correct test cases with BeforeEach setup, proper assertions, and the right Gomega matchers. Used ContainElements for present checks and NotTo(ContainElement()) for absent checks, matching the exact conventions from the rest of the test suite.
Time: 12.3 seconds.
What This Actually Means
I'm not claiming Gemma 4 replaces Claude or GPT-4. It doesn't. The reasoning is shallower on complex multi-step problems, and it occasionally cuts off mid-response at the token limit.
What I am claiming: the gap between "Google releases a new model" and "it's running on your hardware fixing real bugs" has shrunk to hours, not weeks. The pieces are:
- GGUF quantization appears on HuggingFace within hours of a model release
- llama.cpp HEAD usually has architecture support on day one (the tokenizer and template fixes were already committed)
- Kaniko or similar tools let you build from source on-cluster without a separate CI pipeline
- A Kubernetes operator (in my case, LLMKube) lets you deploy with one command and get health checks, metrics, and an OpenAI-compatible API
This is the same workflow regardless of whether the model is Gemma 4, Qwen3.5, Llama, or whatever ships next week. The infrastructure is model-agnostic.
The Hardware Math
This entire setup cost about $2,400:
- 2x RTX 5060 Ti: ~$800
- Ryzen 9 7900X + motherboard + RAM + SSD + case + PSU: ~$1,600
Running 24/7, the system draws about 50-60W idle and 500-600W under full inference load. At $0.12/kWh, that's roughly $30-50/month in electricity for unlimited inference.
Compare to API costs: at OpenAI's pricing for a comparable model, 110 requests in 2 minutes would cost roughly $5-10. Scale that to continuous use and the hardware pays for itself in a month or two.
Try It
LLMKube is open source (Apache 2.0): github.com/defilantech/llmkube
If you have a GPU and a Kubernetes cluster (even a single-node K3s or MicroK8s), you can deploy any GGUF model with:
helm install llmkube llmkube/llmkube
llmkube deploy llama-3.1-8b --gpu
For Gemma 4 specifically, you'll need a custom llama.cpp image until the official builds ship with gemma4 architecture support. The Dockerfile above works.
Benchmarks run on April 2, 2026 on ShadowStack (2x RTX 5060 Ti, 32GB VRAM, Blackwell SM 12.0, CUDA 13.1, driver 590.48.01). Gemma 4 26B-A4B-it Q4_K_M via llama.cpp built from HEAD commit f851fa5a.
Top comments (0)