Running, Building, and Optimizing with Gemma 4

#ai #machinelearning #tutorial #gemma

gemma 4 isn't just an open-weight model you download — it's a toolkit for running AI fully offline, building real generative applications, and squeezing maximum performance out of whatever hardware you have. This guide covers the practical side: local setup, offline app architecture, GenAI application patterns, and performance tuning.

Running Gemma 4 Locally

The fastest path to running Gemma 4 on your own machine is through Ollama, which wraps quantized GGUF weights in a simple CLI and local API — no GPU required for smaller models.

# Install Ollama, then pull a Gemma 4 variant
ollama pull gemma4:e4b      # edge-friendly, multimodal
ollama pull gemma4:26b      # MoE, faster decode
ollama pull gemma4:31b      # dense, max quality

# Run interactively
ollama run gemma4:e4b "Summarize this transaction log"

# Or hit the local REST API
curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:e4b",
  "prompt": "roses are red"
}'

For raw inference speed rather than convenience, llama.cpp with CUDA acceleration outperforms Ollama's wrapper — benchmarks on an NVIDIA DGX Spark show llama.cpp hitting ~65 tokens/sec versus ~60 tok/s for Ollama and ~45 tok/s for vLLM's NVFP4 path on the 26B MoE model. vLLM remains the better choice once you need a production-grade OpenAI-compatible API server with batching, rather than a single local session.

Tool	Best For	Notes
Ollama	Quick local setup, prototyping	GGUF quantized, simplest CLI/API
llama.cpp	Raw inference speed	Direct CUDA/Metal control, no wrapper overhead
vLLM	Production API serving	Batching, concurrent requests, OpenAI-compatible
MLX	Apple Silicon	Native Metal acceleration on Mac

On Apple Silicon, performance scales directly with unified memory: an M1 8GB Mac manages roughly 12 tokens/sec on a 12B Q4 model, while an M4 Max 48GB comfortably runs larger models at ~35 tokens/sec.

Building Offline and Low-Connectivity AI Applications with Gemma

Gemma's on-device sizes (E2B and E4B) are purpose-built for environments with unreliable or absent connectivity — a pattern directly relevant to logistics and fintech deployments in areas with inconsistent network coverage.

Use Google AI Edge Gallery or AICore on Android to embed E2B/E4B directly into a mobile app, with no server round-trip required
Target edge boards like NVIDIA Jetson Orin Nano for IoT and drone-based systems that must reason locally (e.g., an Atoovis-style delivery drone classifying obstacles without a live connection)
Quantize aggressively (Q4_K_M or IQ4_XS) so the model fits in constrained RAM on field devices, trading a small quality drop for a much smaller memory footprint
Design a "store-and-forward" pattern: the local model handles inference and decision-making offline, then syncs logs or embeddings to your backend (e.g., MongoDB Atlas) once connectivity returns
For voice-driven offline use cases like field agent check-ins, use E2B/E4B's native audio input instead of a separate speech-to-text pipeline, reducing both latency and points of failure

This architecture matters for fintech agent apps operating in rural Nigeria or similar low-bandwidth regions, where a cloud-dependent LLM call would simply fail rather than degrade gracefully.

Building Generative AI Applications with Gemma 4

Beyond chatbots, Gemma 4's function-calling and structured JSON output make it suitable as a reasoning layer inside existing backend systems rather than a bolted-on feature.

Agentic backend services: wire Gemma 4's function-calling directly into your Node.js/Express routes so the model decides which internal API to call (e.g., checking a Kredi Bank balance or triggering a VTU top-up) instead of parsing free-text intent yourself
Document and receipt processing: feed scanned invoices or bank statements through Gemma 4's vision input for OCR plus structured extraction, replacing brittle regex-based parsers
Local coding assistants: run Gemma 4 inside tools like OpenCode or via llama.cpp to get an offline pair-programmer for sensitive codebases you don't want sent to a third-party API
RAG pipelines: combine Gemma 4 with LangChain and a local vector store to build a retrieval-augmented assistant over your own documentation or transaction history, entirely self-hosted
Multilingual support apps: leverage Gemma 4's 140+ language coverage to serve customer support or bill-payment flows in local languages without a separate translation layer

A minimal Gradio-based coding assistant, for example, pairs Ollama's local API with a simple web UI to demo tool-calling and live code editing in an afternoon — a useful pattern for internal developer tools.

Optimizing Gemma for Performance and Efficiency

Most "Gemma is slow" complaints trace back to three fixable issues: CPU fallback, the wrong quantization, and an oversized context window.

Check GPU utilization first. If GPU usage stays at 0% during inference, the model silently fell back to CPU — expect only 1-5 tokens/sec until this is fixed.

Pick quantization deliberately:

Quantization	Size (12B)	Speed	Quality	Best For
Q4_K_M	~7 GB	Fastest	Good	Daily use, most tasks
Q5_K_M	~8.5 GB	Fast	Better	When quality matters
Q6_K	~10 GB	Medium	Very good	Balanced
Q8_0	~13 GB	Slow	Near-original	Quality-critical tasks
FP16	~24 GB	Slowest	Original	Only with ample VRAM

Watch context length — it isn't free. VRAM and speed degrade sharply as context grows: a 12B Q4 model runs at full speed with a 2K context but drops to roughly a quarter of that speed at 256K context, consuming 30GB+ of VRAM in the process.

Manage the KV cache. Long-running conversations accumulate key-value cache that eats VRAM over time — reset sessions periodically or cap the cache size for long-lived services.

Use Quantization-Aware Training (QAT) checkpoints where available. Google has released QAT versions of Gemma 4 that preserve much more quality at int4 precision than post-training quantization alone, making it realistic to run a 12B model with a 16K context window on as little as 8GB of VRAM, even on older GPUs like a GTX 1080 Ti.

Mixture-of-Experts changes the math. The 26B MoE model activates only ~3.8B parameters per token versus 30B+ for the dense model, delivering roughly 6x faster decode speed at comparable quality — a strong default choice when latency matters more than peak raw capability.

# Example: capping context and using a fast quant for a responsive local API
ollama run gemma4:e4b --ctx-size 8192

DEV Community

Running, Building, and Optimizing with Gemma 4

Running Gemma 4 Locally

Building Offline and Low-Connectivity AI Applications with Gemma

Building Generative AI Applications with Gemma 4

Optimizing Gemma for Performance and Efficiency

Top comments (0)