DEV Community

Cover image for Running, Building, and Optimizing with Gemma 4
Agbo, Daniel Onuoha
Agbo, Daniel Onuoha

Posted on

Running, Building, and Optimizing with Gemma 4

emma 4 isn't just an open-weight model you download — it's a toolkit for running AI fully offline, building real generative applications, and squeezing maximum performance out of whatever hardware you have. This guide covers the practical side: local setup, offline app architecture, GenAI application patterns, and performance tuning.

Running Gemma 4 Locally

The fastest path to running Gemma 4 on your own machine is through Ollama, which wraps quantized GGUF weights in a simple CLI and local API — no GPU required for smaller models.

# Install Ollama, then pull a Gemma 4 variant
ollama pull gemma4:e4b      # edge-friendly, multimodal
ollama pull gemma4:26b      # MoE, faster decode
ollama pull gemma4:31b      # dense, max quality

# Run interactively
ollama run gemma4:e4b "Summarize this transaction log"

# Or hit the local REST API
curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:e4b",
  "prompt": "roses are red"
}'
Enter fullscreen mode Exit fullscreen mode

For raw inference speed rather than convenience, llama.cpp with CUDA acceleration outperforms Ollama's wrapper — benchmarks on an NVIDIA DGX Spark show llama.cpp hitting ~65 tokens/sec versus ~60 tok/s for Ollama and ~45 tok/s for vLLM's NVFP4 path on the 26B MoE model. vLLM remains the better choice once you need a production-grade OpenAI-compatible API server with batching, rather than a single local session.

Tool Best For Notes
Ollama Quick local setup, prototyping GGUF quantized, simplest CLI/API
llama.cpp Raw inference speed Direct CUDA/Metal control, no wrapper overhead
vLLM Production API serving Batching, concurrent requests, OpenAI-compatible
MLX Apple Silicon Native Metal acceleration on Mac

On Apple Silicon, performance scales directly with unified memory: an M1 8GB Mac manages roughly 12 tokens/sec on a 12B Q4 model, while an M4 Max 48GB comfortably runs larger models at ~35 tokens/sec.

Building Offline and Low-Connectivity AI Applications with Gemma

Gemma's on-device sizes (E2B and E4B) are purpose-built for environments with unreliable or absent connectivity — a pattern directly relevant to logistics and fintech deployments in areas with inconsistent network coverage.

  • Use Google AI Edge Gallery or AICore on Android to embed E2B/E4B directly into a mobile app, with no server round-trip required
  • Target edge boards like NVIDIA Jetson Orin Nano for IoT and drone-based systems that must reason locally (e.g., an Atoovis-style delivery drone classifying obstacles without a live connection)
  • Quantize aggressively (Q4_K_M or IQ4_XS) so the model fits in constrained RAM on field devices, trading a small quality drop for a much smaller memory footprint
  • Design a "store-and-forward" pattern: the local model handles inference and decision-making offline, then syncs logs or embeddings to your backend (e.g., MongoDB Atlas) once connectivity returns
  • For voice-driven offline use cases like field agent check-ins, use E2B/E4B's native audio input instead of a separate speech-to-text pipeline, reducing both latency and points of failure

This architecture matters for fintech agent apps operating in rural Nigeria or similar low-bandwidth regions, where a cloud-dependent LLM call would simply fail rather than degrade gracefully.

Building Generative AI Applications with Gemma 4

Beyond chatbots, Gemma 4's function-calling and structured JSON output make it suitable as a reasoning layer inside existing backend systems rather than a bolted-on feature.

  • Agentic backend services: wire Gemma 4's function-calling directly into your Node.js/Express routes so the model decides which internal API to call (e.g., checking a Kredi Bank balance or triggering a VTU top-up) instead of parsing free-text intent yourself
  • Document and receipt processing: feed scanned invoices or bank statements through Gemma 4's vision input for OCR plus structured extraction, replacing brittle regex-based parsers
  • Local coding assistants: run Gemma 4 inside tools like OpenCode or via llama.cpp to get an offline pair-programmer for sensitive codebases you don't want sent to a third-party API
  • RAG pipelines: combine Gemma 4 with LangChain and a local vector store to build a retrieval-augmented assistant over your own documentation or transaction history, entirely self-hosted
  • Multilingual support apps: leverage Gemma 4's 140+ language coverage to serve customer support or bill-payment flows in local languages without a separate translation layer

A minimal Gradio-based coding assistant, for example, pairs Ollama's local API with a simple web UI to demo tool-calling and live code editing in an afternoon — a useful pattern for internal developer tools.

Optimizing Gemma for Performance and Efficiency

Most "Gemma is slow" complaints trace back to three fixable issues: CPU fallback, the wrong quantization, and an oversized context window.

Check GPU utilization first. If GPU usage stays at 0% during inference, the model silently fell back to CPU — expect only 1-5 tokens/sec until this is fixed.

Pick quantization deliberately:

Quantization Size (12B) Speed Quality Best For
Q4_K_M ~7 GB Fastest Good Daily use, most tasks
Q5_K_M ~8.5 GB Fast Better When quality matters
Q6_K ~10 GB Medium Very good Balanced
Q8_0 ~13 GB Slow Near-original Quality-critical tasks
FP16 ~24 GB Slowest Original Only with ample VRAM

Watch context length — it isn't free. VRAM and speed degrade sharply as context grows: a 12B Q4 model runs at full speed with a 2K context but drops to roughly a quarter of that speed at 256K context, consuming 30GB+ of VRAM in the process.

Manage the KV cache. Long-running conversations accumulate key-value cache that eats VRAM over time — reset sessions periodically or cap the cache size for long-lived services.

Use Quantization-Aware Training (QAT) checkpoints where available. Google has released QAT versions of Gemma 4 that preserve much more quality at int4 precision than post-training quantization alone, making it realistic to run a 12B model with a 16K context window on as little as 8GB of VRAM, even on older GPUs like a GTX 1080 Ti.

Mixture-of-Experts changes the math. The 26B MoE model activates only ~3.8B parameters per token versus 30B+ for the dense model, delivering roughly 6x faster decode speed at comparable quality — a strong default choice when latency matters more than peak raw capability.

# Example: capping context and using a fast quant for a responsive local API
ollama run gemma4:e4b --ctx-size 8192
Enter fullscreen mode Exit fullscreen mode

Top comments (0)