emma 4 isn't just an open-weight model you download — it's a toolkit for running AI fully offline, building real generative applications, and squeezing maximum performance out of whatever hardware you have. This guide covers the practical side: local setup, offline app architecture, GenAI application patterns, and performance tuning.
Running Gemma 4 Locally
The fastest path to running Gemma 4 on your own machine is through Ollama, which wraps quantized GGUF weights in a simple CLI and local API — no GPU required for smaller models.
# Install Ollama, then pull a Gemma 4 variant
ollama pull gemma4:e4b # edge-friendly, multimodal
ollama pull gemma4:26b # MoE, faster decode
ollama pull gemma4:31b # dense, max quality
# Run interactively
ollama run gemma4:e4b "Summarize this transaction log"
# Or hit the local REST API
curl http://localhost:11434/api/generate -d '{
"model": "gemma4:e4b",
"prompt": "roses are red"
}'
For raw inference speed rather than convenience, llama.cpp with CUDA acceleration outperforms Ollama's wrapper — benchmarks on an NVIDIA DGX Spark show llama.cpp hitting ~65 tokens/sec versus ~60 tok/s for Ollama and ~45 tok/s for vLLM's NVFP4 path on the 26B MoE model. vLLM remains the better choice once you need a production-grade OpenAI-compatible API server with batching, rather than a single local session.
| Tool | Best For | Notes |
|---|---|---|
| Ollama | Quick local setup, prototyping | GGUF quantized, simplest CLI/API |
| llama.cpp | Raw inference speed | Direct CUDA/Metal control, no wrapper overhead |
| vLLM | Production API serving | Batching, concurrent requests, OpenAI-compatible |
| MLX | Apple Silicon | Native Metal acceleration on Mac |
On Apple Silicon, performance scales directly with unified memory: an M1 8GB Mac manages roughly 12 tokens/sec on a 12B Q4 model, while an M4 Max 48GB comfortably runs larger models at ~35 tokens/sec.
Building Offline and Low-Connectivity AI Applications with Gemma
Gemma's on-device sizes (E2B and E4B) are purpose-built for environments with unreliable or absent connectivity — a pattern directly relevant to logistics and fintech deployments in areas with inconsistent network coverage.
- Use Google AI Edge Gallery or AICore on Android to embed E2B/E4B directly into a mobile app, with no server round-trip required
- Target edge boards like NVIDIA Jetson Orin Nano for IoT and drone-based systems that must reason locally (e.g., an Atoovis-style delivery drone classifying obstacles without a live connection)
- Quantize aggressively (Q4_K_M or IQ4_XS) so the model fits in constrained RAM on field devices, trading a small quality drop for a much smaller memory footprint
- Design a "store-and-forward" pattern: the local model handles inference and decision-making offline, then syncs logs or embeddings to your backend (e.g., MongoDB Atlas) once connectivity returns
- For voice-driven offline use cases like field agent check-ins, use E2B/E4B's native audio input instead of a separate speech-to-text pipeline, reducing both latency and points of failure
This architecture matters for fintech agent apps operating in rural Nigeria or similar low-bandwidth regions, where a cloud-dependent LLM call would simply fail rather than degrade gracefully.
Building Generative AI Applications with Gemma 4
Beyond chatbots, Gemma 4's function-calling and structured JSON output make it suitable as a reasoning layer inside existing backend systems rather than a bolted-on feature.
- Agentic backend services: wire Gemma 4's function-calling directly into your Node.js/Express routes so the model decides which internal API to call (e.g., checking a Kredi Bank balance or triggering a VTU top-up) instead of parsing free-text intent yourself
- Document and receipt processing: feed scanned invoices or bank statements through Gemma 4's vision input for OCR plus structured extraction, replacing brittle regex-based parsers
- Local coding assistants: run Gemma 4 inside tools like OpenCode or via llama.cpp to get an offline pair-programmer for sensitive codebases you don't want sent to a third-party API
- RAG pipelines: combine Gemma 4 with LangChain and a local vector store to build a retrieval-augmented assistant over your own documentation or transaction history, entirely self-hosted
- Multilingual support apps: leverage Gemma 4's 140+ language coverage to serve customer support or bill-payment flows in local languages without a separate translation layer
A minimal Gradio-based coding assistant, for example, pairs Ollama's local API with a simple web UI to demo tool-calling and live code editing in an afternoon — a useful pattern for internal developer tools.
Optimizing Gemma for Performance and Efficiency
Most "Gemma is slow" complaints trace back to three fixable issues: CPU fallback, the wrong quantization, and an oversized context window.
Check GPU utilization first. If GPU usage stays at 0% during inference, the model silently fell back to CPU — expect only 1-5 tokens/sec until this is fixed.
Pick quantization deliberately:
| Quantization | Size (12B) | Speed | Quality | Best For |
|---|---|---|---|---|
| Q4_K_M | ~7 GB | Fastest | Good | Daily use, most tasks |
| Q5_K_M | ~8.5 GB | Fast | Better | When quality matters |
| Q6_K | ~10 GB | Medium | Very good | Balanced |
| Q8_0 | ~13 GB | Slow | Near-original | Quality-critical tasks |
| FP16 | ~24 GB | Slowest | Original | Only with ample VRAM |
Watch context length — it isn't free. VRAM and speed degrade sharply as context grows: a 12B Q4 model runs at full speed with a 2K context but drops to roughly a quarter of that speed at 256K context, consuming 30GB+ of VRAM in the process.
Manage the KV cache. Long-running conversations accumulate key-value cache that eats VRAM over time — reset sessions periodically or cap the cache size for long-lived services.
Use Quantization-Aware Training (QAT) checkpoints where available. Google has released QAT versions of Gemma 4 that preserve much more quality at int4 precision than post-training quantization alone, making it realistic to run a 12B model with a 16K context window on as little as 8GB of VRAM, even on older GPUs like a GTX 1080 Ti.
Mixture-of-Experts changes the math. The 26B MoE model activates only ~3.8B parameters per token versus 30B+ for the dense model, delivering roughly 6x faster decode speed at comparable quality — a strong default choice when latency matters more than peak raw capability.
# Example: capping context and using a fast quant for a responsive local API
ollama run gemma4:e4b --ctx-size 8192
Top comments (0)