Mamoor Ahmad

Posted on May 7 • Edited on May 9

Building a Fully Offline AI Coding Assistant with Gemma 4, No Cloud Required 🤖

#webdev #tutorial #ai #opensource

Gemma 4 Challenge: Write about Gemma 4 Submission

Your code never leaves your machine. Your API bill is zero. Your assistant still works on a plane. ✈️

That's the pitch. Here's how to actually build it.

🤔 Why Go Offline in 2026?

Three reasons pushed me (and a lot of other devs) toward local AI:

💰 Cost. If you're running coding sessions multiple times a day, API bills add up fast. A one-time hardware investment pays for itself in months.
🔒 Privacy. Some codebases — client work, proprietary algorithms, internal tools — should never touch someone else's server.
⚡ Resilience. Cloud APIs throttle, go down, and change pricing. A local model just runs.

Gemma 4 finally makes this practical. Previous Gemma generations scored 6.6% on function-calling benchmarks — basically useless for agentic coding. Gemma 4 31B scores 86.4% on the same benchmark. 🤯

That's the jump that makes "local coding assistant" go from toy to tool.

🧰 What You'll Need

⚙️ Hardware

Model	Min RAM	Recommended	Best For
🟢 E4B (Edge)	4 GB	8 GB	Raspberry Pi, Jetson Nano
🔵 26B MoE ⭐	16 GB (Q4)	24 GB	M4 MacBook Pro, RTX 4070
🟣 31B Dense	32 GB (Q4)	48 GB+	M4 Max, RTX 4090, GB10

⭐ The sweet spot for most developers: 26B MoE on a 24 GB machine. It activates only 3.8B parameters per token (Mixture of Experts), so it's fast — often faster than the bigger 31B despite being "smaller."

📦 Software

Ollama (easiest) or llama.cpp (most control)
Continue.dev (VS Code / JetBrains extension) or Codex CLI
A GGUF quantized model file

🚀 Step 1: Get the Model

Option A: Ollama — The Easy Path ☕

# Install Ollama (macOS, Linux, Windows)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the model — this downloads ~16 GB for the 26B MoE
ollama pull gemma4:26b

# Or the smaller edge model if you're on limited hardware
ollama pull gemma4:4b

# Verify it works 🎉
ollama run gemma4:26b "Write a Python function to merge two sorted lists"

That's it. You now have a local AI that can write code. Seriously.

Option B: llama.cpp — For Power Users 🔧

llama.cpp gives you more control over quantization, context length, and memory usage. This matters on constrained hardware.

# Install via Homebrew (macOS)
brew install llama.cpp

# Or build from source for GPU support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # NVIDIA
# or: cmake -B build -DGGML_METAL=ON  # Apple Silicon
cmake --build build --config Release -j

Download the GGUF file from Hugging Face:

# 26B MoE Q4 — best balance of quality and speed
huggingface-cli download gg-hf-gg/gemma-4-26B-A4B-it-GGUF \
  gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --local-dir ./models/

Start the server with the right flags (every flag here matters ⚠️):

llama-server \
  -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --port 1234 \
  -ngl 99 \
  -c 32768 \
  -np 1 \
  --jinja \
  -ctk q8_0 \
  -ctv q8_0

🔑 What each flag does:

Flag	Purpose
`-ngl 99`	🚀 Offload all layers to GPU
`-c 32768`	📏 32K context window (increase if you have RAM)
`-np 1`	🎯 Single slot — multiple slots multiply KV cache memory
`--jinja`	🔌 Required for Gemma 4's tool-calling template
`-ctk q8_0 -ctv q8_0`	💾 Quantize KV cache from ~940 MB to ~499 MB

⚠️ Do NOT use the -hf flag to auto-download — it silently pulls a 1.1 GB vision projector that will OOM on 24 GB machines. Learn from my pain. 😅

🔌 Step 2: Connect It to Your Editor

Continue.dev (VS Code / JetBrains) 💻

Continue is an open-source AI code assistant that runs in your IDE. It supports Ollama and llama.cpp out of the box.

Install:

Open VS Code → Extensions → Search "Continue" → Install
Open ~/.continue/config.json (or use the Continue settings UI)

Config for Ollama:

{
  "models": [
    {
      "title": "Gemma 4 26B (Local)",
      "provider": "ollama",
      "model": "gemma4:26b",
      "contextLength": 32768
    }
  ],
  "tabAutocompleteModel": {
    "title": "Gemma 4 E4B (Autocomplete)",
    "provider": "ollama",
    "model": "gemma4:4b"
  }
}

Config for llama.cpp:

{
  "models": [
    {
      "title": "Gemma 4 26B (llama.cpp)",
      "provider": "openai",
      "model": "gemma-4-26b",
      "apiBase": "http://localhost:1234/v1",
      "contextLength": 32768
    }
  ]
}

💡 Pro tip: Use the 4B model for tab autocomplete (fast, low memory) and the 26B model for chat/explain/refactor (smarter, slower). This dual-model setup gives you the best of both worlds! 🏆

Codex CLI — Terminal Power Users ⌨️

If you prefer agentic coding from the terminal:

# Install Codex CLI
npm install -g @openai/codex

# Run with local model
codex --oss -m gemma4:26b

# Or with llama.cpp backend
codex --oss -m http://localhost:1234/v1

In Codex CLI's config.toml, set:

[model]
wire_api = "responses"
web_search = "disabled"  # llama.cpp rejects this tool type

⚙️ Step 3: Tune for Your Hardware

🟡 16 GB Machine (MacBook Air M3/M4, Budget Builds)

# Use the E4B model — still surprisingly capable
ollama pull gemma4:4b

# Or squeeze the 26B MoE with aggressive quantization
ollama pull gemma4:26b-q3_K_M

In Continue, lower contextLength to 8192 to save memory.

🔵 24 GB Machine (M4 Pro, RTX 4070/4080) — ⭐ Sweet Spot

The 26B MoE at Q4_K_M fits comfortably:

# Ollama
ollama pull gemma4:26b

# Or llama.cpp with optimized KV cache
llama-server -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --port 1234 -ngl 99 -c 32768 -np 1 --jinja \
  -ctk q8_0 -ctv q8_0

🟣 48 GB+ Machine (M4 Max, RTX 4090, Workstations)

Run the 31B Dense for maximum quality:

ollama pull gemma4:31b

# Or with full context
llama-server -m ./models/gemma-4-31B-it-Q4_K_M.gguf \
  --port 1234 -ngl 99 -c 65536 -np 1 --jinja

📊 Step 4: Real-World Benchmark

I tested the same coding task across all configurations:

"Write a parse_csv_summary function with error handling, write tests, and run them."

Config	Quality	Time	Tool Calls	Verdict
☁️ GPT-5.4 (Cloud)	★★★★★	65s	3	Type hints, exception chaining, clean
🖥️ 31B Dense (48 GB)	★★★★☆	7 min	3	Functional, solid, no cleanup needed
⚡ 26B MoE (24 GB)	★★★☆☆	4 min	10	Functional but messy — dead code, retries
📱 E4B (8 GB)	★★☆☆☆	2 min	15+	Basic tasks only, struggles with multi-file

🎯 Key takeaway: The 31B Dense on capable hardware gets close to cloud quality. The 26B MoE is fast and functional but needs more human oversight. The E4B is great for autocomplete, not for agentic coding.

⚡ Speed Comparison

The 26B MoE is deceptively fast. Despite being a "26B" model, it only activates 3.8B parameters per token:

Model	Speed on M4 Pro	Why
🚀 26B MoE	~52 tok/s	Only reads 1.9 GB/token from memory
🐢 31B Dense	~10 tok/s	Reads all 31.2B params per token

The MoE architecture means the model is reading less memory per token, so it flies on bandwidth-limited hardware. 🏎️

🎯 Step 5: Prompt Engineering for Local Models

Local models need better prompting than cloud models. Here are patterns that actually work:

📝 System Prompt Template

You are a coding assistant running locally. You have access to these tools:
- Read: Read a file from the filesystem
- Write: Write content to a file
- Execute: Run a shell command

Rules:
1. Read the existing code before making changes.
2. Write tests for any new function you create.
3. Run the tests and fix failures.
4. Keep changes minimal — don't refactor unrelated code.
5. If you're unsure, explain your reasoning before acting.

💡 Tips That Actually Help

🎯 Be specific about file paths. Local models hallucinate paths more than cloud models. Say src/utils/parser.ts, not "the parser file."
📋 One task at a time. Don't ask for a full feature. Ask for "write the function," then "write the tests," then "run the tests."
📖 Provide examples. Show the model what you want with a small example before asking it to generate.
🔧 Use structured output. Gemma 4 supports native JSON output. Use it for tool calls and structured responses.

🐛 Common Pitfalls (Learn From My Pain)

💥 "Ollama hangs on long prompts"

This is a known Flash Attention bug on Apple Silicon with Gemma 4.

Fix: Use llama.cpp instead, or wait for Ollama v0.20.6+.

💥 "Tool calls land in the wrong field"

Ollama v0.20.3 has a streaming bug that routes Gemma 4 tool-call responses to the reasoning output instead of tool_calls.

Fix: Update to v0.20.5+ or use llama.cpp.

💥 "Out of memory on startup"

If using llama.cpp with -hf flag, it downloads a 1.1 GB vision projector you don't need.

Fix: Use a direct -m path to the GGUF file instead.

💥 "Codex CLI rejects my model"

Set web_search = "disabled" in config — Codex CLI sends a web_search_preview tool type that llama.cpp doesn't recognize.

🏗️ Architecture: The Full Offline Stack

Here's what the complete setup looks like:

┌─────────────────────────────────────────────┐
│              Your Editor (VS Code)           │
│  ┌─────────────────────────────────────────┐ │
│  │         Continue.dev Extension           │ │
│  │  ┌──────────┐    ┌──────────────────┐   │ │
│  │  │  💬 Chat  │    │  ⚡ Autocomplete │   │ │
│  │  │  Refactor │    │  (E4B model)     │   │ │
│  │  └─────┬────┘    └────────┬─────────┘   │ │
│  └────────┼──────────────────┼─────────────┘ │
└───────────┼──────────────────┼───────────────┘
            │                  │
     ┌──────▼──────┐    ┌─────▼──────┐
     │  🖥️ llama.cpp│    │  📦 Ollama  │
     │  :1234       │    │   :11434   │
     │  (26B/31B)   │    │   (E4B)    │
     └──────┬──────┘    └─────┬──────┘
            │                  │
     ┌──────▼──────────────────▼──────┐
     │       🔒 Local GPU / CPU       │
     │    No data leaves this box     │
     └────────────────────────────────┘

🤷 When to Use Cloud Instead

Be honest about limitations:

✅ Use Local For:

Day-to-day coding, refactoring, explaining code
Writing tests, documentation, boilerplate
Working with sensitive/proprietary codebases
Offline environments (✈️ flights, ☕ cafes, 🏢 secure facilities)

❌ Use Cloud For:

Complex multi-file architectural changes
Tasks requiring reasoning across 10+ files
When you need the absolute highest code quality
Large-scale codebase migrations

🔮 What's Next

The local AI space is moving fast. Some things to watch:

🧬 Gemma 4 fine-tuning — Use Unsloth to fine-tune on your own codebase. A domain-specific adapter can dramatically improve quality.
🔀 Multi-model pipelines — Route simple tasks to E4B (fast), complex tasks to 26B/31B (smart). The AI router pattern is catching on.
👁️ Vision + Code — Gemma 4 processes images natively. Feed it a screenshot of a UI, get the code. This is massively underrated.

🎬 The Bottom Line

You don't need a $10K rig. A 24 GB laptop with Gemma 4 26B MoE gives you a coding assistant that:

✅ Handles 80% of daily tasks
✅ Costs nothing per query
✅ Never phones home
✅ Works offline
✅ Keeps your code private

That's not a compromise — that's a paradigm shift. 🚀

All benchmarks were run locally on consumer hardware. No cloud APIs were harmed in the making of this post.

Found this useful? Drop a ❤️ and share it with a friend who's tired of API bills!

Questions? Hit me up in the comments — I'll help you troubleshoot your setup. 👇

Top comments (1)

Hemapriya Kanagala • May 8

Nice breakdown 😀

DEV Community