DEV Community

Cover image for Building a Fully Offline AI Coding Assistant with Gemma 4, No Cloud Required 🤖
Mamoor Ahmad
Mamoor Ahmad Subscriber

Posted on

Building a Fully Offline AI Coding Assistant with Gemma 4, No Cloud Required 🤖

Gemma 4 Challenge: Write about Gemma 4 Submission

Your code never leaves your machine. Your API bill is zero. Your assistant still works on a plane. ✈️

That's the pitch. Here's how to actually build it.

🤔 Why Go Offline in 2026?

Robot Coding

Three reasons pushed me (and a lot of other devs) toward local AI:

  1. 💰 Cost. If you're running coding sessions multiple times a day, API bills add up fast. A one-time hardware investment pays for itself in months.

  2. 🔒 Privacy. Some codebases — client work, proprietary algorithms, internal tools — should never touch someone else's server.

  3. Resilience. Cloud APIs throttle, go down, and change pricing. A local model just runs.

Gemma 4 finally makes this practical. Previous Gemma generations scored 6.6% on function-calling benchmarks — basically useless for agentic coding. Gemma 4 31B scores 86.4% on the same benchmark. 🤯

That's the jump that makes "local coding assistant" go from toy to tool.


🧰 What You'll Need

⚙️ Hardware

Model Min RAM Recommended Best For
🟢 E4B (Edge) 4 GB 8 GB Raspberry Pi, Jetson Nano
🔵 26B MoE 16 GB (Q4) 24 GB M4 MacBook Pro, RTX 4070
🟣 31B Dense 32 GB (Q4) 48 GB+ M4 Max, RTX 4090, GB10

The sweet spot for most developers: 26B MoE on a 24 GB machine. It activates only 3.8B parameters per token (Mixture of Experts), so it's fast — often faster than the bigger 31B despite being "smaller."

Hardware Comparison

📦 Software


🚀 Step 1: Get the Model

Option A: Ollama — The Easy Path ☕

# Install Ollama (macOS, Linux, Windows)
curl -fsSL https://ollama.com/install.sh | sh

# Pull the model — this downloads ~16 GB for the 26B MoE
ollama pull gemma4:26b

# Or the smaller edge model if you're on limited hardware
ollama pull gemma4:4b

# Verify it works 🎉
ollama run gemma4:26b "Write a Python function to merge two sorted lists"
Enter fullscreen mode Exit fullscreen mode

That's it. You now have a local AI that can write code. Seriously.

Option B: llama.cpp — For Power Users 🔧

llama.cpp gives you more control over quantization, context length, and memory usage. This matters on constrained hardware.

# Install via Homebrew (macOS)
brew install llama.cpp

# Or build from source for GPU support
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON  # NVIDIA
# or: cmake -B build -DGGML_METAL=ON  # Apple Silicon
cmake --build build --config Release -j
Enter fullscreen mode Exit fullscreen mode

Download the GGUF file from Hugging Face:

# 26B MoE Q4 — best balance of quality and speed
huggingface-cli download gg-hf-gg/gemma-4-26B-A4B-it-GGUF \
  gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --local-dir ./models/
Enter fullscreen mode Exit fullscreen mode

Start the server with the right flags (every flag here matters ⚠️):

llama-server \
  -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --port 1234 \
  -ngl 99 \
  -c 32768 \
  -np 1 \
  --jinja \
  -ctk q8_0 \
  -ctv q8_0
Enter fullscreen mode Exit fullscreen mode

🔑 What each flag does:

Flag Purpose
-ngl 99 🚀 Offload all layers to GPU
-c 32768 📏 32K context window (increase if you have RAM)
-np 1 🎯 Single slot — multiple slots multiply KV cache memory
--jinja 🔌 Required for Gemma 4's tool-calling template
-ctk q8_0 -ctv q8_0 💾 Quantize KV cache from ~940 MB to ~499 MB

⚠️ Do NOT use the -hf flag to auto-download — it silently pulls a 1.1 GB vision projector that will OOM on 24 GB machines. Learn from my pain. 😅


🔌 Step 2: Connect It to Your Editor

Continue.dev (VS Code / JetBrains) 💻

Continue is an open-source AI code assistant that runs in your IDE. It supports Ollama and llama.cpp out of the box.

Install:

  1. Open VS Code → Extensions → Search "Continue" → Install
  2. Open ~/.continue/config.json (or use the Continue settings UI)

Config for Ollama:

{
  "models": [
    {
      "title": "Gemma 4 26B (Local)",
      "provider": "ollama",
      "model": "gemma4:26b",
      "contextLength": 32768
    }
  ],
  "tabAutocompleteModel": {
    "title": "Gemma 4 E4B (Autocomplete)",
    "provider": "ollama",
    "model": "gemma4:4b"
  }
}
Enter fullscreen mode Exit fullscreen mode

Config for llama.cpp:

{
  "models": [
    {
      "title": "Gemma 4 26B (llama.cpp)",
      "provider": "openai",
      "model": "gemma-4-26b",
      "apiBase": "http://localhost:1234/v1",
      "contextLength": 32768
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

💡 Pro tip: Use the 4B model for tab autocomplete (fast, low memory) and the 26B model for chat/explain/refactor (smarter, slower). This dual-model setup gives you the best of both worlds! 🏆

Codex CLI — Terminal Power Users ⌨️

If you prefer agentic coding from the terminal:

# Install Codex CLI
npm install -g @openai/codex

# Run with local model
codex --oss -m gemma4:26b

# Or with llama.cpp backend
codex --oss -m http://localhost:1234/v1
Enter fullscreen mode Exit fullscreen mode

In Codex CLI's config.toml, set:

[model]
wire_api = "responses"
web_search = "disabled"  # llama.cpp rejects this tool type
Enter fullscreen mode Exit fullscreen mode

⚙️ Step 3: Tune for Your Hardware

🟡 16 GB Machine (MacBook Air M3/M4, Budget Builds)

# Use the E4B model — still surprisingly capable
ollama pull gemma4:4b

# Or squeeze the 26B MoE with aggressive quantization
ollama pull gemma4:26b-q3_K_M
Enter fullscreen mode Exit fullscreen mode

In Continue, lower contextLength to 8192 to save memory.

🔵 24 GB Machine (M4 Pro, RTX 4070/4080) — ⭐ Sweet Spot

The 26B MoE at Q4_K_M fits comfortably:

# Ollama
ollama pull gemma4:26b

# Or llama.cpp with optimized KV cache
llama-server -m ./models/gemma-4-26B-A4B-it-Q4_K_M.gguf \
  --port 1234 -ngl 99 -c 32768 -np 1 --jinja \
  -ctk q8_0 -ctv q8_0
Enter fullscreen mode Exit fullscreen mode

🟣 48 GB+ Machine (M4 Max, RTX 4090, Workstations)

Run the 31B Dense for maximum quality:

ollama pull gemma4:31b

# Or with full context
llama-server -m ./models/gemma-4-31B-it-Q4_K_M.gguf \
  --port 1234 -ngl 99 -c 65536 -np 1 --jinja
Enter fullscreen mode Exit fullscreen mode

📊 Step 4: Real-World Benchmark

I tested the same coding task across all configurations:

"Write a parse_csv_summary function with error handling, write tests, and run them."

Benchmark Results

Config Quality Time Tool Calls Verdict
☁️ GPT-5.4 (Cloud) ★★★★★ 65s 3 Type hints, exception chaining, clean
🖥️ 31B Dense (48 GB) ★★★★☆ 7 min 3 Functional, solid, no cleanup needed
26B MoE (24 GB) ★★★☆☆ 4 min 10 Functional but messy — dead code, retries
📱 E4B (8 GB) ★★☆☆☆ 2 min 15+ Basic tasks only, struggles with multi-file

🎯 Key takeaway: The 31B Dense on capable hardware gets close to cloud quality. The 26B MoE is fast and functional but needs more human oversight. The E4B is great for autocomplete, not for agentic coding.

⚡ Speed Comparison

The 26B MoE is deceptively fast. Despite being a "26B" model, it only activates 3.8B parameters per token:

Model Speed on M4 Pro Why
🚀 26B MoE ~52 tok/s Only reads 1.9 GB/token from memory
🐢 31B Dense ~10 tok/s Reads all 31.2B params per token

The MoE architecture means the model is reading less memory per token, so it flies on bandwidth-limited hardware. 🏎️


🎯 Step 5: Prompt Engineering for Local Models

Local models need better prompting than cloud models. Here are patterns that actually work:

📝 System Prompt Template

You are a coding assistant running locally. You have access to these tools:
- Read: Read a file from the filesystem
- Write: Write content to a file
- Execute: Run a shell command

Rules:
1. Read the existing code before making changes.
2. Write tests for any new function you create.
3. Run the tests and fix failures.
4. Keep changes minimal — don't refactor unrelated code.
5. If you're unsure, explain your reasoning before acting.
Enter fullscreen mode Exit fullscreen mode

💡 Tips That Actually Help

  • 🎯 Be specific about file paths. Local models hallucinate paths more than cloud models. Say src/utils/parser.ts, not "the parser file."
  • 📋 One task at a time. Don't ask for a full feature. Ask for "write the function," then "write the tests," then "run the tests."
  • 📖 Provide examples. Show the model what you want with a small example before asking it to generate.
  • 🔧 Use structured output. Gemma 4 supports native JSON output. Use it for tool calls and structured responses.

🐛 Common Pitfalls (Learn From My Pain)

💥 "Ollama hangs on long prompts"

This is a known Flash Attention bug on Apple Silicon with Gemma 4.

Fix: Use llama.cpp instead, or wait for Ollama v0.20.6+.

💥 "Tool calls land in the wrong field"

Ollama v0.20.3 has a streaming bug that routes Gemma 4 tool-call responses to the reasoning output instead of tool_calls.

Fix: Update to v0.20.5+ or use llama.cpp.

💥 "Out of memory on startup"

If using llama.cpp with -hf flag, it downloads a 1.1 GB vision projector you don't need.

Fix: Use a direct -m path to the GGUF file instead.

💥 "Codex CLI rejects my model"

Set web_search = "disabled" in config — Codex CLI sends a web_search_preview tool type that llama.cpp doesn't recognize.


🏗️ Architecture: The Full Offline Stack

Here's what the complete setup looks like:

Architecture Diagram

┌─────────────────────────────────────────────┐
│              Your Editor (VS Code)           │
│  ┌─────────────────────────────────────────┐ │
│  │         Continue.dev Extension           │ │
│  │  ┌──────────┐    ┌──────────────────┐   │ │
│  │  │  💬 Chat  │    │  ⚡ Autocomplete │   │ │
│  │  │  Refactor │    │  (E4B model)     │   │ │
│  │  └─────┬────┘    └────────┬─────────┘   │ │
│  └────────┼──────────────────┼─────────────┘ │
└───────────┼──────────────────┼───────────────┘
            │                  │
     ┌──────▼──────┐    ┌─────▼──────┐
     │  🖥️ llama.cpp│    │  📦 Ollama  │
     │  :1234       │    │   :11434   │
     │  (26B/31B)   │    │   (E4B)    │
     └──────┬──────┘    └─────┬──────┘
            │                  │
     ┌──────▼──────────────────▼──────┐
     │       🔒 Local GPU / CPU       │
     │    No data leaves this box     │
     └────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

🤷 When to Use Cloud Instead

Be honest about limitations:

✅ Use Local For:

  • Day-to-day coding, refactoring, explaining code
  • Writing tests, documentation, boilerplate
  • Working with sensitive/proprietary codebases
  • Offline environments (✈️ flights, ☕ cafes, 🏢 secure facilities)

❌ Use Cloud For:

  • Complex multi-file architectural changes
  • Tasks requiring reasoning across 10+ files
  • When you need the absolute highest code quality
  • Large-scale codebase migrations

🔮 What's Next

The local AI space is moving fast. Some things to watch:

  • 🧬 Gemma 4 fine-tuning — Use Unsloth to fine-tune on your own codebase. A domain-specific adapter can dramatically improve quality.
  • 🔀 Multi-model pipelines — Route simple tasks to E4B (fast), complex tasks to 26B/31B (smart). The AI router pattern is catching on.
  • 👁️ Vision + Code — Gemma 4 processes images natively. Feed it a screenshot of a UI, get the code. This is massively underrated.

🎬 The Bottom Line

You don't need a $10K rig. A 24 GB laptop with Gemma 4 26B MoE gives you a coding assistant that:

  • ✅ Handles 80% of daily tasks
  • ✅ Costs nothing per query
  • ✅ Never phones home
  • ✅ Works offline
  • ✅ Keeps your code private

That's not a compromise — that's a paradigm shift. 🚀


All benchmarks were run locally on consumer hardware. No cloud APIs were harmed in the making of this post.


Found this useful? Drop a ❤️ and share it with a friend who's tired of API bills!

Questions? Hit me up in the comments — I'll help you troubleshoot your setup. 👇

Top comments (0)