DEV Community

Binary Ink
Binary Ink

Posted on

I Tested Gemma 4 on My Laptop and Turned It Into a Free Intelligence Layer for My AI Apps

How a $0 local model replaced $10/day in API calls across four production modules


I've been building MasterCLI — a multi-module AI-native desktop platform written in Go, React, and PostgreSQL. It includes a RAG knowledge base, a multi-agent discussion forum, and an orchestration hub (Nexus).

All of these modules were calling cloud APIs (GPT-4o-mini, Claude) for tasks like classifying user queries, extracting structured data from documents, and preprocessing messages. That's roughly $10/day in API costs just for classification and extraction — tasks that don't need frontier-model intelligence.

Then Google released Gemma 4 (8B) and I decided to test it locally. Here's what I found, and how I integrated it into four production modules in one afternoon.

The Setup: Nothing Fancy

  • Laptop: Regular gaming laptop with an RTX 3070 Ti (8GB VRAM)
  • Model: Gemma 4 8B, Q4_K_M quantization (9.6GB on disk)
  • Runtime: Ollama v0.20.0
  • OS: Windows 11

The model doesn't even fit entirely in VRAM — it partially offloads to system RAM. This is a real-world test, not a cloud GPU benchmark.

ollama pull gemma4
ollama list
# gemma4:latest  9.6 GB  Q4_K_M
Enter fullscreen mode Exit fullscreen mode

The Benchmark: Surprises Everywhere

Speed: Consistent ~25 tok/s

Across all tests, generation speed held steady:

Task Tokens Time Speed
Simple Q&A 11 0.6s 19.8 tok/s
Go code generation 600 25.7s 23.4 tok/s
Chinese JSON extraction 500 18.5s 27.1 tok/s
Intent classification 9 0.4s 25.6 tok/s
Tool calling 34 1.3s 27.1 tok/s

Prompt processing was much faster: 120-850 tok/s depending on batch size.

Discovery #1: It's a Thinking Model

This was the biggest surprise. When I first ran the tests, responses appeared empty. After debugging the streaming output, I discovered Gemma 4 is a thinking model — like DeepSeek-R1 or o1.

For complex questions, the response looks like this:

{"message":{"role":"assistant","content":"","thinking":"Here's a thinking process..."}}
{"message":{"role":"assistant","content":"","thinking":" to arrive at..."}}
// ... many thinking tokens ...
{"message":{"role":"assistant","content":"The three main patterns are..."}}
Enter fullscreen mode Exit fullscreen mode

The model spends tokens on chain-of-thought reasoning in the thinking field before producing the final answer in content.

The critical parameter: "think": false disables this behavior:

Task think=true think=false Speedup
Classification 6.9s 0.9s 7.7x
JSON extraction 19.4s 4.3s 4.5x
Code generation 26.7s 13.3s 2x

For structured extraction and classification, think=false is essential. You get the same quality output without the reasoning overhead.

Discovery #2: Ollama API Quirks

Two gotchas that cost me an hour of debugging:

  1. /api/generate is broken for Gemma 4 — the response field is always empty (tokens are generated but not decoded to text). You must use /api/chat instead.

  2. Tool calling needs num_predict >= 2048 — with smaller budgets, thinking tokens consume the entire allocation and tool calls never emit. With enough headroom, the model is smart enough to skip thinking and call tools directly (34 tokens, 1.3s).

Discovery #3: Tool Calling is Excellent

Given this tool definition:

{
  "name": "search_contracts",
  "parameters": {
    "query": {"type": "string"},
    "min_budget": {"type": "number"},
    "category": {"type": "string", "enum": ["IT","construction","services"]}
  }
}
Enter fullscreen mode Exit fullscreen mode

And the prompt: "Find IT contracts over 5M CNY"

Gemma 4 correctly inferred:

{
  "name": "search_contracts",
  "arguments": {
    "category": "IT",
    "min_budget": 5000000,
    "query": "IT contracts"
  }
}
Enter fullscreen mode Exit fullscreen mode

34 tokens, 1.3 seconds. No thinking needed. This makes it viable for real-time tool routing.

The Architecture: Tiered Intelligence

Based on the benchmarks, I designed a two-tier system:

User Request
    |
    v
+------------------+
|  Gemma 4 (local) |  <-- Fast classification, extraction, routing
|  think=false     |      Latency: <1-4s, Cost: $0
|  ~25 tok/s       |
+--------+---------+
         |
    +----+----+
    | Simple  | --> Return directly (classification, extraction, tags)
    | Complex | --> Escalate to cloud
    +----+----+
         v
+------------------+
| Claude/GPT (API) |  <-- Complex reasoning, long-form generation
| High quality     |      Latency: 2-10s, Pay per token
+------------------+
Enter fullscreen mode Exit fullscreen mode

The key insight: most "intelligence" tasks in a multi-module app are simple classification and extraction — exactly what a local 8B model excels at.

Four Integrations in One Afternoon

P1: Master RAG — Query Classification Middleware

The RAG knowledge base has 80+ domains and 7 namespaces. Previously, users had to manually specify domains: ["ai-ml"] in their searches.

Now Gemma 4 auto-classifies:

func (k *DB) ClassifyQuery(ctx context.Context, query string) *QueryClassification {
    result, err := k.ollama.QuickClassify(ctx, classifyPrompt, query)
    // Returns: {domains: ["ai-ml"], namespaces: ["code"], search_mode: "hybrid"}
}
Enter fullscreen mode Exit fullscreen mode

Result: <1s to auto-detect domain/namespace. Users just type their query naturally.

P2: Forum — Message Preprocessing

The multi-agent discussion forum runs 3+1 AI agents (Claude, Codex, Gemini + coordinator). Each message was going to the cloud for analysis.

Now messages are preprocessed locally — in a goroutine so it doesn't block the discussion:

func (s *Server) handleSpeak(agentID, content string) {
    go func() {
        if meta := s.preprocessMessage(ctx, agentID, content); meta != nil {
            s.hub.Publish("forum:post:meta", meta)
        }
    }()
    // ... save post and advance turn (not blocked) ...
}
Enter fullscreen mode Exit fullscreen mode

Result: Intent classification, sentiment analysis, and topic extraction — all in <1s, invisible to the discussion flow.

P3: Nexus — Tool Routing

Nexus orchestrates multiple AI agent terminals. When creating a new agent session, the system now classifies the task intent:

User: "What design patterns are used in the codebase?"
Gemma4: module=code, confidence=0.87, hint=grep
Enter fullscreen mode Exit fullscreen mode

This is exposed as both an internal routing signal and a standalone MCP tool (classify_intent).

Bonus: The Duck Secretary Gets a Brain

MasterCLI's Dashboard has a mascot — a yellow rubber duck secretary ("yellow rubber duck") that scans the project state and generates daily briefings. Before Gemma4, it produced mechanical summaries like "28 task(s) ready, 10 active goal(s)".

Now it generates actual insights:

Before: "28 task(s) ready, 10 active goal(s)"

The Browser module currently has the largest backlog, with 11 pending tasks.
         B-13, B-14, and B-15 are ready to begin.
         Prioritizing this batch today would also help create a more stable foundation for Dashboard and Nexus."
Enter fullscreen mode Exit fullscreen mode

The key was prompt compression: a long prompt (180 chars, 5 requirements) took 19.7s. A one-line prompt (50 chars) with compact data produced equally good output in 4.3s. The duck is now genuinely useful.

The Go Client: 150 Lines

Each module gets a lightweight Ollama chat client — the same pattern, ~150 lines of Go:

type OllamaChat struct {
    endpoint   string // "http://localhost:11434"
    model      string // "gemma4"
    httpClient *http.Client
}

func (o *OllamaChat) QuickClassify(ctx context.Context, system, input string) (string, error) {
    // POST /api/chat with stream=true, think=false, num_predict=128
    // Concatenate streaming chunks, return content
}
Enter fullscreen mode Exit fullscreen mode

Key configuration rules:

  • Always use /api/chat, never /api/generate (Gemma 4 bug)
  • think: false for classification/extraction (7x faster)
  • num_predict: 2048 for tool calling (needs headroom)
  • Streaming mode to capture both thinking and content fields

Cost Analysis

Metric Before (Cloud API) After (Local Gemma 4)
RAG classification ~$7/day $0
Forum preprocessing ~$8/day $0
Nexus routing ~$1/day $0
Duck Secretary insight ~$1/day $0
Total ~$17/day $0 + electricity
Annual savings ~$6,200

The tradeoff: ~25 tok/s means you can't use it for long-form generation. But for classification, extraction, and routing? It's free and fast enough.

Lessons Learned

  1. Gemma 4 is a thinking model — if you don't know this, your responses look empty. Use think: false for production workloads.

  2. 8B models are production-ready for structured tasks — classification, extraction, tool calling. Don't overpay for intelligence you don't need.

  3. The Ollama API has model-specific quirks — always test with your specific model. Gemma 4 breaks the generate endpoint.

  4. Hybrid architecture wins — local models for fast/cheap tasks, cloud for complex reasoning. The routing logic itself can run on the local model.

  5. Go + Ollama streaming is straightforward — the /api/chat streaming protocol is simple JSON lines. No SDK needed.

Going Deeper

The hybrid architecture in this article — local models for routing, cloud models for reasoning — is one of the patterns I cover in depth in my two books:

"Production MCP Servers with Go" covers the full lifecycle of building MCP servers like the ones powering Master RAG: tool calling, resource management, authentication, testing, and deployment.

"Building AI Coding Agents" goes wider — agent loops, context management, safety models, eval frameworks, and multi-agent orchestration. The model routing pattern from Chapter 6 is exactly what this article implements with Gemma 4.

Both are based on the same production codebase described here.


Have you tested Gemma 4 locally? What's your experience with hybrid local/cloud architectures? I'd love to hear about your setup in the comments.


Tags: #gemma4 #ollama #golang #ai #mcp #localllm #devtools

Series: Building AI-Native Applications with Go

Cover image description: A laptop with terminal showing Ollama running Gemma 4, with performance metrics overlay showing ~25 tok/s generation speed.

Top comments (0)