This is a submission for the Gemma 4 Challenge: Write About Gemma 4
A Write track submission for the Gemma 4 Challenge — an honest look at local AI tool-use on consumer hardware, and the architecture that made it work.
Local AI models are having a moment. You can pull Gemma 4 in a single command, run it on hardware you already own, and have a private, capable LLM running in minutes. That part is solved.
The part nobody talks about? Giving it tools.
Not "tools" in the abstract sense. I mean: let Gemma 4 search the web, read your files, query a database, send a Slack message, spin up a Cloudflare Worker — the things that make an AI agent actually useful in production. That part is not solved. Not out of the box. Not on Windows. Not without hitting walls that will cost you days.
This is the story of how I got Gemma 4 running with 150+ MCP tools on a Windows machine, what broke, what I learned, and the architecture that finally held.
Why Gemma 4
I want to be upfront about intentionality here — the judging rubric asks for it, so let's talk about it directly.
I chose Gemma 4 for one reason: it runs on hardware my clients actually own.
I work with small engineering teams, often in compliance-sensitive environments where "just use the API" isn't an option. Data can't leave the building. Models need to be auditable. And the budget for a dedicated GPU cluster doesn't exist.
Gemma 4 hits a specific sweet spot, and I went with the E2B variant specifically:
- Runs on a workstation-class CPU with no dedicated GPU required (I'm on a Dell T3610 — nothing exotic)
- Capable enough to handle tool-use routing with reasonable accuracy at that size
- Natively multimodal out of the box — vision came for free
- Local-first, which matters in healthcare and legal contexts where data can't leave the building
- Apache 2.0 licensed — no usage restrictions for commercial client work
- Maintained by Google DeepMind with a clear roadmap
I'm not running it because it's the most powerful model. I'm running it because it's the most deployable model for the environments I actually work in. That distinction matters.
The Problem With Local AI + Tools
Most tutorials stop at "run the model." The interesting engineering starts when you ask: how does this model call an external service?
The answer involves MCP — the Model Context Protocol, an open standard from Anthropic that defines how AI models communicate with tools. It's the right answer. But on Windows, it is a brutal debugging experience.
Here's what I ran into:
Problem 1: DNS rebinding protection kills local servers
MCP Python servers using FastMCP have DNS rebinding protection enabled by default. When you're running a local gateway, this silently blocks connections. The error message tells you nothing useful. The fix is a single line — TransportSecuritySettings(enable_dns_rebinding_protection=False) — but you won't find it in the official docs.
Problem 2: UTF-16 LE encoding corrupts Python files
If you edit any MCP server config in Notepad on Windows and save it, you may have silently corrupted the file. Notepad defaults to UTF-16 LE. Python expects UTF-8. The file looks fine in any editor. It just doesn't run. At all.
Problem 3: Claude Desktop has an 8-server limit
Exceed 8 registered MCP servers and Claude Desktop silently drops servers on a 60-second timeout. No warning. No error. Servers just disappear. This one cost me an afternoon.
Problem 4: docker-compose vs docker compose
On older Docker installs (which enterprise machines often have), docker compose (v2 syntax) doesn't exist. docker-compose (v1) does. Scripts that work on your dev machine fail silently on the client's server. Every time.
Problem 5: stdout/stderr deadlocks with subprocess
If you're building an MCP server that shells out to other processes, capture_output=True in Python's subprocess will deadlock when the child process writes enough output to fill the pipe buffer. This is a known Python issue. The fix is writing to temp files. The symptoms look like your server just... hangs.
I documented all of these in an earlier dev.to post. The point here is: tool-use on Windows is not plug-and-play, and Gemma 4 is no exception.
The Architecture
Once I understood the failure modes, I built a gateway layer that sits between Gemma 4 and every tool it needs to call.
Here's what it looks like:
┌─────────────────────────────────────────────────────┐
│ Gemma 4 (Ollama) │
│ Running locally on T3610 │
└─────────────────────┬───────────────────────────────┘
│ OpenAI-compatible API
▼
┌─────────────────────────────────────────────────────┐
│ MCP Gateway (Docker) │
│ Unified tool routing layer │
│ Port 8089 → 8009 (internal) │
└──────┬────────┬────────┬────────┬───────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
Web Search Files Notion Cloudflare
Databases Git Slack Calendar
...150+ tools total
The key design decision: Gemma 4 never talks to tools directly. It talks to the gateway. The gateway handles all the MCP plumbing, server registration, error recovery, and routing. Gemma just sees a clean tool list and calls them by name.
This matters because:
- Model portability — I can swap Gemma 4 for any other OpenAI-compatible model without touching the tool layer
- Failure isolation — when a tool server crashes (and they do), it doesn't take down the whole session
- Auditability — every tool call goes through a single chokepoint I can log, inspect, and gate
What Gemma 4 Can Actually Do In This Setup
Once the architecture was solid, I tested Gemma 4 on tool-use tasks that matter in real workflows:
File operations: Reading, writing, and summarizing local files. Solid. This is where smaller models shine — the task is clear, the context is bounded.
Multi-step research: "Search for recent papers on X, save a summary to Notion, and create a calendar reminder to review it." This worked about 70% of the time. The failure mode is mid-chain context loss — Gemma 4 sometimes forgets it's mid-task after a tool result comes back with a lot of tokens.
Code operations: Reading a repo, identifying a bug, writing a fix. Reasonable for small files. Falls apart on large codebases where context window becomes a real constraint.
Calendar and communication tools: Gemma 4 handles these well. The tasks are short, the schemas are simple, and the model doesn't need to maintain long chains.
The honest limitation: Gemma 4 is not GPT-4 at multi-step agentic tasks. If your workflow requires 8+ chained tool calls with complex state management, you'll see degradation. For single and double-hop tool use, it handles itself well.
Why This Architecture Generalizes
I want to make a point that goes beyond this specific setup.
The model will keep getting better. Gemma 5, 6 — whatever comes next will handle multi-step tool-use with more reliability. But the infrastructure problem — routing, reliability, security, Windows compatibility — that doesn't get solved by a better model.
The gateway layer I built is model-agnostic. The same Docker Compose stack that runs Gemma 4 today can route Llama 4, Mistral, or any future open-source model tomorrow. The tools don't change. The clients don't change. Only the model does.
That's the bet I'm making: the infrastructure for local AI tool-use is the durable asset. The models are the commodity.
Getting Started
If you want to replicate this setup, here's the minimum viable path:
Prerequisites:
- Docker (any recent version)
- Ollama installed locally
- A Windows machine with at least 16GB RAM (32GB recommended for Gemma 4)
Step 1: Pull Gemma 4 via Ollama
ollama pull gemma4:e2b
Step 2: Verify it runs
ollama run gemma4:e2b "List 3 things you can help me with"
Step 3: Point your gateway at Ollama's OpenAI-compatible endpoint
http://localhost:11434/v1
Ollama exposes an OpenAI-compatible API by default. Your gateway, any MCP-compatible client, or any OpenAI SDK can talk to it without modification.
Step 4: Register your tools
This is where the gateway earns its keep. Instead of configuring each MCP server individually for Gemma, you register them once at the gateway level. The model gets a unified tool list. You get one place to manage everything.
What I'd Tell Someone Starting This Today
- Expect Windows-specific failures. They're not your fault. They're documented failure modes that affect everyone. Know them going in.
- Use a gateway layer. Don't wire tools directly to the model. You'll regret it the first time a tool server crashes mid-session.
- Start with single-hop tool calls. Verify the plumbing works before building multi-step agents. One successful file read tells you more than a complex workflow that fails mysteriously.
- Pick the right model for the task. Gemma 4 is excellent for bounded, local, compliance-sensitive workflows. It is not a replacement for frontier models on complex agentic tasks. Know what you're optimizing for.
- The infrastructure is the product. The model is a component.
Closing
Local AI + tools on Windows is solvable. It just requires understanding the failure modes, designing around them, and building an infrastructure layer that the model can rely on.
Gemma 4 made this story possible because it runs on hardware that's actually available in the environments that need this most. That's not a small thing.
If you're building something similar, I'm happy to talk through the architecture in the comments. The failure modes I listed above have documented fixes — no reason to rediscover them the hard way.
Top comments (0)