DEV Community

Tijo Gaucher
Tijo Gaucher

Posted on

Running Gemma 4 next to your agent runtime: notes from a small shop

My brother Brandon and I run RapidClaw. Most days it's just the two of us, a handful of customers, and a few agents chugging along in production. A few months ago we started putting small open-weight models on the same box as the agent runtime — mostly Gemma 4, a bit of Phi-4 for comparison, some Qwen. This is a short write-up of what's actually worked and what hasn't.

Nothing revolutionary here. I'm writing it because I searched for "agent + local Gemma" a bunch of times last quarter and mostly found benchmark posts, not lived-experience notes.

The thing we noticed

The newest small models are small enough that they fit on the same machine as the agent loop. That's the whole observation. Gemma 4 4B runs fine on a 24 GB GPU next to a Node process running our agent code. Phi-4 14B is tight but works. A year ago you needed a separate inference box, which meant a network hop, which meant we just paid a hosted API and moved on.

Now the tradeoff is different. You can keep the hosted model for the hard stuff and quietly route the cheap, high-volume calls to the local model. Hybrid, not replacement.

What we actually do

We have four agents running in production right now. One of them — the one that classifies incoming support messages and decides which of the other agents to hand off to — used to make a hosted-model call per message. That single agent was roughly 80% of our inference spend because it ran on every message, even the obvious ones.

We moved that classifier to Gemma 4 4B on the same box. The agent framework is unchanged, it just points at a local OpenAI-compatible endpoint (we're using Ollama for now, llama.cpp's server also works). The other three agents still call the hosted models when they need to reason about something real.

That's it. One local model, four agents, one box. No Kubernetes, no model router, no fancy fallback chain.

Numbers from our box

Single machine, RTX 4090, one of our production workers. Measured over a week in March on real traffic, not a synthetic benchmark.

Path Median latency p95 Cost per 1k calls
Hosted Sonnet-class 1.8s 4.2s ~$4.50
Hosted mini/flash-class 0.9s 2.1s ~$0.60
Gemma 3 4B, local, same box 0.25s 0.6s ~$0.04*

*Local cost is amortized GPU + power on a box we were already paying for. If you had to rent a GPU just for this, the numbers flip hard — more on that below.

For the classifier workload specifically, Gemma 4 is good enough. It's not as sharp as the big hosted models, but "is this message a billing question or a bug report" doesn't need the big hosted models. We compared a week of its outputs against the hosted model's outputs on the same messages — they agreed on about 94% of them. The 6% where they disagreed were mostly ambiguous messages where the hosted model wasn't obviously right either.

Gotchas we hit

Cold starts are real. First request after the model unloads was 8–15 seconds. We pin the model in memory with a keepalive. Obvious in hindsight.

VRAM math is tighter than you think. Gemma 4 4B at Q4, plus an 8k context window, plus our Node process, plus the occasional burst of parallel requests: we hit OOM twice in the first week. We now cap concurrent local calls at 3 and queue the rest. Nothing fancy.

Prompt formats drift. A prompt that worked cleanly on the hosted model produced mush on Gemma. Small models are less forgiving of vague instructions. We ended up maintaining two prompt versions — one terse and explicit for Gemma, one more conversational for the hosted model. Not ideal but it's only two prompts.

Eval is annoying but necessary. You can't just swap models and hope. We built a small eval set (about 200 labeled messages) and run it whenever we change the local model or the prompt. Takes five minutes. Worth it.

When not to bother

Honestly, most people reading this probably shouldn't do this yet. A few cases where it doesn't make sense:

  • Low volume. If you're making under ~10k inference calls a day, the hosted APIs are cheaper than any GPU you'd rent. Local only wins at volume.
  • You don't already have a box. If you're renting a GPU purely to run Gemma 3, the math only works if you're saturating it. We could do this because we already had machines running the agent runtime with idle GPU capacity.
  • The task actually needs the big model. If you're doing code generation or multi-step planning, Gemma 4 4B will frustrate you. Use the hosted model and stop fighting it.
  • You're early. If you're pre-product-market-fit, every hour spent on inference optimization is an hour not spent on the thing users actually care about. We only did this after the classifier bill started showing up in the monthly.

What I'd try next

Phi-4 14B for one of the agents that does light reasoning over structured data. We haven't moved it yet because the quality bar is higher and I haven't built the eval set for it. Probably in April.

Also curious about Qwen 2.5 for a multilingual case we have, but that's further out.


That's the whole post. Nothing dramatic — a classifier moved, a bill went down, we learned some boring operational lessons. Small open-weight models finally being small enough to share a box with the agent runtime is, for us, the thing that made any of this viable.


Tijo Bear runs RapidClaw (rapidclaw.dev) with his brother Brandon — managed hosting for AI agents. If you're running agents and curious about hybrid local/hosted setups, the site has more.

Top comments (0)