This is a submission for the Gemma 4 Challenge: Write About Gemma 4
I Ditched Cloud LLMs for Gemma 4 4B: A DevOps Engineer's 48-Hour Real...
For further actions, you may consider blocking this person and/or reporting abuse
The hardcoded values and duplicate components problem resonates — I've seen that pattern repeatedly in code that ships under time pressure, whether it's a hackathon or a production hotfix. One thing I've found helpful when returning to old code is running a static analysis pass first (ESLint with stricter rules than you had originally) because it surfaces the structural issues faster than manual review, and gives you a prioritized list of what to fix before you even understand the logic again. The "step by step" approach you describe is the right call; rewriting from scratch almost always underestimates how much implicit knowledge is buried in the messy version.
Great tip, Ofri! I actually started doing exactly that after this experiment — running
terraform validate+tflintbefore feeding configs to Gemma. It catches the structural noise so the model can focus on the architectural logic instead of syntax gotchas. The "implicit knowledge" point is gold — that's why I never trust AI to rewrite from scratch without my review. Step by step preserves context; rewrite destroys it.I tend to run it through a dependency grapher first. That way you have a relationship table that maps everything to it's source. That tends to make it much faster and easier for the LLM to detect where and why things are hard-coded and what breaks if it changes.
4 agents on 8GB is tidy work, I'll give you that. But I have to ask — what's the latency hit when you shard the KV cache that aggressively? I get the concurrency appeal, but for my daily DevOps tasks (log triage, config review), one clean 4B instance at full speed beats 4 throttled agents fighting for VRAM. Different workloads, different math. Out of curiosity, what's your typical TTFT with that setup
Surprisingly, with LM studio in vulkan mode (rx 9060xt 8gb), latency isnt that bad, concurrency fits in the vram and compute, but the problem is the kv cache. With shared kv cache for a simple task, eg. "Find the cheapest flight from A-Z on site {X} the initializing phase is about 2 seconds (400 tokens) per agent when initializing all at once, after which they run at a smooth 96 tps each. Using standard LLM tools, that would kill it, but using my mcp it ran pretty decently, when I switched to headful browsers, it was on average about a second per action (typing uses scripting, so it executes without LLM inference). I'll need to setup and run a benchmark for exact figures for you, I'll get on that tomorrow and respond with the hard data. Overall, testing E2B vs qwen 4b, qwen 4b consistently performed worse. I'll run tests tomorrow, including tests with the new drafters Google released to see how they compare vs the standard 4b.
The $847 framing is convincing but it ignores a hidden cost: self-hosting means you also inherit eval pipeline maintenance, model swaps when better weights drop, and GPU babysitting. At 4B params it's fine. Past 30B in production it stops looking like savings.
Absolutely fair point, Valentin! You're right — my "savings" calculation only holds at 4B. The moment you need 31B Dense in production, the math flips: you're now in GPU rental / hardware depreciation territory. I deliberately stayed in the "laptop-friendly" zone because that's where the ROI is today for solo devs and small teams. Past 30B, managed cloud starts looking cheap again. The sweet spot is knowing exactly where your workload crosses that line.
Now imagine running Kimi K2.5 locally... Sounds insane, but if you cant afford data-leaks, some people are forced to buy an entire rack just for it. The main thing with the cost of running local, is you need to make the most of everything, eg. running a draft model alongside it and using parallel to scale it. With Gemma 4 e2b, I can fit around 4 concurrent agents in parallel on a 8gb card with some hacking (weights and KV Cache), which isnt too bad when you consider something like a 7900xtx with 24gb, you can scale the context window, or you can up the concurrency (albeit barely, given compute limit), But having 5 agents run a task at once, eg. scraping a dependency graph for hard-coded values, insecure api endpoints, or browser agents for UI fuzzing, you get a pretty decent workforce for a few bucks, that at the very least cuts your consumption on cloud.
96 tps per agent with shared KV cache on 8GB is honestly impressive — I expected way worse. The 2s init hit is acceptable if the agents run for a while after.
Would love to see those benchmark numbers when you have them, especially the drafter comparison. My gut says Google's draft models will crush standard 4B on TTFT, but curious if the quality trade-off is noticeable for non-creative tasks.
Also: MCP for browser automation — are you using Playwright MCP or something custom? I've been meaning to wire that up for infra health checks.
Custom, it hooks into your native Chromium browser, for the pods, with barebones chromium on an alpine image. While setup to use a fresh account each run, the purpose is to reuse accounts, so you can build a web-presence with your agents and have your core logins saved in a key-vault, so you can execute tasks like booking flights, not just finding them (though I'm using throwaway pods for finding the data and an authenticated agent browser for executing the booking). I've been preoccupied lately with other projects, namely Doccit (Autonomous Accounting Suite), Windows-MCP (custom sandbox orchestration layer, that automates anything inside windows), V.A.L.I.D. demo apps (eg. JabuDemo for a cash deposit to digital money company), V.A.L.I.D. based demo app for replacing fast food industries' entire infrastructure (mobile + web app, kitchen management software, POS software, drive-thru management, stock management, driver-tracking, order priority orchestrator, batch and forecasting for kitchen management, etc. But I'll see when I can get to it and dump the logs for you.
Multi-agent batch changes the math fairly: hardware amortized across 4-5 concurrent tasks brings per-agent cost way down. What doesn't get amortized is the maintenance. KV cache hacking, weight juggling, draft model coordination, that's full-time infra work. For a team that already owns that layer, the 'few bucks' holds. For a team that doesn't, there's a hidden FTE inside the number.
This is the kind of reality check local LLMs need. They’re not always better than cloud models, but the privacy and control angle is huge for DevOps work. Logs, configs, scripts, internal notes, and messy debugging context are exactly the things people hesitate to paste into cloud tools. Even if the model is smaller, reducing that hesitation can change the workflow completely.
You nailed it, Varsha! That "hesitation" you mention is the real killer. Before Gemma 4B, I'd catch myself sanitizing logs before pasting them into a cloud tool — stripping IPs, renaming services... by the time the prompt was "safe", I'd already solved half the problem manually. Local AI removes that friction entirely. Privacy isn't just compliance, it's productivity.
Exactly. Privacy becomes a workflow problem, not just a security checkbox. If engineers have to pause, sanitize, and rethink every prompt, the tool already lost half its value.
The $847 breakdown is the tell, and it's the most common cost mistake in AI infra: log summarization, Terraform reviews, stacktrace explanations are all low-stakes, high-volume tasks that never needed a frontier model in the first place. You weren't overpaying because cloud is expensive, you were overpaying because you ran a Ferrari for the grocery run. That's exactly why "local vs cloud" is slightly the wrong axis, the real lever is routing: a 4B local model handles the bulk grunt work for ~free, and you reserve a cloud frontier call only for the genuinely hard reasoning that earns the price. Going all-local trades a variable bill for a fixed one and a quality ceiling; the cheaper answer for most teams is task-tiered routing, cheapest model that clears the bar per task. That's the discipline I bake into Moonshift, cost tracks the difficulty of the work, not a flat default. After 48 hours, where did Gemma 4B actually fall short enough that you'd still reach for the cloud, the cryptic-stacktrace reasoning, or did it hold up better than expected there too?
"Ferrari for the grocery run" — I'm stealing that, Harjot.
you're absolutely right: the $847 wasn't a cloud problem, it was a routing problem. I was defaulting to GPT-4o for everything because the API key was already there. Convenience tax. To answer your question: Gemma 4B fell short on multi-hop reasoning. Single stacktrace? Fine. Stacktrace + cross-service correlation + "which deployment caused this cascade"? That's where I still reach for the cloud. The 4B describes symptoms beautifully but struggles with systemic root cause.
Your task-tiered routing idea is exactly where I landed after the experiment. My current setup: 4B local for logs/configs/docs, cloud only for "I have no idea what's happening and need to think out loud." The discipline isn't easy — old habits die hard — but that's the architecture I'm building toward.
Am Curious I wanna know if Moonshift automate that routing decision, or does the dev still manually pick the tier per task?
Appreciate how practical this was. The "$847 question" is exactly the kind of trigger that makes local models feel less ideological and more operational. I also liked the framing around intentional model selection: for a lot of internal DevOps workflows, the winning setup is not the smartest model in absolute terms, but the one that keeps sensitive logs inside the perimeter while being fast enough to use every day.
Thanks Vic! 🙏 Exactly — the "intentional" part is what changed my workflow. It's not about being anti-cloud, it's about being conscious of where each task lands. The perimeter vs. performance trade-off became a architectural decision, not just a cost one. Appreciate you catching that nuance!