MxGuru

Posted on May 18 • Originally published at sovereignhive.com.au

What Gemma 4 Actually Unlocks for a Local Security Swarm (And Why I Don't Use the Same Variant Everywhere)

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I've been building an offline, multi-tier adversarial agent swarm on a single workstation — an RTX 5070 (12GB VRAM), a Ryzen 9 9950X3D, zero cloud calls, zero external dependencies, and zero vendor content restrictions. The swarm acts as an autonomous "Blue Team": it audits, scans, correlates threats, and, where appropriate, simulates the attacker side of an engagement against the assets it protects.

When the Gemma 4 family dropped, the question I had wasn't should I use it. A local-first, capable, open-license, multimodal model with a 128K context window is an automatic yes. The genuinely interesting question was: which variant goes where?

That's the question I think most "I tried the new model" posts skip past. The Gemma 4 lineup isn't just one model cut into three sizes. It's three distinct architectural answers to three different deployment problems. Picking the right one per role is where you find real leverage.

The Lineup, Architecturally

For anyone who hasn't pulled the spec sheet yet:

Gemma 4 E2B / E4B — Small effective-parameter models built for the edge: phones, browsers, ambient compute. Fast time-to-first-token, tiny VRAM footprint, and you can run many of them concurrently.
Gemma 4 26B MoE — Mixture-of-Experts. Total parameters are massive, but only a fraction activate per token. Designed for high throughput with strong reasoning on a per-task basis. It takes up space in memory, but it's computationally much cheaper to run than its parameter count suggests.
Gemma 4 31B Dense — Server-grade local. Every parameter fires on every token. Predictable inference cost and generally the strongest reasoning ceiling of the three, but carries the highest VRAM tax and latency floor.

All three share the same training lineage, the same 128K context window, and the same multimodal head. They differ entirely on activation patterns, footprint, and what kind of work they are built to absorb.

Casting Models by RBAC Tier

The swarm uses a 6-tier zero-trust Role-Based Access Control (RBAC) system. Tier 6 is the most privileged — supervisors that can spawn, terminate, and de-escalate other agents. Tier 5 is the least privileged — ambient scanners that watch logs, file changes, and network deltas. Every privileged action routes through a hardcoded PermissionGate that doesn't care what the model wants; if the tier doesn't permit it, the call dies.

This matters for model casting because higher tiers don't just need smarter agents — they need slower, more deliberate ones. A supervisor that fires off twenty execution plans a second is a massive liability. Conversely, an ambient scanner that thinks for three seconds before flagging a file change is useless.

So, the question per tier is: how much reasoning depth, how much latency tolerance, and how many instances do we need concurrently?

Where Each Variant Earns Its Slot

The E2B / E4B at the edges (Tiers 4–5). Ambient watchers, log diffing, simple anomaly flagging, and "is this string weird" classification. The work here is high-volume, mostly pattern-shaped, and low stakes per call. I need several of these running concurrently with zero VRAM drama. A small model that returns a token in tens of milliseconds and lets me run multiples in parallel easily beats a 31B Dense that locks the GPU for seconds. Edge Gemma 4 is built for exactly this shape of work.

The 26B MoE in the middle (Tiers 2–3). Triage, correlation, and threat synthesis ("you've got fifteen of these alerts — is an attack chain forming?"). The MoE architecture fits here for a specific reason: middle-tier work is bursty. You have quiet stretches followed by a sudden need to reason hard about a correlated set of events. MoE's sparse activation means we get 31B-class reasoning without the relentless compute tax of a dense model. The 128K context window pays for itself here too, allowing triage agents to ingest a long correlation window of events in a single shot.

The 31B Dense at the top (Tiers 5–6) — with caveats. Supervisors, planners, and adversarial scenario generation. Dense earns its slot here because top-tier reasoning needs to be predictable. When an MoE routes to a different expert mix on a similar query, you can occasionally get stochastic depth. For a supervisor agent deciding whether to spawn a sub-agent at a different privilege tier, I want mathematical uniformity more than peak throughput. Dense delivers that.

The Caveat: On a single-card 12GB 5070, a 31B Dense model is the heavyweight in the room. It cannot coexist concurrently with the MoE and a stack of edge models without aggressive quantization and careful orchestration. Mine gets gated through an HTTP inference queue — agents request inference, the gateway serializes the high-cost calls, and the small models keep running in their own lane. It's not glamorous infrastructure, but it's what makes the casting work.

What I Actively Avoid

Based on this architecture, here are a few patterns I actively avoid:

Don't use the 31B Dense everywhere just because it's the strongest. Latency at the bottom tier kills a swarm's situational awareness. You'll miss live events because your "ambient" watchers are blocked behind a heavy inference floor.

Don't put the MoE on supervisor duty. I like the model. I just don't want stochastic expert routing inside the agent that decides whether another agent gets disk-write permissions.

Don't put the E2B/E4B on triage. Edge models are great at answering "is this weird?", but weak at "what does it mean across these fifteen events?" Triage is the rung where context and parameter count win, not throughput.

The Takeaway

The Gemma 4 release is remarkable because the variants are legitimately different tools, not just three sizes of the same hammer. The MoE isn't "the 31B but smaller," and the E2B isn't "the E4B but worse." Each one is shaped for a specific class of work.

For a local-first, zero-cloud security swarm, the answer turned out to be all three at once, casting them by tier rather than picking a default. The model that wins on a benchmark is rarely the right model for every role inside a complex system.

That's the lesson I'd transfer out of this exercise: when a model family ships with real architectural variation, the lazy move is picking a favorite. The valuable move is asking which variant belongs in which slot — and building the orchestration to run them side by side.

Top comments (4)

Gilder Miller • May 19

Love the variant separation here. Most people grab the biggest model, but matching role to architecture is where real performance lives.
The MoE routing point on supervisors is sharp. Stochastic behavior in privilege decisions is a bad idea. Running three variants on one 5070 is impressive. What quantization are you using for the edge models?

MxGuru • May 19

Glad it landed. The MoE/privilege overlap is exactly where we ended up too — deterministic dispatch, no stochastic routing anywhere in the supervisor tier. Quantization-wise we run per-role rather than uniform: roles doing pattern-matching tolerate aggressive compression, roles making judgment calls get more headroom. The specific mapping I'll keep close for now.😀

Gilder Miller • May 19

Per-role quantization is the right call. Pattern matching can survive compression because speed matters more than precision there. Judgment roles need that extra headroom to stay reliable.
How do you draw the line when compression starts degrading output? Downstream evals or something more real-time?
It helped me alot.

MxGuru • May 20

Both, honestly — they cover different failure modes. The floor is offline: each role has its own eval suite tied to what it actually does, and a quant level only ships if it clears that bar. But static evals miss slow drift, so the live signal is cheaper — I periodically shadow a sample against a full-precision reference and watch the delta, plus treat downstream validation failures and cross-agent disagreement as a canary. When a compressed role starts getting overruled more often, that's usually the first sign it's lost the plot before any benchmark notices.