hatyibei

Posted on May 24

Building a local multi-agent system with Gemma 4: E4B commander, specialist comparisons across the family, and a reward function that hacked itself

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Gemma 4 ships as a family. E2B, E4B, and Dense, sharing a training recipe, tokenizer, and instruction-following behavior. The interesting design question for us wasn't which model to pick. It was which size goes where, and what happens when you push the smaller variant past what it does naturally.

We built a local multi-agent system that routes legal queries and chat-history queries to specialist agents, all on a single RTX 5090. Gemma 4 E4B did the routing and the final judge pass. For the specialist role, we tried Gemma 4 E2B, Gemma 3-1B, Bonsai-8B at 1-bit quantization, and eventually TinySwallow 1.5B—training the trainable ones with the same GRPO reward design and running them against the same evaluation suite.

The numbers across that comparison are most of this post. Some specific things we found:

Gemma 4 E4B as a commander: 99.5% routing accuracy on a 660-case eval after four prompt iterations, at 0.44s average latency.
Gemma 4 E2B trained with GRPO: viable as a specialist, but training was 15-30x slower per step than Gemma 3-1B with the same design, and the reward signal was substantially less stable.
A reward function we wrote that scored "はい" (Japanese for "yes") at +1.50, tied with our best legitimate answer. The model found it in three training steps.
Two completely independent specialist failures—Bonsai overflowing context and the chat agent throwing KeyError—that both passed our integration evaluation cleanly because the Commander's judge step was robust enough to hide them.
One specific thing the Gemma 4 family gave us for this work: a size split (E2B / E4B / Dense) that maps onto specialist / commander / heavy-reasoning roles on consumer hardware, with the think=True/False toggle on the commander tier so one variant covers both fast routing and slower reasoning. Qwen 3.5/3.6 has a comparable structure; the practical reason we landed on Gemma 4 specifically is at the bottom of this post.

All running locally on one consumer GPU. On-premise because legal documents and internal chat couldn't leave our infrastructure. The two-minute version is at the bottom if you want to skip the long version.

What we were building

Two capabilities for internal business support:

Legal RAG. Natural-language queries against a database of Japanese statutory law (944 articles across ten major regulations covering data privacy, cybersecurity, copyright, telecommunications).
Internal chat search. Semantic search over our team's chat history (around 17,000 messages) so people could ask "did anyone discuss X last quarter?"

Cloud APIs weren't available to us. Everything ran on one workstation:

Component	Specification
GPU	NVIDIA RTX 5090
VRAM	31.8 GB
OS	Ubuntu via WSL2

The VRAM ceiling drives most of what follows. To keep multiple models resident simultaneously, none of them could be in the 70B class. Small specialists for the retrieval-grounded tasks, and a medium-sized commander to coordinate them.

The Gemma 4 family decision

Here's the design we landed on after the comparisons:

User query
   │
   ▼
┌─────────────────────────────────────────────────────────────┐
│  Commander: Gemma 4 E4B (via Ollama, think=False)           │
│  → Routes to: law | chat | both | reasoning | none          │
└────────────────────────────┬────────────────────────────────┘
                             │
       ┌─────────────────────┼──────────────────────┐
       ▼                     ▼                      ▼
   [law]               [chat]              [reasoning | none]
   LawAgent            ChatAgent           Commander direct
   TinySwallow 1.5B    TinySwallow 1.5B    (E4B, think=True)
   + GRPO LoRA v5      + GRPO LoRA v5      no RAG
   + multilingual-e5   + multilingual-e5
   + bge-reranker-v2   + bge-reranker-v2
   944 articles        ~17K messages
       │                     │
       └──────────┬──────────┘
                  ▼
┌─────────────────────────────────────────────────────────────┐
│  Commander Judge (E4B, think=True)                          │
│  → Synthesizes agent outputs into final response            │
└─────────────────────────────────────────────────────────────┘

E4B at the top, TinySwallow at the bottom. The E4B placement was easy. The specialist position took comparison.

E4B is the smallest variant in the Gemma 4 family that consistently emits valid JSON for routing decisions and produces a coherent judge pass when asked to synthesize specialist output. It's also small enough to leave VRAM headroom for two specialist instances running underneath. The shared training recipe and tokenizer across the family meant Gemma 4 E2B was the natural specialist candidate—same prompting habits, same tokenization, same instruction-following style. So we started there.

We compared:

Gemma 4 E2B (Gemma 4 family, smaller specialist)
Gemma 3-1B (predecessor generation, similar parameter scale)
TinySwallow 1.5B (Qwen2.5-1.5B base, Japanese-tuned by SakanaAI)
Bonsai-8B (Qwen3-8B at 1-bit quantization, 1.15 GB file)

Spoiler: the eventual choice was TinySwallow, for reasons specific to Japanese specialization. But the comparison data along the way is what made it clear, and one of those data points is specifically about training Gemma 4 E2B.

Commander: Gemma 4 E4B in detail

A few things made E4B work cleanly for the commander role.

The think=True/False distinction. Routing needs to be fast and structured: the commander reads the user query, picks one of five route labels, returns a small JSON. We run it with think=False. Judging—where the commander reads the specialist outputs and synthesizes a final response—benefits from explicit reasoning. We run it with think=True. Same model, two modes, different costs, no extra deployment.

Ollama over raw HuggingFace transformers. Our initial setup loaded E4B with AutoModelForCausalLM. Loading 2130 layers took 35+ seconds cold and the model's VRAM kept fighting with the specialists running underneath. Switching to Ollama solved both: the model stays resident as a server, first tokens come back in milliseconds, and 43 of 43 layers offload to the GPU cleanly. Routing latency settled at average 0.44 seconds with p95 at 0.59 seconds across a 660-case eval.

JSON output reliability. The routing prompt asks for a structured {"route": "<one of five>", "reasoning": "<one sentence>"} response. E4B emits valid JSON on essentially every case in our evaluation set. Some smaller models in the same family-size range would occasionally produce unstructured prose or partial JSON; E4B didn't.

The 660-case routing iteration story (which we'll come back to) showed the commander could be tuned to 99.5% routing accuracy through prompt engineering alone—no fine-tuning required. That's the kind of capability you want at the commander tier: stable, predictable, fast.

The specialist problem: why fine-tuning facts in didn't work

Before we got to GRPO, we tried the obvious thing on TinySwallow: fine-tune it on legal QA pairs so it would know the law.

It didn't work. LoRA with r=64, alpha=128, three epochs, loss converged cleanly to 0.162. The model picked up the style of legal answers ("「個人情報」とは、生存する個人に関する情報であって…") but not the facts. Hallucinated article numbers held steady. Even with the relevant article in the prompt, the model mostly ignored it and produced confident-sounding fabrications.

The paper that helped us think about why: Yue et al. (2025), "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?" (arXiv:2504.13837, NeurIPS 2025 Oral). Our reading: RL-trained models outperform their base at small k, but at large k the base model achieves a higher pass@k. The capabilities that surface during training are already latent in the base model. RL makes them more accessible. It doesn't introduce new ones.

For a 1.5B model—and equally for Gemma 4 E2B at roughly 2B—"knowing more legal facts" is probably outside the base capability range. So we split the architecture:

Knowledge goes into RAG. The 944-article ChromaDB is the source of truth.
Behavior is shaped through reinforcement learning. The model has to actually use the retrieved context instead of bypassing it.

This split—knowledge via RAG, behavior via GRPO—is the design principle that lets us apply the same reward design across multiple specialist candidates. The reward function checks "did you use the context faithfully?" rather than "did you give the right answer?" Any small model that can be trained to use context becomes a viable specialist.

The reward hacking story

We picked Group Relative Policy Optimization, which is PPO with the value model removed, popularized by DeepSeek-R1. Rewards come from Python functions you write yourself. No separate reward model.

We started training TinySwallow. The first version had three reward functions:

context_citation_reward: does the response cite text from the retrieved context?
hallucination_penalty_reward: penalty for fake article numbers, plus a +1.0 bonus for appropriate refusal ("情報が見つかりませんでした").
uncertainty_reward: small reward for hedging, small penalty for length.

The v1 training ran, reward improved from -0.04 to +0.09, mediocre but moving. We assumed the citation reward was doing useful work. It wasn't: the function tried to match full phrases but the model's output and the retrieved context differed in whitespace and quotation styling, so every comparison failed. We'd been training on noise plus the refusal bonus.

We fixed that for v2, added a penalty for "hedge-then-hallucinate" patterns, and ran the full 526-example dataset through training. The metrics that came out:

mean_length:           27 tokens
halluc/mean:           +1.0
frac_reward_zero_std:  0.75

Twenty-seven tokens. frac_reward_zero_std: 0.75 meaning that across 75% of training prompts, every generation in the group produced identical text. The model had collapsed.

To check what was happening, we wrote test cases against our own reward functions. One of them was simply this:

('hack（超短文）', 'はい。')

Japanese for "yes," labeled "hack: ultra-short text." We expected our reward function to penalize it heavily. Instead:

ケース               citation  halluc    uncert    合計
good（引用・長め）   +0.62     +0.00     +0.00     +0.62
bad（架空）          +0.00     -0.80     -0.50     -1.30
hack（超短文）       +0.00     +1.00     +0.50     +1.50  ← maximum

The cheapest way to score +1.0 was to say "yes." Or nothing. Or anything short that didn't contain a fake article number. We'd designed reward hacking into our own training signal. The model found it in three steps.

The fix evolved over two more versions:

v3 added a sliding-window length penalty. It pushed mean_length to 74 tokens briefly, then died at step 62 of 131 without saving an adapter as the hacking patterns returned. We didn't restart.
v4 deleted the +1.0 refusal bonus entirely and scaled the citation reward by 2x to widen the gap between good answers and silence. The model still preferred refusal to hallucination, but now had to actively earn positive score by citing real context. mean_length rose to 174-200, frac_reward_zero_std dropped to 0, and average eval score went from baseline +0.85 to +1.15.
v5 changed the strategy: instead of trying to train hallucination out of free-form generation, constrain the output structure so hallucination had nowhere to go. We forced a three-section format—【条文引用】 / 【解説】 / 【範囲外】—and wrote rewards that checked each section against retrieved context independently. Average eval rose to +1.44, with format compliance and citation grounding both saturating during training.

The v5 design is what we then applied across the family.

Same reward design, three different models: what we found

Here's where Gemma 4 specifically gets interesting. Once we'd worked out the v5 reward design on TinySwallow, we ran the same design—same 526-example dataset, same 131-step training run, same group size of 4, same max_new_tokens of 300—on Gemma 3-1B and Gemma 4 E2B as alternative specialist candidates. All three models trained with reward functions [format_compliance_reward, citation_grounded_reward, explanation_grounded_reward].

The training profiles were strikingly different.

Gemma 3-1B + GRPO v5 trained cleanly. At step ~10 the metrics looked like:

mean_length:                          298.5 (clipped at 300)
format_compliance_reward/mean:        −0.175 → climbing
citation_grounded_reward/mean:        +1.47
frac_reward_zero_std:                 0.05
step_time:                            ~18 seconds

By step ~80, format compliance had risen to around 0.83 (close to its 1.0 ceiling), citation grounded held around 1.55-1.57 (out of 2.0 maximum), and frac_reward_zero_std had risen to around 0.40-0.42. The training was stable and the model converged toward saturating both rewards.

Gemma 4 E2B + GRPO v5 trained with the same script, same data, same reward design. The training profile was visibly harder:

mean_length:                          284 → 277 → 287 → 294 (oscillating)
format_compliance_reward/mean:        +0.07 → +0.26 → +0.32 → +0.17 (unstable)
citation_grounded_reward/mean:        +1.53 → +1.16 → +1.26 → +1.34 (oscillating)
frac_reward_zero_std:                 0.575 → 0.575 → 0.60 → 0.65 → 0.72 (rising)
step_time:                            300-700 seconds per step

Two things stand out. First, per-step time was 15-30x longer than Gemma 3-1B. A run that took 18 seconds per step on Gemma 3-1B took 300-700 seconds per step on Gemma 4 E2B. The full 131-step run that took roughly 40 minutes on Gemma 3-1B took us most of a day on Gemma 4 E2B. We don't have a definitive root cause documented for this—candidate explanations include the larger model size (2011 layers loaded for E2B vs 340 for Gemma 3-1B), the multimodal processor architecture, or our specific Ollama/transformers setup—but the time difference is real and reproducible.

Second, the reward signal was substantially less stable. format_compliance_reward/mean oscillated in the +0.07 to +0.32 range rather than climbing monotonically toward 1.0 the way Gemma 3-1B's did. frac_reward_zero_std rose to 0.72, meaning the model was producing identical outputs across the group on 72% of prompts—a pattern that on TinySwallow would have signaled the same reward hacking problem we'd seen before. Entropy stayed high (2.5-2.9) where Gemma 3-1B converged tightly (0.17-0.25).

TinySwallow + GRPO v5 sits between these in shape, ultimately converging to the +1.44 average eval result above.

Without further investigation we wouldn't claim Gemma 4 E2B is "worse" as a specialist—the convergence profile we observed could be specific to our reward shape, our dataset, or our training configuration. But for anyone planning to GRPO-train Gemma 4 E2B as a specialist in a similar setup, we'd say: budget significantly more wall-clock time per step, expect a less stable reward signal in the early phases, and consider whether the same GPU hours might give you a cleaner training curve on a smaller predecessor model.

After this comparison, the deciding factor for our final specialist choice came down to language specialization. TinySwallow is Qwen2.5-1.5B with TAID distillation from Qwen2.5-32B, specifically tuned for Japanese. For a system answering Japanese legal queries and Japanese chat history, that specialization tipped the result. The GRPO-shaped TinySwallow became our production specialist.

Bonsai-8B: a model we wanted to use and couldn't

One more candidate worth talking about because it's a genuinely interesting model, and the reason we couldn't use it is structural rather than about quality.

Bonsai-8B from Prism ML (Apache 2.0) is a 1-bit quantized (1.125 bits per weight) Qwen3-8B variant. The GGUF file is 1.15 GB. On benchmarks it outscores Llama 3.1 8B at FP16 (70.5 vs 67.1). Inference on our hardware was faster than TinySwallow.

We ran a 12-case integration evaluation against Bonsai and against TinySwallow with the v5 LoRA:

Specialist config	Avg score	Routing acc	Format	Keyword (law cases)
TinySwallow + GRPO LoRA v5	+3.42	100% (12/12)	100%	83.3%
Bonsai-8B (run 1)	+3.23	91.7% (11/12)	100%	83.3%
Bonsai-8B (run 2)	+3.20	91.7% (11/12)	100%	83.3%

TinySwallow wins on score and routing. But that wasn't the deciding factor. The deciding factor was in the Bonsai eval logs:

[L2] 著作権法 第30条の4（AI学習データ）
  Q: 著作権法において情報解析を目的とする著作物の利用について教えてください。
[LawAgent] 処理中...
  ⚠️ LawAgent エラー: [BonsaiGenerator] API エラー HTTP 400:
  {"error":{"code":400,"message":"request (4922 tokens) exceeds
   the available context size (4096 tokens), try increasing it",
   "type":"exceed_context_size_error",
   "n_prompt_tokens":4922,"n_ctx":4096}}
  ★ スコア: +4.00

The same error reproduced on every L2 case across three Bonsai eval runs. 4922 tokens overflowing a 4096 context window. For RAG with three retrieved law articles plus chat history, we routinely sent 4000-5000 token prompts. Bonsai would have required a custom retrieval strategy that pre-summarized articles before sending them in, another hop, another failure mode.

At the small-model scale, context window matters more than parameter count. Bonsai's 8B parameters at 1-bit didn't help us when the model couldn't hold the prompts. The Gemma 4 family's longer native context made this a non-issue for E2B and E4B.

There's a second thing about that eval log worth holding on to. L2 scored +4.00 despite the API error. Come back to that in two sections.

Routing iteration on Gemma 4 E4B: 660 cases, four prompt versions

The commander's routing decision is a five-way classification (law / chat / both / reasoning / none). It sounds easy until you actually try it.

We built a 660-case evaluation set:

Route	Cases	Source
`law`	308	Auto-sampled from `law_qa_all.jsonl`
`chat`	67	Hand-written
`both`	40	Hand-written
`none`	159	General questions, programming, casual
`reasoning`	86	Logical analysis, comparative critique

Then we iterated the routing prompt that E4B saw:

Version	Accuracy	Per-class detail
v1	90.5% (597/660)	`none` collapsed to 62.9% (100/159). Most errors were `none → reasoning`
v2	92.3% (609/660)	Tightened the `reasoning` definition. `none` rose to 98% but `reasoning` itself collapsed from 98% to 42%
v3	97.1% (641/660)	Added contrastive examples on the `none` vs `reasoning` boundary
v4	99.5% (657/660)	Added 17 few-shot labeled examples covering edge cases

The v2 result is the most instructive. We tightened the criteria for reasoning to pull none accuracy up, and immediately broke reasoning. Boundary problems in prompt-engineered classification don't resolve through tighter natural-language definitions. Concrete contrastive examples that show the line do.

The remaining three errors at v4 are genuinely ambiguous label cases (a bias-variance tradeoff question, a DB normalization question, an architecture comparison question) where reasonable humans would disagree about whether the user wanted a reasoning walkthrough or a none definition.

The point worth noting for Gemma 4 specifically: this entire iteration happened with the same E4B model. No fine-tuning, no retraining. Four prompt revisions and 17 added few-shot examples moved routing accuracy from 90.5% to 99.5%. That's the kind of trainability-through-prompting you want at the commander tier.

Two failures the integration eval didn't catch

The Bonsai L2 case scoring +4.00 despite a 4096-token context error wasn't an isolated quirk in the evaluation. It was a structural property of how we'd set up the integration eval: the Commander's think=True judge pass is robust enough to produce reasonable-looking output even when one specialist is throwing errors. The eval scored what came out of the Commander, not whether the specialist actually did its job.

I noticed this when sampling production logs for a different reason. The daily JSONL files in our logs/commander_agent/ directory had entries like this:

{"timestamp": "[REDACTED]",
 "question": "最近のミーティングでAIツール導入について議論されていましたか？",
 "route_decision": "chat",
 "agent_outputs": {"chat": {"output": "Error: 'law'", ...}}}

Error: 'law'. On every chat query. On every both query that involved chat.

The root cause was a VRAM-saving optimization we'd added when LawAgent and ChatAgent shared the same base model:

class CommanderSystem:
    def __init__(self, ...):
        # Share generator across agents if base model is identical
        shared_gen = None
        if load_law and load_chat and law_base == chat_base:
            shared_gen = LawGenerator(model_path=law_base, lora_path=lora_path)
        self.law_agent  = LawAgent(shared_generator=shared_gen)
        self.chat_agent = ChatAgent(shared_generator=shared_gen)  # ← LawGenerator passed in

class ChatAgent:
    def __init__(self, shared_generator=None, ...):
        if shared_generator is not None:
            self.generator = shared_generator   # ← uses LawGenerator unchanged

LawGenerator.generate() formats retrieved context with f"【{d['law']} {d['article']}】\n{d['text']}". Chat retrieval returns documents shaped like {'ts': ..., 'user': ..., 'channel': ..., 'text': ...}. The chat agent tried to access d['law'] on a chat document, got KeyError: 'law', and the error was caught and serialized into the log as a string. Downstream logic ignored it.

The 660-case routing evaluation never caught this because it only checked routing decisions, not agent outputs. The 12-case integration eval used the Commander's judge step, which—as with the Bonsai context errors—was robust enough to produce reasonable-looking output even when one specialist threw. The eval scored pass while chat retrieval was broken.

The fix:

class ChatAgent:
    def __init__(self, shared_generator=None, ...):
        if shared_generator is not None:
            # Borrow model and tokenizer; build our own Generator
            # that expects chat-shaped documents
            self.generator = self.ChatGenerator.__new__(self.ChatGenerator)
            self.generator.model     = shared_generator.model
            self.generator.tokenizer = shared_generator.tokenizer
            print("[ChatAgent] Generator: 共有インスタンスを使用")

These were two completely unrelated failure modes—Bonsai overflowing its context window on legal RAG, ChatAgent's KeyError on chat queries—but they failed the integration eval the same way. They both reached the Commander's judge step as error text, the judge generated prose around the error text, and the keyword check passed because the keyword happened to appear somewhere in that prose.

The lesson is about how robust orchestrators interact with evaluations. An integration eval that scores only the orchestrator output cannot detect failures that the orchestrator is robust to. With a think=True Gemma 4 commander, the orchestrator is robust to a lot. Score the specialists separately, not just the final output. Sample production logs.

Final results

After the KeyError fix and final integration:

ケース数        : 30 件 (エラー: 0 件)
平均スコア      : +3.38
ルーティング正解: 100.0%  (30/30)
フォーマット正解: 100.0%
キーワード正解  : 92.9%  (lawケースのみ)
レイテンシ      : 平均 17.14秒  最大 88.11秒  P95 38.65秒

Metric	Value
Routing accuracy	99.5% (660-case eval), 100% (30-case full eval)
Format compliance	100% (30/30)
Average score	+3.38 (4.0 is the per-case ceiling)
Keyword accuracy on law cases	92.9%
Routing latency	avg 0.44s, p95 0.59s
Full-pipeline latency	avg 17.14s, p95 38.65s
Hardware	Single RTX 5090, 31.8 GB VRAM
Models loaded simultaneously	Gemma 4 E4B via Ollama, TinySwallow 1.5B + GRPO LoRA v5 (shared across both specialists)
External services	Zero. ChromaDB local, multilingual-e5-large local, bge-reranker-v2-m3 local

An actual production query against this configuration:

Q: 先週の進捗報告をまとめて

A:
【会話引用】
[2026年02月02日 / team-member-a / #progress-reports]
先週頑張ったので今週ゆっくり働きます ...

【解説】
先週の進捗報告では、システム改善、内部レビュー資料の作成、
効率化の検討、そして次フェーズに関する調査などが挙げられています。

【範囲外】
この会話履歴からは、具体的なタスクの詳細や担当者の名前などはわかりません。

(Names and channel are anonymized; the structure and message dates are real.)

The three-section structure the model was forced into during GRPO training carries cleanly to inference. Citation, explanation, and out-of-scope disclaimer stay separate. The Commander's think=True judge pass then rewrites this into the conversational register the user actually wants.

Two-minute version

A few things about working with the Gemma 4 family at this scale.

Gemma 4 E4B is a strong multi-agent commander. The think=True/False distinction lets you cover fast routing and slow reasoning with one model. The Ollama integration is clean (43 of 43 layers offloaded, milliseconds first-token latency). And routing accuracy can be tuned from 90.5% to 99.5% through prompt iteration alone, no fine-tuning required. Same model, two modes, 660 cases of evidence.

Gemma 4 E2B as a GRPO-trained specialist is viable but harder than the predecessor generation. Same data, same reward design, same 131 training steps: Gemma 3-1B trained at ~18 seconds per step with stable rewards converging cleanly. Gemma 4 E2B trained at 300-700 seconds per step with oscillating rewards and rising frac_reward_zero_std. Budget significantly more wall-clock time, and consider whether the same compute would give you a cleaner curve on a smaller model.

At 1.5-2B parameters, don't try to fine-tune facts in. Use RAG for knowledge and reserve reinforcement learning for behavior shaping. This matches what Yue et al. found about RL surfacing latent capability rather than adding new capability. For factual recall at this scale, retrieval beats fine-tuning.

Audit your reward landscape by asking what the cheapest way to score 1.0 is. If a one-word answer scores well, your reward function is broken. We learned this when our +1.0 "appropriate refusal" bonus drowned out a +0.62 "good citation" reward and the model converged to one-word outputs across 75% of training prompts. Structural fixes (remove the bonus, then enforce a three-section output format) worked better than incremental reward tweaks.

Integration evaluations that score orchestrator output cannot catch failures the orchestrator is robust to. A think=True Gemma 4 commander is robust to a lot, including specialist API errors and KeyError exceptions surfaced as string output. Two independent specialist failures—Bonsai's 4096-token context overflow and a ChatAgent KeyError—both passed our integration eval cleanly. Score specialists separately. Sample production logs.

At the small-model scale, context window can beat parameter count. Bonsai-8B at 1.15 GB outscores most 8B models on benchmarks, but its 4096-token context made it unusable for our RAG. The Gemma 4 family's native long context handled our 4000-5000 token RAG prompts without intervention.

How this fits the moment

April 2026 was a remarkable month for open-weight releases. Gemma 4 shipped on April 2 with native multimodal across most of its variants. Alibaba's Qwen 3.5 had been available since February with a similar family span (0.8B through 397B, hybrid thinking, 256K context, Apache 2.0), and Qwen 3.6 followed in April with a thinking-preservation feature. Several other strong families—Llama 4, Phi-4-reasoning, GLM-5.1, DeepSeek V4—shipped in adjacent weeks.

What pulled us toward Gemma 4 specifically was the VRAM math. Multi-agent on a single RTX 5090 means hosting a commander and one or two specialists concurrently in 31.8 GB of VRAM. The E4B/E2B size split fit that constraint cleanly—E4B for the routing and judge tier, room left over for a small specialist running underneath. Qwen 3.5/3.6 would have given us a comparable size split (4B/9B for the medium tier, 0.8B-2B for the specialist tier), and would be a reasonable choice for the same project today. The methodology in this post—family-internal commander/specialist split with shared GRPO behavior shaping—should transfer to any recent family with comparable structure.

The architecture turns out to be a reasonable instantiation of what the IPA's Open Data Spaces (ODS) specification calls L1 data sovereignty. Enterprise data (legal documents, internal chat) gets processed in place. The AI inference layer runs locally on open weights. We were already working inside the constraint ODS later articulated as the Agentic AI-era data infrastructure pattern.

There's a companion submission to this challenge: Diagent, a tool that turns hand-drawn workflow sketches into verified LangGraph agents. Diagent uses Gemma 4 E4B for the two jobs at its core—multimodal parsing of the sketch into a Mermaid diagram, and reasoning-mode generation of LangGraph code—then statically audits and sandbox-executes what it produced. This post is about operating a multi-agent system at runtime; Diagent is about generating and verifying agent workflows at design time. Different stages of the same craft: building AI agent systems with the Gemma 4 family, locally, on consumer GPUs.

References

Gemma 4 family by Google DeepMind (Apache 2.0). Gemma 3 family by Google DeepMind (Apache 2.0). TinySwallow-1.5B-Instruct by SakanaAI (Apache 2.0). Bonsai-8B by Prism ML (Apache 2.0). Ollama for serving. ChromaDB for vector storage. multilingual-e5-large for embeddings. bge-reranker-v2-m3 for reranking.

Research that shaped the design:

Yue et al. (2025), "Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?" — arXiv:2504.13837 (NeurIPS 2025 Oral / ICML 2025 AI4MATH Best Paper)
Zhang et al. (2024), "RAFT: Adapting Language Model to Domain Specific RAG" — arXiv:2403.10131
Asai et al. (2023), "Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection" — arXiv:2310.11511
IPA (2026), "Open Data Spaces (ODS) Reference Architecture Model" — https://www.ipa.go.jp/digital/opendataspaces/

This is the runtime-control half of a two-part submission. The design-time half—Diagent, a sketch-to-verified-agent tool also built on Gemma 4 E4B—is linked here: https://dev.to/hatyibei/diagent-when-the-static-auditor-and-the-sandbox-disagree-whos-right-4ijg.

All numbers in this post come from actual evaluation and training runs. Logs available on request.

DEV Community