DEV Community

Discussion on: I Added Three Rules to Gemma 4. The MoE Searched. The Dense Model Refused.

Collapse
 
cloudninealt profile image
Robin Converse

The MoE vs Dense failure mode distinction is exactly what I found too — from a different angle.

I was hitting empty responses on the 26B MoE through Ollama’s /v1/chat/completions endpoint. Turned out the reasoning trace was exhausting the token budget before any output reached the content field. The model was doing all its work in a reasoning layer that never surfaced to the caller.

Your observation about “tuning architecture not size” maps directly to what I was seeing. The MoE’s reasoning behavior isn’t incidental — it’s structural. It reasons differently than the dense model at an architectural level, not just a capability level.

I ended up routing to Ollama’s native /api/generate endpoint instead, which handles the reasoning/output split cleanly.

Found two upstream bugs in the process — documented here if useful: dev.to/cloudninealt/self-hosting-gemma-4-for-production-automation-revealed-two-ollama-bugs-1oo4

Different deployment context (sovereign self-hosted infrastructure vs your production chat router) but the same underlying architecture behavior surfacing in both.

Great work on the Arabic production testing — the false negative failure mode on the dense model is a finding worth knowing about.

Collapse
 
alimafana profile image
Ali Afana

Robin — this is a really helpful complement. You found a mechanism for what I observed behaviorally.
My "architectural slots for sequential sub-behavior" was a guess from the outside, looking at output patterns. You hit direct evidence: the 26B MoE was doing substantial reasoning work that exhausted the token budget before any output reached the content field. That's not metaphorical reasoning capacity. That's measurable token expenditure on internal computation the OpenAI-compatible interface doesn't surface.
Two angles on the same architectural fact, from completely different deployment contexts — sovereign self-hosted infra and a production e-commerce chat router shouldn't be running into the same artifact, but they are. Your trace data lets me trust the hypothesis more than my matrix alone justified.
The endpoint detail is the kind of thing I would have missed for weeks. /v1/chat/completions smoothing over the reasoning/output split is exactly the wrong abstraction to inherit from OpenAI — for closed models with no reasoning trace it doesn't matter, but for MoE-with-reasoning it's a silent failure surface. I'm on Google AI Studio, not Ollama, but worth checking whether the equivalent path is happening on the Gemini API side too — might explain part of the latency I was seeing.
Reading your bugs writeup now. Documenting the upstream and getting them filed is the part of the work that's invisible from the outside and matters most.