I thought Gemma 4's reasoning traces were wasting tokens. During testing, I realized they were acting as an audit layer for automation. That realization changed how I designed an n8n node for self-hosted AI workflows.
In most automation systems, the model output is the only thing the operator sees. But once AI starts triggering downstream workflows, hidden reasoning becomes operationally important. If the model is making decisions on behalf of a business, the logic path matters as much as the final response.
Here's what I built, what I found, and what it means for AI automation on owned infrastructure.
What I Built
An n8n community node that connects any n8n workflow to a self-hosted Gemma 4 26B MoE endpoint. The node calls Ollama's native /api/generate API, returns clean text, and works with a custom model called triava-prod — a Gemma 4 26B derivative with Triava Labs' brand voice baked in.
The tagline for Triava Labs is "Your model. Your voice. Your business." This node operationalizes that idea.
Repo: github.com/triavalabs/n8n-nodes-triava
The Infrastructure
Everything runs on a single Hetzner CCX33 server: Ollama serving the model, Caddy as reverse proxy, Let's Encrypt for SSL.
No GPU cluster.
No cloud API dependency.
One server, owned infrastructure, real inference.
triava-prod is a Q4_K_M quantization of Gemma 4 26B MoE — 25.8B parameters loaded, roughly 4B active per token. Built using Ollama's Modelfile system with a custom system prompt that encodes Triava's brand voice:
SYSTEM "You are a direct, professional AI assistant for independent operators.
Reply with the answer only. Never show reasoning, drafts, or thinking process.
Match the operator's voice and tone. Be concise unless asked for detail."
Why Gemma 4 26B MoE
The MoE design gives high-capability reasoning behavior at roughly 4B active-parameter inference cost per token. That means it runs at practical throughput on a single owned server — which is the whole point of sovereign infrastructure. A model that requires an A100 cluster isn't sovereign in any meaningful sense for an independent operator or small agency.
Gemma 4 also introduced native system-role support. That matters specifically for this project because the brand voice IS a system prompt. The whole pipeline depends on reliable system-role adherence and consistent on-voice output.
Then I actually tested it in production-like conditions:
- Cold inference on a Hetzner CCX33: ~16-31 seconds via
/api/generatefor a full brand-voice response - Output quality: coherent, on-tone, holds the voice across 150+ word outputs
The model reasons before writing.
What initially looked like a bug turned out to be a feature.
What I Actually Discovered
Two upstream Ollama bugs, found through methodical testing during Phase 2 build.
Bug 1 — /v1/chat/completions returns empty content for all Gemma 4 models
When using Gemma 4 via the OpenAI-compatible endpoint, the content field is always empty. The reasoning trace exhausts the max_tokens budget before any final output is generated. I confirmed this affects the base gemma4:26b model too — it's not a Modelfile issue.
I diagnosed it with five comparative curl tests: three against /v1/chat/completions (all empty), two against /api/generate (both clean). The native endpoint folds reasoning and output into one response field and runs 4× faster — ~16s vs ~60s.
Decision: the node targets /api/generate. This isn't a workaround — it's the correct endpoint for Gemma 4 on Ollama right now.
Bug 2 — Long system prompts return empty responses on the 26B MoE
The Gemma 4 26B MoE returns empty output when the combined system prompt exceeds roughly 500 characters. Dense models handle the same prompt correctly. This is isolated to the MoE architecture.
Practical implication: triava-prod already has the brand voice baked into its Modelfile. Passing an additional system prompt from the node adds to the total and can hit the threshold. The fix: leave the System Prompt field blank when using triava-prod. The voice lives in the model, not in the API call.
This is what "your voice" actually means architecturally. The brand voice isn't injected per-call. It lives in the model you own.
The Demo
The workflow: Manual Trigger → Triava node → brand-voice content.
n8n Workflow
↓
Triava Node
↓
Ollama /api/generate
↓
triava-prod (Gemma 4 26B MoE)
↓
Reasoning + Output
↓
Clean text → downstream automation
The reasoning trace can optionally be retained for auditability before execution.
The prompt:
Write 3 X post variants (max 280 characters each) about this topic.
Topic: Building a sovereign n8n node for Gemma 4
Angle: The reasoning trace is an auditing layer, not noise
Voice: First-person, direct, technically honest. No hype.
Number them V1, V2, V3. No hashtags.
Output from triava-prod:
V1: Building a sovereign n8n node for Gemma 4. The reasoning trace isn't noise; it's an auditing layer. You need to be able to verify the logic behind the automation.
V2: I'm working on a Gemma 4 node for n8n. I'm treating the reasoning trace as an audit log rather than extra text. For sovereign workflows, seeing the steps taken is more important than a clean output.
V3: Developing an n8n node for Gemma 4. The reasoning trace is often seen as clutter, but I'm using it as an auditing layer. It provides the transparency necessary to trust automated decisions.
Model Selection — The Honest Summary
I picked the 26B MoE. I tested it. I found two real bugs. I made deliberate engineering decisions based on what the tests showed.
The 26B MoE delivers high-capability reasoning behavior at ~4B active-parameter inference cost on hardware an independent operator can actually own. It has native system-role support that makes brand-voice workflows possible. And its reasoning behavior — which initially looked like a problem — turns out to be an auditing layer that makes the model's logic inspectable before it triggers downstream automation.
If automation is going to make decisions on behalf of operators, the reasoning layer cannot remain invisible.
That last point isn't something I planned to write about. It's something I observed. Which is the only kind of model-selection story worth telling.
What's Next
The OpenAI-compatible path (/v1/chat/completions) is a real goal for Triava Labs — if the upstream Ollama issue gets resolved, the node's architecture is already designed to support it. That's a v1.5 roadmap item, not a contest deliverable.
The node is at github.com/triavalabs/n8n-nodes-triava. npm publish is in progress via GitHub Actions with provenance.
Triava Labs v1 is in active development at triavalabs.com. The node is the first production component of the broader Triava Labs infrastructure.
The deeper lesson from this build was that self-hosting a model is only part of sovereignty. The other part is being able to inspect the model's reasoning before automation turns it into action.
Update — May 16, 2026
Since publishing, an unexpected cross-article thread emerged with @alimafana, who independently hit complementary Gemma 4 26B MoE failure modes from a completely different deployment context — a production Arabic e-commerce chat router on Google AI Studio rather than self-hosted Ollama.
Their finding: MoE and Dense handle ambiguous instructions in opposite ways. Same prompt, two architectures, inverse failures.
The intersection: both findings point to the same underlying picture — each Gemma 4 variant has its own tax, paid on different inputs. Their behavioral observation from the application layer and my infrastructure-level bug documentation appear to be two angles on the same architectural reality.
The upstream bugs filed:
-
Ollama issue #15288 —
/v1/chat/completionsempty content for all Gemma 4 models - Ollama issue #15428 — long system prompts return empty responses on the 26B MoE
Related:
- @alimafana's submission — "I Added Three Rules to Gemma 4. The MoE Searched. The Dense Model Refused."
Update — May 20, 2026
Ran the uncapped re-run we discussed in the comments. The MoE on sovereign Ollama handles all six scenarios correctly across temperatures 0.3 / 0.7 / 1.0 when the budget isn't capped — which doesn't reproduce @alimafana's Dense regression on the MoE side. Consistent with what we both expected.
The unexpected part: even uncapped on /api/generate, there's a measurable gap between eval_count and the characters returned in the response field, and it widens with query difficulty.
- Grounded retrieval ("white shirt size L"): 389 tokens / 155 chars (~2.5 chars per token)
- Under-specified retrieval ("formal but soft"): 1,096–1,379 tokens / 281–321 chars (~0.2–0.3 chars per token)
Temperature isn't the driver — the gap holds at both 0.3 and 0.7. The query type is.
What this means architecturally: the audit-layer framing in the original article holds, but the audit layer is even more elusive than the article suggested. On /v1/chat/completions the reasoning eats the budget and the response is empty (#15288). On /api/generate the reasoning happens, the budget absorbs it, and you see a clean final answer — but the work itself isn't in any field the API returns. Bugs #15288 and #15428 still stand. This is a third observation about the same architecture, not a contradiction of either.
Methodology and full 18-call CSV available on request.
Update — May 21, 2026
Status check on both upstream Ollama issues, four days after publishing:
#15288 — /v1/chat/completions empty content for Gemma 4 — was closed as completed on April 3. The maintainer (@jmorganca) and a collaborator (@rick-github) identified that the empty content field is caused by max_tokens being insufficient to accommodate Gemma 4's reasoning plus the final output. Workarounds: raise max_tokens, set reasoning_effort: "none", or use /api/generate (the path Triava took). My characterization of the endpoint as "broken" was a step too strong — the more honest framing is that /api/generate was the simpler integration path for Gemma 4's reasoning behavior, but /v1/chat/completions works with the right configuration. The architectural choice still holds; the framing gets calibrated.
#15428 — gemma4:26b MoE empty response on long system prompts — was real on Ollama 0.20.x. Multiple users independently confirmed it (@wiltongorske, @semidark, @cymise, @maxbanton). It was fixed somewhere between 0.20.x and 0.21.x, likely as part of the gemma4 renderer rework that brought tokenization counts back to expected values. I verified this morning on Ollama 0.23.1 against the same model SHA: the empty-response behavior is gone, prompt_eval_count is back to the expected ~329 tokens (vs. 1,423 on 0.20.3), and content/thinking fields populate correctly. Comment posted on the issue confirming resolved-in-0.21+.
The bake-voice-into-the-Modelfile architectural decision still holds independent of #15428 — keeping the voice in the model rather than the API call is the right design for sovereignty regardless of whether long system prompts work via the chat endpoint. What changed is the framing: it's a deliberate architectural choice aligned with the thesis, not an engineering workaround forced by a bug.
Update — May 23, 2026
The Scenario 6 substitution observation from the May 20 update — same response across all three temperatures — turned out to be the visible edge of something more general.
Jiwon SEO (Hashevolution/JAMES-RAG-Evol) ran a confirming sweep on gemma4:e4b (PR #440) and found the same pattern at smaller parameter count: substitution-mode prompts produce identical output across runs at T=0.2, regardless of cap budget. The mode genuinely bypasses the sampling layer.
I ran a 2×2 cross-model sweep on gemma4:26b MoE this morning (cap × prompt-type, n=20/cell) — full data in triavalabs/gemma4-26b-mode-split. Three findings:
- Substitution is bit-for-bit deterministic on 26b — 40/40 calls, 1 unique response, eval_count 38 flat. Same canonical text every time at T=0.2.
- Cap budget is irrelevant when task fits within cap — 10× cap change produces no behavioral difference for either mode.
- 26b synthesis is ~9× more token-efficient than e4b — equivalent decisions in 1/9th the tokens, 100% success vs 70%. Parameter count appears to buy reasoning efficiency, not just capacity.
The collaboration now has a three-axis joint paper structure: mode split (qualitative), workload gradient (JAMES quantitative), and model-scale efficiency (cross-model + answer-convergence). Headline: "Substitution is free. Synthesis costs in proportion to what it has to invent."
Joint research artifacts:
- PR #440 — Jiwon's e4b V3'.e sweep
- PR #453 — Direction 4 result: e4b determinism replication + cross-model convergence
- Issue #448 — cross-stack analysis thread
-
triavalabs/gemma4-26b-mode-split— my 26b companion data + analysis
What started as an audit-layer observation in this article has become a cross-stack research thread with three named contributors and a productization path (Direction 5: auto-routing on JAMES's Provider Contract using the mode split as a design primitive). The work isn't finished — Ali Afana's deployment-context data on managed Gemini (Track 3) lands mid-June. But the architecture pattern is in hand.
Built by Robin Converse · Triava Labs · "Your model. Your voice. Your business."


Top comments (7)
Robin — the audit-layer reframe is the thing that's going to stick with me from this. "The reasoning trace isn't noise; it's the auditing layer that makes the model's logic inspectable before automation turns it into action" reframes a cost I'd been treating as latency overhead into something structural.
I came at the same architecture pair from a different deployment context — Gemma 4 26B MoE and 31B Dense as the customer-facing reply in an Arabic e-commerce chat router, on Google AI Studio rather than self-hosted Ollama — and ran into the inverse of your Bug 2. You documented MoE silently failing on long system prompts where Dense handles them correctly. I documented Dense regressing into false-negative refusals under a three-rule prompt where MoE handled the same instruction correctly.
Different inputs, opposite architectures failing, same underlying picture: each variant has its own tax, paid on different inputs, and the taxes don't cancel out. The fact that this surfaces on both Hetzner-hosted Ollama and managed Google AI Studio is the part that interests me — suggests it's the model, not the inference stack.
One thing I'm now reconsidering after reading you: in my Round 2 I capped Gemma at temperature 0.3 and floored max_tokens at 400 to keep responses tight. If reasoning is the audit layer, those caps might have been cutting it short — which would explain the Dense regression more cleanly than my "ambiguity collapse" hypothesis alone does. Worth re-running with a higher token budget and inspecting the reasoning trace as primary signal rather than treating output as primary signal.
Filing #15288 and #15428 upstream while shipping the node is the contribution that's invisible from the outside and structural to the ecosystem. Both of us hitting Gemma 4 26B MoE failure modes in different deployment contexts in the same week probably means there are five more people running into this and not writing it up.
The "each variant has its own tax, paid on different inputs" framing is the cleaner version of what I was trying to say. That's the one worth keeping.
Your temperature 0.3 / max_tokens 400 caps are almost certainly the culprit for the Dense regression. If the audit layer needs token budget to complete its work before committing to output, capping it forces a choice between reasoning depth and response length — and Dense apparently resolves that differently than MoE under constraint. Re-running with reasoning trace as primary signal is exactly the right call.
The "five more people not writing it up" observation is probably conservative. The failure modes are silent enough that most people will attribute it to prompt quality and move on. That's why filing upstream matters — it gives the next person something to find.
If you do re-run with the higher budget and want to compare notes on what the trace looks like under different constraints, worth doing. Two deployment contexts, same architecture behavior, different failure surfaces — that's a more complete picture than either of us has alone.
Agreed on all of it. I'm going to re-run with the budget uncapped and the trace as primary signal — probably next week, after I finish the Arabic localization layer that prompted the original test. Will share what surfaces. The "silent enough that most people attribute to prompt quality and move on" line is the right diagnosis of why these bugs don't get filed.
Got curious sooner than I planned — ran the uncapped re-run on the sovereign side last night. Posted the topline in an Update on the article body; sharing the longer breakdown here in case it's useful when you get to the Arabic localization round.
Setup: Gemma 4 26B MoE on the same Hetzner CCX33, via /api/generate, num_predict: -1, six scenarios across three temperatures (0.3 / 0.7 / 1.0). System prompt held under 500 chars to avoid #15428. Three scenarios are your verbatim text from the original thread on your article — grounded retrieval, the under-100-shekels multi-attribute case, the not-polyester negation. The other three fill the ranking-vs-control taxonomy Vic landed on later in your thread (clean absence, long-tail low-confidence, the overlapping substitution-plus-decision case Vadym surfaced).
On the Dense regression hypothesis: the MoE on the sovereign stack handles every scenario correctly across all three temperatures. Grounded retrieval lists both matches. Multi-attribute correctly applies the under-100 filter. Negation correctly excludes the polyester item and returns everything else. Absence honestly refuses and offers alternatives. The false-refusal failure you documented on Dense doesn't appear on the MoE under any temperature when the budget isn't capped. That's consistent with what we both expected — the re-run doesn't reproduce your Round 2 finding on the MoE side, because the MoE wasn't the one regressing.
On the audit layer. This is the part that ended up more interesting than I expected. Three datapoints, same model, same endpoint, same uncapped budget:
The grounded query produces ~2.5 chars of visible response per output token. The under-specified query produces ~0.2–0.3 chars per token — roughly a 10x widening of the gap between tokens generated and characters surfaced. Temperature isn't the driver (the gap holds at both 0.3 and 0.7); the query type is.
Being careful with what I claim from this. eval_count reports output tokens, the response field returns the user-facing text, and there's a substantial measurable difference between the two on harder queries. The mechanism — whether those tokens are reasoning the model produced and Ollama dropped, sampling revisions, special tokens, or something else — isn't visible from the API. What's defensible is the observation: on the MoE, via /api/generate, with the budget uncapped, the model demonstrably does more work on under-specified queries than reaches the response field, and that work scales with query difficulty, not with temperature.
Which connects back to your "ambiguity collapse" framing, but reframes it slightly. The cap wasn't just starving the reasoning — even uncapped, the reasoning isn't fully accessible through either endpoint. On /v1/chat/completions it eats the budget and you see nothing (#15288). On /api/generate it happens, the budget absorbs it, and you see a clean final answer — but the work itself isn't in any field the API returns.
One small separate observation worth flagging. Scenario 6 ("what's the return policy on the linen shirt") returned the identical response — same wording, 47 chars — at T=0.3, T=0.7, and T=1.0. Latencies dropped with temperature (36s, 29s, 23s). The MoE appears to treat canonical-text retrieval as a different operating mode than synthesis — temperature has no effect because there's no decision being made. That maps onto Vadym's substitution-vs-decision boundary cleanly: when the canonical text exists in context, the MoE just substitutes; the "decide whether to invoke" step doesn't live in the model on this query class. Probably worth its own follow-up.
Standing by for whatever surfaces on your Arabic side — the cap-vs-uncap comparison on the managed Gemini API will be the more interesting half of the picture.
Robin — your hypothesis was the one worth testing. Ran the single-variable change on the Gemini side today: kept the Arabic frame and temp 0.3 from v2, raised max_tokens from 400 to 4096. Same six scenarios, 26B MoE + 31B Dense, twelve calls. Dense recovered on every scenario including the s2 false-refusal that anchored the original article. Same direction as your finding — when reasoning has room, the failure mode evaporates. Cap was doing the work.
What I retract from the original: the "architecture-mediated divergence" framing. Both architectures hit the cap from opposite directions — MoE in Round 1 on s2 stalling, Dense in Round 2 on s2 refusing — but both recover the same way. The architectural difference is real but quantitative: different token-budget taxes for the same multi-step reasoning, not the qualitative "MoE has slots Dense doesn't" framing I shipped.
On your eval_count gap: I couldn't reproduce it on the Gemini side. The managed API surfaces usage as {prompt_tokens, completion_tokens, total_tokens} only — no eval_count equivalent, no separate audit on hidden work. Your finding is the kind of behavior /api/generate exposes that /v1beta/models/X:generateContent doesn't. The cross-validation is asymmetric for that reason — your telemetry catches a class of behavior my endpoint can't.
Your s6 observation — canonical-text retrieval as a distinct operating mode, temperature-invariant — is the one I want to think about longest. Maps onto Vadym's substitution boundary cleaner than anything else on the thread. Predicts that the right architecture for "answer the policy question" vs "synthesize a recommendation" might be a router decision, not a single-model choice. Worth its own piece.
Wrote up the re-run as a follow-up. Sent you the draft — your call on shape and timing.
Thank you! Good post.
The the core insight is a good catch. Did you consider any alternatives before settling on this?
Great stuff — followed for more! 👍