Robin Converse

Posted on May 16

Self-Hosting Gemma 4 for Production Automation Revealed Two Ollama Bugs

#devchallenge #gemmachallenge #gemma

Gemma 4 Challenge: Build With Gemma 4 Submission

I thought Gemma 4's reasoning traces were wasting tokens. During testing, I realized they were acting as an audit layer for automation. That realization changed how I designed an n8n node for self-hosted AI workflows.

In most automation systems, the model output is the only thing the operator sees. But once AI starts triggering downstream workflows, hidden reasoning becomes operationally important. If the model is making decisions on behalf of a business, the logic path matters as much as the final response.

Here's what I built, what I found, and what it means for AI automation on owned infrastructure.

What I Built

An n8n community node that connects any n8n workflow to a self-hosted Gemma 4 26B MoE endpoint. The node calls Ollama's native /api/generate API, returns clean text, and works with a custom model called triava-prod — a Gemma 4 26B derivative with Triava Labs' brand voice baked in.

The tagline for Triava Labs is "Your model. Your voice. Your business." This node operationalizes that idea.

Repo: github.com/triavalabs/n8n-nodes-triava

The Infrastructure

Everything runs on a single Hetzner CCX33 server: Ollama serving the model, Caddy as reverse proxy, Let's Encrypt for SSL.

No GPU cluster.
No cloud API dependency.
One server, owned infrastructure, real inference.

triava-prod is a Q4_K_M quantization of Gemma 4 26B MoE — 25.8B parameters loaded, roughly 4B active per token. Built using Ollama's Modelfile system with a custom system prompt that encodes Triava's brand voice:

SYSTEM "You are a direct, professional AI assistant for independent operators.
Reply with the answer only. Never show reasoning, drafts, or thinking process.
Match the operator's voice and tone. Be concise unless asked for detail."

Why Gemma 4 26B MoE

The MoE design gives high-capability reasoning behavior at roughly 4B active-parameter inference cost per token. That means it runs at practical throughput on a single owned server — which is the whole point of sovereign infrastructure. A model that requires an A100 cluster isn't sovereign in any meaningful sense for an independent operator or small agency.

Gemma 4 also introduced native system-role support. That matters specifically for this project because the brand voice IS a system prompt. The whole pipeline depends on reliable system-role adherence and consistent on-voice output.

Then I actually tested it in production-like conditions:

Cold inference on a Hetzner CCX33: ~16-31 seconds via /api/generate for a full brand-voice response
Output quality: coherent, on-tone, holds the voice across 150+ word outputs

The model reasons before writing.

What initially looked like a bug turned out to be a feature.

What I Actually Discovered

Two upstream Ollama bugs, found through methodical testing during Phase 2 build.

Bug 1 — `/v1/chat/completions` returns empty content for all Gemma 4 models

(Ollama issue #15288)

When using Gemma 4 via the OpenAI-compatible endpoint, the content field is always empty. The reasoning trace exhausts the max_tokens budget before any final output is generated. I confirmed this affects the base gemma4:26b model too — it's not a Modelfile issue.

I diagnosed it with five comparative curl tests: three against /v1/chat/completions (all empty), two against /api/generate (both clean). The native endpoint folds reasoning and output into one response field and runs 4× faster — ~16s vs ~60s.

Decision: the node targets /api/generate. This isn't a workaround — it's the correct endpoint for Gemma 4 on Ollama right now.

Bug 2 — Long system prompts return empty responses on the 26B MoE

(Ollama issue #15428)

The Gemma 4 26B MoE returns empty output when the combined system prompt exceeds roughly 500 characters. Dense models handle the same prompt correctly. This is isolated to the MoE architecture.

Practical implication: triava-prod already has the brand voice baked into its Modelfile. Passing an additional system prompt from the node adds to the total and can hit the threshold. The fix: leave the System Prompt field blank when using triava-prod. The voice lives in the model, not in the API call.

This is what "your voice" actually means architecturally. The brand voice isn't injected per-call. It lives in the model you own.

The Demo

The workflow: Manual Trigger → Triava node → brand-voice content.

n8n Workflow
     ↓
Triava Node
     ↓
Ollama /api/generate
     ↓
triava-prod (Gemma 4 26B MoE)
     ↓
Reasoning + Output
     ↓
Clean text → downstream automation

The reasoning trace can optionally be retained for auditability before execution.

The prompt:

Write 3 X post variants (max 280 characters each) about this topic.

Topic: Building a sovereign n8n node for Gemma 4
Angle: The reasoning trace is an auditing layer, not noise
Voice: First-person, direct, technically honest. No hype.

Number them V1, V2, V3. No hashtags.

Output from triava-prod:

V1: Building a sovereign n8n node for Gemma 4. The reasoning trace isn't noise; it's an auditing layer. You need to be able to verify the logic behind the automation.

V2: I'm working on a Gemma 4 node for n8n. I'm treating the reasoning trace as an audit log rather than extra text. For sovereign workflows, seeing the steps taken is more important than a clean output.

V3: Developing an n8n node for Gemma 4. The reasoning trace is often seen as clutter, but I'm using it as an auditing layer. It provides the transparency necessary to trust automated decisions.

Model Selection — The Honest Summary

I picked the 26B MoE. I tested it. I found two real bugs. I made deliberate engineering decisions based on what the tests showed.

The 26B MoE delivers high-capability reasoning behavior at ~4B active-parameter inference cost on hardware an independent operator can actually own. It has native system-role support that makes brand-voice workflows possible. And its reasoning behavior — which initially looked like a problem — turns out to be an auditing layer that makes the model's logic inspectable before it triggers downstream automation.

If automation is going to make decisions on behalf of operators, the reasoning layer cannot remain invisible.

That last point isn't something I planned to write about. It's something I observed. Which is the only kind of model-selection story worth telling.

What's Next

The OpenAI-compatible path (/v1/chat/completions) is a real goal for Triava Labs — if the upstream Ollama issue gets resolved, the node's architecture is already designed to support it. That's a v1.5 roadmap item, not a contest deliverable.

The node is at github.com/triavalabs/n8n-nodes-triava. npm publish is in progress via GitHub Actions with provenance.

Triava Labs v1 is in active development at triavalabs.com. The node is the first production component of the broader Triava Labs infrastructure.

The deeper lesson from this build was that self-hosting a model is only part of sovereignty. The other part is being able to inspect the model's reasoning before automation turns it into action.

Update — May 16, 2026

Since publishing, an unexpected cross-article thread emerged with @alimafana, who independently hit complementary Gemma 4 26B MoE failure modes from a completely different deployment context — a production Arabic e-commerce chat router on Google AI Studio rather than self-hosted Ollama.

Their finding: MoE and Dense handle ambiguous instructions in opposite ways. Same prompt, two architectures, inverse failures.

The intersection: both findings point to the same underlying picture — each Gemma 4 variant has its own tax, paid on different inputs. Their behavioral observation from the application layer and my infrastructure-level bug documentation appear to be two angles on the same architectural reality.

The upstream bugs filed:

Ollama issue #15288 — /v1/chat/completions empty content for all Gemma 4 models
Ollama issue #15428 — long system prompts return empty responses on the 26B MoE

Related:

@alimafana's submission — "I Added Three Rules to Gemma 4. The MoE Searched. The Dense Model Refused."

Built by Robin Converse · Triava Labs · "Your model. Your voice. Your business."

Top comments (3)

Ali Afana • May 16

Robin — the audit-layer reframe is the thing that's going to stick with me from this. "The reasoning trace isn't noise; it's the auditing layer that makes the model's logic inspectable before automation turns it into action" reframes a cost I'd been treating as latency overhead into something structural.
I came at the same architecture pair from a different deployment context — Gemma 4 26B MoE and 31B Dense as the customer-facing reply in an Arabic e-commerce chat router, on Google AI Studio rather than self-hosted Ollama — and ran into the inverse of your Bug 2. You documented MoE silently failing on long system prompts where Dense handles them correctly. I documented Dense regressing into false-negative refusals under a three-rule prompt where MoE handled the same instruction correctly.
Different inputs, opposite architectures failing, same underlying picture: each variant has its own tax, paid on different inputs, and the taxes don't cancel out. The fact that this surfaces on both Hetzner-hosted Ollama and managed Google AI Studio is the part that interests me — suggests it's the model, not the inference stack.
One thing I'm now reconsidering after reading you: in my Round 2 I capped Gemma at temperature 0.3 and floored max_tokens at 400 to keep responses tight. If reasoning is the audit layer, those caps might have been cutting it short — which would explain the Dense regression more cleanly than my "ambiguity collapse" hypothesis alone does. Worth re-running with a higher token budget and inspecting the reasoning trace as primary signal rather than treating output as primary signal.
Filing #15288 and #15428 upstream while shipping the node is the contribution that's invisible from the outside and structural to the ecosystem. Both of us hitting Gemma 4 26B MoE failure modes in different deployment contexts in the same week probably means there are five more people running into this and not writing it up.

Robin Converse • May 16

The "each variant has its own tax, paid on different inputs" framing is the cleaner version of what I was trying to say. That's the one worth keeping.
Your temperature 0.3 / max_tokens 400 caps are almost certainly the culprit for the Dense regression. If the audit layer needs token budget to complete its work before committing to output, capping it forces a choice between reasoning depth and response length — and Dense apparently resolves that differently than MoE under constraint. Re-running with reasoning trace as primary signal is exactly the right call.
The "five more people not writing it up" observation is probably conservative. The failure modes are silent enough that most people will attribute it to prompt quality and move on. That's why filing upstream matters — it gives the next person something to find.
If you do re-run with the higher budget and want to compare notes on what the trace looks like under different constraints, worth doing. Two deployment contexts, same architecture behavior, different failure surfaces — that's a more complete picture than either of us has alone.

Ali Afana • May 16

Agreed on all of it. I'm going to re-run with the budget uncapped and the trace as primary signal — probably next week, after I finish the Arabic localization layer that prompted the original test. Will share what surfaces. The "silent enough that most people attribute to prompt quality and move on" line is the right diagnosis of why these bugs don't get filed.