<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Robin Converse</title>
    <description>The latest articles on DEV Community by Robin Converse (@cloudninealt).</description>
    <link>https://dev.to/cloudninealt</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F577245%2F67494a6c-062c-413e-be2a-76933938bbce.jpeg</url>
      <title>DEV Community: Robin Converse</title>
      <link>https://dev.to/cloudninealt</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/cloudninealt"/>
    <language>en</language>
    <item>
      <title>Self-Hosting Gemma 4 for Production Automation Revealed Two Ollama Bugs</title>
      <dc:creator>Robin Converse</dc:creator>
      <pubDate>Sat, 16 May 2026 11:27:31 +0000</pubDate>
      <link>https://dev.to/cloudninealt/self-hosting-gemma-4-for-production-automation-revealed-two-ollama-bugs-1oo4</link>
      <guid>https://dev.to/cloudninealt/self-hosting-gemma-4-for-production-automation-revealed-two-ollama-bugs-1oo4</guid>
      <description>&lt;p&gt;I thought Gemma 4's reasoning traces were wasting tokens. During testing, I realized they were acting as an audit layer for automation. That realization changed how I designed an n8n node for self-hosted AI workflows.&lt;/p&gt;

&lt;p&gt;In most automation systems, the model output is the only thing the operator sees. But once AI starts triggering downstream workflows, hidden reasoning becomes operationally important. If the model is making decisions on behalf of a business, the logic path matters as much as the final response.&lt;/p&gt;

&lt;p&gt;Here's what I built, what I found, and what it means for AI automation on owned infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;An n8n community node that connects any n8n workflow to a self-hosted Gemma 4 26B MoE endpoint. The node calls Ollama's native &lt;code&gt;/api/generate&lt;/code&gt; API, returns clean text, and works with a custom model called &lt;code&gt;triava-prod&lt;/code&gt; — a Gemma 4 26B derivative with Triava Labs' brand voice baked in.&lt;/p&gt;

&lt;p&gt;The tagline for Triava Labs is "Your model. Your voice. Your business." This node operationalizes that idea.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/triavalabs/n8n-nodes-triava" rel="noopener noreferrer"&gt;github.com/triavalabs/n8n-nodes-triava&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Infrastructure
&lt;/h2&gt;

&lt;p&gt;Everything runs on a single Hetzner CCX33 server: Ollama serving the model, Caddy as reverse proxy, Let's Encrypt for SSL.&lt;/p&gt;

&lt;p&gt;No GPU cluster.&lt;br&gt;
No cloud API dependency.&lt;br&gt;
One server, owned infrastructure, real inference.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;triava-prod&lt;/code&gt; is a Q4_K_M quantization of Gemma 4 26B MoE — 25.8B parameters loaded, roughly 4B active per token. Built using Ollama's Modelfile system with a custom system prompt that encodes Triava's brand voice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SYSTEM "You are a direct, professional AI assistant for independent operators.
Reply with the answer only. Never show reasoning, drafts, or thinking process.
Match the operator's voice and tone. Be concise unless asked for detail."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why Gemma 4 26B MoE
&lt;/h2&gt;

&lt;p&gt;The MoE design gives high-capability reasoning behavior at roughly 4B active-parameter inference cost per token. That means it runs at practical throughput on a single owned server — which is the whole point of sovereign infrastructure. A model that requires an A100 cluster isn't sovereign in any meaningful sense for an independent operator or small agency.&lt;/p&gt;

&lt;p&gt;Gemma 4 also introduced native system-role support. That matters specifically for this project because the brand voice IS a system prompt. The whole pipeline depends on reliable system-role adherence and consistent on-voice output.&lt;/p&gt;

&lt;p&gt;Then I actually tested it in production-like conditions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cold inference on a Hetzner CCX33: &lt;strong&gt;~16-31 seconds&lt;/strong&gt; via &lt;code&gt;/api/generate&lt;/code&gt; for a full brand-voice response&lt;/li&gt;
&lt;li&gt;Output quality: coherent, on-tone, holds the voice across 150+ word outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model reasons before writing.&lt;/p&gt;

&lt;p&gt;What initially looked like a bug turned out to be a feature.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Discovered
&lt;/h2&gt;

&lt;p&gt;Two upstream Ollama bugs, found through methodical testing during Phase 2 build.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug 1 — &lt;code&gt;/v1/chat/completions&lt;/code&gt; returns empty content for all Gemma 4 models
&lt;/h3&gt;

&lt;p&gt;(&lt;a href="https://github.com/ollama/ollama/issues/15288" rel="noopener noreferrer"&gt;Ollama issue #15288&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;When using Gemma 4 via the OpenAI-compatible endpoint, the content field is always empty. The reasoning trace exhausts the &lt;code&gt;max_tokens&lt;/code&gt; budget before any final output is generated. I confirmed this affects the base &lt;code&gt;gemma4:26b&lt;/code&gt; model too — it's not a Modelfile issue.&lt;/p&gt;

&lt;p&gt;I diagnosed it with five comparative curl tests: three against &lt;code&gt;/v1/chat/completions&lt;/code&gt; (all empty), two against &lt;code&gt;/api/generate&lt;/code&gt; (both clean). The native endpoint folds reasoning and output into one &lt;code&gt;response&lt;/code&gt; field and runs 4× faster — ~16s vs ~60s.&lt;/p&gt;

&lt;p&gt;Decision: the node targets &lt;code&gt;/api/generate&lt;/code&gt;. This isn't a workaround — it's the correct endpoint for Gemma 4 on Ollama right now.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bug 2 — Long system prompts return empty responses on the 26B MoE
&lt;/h3&gt;

&lt;p&gt;(&lt;a href="https://github.com/ollama/ollama/issues/15428" rel="noopener noreferrer"&gt;Ollama issue #15428&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;The Gemma 4 26B MoE returns empty output when the combined system prompt exceeds roughly 500 characters. Dense models handle the same prompt correctly. This is isolated to the MoE architecture.&lt;/p&gt;

&lt;p&gt;Practical implication: &lt;code&gt;triava-prod&lt;/code&gt; already has the brand voice baked into its Modelfile. Passing an additional system prompt from the node adds to the total and can hit the threshold. The fix: leave the System Prompt field blank when using &lt;code&gt;triava-prod&lt;/code&gt;. The voice lives in the model, not in the API call.&lt;/p&gt;

&lt;p&gt;This is what "your voice" actually means architecturally. The brand voice isn't injected per-call. It lives in the model you own.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Demo
&lt;/h2&gt;

&lt;p&gt;The workflow: Manual Trigger → Triava node → brand-voice content.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;n8n Workflow
     ↓
Triava Node
     ↓
Ollama /api/generate
     ↓
triava-prod (Gemma 4 26B MoE)
     ↓
Reasoning + Output
     ↓
Clean text → downstream automation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reasoning trace can optionally be retained for auditability before execution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fku43vorlmu31hr3okcx2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fku43vorlmu31hr3okcx2.png" alt="Triava node executing in n8n — model set to triava-prod:latest, message prompt visible, system prompt blank, inference running" width="800" height="592"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write 3 X post variants (max 280 characters each) about this topic.

Topic: Building a sovereign n8n node for Gemma 4
Angle: The reasoning trace is an auditing layer, not noise
Voice: First-person, direct, technically honest. No hype.

Number them V1, V2, V3. No hashtags.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz8sviuk47p9101rh7ha.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyz8sviuk47p9101rh7ha.png" alt="Triava node output showing Node executed successfully with V1 V2 V3 brand-voice X post variants in the JSON panel" width="800" height="594"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Output from &lt;code&gt;triava-prod&lt;/code&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;V1:&lt;/strong&gt; Building a sovereign n8n node for Gemma 4. The reasoning trace isn't noise; it's an auditing layer. You need to be able to verify the logic behind the automation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;V2:&lt;/strong&gt; I'm working on a Gemma 4 node for n8n. I'm treating the reasoning trace as an audit log rather than extra text. For sovereign workflows, seeing the steps taken is more important than a clean output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;V3:&lt;/strong&gt; Developing an n8n node for Gemma 4. The reasoning trace is often seen as clutter, but I'm using it as an auditing layer. It provides the transparency necessary to trust automated decisions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/ZQmto4PvzC0"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;




&lt;h2&gt;
  
  
  Model Selection — The Honest Summary
&lt;/h2&gt;

&lt;p&gt;I picked the 26B MoE. I tested it. I found two real bugs. I made deliberate engineering decisions based on what the tests showed.&lt;/p&gt;

&lt;p&gt;The 26B MoE delivers high-capability reasoning behavior at ~4B active-parameter inference cost on hardware an independent operator can actually own. It has native system-role support that makes brand-voice workflows possible. And its reasoning behavior — which initially looked like a problem — turns out to be an auditing layer that makes the model's logic inspectable before it triggers downstream automation.&lt;/p&gt;

&lt;p&gt;If automation is going to make decisions on behalf of operators, the reasoning layer cannot remain invisible.&lt;/p&gt;

&lt;p&gt;That last point isn't something I planned to write about. It's something I observed. Which is the only kind of model-selection story worth telling.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;The OpenAI-compatible path (&lt;code&gt;/v1/chat/completions&lt;/code&gt;) is a real goal for Triava Labs — if the upstream Ollama issue gets resolved, the node's architecture is already designed to support it. That's a v1.5 roadmap item, not a contest deliverable.&lt;/p&gt;

&lt;p&gt;The node is at &lt;a href="https://github.com/triavalabs/n8n-nodes-triava" rel="noopener noreferrer"&gt;github.com/triavalabs/n8n-nodes-triava&lt;/a&gt;. npm publish is in progress via GitHub Actions with provenance.&lt;/p&gt;

&lt;p&gt;Triava Labs v1 is in active development at triavalabs.com. The node is the first production component of the broader Triava Labs infrastructure.&lt;/p&gt;

&lt;p&gt;The deeper lesson from this build was that self-hosting a model is only part of sovereignty. The other part is being able to inspect the model's reasoning before automation turns it into action.&lt;/p&gt;

&lt;h2&gt;
  
  
  Update — May 16, 2026
&lt;/h2&gt;

&lt;p&gt;Since publishing, an unexpected cross-article thread emerged with &lt;a href="https://dev.to/alimafana"&gt;@alimafana&lt;/a&gt;, who independently hit complementary Gemma 4 26B MoE failure modes from a completely different deployment context — a production Arabic e-commerce chat router on Google AI Studio rather than self-hosted Ollama.&lt;/p&gt;

&lt;p&gt;Their finding: MoE and Dense handle ambiguous instructions in opposite ways. Same prompt, two architectures, inverse failures.&lt;/p&gt;

&lt;p&gt;The intersection: both findings point to the same underlying picture — each Gemma 4 variant has its own tax, paid on different inputs. Their behavioral observation from the application layer and my infrastructure-level bug documentation appear to be two angles on the same architectural reality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The upstream bugs filed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/ollama/ollama/issues/15288" rel="noopener noreferrer"&gt;Ollama issue #15288&lt;/a&gt; — &lt;code&gt;/v1/chat/completions&lt;/code&gt; empty content for all Gemma 4 models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/ollama/ollama/issues/15428" rel="noopener noreferrer"&gt;Ollama issue #15428&lt;/a&gt; — long system prompts return empty responses on the 26B MoE&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Related:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/alimafana/i-added-three-rules-to-gemma-4-the-moe-searched-the-dense-model-refused-1j18"&gt;@alimafana's submission&lt;/a&gt; — "I Added Three Rules to Gemma 4. The MoE Searched. The Dense Model Refused."&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Built by Robin Converse · &lt;a href="https://triavalabs.com" rel="noopener noreferrer"&gt;Triava Labs&lt;/a&gt; · "Your model. Your voice. Your business."&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
    </item>
  </channel>
</rss>
