<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Turbo31150</title>
    <description>The latest articles on DEV Community by Turbo31150 (@turbo31150).</description>
    <link>https://dev.to/turbo31150</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3388861%2Fa5615e6d-0da8-4292-a8fe-580a99a6f56b.png</url>
      <title>DEV Community: Turbo31150</title>
      <link>https://dev.to/turbo31150</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/turbo31150"/>
    <language>en</language>
    <item>
      <title>How I built a 6-node 12-GPU on-prem AI cluster running 1000+ agents</title>
      <dc:creator>Turbo31150</dc:creator>
      <pubDate>Tue, 19 May 2026 20:10:07 +0000</pubDate>
      <link>https://dev.to/turbo31150/how-i-built-a-6-node-12-gpu-on-prem-ai-cluster-running-1000-agents-3203</link>
      <guid>https://dev.to/turbo31150/how-i-built-a-6-node-12-gpu-on-prem-ai-cluster-running-1000-agents-3203</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;TL;DR — 6 machines, 12 GPUs, 1,000+ concurrent agents, P95 18 ms, voice &amp;lt;300 ms, 280,741 lines of Python, 44 MIT repos. Vs Azure OpenAI: 7-month break-even on a 50K€ deployment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why I built this
&lt;/h2&gt;

&lt;p&gt;I'm Franck. Toulouse, France. Over 3 years I paid roughly €280,000 to Azure + OpenAI before doing the math properly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: 1.2s voice round-trip — incompatible with the voice-first UX I wanted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance&lt;/strong&gt;: customer data on US servers. Not GDPR-native, just GDPR-compliant-on-paper.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quotas&lt;/strong&gt;: random throttling at the worst times.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lock-in&lt;/strong&gt;: Azure outage = my product offline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I decided to rebuild everything on-prem. This is the result.&lt;/p&gt;

&lt;h2&gt;
  
  
  The cluster
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;6 machines, 3 tiers, 12 GPUs total, &amp;lt;5ms inter-node latency.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Tier 1 — GPU compute (heavy inference)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;M1 "La Créatrice"&lt;/strong&gt; — Ryzen 5700X3D, 6× RTX 3080+, 46 GB RAM. Primary LLM node, runs qwen3.5-9b, qwen3.5-35b-a3b, deepseek-r1, the Claude 4.5/4.6 distillations, and the Whisper CUDA pipeline.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;M2 "Le Forge"&lt;/strong&gt; — multi-GPU NVIDIA, secondary inference, failover from M1 in 1.3s.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tier 2 — CPU/RAM (orchestration, memory)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;M3 "Le Cerveau"&lt;/strong&gt; — high-RAM CPU node. PostgreSQL + Redis + Pinecone. Runs the orchestrator, the 3-quorum consensus engine (M1+M2+M3), and the analytics/monitoring agents.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tier 3 — production / work
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;M4 "Bridge Windows"&lt;/strong&gt; — Windows 11, 2 GPUs, trading bot live.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;M5 "Interface Relay"&lt;/strong&gt; — Linux i5-6500, 15 GB RAM. Dev interface, 15+ MCP servers, Claude Code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;M6 "Mobile Ops"&lt;/strong&gt; — laptop. SSH + VPN. Client demos and on-site ops.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The 9 layers I added on top of Ubuntu
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;L9 — Vocal / conversational (Whisper CUDA STT, Piper TTS, wake word, 50+ languages)&lt;/li&gt;
&lt;li&gt;L8 — Multi-agent orchestration (MCP-native, consensus engine)&lt;/li&gt;
&lt;li&gt;L7 — Trading consensus engine (multi-model voting GPT/Gemini/Claude)&lt;/li&gt;
&lt;li&gt;L6 — Browser + web automation (Chrome DevTools Protocol)&lt;/li&gt;
&lt;li&gt;L5 — MCP tool registry (88+ handlers)&lt;/li&gt;
&lt;li&gt;L4 — GPU cluster management (Docker Swarm, failover &amp;lt;2s)&lt;/li&gt;
&lt;li&gt;L3 — Domino pipeline engine (835 chains)&lt;/li&gt;
&lt;li&gt;L2 — systemd service layer (98 units)&lt;/li&gt;
&lt;li&gt;L1 — Linux boot integration (GRUB hooks, ZRAM, kernel params)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Concurrent agents&lt;/td&gt;
&lt;td&gt;1,000+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P95 latency (cluster internal)&lt;/td&gt;
&lt;td&gt;18 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Voice pipeline end-to-end&lt;/td&gt;
&lt;td&gt;&amp;lt;300 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggregate throughput&lt;/td&gt;
&lt;td&gt;67 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Python lines&lt;/td&gt;
&lt;td&gt;280,741&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Public repos&lt;/td&gt;
&lt;td&gt;44 (all MIT)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Cost comparison (1M tokens/day, team of 10)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;€/month&lt;/th&gt;
&lt;th&gt;P95&lt;/th&gt;
&lt;th&gt;Concurrent agents&lt;/th&gt;
&lt;th&gt;Data residency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Azure OpenAI&lt;/td&gt;
&lt;td&gt;1,500&lt;/td&gt;
&lt;td&gt;800ms-3s&lt;/td&gt;
&lt;td&gt;~20&lt;/td&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS Bedrock&lt;/td&gt;
&lt;td&gt;1,800&lt;/td&gt;
&lt;td&gt;700ms-2.5s&lt;/td&gt;
&lt;td&gt;~15&lt;/td&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral Cloud&lt;/td&gt;
&lt;td&gt;800&lt;/td&gt;
&lt;td&gt;400-800ms&lt;/td&gt;
&lt;td&gt;~30&lt;/td&gt;
&lt;td&gt;EU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JARVIS OS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1,000+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Air-gapped&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a 50K€ turn-key deployment, &lt;strong&gt;break-even vs Azure is 7 months&lt;/strong&gt;, and the marginal cost is zero after that.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I sell now
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JARVIS OS turn-key&lt;/strong&gt; — 20K€ to 250K€ depending on scope.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;62 PDF trainings&lt;/strong&gt; — from €39, 293h of content based on production code (+48 private).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IA infra audit&lt;/strong&gt; — €1,500, report in 48h.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1-to-1 mentorship&lt;/strong&gt; — €250/h.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fractional CTO&lt;/strong&gt; — TJM €1,000-1,150 / CDI €85-95K. Toulouse / remote.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Honest weaknesses
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Consensus voting&lt;/strong&gt; is empirical. No formal verification of the agreement function.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tier-2 failure&lt;/strong&gt; (M3 down) is the weakest scenario — orchestrator dies, cluster keeps inferring but loses persistent memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP protocol bet&lt;/strong&gt; — if Anthropic deprecates parts of MCP, I have 88 handlers to refactor.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;kWh-per-token efficiency&lt;/strong&gt; — cloud probably wins on aggregate watts/token, on-prem wins on marginal cost.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Site: &lt;a href="https://jarvis-delmas.netlify.app" rel="noopener noreferrer"&gt;https://jarvis-delmas.netlify.app&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Code: &lt;a href="https://github.com/Turbo31150" rel="noopener noreferrer"&gt;https://github.com/Turbo31150&lt;/a&gt; (44 MIT repos)&lt;/li&gt;
&lt;li&gt;Contact: &lt;a href="mailto:miningexpert31@gmail.comIf"&gt;miningexpert31@gmail.comIf&lt;/a&gt; you're running anything similar — at home or for a client — I'd love to compare notes.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>infrastructure</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
