TL;DR — 6 machines, 12 GPUs, 1,000+ concurrent agents, P95 18 ms, voice <300 ms, 280,741 lines of Python, 44 MIT repos. Vs Azure OpenAI: 7-month break-even on a 50K€ deployment.
Why I built this
I'm Franck. Toulouse, France. Over 3 years I paid roughly €280,000 to Azure + OpenAI before doing the math properly:
- Latency: 1.2s voice round-trip — incompatible with the voice-first UX I wanted.
- Compliance: customer data on US servers. Not GDPR-native, just GDPR-compliant-on-paper.
- Quotas: random throttling at the worst times.
- Lock-in: Azure outage = my product offline.
I decided to rebuild everything on-prem. This is the result.
The cluster
6 machines, 3 tiers, 12 GPUs total, <5ms inter-node latency.
Tier 1 — GPU compute (heavy inference)
- M1 "La Créatrice" — Ryzen 5700X3D, 6× RTX 3080+, 46 GB RAM. Primary LLM node, runs qwen3.5-9b, qwen3.5-35b-a3b, deepseek-r1, the Claude 4.5/4.6 distillations, and the Whisper CUDA pipeline.
- M2 "Le Forge" — multi-GPU NVIDIA, secondary inference, failover from M1 in 1.3s.
Tier 2 — CPU/RAM (orchestration, memory)
- M3 "Le Cerveau" — high-RAM CPU node. PostgreSQL + Redis + Pinecone. Runs the orchestrator, the 3-quorum consensus engine (M1+M2+M3), and the analytics/monitoring agents.
Tier 3 — production / work
- M4 "Bridge Windows" — Windows 11, 2 GPUs, trading bot live.
- M5 "Interface Relay" — Linux i5-6500, 15 GB RAM. Dev interface, 15+ MCP servers, Claude Code.
- M6 "Mobile Ops" — laptop. SSH + VPN. Client demos and on-site ops.
The 9 layers I added on top of Ubuntu
- L9 — Vocal / conversational (Whisper CUDA STT, Piper TTS, wake word, 50+ languages)
- L8 — Multi-agent orchestration (MCP-native, consensus engine)
- L7 — Trading consensus engine (multi-model voting GPT/Gemini/Claude)
- L6 — Browser + web automation (Chrome DevTools Protocol)
- L5 — MCP tool registry (88+ handlers)
- L4 — GPU cluster management (Docker Swarm, failover <2s)
- L3 — Domino pipeline engine (835 chains)
- L2 — systemd service layer (98 units)
- L1 — Linux boot integration (GRUB hooks, ZRAM, kernel params)
Real numbers
| Metric | Value |
|---|---|
| Concurrent agents | 1,000+ |
| P95 latency (cluster internal) | 18 ms |
| Voice pipeline end-to-end | <300 ms |
| Aggregate throughput | 67 tok/s |
| Python lines | 280,741 |
| Public repos | 44 (all MIT) |
Cost comparison (1M tokens/day, team of 10)
| Provider | €/month | P95 | Concurrent agents | Data residency |
|---|---|---|---|---|
| Azure OpenAI | 1,500 | 800ms-3s | ~20 | US |
| AWS Bedrock | 1,800 | 700ms-2.5s | ~15 | US |
| Mistral Cloud | 800 | 400-800ms | ~30 | EU |
| JARVIS OS | 0 | 18 ms | 1,000+ | Air-gapped |
For a 50K€ turn-key deployment, break-even vs Azure is 7 months, and the marginal cost is zero after that.
What I sell now
- JARVIS OS turn-key — 20K€ to 250K€ depending on scope.
- 62 PDF trainings — from €39, 293h of content based on production code (+48 private).
- IA infra audit — €1,500, report in 48h.
- 1-to-1 mentorship — €250/h.
- Fractional CTO — TJM €1,000-1,150 / CDI €85-95K. Toulouse / remote.
Honest weaknesses
- Consensus voting is empirical. No formal verification of the agreement function.
- Tier-2 failure (M3 down) is the weakest scenario — orchestrator dies, cluster keeps inferring but loses persistent memory.
- MCP protocol bet — if Anthropic deprecates parts of MCP, I have 88 handlers to refactor.
- kWh-per-token efficiency — cloud probably wins on aggregate watts/token, on-prem wins on marginal cost.
Links
- Site: https://jarvis-delmas.netlify.app
- Code: https://github.com/Turbo31150 (44 MIT repos)
- Contact: miningexpert31@gmail.comIf you're running anything similar — at home or for a client — I'd love to compare notes.
Top comments (0)