porokka

Posted on Apr 13

I Built a Local AI Operating System Over Easter Weekend

#ollama #ai #python #agents

RTX 3090, WSL2, Ollama, ReAct agent loop, 20 skills. The hard part wasn't the AI.

I've been doing information management systems for 30 years. Big data was the same problem as before with different hype. Data vault too. Most of these "revolutions" are just repackaging — the underlying problem of getting data from A to B, storing it, querying it, hasn't changed much.

So I was skeptical about local AI assistants. Tried a few. They were basically chatbots with a microphone. Not interesting.

What changed my mind was the ReAct pattern — the idea of a loop that can actually use tools, not just generate text. That's a different thing. So over Easter weekend I sat down and built JARVIS OS. It's been running for a week, has 10 GitHub pulls already, and I'm still adding to it.

The name is intentional. J.A.R.V.I.S — Just A Rather Very Intelligent System — was Tony Stark's AI, voiced by Paul Bettany, later replaced by F.R.I.D.A.Y, then became Vision. The system supports all four personalities. If you're going to spend a weekend building a personal AI, do the lore properly.

Hardware

Windows 11 gaming PC, RTX 3090 24GB, WSL2 running Ubuntu. Ollama accesses the 3090 via CUDA passthrough — no dual boot, no VM overhead. Windows handles audio routing and home automation network calls, Linux layer runs all inference and Python services.

GPU        NVIDIA RTX 3090 — 24GB VRAM
OS         Windows 11 + WSL2 (Ubuntu)
RAM        64GB
Storage    M.2 512GB + M.2 256GB + 2.5" SSD 512GB

Models

Model choice is both VRAM and architecture.

Model                VRAM   Architecture      Role
qwen3:8b             ~5GB   dense             fast queries, always loaded
qwen3:30b-a3b       ~18GB   MoE, 3B active    reasoning, tool use
qwen3-coder:30b     ~18GB   MoE, 3B active    code tasks
Orpheus 3B TTS       ~3GB   dense             speech synthesis, always loaded
FLUX via ComfyUI    ~12GB   diffusion         image generation (FLUX capacitor)
Whisper STT          ~1GB   encoder-only      speech recognition, always loaded

The -a3b suffix on the Qwen3 30B models means roughly 3B parameters activate per forward pass — Mixture of Experts architecture. The full 30B knowledge is in the weights but you're only paying compute for a fraction of it. That's how 30B fits in 18GB and still runs fast enough to be usable.

Total if everything loaded simultaneously: ~57GB. The 3090 has 24GB. So nothing is loaded all at once — Whisper, qwen3:8b, and Orpheus stay resident, everything else swaps in on demand. The 30B LLMs and FLUX can't coexist. When image generation comes in, the skill unloads the active LLM, runs FLUX, reloads. Takes 15-20 seconds. JARVIS acknowledges immediately so it doesn't feel frozen.

More GPUs incoming

Currently waiting to scavenge my kids' PCs for a spare RTX 3090 and RTX 2080. Plan is dedicated GPUs per model slot — no more swapping. I already know how to wire risers and manage multi-GPU from building a crypto mining rig, so the hardware side is straightforward. The interesting part will be extending the skill system to route inference across multiple Ollama instances by GPU.

Audio chain

TTS output routes through the Denon AVR-X4100W center channel. In 5.1 mixing the center channel carries all dialogue — it's why voices in films feel anchored in the room. Same principle here. Orpheus 3B through a calibrated 5.1 system sounds completely different from a desktop speaker.

The Denon is also controlled by JARVIS via HTTP API. Input switching, volume, surround modes. "Hey JARVIS, headphones mode" just works.

Architecture

Three layers: router, ReAct loop, skills.

Router

Not every query needs a 30B model. "Play Radio Nova" doesn't need AI at all — it needs a pattern match and a function call. The router handles this before anything reaches a model:

fast    → qwen3:8b          casual, quick
reason  → qwen3:30b-a3b     analysis, tool use
code    → qwen3-coder:30b   code tasks
deep    → qwen3:30b-a3b     strategy, long context
cloud   → Claude API        fallback for complex tasks
direct  → no model          simple commands, pattern match

A lot of what feels "AI" in voice assistants is actually just string matching. Keeping that explicit and separate from the model layer matters for latency and predictability.

ReAct loop

This is the interesting part. react_server.py on port 7900 runs a Think → Tool → Observe loop, up to 5 iterations, until it has a final answer. The planner decides which tools are needed, calls them, reads the results, decides if it needs more.

What makes it feel responsive: when a multi-step command comes in, JARVIS immediately says "yes sir, working on it" — hardcoded, fires before any model inference starts. The AI doesn't need to confirm receipt of a task. That's a status update. Keep it deterministic, fire it instantly.

Command received
      ↓
"Yes sir, working on it."  ← hardcoded, immediate
      ↓
ReAct loop starts
  Think → which tools?
  Tool  → execute
  Observe → result
  Repeat ×5 max
      ↓
Final response via TTS

Getting this loop to feel natural — the right pause lengths, when to speak mid-task, when to stay quiet — took more tuning than the AI parts.

Skills

20 skills, 35+ tools. Self-contained Python modules in skills/. Drop a file in, restart, it's live. Each skill registers its own tools and keywords.

The range: home automation (Denon, Shield, LG TV, Panasonic UB9000, Philips Hue, Plex, internet radio), memory (MemPalace vector DB + Obsidian vault, 2000+ memories — MemPalace is a project by Milla Jovovich, yes that one, Fifth Element and Resident Evil, apparently she codes too), AI image gen (FLUX via ComfyUI — the FLUX capacitor, obviously), dev tools (git, shell, web search, network scan), comms (Twilio calls/SMS, email via SMTP/IMAP).

The Twilio integration is more useful than expected. JARVIS answers calls, takes messages, sends me a summary. When I'm in flow I don't want to deal with the phone.

There are also smaller utility skills that don't sound impressive but get used constantly: KeePass password lookup, Windows application and source switching, clipboard management. "Switch to browser" or "open Spotify" without touching the keyboard.

The Denon night mode is a good example of a composed skill — "Hey JARVIS, night mode" switches the AVR input to headphones, drops volume to a comfortable level, and dims the Hue lights. Three devices, one command, no cloud.

Chaining

The more interesting capability is multi-tool chains that feel like a single intent. I used to have an Alexa routine called "set the mood" — it turned the Hue lights red and started Barry White on Spotify. Fine, but limited.

The same command through JARVIS now chains: Hue goes red, Barry White starts on Spotify via Shield, TV switches to the Shield input, and a fireplace video plays on the 4K screen. Five actions, one sentence, everything I own cooperating. Could add more — Denon surround mode, lock the front door, whatever. The chain is just a skill that calls other skills.

That's the gap between a voice assistant and an operating system. Alexa executes routines you pre-configure through an app. JARVIS figures out the chain from the intent and the available tools.

The Hard Parts

Whisper hallucinating

Whisper doesn't know if it's hearing you or the room. Radio playing, TV on in the background — it will confidently transcribe music as speech. Two separate problems here.

Ambient audio — still being tuned. Silence detection thresholds and energy levels help, but it's not perfect. A 30B model running in the living room with Radio Nova on is a real environment, not a quiet headset setup.

JARVIS transcribing its own voice — this one turned out to be trivial. TTS output comes from the computer, so JARVIS already knows exactly what it said. When Whisper transcribes the mic input, just strip that text out. No audio level detection, no speaker diarization, no VAD heuristics — string match and remove. Done. Sometimes the obvious solution is the right one.

End of utterance detection

Wake word fires fine. But when do you know the command is finished? Someone says "Hey JARVIS" then pauses to think, then continues. Simple silence timeout doesn't work well. This is still being tuned.

Not everything should be AI

"Play Radio Nova" is a deterministic command. It maps to one function. Running it through a language model adds 2-3 seconds of latency and introduces the possibility of misinterpretation for zero benefit. Part of building this was being explicit about which commands need intelligence and which ones just need a lookup table.

The system is more reliable and faster because of this distinction, not despite it.

The Brain You Can Read

Every cloud AI assistant has memory in some form. You have no idea what's in it, why it remembers what it does, or what it got wrong. It's a black box by design.

JARVIS memory is an Obsidian vault. Markdown files on disk. You can open them, read them, edit them, delete a bad memory, add context manually. If JARVIS has a wrong assumption about a project you just fix the file. If you want it to know something specific you write it down. The "brain" is inspectable.

This changes the relationship with the system. With Alexa or any cloud assistant you're guessing what it knows. With JARVIS you can see it. I have notes on every active project, decisions made, technical context, preferences — all readable in Obsidian on any device, all available to JARVIS as context on every query.

MemPalace sits alongside this as the episodic layer — vector search over everything JARVIS has done and said, 2000+ memories so far. Obsidian is the structured knowledge you write deliberately. MemPalace is the experience log that accumulates automatically. Together they give JARVIS both long-term structured context and recall of specific past interactions.

The practical result: "what did I decide about the BullishBeat sell model?" returns the actual decision, with context, from whenever I made it. Not a hallucinated summary — a real note I wrote, retrieved by semantic search.

Remote Access: MCP Server

Built a remote MCP server running on the JARVIS machine. From my laptop, anywhere, I get access to the Obsidian vault and all code repos via Unifi Teleport.

Laptop (Claude / MCP client)
        ↓  Unifi Teleport
JARVIS machine
        ↓
  MCP server
  ├── Obsidian vault
  └── code repos

Unifi Teleport is worth mentioning specifically — if you already have a Unifi router it costs nothing and takes five minutes. No port forwarding, no VPN config, no dynamic DNS. Secure tunnel on demand.

Result: Claude on my laptop has the same context as home JARVIS. Same vault, same memory, same project state. The vault is the source of truth, the MCP server makes it accessible from anywhere.

What's Next

Coding console — a terminal embedded in the HUD where JARVIS can execute code and show output inline. The remote MCP setup makes this interesting because it could work from the laptop too, against the home machine's repos.

Self-building skills — this is the more interesting one. The infrastructure for it already exists: MemPalace records everything JARVIS does, Obsidian holds structured knowledge about workflows and projects. The missing piece is a feedback loop. When JARVIS notices a recurring task pattern in MemPalace, or reads a documented workflow in the vault, it drafts a new skill module and drops it in skills/. Next time that task comes in it's a deterministic handler, not an LLM call.

This is different from model self-learning — the weights don't change. But the system still learns. Persistent memory means JARVIS accumulates context and experience over time. Self-built skills mean its behavior changes based on that experience — new capabilities appear, recurring tasks get faster and more precise, patterns become tools. The combination of MemPalace, Obsidian, and a growing skill set is what makes this an evolving system rather than a static one. The vault already has the knowledge, the loop just needs to close.

What I'd Do Differently

AI-first taught me where AI isn't needed. I built this AI-first deliberately — I wanted to run LLMs and see what they could do. Through actually using it I discovered that some commands are faster and more precise without any model involved. "Play Radio Nova" doesn't need reasoning, it needs a function call. I wouldn't have known which commands those were without building it the other way first. The deterministic handlers came after, informed by real usage.

ComfyUI from day one. Tried to use the Flux API directly first. Didn't work reliably for local setups. ComfyUI as the wrapper is better anyway — queue management, workflow saving, VRAM control.

Voice last. The AI parts were straightforward. Voice in a real environment with music and ambient audio is a harder problem. If I was doing it again I'd get everything else solid first, then add voice.

Get It

github.com/porokka/jarvis-os

Windows 11 + WSL2 or native Linux. NVIDIA GPU required. Install scripts handle the rest — Ollama, model pulls, Next.js, MemPalace.

Sami Porokka — developer, Tallinn. Poro-IT OÜ

DEV Community