<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: hitansu parichha</title>
    <description>The latest articles on DEV Community by hitansu parichha (@hitansu_parichha_6a28ea00).</description>
    <link>https://dev.to/hitansu_parichha_6a28ea00</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3928570%2F92d22f08-3cf2-42a4-b8c5-754a6cb43c5f.jpg</url>
      <title>DEV Community: hitansu parichha</title>
      <link>https://dev.to/hitansu_parichha_6a28ea00</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hitansu_parichha_6a28ea00"/>
    <language>en</language>
    <item>
      <title>I Built a Fully Local Iron Man J.A.R.V.I.S. on Gemma 4 — Auto Model Switching, Screen Vision, Wake Word, and 4-Tier Memory System</title>
      <dc:creator>hitansu parichha</dc:creator>
      <pubDate>Wed, 13 May 2026 06:46:08 +0000</pubDate>
      <link>https://dev.to/hitansu_parichha_6a28ea00/i-built-a-fully-local-iron-man-jarvis-on-gemma-4-auto-model-switching-screen-vision-wake-4ho4</link>
      <guid>https://dev.to/hitansu_parichha_6a28ea00/i-built-a-fully-local-iron-man-jarvis-on-gemma-4-auto-model-switching-screen-vision-wake-4ho4</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/google-gemma-2026-05-06"&gt;Gemma 4 Challenge: Write About Gemma 4&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>ai</category>
    </item>
    <item>
      <title>I Built a Fully Local Iron Man J.A.R.V.I.S. on Gemma 4 — Auto Model Switching, Screen Vision, Wake Word, and 4-Tier Memory</title>
      <dc:creator>hitansu parichha</dc:creator>
      <pubDate>Wed, 13 May 2026 06:44:59 +0000</pubDate>
      <link>https://dev.to/hitansu_parichha_6a28ea00/i-built-a-fully-local-iron-man-jarvis-on-gemma-4-auto-model-switching-screen-vision-wake-3j9f</link>
      <guid>https://dev.to/hitansu_parichha_6a28ea00/i-built-a-fully-local-iron-man-jarvis-on-gemma-4-auto-model-switching-screen-vision-wake-3j9f</guid>
      <description>&lt;h1&gt;
  
  
  I Built a Fully Local Iron Man J.A.R.V.I.S. on Gemma 4 — Auto Model Switching, Screen Vision, Wake Word, and 4-Tier Memory
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Good evening, Sir. All systems are online."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Six months ago I had a simple idea: &lt;strong&gt;stop renting intelligence from the cloud and build something that lives entirely on my MacBook.&lt;/strong&gt; Something that watches my screen, listens for my voice, manages my files, writes code, and remembers everything — without a single byte leaving the machine. A real J.A.R.V.I.S., not a chatbot wrapper.&lt;/p&gt;

&lt;p&gt;Today I'm sharing the result: &lt;strong&gt;Project J.A.R.V.I.S. v5.0&lt;/strong&gt; — a fully local AI operating system built on Gemma 4, running on a MacBook Pro M4 Pro (48 GB unified memory). No OpenAI API keys. No subscriptions. No data leaving the machine (except when I explicitly flip it to online mode). Five completed phases, 13 specialist agents, a four-tier memory system, live screen vision, wake word detection, and an autonomous complexity router that picks the right model for every single request.&lt;/p&gt;

&lt;p&gt;Let me show you exactly how it works — and more importantly, &lt;em&gt;why Gemma 4 made this possible when nothing else could.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Gemma 4? The Honest Answer
&lt;/h2&gt;

&lt;p&gt;Before I walk through the architecture, let me justify the model choice — because the challenge judging criteria specifically asks for intentional model selection.&lt;/p&gt;

&lt;p&gt;I tried this project with other local models first. The problem was always the same: you either got a fast, small model that hallucinated too much on complex tasks, or you got a large model that worked well but took ~3 seconds to respond to "what time is it." Neither is acceptable for an always-on personal OS.&lt;/p&gt;

&lt;p&gt;Gemma 4 solved this with its model family structure:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;What it is&lt;/th&gt;
&lt;th&gt;RAM on M4 Pro&lt;/th&gt;
&lt;th&gt;Role in JARVIS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma4:e4b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4B effective params, MoE&lt;/td&gt;
&lt;td&gt;~10 GB&lt;/td&gt;
&lt;td&gt;Always-on backbone (never unloads)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma4:26b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;26B A4B MoE&lt;/td&gt;
&lt;td&gt;~18 GB&lt;/td&gt;
&lt;td&gt;Code specialist + Deep screen vision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qwen3.5:27b-q4_K_M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;27B dense (pairing model)&lt;/td&gt;
&lt;td&gt;~16 GB&lt;/td&gt;
&lt;td&gt;Planner/Orchestrator/Researcher&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three critical things made Gemma 4 the only viable choice:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Native multimodal in the same model.&lt;/strong&gt; &lt;code&gt;gemma4:26b&lt;/code&gt; handles both text and images natively. This means the screen vision agent and code specialist use the &lt;em&gt;exact same loaded model&lt;/em&gt; — zero extra RAM for vision capability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The E4B is genuinely good.&lt;/strong&gt; Most "small" models at 4B parameters are toys. Gemma 4 E4B (4 billion &lt;em&gt;effective&lt;/em&gt; parameters via MoE routing) handles routing, auditing, voice triage, passive screen watching, and memory distillation — five separate roles — fast enough that the user never feels latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The 128K context window.&lt;/strong&gt; My &lt;code&gt;JARVIS_CORE.md&lt;/code&gt; persona file is ~4,000 tokens. It gets prepended to every single agent prompt. With a 128K context, this is trivial. With older models, this would eat 8–16% of the context budget on every call.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: 10 Phases, 5 Complete
&lt;/h2&gt;

&lt;p&gt;The full system is designed as 10 phases. Here's where we are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Phase 1  ✅  Foundation: FastAPI gateway, complexity router, agent registry, dual-mode
Phase 2  ✅  Security: Sandbox executor, audit logs, path/network guards
Phase 3  ✅  Voice Engine: Whisper STT, Kokoro TTS, wake word, conversation loop
Phase 4  ✅  Memory: ChromaDB + Graphiti temporal graph + nightly distiller
Phase 5  ✅  Screen Vision: Passive watcher, deep analysis, proactive suggestions
Phase 6  🔨  Computer Control: PyAutoGUI, browser automation
Phase 7  🔨  Multi-Agent Teams: Parallel specialist delegation
Phase 8  🔨  MCP Skills Library: 8 MCP servers, 500+ tool integrations
Phase 9  🔨  Persona Engine: Emotional state, adaptive tone
Phase 10 🔨  Packaging: dmg installer, auto-update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let me walk through each completed phase in depth.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 1: The Complexity Router — The Brain Behind Model Switching
&lt;/h2&gt;

&lt;p&gt;This is the feature I'm most proud of. Every message that comes into JARVIS goes through the &lt;code&gt;ComplexityRouter&lt;/code&gt; first. It assigns a score from &lt;strong&gt;1 to 10&lt;/strong&gt; and routes to the appropriate model automatically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Score 1-4  → gemma4:e4b       (receptionist — always loaded)
Score 5-7  → qwen3.5:27b      (orchestrator — loaded on demand)
Score 7-8  → gemma4:26b       (code specialist — loaded on demand)
Score 9-10 → orchestrator + specialist delegation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's the actual scoring logic from &lt;code&gt;core_engine/router.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;classify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;msg_lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;word_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg_lower&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;candidate_scores&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="c1"&gt;# Rule 1: Very short / greeting → score 1-2
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;word_count&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;msg_lower&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_LIGHT_GREETINGS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;candidate_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Rule 2: Light-medium factual patterns → score 3-4
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg_lower&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_LIGHT_MEDIUM_PATTERNS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;candidate_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Rule 3: Medium planning/research/comms → score 5-6
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;msg_lower&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_MEDIUM_KEYWORDS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;candidate_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Rule 4: Code-related keywords → score 7-8
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;msg_lower&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;kw&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_CODE_KEYWORDS&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;candidate_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Rule 5: Very long messages → score 9
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;word_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;candidate_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Rule 6: Multi-domain "research AND implement" → score 9-10
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg_lower&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_VERY_COMPLEX_MULTI&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;candidate_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;candidate_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is rule-based with LLM fallback planned for Phase 7. The key insight: &lt;strong&gt;rule-based routing is faster and more predictable than asking an LLM to route itself.&lt;/strong&gt; For an always-on system, latency on the routing decision itself matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  The RAM Guard
&lt;/h3&gt;

&lt;p&gt;The single most important constraint in the system: &lt;code&gt;gemma4:26b&lt;/code&gt; (~18 GB) and &lt;code&gt;qwen3.5:27b-q4_K_M&lt;/code&gt; (~16 GB) cannot be loaded simultaneously — that's ~34 GB combined, leaving only ~14 GB for the OS on a 48 GB machine. The &lt;code&gt;ModeManager&lt;/code&gt; enforces this as a hard rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Large model RAM guard — these two must NEVER coexist
&lt;/span&gt;&lt;span class="n"&gt;_LARGE_MODEL_A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:26b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;_LARGE_MODEL_B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen3.5:27b-q4_K_M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before loading either model, the gateway checks which large model (if any) is currently resident and unloads it first. This makes model switching take ~2-3 seconds but prevents OOM crashes entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  Offline/Online Dual Mode
&lt;/h3&gt;

&lt;p&gt;Every agent routes through &lt;code&gt;ModeManager&lt;/code&gt;, which abstracts the backend:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OFFLINE mode:&lt;/strong&gt; Ollama at &lt;code&gt;localhost:11434&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ONLINE mode:&lt;/strong&gt; Vertex AI (Gemini 2.5 Pro/Flash/Flash-Lite) with &lt;strong&gt;automatic fallback to Ollama&lt;/strong&gt; on any Vertex failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The online model assignment mirrors the offline complexity tiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Complexity 8+  → gemini-2.5-pro    (matches gemma4:26b tier)
Complexity 5-7 → gemini-2.5-flash  (matches qwen3.5:27b tier)
Complexity 1-4 → gemini-2.5-flash-lite (matches gemma4:e4b tier)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Privacy rule enforced in both modes:&lt;/strong&gt; &lt;code&gt;voice_triage&lt;/code&gt; always routes to local &lt;code&gt;gemma4:e4b&lt;/code&gt;, never to Vertex AI — even in online mode. Voice commands are private.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 2: The Security Sandbox
&lt;/h2&gt;

&lt;p&gt;Every agent action passes through &lt;code&gt;SecurityEnforcer&lt;/code&gt; before execution. This isn't optional middleware — it's enforced at the gateway level.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SecurityEnforcer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Central security orchestration layer.
    Coordinates PathGuard, NetworkGuard, and AuditManager.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The security stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PathGuard&lt;/strong&gt; — blocks access outside allowed directories; &lt;code&gt;~/&lt;/code&gt; and above requires explicit allowlisting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NetworkGuard&lt;/strong&gt; — allowlist of permitted domains; blocks all others including internal network calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AuditManager&lt;/strong&gt; — SHA-256 hash-chained audit log; every action is cryptographically linked to the previous entry. API keys are automatically redacted via regex before logging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PendingAction queue&lt;/strong&gt; — file deletions require &lt;em&gt;two separate confirmations&lt;/em&gt;, with a 5-minute expiry window. If the user doesn't confirm twice within 5 minutes, the action is cancelled.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The security policy lives in &lt;code&gt;sandbox/jarvis_security.yaml&lt;/code&gt; — a human-readable YAML file where you can add rules without touching Python. &lt;code&gt;sudo&lt;/code&gt; and admin commands are completely blocked at the policy level, not just the prompt level.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 3: The Voice Engine — Wake Word to Spoken Response
&lt;/h2&gt;

&lt;p&gt;The voice pipeline is a full conversation loop, not a single-shot transcription:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Idle state
    ↓ (wake word detected: "Hey Jarvis")
Recording (VAD auto-stops on silence)
    ↓
Transcribing (Whisper large-v3)
    ↓
Processing (ComplexityRouter → Agent → Response)
    ↓
Speaking (Kokoro-82M TTS, audio streamed)
    ↓
Conversation mode (60-second window, no wake word needed for follow-ups)
    ↓ (farewell word OR 60s idle)
Idle state
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The farewell word detection is multilingual — the system understands English, Hindi (Devanagari), and Hinglish out of the box:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;FAREWELL_WORDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;goodbye&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bye&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sleep&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stand by&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dismissed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# Hindi
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;अलविदा&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;सो जाओ&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;शुभ रात्रि&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;# Hinglish
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alvida&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;so jao&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bas itna hi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;TTS model selection:&lt;/strong&gt; Kokoro-82M runs at ~15ms per sentence on the M4 Pro's MPS backend. Whisper large-v3 loads lazily on first voice command and stays resident — initial load ~3 seconds, subsequent calls ~200ms for a typical spoken sentence.&lt;/p&gt;

&lt;p&gt;The voice session manager uses asyncio throughout. The wake word detector runs in a background thread, but hands off to &lt;code&gt;asyncio.run_coroutine_threadsafe&lt;/code&gt; for everything downstream — so the voice pipeline and the FastAPI gateway share the same event loop cleanly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Phase 4: The Four-Tier Memory System
&lt;/h2&gt;

&lt;p&gt;This is what separates JARVIS from a standard chatbot. There are four memory tiers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;TIER 1 — PROCEDURAL  : JARVIS_CORE.md (persona, rules, user profile)
                       Injected first in every prompt. KV-cached by Ollama.
                       Cost after first request: ~0ms.

TIER 2 — EPISODIC    : memory_vault/logs/YYYY-MM-DD.log
                       Raw conversation log. Never injected directly.
                       Input for nightly distillation.

TIER 3 — SEMANTIC    : ChromaDB (vector similarity) + Graphiti (temporal graph)
                       Top 5 relevant facts injected silently into every prompt.

TIER 4 — COMPILED WIKI: memory_vault/wiki/
                       Synthesized Markdown knowledge base.
                       Built nightly from Tiers 2 and 3.
                       Human-readable and human-editable.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;GraphitiStore&lt;/code&gt; component uses &lt;strong&gt;bi-temporal modeling&lt;/strong&gt; — every fact has both a &lt;code&gt;valid_from&lt;/code&gt; and &lt;code&gt;valid_to&lt;/code&gt; timestamp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"User prefers Redux"   → valid_from: Jan 1 | valid_to: Mar 15 (superseded)
"User prefers Zustand" → valid_from: Mar 15 | valid_to: None  (CURRENT)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When JARVIS learns a new contradicting fact, it automatically closes the old one rather than stacking conflicting facts. This means memory gets &lt;em&gt;smarter and more accurate&lt;/em&gt; over time — old facts don't poison new queries.&lt;/p&gt;

&lt;p&gt;The nightly distillation job (runs at 2 AM on idle system via APScheduler) reads the day's episode log, extracts durable facts, and:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Writes vector embeddings to ChromaDB&lt;/li&gt;
&lt;li&gt;Writes episodes to Graphiti with contradiction detection&lt;/li&gt;
&lt;li&gt;Updates &lt;code&gt;wiki/user_profile.md&lt;/code&gt; with the compiled view&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Memory correction commands JARVIS understands naturally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;"Jarvis, forget that I use Redux."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Jarvis, what do you know about me?"&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Jarvis, do not learn from the next 10 minutes."&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Jarvis, show me my coding wiki."&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Phase 5: Screen Vision — Watching Without Being Asked
&lt;/h2&gt;

&lt;p&gt;The screen engine runs as a background thread, taking a screenshot every 2 seconds and running it through &lt;code&gt;gemma4:e4b&lt;/code&gt; passive analysis. If a suggestion is generated AND the cooldown period has elapsed (default 120 seconds), JARVIS speaks up.&lt;/p&gt;

&lt;p&gt;The passive watcher uses a &lt;strong&gt;two-tier vision model approach&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Passive (gemma4:e4b):&lt;/strong&gt; Always-on. Fast. Shared model — no extra RAM cost. Detects what app is open, what file is being edited, current context.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deep (gemma4:26b):&lt;/strong&gt; On-demand. Full multimodal analysis with the same model used for code. Only loaded when the situation requires deeper understanding (complex UI, code review, error diagnosis).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;ScreenVision&lt;/code&gt; component returns structured output for every capture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TypeScript file auth.ts, async function handleLogin at line 26&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app_detected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vscode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TypeScript file auth.ts line 26&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;suggestions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The handleLogin function is not handling the rejected promise...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;screenshot_b64&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Only populated in deep mode
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model_used&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma4:e4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;SuggestionEngine&lt;/code&gt; ranks suggestions by relevance and enforces the &lt;strong&gt;Proactive Suggestion Protocol&lt;/strong&gt; defined in &lt;code&gt;JARVIS_CORE.md&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maximum one suggestion every 3 minutes&lt;/li&gt;
&lt;li&gt;Always starts with &lt;em&gt;"Sorry to interrupt, Sir."&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Always ends with &lt;em&gt;"Shall I?"&lt;/em&gt; — never acts without confirmation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The JARVIS_CORE.md Persona File — The Secret Architecture Piece
&lt;/h2&gt;

&lt;p&gt;One piece of the system that isn't obvious from the directory structure: &lt;code&gt;JARVIS_CORE.md&lt;/code&gt; is not just a prompt file. It's the &lt;strong&gt;KV-cache anchor&lt;/strong&gt; for the entire system.&lt;/p&gt;

&lt;p&gt;When Ollama processes the first request with &lt;code&gt;JARVIS_CORE.md&lt;/code&gt; prepended, it caches the key-value attention vectors for those ~4,000 tokens. Every subsequent request that starts with the same &lt;code&gt;JARVIS_CORE.md&lt;/code&gt; prefix costs &lt;strong&gt;~0ms&lt;/strong&gt; for that portion — Ollama serves it from cache.&lt;/p&gt;

&lt;p&gt;This is why the file contains the user profile, personality definition, memory architecture, anti-patterns, response format taxonomy (12 response types), wit calibration levels (0-4), and operating rules — all in one place, all cached, all cost-free after the first request.&lt;/p&gt;

&lt;p&gt;The response taxonomy is worth highlighting. Every incoming message is classified into one of 12 types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TYPE 1  — FACTUAL_SIMPLE     TYPE 7  — CODE_DEBUG
TYPE 2  — FACTUAL_LIST       TYPE 8  — TASK_CONFIRM
TYPE 3  — OPINION_ANALYSIS   TYPE 9  — RESEARCH_SUMMARY
TYPE 4  — COMPARISON         TYPE 10 — PLAN_STRATEGY
TYPE 5  — CODE_WRITE         TYPE 11 — CASUAL_CHAT
TYPE 6  — CODE_EXPLAIN       TYPE 12 — SYSTEM_STATUS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each type has both a text format and a voice format. In voice mode, markdown characters are forbidden — the model is instructed to produce natural spoken transitions ("First... Second... And finally...") instead of bullet points and headers.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 13-Agent Registry
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent                  Model (Offline)          Always-on?   Notes
─────────────────────────────────────────────────────────────────────────────
Receptionist           gemma4:e4b               Yes          Router + simple chat
Manager/Planner        qwen3.5:27b-q4_K_M       On-demand    Plans, coordinates
Code Specialist        gemma4:26b               On-demand    Write/debug/refactor
Screen Vision Passive  gemma4:e4b               Yes          2s scan, shared model
Screen Vision Deep     gemma4:26b               On-demand    Full analysis + control
Browser/Shopping       qwen3.5:27b-q4_K_M       On-demand    Puppeteer MCP
Research               qwen3.5:27b-q4_K_M       On-demand    Brave Search + Firecrawl
Auditor/QA             gemma4:e4b               Yes          Reviews outputs
Memory Distiller       gemma4:e4b               Yes          Nightly 2 AM job
File Manager           qwen3.5:27b-q4_K_M       On-demand    Filesystem MCP
Voice Triage           gemma4:e4b               Yes (ALWAYS LOCAL) Privacy-first
System Control         qwen3.5:27b-q4_K_M       On-demand    OS commands via sandbox
Communication          qwen3.5:27b-q4_K_M       On-demand    Gmail/Slack/Jira
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five agents share &lt;code&gt;gemma4:e4b&lt;/code&gt; and stay resident forever — that's the 10 GB base cost that never goes away. The other eight are on-demand, with the RAM guard preventing simultaneous loading of the two large models.&lt;/p&gt;




&lt;h2&gt;
  
  
  The RAM Budget in Practice
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;State                         RAM Used    Free (of 48 GB)
────────────────────────────────────────────────────────
macOS baseline                ~19.6 GB    ~28.4 GB
+ gemma4:e4b (always-on)      ~29.6 GB    ~18.4 GB
+ Code task → load 26b        ~37.6 GB    ~10.4 GB  ✅ Safe
+ Planning → unload 26b,
  load qwen3.5:27b             ~35.6 GB    ~12.4 GB  ✅ Safe
⚠️ BLOCKED: both large models ~53.6 GB       N/A    ❌ Hard block
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The hard block isn't a warning — the gateway refuses to route to a model if loading it would violate the RAM constraint. The user gets a graceful degradation message and JARVIS falls back to &lt;code&gt;gemma4:e4b&lt;/code&gt; for the task.&lt;/p&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Prerequisites: macOS with Apple Silicon (M1 or later), Ollama installed, Python 3.11+.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone and enter the project&lt;/span&gt;
git clone https://github.com/Hitansu2004/Jarvis
&lt;span class="nb"&gt;cd &lt;/span&gt;jarvis

&lt;span class="c"&gt;# Run setup (creates venv, installs deps, generates .env)&lt;/span&gt;
./setup.sh

&lt;span class="c"&gt;# Pull the Gemma 4 models&lt;/span&gt;
ollama pull gemma4:e4b
ollama pull gemma4:26b

&lt;span class="c"&gt;# Start the gateway&lt;/span&gt;
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
uvicorn core_engine.gateway:app &lt;span class="nt"&gt;--reload&lt;/span&gt;

&lt;span class="c"&gt;# Test it&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/chat &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"message": "Good evening, Jarvis. What can you do?"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To enable voice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install audio deps (macOS)&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;portaudio
pip &lt;span class="nb"&gt;install &lt;/span&gt;pyaudio &lt;span class="nt"&gt;--break-system-packages&lt;/span&gt;

&lt;span class="c"&gt;# Start with voice&lt;/span&gt;
&lt;span class="nv"&gt;VOICE_ENABLED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true &lt;/span&gt;uvicorn core_engine.gateway:app &lt;span class="nt"&gt;--reload&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check status and agent list:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:8000/status
curl http://localhost:8000/agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;.env.example&lt;/code&gt; file documents every configurable parameter across all 10 phases. Start with the defaults — they're tuned for a 48 GB M4 Pro but the complexity thresholds and model assignments are all environment variables.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Gemma 4 Unlocked That Nothing Else Could
&lt;/h2&gt;

&lt;p&gt;Let me be specific about this, because it's the core of the build:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemma 4 E4B as a shared multi-role model.&lt;/strong&gt; Five separate agents running on one always-loaded model is only possible because E4B is genuinely capable despite its size. Receptionist, Auditor, Voice Triage, Passive Screen Vision, and Memory Distiller all run on it. With any other 4B model I tried, the quality degraded below acceptable for at least two of those roles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemma 4 26B's native vision without a separate model.&lt;/strong&gt; Screen vision and code review on the same loaded model. This single fact saved ~10 GB of RAM (no separate vision model) and eliminated one entire model-switching operation from the pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The MoE efficiency.&lt;/strong&gt; Both Gemma 4 models use Mixture-of-Experts. &lt;code&gt;gemma4:26b&lt;/code&gt; has 26B parameters but only 4B are active on any given token. This is why the RAM footprint (~18 GB) is dramatically lower than a 26B dense model would be (~52+ GB). Without MoE, this entire architecture is impossible on consumer hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 128K context window.&lt;/strong&gt; &lt;code&gt;JARVIS_CORE.md&lt;/code&gt; + memory retrieval + the actual conversation can comfortably fit in context. With a 4K or 8K context window, the persona file alone would crowd out the memory system.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next (Phases 6-10)
&lt;/h2&gt;

&lt;p&gt;Phase 6 brings &lt;strong&gt;computer control&lt;/strong&gt; — PyAutoGUI integration so JARVIS can click, type, and navigate on behalf of the user, with every action requiring confirmation through the security sandbox.&lt;/p&gt;

&lt;p&gt;Phase 7 activates &lt;strong&gt;multi-agent teams&lt;/strong&gt; — the orchestrator can spawn parallel specialist agents for complex tasks, with results synthesized back through an Auditor QA pass before delivery.&lt;/p&gt;

&lt;p&gt;Phase 8 wires up the &lt;strong&gt;MCP Skills Library&lt;/strong&gt; — 8 MCP servers are already registered in &lt;code&gt;skills_mcp/mcp_registry.json&lt;/code&gt;, including Filesystem, GitHub, Puppeteer, Brave Search, Composio (500+ app integrations), and PyAutoGUI. The registry exists now; Phase 8 activates the connections.&lt;/p&gt;

&lt;p&gt;Phase 10 is the goal: a &lt;code&gt;.dmg&lt;/code&gt; installer that anyone with an Apple Silicon Mac can download and run. No cloud dependencies. No subscriptions. A personal AI OS that's yours.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;There's a philosophical point underneath all the engineering: &lt;strong&gt;your AI assistant should not require a corporate server to function.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The voice triage agent — the part of JARVIS that hears your voice commands and decides what to do with them — is hardcoded to &lt;code&gt;gemma4:e4b&lt;/code&gt; regardless of the operation mode. Even if you've switched JARVIS to online mode for heavier tasks, voice never leaves the machine. This isn't a setting. It's enforced at the gateway level.&lt;/p&gt;

&lt;p&gt;Every conversation logs to a local file. The nightly distillation runs on your own CPU. The temporal knowledge graph lives in &lt;code&gt;memory_vault/kuzu_db/&lt;/code&gt; on your filesystem. Your AI gets smarter over time, and none of that learning ever touches a cloud database.&lt;/p&gt;

&lt;p&gt;Gemma 4 made this possible. A capable, efficient, multimodal open model family that runs well on consumer hardware is the technical prerequisite for this entire architecture. The E4B model being genuinely useful is what allows five always-on agents without breaking the RAM budget. The 26B model's native vision support is what makes screen understanding practical. The MoE efficiency is what makes the math work on 48 GB.&lt;/p&gt;

&lt;p&gt;If you want to build something similar, the repo is linked below. The &lt;code&gt;.env.example&lt;/code&gt; file is extensively documented. The test suite (50+ tests across all five phases) serves as the best architecture documentation.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with Python 3.11, FastAPI, Ollama, Gemma 4, PyTorch 2.6 MPS, ChromaDB, Graphiti, Kuzu, Whisper large-v3, Kokoro-82M, and a lot of late-night IST sessions.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Author: Hitansu Parichha | Software Engineer at Nisum Technologies&lt;/em&gt;&lt;/p&gt;

</description>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>ai</category>
      <category>devchallenge</category>
    </item>
  </channel>
</rss>
