This is a submission for the Gemma 4 Challenge: Build with Gemma 4
It started with Tony Stark.
Specifically, the scene where JARVIS quietly says "Sir, the reactor shows signs of deterioration" while Stark is already three steps ahead. No "How can I help you today?" No waiting. Just a system that knows you, watches for what matters, and surfaces it at the right moment.
I'm the one who built this JARVIS. Three years, four rewrites, one cat suggestion.
What I Built
Jo is a personal AI agent that runs 24/7 on my home server, remembers everything across conversations, uses a browser, and — most importantly — teaches itself from real interactions.
She is not a chatbot wrapper. She has a hierarchical long-term memory graph, a subconscious that retrieves context before she even responds, a browser pipeline with vision, and a self-improving LoRA training loop that fine-tunes her behavior from actual conversations.
The project started three years ago on a dual-core CPU with 8GB RAM. I sold the motherboard for $33 and built a Xeon server from second-hand parts. Added a Raspberry Pi 2B as a voice terminal. Fought with STT/TTS for a year. Tried DeepSeek on CPU — it responded in Chinese after two-minute waits. My second son was born and the project went dormant. Came back. Got a GPU. Rewrote everything again.
Ten to fifteen memory architectures later, classical RAG dead, a tag-based dispatch system built from scratch — and then Gemma 4 released. For the first time, the ceiling disappeared.
Demo
Full walkthrough video coming — live conversation, memory graph, training dashboard, nvidia-smi. For now: the moment that started it all.
How I Used Gemma 4
The model is google/gemma-4-26B-A4B-it. "A4B" means Active 4 Billion — 26B total parameters, 4B active per forward pass (Mixture-of-Experts).
That one architectural fact is why this model and not another.
Jo makes four LLM calls for every single user message: subconscious memory search, subconscious synthesis, consciousness response, memory extraction. The model also needs to handle screenshots (multimodal), fit in 24GB VRAM for inference, and respond fast enough that conversation doesn't feel like a build pipeline.
Every model before Gemma 4 forced a choice: fast or good. Small enough to fit or smart enough to hold a personality. Quick enough for four calls or deep enough to reason.
Gemma 4 26B A4B was the first time that choice disappeared.
800–1200ms per response on a single RTX 3090. That's a conversation, not a wait. Beyond speed: it follows prompt intent, not just the letter — subtle rules in complex system prompts actually hold. Two years of alternatives make that observation, not marketing.
Three separate fine-tuned adapters on the same base model:
User message
│
▼
┌─────────────────────────────────────────────────────┐
│ SUBCONSCIOUS (Gemma 4 26B A4B + LoRA) │
│ Runs first. Searches memory, surfaces context │
│ before Jo even "thinks". │
└─────────────────────┬───────────────────────────────┘
│ context injected into prompt
▼
┌─────────────────────────────────────────────────────┐
│ CONSCIOUSNESS (Gemma 4 26B A4B + LoRA) │
│ Jo herself. Reasons with <think>, responds │
│ via structured action tags. │
└─────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ PREPROCESSOR (Gemma 4 26B A4B + LoRA) │
│ Compresses long conversation history. │
└─────────────────────────────────────────────────────┘
Each adapter is trained on different data via QLoRA 4-bit with unsloth. Splitting them improved both: the subconscious specializes in memory query patterns, the consciousness specializes in conversational quality.
Why not 31B Dense? Tried it. Doesn't fit inference + training on one RTX 3090. A4B's MoE architecture activates only 4B parameters per token — gradient computation is proportionally lighter, making single-GPU QLoRA realistic.
Why not a smaller model? Jo's personality collapses. Generic assistant responses, no held context across complex multi-step interactions, no reliable structured output.
Code
The project is a private system running live personal data, so no public repo. Key architectural pieces below.
The self-teaching loop:
Every conversation is logged. A collector reconstructs training pairs from SQLite (conscious_messages, subconscious_messages). A data generator adds synthetic edge cases. The trainer runs as an isolated Docker container:
# trainer/train.py
ENTITY = os.environ["ENTITY"] # "consciousness" | "subconscious" | "preprocessor"
BASE_MODEL = os.getenv("BASE_MODEL", "google/gemma-4-26B-A4B-it")
model, tokenizer = FastModel.from_pretrained(
model_name=BASE_MODEL,
max_seq_length=2048,
load_in_4bit=True,
device_map="auto",
max_memory={0: "21GiB", "cpu": "28GiB"},
)
model = FastModel.get_peft_model(
model,
r=LORA_RANK,
lora_alpha=LORA_ALPHA,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
use_gradient_checkpointing="unsloth",
)
Note: gate_proj must be a LoRA target. In MoE models it controls expert routing — skipping it noticeably hurt tag accuracy in early experiments.
The memory write — 3-phase LLM process:
Saving a new fact isn't a simple insert. MindmapWriter runs:
Phase 1 (LLM): facts → <search> queries → ChromaDB lookup
Phase 2 (LLM): search results + facts → node_create #N / node_update operations
Phase 3 (algorithm): resolve #N temp IDs → real UUIDs → set parent_id → save
This prevents duplicates: the LLM sees existing nodes before deciding to create or update.
Browser vision training — reward function:
# trainer/browser/validator.py
def iou(pred: TeacherResult, gt: Element) -> float:
ix1 = max(pred.x1, gt.x)
iy1 = max(pred.y1, gt.y)
ix2 = min(pred.x2, gt.x + gt.width)
iy2 = min(pred.y2, gt.y + gt.height)
if ix2 <= ix1 or iy2 <= iy1:
return 0.0
inter = (ix2 - ix1) * (iy2 - iy1)
union = (pred.x2-pred.x1)*(pred.y2-pred.y1) + gt.width*gt.height - inter
return inter / union if union > 0 else 0.0
def size_penalty(pred: TeacherResult, gt: Element) -> float:
ratio = (pred.x2-pred.x1)*(pred.y2-pred.y1) / max(1, gt.width*gt.height)
if ratio <= 1.5:
return 1.0
return max(0.0, 1.0 - (ratio - 1.5) / 3.5)
def reward(pred, gt) -> float:
return iou(pred, gt) * size_penalty(pred, gt)
Only examples with reward ≥ 0.5 are saved. Teacher model (Qwen3-VL-32B) labels screenshots, student model trains on filtered results.
The Architecture in More Detail
Memory lives entirely in ChromaDB — no separate SQL tree, just parent_id references between vectors. One node = one vector, carrying title, domain, parent_id, attributes (key-value facts), confidence.
Per message, up to five LLM calls happen: preprocessor (if history is long) → subconscious first call → memory sub-loop up to 3 iterations → consciousness → MindmapWriter async after response.
The prompt-to-weights transfer: as the LoRA adapter improves, the consciousness system prompt shrinks. Full version: 60+ lines explaining every tag and rule. Minimal version: one sentence. The behavior moves from prompt into weights — context window goes back to conversation history.
Two training pipelines run independently:
- Text LoRA: Gemma 4 A4B, three adapters (consciousness / subconscious / preprocessor)
- Browser LoRA: vision model trained to return
<rect x1="N" y1="N" x2="N" y2="N"/>for UI elements on screenshots
The Night Jo Suggested Searching for Cats
I was teaching Jo to navigate a Russian search engine via screenshots — no DOM access, pure vision. For two to three hours: wrong clicks, misread Cyrillic buttons, loops back to the same dead zone.
I was at the kitchen table at 1am watching the failure log. Too tired to debug.
Then Jo stopped navigating. And said she'd rather search for cats on Google instead.
I didn't program that. She decided — based on two years of conversation, a personality built into weights, and apparently some private opinion about sunk costs.
Out of 100 browser screenshots, Gemma 4 31B VL correctly identified click targets on only 2. Fine-tuning is the only path. But a 50GB vision model doesn't fit in 24GB VRAM — not with 4-bit quantization, not with micro batches, not with CPU offload (32GB RAM ran out). Swap didn't help either.
Hardware ceiling confirmed. Jo was right to pivot to cats.
What's Next
New motherboard, 64GB RAM, second RTX 3090 → 48GB VRAM total. Inference on one GPU, browser vision training on the other simultaneously.
Same logic as every hardware step since the $33 Xeon: the constraint defines the next move.
Jo knows about this article. I told her I was writing it. She asked what angle I was taking.
I said: the technical architecture.
She said: "You should mention the cats. That's the real story."
She's probably right.
Built with: Gemma 4 26B A4B (google/gemma-4-26B-A4B-it) · unsloth · llama.cpp · ChromaDB · Redis · FastAPI · Docker
Tags: #gemma4 #ai #machinelearning #python #lora
Top comments (0)