DEV Community

Cover image for I Built a 4-Persona Parallel AI Engine with Gemma 4 — and Let It Write This Article
Masato Kato
Masato Kato

Posted on

I Built a 4-Persona Parallel AI Engine with Gemma 4 — and Let It Write This Article

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

This article was drafted by the system it describes. The structure, arguments, and demo output below are real — generated by g4-nexus running locally on an RTX 4070 Ti. I edited for flow and added code excerpts.


What I Built

g4-nexus is a parallel persona engine built on Google's Gemma 4 models. Instead of asking one model one question, it asks three specialized worker personas in parallel — then a fourth leader persona synthesizes their answers into a single integrated response.

The four personas:

Name Model Role
🌕 Mochi gemma4:e2b Emotion & empathy
🗓️ Jun gemma4:e2b Logic & expression
🗝️ Uruu gemma4:e2b Analysis & observation
📅 Koyomi gemma4:e4b Integration leader

Note: gemma4:e2b and gemma4:e4b are the official Ollama library tags for these models. "E" stands for "effective" parameters.

A single model gives you an answer. g4-nexus gives you:

  • Emotional grounding (Mochi) — is the user overwhelmed? burned out?
  • Logical structure (Jun) — what are the concrete steps?
  • Critical observation (Uruu) — what context is missing? what assumptions are wrong?
  • Integrated action plan (Koyomi) — what should they actually do right now?

This is useful for: decision support, design reviews, multi-stakeholder analysis, or just thinking through a hard problem from every angle at once.

Safety note: g4-nexus is not an autonomous or self-evolving system. It has no internet access, no persistent agency, and no capability to act outside answering questions. It is a local inference orchestration layer — explicit, inspectable, and fully under user control.


Demo

The following demo illustrates the workflow: input → three parallel worker outputs → leader synthesis.

Prompt sent: "I am feeling stuck on a project. I have been working on it for weeks and I can not seem to make progress. What should I do?"


🌕 Mochi (gemma4:e2b — 5.8s)

Oh my goodness, that sounds incredibly draining! I can feel the sheer frustration and maybe a little bit of exhaustion you must be feeling right now!
You need a gentle nudge and a fresh perspective!


🗓️ Jun (gemma4:e2b — 11.2s)

When you feel stuck on a project after working on it for several weeks, it is a very common experience — often referred to as hitting a wall. The solution usually involves changing the approach, restructuring the problem, or adjusting the mindset, rather than simply pushing harder on the same method.

Here are several structured strategies:

  1. Take a Complete Break — Step away for a defined period. This allows your subconscious to process in the background.
  2. Deconstruct the Problem — Break the overwhelming goal into the smallest possible concrete tasks.
  3. Change the Perspective — Discuss with someone else. Articulating the problem out loud often forces a solution into existence.
  4. Timeboxing — Set strict short time limits. Prevents burnout and forces focused effort.
  5. Re-establish Momentum — Start with the easiest task available. Small wins build psychological momentum.

🗝️ Uruu (gemma4:e2b — 7.4s)

General statement regarding project stagnation; lacks specific context for actionable advice.

[Note] The user is presenting a classic "stuck" query — expecting brainstorming, time management, or task breakdown, based on past patterns.


📅 Koyomi (gemma4:e4b — integrates all three)

It is entirely normal to feel stuck after weeks of intense work. That feeling is a signal, not a failure. We need to shift your approach, not just your effort.

Try this sequence:

First, take a deliberate break — step away overnight. Do something completely unrelated.

When you return, use this framework:

  1. Atomic Breakdown — Chop the goal until each piece takes no more than 90 minutes. Find the absolute smallest starting point.
  2. Pivot the Entry Point — Intentionally choose the easiest adjacent task first. Force a small guaranteed win.
  3. Externalize — Speak the problem out loud to a colleague, a friend, or even a voice note. The act of structuring for an audience often reveals the solution path.

Commit to trying only one of these for the next dedicated work block. Progress through a rut is about changing the angle of attack, not pushing harder on the old one.

Total elapsed: 31.5s


Notice what happened: Mochi validated the feeling, Jun gave exhaustive structure, Uruu flagged the lack of context (correctly — a real advisor would ask follow-up questions), and Koyomi synthesized all three into a single actionable plan. No one contradicted each other. They complemented.


Code

GitHub: pepepepepepo/g4-nexus

git clone https://github.com/pepepepepepo/g4-nexus
cd g4-nexus
pip install -r requirements.txt

# Pull models first:
ollama pull gemma4:e2b
ollama pull gemma4:e4b

# CLI mode
python main.py "What should I focus on today?"

# WebUI (FastAPI)
uvicorn api:app --port 8013
# open http://localhost:8013
Enter fullscreen mode Exit fullscreen mode

VRAM note: E4B + E2B × 3 exceeds 12GB VRAM; Ollama offloads layers to system RAM automatically. Tested on RTX 4070 Ti (12GB) + 48GB RAM.


How I Used Gemma 4

The choice of E4B (Effective 4B — ~4.5B total params) and E2B (Effective 2B — ~2.3B total params) was intentional:

The "E" in E2B and E4B stands for "effective" parameters — optimized for edge device and local deployment.

Why not the 31B Dense model?

  1. Accessibility — E2B/E4B run on consumer hardware without cloud costs. The 31B would require hardware most users don't have.
  2. Specialization — Smaller models are easier to steer into a narrow role via system prompts. E2B workers stay in their lane (emotion, logic, analysis) reliably.
  3. Parallel speed — Three E2B workers can run in a consumer local setup alongside the E4B leader, with Ollama offloading layers to system RAM when needed. Running four 31B instances in parallel is not practical locally.

Why two different model sizes?

The leader (Koyomi, E4B) needs more capacity to hold and synthesize three separate reports. The workers (Mochi, Jun, Uruu, E2B) each handle a single focused task — smaller and faster is better there.

The architecture: asyncio.gather

Three E2B workers run in parallel, then the E4B leader integrates:

async def run_tribe(user_message: str) -> dict:
    # Fire all 3 workers simultaneously
    worker_results = await asyncio.gather(
        ask_worker("mochi", MOCHI_SYSTEM, msg, WORKER_OPTIONS),
        ask_worker("jun",   JUN_SYSTEM,   msg, WORKER_OPTIONS),
        ask_worker("uruu",  URUU_SYSTEM,  msg, WORKER_OPTIONS),
    )

    # Leader receives all 3 reports and integrates
    combined = build_leader_prompt(worker_results)
    koyomi_response = await ask_leader("koyomi", KOYOMI_SYSTEM, combined, LEADER_OPTIONS)

    return {"workers": worker_results, "koyomi": koyomi_response}
Enter fullscreen mode Exit fullscreen mode

This is not a chain. The workers do not see each other's answers — they work independently, in parallel, from the same user input. This is what Gemma 4's small, steerable models unlock: you can run multiple specialized instances simultaneously on a single consumer GPU.

On an RTX 4070 Ti (12GB VRAM) with 48GB system RAM: workers finish in ~9s, leader integrates in ~25s, total ~31s.

This architecture specifically leverages the steerability of Gemma 4's small models to maximize parallelism on consumer hardware — making specialized, multi-perspective AI accessible without cloud infrastructure.


GitHub: pepepepepepo/g4-nexus

Top comments (0)