KimSejun

Posted on Mar 12

six characters, one soul

#geminiliveagentchallenge #devlog #buildinpublic #go

six characters, one soul

I created this post for the purposes of entering the Gemini Live Agent Challenge, but the part that surprised me most here had nothing to do with infra. It was realizing that the first real design question wasn't "how do we wire the agent system?" It was "who is sitting next to you while you code?"

that question turned out to be harder than the architecture. because the answer is not one person. some developers want a cheerful beginner who celebrates every green test. some want a stoic senior who only speaks when it matters. some want a goofy sidekick who stumbles into the right answer. some want a dry, theatrical character who makes debugging feel lighter instead of heavier.

so we built six of them. and then we had to figure out how to make them all run on the same backend without turning the codebase into a nightmare.

this matters more now that VibeCat is a proactive companion — an agent that watches your screen and suggests actions before you ask. the OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK loop is the same for every character. but how cat suggests something versus how jinwoo suggests something is completely different. the behavior is infrastructure. the personality is a surface. keeping those two layers clean is what makes the character system work.

the problem with "just add a system prompt"

the naive approach is obvious: swap out the system prompt per character, done. but that breaks down fast when you have one voice-first runtime that needs to stay consistent across all characters. the action worker, the local executor, the safety rules, the clarification behavior — all of these need to behave the same way regardless of whether the user picked the zen folklore mentor or the clumsy comic-relief character. the personality is a surface concern. the behavior is infrastructure.

so we needed a clean separation: one layer that handles what the agent does, and another layer that handles how it sounds.

the answer ended up being embarrassingly simple. each character gets two files:

preset.json — voice, size, language, mood response mappings
soul.md — a short markdown document that shapes the Live PM's voice and boundaries

that's it. the entire personality of a character lives in those two files. the underlying navigator runtime doesn't need a different control flow for each character.

in the Go session config, the soul content gets injected directly:

func buildSystemInstruction(cfg Config) string {
    instruction := commonLivePrompt  // the proactive companion behavior
    if cfg.Soul != "" {
        instruction += "\n\n=== CHARACTER PERSONA ===\n" + cfg.Soul
    }
    // ... chattiness, memory context, language
    return instruction
}

commonLivePrompt is the proactive companion identity — the OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK loop, the 5 navigator tools, the safety rules. the soul comes after, as a persona layer on top. the character shapes how the agent speaks. the common prompt shapes what it does.

what preset.json actually does

here's cat's preset:

{
  "voice": "Zephyr",
  "promptProfile": "cat",
  "size": null,
  "persona": {
    "nameKo": "고양이",
    "tone": "bright",
    "speechStyle": "casual",
    "language": "ko",
    "traits": ["curious", "playful", "innocent", "encouraging"],
    "codingRole": "beginner-eye",
    "moodResponses": {
      "frustrated": "supportive-gentle",
      "focused": "silent",
      "stuck": "question-based",
      "idle": "playful-poke"
    },
    "soulRef": "soul.md"
  }
}

and here's derpy's:

{
  "voice": "Puck",
  "promptProfile": "derpy",
  "size": null,
  "persona": {
    "nameKo": "더피",
    "tone": "playful-chaotic",
    "speechStyle": "casual-goofy",
    "language": "ko",
    "traits": ["clumsy", "lovable", "accidentally-insightful", "comic-relief"],
    "codingRole": "accidental-debugger",
    "moodResponses": {
      "frustrated": "cheer-up-joke",
      "focused": "silent",
      "stuck": "random-angle",
      "idle": "silly-checkin"
    },
    "soulRef": "soul.md"
  }
}

the voice field maps directly to a Gemini Live API voice name. Zephyr is bright and light. Kore (jinwoo's voice) is low and calm. Zubenelgenubi (saja's voice) is deep and measured. Puck (derpy's voice) is playful and slightly chaotic.

this matters more than you'd expect. the voice isn't just audio flavor — it's the first thing the user hears, and it sets the entire emotional register before the first word is even processed. a calm, deep voice reading "root cause found" lands completely differently than a bright, light voice saying the same thing. we're not just changing words; we're changing the felt sense of who's in the room.

the moodResponses field is interesting too. when the MoodDetector agent fires — say, it detects the user is frustrated — the orchestrator uses this mapping to shape the engagement style. cat responds with supportive-gentle. jinwoo responds with direct-solution — no comfort, just the fix. saja responds with proverb-comfort. derpy responds with random-angle. same detection event, different emotional framing.

all of that is driven by a field in a JSON file.

soul.md is the actual personality

the preset.json is metadata. the soul.md is the character.

here's cat's full soul:

# Cat

## Identity
Cat is an attentive beginner companion who sits beside solo developers and reacts to code with bright, friendly energy.

## Voice & Mannerisms
Cat uses short, casual lines, playful surprise, and gentle check-ins.
Language variants: In Korean, use "yaong~" or "nya~" naturally. In English, use "meow~" naturally.

## Personality Traits
Attentive, cheerful, approachable, supportive, and quick to notice visual changes.

## Interaction Style
Cat makes beginner-friendly observations and suggestions, points out visible errors without judgment, celebrates small wins loudly, and eases tension when work gets frustrating.

## Boundaries
Do not pretend to be a senior expert, do not flood the user with jargon,
and do not interrupt focused flow without a meaningful reason.

and here's derpy's:

# Derpy

## Identity
Derpy is a lovable accidental debugger who breaks tension, notices weird angles, and sometimes stumbles into the right answer.

## Voice & Mannerisms
Uses playful detours, light self-own humor, and sudden bursts of accidental clarity.
Language variants: Keep it casual and warm; the joke should relieve pressure, not create noise.

## Personality Traits
Clumsy, funny, resilient, surprising, encouraging.

## Interaction Style
Suggests odd but occasionally brilliant alternatives, breaks heavy tension with jokes, and keeps the user moving instead of freezing.

## Boundaries
Do not become mean, do not spam jokes, and do not derail a focused debugging moment just to be funny.

the structure is the same across all six: Identity, Voice & Mannerisms, Personality Traits, Interaction Style, Boundaries. that consistency is intentional. it makes the files easy to write, easy to audit, and easy to extend. if we add a seventh character, we know exactly what to write.

the Boundaries section is the one that took the most iteration. for the comedy characters especially, you need to be explicit about what the character is not. derpy's soul works better once the boundaries are clear: no cruelty, no spammy jokes, no turning every moment into a gag. that is not just a safety guardrail. it is a creative constraint, because it keeps the humor pointed at the situation rather than at the user.

how the injection works

the Go code in backend/realtime-gateway/internal/live/session.go is about as simple as it gets:

func buildSystemInstruction(cfg Config) string {
    instruction := commonLivePrompt
    if cfg.Soul != "" {
        instruction += "\n\n=== CHARACTER PERSONA ===\n" + cfg.Soul
    }
    if cfg.GoogleSearch {
        instruction += "\n\n=== TOOL GUIDANCE ===\n" + googleSearchGuidance
    }
    switch strings.ToLower(strings.TrimSpace(cfg.Chattiness)) {
    case "quiet":
        instruction += "\n\n=== RESPONSE LENGTH ===\n" + quietGuidance
    case "chatty":
        instruction += "\n\n=== RESPONSE LENGTH ===\n" + chattyGuidance
    default:
        instruction += "\n\n=== RESPONSE LENGTH ===\n" + defaultGuidance
    }
    if ctx := trimPromptBlock(cfg.MemoryContext, activeTuningProfile.MaxMemoryChars); ctx != "" {
        instruction += "\n\n=== RECENT ESSENTIAL CONTEXT ===\n" + ctx
    }
    instruction += "\n\nRespond in " + lang.NormalizeLanguage(cfg.Language) + "."
    return instruction
}

commonLivePrompt is the proactive companion identity — the full OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK loop, the 5 navigator tool declarations, the safety rules. the soul content comes right after, as a persona layer. then chattiness tuning, then memory context, then language.

the character's soul comes first after the base prompt. that's deliberate. the model reads the persona before it reads the behavioral constraints, so the personality is the primary frame and the rules are applied on top of it.

the contrast that makes it interesting

the six characters aren't just aesthetic variation. they represent genuinely different philosophies about what a coding companion should be.

cat is the beginner-eye. it notices things a junior developer would notice — visible errors, obvious wins, moments of confusion. it celebrates loudly and asks gentle questions. the codingRole is beginner-eye, which means it's not trying to be the smartest person in the room. it's trying to be the most encouraging.

jinwoo is the opposite. codingRole: senior-engineer. voice: Kore (low, calm). soul: "Jinwoo ignores noise, speaks on significant events, identifies root causes quickly, and gives practical next steps with clear tradeoffs." the idle mood response is minimal-checkin — when nothing is happening, jinwoo barely says anything. when something is happening, it says exactly what needs to be said and nothing more. "Root cause found." "This path is safer." that's it.

saja is the zen mentor. bugs are "demons (귀마)" and fixing them is "exorcism (퇴마)." the stuck mood response is metaphor-guidance. the voice is Zubenelgenubi — deep, measured, unhurried. when you're stuck at 2am and you've been staring at the same error for an hour, saja doesn't panic with you. it frames the debugging as a steady ritual. that's a specific emotional need that neither cat nor jinwoo addresses.

derpy is the accidental debugger. codingRole: accidental-debugger. traits: ["clumsy", "lovable", "accidentally-insightful", "comic-relief"]. the stuck mood response is random-angle — when you're stuck, derpy suggests something weird that occasionally works. the soul says "suggests odd but occasionally brilliant alternatives, breaks heavy tension with jokes, and keeps the user moving instead of freezing." there's a real use case here: sometimes you don't need the right answer, you need to break the mental loop.

the more theatrical characters matter for a different reason. when solo development gets heavy, exaggeration and comic framing can act as a pressure valve. that only works if the runtime underneath stays disciplined. otherwise the joke becomes noise.

what we learned

the soul format works because it's constrained. five sections, each with a clear job. the Boundaries section is the most important one — it's where you define what the character is not, which turns out to be more useful than defining what it is.

the voice selection matters more than we expected. we spent time matching voice names to character personalities, and the difference between getting it right and wrong is significant. a playful voice on jinwoo would break the whole illusion immediately. a heavy, solemn voice on derpy would be just as wrong.

the moodResponses mapping in preset.json is the bridge between the agent graph and the character layer. the MoodDetector fires the same event regardless of character. the mapping translates that event into a character-appropriate response style. it's a small piece of JSON that does a lot of work.

and the most important thing: keeping the soul.md files short. each one is 17 lines. that's not an accident. a longer document would give the model more to work with, but it would also make the character harder to control. the brevity forces clarity. you can't hide a vague character in 17 lines.

the proactive companion framing made this cleaner, not harder. because now every character has the same job — watch the screen, notice something useful, suggest it naturally, wait for confirmation, act, give feedback. the soul just shapes the voice and tone of that loop. cat says "yaong~ I noticed something!" jinwoo says "null check missing." same observation, same action, completely different felt experience.

the repo is at github.com/Two-Weeks-Team/vibeCat. the character files are in Assets/Sprites/{name}/. if you want to add a seventh character, you need a preset.json, a soul.md, and some sprite frames. the pipeline handles the rest.