KimSejun

Posted on Mar 6

the cat that watches your screen: building vibecat for the gemini live agent challenge

#geminiliveagentchallenge #devlog #buildinpublic #go

the cat that watches your screen: building vibecat for the gemini live agent challenge

I created this post for the purposes of entering the Gemini Live Agent Challenge, and honestly it feels less like a submission writeup and more like a field report from a two-week sprint that nearly broke me. VibeCat is a macOS desktop companion for solo developers — an animated cat that sits on your screen, watches your code, hears your voice, remembers what you were doing yesterday, and speaks up when it matters. Not a chatbot. Not a sidebar. A colleague.

The idea came from a real problem: coding alone is weird. There's no one to catch your typos, no one to notice you've been staring at the same function for 45 minutes, no one to say "hey, that test just went green" when you're too deep in the next problem to notice. VibeCat is the empty chair next to you, filled.

the architecture that took three rewrites to get right

The challenge rules require GenAI SDK + ADK + Gemini Live API + VAD — all four, together, with strict client/backend separation. That constraint turned out to be a gift, because it forced a clean three-layer split that I probably wouldn't have designed otherwise.

macOS Swift Client → Realtime Gateway (Go + GenAI SDK) → ADK Orchestrator (Go + ADK)

The client does exactly three things: captures the screen, captures audio, and renders the cat. No model calls. No API keys. The Gemini API key lives in GCP Secret Manager and never touches the client binary. The client authenticates with a device UUID — no account creation, no OAuth dance, just a UUID generated on first launch that gets exchanged for a short-lived JWT. Zero onboarding friction.

flowchart LR
    U["User (Voice + Screen)"] --> C[macOS Client\nSwift 6 / SwiftUI]
    C -->|WebSocket| RG[Realtime Gateway\nGo + GenAI SDK\nCloud Run]
    RG -->|Live API| GL[Gemini Live API]
    RG -->|HTTP POST /analyze| AO[ADK Orchestrator\nGo + ADK Go SDK\nCloud Run]
    AO --> W1["Wave 1 ∥\nVision + Memory"]
    AO --> W2["Wave 2 ∥\nMood + Celebration"]
    AO --> W3["Wave 3 →\nMediator → Scheduler\n→ Engagement → Search"]
    W1 & W3 --> FS[(Firestore)]
    W3 --> GS[Google Search]
    RG --> SM[Secret Manager]
    RG & AO --> OBS[Cloud Trace +\nCloud Logging +\nADK Telemetry]

The Realtime Gateway is a Go service running on Cloud Run. It accepts WebSocket connections from the Swift client, initializes a Gemini Live API session via the GenAI SDK, and proxies audio bidirectionally. PCM 16kHz 16-bit mono goes up, PCM 24kHz comes back down. The gateway also handles session resumption — if the connection drops, the client sends its last resumption handle and Gemini picks up the conversation where it left off.

The ADK Orchestrator is a separate Cloud Run service. The gateway calls it via HTTP POST /analyze every time a screen capture comes in. The orchestrator runs the 9-agent graph and returns a decision: should the cat speak, what should it say, what emotion should it express, how urgent is it.

nine agents, three waves, one decision

The 9-agent graph is where most of the interesting engineering happened. The naive approach — run all agents sequentially — was too slow. Initial measurements showed 3.5 seconds from screen capture to spoken response. That's too long. The cat would feel laggy, disconnected from what you're actually doing.

The fix was ADK's ParallelAgent. The graph now runs in three waves:

Wave 1 (Perception, parallel): VisionAgent and MemoryAgent run simultaneously. VisionAgent sends the screenshot to gemini-3.1-flash-lite-preview and gets back a structured analysis — what app is open, what errors are visible, what the developer is doing. MemoryAgent hits Firestore and retrieves cross-session context — what topics this developer has been working on, what was unresolved last time.

Wave 2 (Emotion, parallel): MoodDetector and CelebrationTrigger run simultaneously on the vision output. MoodDetector classifies frustration level from signals like repeated edits to the same line, error frequency, and silence duration. CelebrationTrigger looks for success signals — tests going green, builds completing, git commits.

Wave 3 (Decision, sequential): Mediator, AdaptiveScheduler, EngagementAgent, and SearchBuddy run in order. Mediator decides whether to speak at all — it enforces cooldown periods, checks significance thresholds, and can bypass the cooldown entirely if CelebrationTrigger fired. AdaptiveScheduler adjusts timing thresholds based on interaction rate. EngagementAgent handles proactive outreach when the developer has been silent too long. SearchBuddy fires Google Search grounding when the developer appears stuck or frustrated.

// graph.go — the actual ADK agent graph construction
wave1, err := parallelagent.New(parallelagent.Config{
    AgentConfig: agent.Config{
        Name:        "wave1_perception",
        Description: "Parallel: Vision analysis + Memory retrieval",
        SubAgents:   []agent.Agent{visionAgent, memoryAgent},
    },
})

wave2, err := parallelagent.New(parallelagent.Config{
    AgentConfig: agent.Config{
        Name:        "wave2_emotion",
        Description: "Parallel: Mood detection + Celebration check",
        SubAgents:   []agent.Agent{moodAgent, celebrationAgent},
    },
})

wave3, err := sequentialagent.New(sequentialagent.Config{
    AgentConfig: agent.Config{
        Name:        "wave3_decision",
        Description: "Sequential: Decision agents that depend on perception + emotion results",
        SubAgents:   []agent.Agent{mediatorAgent, schedulerAgent, engagementAgent, searchAgent},
    },
})

graph, err := sequentialagent.New(sequentialagent.Config{
    AgentConfig: agent.Config{
        Name:        "vibecat_graph",
        Description: "VibeCat 9-agent orchestration: perception(parallel) → emotion(parallel) → decision(sequential)",
        SubAgents:   []agent.Agent{wave1, wave2, wave3},
    },
})

The result: 3.5 seconds down to 2.1 seconds — a 35% latency improvement just from parallelizing the perception and emotion waves. The sequential constraint on Wave 3 is intentional: Mediator needs both vision and mood results before it can make a good decision, and SearchBuddy needs to know if Mediator decided to speak before it wastes a Google Search call.

the VAD configuration that actually works

Getting VAD right took longer than I expected. The Gemini Live API's automaticActivityDetection is powerful but the defaults are tuned for phone calls, not developer workflows. Developers type, think, mutter, and have long silences that aren't conversation endings.

prefixPadding := int32(300)
silenceDuration := int32(500)
lc.RealtimeInputConfig = &genai.RealtimeInputConfig{
    AutomaticActivityDetection: &genai.AutomaticActivityDetection{
        StartOfSpeechSensitivity: genai.StartSensitivityLow,
        EndOfSpeechSensitivity:   genai.EndSensitivityLow,
        PrefixPaddingMs:          &prefixPadding,
        SilenceDurationMs:        &silenceDuration,
    },
    ActivityHandling: genai.ActivityHandlingStartOfActivityInterrupts,
    TurnCoverage:     genai.TurnCoverageTurnIncludesOnlyActivity,
}

Low sensitivity on both start and end of speech. 300ms prefix padding so the beginning of a sentence isn't clipped. 500ms silence duration so the cat doesn't interrupt mid-thought. StartOfActivityInterrupts means the developer can barge in and cut the cat off mid-sentence — critical for not being annoying. TurnIncludesOnlyActivity means silence doesn't get sent to the model, which keeps the context window clean.

I also explicitly disabled Google Search on the Live API side. Adding it caused Gemini to consider searching on every voice response, adding 5-10 seconds of latency even for simple "hey, what's up" exchanges. Search is handled exclusively by the ADK SearchBuddy agent on the screen-analysis pipeline, which only fires selectively.

affective dialog and urgency-based TTS routing

The gateway supports two speech paths. For real-time conversation, audio comes back from Gemini Live API directly — that's the gemini-2.5-flash-native-audio-latest model with AffectiveDialog enabled. Affective dialog means the model adjusts its vocal tone based on emotional context — it sounds more gentle when you're frustrated, more excited when you're celebrating.

For proactive observations from the ADK orchestrator — the cat noticing something on your screen and deciding to comment — the gateway routes through a separate TTS path using gemini-2.5-flash-preview-tts. This is the urgency-based routing: high-urgency signals (errors, build failures) go through TTS immediately, bypassing the normal cooldown. Low-urgency observations queue and wait for a natural break.

// tts/client.go
const defaultModel = "gemini-2.5-flash-preview-tts"

func (c *Client) StreamSpeak(ctx context.Context, cfg Config, sink AudioSink) error {
    ttsCtx, cancel := context.WithTimeout(ctx, ttsTimeout)
    defer cancel()
    // streams PCM chunks to the AudioSink as they arrive
    for resp, err := range c.genai.Models.GenerateContentStream(ttsCtx, c.model, genai.Text(text), genConfig) {
        // ...
    }
}

The Swift client receives PCM audio chunks and plays them through AVFoundation. The chat bubble appears in sync with speech using a KDC-pattern (key-duration-content) timing system — the bubble pops up when audio starts, stays visible for the duration of the spoken content, then fades. It auto-sizes to the text content and uses edge avoidance so it doesn't render off-screen when the cat is near a monitor edge.

the /healthz bug that cost me two hours

Cloud Run has a health check system. By default, it hits /healthz to determine if a container is ready to receive traffic. I had named my health endpoint /health (no z). The service would deploy, Cloud Run would hit /healthz, get a 404, decide the service was unhealthy, and refuse to route traffic to it.

The symptom was maddening: the container logs showed the service starting fine, the Go server was listening, but every request from the client got a 503. I spent two hours checking TLS configuration, IAM permissions, and network policies before I thought to check what Cloud Run was actually hitting. The fix was one line:

// was: mux.HandleFunc("/health", healthHandler)
// now:
mux.HandleFunc("/health", healthHandler)
mux.HandleFunc("/healthz", healthHandler)  // Cloud Run default probe

Both endpoints now exist. /health returns connection count and service status. /healthz is an alias. Two hours of debugging for one line of code. Classic.

six characters, six souls

Each character is more than a sprite sheet. Every character has a soul.md — a personality document that gets injected into the system prompt at session start. The cat's soul is bright and beginner-friendly. Jinwoo's soul is a silent senior engineer who speaks rarely but precisely. Trump's soul is a bombastic hype-man who calls every successful commit "the greatest commit in the history of software development."

// cat/preset.json
{
  "voice": "Zephyr",
  "persona": {
    "tone": "bright",
    "speechStyle": "casual",
    "traits": ["curious", "playful", "innocent", "encouraging"],
    "codingRole": "beginner-eye",
    "moodResponses": {
      "frustrated": "supportive-gentle",
      "focused": "silent",
      "stuck": "question-based",
      "idle": "playful-poke"
    }
  }
}

The soul content gets appended to the common system prompt at session setup time, server-side. The client sends a setup WebSocket frame with the character name and the gateway loads the soul from disk. Character switching mid-session creates a new Gemini Live session with the new soul injected — the resumption handle is discarded because the persona change is intentional.

The six voices — Zephyr, Puck, Kore, Schedar, Zubenelgenubi, Fenrir — are Gemini's prebuilt voices, each matched to the character's personality. Kore for Jinwoo because it's calm and measured. Fenrir for Trump because it's energetic and slightly unhinged. Puck for Derpy because it sounds like it's about to trip over something.

observability: knowing what the cat is actually doing

One thing I got right early was observability. Both Cloud Run services emit structured JSON logs via log/slog, ship traces to Cloud Trace via OpenTelemetry, and the ADK orchestrator uses ADK's built-in telemetry to trace agent execution.

// main.go — Cloud Trace initialization
traceExporter, traceErr := texporter.New(texporter.WithProjectID(projectID))
if traceErr != nil {
    slog.Warn("cloud trace init failed — tracing disabled", "error", traceErr)
} else {
    tp := sdktrace.NewTracerProvider(sdktrace.WithBatcher(traceExporter))
    otel.SetTracerProvider(tp)
    defer tp.Shutdown(context.Background())
}

Every WebSocket connection gets a trace span. Every ADK orchestrator call gets a child span. Every agent in the graph gets its own span via ADK telemetry. In Cloud Trace, you can see exactly which wave took how long, which agent was the bottleneck, and whether the latency improvement from parallelization is holding up in production.

The gateway also tracks session metrics in Firestore: utterance count, response count, interrupt count, response rate, interrupt rate. The AdaptiveScheduler reads these metrics and adjusts cooldown thresholds — if the developer is interrupting the cat frequently, it backs off and speaks less. If the developer is engaging, it becomes more proactive.

what actually broke (and what I had to fix myself)

Honest section: I was using delegated implementation agents to help write the backend code. They kept erroring on the ADK Go SDK — the API surface changed between v0.4 and v0.5, and the agents were generating code against the old API. parallelagent.New signature changed. sequentialagent.Config field names changed. The agents would generate plausible-looking code that didn't compile.

I ended up implementing the agent graph directly, reading the ADK Go SDK source on GitHub to understand the actual v0.5 API. The graph construction code above is what I wrote by hand after the third failed agent attempt. Sometimes the fastest path is just doing it yourself.

The Swift client had its own set of surprises. Full-screen transparent overlay for mouse tracking — the window needs to be transparent, non-activating, and always-on-top, but also needs to pass through mouse events to the apps underneath. Getting that combination right in SwiftUI required dropping down to AppKit for the window configuration. The overlay is a NSPanel with ignoresMouseEvents = true and level = .floating, with a SwiftUI view hosted inside it.

where it ended up

VibeCat is deployed on Cloud Run in asia-northeast3. Two services: realtime-gateway and adk-orchestrator. Firestore for persistence. Secret Manager for the API key. Artifact Registry for container images. Cloud Build for CI/CD.

The macOS client is a Swift 6 app with a status bar icon, a transparent overlay for the cat sprite, a chat bubble that appears when the cat speaks, and a decision overlay HUD that shows what the cat is thinking in real time. Screen capture runs at configurable intervals. Audio capture streams PCM 16kHz to the gateway continuously when the microphone is active.

The 9 agents are all wired and running. The 3-wave parallel execution is live. The 6 characters are selectable from the status bar menu. Cross-session memory works — the cat remembers what you were debugging yesterday. The celebration trigger fires when tests pass. The mood detector backs off when you're in flow.

It's not perfect. The latency is still noticeable on cold starts. The mood detection is heuristic-based and sometimes misreads focused silence as frustration. The chat bubble occasionally clips on ultra-wide monitors. But it works, it's deployed, and it does the thing it was supposed to do: fill the empty chair.

Building this for the Gemini Live Agent Challenge forced me to think carefully about what "agentic" actually means in a real product context. It's not about having many agents — it's about having agents that each do one thing well, that compose cleanly, and that produce decisions fast enough to feel natural. The 3-wave parallel architecture is the thing I'm most proud of. It's the difference between a cat that feels alive and a cat that feels like it's loading.

The stack: Swift 6 + Go 1.24 + google.golang.org/genai + google.golang.org/adk v0.5.0 + Gemini Live API + VAD. GCP: Cloud Run, Firestore, Secret Manager, Cloud Trace, Cloud Logging, Artifact Registry. Models: gemini-2.5-flash-native-audio-latest for live conversation, gemini-2.5-flash-preview-tts for proactive speech, gemini-3.1-flash-lite-preview for vision analysis.

The cat is watching. It knows you've been on that function for too long.

I created this post for the purposes of entering the Gemini Live Agent Challenge.

DEV Community

the cat that watches your screen: building vibecat for the gemini live agent challenge

the cat that watches your screen: building vibecat for the gemini live agent challenge

the architecture that took three rewrites to get right

nine agents, three waves, one decision

the VAD configuration that actually works

affective dialog and urgency-based TTS routing

the /healthz bug that cost me two hours

six characters, six souls

observability: knowing what the cat is actually doing

what actually broke (and what I had to fix myself)

where it ended up

GeminiLiveAgentChallenge

Top comments (0)