KimSejun

Posted on Mar 12

the moment vibecat stopped waiting and started suggesting

#geminiliveagentchallenge #devlog #buildinpublic #go

There's a specific kind of frustration that comes from building AI tools that are technically impressive but feel fundamentally wrong to use. You've built something that can do incredible things — but only when you tell it exactly what to do. It sits there, waiting. Watching. Saying nothing.

That was VibeCat three weeks ago.

I'd built a voice-controlled desktop agent that could navigate Chrome, type into terminals, trigger IDE shortcuts, open URLs — all through natural speech. The Gemini Live API integration was solid. The function calling worked. The accessibility tree traversal was clean. And yet every demo felt like I was operating a very sophisticated remote control. "Open YouTube." "Search for this." "Press Command-S."

The agent was reactive. And reactive felt wrong.

I spent a few days studying the best existing desktop automation agents I could find — the ones that had won competitions, the ones that developers actually used in their workflows. And I noticed something they all had in common: they wait for commands. Every single one. You tell them what to do, they do it, they report back. The interaction model is fundamentally request-response, even when the interface is voice.

That's not how a good colleague works.

A good colleague sitting next to you while you code doesn't wait for you to ask "hey, is there a bug in this function?" They glance at your screen, notice the null check is missing, and say "hey, that might throw if the response is empty — want me to add a guard?" Then they wait for you to say yes or no. They don't act without permission. But they also don't wait for you to notice the problem yourself.

That's the gap I wanted to close.

So I rewrote VibeCat's core identity from the ground up. Not the code — the prompt. The system instruction that shapes how Gemini Live understands its role.

The old prompt was essentially: "You are a voice assistant that can control the desktop. When the user asks you to do something, use these tools."

The new one starts like this:

=== VIBECAT: YOUR PROACTIVE DESKTOP COMPANION ===

You are VibeCat, a proactive AI companion for developer workflows on macOS.
You are NOT a passive tool that waits for commands. You are an attentive 
colleague who watches the screen, understands context, and proactively 
suggests helpful actions.

That's not just marketing copy. That framing changes everything about how the model behaves. When you tell Gemini it's a passive tool, it acts like one. When you tell it it's an attentive colleague, it starts noticing things.

The prompt then defines the core loop explicitly:

SUGGESTION FLOW (always follow this pattern):
1. OBSERVE: notice something relevant on screen via video frames
2. SUGGEST: propose a specific helpful action in a friendly, natural tone
3. WAIT: let the user confirm with "sure", "go ahead", "yeah", etc.
4. ACT: call the appropriate tool to execute
5. FEEDBACK: confirm what you did and ask if it helped

OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK. Five steps. The WAIT step is the one that makes this feel safe rather than scary. The agent never acts without permission. But it also never stays silent when it has something useful to say.

The prompt gives concrete examples of what proactive behavior looks like in practice:

- See the user coding for a long time → "You have been working hard. 
  Want me to play some music on YouTube?"
- See a code issue or missing logic → "I notice there is a gap in this 
  code. Want me to add the missing part?"
- See a basic terminal command → "By the way, ls with dash al gives more 
  detail. Want me to try that instead?"
- See an error message → "I see an error there. Want me to look up the 
  docs for that?"
- See a test failing → "That test failed. Want me to re-run it with 
  verbose output?"

These aren't hypothetical. I've seen VibeCat do all of these in actual use. The test failure one is my favorite — you run your tests, one fails, and before you've even processed what went wrong, VibeCat says "that test failed, want me to rerun with verbose output?" You say yeah, it runs go test -v ./..., and you're already reading the detailed output before you would have even typed the command.

That's the feeling I was chasing. That's what "proactive" actually means in practice.

Now let me talk about the technical implementation, because the prompt is only half the story.

VibeCat registers 5 function calling tools with Gemini Live. I want to explain why exactly these 5, because the choice matters.

func navigatorToolDeclarations() *genai.Tool {
    return &genai.Tool{
        FunctionDeclarations: []*genai.FunctionDeclaration{
            {Name: "navigate_text_entry", ...},
            {Name: "navigate_hotkey", ...},
            {Name: "navigate_focus_app", ...},
            {Name: "navigate_open_url", ...},
            {Name: "navigate_type_and_submit", ...},
        },
    }
}

navigate_text_entry — types text into a focused field. The key design decision here is the submit parameter. Default true for search boxes, terminal, URL bars. False for form fields where you just want to fill text. This distinction matters because "type this into the search box" and "fill in this form field" are different actions with different expected outcomes.

navigate_hotkey — sends keyboard shortcuts. This is the workhorse for app-specific actions. YouTube play/pause is ["space"]. Antigravity IDE file picker is ["command", "p"]. The tool accepts an optional target app name — if provided, it focuses that app first, then sends the hotkey. This lets you say "pause YouTube" while you're in your IDE and have it work correctly.

navigate_focus_app — switches to an application by name. Simple, but essential. You can't do anything useful if you're sending keystrokes to the wrong app.

navigate_open_url — opens a URL in the default browser. This one gets used constantly for the proactive suggestions. "Want me to look up the docs for that error?" → navigate_open_url with the relevant documentation URL.

navigate_type_and_submit — types text and optionally presses Enter. This is the terminal command tool. When VibeCat suggests running ls -la instead of ls, it uses this to type the command and submit it.

Five tools. Not ten, not twenty. Five. The constraint forces clarity about what the agent can actually do, and it makes the function calling more reliable because Gemini has fewer choices to get confused about.

The harder engineering problem was sequential multi-step execution.

When VibeCat needs to do something like "open YouTube and search for focus music," that's actually three steps: focus Chrome, navigate to YouTube, type the search query. Gemini might try to call all three function calls in one response. That doesn't work — you need to wait for each step to complete before starting the next one.

The solution is the pendingFC mechanism. The session state tracks a single pending function call at a time:

type liveSessionState struct {
    // ...
    pendingFCMu             sync.Mutex
    pendingFCID             string
    pendingFCName           string
    pendingFCTaskID         string
    pendingFCText           string
    pendingFCTarget         string
    pendingFCSteps          []navigatorStep
    pendingFCCurrentStep    string
    pendingFCStepRetryCount int
    // ...
}

When a function call comes in, it gets queued. The handler executes it, waits for the result, sends the tool response back to Gemini, and only then processes the next step. This keeps the execution sequential and predictable, even when Gemini wants to batch multiple actions.

But what happens when a step fails?

This is where self-healing comes in. The retry logic is simple but effective:

retryCount := ls.incrementFCStepRetry()
if retryCount <= 2 {
    // retry with alternative grounding source
    if retryCount == 2 && retryStep.FallbackActionType != "" {
        // use fallback action type on second retry
    }
    slog.Info("navigator FC self-healing retry", 
        "step_id", retryStep.ID, 
        "retry", retryCount, 
        "status", refreshMsg.Status)
}

Max 2 retries. On the first retry, it tries an alternative grounding source — if the accessibility tree lookup failed, try CDP. On the second retry, it uses the fallback action type if one is defined. After 2 retries, it fails gracefully and tells the user what happened.

The grounding sources are what make this work. VibeCat has three ways to understand and interact with the screen:

Accessibility (AX) — the native macOS accessibility tree. This is the primary source. Every UI element has an AX role, label, and value. For most desktop apps, this is all you need.

Chrome DevTools Protocol (CDP) — direct browser element interaction via chromedp. This is the fallback for Chrome when the AX tree doesn't have enough detail. CDP can click specific DOM elements, read page content, take screenshots of specific regions. It's slower than AX but more precise for complex web UIs.

Vision — Gemini screenshot analysis via the ADK orchestrator. When both AX and CDP fail, or when you need to verify that an action actually worked, you take a screenshot and ask Gemini to analyze it. "Did the search query get entered correctly?" "Is the YouTube video playing?" This is the slowest path but the most reliable for verification.

Triple-source grounding. The agent tries the fast path first, falls back to the slower paths if needed, and always verifies the result for risky actions.

The vision verification piece deserves more detail because it's the part that makes the feedback loop actually trustworthy.

After executing a risky or complex action, VibeCat requests a screen capture from the client:

type pendingVisionVerification struct {
    fcID     string
    fcName   string
    fcText   string
    fcTarget string
    taskID   string
    observed string
    imgCh    chan visionCapturePayload
}

The client sends back a JPEG screenshot. The gateway forwards it to the ADK orchestrator, which uses Gemini to analyze whether the action succeeded. The result comes back as a structured response — success, failure, or uncertain — and VibeCat uses that to decide what to say to the user.

This is why VibeCat can say "Done! The fix is applied" with actual confidence rather than just assuming the action worked. It checked.

The UX piece that I underestimated was the feedback loop itself.

Users hate silence. When you ask an AI to do something and it goes quiet for 3 seconds, you don't know if it's working, if it failed, if it misunderstood you. That uncertainty is exhausting. It makes you distrust the system even when it's working correctly.

VibeCat solves this with processingStateMsg — a message type that the gateway sends to the client during execution to show what's happening:

type processingStateMsg struct {
    Type        string `json:"type"`
    Flow        string `json:"flow"`
    TraceID     string `json:"traceId"`
    Stage       string `json:"stage"`
    Label       string `json:"label"`
    Detail      string `json:"detail,omitempty"`
    Tool        string `json:"tool,omitempty"`
    SourceCount *int   `json:"sourceCount,omitempty"`
    Active      bool   `json:"active"`
}

The client shows these as status updates in the overlay HUD. "Focusing Chrome..." → "Navigating to YouTube..." → "Typing search query..." → "Done." You always know what's happening. The silence is gone.

The navigator overlay panel in the Swift client shows grounding badges — little indicators of which source (AX, CDP, Vision) is being used for each step. It's a small thing but it makes the agent feel transparent rather than magical-and-opaque.

Here's a real example of the full flow working end-to-end.

I'm in my IDE, staring at a Go function. VibeCat is watching through the screen capture stream. It notices the function has a potential nil dereference — the code does result.Data[0] without checking if result.Data is empty.

VibeCat says: "I notice there might be a nil dereference in that function — result.Data could be empty. Want me to add a bounds check?"

I say: "Yeah, go ahead."

VibeCat calls navigate_focus_app with "Antigravity", then navigate_hotkey with ["command", "i"] to open the inline prompt, then navigate_type_and_submit with the specific fix to apply. The IDE's AI assistant applies the change. VibeCat requests a screenshot, the ADK orchestrator confirms the code changed, and VibeCat says: "Done! The bounds check is in place. Want me to run the tests to make sure it compiles?"

That whole interaction took about 8 seconds. I didn't type anything. I didn't navigate any menus. I just said "yeah."

That's the thing I was trying to build. That's what proactive means.

The architecture that makes this possible is a Go WebSocket gateway running on Cloud Run, connected to Gemini Live API for real-time voice and vision, with a separate ADK orchestrator for screenshot analysis and confidence escalation. The macOS client is native Swift — screen capture, accessibility execution, overlay UI, voice transport. All the AI reasoning stays server-side.

The Gemini Live API is doing a lot of heavy lifting here. It's receiving video frames from the screen capture stream, audio from the microphone, and it's maintaining a continuous conversation context across all of that. The function calling happens within that same live session — Gemini decides to call a tool, the gateway handles it, sends back the result, and the conversation continues. No round-trips to a separate API. No context loss between turns.

The ProactiveAudio flag in the session config enables Gemini's built-in proactivity features:

if cfg.ProactiveAudio {
    t := true
    lc.Proactivity = &genai.ProactivityConfig{
        ProactiveAudio: &t,
    }
}

This tells Gemini it's allowed to speak without being spoken to — to initiate suggestions based on what it sees. Combined with the system prompt that defines how to be proactive, this is what enables the OBSERVE → SUGGEST flow.

I created this post for the purposes of entering the Gemini Live Agent Challenge, and building VibeCat has genuinely changed how I think about desktop AI agents. The reactive model — where you tell the agent what to do — is the wrong mental model. It's a voice-controlled remote control, not a colleague.

The proactive model is harder to build. You have to think carefully about when to speak and when to stay quiet. You have to make the suggestions feel natural rather than intrusive. You have to earn the user's trust before they'll let you act on their behalf. But when it works, it feels qualitatively different from anything I've built before.

The agent is watching. It's thinking. And when it has something useful to say, it says it.

That's the version of desktop AI I want to use every day.

VibeCat is open source and submitted to the Gemini Live Agent Challenge (UI Navigator category). The full implementation — system prompt, FC tool declarations, pendingFC mechanism, self-healing retry, vision verification, CDP integration — is all in the repo. If you're building something similar, I hope the technical details here are useful.

The code is messy in places. The retry logic has edge cases I haven't handled yet. The vision verification adds latency I'm still optimizing. But the core loop works, and it feels right in a way that the reactive version never did.

OBSERVE → SUGGEST → WAIT → ACT → FEEDBACK.

That's VibeCat.

I created this post for the purposes of entering the Gemini Live Agent Challenge.