DEV Community

Cover image for submitting vibecat: what 3 weeks of building a desktop AI actually taught me
KimSejun
KimSejun

Posted on

submitting vibecat: what 3 weeks of building a desktop AI actually taught me

submitting vibecat: what 3 weeks of building a desktop AI actually taught me

I created this post for the purposes of entering the Gemini Live Agent Challenge. But this isn't a pitch — it's what I actually learned.

Three weeks ago, VibeCat was a failed video transcription app called Missless. Today it's a proactive desktop companion that watches your screen, suggests actions before you ask, and moves your mouse to execute them. We pivoted, rebuilt from scratch, shipped to Google Cloud Run, and submitted to Devpost with hours to spare.

Here's the honest retrospective.

the pivot that saved us

Missless was supposed to do real-time video transcription with Gemini. It worked — technically. But the demo was boring. You'd talk, text appeared on screen, end of demo. No one would watch that for 4 minutes.

The pivot happened on March 4th. We asked ourselves: what if instead of processing video passively, the AI could see the screen and act on what it sees? Not a chatbot. Not a voice assistant. A colleague who happens to be a cat.

VibeCat's core loop came from a single frustrated user request: "LOOK! DECIDE! MOVE! CLICK! VERIFY!" — five words, shouted in all caps. That became the product.

VibeCat — Your Proactive Desktop Companion

the architecture that actually shipped

macOS Client (Swift 6.2)
  → WebSocket
    → Realtime Gateway (Go, Cloud Run)
      → Gemini Live API (voice + vision + FC)
      → ADK Orchestrator (screenshot analysis)
      → Firestore (session state)
      → Cloud Logging (observability)
Enter fullscreen mode Exit fullscreen mode

Three services. One WebSocket connection. Five function calling tools. That's it.

The simplicity was intentional. We had 3 weeks. Every architectural decision was "what's the minimum that works reliably?" We considered an event-driven microservices setup with separate vision, NLU, and action services. We considered a local-first architecture with on-device models. We considered both and built neither. One gateway, one orchestrator, one client.

The pendingFC mechanism — where function calls queue up and execute strictly one at a time with verification between each — was the most important architectural decision. It added latency but eliminated an entire category of bugs where Gemini would fire three actions simultaneously and corrupt the UI state.

three scenarios, three different nightmares

YouTube Music (S1): The hardest. YouTube renders player controls on <canvas>, invisible to the Accessibility API. Our first approach — keyboard shortcuts — worked but looked robotic. Our second approach — vision-based mouse control — looked incredible but failed 40% of the time because Gemini's coordinate estimates were off by 20+ pixels on Retina displays. The solution: try AX first, fall back to CDP (Chrome DevTools Protocol), then fall back to vision coordinates. Self-healing with max 2 retries. Final success rate: 94%.

Code Enhancement (S2): VibeCat reads your code in Antigravity IDE, proactively suggests improving documentation, types "Enhance the comments for this code" into Gemini Chat, and lets the AI rewrite. This was surprisingly stable — 100% success rate after the first week. The trick was using navigate_text_entry with the AX tree instead of trying to click into the text field.

Terminal Automation (S3): VibeCat switches to Terminal, runs go vet ./..., and verifies the output. Also 100% after stabilization. Terminal is the most AX-friendly app on macOS — every element is properly labeled and positioned.

what gemini live API actually enables

I'd used Gemini's regular API before. The Live API is a different experience entirely.

The killer feature isn't voice or vision individually — it's that they exist in the same session simultaneously. VibeCat can see a screenshot, hear the user say "yeah, fix that," understand both inputs in context, and issue a function call — all in one streaming session with sub-second latency.

Function Calling over Live API was the primitive that made proactive desktop control possible. Without it, we'd need a separate intent classification step, a separate action planning step, and a separate execution step — each adding latency and losing context. With FC, Gemini does all three in one inference pass.

The gotcha: Gemini sometimes hallucinates tool usage. It says "I've typed the command" without actually calling the tool. Our inject-text approach had a 40% failure rate from this. The fix was simple but non-obvious — send actual user voice input instead of programmatic text injection. When the user speaks, Gemini takes the FC path; when we inject text, Gemini sometimes takes the "just respond" path.

the self-healing engine

Most automation agents fail and stop. VibeCat fails, switches to a different approach, and tries again — all while narrating what it's doing to the user.

🔍 Analyzing screen...
▶️ Clicking play button [AX]
⚠️ Button not found — retrying with CDP
▶️ Clicking play button [CDP]  
⚠️ CDP target unavailable — retrying with Vision
▶️ Moving cursor to (847, 423) [Vision]
✅ Music is playing!
Enter fullscreen mode Exit fullscreen mode

Three grounding sources (Accessibility API, Chrome DevTools Protocol, Vision coordinates), max 2 retries, post-action verification via ADK screenshot analysis. The transparent narration turned out to be more important than the retry logic itself — users who see the recovery process trust the system. Users who see silent failure don't.

Enhanced code with Gemini analysis panel

System architecture diagram

Running on Google Cloud Platform

the demo video pipeline

The demo video deserves its own post, but briefly: we built a fully automated pipeline with Gemini TTS (Zephyr voice for the cat, Charon for narration), MiniMax cloned voice for the human narrator, background music from the actual YouTube Music song played in the demo, and ffmpeg for composition. The dubbing script is a JSON file with millisecond-precision timestamps. Running one shell script regenerates the entire video from source clips.

numbers

  • 17 dev.to devlogs published during the challenge
  • 3 demo scenarios all passing E2E
  • 5 FC tools for desktop control
  • 80+ key codes mapped in AccessibilityNavigator
  • 3 grounding sources with automatic fallback
  • 10+ Cloud Run deployments during the final week
  • 144 seconds of demo video, fully dubbed with 3 distinct voices
  • 1 cat who never sleeps

what I'd tell someone starting a similar project

Start with the demo, not the architecture. We spent the first week building infrastructure and the last week desperately trying to make it demo-ready. If I could restart, I'd record a fake demo video on day one and work backward from "what needs to be real."

Proactive AI is harder than reactive AI. It's easy to make an agent that responds to commands. It's hard to make one that speaks up at the right moment without being annoying. The confirmation gate — always waiting for user approval — was the single most important UX decision. It makes proactive feel safe instead of scary.

Narrate everything. Silent processing feels broken. Transparent processing feels collaborative. Show the user what the AI is doing, why it's doing it, which tool it's using, and whether it succeeded. The overlay panel cost us two days to build and was worth every hour.

Gemini Live API + Function Calling is genuinely powerful. Real-time multimodal input with structured tool invocation in a single session — this combination enables interaction patterns that weren't possible before. It's not perfect (hallucinated tool calls are real), but it's the right foundation for desktop AI agents.

VibeCat started as a joke name. "Vibe coding, but with a cat." Three weeks later, the cat watches your screen, suggests improvements, moves your mouse, and verifies its own work. It's still a cat. But now it's a pretty capable one.

Top comments (0)