teaching a cat to use a mouse — literally

#geminiliveagentchallenge #devlog #buildinpublic #macos

teaching a cat to use a mouse — literally

I created this post for the purposes of entering the Gemini Live Agent Challenge, and honestly this was the feature that almost broke us.

Our user's feedback was blunt: "Why aren't you using vision to control the mouse directly?" And then, more specifically: "The cursor should glide smoothly, find its target visually, move again, and click — that's the WOW factor."

He was right. Sending keyboard shortcuts and accessibility API calls is reliable, but it looks like a script running. A cursor that glides across the screen, finds its target visually, and clicks — that looks like intelligence.

So we built the LOOK → DECIDE → MOVE → CLICK → VERIFY pipeline.

the five-stage pipeline

Here's what happens when VibeCat decides to click something on your screen:

LOOK — VibeCat captures a screenshot via ScreenCaptureKit. This isn't a polling loop; it's triggered when the gateway's proactive companion decides an action is needed. The screenshot goes to Gemini's vision model along with the current AX (Accessibility) snapshot for context.

DECIDE — Gemini analyzes the screenshot and returns a target. This could be "the Play button on YouTube Music at approximately (847, 423)" or "the text field labeled 'Search' in the Antigravity IDE sidebar." The key insight: we don't just get coordinates. We get a semantic description of what to click and why, which feeds into the transparent feedback overlay.

MOVE — animateCursorTo in AccessibilityNavigator.swift smoothly interpolates the cursor position over ~300ms using a cubic easing curve. This is purely cosmetic but it's what makes VibeCat feel like a colleague reaching for the mouse rather than a teleporting robot.

func animateCursorTo(_ target: CGPoint, duration: TimeInterval = 0.3) {
    let start = NSEvent.mouseLocation
    let steps = Int(duration * 60) // 60fps
    for i in 0...steps {
        let t = Double(i) / Double(steps)
        let eased = t * t * (3 - 2 * t) // smoothstep
        let x = start.x + (target.x - start.x) * eased
        let y = start.y + (target.y - start.y) * eased
        CGEvent(mouseEventSource: nil, mouseType: .mouseMoved,
                mouseCursorPosition: CGPoint(x: x, y: y),
                mouseButton: .left)?.post(tap: .cghidEventTap)
        Thread.sleep(forTimeInterval: duration / Double(steps))
    }
}

CLICK — A CGEvent mouse click at the current cursor position. Simple, but the timing matters — we add a 50ms delay after the final move to let the OS register the cursor position before clicking.

VERIFY — Another screenshot capture, sent to the ADK Orchestrator for vision analysis. "Did the button state change? Is the expected content now visible?" If verification fails, the self-healing engine kicks in with an alternative grounding strategy.

three grounding sources, one fallback chain

The real complexity isn't in clicking — it's in finding the right thing to click. VibeCat uses three grounding sources in priority order:

Accessibility API (AX) — The gold standard. macOS exposes UI elements with roles, labels, and positions. When it works, it's pixel-perfect. But YouTube Music renders its player controls on a <canvas> element — completely invisible to AX.
Chrome DevTools Protocol (CDP) — For browser elements AX can't see. Our Go gateway runs chromedp to query DOM elements, get bounding boxes, and execute JavaScript. This catches most canvas-rendered controls.
Vision coordinates — The last resort. Send a screenshot to Gemini, ask "where is the play button?", get approximate pixel coordinates. Less reliable, but it works on literally anything visible on screen.

The self-healing engine (max 2 retries) walks down this chain automatically:

Step 1: Try AX targeting
  → Failed (element not found in AX tree)
Step 2: Try CDP targeting  
  → Failed (Chrome not exposing this element via CDP)
Step 3: Try vision coordinates
  → Got (847, 423), move cursor, click
  → Verify: screenshot shows music is now playing ✓

the YouTube Music problem

YouTube Music was our hardest surface. The player controls are canvas-rendered, the site is a single-page app that mutates state without URL changes, and the search results list doesn't expose individual items as clickable AX elements.

Our solution was multi-layered:

Open YouTube Music via navigate_open_url with the search query pre-filled in the URL
Wait for results to load (vision verification of the page state)
Use vision to find the target song/playlist
animateCursorTo to the result
Click via CGEvent
Verify playback started via CDP document.querySelector('video').paused === false
If verification fails, fallback to video.play() via JavaScript injection

We ran this sequence 5 times consecutively in our rehearsal protocol. It passed every time — but only after we added the video.play() fallback. Pure vision-based clicking had about a 60% success rate on first attempt because Gemini's coordinate estimates were sometimes off by 20-30 pixels.

80 key codes and counting

Beyond mouse control, AccessibilityNavigator.swift maps 80+ macOS key codes for keyboard automation. Things like Cmd+Shift+5 to start screen recording, Cmd+Tab to switch apps, or Ctrl+A to select all text in Terminal. Each key code was manually verified across our three gold-tier surfaces: Antigravity IDE, Terminal, and Chrome.

The overlay panel shows all of this in real time — which grounding source is being used, which step of the pipeline we're in, and whether the last verification passed or failed. Users never see a black box. They see VibeCat working.

what I'd do differently

Honestly? I'd invest more in vision coordinate calibration. The 20-30 pixel offset on Retina displays cost us hours of debugging. We eventually solved it by preferring semantic AX targeting wherever possible and only falling back to raw coordinates as a last resort. But if we'd built a proper coordinate calibration system (test click → verify → adjust offset) from day one, the vision path would have been much more reliable.

The cursor animation, though? That was worth every line of code. When VibeCat smoothly moves the mouse to a YouTube search result and clicks it — people's eyes light up. That's the moment it stops being a demo and starts feeling like the future.