DEV Community

Cover image for the day vibecat stopped being a screen-watching demo
KimSejun
KimSejun

Posted on

the day vibecat stopped being a screen-watching demo

the day vibecat stopped being a screen-watching demo

I created this post for the purposes of entering the Gemini Live Agent Challenge, and this was the day the project got a lot less cute and a lot more real.

For a while, the easy way to describe VibeCat was: "it's a cat on your desktop that watches your screen and comments on what you're doing."

That line worked. People got it immediately. It also let me hide from the harder question.

Can it actually do anything useful?

Watching is a good demo. Acting is a product.

And to be fair to the earlier version: that screen-watching phase was not fake work. It taught me what context mattered, what annoyed me, and what the system kept getting almost-right. It was just incomplete.

The moment that became obvious was text entry. If the user says, "type this here," there are only two acceptable outcomes:

  1. the system finds the right field and types the text
  2. the system says it cannot safely verify the target

Everything else is fake confidence.

where the old framing broke

The old companion framing made it easy to focus on observation.

  • what app is open
  • what error is visible
  • whether the user sounds frustrated
  • whether the current screen looks important

That is all still useful. But none of it answers the practical question of whether the current focused thing is actually the search field in front of you.

A screenshot can tell you a lot. It cannot give you permission to click blindly.

That was the product boundary I needed to respect.

the first time it felt real

The first really convincing moment was not a giant workflow. It was tiny.

Chrome was open. The docs site was already on screen. I said, "type gemini live api here."

The client checked the frontmost app, the focused element, and the accessibility role. The worker verified that the target looked like text input. Then it inserted the text and refreshed context afterward.

That was it.

No fireworks. No theatrical demo beat. Just a boring action landing in the right place.

That was the moment VibeCat stopped feeling like a mascot wrapped around an LLM and started feeling like a UI navigator.

what the product promise became

The contract is much sharper now.

If intent is clear and the action is low-risk, VibeCat acts.

If the request is ambiguous, it asks one short question.

If the request is risky, it stops and asks for explicit confirmation.

If the target is unclear, it drops to guided mode instead of guessing.

That last part matters the most. There is a huge difference between "I think the input field is somewhere near the top left" and "I found a focused text input and verified it after insertion."

The first one sounds smart. The second one is actually useful.

why the screen still matters

The screen did not become irrelevant. It just stopped being the whole story.

Now the useful context is a combination of:

  • current app
  • window title
  • focused element role and label
  • selected text
  • accessibility snapshot
  • the latest visual state

That combination is what turns a passive observer into an executor.

The screen tells you what world you are in. Accessibility tells you what object you can safely touch. Verification tells you whether the action actually landed.

That triangle is the product.

the trade i made on purpose

This pivot cost me some of the original companion magic.

The older version had more ambient personality. It could feel like a creature hanging out on your desktop and reacting to your mood.

I still like that version emotionally.

But for a real product, and especially for a challenge entry, "acts safely on natural intent" is a much stronger promise than "sometimes notices things on its own."

That is the trade I made, and I think it was the right one.

The cat is still there. The voices are still there. The screen analysis still matters. But now those things serve the action loop instead of replacing it.

And once I saw that clearly, I couldn't go back to the older pitch.

The cat can still watch your screen.

It just has a job now.


Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat

Top comments (0)