KimSejun

Posted on Mar 11

why i stopped letting nine agents argue over one click

#geminiliveagentchallenge #devlog #buildinpublic #architecture

why i stopped letting nine agents argue over one click

I created this post for the purposes of entering the Gemini Live Agent Challenge, but this one is really about admitting I was solving the wrong problem for a while.

For a few days, VibeCat looked incredible in architecture diagrams. I had named agents. I had parallel waves. I had boxes for mood, celebration, engagement, memory, search, mediation. Every time I added one more box, the system felt more sophisticated.

Then I tried to use it for an actual desktop action.

Not a grand demo. Something boring. "Open the official docs." The kind of request that should feel instant.

And that was the moment the architecture stopped feeling smart and started feeling expensive.

The graph itself wasn't wrong. It was just sitting in the wrong part of the product.

Those older posts were honest snapshots of the project at the time. The graph solved real problems. It just wasn't the thing that should own every user-facing action.

the embarrassing realization

I had been treating "how many capabilities exist" as if it were the same question as "how many active decision-makers should be in the hot path."

Those are not the same thing.

VibeCat absolutely does have many capabilities. It can analyze the screen, keep memory, do research, reason about ambiguity, classify risk, and decide whether a step should run locally or not.

But when the user says something concrete like:

"open the official docs"
"type this in the search box"
"run that again"

nobody cares that the internal graph is elegant. They care whether the system moves now, and whether it moves safely.

I had built a system that was very good at explaining itself and not yet strict enough about acting.

what changed

The turning point was realizing that the product is easier to understand in three planes:

Gemini Live + VAD      -> talks to the user
navigator worker       -> decides the next safe step
local macOS executor   -> actually focuses, types, clicks, verifies

That is the part I should have led with from the beginning.

The always-on Live session is the PM. It handles the messy human side: interruptions, vague requests, clarification, short confirmations, "no, not that tab, the other one."

The worker is much less charming. It has one job: take an actionable request, classify it, decide whether it is ambiguous or risky, plan one step, then wait for verification.

The local executor is narrower still. It looks at the current app, the focused element, the AX tree, and the current window state, then tries to perform exactly one step without pretending confidence it doesn't have.

Once I drew the system that way, the product made more sense immediately.

the part i did not throw away

This is the part I wish I had explained better in the public posts: I did not "discover that multi-agent systems are fake" or anything dramatic like that.

The 9-agent graph was useful. It still is useful.

It is just better as a background intelligence lane than as the thing that every single UI action has to march through.

Memory still helps. Research still helps. Low-confidence screen analysis still helps. Session summaries still help. Multimodal checks still help.

But those capabilities should come in when they add accuracy, not because I am emotionally attached to the architecture.

That was the real pivot: the intelligence stayed, but it moved behind the worker.

one rule fixed half the product

The biggest practical improvement came from one boring rule:

only one executable task can be active at a time.

Before that, a lot of weird bugs shared the same root cause:

the right action in the wrong app
typing into the wrong field after the UI changed
continuing an old plan because a stale refresh arrived late
silently juggling two user intents at once and doing neither well

Once the system had exactly one current task, one current step, and one verification loop, a lot of the magic stopped being magical and started being debuggable.

That trade is worth it every time.

I would much rather have a desktop agent that feels slightly stricter than one that feels "clever" right up until it pastes into the wrong input field.

the request that made it obvious

The request that finally broke my attachment to the old framing was text entry.

If the user says, "type gemini live api here," the system cannot answer with a pretty explanation about context. It has to either find the field and type into it, or admit it cannot verify the target.

That means the hot path needs very boring things:

focus state
target identity
step ids
risk checks
post-action refresh
replacement logic if the user changes their mind mid-flight

That is not where I want a council of equal agents debating the meaning of the moment.

That is where I want one worker making one decision.

what this changed emotionally

This pivot also fixed something less technical: I stopped feeling like I had to constantly defend the architecture.

Before, when I described VibeCat, I kept reaching for "graph," "specialists," "waves," and "agents." Those words were accurate, but they were not the thing a user would actually trust.

Now the explanation is simpler, and that simplicity is earned:

there is one thing talking to you.
there is one thing deciding the next step.
there is one thing on your Mac that can do the step and verify it.

That is a product shape.

And honestly, it is the first version of the system that feels like it deserves to exist outside a demo.

Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat