DEV Community

Cover image for the empty chair problem — why I'm building a desktop AI instead of another chatbot
KimSejun
KimSejun

Posted on

the empty chair problem — why I'm building a desktop AI instead of another chatbot

the empty chair problem — why I'm building a desktop AI instead of another chatbot

When you code alone, the chair next to you is empty. I created this post for the purposes of entering the Gemini Live Agent Challenge, but I'm writing it because this thought has been stuck in my head for weeks now and I need to get it out.

I've been a solo developer for a while. Most of my day looks like this: stare at code, google something, stare more, realize the bug was a typo, stare even more. Nobody catches the typo. Nobody notices I've been stuck on the same function for 40 minutes. Nobody says "hey, maybe take a break" when I start angrily deleting code. And nobody celebrates when the tests finally pass at 2am.

So I'm building VibeCat. It's a macOS desktop companion — an animated cat (or one of 5 other characters) that sits on your screen, watches your code, hears your voice, and sometimes speaks up. Not a chatbot that waits for your prompt. A colleague that sees, judges, and occasionally tells you your code is broken.

the difference between a chatbot and a colleague

This distinction matters. A chatbot answers questions. You ask, it responds. That's a vending machine with better UI.

A colleague does something fundamentally different. They:

  • See your screen without being asked ("hey, line 23 has a type mismatch")
  • Hear your tone and know when you're frustrated
  • Remember that yesterday you were stuck on the auth module
  • Judge whether to speak or shut up (this is the hardest part)
  • Celebrate when things work ("tests passed! nice work")
  • Search for answers when you're clearly lost
  • Adapt their timing to your flow state

That's not one AI model doing one thing. That's nine distinct behaviors. So I decomposed the "colleague" into 9 agents.

the spike that almost killed the project before it started

Before writing a single line of production code, I had to answer one question: does the Go SDK for Gemini Live API actually work? Because the entire architecture depends on it.

The mandatory stack for the challenge is GenAI SDK + ADK + Gemini Live API + VAD. All four. No exceptions. And everything has to run through a backend — the client can never call Gemini directly.

So I wrote two spike programs. The first one:

genaiClient, err := genai.NewClient(ctx, &genai.ClientConfig{
    APIKey:  apiKey,
    Backend: genai.BackendGeminiAPI,
})

session, err := genaiClient.Live.Connect(ctx, "gemini-2.5-flash-native-audio-latest", &genai.LiveConnectConfig{
    ResponseModalities: []genai.Modality{genai.ModalityAudio},
    SpeechConfig: &genai.SpeechConfig{
        VoiceConfig: &genai.VoiceConfig{
            PrebuiltVoiceConfig: &genai.PrebuiltVoiceConfig{
                VoiceName: "Zephyr",
            },
        },
    },
})
Enter fullscreen mode Exit fullscreen mode

It compiled. It connected. Gemini said hello. I nearly fell off my chair.

The second spike was for ADK — Google's Agent Development Kit. I needed to prove that Go agents could be wired into a graph:

myAgent, err := agent.New(agent.Config{
    Name:        "test_agent",
    Description: "does a thing",
    Run:         func(ctx agent.InvocationContext) iter.Seq2[*session.Event, error] { ... },
})
Enter fullscreen mode Exit fullscreen mode

The iter.Seq2 pattern took me a minute. ADK uses Go 1.23+ range-over-function iterators for streaming agent results. Not the most intuitive API, but it works. The sequentialagent from google.golang.org/adk/agent/workflowagents/sequentialagent lets you chain agents in order.

Both spikes passed. The SDK surface was real, documented, and functional. Project greenlit.

125 tasks, 13 days, 1 developer

Here's the math that doesn't work: 125 implementation tasks. 13 days until deadline. That's 9.6 tasks per day, every day, with no breaks.

So I made a brutal prioritization call: demo-driven development. If a feature appears in the 4-minute demo video, it's P0. Everything else is P1 or cut.

The architecture ended up as three layers:

macOS Client (Swift 6)
    ↕ WebSocket
Realtime Gateway (Go + GenAI SDK)
    ↕ HTTP POST /analyze
ADK Orchestrator (Go + ADK Go SDK → 9-agent graph)
Enter fullscreen mode Exit fullscreen mode

The client does UI, screen capture, and audio playback. It never touches Gemini directly. The gateway proxies WebSocket audio to Gemini's Live API. The orchestrator runs the 9-agent graph on screen captures and returns speech decisions.

No monolith. No shortcuts. The challenge rules require the client-backend split, and honestly it's the right architecture anyway — you don't want API keys sitting on someone's Mac.

what's next

Tomorrow I wire the 9 agents. Each one gets a Go package, a specific role, and a clear rule for when it should shut up. The Mediator agent — the one that decides whether to speak — is going to be the hardest. Making AI talk is easy. Making AI know when to be quiet? That's the real engineering.

28 commits in. 4,870 lines to go. The chair is still empty, but not for long.

Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat

Top comments (0)