missless failed at real-time video — so we pivoted to vibeCat
Three weeks of work. A working WebSocket proxy, Cloud Run deployment, Lyria BGM generation, 75 commits. And then the real-time video generation just... didn't work.
I created this post for the purposes of entering the Gemini Live Agent Challenge. If you read my earlier posts about missless — the websocket cascade from hell, the security holes, the 3am debugging sessions — this is the post where that story ends and a new one begins.
the promise that broke
missless was supposed to be a "virtual reunion" app. Upload a video of someone you miss, the AI reconstructs their personality and voice, and you have a real-time conversation — with video. Not just audio. Video. A face that moves, reacts, speaks back to you.
The audio side worked beautifully. Gemini Live API handled voice synthesis, the Go backend proxied WebSocket streams, Cloud Run kept it alive. I had real-time voice conversations with AI-reconstructed personas and it felt genuinely moving.
But the product vision required real-time video generation. A face on screen that moves its lips when it talks, that shifts expression when you say something emotional. That was the whole point — you're not just hearing someone you miss, you're seeing them.
And that's where everything fell apart.
why real-time video generation killed us
The technical reality was brutal:
Latency. Video generation models can't produce frames fast enough for real-time conversation. We needed <200ms per frame to feel natural. The best we got was 2-3 seconds per frame. That's a slideshow, not a reunion.
Consistency. Even when frames arrived, the face wasn't consistent across frames. The person looked slightly different every time — different lighting, different angle, subtle uncanny valley shifts. In audio, minor inconsistencies are forgivable. In video, they're horrifying.
Cost. Every frame is a model inference. At 15fps for a 10-minute conversation, that's 9,000 inferences. The API costs alone made the product unviable for anything beyond a demo.
The challenge stack constraint. The Gemini Live Agent Challenge requires GenAI SDK + ADK + Gemini Live API + VAD. All four. missless used GenAI SDK and Live API, but it had no ADK integration — it was a single-agent system. I could force-fit ADK, but the judges would see a bolted-on agent graph that didn't justify its existence.
I spent a week trying to solve the latency problem. Pre-generating frames, caching expressions, interpolating between keyframes. None of it felt right. The product was fighting the technology instead of riding it.
the moment I stopped pretending
I was writing the submission documentation and got to the "ADK agent graph" section. I stared at the empty space for twenty minutes. Because there was no agent graph. missless was one WebSocket session proxying to one Gemini Live connection. That's a pipe, not an orchestration system.
The challenge isn't asking for "a touching AI concept." It's asking for a live, multimodal, backend-first agent system where multiple agents collaborate in real time. missless could do voice. It couldn't do video. And it couldn't do multi-agent orchestration without lying about the architecture.
So I closed the missless repo and opened a new directory.
what vibeCat is
VibeCat is a macOS desktop companion for solo developers. Instead of trying to reconstruct a human face in real time (which doesn't work), it puts an animated character on your screen — a cat, a goofy tiger, a zen monk, a chibi dictator — that watches your code, hears your voice, and speaks up when it matters.
The critical difference: VibeCat is built around 9 agents, not 1.
| Agent | Role |
|---|---|
| VisionAgent | Analyzes screen captures for errors |
| MoodDetector | Senses frustration from patterns |
| Mediator | Decides whether to speak or stay silent |
| AdaptiveScheduler | Adjusts timing to developer flow |
| EngagementAgent | Proactive outreach after silence |
| MemoryAgent | Cross-session context via Firestore |
| CelebrationTrigger | Detects success moments |
| SearchBuddy | Google Search grounding |
| VAD | Real-time voice with barge-in |
All nine run through ADK's sequential agent graph:
graph, _ := sequentialagent.New(sequentialagent.Config{
Name: "vibecat_graph",
SubAgents: []agent.Agent{
memoryAgent, visionAgent, moodDetector, mediator,
adaptiveScheduler, engagementAgent, celebrationTrigger,
searchBuddy,
},
})
No fake video generation. No uncanny valley. Just a sprite-animated character driven by a real multi-agent pipeline that decides what to say, when to say it, and — most importantly — when to shut up.
what transfers from missless
The missless work wasn't wasted. The Go backend patterns move directly:
-
WebSocket proxy to Gemini Live API — same
client.Live.Connect()pattern -
Cloud Run deployment — same region, same Docker setup, same
/readyzlesson (never/healthz) - JWT auth — same token flow
- The debugging instincts — missless taught me that Cloud Run services start fine with wrong env vars and silently break
The WebSocket cascade from hell? That debugging session directly informed how I structured VibeCat's gateway. Every silent failure I found in missless became a startup validation check in VibeCat.
what's next
Starting fresh on VibeCat. The plan:
- Go gateway — GenAI SDK, WebSocket proxy, JWT auth
- Go ADK orchestrator — 9-agent sequential graph
- Swift 6 macOS client — ScreenCaptureKit, sprite animation, audio playback
- 6 characters with unique voices and personalities
- Cloud Run deployment to
asia-northeast3 - E2E verification suite
I'll be writing about the build as I go. Real code, real errors, real numbers.
missless taught me that a good idea can still be the wrong submission shape. The real-time video dream was beautiful, but the technology isn't there yet. What is there: real-time audio, screen understanding, multi-agent decision-making, and the ability to put something in the empty chair next to a solo developer.
Even if that something is a cat.
Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat
Top comments (0)