DEV Community

Cover image for missless failed at real-time video — so we pivoted to vibeCat
KimSejun
KimSejun

Posted on

missless failed at real-time video — so we pivoted to vibeCat

missless failed at real-time video — so we pivoted to vibeCat

Three weeks of work. A working WebSocket proxy, Cloud Run deployment, Lyria BGM generation, 75 commits. And then the real-time video generation just... didn't work.

I created this post for the purposes of entering the Gemini Live Agent Challenge. If you read my earlier posts about missless — the websocket cascade from hell, the security holes, the 3am debugging sessions — this is the post where that story ends and a new one begins.

the promise that broke

missless was supposed to be a "virtual reunion" app. Upload a video of someone you miss, the AI reconstructs their personality and voice, and you have a real-time conversation — with video. Not just audio. Video. A face that moves, reacts, speaks back to you.

The audio side worked beautifully. Gemini Live API handled voice synthesis, the Go backend proxied WebSocket streams, Cloud Run kept it alive. I had real-time voice conversations with AI-reconstructed personas and it felt genuinely moving.

But the product vision required real-time video generation. A face on screen that moves its lips when it talks, that shifts expression when you say something emotional. That was the whole point — you're not just hearing someone you miss, you're seeing them.

And that's where everything fell apart.

why real-time video generation killed us

The technical reality was brutal:

  1. Latency. Video generation models can't produce frames fast enough for real-time conversation. We needed <200ms per frame to feel natural. The best we got was 2-3 seconds per frame. That's a slideshow, not a reunion.

  2. Consistency. Even when frames arrived, the face wasn't consistent across frames. The person looked slightly different every time — different lighting, different angle, subtle uncanny valley shifts. In audio, minor inconsistencies are forgivable. In video, they're horrifying.

  3. Cost. Every frame is a model inference. At 15fps for a 10-minute conversation, that's 9,000 inferences. The API costs alone made the product unviable for anything beyond a demo.

  4. The challenge stack constraint. The Gemini Live Agent Challenge requires GenAI SDK + ADK + Gemini Live API + VAD. All four. missless used GenAI SDK and Live API, but it had no ADK integration — it was a single-agent system. I could force-fit ADK, but the judges would see a bolted-on agent graph that didn't justify its existence.

I spent a week trying to solve the latency problem. Pre-generating frames, caching expressions, interpolating between keyframes. None of it felt right. The product was fighting the technology instead of riding it.

the moment I stopped pretending

I was writing the submission documentation and got to the "ADK agent graph" section. I stared at the empty space for twenty minutes. Because there was no agent graph. missless was one WebSocket session proxying to one Gemini Live connection. That's a pipe, not an orchestration system.

The challenge isn't asking for "a touching AI concept." It's asking for a live, multimodal, backend-first agent system where multiple agents collaborate in real time. missless could do voice. It couldn't do video. And it couldn't do multi-agent orchestration without lying about the architecture.

So I closed the missless repo and opened a new directory.

what vibeCat is

VibeCat is a macOS desktop companion for solo developers. Instead of trying to reconstruct a human face in real time (which doesn't work), it puts an animated character on your screen — a cat, a goofy tiger, a zen monk, a chibi dictator — that watches your code, hears your voice, and speaks up when it matters.

The critical difference: VibeCat is built around 9 agents, not 1.

Agent Role
VisionAgent Analyzes screen captures for errors
MoodDetector Senses frustration from patterns
Mediator Decides whether to speak or stay silent
AdaptiveScheduler Adjusts timing to developer flow
EngagementAgent Proactive outreach after silence
MemoryAgent Cross-session context via Firestore
CelebrationTrigger Detects success moments
SearchBuddy Google Search grounding
VAD Real-time voice with barge-in

All nine run through ADK's sequential agent graph:

graph, _ := sequentialagent.New(sequentialagent.Config{
    Name:      "vibecat_graph",
    SubAgents: []agent.Agent{
        memoryAgent, visionAgent, moodDetector, mediator,
        adaptiveScheduler, engagementAgent, celebrationTrigger,
        searchBuddy,
    },
})
Enter fullscreen mode Exit fullscreen mode

No fake video generation. No uncanny valley. Just a sprite-animated character driven by a real multi-agent pipeline that decides what to say, when to say it, and — most importantly — when to shut up.

what transfers from missless

The missless work wasn't wasted. The Go backend patterns move directly:

  • WebSocket proxy to Gemini Live API — same client.Live.Connect() pattern
  • Cloud Run deployment — same region, same Docker setup, same /readyz lesson (never /healthz)
  • JWT auth — same token flow
  • The debugging instincts — missless taught me that Cloud Run services start fine with wrong env vars and silently break

The WebSocket cascade from hell? That debugging session directly informed how I structured VibeCat's gateway. Every silent failure I found in missless became a startup validation check in VibeCat.

what's next

Starting fresh on VibeCat. The plan:

  1. Go gateway — GenAI SDK, WebSocket proxy, JWT auth
  2. Go ADK orchestrator — 9-agent sequential graph
  3. Swift 6 macOS client — ScreenCaptureKit, sprite animation, audio playback
  4. 6 characters with unique voices and personalities
  5. Cloud Run deployment to asia-northeast3
  6. E2E verification suite

I'll be writing about the build as I go. Real code, real errors, real numbers.

missless taught me that a good idea can still be the wrong submission shape. The real-time video dream was beautiful, but the technology isn't there yet. What is there: real-time audio, screen understanding, multi-agent decision-making, and the ability to put something in the empty chair next to a solo developer.

Even if that something is a cat.


Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat

GeminiLiveAgentChallenge

Top comments (0)