DEV Community

Cover image for the websocket cascade from hell
KimSejun
KimSejun

Posted on

the websocket cascade from hell

75 commits, 48 hours, and the websocket cascade from hell

I am staring at my terminal and the numbers don't feel real. 75 commits in the last 48 hours. My git log looks like a crime scene. I created this post for the purposes of entering the Gemini Live Agent Challenge, but right now it feels more like a post-mortem for my sanity.

The project is missless.co. The idea is simple on paper: you upload a YouTube video of someone you miss, and our AI analyzes their personality, voice, and mannerisms to create a real-time voice persona. You talk, they talk back. It uses a Go backend to proxy the Gemini Live API and a Next.js frontend for the PWA experience.

Everything worked on my local machine. Of course it did. Then I deployed to Cloud Run, and the universe decided to remind me that networking is hard.

The Cloud Run WebSocket Meltdown

The moment the container went live, everything broke. It wasn't just one bug. It was a cascade of failures that felt like a targeted attack by the ghost of debugging past.

First, the WebSocket URL. In development, I was hardcoding ws://localhost:18080. When I moved to production, I tried to be clever with a protocol-relative URL in the frontend. It failed because Cloud Run handles TLS termination in a way that my frontend didn't expect. I had to go in and fix the connection logic to explicitly check the window location and switch between ws and wss.

Then came the origin check. I had a middleware that was supposed to protect the WebSocket endpoint, but it was blocking Cloud Run's own internal URL. I was literally locking myself out of my own server.

But the real "head desk" moment was the session auth. In missless, the onboarding flow starts before you even log in. You paste a YouTube URL, the AI starts analyzing, and you start a "preview" conversation. My middleware was expecting a session cookie that didn't exist yet. I had to completely rethink the auth flow to allow "anonymous" WebSocket connections that are tied to a temporary session ID before upgrading them to a full user account.

The most embarrassing bug, though, was in internal/live/proxy.go. I had a Wait() function that was supposed to keep the proxy alive while the audio stream was active. For some reason, I had set a 5-second timeout on the context.

// The bug that killed every session after 5 seconds
func (p *Proxy) Wait() {
    ctx, cancel := context.WithTimeout(p.ctx, 5*time.Second) // WHY?
    defer cancel()
    <-ctx.Done()
}
Enter fullscreen mode Exit fullscreen mode

In production, every single reunion session died exactly five seconds after it started. Users would say "Hello?", the AI would start to breathe, and then... connection closed. I had to change it to block for the full lifetime of the session. It seems obvious now, but at 3 AM, that 5-second timeout felt like a good safety measure. It wasn't.

Security Holes and Path Traversal

Once the WebSockets were finally staying open, I did a quick security audit. It was a disaster.

I found a path traversal vulnerability in the album sharing page. We allow users to view "memory albums" generated during their sessions. The handler was taking an ID and looking up a file. I realized that with a bit of clever string manipulation, someone could have requested ../../etc/passwd or my .env file. I had to strip all directory separators and validate the ID against a strict regex before even touching the filesystem.

Then there was the rate limiting. Or rather, the lack of it. I realized that our WebSocket endpoint had zero protection. Anyone with a script could have opened a thousand connections and burned through my Gemini API credits in minutes. I had to scramble to add an IP-based rate limiter using a simple memory store.

I also had to tune the Cloud Run concurrency. I had it set to 80, which is the default. But for a WebSocket-heavy app that maintains a persistent connection to the Gemini Live API, 80 was way too high. The container was choking. I dropped it down to a more manageable level to ensure each connection had enough CPU headroom for the audio processing.

The "SafeGo" Sweep

In Go, it is so easy to just type go func() { ... } and move on. But in a production server, that is a recipe for a silent death. If one of those goroutines panics, the whole server goes down.

I spent three hours today hunting down every bare go keyword and replacing it with a util.SafeGo() wrapper. This wrapper catches panics, logs the stack trace, and sends a notification to my error tracker. It also helps with the "initiateShutdown" logic I had to build. I noticed that when a user closed their browser, some of my backend goroutines were just hanging around, leaking memory. I had to implement a proper cleanup signal that propagates through the context to kill those orphans.

The Test Coverage Rollercoaster

I decided to be a "good developer" and push for better test coverage. I got the auth package from 0% to 94%. I got the handler package up to 48%. I even wrote tests for the live/reconnect logic, which is notoriously hard to simulate.

But then I saw the regression.

I migrated our Firestore logic from the deprecated Batch API to the new BulkWriter for better performance. I thought I was being smart. But when I ran the coverage report, the memory package dropped from 96.9% to 63.9%. The store package dropped from 96.6% to 50.0%.

It turns out that BulkWriter introduces a lot of asynchronous code paths that my existing synchronous tests weren't hitting. The "real" Firestore connection I added for integration testing also made some of my mocks obsolete. It was a gut punch. You think you are making the code better, but the metrics tell you that you are flying blind. I'm still working on bringing those numbers back up.

Generating BGM with Lyria

The one bright spot in this sprint was the BGM generation. I wanted missless to have a cinematic feel, so I built a CLI tool that connects directly to the Lyria RealTime WebSocket API.

I defined six specific moods: warm, romantic, nostalgic, playful, emotional, and farewell. The tool sends a prompt to Lyria, gets the audio stream back, and saves it as a high-quality 30-second loop. Now, when the AI detects a shift in the conversation, it can trigger a "change_atmosphere" tool that swaps the background music in real-time.

// Snippet from the BGM CLI tool
func generateMood(mood string) error {
    prompt := fmt.Sprintf("A %s cinematic background track for a virtual reunion, soft piano, ambient", mood)
    // Connect to Lyria WS...
    // Stream audio to file...
    return nil
}
Enter fullscreen mode Exit fullscreen mode

It's a small detail, but hearing a nostalgic piano track swell up just as the AI starts talking about a shared memory from ten years ago... it makes all the WebSocket bugs feel worth it.

Where we are now

We are at 87.5% of the V7 plan. 21 out of 24 major tasks are done. I have 14 open issues on GitHub, mostly related to frontend polish and those coverage regressions. All the Go tests are passing with the -race flag, which is a miracle given how many goroutines I'm juggling.

I'm exhausted. My eyes are blurry. But missless is alive. It's running on Cloud Run, it's secure (mostly), and it's actually starting to feel like the "virtual reunion" I imagined.

Now, if you'll excuse me, I'm going to go sleep for a few hours before I find another 75 commits to make.

Top comments (0)