making go speak real-time — our gemini live api websocket proxy

#geminiliveagentchallenge #devlog #buildinpublic #go

making Go speak real-time — our Gemini Live API WebSocket proxy

The first time I got the audio proxy working, the cat meowed in Gemini's voice — a full 3 seconds of distorted PCM noise that sounded like a dial-up modem possessed by a cheerful robot. I'd set the sample rate wrong. 24kHz audio interpreted as 16kHz sounds like a cursed lullaby.

I created this post for the purposes of entering the Gemini Live Agent Challenge. I'm building VibeCat.

The core challenge was simple to state, hard to build: the macOS client can't talk to Gemini directly. Challenge rules require a backend, and you never put API keys on someone's Mac. So I needed a WebSocket proxy in Go that sits between the Swift client and Gemini Live API — receiving raw audio from one side, forwarding it to the other, and doing it fast enough that conversation feels natural.

the architecture (deceptively simple)

Swift Client ←→ [wss://gateway/ws/live] ←→ Go Gateway ←→ Gemini Live API
     PCM 16kHz mono →                                    → PCM 16kHz
                    ← PCM 24kHz                          ← PCM 24kHz

On paper, it's a pipe. Audio goes in one side, comes out the other. I told myself this would take a day. It took three. The first day was the "it works!" day. The second was the "why did it stop working?" day. The third was the "oh, WebSocket connections are secretly fragile" day.

connecting to Gemini

After the modem-cat incident, I triple-checked sample rates. The GenAI Go SDK makes the connection surprisingly clean:

session, err := m.client.Live.Connect(ctx, "gemini-2.0-flash-live-001", liveConfig)

One line. But building that liveConfig is where it gets interesting:

func buildLiveConfig(cfg Config) *genai.LiveConnectConfig {
    lc := &genai.LiveConnectConfig{}

    if cfg.Voice != "" {
        lc.SpeechConfig = &genai.SpeechConfig{
            VoiceConfig: &genai.VoiceConfig{
                PrebuiltVoiceConfig: &genai.PrebuiltVoiceConfig{
                    VoiceName: cfg.Voice,  // "Zephyr", "Puck", etc.
                },
            },
        }
    }

    lc.RealtimeInputConfig = &genai.RealtimeInputConfig{
        AutomaticActivityDetection: &genai.AutomaticActivityDetection{
            Disabled: false,  // VAD must be enabled — challenge requirement
        },
    }

    return lc
}

VAD (Voice Activity Detection) is mandatory. When AutomaticActivityDetection is enabled, Gemini handles turn-taking automatically — it detects when you stop talking and starts responding. It also supports barge-in: if you interrupt mid-response, Gemini stops and listens.

audio streaming

Sending audio to Gemini:

func (s *Session) SendAudio(pcmData []byte) error {
    return s.gemini.SendRealtimeInput(genai.LiveRealtimeInput{
        Audio: &genai.Blob{
            MIMEType: "audio/pcm;rate=16000",
            Data:     pcmData,
        },
    })
}

The MIME type matters. audio/pcm;rate=16000 means raw PCM, 16-bit, 16kHz, mono. I know because I got it wrong — passed audio/pcm without the rate parameter, and Gemini interpreted my voice as white noise. No error. No warning. Just silence on the other end and me talking to myself in an empty apartment at midnight.

Receiving from Gemini is a loop that runs in its own goroutine:

func receiveFromGemini(ctx context.Context, conn *websocket.Conn, sess *live.Session, connID string) {
    for {
        msg, err := sess.Receive()
        if err != nil {
            return
        }

        if msg.ServerContent != nil && msg.ServerContent.ModelTurn != nil {
            for _, part := range msg.ServerContent.ModelTurn.Parts {
                if part.InlineData != nil && len(part.InlineData.Data) > 0 {
                    conn.WriteMessage(websocket.BinaryMessage, part.InlineData.Data)
                }
            }
        }

        if msg.ServerContent != nil && msg.ServerContent.TurnComplete {
            sendJSON(conn, map[string]string{"type": "turnComplete"})
        }

        if msg.ServerContent != nil && msg.ServerContent.Interrupted {
            sendJSON(conn, map[string]string{"type": "interrupted"})
        }
    }
}

Gemini sends audio in chunks via InlineData.Data. Each chunk is a PCM frame at 24kHz that goes straight to the client as a binary WebSocket message. Text events (transcriptions, turn completions, interruptions) go as JSON text frames.

the zombie killer

Day two's lesson: WebSocket connections die in weird ways. The client closes their laptop. The network drops. The process crashes. In all these cases, the server-side connection sits there, alive but silent — a zombie. I found this out because my test server accumulated 14 dead connections over a weekend. Each one holding a Gemini Live session open. Each one costing API credits for nothing.

const (
    pingInterval  = 15 * time.Second
    zombieTimeout = 45 * time.Second
)

rawConn.SetReadDeadline(time.Now().Add(zombieTimeout))
rawConn.SetPongHandler(func(string) error {
    rawConn.SetReadDeadline(time.Now().Add(zombieTimeout))
    return nil
})

// Ping goroutine
go func() {
    ticker := time.NewTicker(pingInterval)
    defer ticker.Stop()
    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            rawConn.WriteControl(websocket.PingMessage, nil, time.Now().Add(5*time.Second))
        }
    }
}()

Every 15 seconds, the server pings the client. If the client doesn't pong within 45 seconds, the read deadline expires and the connection gets cleaned up. The Gemini session closes, the registry removes the connection, and resources are freed.

session resumption

Gemini Live sessions have a time limit. When the server sends a GoAway signal, you have a few seconds to save the resumption handle and reconnect:

if msg.SessionResumptionUpdate != nil && msg.SessionResumptionUpdate.NewHandle != "" {
    sess.ResumptionHandle = msg.SessionResumptionUpdate.NewHandle
    sendJSON(conn, map[string]any{
        "type":             "setupComplete",
        "sessionId":        connID,
        "resumptionHandle": sess.ResumptionHandle,
    })
}

The client saves the handle. On reconnect, it sends the handle in the setup message, and the gateway passes it to SessionResumptionConfig. Gemini picks up where it left off. No lost context, no repeated introductions.

JWT auth

Every WebSocket connection requires a valid JWT:

mux.Handle("/ws/live", auth.Middleware(jwtMgr, ws.Handler(registry, liveMgr, adkClient)))

The client first calls POST /api/v1/auth/register with an API key, gets back a signed JWT with 24-hour expiry, then passes it as Bearer <token> in the WebSocket upgrade request. No token, no connection. Bad token, 401.

The whole gateway is about 300 lines of WebSocket handler code and 170 lines of Live session management. Not counting the auth layer. For a real-time bidirectional audio proxy with authentication, session resumption, and zombie detection — that's compact.

But the line count doesn't capture the real work. The real work was the modem-cat at midnight, the 14 zombie connections leaking credits, the missing MIME parameter that turned my voice into silence. The code is simple because I made every mistake first.

The proxy works now. Audio goes in, the cat talks back, and it sounds like an actual voice — not a dial-up modem anymore. That feels like progress.

Building VibeCat for the Gemini Live Agent Challenge. Source: github.com/Two-Weeks-Team/vibeCat