DEV Community

John Jomel Pangilinan
John Jomel Pangilinan

Posted on

Building Blackbird: A Real-Time AI Co-Pilot with Gemini 2.5 Flash Live API

This article was created for the purposes of entering the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge


Screen sharing is everywhere — sales calls, support sessions, remote onboarding. But the person sharing their screen is always on their own. They have to think, talk, navigate, and perform all at once.

Blackbird is a real-time AI co-pilot that changes that. It sees your screen, hears the conversation, and whispers private coaching through your headphones — all powered by Gemini 2.5 Flash's multimodal Live API.

In this post I'll walk through how the system works, the architecture decisions, and what I learned building real-time multimodal agents on Google Cloud.

What Blackbird Does

Blackbird runs alongside any screen-sharing call, invisible to the other party:

  • Screen vision: Captures the screen every 5 seconds and streams JPEG frames to Gemini
  • Private voice coaching: Proactive spoken guidance only the agent hears
  • Real-time translation: Speak in one language, your counterpart hears another — live
  • Guided actions: The AI identifies UI elements on the customer's screen and can execute clicks with approval
  • Native desktop overlays: An Electron companion renders annotation rings and step-by-step guide cards directly on the desktop

The Core: Gemini Live API as a Persistent Multimodal Session

The entire system revolves around one idea: a persistent Gemini Live API session that receives interleaved audio and video in real time and responds with streamed audio, text, or tool calls.

Here's the connection setup from server/gemini-session.ts:

import { GoogleGenAI, type Session, Modality } from "@google/genai";

this.session = await getGenAI().live.connect({
  model: "gemini-2.5-flash-native-audio-preview-12-2025",
  config: {
    responseModalities: [Modality.AUDIO],
    systemInstruction: {
      parts: [{ text: this.systemInstruction }],
    },
    speechConfig: {
      voiceConfig: {
        prebuiltVoiceConfig: { voiceName: "Puck" },
      },
    },
    tools: [{ functionDeclarations: toolDeclarations }],
    inputAudioTranscription: {},
    outputAudioTranscription: {},
  },
  callbacks: {
    onmessage: (msg) => {
      if (msg.toolCall?.functionCalls) {
        this.handleToolCalls(msg.toolCall.functionCalls);
      }
      if (msg.serverContent?.modelTurn?.parts) {
        for (const part of msg.serverContent.modelTurn.parts) {
          if (part.inlineData?.data) {
            this.callbacks.onAudio(part.inlineData.data);
          }
        }
      }
    },
  },
});
Enter fullscreen mode Exit fullscreen mode

This single session handles everything — speech recognition, vision understanding, response generation, and tool calling — all in one persistent WebSocket connection. No orchestration layer, no chaining multiple API calls. Just one live session that sees and hears simultaneously.

Streaming Audio and Video Into the Session

Audio from the user's microphone is captured at 16kHz mono PCM16 using an AudioWorklet processor, converted to base64, and sent to Gemini:

sendAudio(base64Audio: string): void {
  this.session.sendRealtimeInput({
    audio: {
      data: base64Audio,
      mimeType: "audio/pcm;rate=16000",
    },
  });
}
Enter fullscreen mode Exit fullscreen mode

Screen frames are captured every 5 seconds, JPEG-compressed at 70% quality (max 1280px width), and interleaved into the same session:

sendFrame(base64Jpeg: string): void {
  this.session.sendRealtimeInput({
    video: {
      data: base64Jpeg,
      mimeType: "image/jpeg",
    },
  });
}
Enter fullscreen mode Exit fullscreen mode

Gemini processes both streams together. It can respond to what it hears ("the customer just asked about pricing") and what it sees ("they're on the settings page, the toggle they need is in the top right") in the same turn.

Architecture: Custom Server with 4 WebSocket Endpoints

The backend is a custom Node.js server that wraps Next.js and manages multiple WebSocket servers:

// server/index.ts
const server = http.createServer(nextHandler);

server.on("upgrade", (req, socket, head) => {
  const { pathname } = new URL(req.url!, `http://${req.headers.host}`);

  if (pathname === "/ws")           wssCopilot.handleUpgrade(req, socket, head, ...);
  if (pathname === "/ws/translate")  wssTranslate.handleUpgrade(req, socket, head, ...);
  if (pathname === "/ws/elevenlabs") wssElevenlabs.handleUpgrade(req, socket, head, ...);
  if (pathname === "/ws/room")       wssRoom.handleUpgrade(req, socket, head, ...);
});
Enter fullscreen mode Exit fullscreen mode

Each endpoint creates its own Gemini Live session with different system instructions:

Endpoint Purpose Gemini Config
/ws Single-user co-pilot Vision + audio + tools
/ws/translate Real-time translation Audio only, language-mapped voice
/ws/room Shared agent+customer session Dual-participant awareness + tools
/ws/elevenlabs Translation with voice cloning Gemini for translation text, ElevenLabs for TTS

Function Calling: Giving the Agent Hands

Gemini's function calling lets Blackbird take actions mid-conversation. Five tools are declared:

export const toolDeclarations = [
  {
    name: "web_search",
    description: "Search the web for real-time information",
    parameters: {
      type: "object",
      properties: { query: { type: "string" } },
      required: ["query"],
    },
  },
  {
    name: "guide_customer",
    description: "Send step-by-step instructions to the customer's screen",
    parameters: {
      type: "object",
      properties: {
        title: { type: "string" },
        steps: { type: "array", items: { type: "string" } },
      },
    },
  },
  {
    name: "guided_action",
    description: "Click a specific location on the customer's screen",
    parameters: {
      type: "object",
      properties: {
        x: { type: "number" }, // normalized 0-1
        y: { type: "number" },
        label: { type: "string" },
      },
    },
  },
  {
    name: "save_session",
    description: "Save a structured call summary",
    parameters: {
      type: "object",
      properties: {
        title: { type: "string" },
        content: { type: "string" },
      },
    },
  },
];
Enter fullscreen mode Exit fullscreen mode

When Gemini calls guided_action, the coordinates are sent to the Electron companion app, which renders a pulsing annotation ring at that position and — with customer approval — executes the click via platform-native APIs.

The tool response flows back into the same Live session:

this.session.sendToolResponse({
  functionResponses: {
    id: call.id,
    name,
    response: { output },
  },
});
Enter fullscreen mode Exit fullscreen mode

Gemini then incorporates the result into its next spoken response. The entire loop — hearing a request, seeing the screen, calling a tool, getting the result, speaking the response — happens in a single persistent session.

Real-Time Translation: Two Gemini Sessions Per Room

For the shared session mode, Blackbird creates two translation sessions — one for each direction:

// System instruction for translation sessions
`You are a professional real-time interpreter.
Translate everything spoken into ${targetLanguage}.`
Enter fullscreen mode Exit fullscreen mode

Each session is configured with a language-appropriate voice. The config maps 30+ languages to Gemini voice presets:

export const VOICE_MAP: Record<string, string> = {
  Japanese: "Kore",
  Spanish: "Aoede",
  Korean: "Kore",
  French: "Aoede",
  // ...
};
Enter fullscreen mode Exit fullscreen mode

The agent speaks English → Gemini translates to Japanese with the Kore voice → customer hears Japanese. The customer speaks Japanese → Gemini translates to English with the Puck voice → agent hears English. Both happen simultaneously in real time.

Deploying to Google Cloud Run

The production deployment uses Cloud Build and Cloud Run, configured for WebSocket support:

# deploy/deploy.sh
gcloud run deploy blackbird \
  --image "gcr.io/${PROJECT_ID}/blackbird" \
  --region "${REGION}" \
  --platform managed \
  --allow-unauthenticated \
  --memory 1Gi \
  --timeout 3600 \
  --concurrency 80 \
  --session-affinity \
  --set-env-vars "GOOGLE_API_KEY=${GOOGLE_API_KEY},GOOGLE_CLOUD_PROJECT=${PROJECT_ID}"
Enter fullscreen mode Exit fullscreen mode

Key settings:

  • --session-affinity: Critical for WebSocket connections — ensures a client stays on the same instance
  • --timeout 3600: Long-lived connections for hour-long calls
  • --memory 1Gi: Handles multiple concurrent Gemini sessions

Cloud Firestore persists session summaries when Gemini calls the save_session tool:

const doc = await firestore.collection("sessions").add({
  title,
  summary: content,
  timestamp: new Date(),
});
Enter fullscreen mode Exit fullscreen mode

The Electron Companion: Native Overlays

The desktop companion is an Electron tray app that bridges the web app to the native desktop. It runs a local WebSocket server on port 52836:

  • The web app auto-detects it and switches from browser screen capture to native desktopCapturer
  • Guided actions render as transparent fullscreen overlays — pulsing rings at target coordinates
  • Step-by-step guide cards appear as interactive overlays the customer can follow
  • The overlay is click-through by default, becoming interactive only when guide cards are shown

What I Learned

Gemini's Live API is genuinely different. Most AI integrations are request-response. The Live API is a persistent session that accumulates context from both audio and video over time. The model doesn't just answer questions — it notices things and proactively speaks up. That's what makes the co-pilot feel like a real partner rather than a chatbot.

Tool calling mid-conversation is powerful. Gemini deciding on its own to call guide_customer or save_session based on the flow of conversation — without explicit user commands — makes the agent feel autonomous. The function calling interface is clean and the round-trip back into the session is seamless.

Latency is everything in real-time audio. PCM chunk sizes, AudioWorklet buffer management, BigEndian encoding — every millisecond matters. The native audio preview model's streaming response keeps the experience conversational rather than turn-based.

Try It Yourself

The project is open source: github.com/jomspangilinan/blackbird

git clone https://github.com/jomspangilinan/blackbird.git
cd blackbird
npm install
cp .env.example .env
# Add your Gemini API key to .env
npm run dev
Enter fullscreen mode Exit fullscreen mode

All you need is a Google AI Studio API key to get started.


Built for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge

Top comments (0)