Md Enayetur Rahman

Posted on Dec 4, 2025

Building Custom Audio Streaming Apps with Google ADK

My Journey into Real-Time Voice Interaction

After building basic agents and exploring streaming capabilities, I wanted to take things to the next level: creating custom web applications that enable real-time, bidirectional voice communication with AI agents. This led me to explore two different approaches for building audio streaming applications using Google ADK: Server-Sent Events (SSE) and WebSockets.

I'll walk you through both implementations, explaining the differences, when to use each approach, and sharing the code so you can build your own custom audio streaming applications.

Understanding Custom Audio Streaming

Unlike the standard ADK web interface, custom audio streaming applications give you complete control over the user experience. You can build:

Real-time voice conversations with AI agents
Custom UI/UX tailored to your specific needs
Bidirectional audio streaming for natural conversation flow
Text and audio modes that can switch dynamically
Production-ready applications with proper session management

Both implementations use Google's Gemini Live API models, which support native audio streaming capabilities.

Two Approaches: SSE vs WebSockets

I built two versions of the same application to understand the trade-offs:

Server-Sent Events (SSE) Approach

Server → Client: SSE stream for real-time updates
Client → Server: HTTP POST requests
Best for: Simpler implementations, when you don't need true bidirectional streaming

WebSocket Approach

Server ↔ Client: Full bidirectional WebSocket connection
Best for: True real-time bidirectional communication, lower latency

Let me show you how I built both!

Building the SSE-Based Audio Streaming App

Step 1: Project Setup

I started by setting up the project structure:

C:\Agent\Custom_Audio_Streaming_app_SSE>cd adk-docs\examples\python\snippets\streaming\adk-streaming\app

The project structure looks like this:

adk-streaming/
└── app/
    ├── main.py                    # FastAPI server
    ├── google_search_agent/
    │   └── agent.py              # The agent definition
    ├── static/
    │   ├── index.html            # Frontend HTML
    │   └── js/
    │       ├── app.js            # Main client logic
    │       ├── audio-player.js   # Audio playback worklet
    │       └── audio-recorder.js # Audio capture worklet
    └── requirements.txt

Step 2: Understanding the Agent

The agent uses a Gemini Live API model with Google Search capabilities:

from google.adk.agents import Agent
from google.adk.tools import google_search

root_agent = Agent(
    name="google_search_agent",
    model="gemini-2.0-flash-live-001",  # Live API model for audio
    description="Agent to answer questions using Google Search.",
    instruction="Answer the question using the Google Search tool.",
    tools=[google_search],
)

Key Points:

Uses gemini-2.0-flash-live-001 - a model that supports the Gemini Live API
Includes google_search tool for real-time information retrieval
Configured for both text and audio responses

Step 3: Server-Side Implementation (SSE)

The FastAPI server handles two main endpoints:

SSE Endpoint (Server → Client)

@app.get("/events/{user_id}")
async def sse_endpoint(user_id: int, is_audio: str = "false"):
    """SSE endpoint for agent to client communication"""

    # Start agent session
    user_id_str = str(user_id)
    live_events, live_request_queue = await start_agent_session(
        user_id_str,
        is_audio == "true"
    )

    # Store the request queue for this user
    active_sessions[user_id_str] = live_request_queue

    async def event_generator():
        try:
            async for data in agent_to_client_sse(live_events):
                yield data
        finally:
            cleanup()

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
        }
    )

How it works:

Creates an ADK streaming session using start_agent_session()
Streams events from the agent to the client via SSE
Stores the live_request_queue for client-to-agent communication

HTTP POST Endpoint (Client → Server)

@app.post("/send/{user_id}")
async def send_message_endpoint(user_id: int, request: Request):
    """HTTP endpoint for client to agent communication"""

    live_request_queue = active_sessions.get(str(user_id))
    if not live_request_queue:
        return {"error": "Session not found"}

    message = await request.json()
    mime_type = message["mime_type"]
    data = message["data"]

    if mime_type == "text/plain":
        content = Content(role="user", parts=[Part.from_text(text=data)])
        live_request_queue.send_content(content=content)
    elif mime_type == "audio/pcm":
        decoded_data = base64.b64decode(data)
        live_request_queue.send_realtime(
            Blob(data=decoded_data, mime_type=mime_type)
        )

    return {"status": "sent"}

Key Concepts:

send_content(): Used for text messages in "turn-by-turn mode" - signals a complete turn
send_realtime(): Used for audio in "realtime mode" - continuous data flow without turn boundaries
Audio data is Base64-encoded for JSON transport

Step 4: Client-Side Implementation (SSE)

The client uses EventSource for SSE and fetch for HTTP POST:

// SSE connection for server-to-client
const sse_url = "http://" + window.location.host + "/events/" + sessionId;
const send_url = "http://" + window.location.host + "/send/" + sessionId;
let eventSource = null;

function connectSSE() {
  eventSource = new EventSource(sse_url + "?is_audio=" + is_audio);

  eventSource.onmessage = function (event) {
    const message_from_server = JSON.parse(event.data);

    // Handle audio data
    if (message_from_server.mime_type == "audio/pcm" && audioPlayerNode) {
      audioPlayerNode.port.postMessage(base64ToArray(message_from_server.data));
    }

    // Handle text data
    if (message_from_server.mime_type == "text/plain") {
      // Display streaming text
      displayText(message_from_server.data);
    }
  };
}

// HTTP POST for client-to-server
async function sendMessage(message) {
  await fetch(send_url, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(message),
  });
}

Audio Handling:

Uses Web Audio API with AudioWorklets for low-latency audio processing
Buffers audio data in 0.2-second intervals before sending (SSE limitation)
Decodes Base64 audio from server for playback

Building the WebSocket-Based Audio Streaming App

Key Differences in WebSocket Implementation

The WebSocket version uses a single bidirectional connection instead of separate SSE and HTTP endpoints.

Server-Side: WebSocket Endpoint

@app.websocket("/ws/{user_id}")
async def websocket_endpoint(websocket: WebSocket, user_id: int, is_audio: str):
    """Client websocket endpoint"""

    await websocket.accept()

    user_id_str = str(user_id)
    live_events, live_request_queue = await start_agent_session(
        user_id_str,
        is_audio == "true"
    )

    # Run bidirectional messaging concurrently
    agent_to_client_task = asyncio.create_task(
        agent_to_client_messaging(websocket, live_events)
    )
    client_to_agent_task = asyncio.create_task(
        client_to_agent_messaging(websocket, live_request_queue)
    )

    try:
        # Wait for either task to complete
        tasks = [agent_to_client_task, client_to_agent_task]
        done, pending = await asyncio.wait(
            tasks,
            return_when=asyncio.FIRST_EXCEPTION
        )
    finally:
        live_request_queue.close()

Key Features:

Single WebSocket connection handles both directions
Two concurrent async tasks manage bidirectional communication
Proper cleanup on disconnect

Client-Side: WebSocket Connection

const ws_url = "wss://" + window.location.host + "/ws/" + sessionId;
let websocket = null;

function connectWebsocket() {
  websocket = new WebSocket(ws_url + "?is_audio=" + is_audio);

  websocket.onopen = function () {
    console.log("WebSocket connection opened.");
    document.getElementById("sendButton").disabled = false;
  };

  websocket.onmessage = function (event) {
    const message_from_server = JSON.parse(event.data);
    // Handle audio/text messages
  };

  websocket.onclose = function () {
    // Auto-reconnect logic
    setTimeout(connectWebsocket, 5000);
  };
}

function sendMessage(message) {
  if (websocket && websocket.readyState == WebSocket.OPEN) {
    websocket.send(JSON.stringify(message));
  }
}

Advantages:

Lower latency - no HTTP overhead for each message
True bidirectional streaming
Simpler client code - single connection to manage
Better for real-time audio streaming

Understanding Audio Processing

Both implementations use the same audio processing approach:

Audio Worklets

The applications use Web Audio API AudioWorklets for low-latency audio processing:

Audio Recorder Worklet: Captures microphone input as PCM audio
Audio Player Worklet: Plays PCM audio from the server

Audio Flow

Input: Microphone → AudioRecorderWorklet → PCM data → Base64 encode → Send to server
Output: Server → Base64 audio → Decode → AudioPlayerWorklet → Speakers

Key Configuration

run_config = RunConfig(
    streaming_mode=StreamingMode.BIDI,
    response_modalities=["AUDIO"] if is_audio else ["TEXT"],
    session_resumption=types.SessionResumptionConfig(),
    output_audio_transcription=types.AudioTranscriptionConfig() if is_audio else None,
)

Important Notes:

You must choose exactly ONE modality per session (either TEXT or AUDIO)
output_audio_transcription provides text representation of audio for UI display
session_resumption enables improved reliability and recovery

Comparing SSE vs WebSocket

Feature	SSE	WebSocket
Bidirectional	No (separate HTTP POST)	Yes (single connection)
Latency	Higher (HTTP overhead)	Lower (direct connection)
Complexity	Simpler (standard HTTP)	More complex (protocol handling)
Audio Buffering	Required (0.2s intervals)	Can send immediately
Browser Support	Excellent	Excellent
Best For	Simpler apps, less frequent updates	Real-time apps, frequent updates

Key Learnings

1. Modality Selection is Critical

You cannot use both TEXT and AUDIO modalities simultaneously in the same session. You must choose one:

modality = "AUDIO" if is_audio else "TEXT"
run_config = RunConfig(
    response_modalities=[modality],  # Single modality only!
)

2. Turn-by-Turn vs Realtime Mode

send_content(): For text messages - signals a complete turn, triggers immediate response
send_realtime(): For audio - continuous data flow, enables natural conversation

3. Session Management

Both implementations use InMemorySessionService for session management. In production, you'd want to use a persistent session service for reliability across server restarts.

4. Audio Encoding

Audio must be Base64-encoded for JSON transport:

Server → Client: base64.b64encode(audio_data).decode("ascii")
Client → Server: base64.b64decode(data)

5. Web Audio API Requirements

Requires user gesture to start (browser security)
AudioWorklets provide low-latency processing
PCM format (16-bit, 16kHz) is standard

Running the Applications

SSE Version

cd Custom_Audio_Streaming_app_SSE/adk-docs/examples/python/snippets/streaming/adk-streaming/app
pip install -r requirements.txt
uvicorn main:app --reload

WebSocket Version

cd Custom_Audio_Streaming_app_websocket/adk-streaming-ws/app
pip install -r requirements.txt
uvicorn main:app --reload

Both applications will be available at http://localhost:8000

What's Next?

Building these custom audio streaming applications opened up many possibilities:

Production Deployment: Explore Cloud Run or GKE deployment strategies
Session Persistence: Implement persistent session storage for production
Advanced Features: Voice activity detection, audio compression, context window management
Multi-User Support: Scale to handle multiple concurrent users
Custom UI/UX: Build domain-specific interfaces for different use cases

Key Takeaways

SSE is simpler but WebSockets provide better real-time performance
Audio streaming requires careful encoding/decoding between Base64 and binary
Modality selection is critical - choose TEXT or AUDIO, not both
Web Audio API AudioWorklets enable low-latency audio processing
Session management is important for production applications
ADK makes it surprisingly straightforward to build custom streaming applications

Resources

Custom Audio Streaming with WebSockets Documentation - Complete WebSocket guide
Custom Audio Streaming with SSE Documentation - Complete SSE guide
GitHub Repository - My code repository with both implementations
YouTube Video - Video walkthrough of building streaming applications
ADK Multi-Tool Agent Quickstart - Learn about tools and agents
Gemini Live API Documentation - Understanding Live API capabilities

DEV Community