DEV Community

Cover image for Building Custom Audio Streaming Apps with Google ADK
Md Enayetur Rahman
Md Enayetur Rahman

Posted on

Building Custom Audio Streaming Apps with Google ADK

My Journey into Real-Time Voice Interaction

After building basic agents and exploring streaming capabilities, I wanted to take things to the next level: creating custom web applications that enable real-time, bidirectional voice communication with AI agents. This led me to explore two different approaches for building audio streaming applications using Google ADK: Server-Sent Events (SSE) and WebSockets.

I'll walk you through both implementations, explaining the differences, when to use each approach, and sharing the code so you can build your own custom audio streaming applications.

Understanding Custom Audio Streaming

Unlike the standard ADK web interface, custom audio streaming applications give you complete control over the user experience. You can build:

  • Real-time voice conversations with AI agents
  • Custom UI/UX tailored to your specific needs
  • Bidirectional audio streaming for natural conversation flow
  • Text and audio modes that can switch dynamically
  • Production-ready applications with proper session management

Both implementations use Google's Gemini Live API models, which support native audio streaming capabilities.

Two Approaches: SSE vs WebSockets

I built two versions of the same application to understand the trade-offs:

Server-Sent Events (SSE) Approach

  • Server → Client: SSE stream for real-time updates
  • Client → Server: HTTP POST requests
  • Best for: Simpler implementations, when you don't need true bidirectional streaming

WebSocket Approach

  • Server ↔ Client: Full bidirectional WebSocket connection
  • Best for: True real-time bidirectional communication, lower latency

Let me show you how I built both!

Building the SSE-Based Audio Streaming App

Step 1: Project Setup

I started by setting up the project structure:

C:\Agent\Custom_Audio_Streaming_app_SSE>cd adk-docs\examples\python\snippets\streaming\adk-streaming\app
Enter fullscreen mode Exit fullscreen mode

The project structure looks like this:

adk-streaming/
└── app/
    ├── main.py                    # FastAPI server
    ├── google_search_agent/
    │   └── agent.py              # The agent definition
    ├── static/
    │   ├── index.html            # Frontend HTML
    │   └── js/
    │       ├── app.js            # Main client logic
    │       ├── audio-player.js   # Audio playback worklet
    │       └── audio-recorder.js # Audio capture worklet
    └── requirements.txt
Enter fullscreen mode Exit fullscreen mode

Step 2: Understanding the Agent

The agent uses a Gemini Live API model with Google Search capabilities:

from google.adk.agents import Agent
from google.adk.tools import google_search

root_agent = Agent(
    name="google_search_agent",
    model="gemini-2.0-flash-live-001",  # Live API model for audio
    description="Agent to answer questions using Google Search.",
    instruction="Answer the question using the Google Search tool.",
    tools=[google_search],
)
Enter fullscreen mode Exit fullscreen mode

Key Points:

  • Uses gemini-2.0-flash-live-001 - a model that supports the Gemini Live API
  • Includes google_search tool for real-time information retrieval
  • Configured for both text and audio responses

Step 3: Server-Side Implementation (SSE)

The FastAPI server handles two main endpoints:

SSE Endpoint (Server → Client)

@app.get("/events/{user_id}")
async def sse_endpoint(user_id: int, is_audio: str = "false"):
    """SSE endpoint for agent to client communication"""

    # Start agent session
    user_id_str = str(user_id)
    live_events, live_request_queue = await start_agent_session(
        user_id_str,
        is_audio == "true"
    )

    # Store the request queue for this user
    active_sessions[user_id_str] = live_request_queue

    async def event_generator():
        try:
            async for data in agent_to_client_sse(live_events):
                yield data
        finally:
            cleanup()

    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
        }
    )
Enter fullscreen mode Exit fullscreen mode

How it works:

  • Creates an ADK streaming session using start_agent_session()
  • Streams events from the agent to the client via SSE
  • Stores the live_request_queue for client-to-agent communication

HTTP POST Endpoint (Client → Server)

@app.post("/send/{user_id}")
async def send_message_endpoint(user_id: int, request: Request):
    """HTTP endpoint for client to agent communication"""

    live_request_queue = active_sessions.get(str(user_id))
    if not live_request_queue:
        return {"error": "Session not found"}

    message = await request.json()
    mime_type = message["mime_type"]
    data = message["data"]

    if mime_type == "text/plain":
        content = Content(role="user", parts=[Part.from_text(text=data)])
        live_request_queue.send_content(content=content)
    elif mime_type == "audio/pcm":
        decoded_data = base64.b64decode(data)
        live_request_queue.send_realtime(
            Blob(data=decoded_data, mime_type=mime_type)
        )

    return {"status": "sent"}
Enter fullscreen mode Exit fullscreen mode

Key Concepts:

  • send_content(): Used for text messages in "turn-by-turn mode" - signals a complete turn
  • send_realtime(): Used for audio in "realtime mode" - continuous data flow without turn boundaries
  • Audio data is Base64-encoded for JSON transport

Step 4: Client-Side Implementation (SSE)

The client uses EventSource for SSE and fetch for HTTP POST:

// SSE connection for server-to-client
const sse_url = "http://" + window.location.host + "/events/" + sessionId;
const send_url = "http://" + window.location.host + "/send/" + sessionId;
let eventSource = null;

function connectSSE() {
  eventSource = new EventSource(sse_url + "?is_audio=" + is_audio);

  eventSource.onmessage = function (event) {
    const message_from_server = JSON.parse(event.data);

    // Handle audio data
    if (message_from_server.mime_type == "audio/pcm" && audioPlayerNode) {
      audioPlayerNode.port.postMessage(base64ToArray(message_from_server.data));
    }

    // Handle text data
    if (message_from_server.mime_type == "text/plain") {
      // Display streaming text
      displayText(message_from_server.data);
    }
  };
}

// HTTP POST for client-to-server
async function sendMessage(message) {
  await fetch(send_url, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify(message),
  });
}
Enter fullscreen mode Exit fullscreen mode

Audio Handling:

  • Uses Web Audio API with AudioWorklets for low-latency audio processing
  • Buffers audio data in 0.2-second intervals before sending (SSE limitation)
  • Decodes Base64 audio from server for playback

Building the WebSocket-Based Audio Streaming App

Key Differences in WebSocket Implementation

The WebSocket version uses a single bidirectional connection instead of separate SSE and HTTP endpoints.

Server-Side: WebSocket Endpoint

@app.websocket("/ws/{user_id}")
async def websocket_endpoint(websocket: WebSocket, user_id: int, is_audio: str):
    """Client websocket endpoint"""

    await websocket.accept()

    user_id_str = str(user_id)
    live_events, live_request_queue = await start_agent_session(
        user_id_str,
        is_audio == "true"
    )

    # Run bidirectional messaging concurrently
    agent_to_client_task = asyncio.create_task(
        agent_to_client_messaging(websocket, live_events)
    )
    client_to_agent_task = asyncio.create_task(
        client_to_agent_messaging(websocket, live_request_queue)
    )

    try:
        # Wait for either task to complete
        tasks = [agent_to_client_task, client_to_agent_task]
        done, pending = await asyncio.wait(
            tasks,
            return_when=asyncio.FIRST_EXCEPTION
        )
    finally:
        live_request_queue.close()
Enter fullscreen mode Exit fullscreen mode

Key Features:

  • Single WebSocket connection handles both directions
  • Two concurrent async tasks manage bidirectional communication
  • Proper cleanup on disconnect

Client-Side: WebSocket Connection

const ws_url = "wss://" + window.location.host + "/ws/" + sessionId;
let websocket = null;

function connectWebsocket() {
  websocket = new WebSocket(ws_url + "?is_audio=" + is_audio);

  websocket.onopen = function () {
    console.log("WebSocket connection opened.");
    document.getElementById("sendButton").disabled = false;
  };

  websocket.onmessage = function (event) {
    const message_from_server = JSON.parse(event.data);
    // Handle audio/text messages
  };

  websocket.onclose = function () {
    // Auto-reconnect logic
    setTimeout(connectWebsocket, 5000);
  };
}

function sendMessage(message) {
  if (websocket && websocket.readyState == WebSocket.OPEN) {
    websocket.send(JSON.stringify(message));
  }
}
Enter fullscreen mode Exit fullscreen mode

Advantages:

  • Lower latency - no HTTP overhead for each message
  • True bidirectional streaming
  • Simpler client code - single connection to manage
  • Better for real-time audio streaming

Understanding Audio Processing

Both implementations use the same audio processing approach:

Audio Worklets

The applications use Web Audio API AudioWorklets for low-latency audio processing:

  • Audio Recorder Worklet: Captures microphone input as PCM audio
  • Audio Player Worklet: Plays PCM audio from the server

Audio Flow

  1. Input: Microphone → AudioRecorderWorklet → PCM data → Base64 encode → Send to server
  2. Output: Server → Base64 audio → Decode → AudioPlayerWorklet → Speakers

Key Configuration

run_config = RunConfig(
    streaming_mode=StreamingMode.BIDI,
    response_modalities=["AUDIO"] if is_audio else ["TEXT"],
    session_resumption=types.SessionResumptionConfig(),
    output_audio_transcription=types.AudioTranscriptionConfig() if is_audio else None,
)
Enter fullscreen mode Exit fullscreen mode

Important Notes:

  • You must choose exactly ONE modality per session (either TEXT or AUDIO)
  • output_audio_transcription provides text representation of audio for UI display
  • session_resumption enables improved reliability and recovery

Comparing SSE vs WebSocket

Feature SSE WebSocket
Bidirectional No (separate HTTP POST) Yes (single connection)
Latency Higher (HTTP overhead) Lower (direct connection)
Complexity Simpler (standard HTTP) More complex (protocol handling)
Audio Buffering Required (0.2s intervals) Can send immediately
Browser Support Excellent Excellent
Best For Simpler apps, less frequent updates Real-time apps, frequent updates

Key Learnings

1. Modality Selection is Critical

You cannot use both TEXT and AUDIO modalities simultaneously in the same session. You must choose one:

modality = "AUDIO" if is_audio else "TEXT"
run_config = RunConfig(
    response_modalities=[modality],  # Single modality only!
)
Enter fullscreen mode Exit fullscreen mode

2. Turn-by-Turn vs Realtime Mode

  • send_content(): For text messages - signals a complete turn, triggers immediate response
  • send_realtime(): For audio - continuous data flow, enables natural conversation

3. Session Management

Both implementations use InMemorySessionService for session management. In production, you'd want to use a persistent session service for reliability across server restarts.

4. Audio Encoding

Audio must be Base64-encoded for JSON transport:

  • Server → Client: base64.b64encode(audio_data).decode("ascii")
  • Client → Server: base64.b64decode(data)

5. Web Audio API Requirements

  • Requires user gesture to start (browser security)
  • AudioWorklets provide low-latency processing
  • PCM format (16-bit, 16kHz) is standard

Running the Applications

SSE Version

cd Custom_Audio_Streaming_app_SSE/adk-docs/examples/python/snippets/streaming/adk-streaming/app
pip install -r requirements.txt
uvicorn main:app --reload
Enter fullscreen mode Exit fullscreen mode

WebSocket Version

cd Custom_Audio_Streaming_app_websocket/adk-streaming-ws/app
pip install -r requirements.txt
uvicorn main:app --reload
Enter fullscreen mode Exit fullscreen mode

Both applications will be available at http://localhost:8000

What's Next?

Building these custom audio streaming applications opened up many possibilities:

  1. Production Deployment: Explore Cloud Run or GKE deployment strategies
  2. Session Persistence: Implement persistent session storage for production
  3. Advanced Features: Voice activity detection, audio compression, context window management
  4. Multi-User Support: Scale to handle multiple concurrent users
  5. Custom UI/UX: Build domain-specific interfaces for different use cases

Key Takeaways

  • SSE is simpler but WebSockets provide better real-time performance
  • Audio streaming requires careful encoding/decoding between Base64 and binary
  • Modality selection is critical - choose TEXT or AUDIO, not both
  • Web Audio API AudioWorklets enable low-latency audio processing
  • Session management is important for production applications
  • ADK makes it surprisingly straightforward to build custom streaming applications

Resources

Top comments (0)