My Journey into Real-Time Voice Interaction
After building basic agents and exploring streaming capabilities, I wanted to take things to the next level: creating custom web applications that enable real-time, bidirectional voice communication with AI agents. This led me to explore two different approaches for building audio streaming applications using Google ADK: Server-Sent Events (SSE) and WebSockets.
I'll walk you through both implementations, explaining the differences, when to use each approach, and sharing the code so you can build your own custom audio streaming applications.
Understanding Custom Audio Streaming
Unlike the standard ADK web interface, custom audio streaming applications give you complete control over the user experience. You can build:
- Real-time voice conversations with AI agents
- Custom UI/UX tailored to your specific needs
- Bidirectional audio streaming for natural conversation flow
- Text and audio modes that can switch dynamically
- Production-ready applications with proper session management
Both implementations use Google's Gemini Live API models, which support native audio streaming capabilities.
Two Approaches: SSE vs WebSockets
I built two versions of the same application to understand the trade-offs:
Server-Sent Events (SSE) Approach
- Server → Client: SSE stream for real-time updates
- Client → Server: HTTP POST requests
- Best for: Simpler implementations, when you don't need true bidirectional streaming
WebSocket Approach
- Server ↔ Client: Full bidirectional WebSocket connection
- Best for: True real-time bidirectional communication, lower latency
Let me show you how I built both!
Building the SSE-Based Audio Streaming App
Step 1: Project Setup
I started by setting up the project structure:
C:\Agent\Custom_Audio_Streaming_app_SSE>cd adk-docs\examples\python\snippets\streaming\adk-streaming\app
The project structure looks like this:
adk-streaming/
└── app/
├── main.py # FastAPI server
├── google_search_agent/
│ └── agent.py # The agent definition
├── static/
│ ├── index.html # Frontend HTML
│ └── js/
│ ├── app.js # Main client logic
│ ├── audio-player.js # Audio playback worklet
│ └── audio-recorder.js # Audio capture worklet
└── requirements.txt
Step 2: Understanding the Agent
The agent uses a Gemini Live API model with Google Search capabilities:
from google.adk.agents import Agent
from google.adk.tools import google_search
root_agent = Agent(
name="google_search_agent",
model="gemini-2.0-flash-live-001", # Live API model for audio
description="Agent to answer questions using Google Search.",
instruction="Answer the question using the Google Search tool.",
tools=[google_search],
)
Key Points:
- Uses
gemini-2.0-flash-live-001- a model that supports the Gemini Live API - Includes
google_searchtool for real-time information retrieval - Configured for both text and audio responses
Step 3: Server-Side Implementation (SSE)
The FastAPI server handles two main endpoints:
SSE Endpoint (Server → Client)
@app.get("/events/{user_id}")
async def sse_endpoint(user_id: int, is_audio: str = "false"):
"""SSE endpoint for agent to client communication"""
# Start agent session
user_id_str = str(user_id)
live_events, live_request_queue = await start_agent_session(
user_id_str,
is_audio == "true"
)
# Store the request queue for this user
active_sessions[user_id_str] = live_request_queue
async def event_generator():
try:
async for data in agent_to_client_sse(live_events):
yield data
finally:
cleanup()
return StreamingResponse(
event_generator(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
}
)
How it works:
- Creates an ADK streaming session using
start_agent_session() - Streams events from the agent to the client via SSE
- Stores the
live_request_queuefor client-to-agent communication
HTTP POST Endpoint (Client → Server)
@app.post("/send/{user_id}")
async def send_message_endpoint(user_id: int, request: Request):
"""HTTP endpoint for client to agent communication"""
live_request_queue = active_sessions.get(str(user_id))
if not live_request_queue:
return {"error": "Session not found"}
message = await request.json()
mime_type = message["mime_type"]
data = message["data"]
if mime_type == "text/plain":
content = Content(role="user", parts=[Part.from_text(text=data)])
live_request_queue.send_content(content=content)
elif mime_type == "audio/pcm":
decoded_data = base64.b64decode(data)
live_request_queue.send_realtime(
Blob(data=decoded_data, mime_type=mime_type)
)
return {"status": "sent"}
Key Concepts:
-
send_content(): Used for text messages in "turn-by-turn mode" - signals a complete turn -
send_realtime(): Used for audio in "realtime mode" - continuous data flow without turn boundaries - Audio data is Base64-encoded for JSON transport
Step 4: Client-Side Implementation (SSE)
The client uses EventSource for SSE and fetch for HTTP POST:
// SSE connection for server-to-client
const sse_url = "http://" + window.location.host + "/events/" + sessionId;
const send_url = "http://" + window.location.host + "/send/" + sessionId;
let eventSource = null;
function connectSSE() {
eventSource = new EventSource(sse_url + "?is_audio=" + is_audio);
eventSource.onmessage = function (event) {
const message_from_server = JSON.parse(event.data);
// Handle audio data
if (message_from_server.mime_type == "audio/pcm" && audioPlayerNode) {
audioPlayerNode.port.postMessage(base64ToArray(message_from_server.data));
}
// Handle text data
if (message_from_server.mime_type == "text/plain") {
// Display streaming text
displayText(message_from_server.data);
}
};
}
// HTTP POST for client-to-server
async function sendMessage(message) {
await fetch(send_url, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify(message),
});
}
Audio Handling:
- Uses Web Audio API with AudioWorklets for low-latency audio processing
- Buffers audio data in 0.2-second intervals before sending (SSE limitation)
- Decodes Base64 audio from server for playback
Building the WebSocket-Based Audio Streaming App
Key Differences in WebSocket Implementation
The WebSocket version uses a single bidirectional connection instead of separate SSE and HTTP endpoints.
Server-Side: WebSocket Endpoint
@app.websocket("/ws/{user_id}")
async def websocket_endpoint(websocket: WebSocket, user_id: int, is_audio: str):
"""Client websocket endpoint"""
await websocket.accept()
user_id_str = str(user_id)
live_events, live_request_queue = await start_agent_session(
user_id_str,
is_audio == "true"
)
# Run bidirectional messaging concurrently
agent_to_client_task = asyncio.create_task(
agent_to_client_messaging(websocket, live_events)
)
client_to_agent_task = asyncio.create_task(
client_to_agent_messaging(websocket, live_request_queue)
)
try:
# Wait for either task to complete
tasks = [agent_to_client_task, client_to_agent_task]
done, pending = await asyncio.wait(
tasks,
return_when=asyncio.FIRST_EXCEPTION
)
finally:
live_request_queue.close()
Key Features:
- Single WebSocket connection handles both directions
- Two concurrent async tasks manage bidirectional communication
- Proper cleanup on disconnect
Client-Side: WebSocket Connection
const ws_url = "wss://" + window.location.host + "/ws/" + sessionId;
let websocket = null;
function connectWebsocket() {
websocket = new WebSocket(ws_url + "?is_audio=" + is_audio);
websocket.onopen = function () {
console.log("WebSocket connection opened.");
document.getElementById("sendButton").disabled = false;
};
websocket.onmessage = function (event) {
const message_from_server = JSON.parse(event.data);
// Handle audio/text messages
};
websocket.onclose = function () {
// Auto-reconnect logic
setTimeout(connectWebsocket, 5000);
};
}
function sendMessage(message) {
if (websocket && websocket.readyState == WebSocket.OPEN) {
websocket.send(JSON.stringify(message));
}
}
Advantages:
- Lower latency - no HTTP overhead for each message
- True bidirectional streaming
- Simpler client code - single connection to manage
- Better for real-time audio streaming
Understanding Audio Processing
Both implementations use the same audio processing approach:
Audio Worklets
The applications use Web Audio API AudioWorklets for low-latency audio processing:
- Audio Recorder Worklet: Captures microphone input as PCM audio
- Audio Player Worklet: Plays PCM audio from the server
Audio Flow
- Input: Microphone → AudioRecorderWorklet → PCM data → Base64 encode → Send to server
- Output: Server → Base64 audio → Decode → AudioPlayerWorklet → Speakers
Key Configuration
run_config = RunConfig(
streaming_mode=StreamingMode.BIDI,
response_modalities=["AUDIO"] if is_audio else ["TEXT"],
session_resumption=types.SessionResumptionConfig(),
output_audio_transcription=types.AudioTranscriptionConfig() if is_audio else None,
)
Important Notes:
- You must choose exactly ONE modality per session (either TEXT or AUDIO)
-
output_audio_transcriptionprovides text representation of audio for UI display -
session_resumptionenables improved reliability and recovery
Comparing SSE vs WebSocket
| Feature | SSE | WebSocket |
|---|---|---|
| Bidirectional | No (separate HTTP POST) | Yes (single connection) |
| Latency | Higher (HTTP overhead) | Lower (direct connection) |
| Complexity | Simpler (standard HTTP) | More complex (protocol handling) |
| Audio Buffering | Required (0.2s intervals) | Can send immediately |
| Browser Support | Excellent | Excellent |
| Best For | Simpler apps, less frequent updates | Real-time apps, frequent updates |
Key Learnings
1. Modality Selection is Critical
You cannot use both TEXT and AUDIO modalities simultaneously in the same session. You must choose one:
modality = "AUDIO" if is_audio else "TEXT"
run_config = RunConfig(
response_modalities=[modality], # Single modality only!
)
2. Turn-by-Turn vs Realtime Mode
-
send_content(): For text messages - signals a complete turn, triggers immediate response -
send_realtime(): For audio - continuous data flow, enables natural conversation
3. Session Management
Both implementations use InMemorySessionService for session management. In production, you'd want to use a persistent session service for reliability across server restarts.
4. Audio Encoding
Audio must be Base64-encoded for JSON transport:
-
Server → Client:
base64.b64encode(audio_data).decode("ascii") -
Client → Server:
base64.b64decode(data)
5. Web Audio API Requirements
- Requires user gesture to start (browser security)
- AudioWorklets provide low-latency processing
- PCM format (16-bit, 16kHz) is standard
Running the Applications
SSE Version
cd Custom_Audio_Streaming_app_SSE/adk-docs/examples/python/snippets/streaming/adk-streaming/app
pip install -r requirements.txt
uvicorn main:app --reload
WebSocket Version
cd Custom_Audio_Streaming_app_websocket/adk-streaming-ws/app
pip install -r requirements.txt
uvicorn main:app --reload
Both applications will be available at http://localhost:8000
What's Next?
Building these custom audio streaming applications opened up many possibilities:
- Production Deployment: Explore Cloud Run or GKE deployment strategies
- Session Persistence: Implement persistent session storage for production
- Advanced Features: Voice activity detection, audio compression, context window management
- Multi-User Support: Scale to handle multiple concurrent users
- Custom UI/UX: Build domain-specific interfaces for different use cases
Key Takeaways
- SSE is simpler but WebSockets provide better real-time performance
- Audio streaming requires careful encoding/decoding between Base64 and binary
- Modality selection is critical - choose TEXT or AUDIO, not both
- Web Audio API AudioWorklets enable low-latency audio processing
- Session management is important for production applications
- ADK makes it surprisingly straightforward to build custom streaming applications
Resources
- Custom Audio Streaming with WebSockets Documentation - Complete WebSocket guide
- Custom Audio Streaming with SSE Documentation - Complete SSE guide
- GitHub Repository - My code repository with both implementations
- YouTube Video - Video walkthrough of building streaming applications
- ADK Multi-Tool Agent Quickstart - Learn about tools and agents
- Gemini Live API Documentation - Understanding Live API capabilities
Top comments (0)