eliotc

Posted on Mar 15 • Edited on Mar 16

How I Built a Real-Time Voice AI University Counsellor with Google ADK and Gemini Live in 10 Days

#gcp #adk #edutech #geminiliveagentchallenge

The Problem

Choosing a university course is one of the highest-stakes decisions a young person makes — and the experience is still stuck in the past. University websites are mazes of PDFs. Course advisors are booked weeks out. And for international students navigating visa requirements, ATAR cutoffs, and scholarship deadlines from overseas, the information gap is even wider.

I wanted to see what happens when you put a knowledgeable, always-available voice AI in front of a prospective student. Not a chatbot. A real conversation.

That became Waypoint — and Clara, Kingsford University's AI course counsellor.

Demo Video

https://www.youtube.com/watch?v=y6SN-Y1aiCk

What Clara does

Clara is a real-time voice agent. You speak to her. She responds with audio. And while she's talking, structured cards appear in a sidebar — course tiles, scholarship cards, markdown knowledge docs, booking confirmations — all in real time, without waiting for her to finish speaking.

A typical session looks like this:

"Hi Clara, I have a friend from China who wants to explore courses at Kingsford."
(share a school transcript photo)
"Yeah, can you show me some of those courses?"
"My friend would also like to explore potential scholarships."
"If my friend wants to understand the next steps to apply, what information do you have?"
(share a Google Maps view of campus)
"I'm also looking at your campus via Google Maps. I'd like to look around."

In that single session: course cards, scholarship cards, info cards with rendered markdown tables, and Clara describing the International Centre grounded in actual knowledge base data — not hallucination.

The Architecture

Browser (HTML/JS)
  │  PCM audio 16kHz + image/jpeg frames
  ▼
FastAPI WebSocket  (/ws/{client_id})   ← Cloud Run
  │
  ▼
ADK Runner  (InMemorySessionService)
  │  LiveRequestQueue — bidirectional audio + vision
  ▼
Gemini Live API  (gemini-live-2.5-flash-native-audio · Vertex AI)
  │  function_call → tool result
  ▼
7 ADK Tools  →  Cloud SQL PostgreSQL + pgvector
  │
  └─ display_data → WebSocket side-channel → Browser cards

The key insight is the card side-channel: display_data is an ADK tool that, instead of returning data for the LLM to summarise, pushes a JSON card payload directly over the WebSocket to the browser. Cards appear while Clara is still speaking — not after.

The Technical Stack

Layer	Choice
Agent framework	`google-adk`
Model	`gemini-live-2.5-flash-native-audio` via Vertex AI
Backend	FastAPI + Uvicorn on Cloud Run
Database	Cloud SQL PostgreSQL 16 + pgvector
Embeddings	`gemini-embedding-001` (1536-dim)
Frontend	Plain HTML/JS — no build step
CI/CD	Cloud Build (`cloudbuild.yaml`) — auto-deploys on `git push` to main
Secrets	Secret Manager

The Hardest Parts

1. ADK's audio wire format

ADK's GeminiLlmConnection.send_realtime() uses a deprecated mediaChunks wire format. The native audio model requires the audio key for its voice activity detection to work. Without this, Clara never "hears" the user's microphone.

The fix was a monkey-patch on ADK's GeminiLlmConnection class:

async def patched_send_realtime(self, input):
    if isinstance(input, types.Blob):
        await self._gemini_session.send_realtime_input(audio=input)
    # ... handle other types (ActivityStart, ActivityEnd) ...

GeminiLlmConnection.send_realtime = patched_send_realtime

This isn't documented anywhere. It took a full day of reading ADK source code and watching WebSocket frames to find it.

2. Tool response camelCase conversion

ADK's send_tool_response() applies recursive camelCase conversion to tool result keys — so career_outcomes becomes careerOutcomes. The native audio model rejects this with a 1011 internal error, crashing the session mid-conversation.

The fix: bypass send_tool_response() entirely and manually construct the function response JSON:

async def patched_send_content(self, content):
    if not content.parts: return
    if content.parts[0].function_response:
        function_responses = [p.function_response for p in content.parts if p.function_response]
        payload = json.dumps({
            "tool_response": {
                "functionResponses": [
                    {"id": fr.id, "name": fr.name, "response": fr.response}
                    for fr in function_responses
                ]
            }
        })
        await self._gemini_session._ws.send(payload)
    else:
        # Pass non-tool content directly
        await self._gemini_session.send(input=types.LiveClientContent(turns=[content], turn_complete=True))

3. TURN ISOLATION

The native audio model produces garbled or empty audio if a tool call and spoken text occur in the same turn. This was the hardest behavioural problem to diagnose — Clara would sometimes respond with audio AND call a tool simultaneously, producing silence.

The solution was an explicit rule in the system prompt:

"TURN ISOLATION: In any single turn, you must EITHER speak OR call a tool. Never both. If you are calling a tool, remain completely silent."

4. Context window compression

ADK's SlidingWindow compression throws a 1008 disconnect on the native audio BIDI stream. Removing it entirely from RunConfig fixed session stability.

The Evaluation Suite

Before submitting, I wanted evidence that Clara actually works — not just vibes from a demo run. I built a 3-layer automated eval suite:

Layer 1 — Tool correctness (23 assertions)
Direct DB calls for all 7 tools. Does search_scholarships(type="International") return exactly 1 result? Does book_campus_tour reject party_size > 6?

Layer 2 — Tool routing (24 queries)
Send natural-language queries to Gemini text API with Clara's system prompt. Does "Tell me about the Bachelor of Nursing" route to get_course_detail? Does "Hi, how are you?" correctly suppress all tool calls?

Layer 2b — Multi-turn routing (17 turns)
4 full counselling conversations. After asking about engineering courses, does "tell me more about the cybersecurity one" route to get_course_detail with the correct course inferred from context?

Result: 63/64 passed (98%)

The one miss: "I'm strong in science and prefer studying online" → routed to null instead of recommend_courses. A genuine edge case where the model treated the preference statement as conversation rather than a recommendation request.

# Reproducible — clone the repo and run:
python eval_suite.py

Vision Input

Clara can see what you share. When a student shares a school transcript photo, Clara reads the grades and recommends matching courses. When they share a Google Maps view of the campus, Clara retrieves grounded information about that building from the knowledge base.

The image pipeline is straightforward with ADK:

# Browser sends image as base64 over WebSocket
image_data = base64.b64decode(msg["data"])
await queue.send_realtime(
    types.Blob(data=image_data, mime_type="image/jpeg")
)

The harder part was the agent instruction. By default, Clara would respond conversationally to building photos ("It looks like a busy part of campus!") rather than calling search_knowledge. The fix was an explicit rule:

"CAMPUS BUILDINGS & LOCATIONS: If the student shares an image of a campus building or map, call search_knowledge with a query about that location. Do NOT respond from memory."

What I'd Do Differently

Use DatabaseSessionService from day one — even for a hackathon. InMemorySessionService resets on every reconnect, which means Clara forgets the conversation if the WebSocket drops. This was fine for demos but would fail in production.

Add a conversation replay tool — a way to replay a session from logs for debugging. When Clara gives a wrong answer in a live voice session, it's very hard to reproduce without exact audio.

Test the eval suite against the production model earlier — I ran Layer 2/2b against gemini-2.5-flash (text API) but production runs gemini-2.5-flash-native-audio. They're separately fine-tuned variants. The routing held up, but there were subtle differences in tool-calling tendency I only discovered during manual testing.