The Problem
Choosing a university course is one of the highest-stakes decisions a young person makes — and the experience is still stuck in the past. University websites are mazes of PDFs. Course advisors are booked weeks out. And for international students navigating visa requirements, ATAR cutoffs, and scholarship deadlines from overseas, the information gap is even wider.
I wanted to see what happens when you put a knowledgeable, always-available voice AI in front of a prospective student. Not a chatbot. A real conversation.
That became Waypoint — and Clara, Kingsford University's AI course counsellor.
What Clara does
Clara is a real-time voice agent. You speak to her. She responds with audio. And while she's talking, structured cards appear in a sidebar — course tiles, scholarship cards, markdown knowledge docs, booking confirmations — all in real time, without waiting for her to finish speaking.
A typical session looks like this:
"Hi Clara, I have a friend from China who wants to explore courses at Kingsford."
(share a school transcript photo)
"Yeah, can you show me some of those courses?"
"My friend would also like to explore potential scholarships."
"If my friend wants to understand the next steps to apply, what information do you have?"
(share a Google Maps view of campus)
"I'm also looking at your campus via Google Maps. I'd like to look around."
In that single session: course cards, scholarship cards, info cards with rendered markdown tables, and Clara describing the International Centre grounded in actual knowledge base data — not hallucination.
The Architecture
Browser (HTML/JS)
│ PCM audio 16kHz + image/jpeg frames
▼
FastAPI WebSocket (/ws/{client_id}) ← Cloud Run
│
▼
ADK Runner (InMemorySessionService)
│ LiveRequestQueue — bidirectional audio + vision
▼
Gemini Live API (gemini-live-2.5-flash-native-audio · Vertex AI)
│ function_call → tool result
▼
7 ADK Tools → Cloud SQL PostgreSQL + pgvector
│
└─ display_data → WebSocket side-channel → Browser cards
The key insight is the card side-channel: display_data is an ADK tool that, instead of returning data for the LLM to summarise, pushes a JSON card payload directly over the WebSocket to the browser. Cards appear while Clara is still speaking — not after.
The Technical Stack
| Layer | Choice |
|---|---|
| Agent framework | google-adk |
| Model |
gemini-live-2.5-flash-native-audio via Vertex AI |
| Backend | FastAPI + Uvicorn on Cloud Run |
| Database | Cloud SQL PostgreSQL 16 + pgvector |
| Embeddings |
gemini-embedding-001 (1536-dim) |
| Frontend | Plain HTML/JS — no build step |
| CI/CD | Cloud Build (cloudbuild.yaml) — auto-deploys on git push to main |
| Secrets | Secret Manager |
The Hardest Parts
1. ADK's audio wire format
ADK's GeminiLlmConnection.send_realtime() uses a deprecated mediaChunks wire format. The native audio model requires the audio key for its voice activity detection to work. Without this, Clara never "hears" the user's microphone.
The fix was a monkey-patch on ADK's GeminiLlmConnection class:
async def patched_send_realtime(self, input):
if isinstance(input, types.Blob):
await self._gemini_session.send_realtime_input(audio=input)
# ... handle other types (ActivityStart, ActivityEnd) ...
GeminiLlmConnection.send_realtime = patched_send_realtime
This isn't documented anywhere. It took a full day of reading ADK source code and watching WebSocket frames to find it.
2. Tool response camelCase conversion
ADK's send_tool_response() applies recursive camelCase conversion to tool result keys — so career_outcomes becomes careerOutcomes. The native audio model rejects this with a 1011 internal error, crashing the session mid-conversation.
The fix: bypass send_tool_response() entirely and manually construct the function response JSON:
async def patched_send_content(self, content):
if not content.parts: return
if content.parts[0].function_response:
function_responses = [p.function_response for p in content.parts if p.function_response]
payload = json.dumps({
"tool_response": {
"functionResponses": [
{"id": fr.id, "name": fr.name, "response": fr.response}
for fr in function_responses
]
}
})
await self._gemini_session._ws.send(payload)
else:
# Pass non-tool content directly
await self._gemini_session.send(input=types.LiveClientContent(turns=[content], turn_complete=True))
3. TURN ISOLATION
The native audio model produces garbled or empty audio if a tool call and spoken text occur in the same turn. This was the hardest behavioural problem to diagnose — Clara would sometimes respond with audio AND call a tool simultaneously, producing silence.
The solution was an explicit rule in the system prompt:
"TURN ISOLATION: In any single turn, you must EITHER speak OR call a tool. Never both. If you are calling a tool, remain completely silent."
4. Context window compression
ADK's SlidingWindow compression throws a 1008 disconnect on the native audio BIDI stream. Removing it entirely from RunConfig fixed session stability.
The Evaluation Suite
Before submitting, I wanted evidence that Clara actually works — not just vibes from a demo run. I built a 3-layer automated eval suite:
Layer 1 — Tool correctness (23 assertions)
Direct DB calls for all 7 tools. Does search_scholarships(type="International") return exactly 1 result? Does book_campus_tour reject party_size > 6?
Layer 2 — Tool routing (24 queries)
Send natural-language queries to Gemini text API with Clara's system prompt. Does "Tell me about the Bachelor of Nursing" route to get_course_detail? Does "Hi, how are you?" correctly suppress all tool calls?
Layer 2b — Multi-turn routing (17 turns)
4 full counselling conversations. After asking about engineering courses, does "tell me more about the cybersecurity one" route to get_course_detail with the correct course inferred from context?
Result: 63/64 passed (98%)
The one miss: "I'm strong in science and prefer studying online" → routed to null instead of recommend_courses. A genuine edge case where the model treated the preference statement as conversation rather than a recommendation request.
# Reproducible — clone the repo and run:
python eval_suite.py
Vision Input
Clara can see what you share. When a student shares a school transcript photo, Clara reads the grades and recommends matching courses. When they share a Google Maps view of the campus, Clara retrieves grounded information about that building from the knowledge base.
The image pipeline is straightforward with ADK:
# Browser sends image as base64 over WebSocket
image_data = base64.b64decode(msg["data"])
await queue.send_realtime(
types.Blob(data=image_data, mime_type="image/jpeg")
)
The harder part was the agent instruction. By default, Clara would respond conversationally to building photos ("It looks like a busy part of campus!") rather than calling search_knowledge. The fix was an explicit rule:
"CAMPUS BUILDINGS & LOCATIONS: If the student shares an image of a campus building or map, call search_knowledge with a query about that location. Do NOT respond from memory."
What I'd Do Differently
Use DatabaseSessionService from day one — even for a hackathon. InMemorySessionService resets on every reconnect, which means Clara forgets the conversation if the WebSocket drops. This was fine for demos but would fail in production.
Add a conversation replay tool — a way to replay a session from logs for debugging. When Clara gives a wrong answer in a live voice session, it's very hard to reproduce without exact audio.
Test the eval suite against the production model earlier — I ran Layer 2/2b against gemini-2.5-flash (text API) but production runs gemini-2.5-flash-native-audio. They're separately fine-tuned variants. The routing held up, but there were subtle differences in tool-calling tendency I only discovered during manual testing.
Try It
Live demo: https://waypoint-881109238433.us-central1.run.app
Top comments (1)
Quick personal review of AhaChat after trying it
I recently tried AhaChat to set up a chatbot for a small Facebook page I manage, so I thought I’d share my experience.
I don’t have any coding background, so ease of use was important for me. The drag-and-drop interface was pretty straightforward, and creating simple automated reply flows wasn’t too complicated. I mainly used it to handle repetitive questions like pricing, shipping fees, and business hours, which saved me a decent amount of time.
I also tested a basic flow to collect customer info (name + phone number). It worked fine, and everything is set up with simple “if–then” logic rather than actual coding.
It’s not an advanced AI that understands everything automatically — it’s more of a rule-based chatbot where you design the conversation flow yourself. But for basic automation and reducing manual replies, it does the job.
Overall thoughts:
Good for small businesses or beginners
Easy to set up
No technical skills required
I’m not affiliated with them — just sharing in case someone is looking into chatbot tools for simple automation.
Curious if anyone else here has tried it or similar platforms — what was your experience?