This post was created for the purposes of entering the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge
What is Ekaette?
Ekaette is a configurable multimodal AI voice and messaging assistant for customer-facing businesses. Customers can call a phone number, speak naturally, send photos or videos on WhatsApp mid-call, and continue the same conversation across channels without repeating themselves.
It supports 6 industry templates (electronics, hotel, automotive, fashion, telecom, aviation) and is configurable per tenant and company without changing backend code.
Try it live:
- 📞 Call: +2342017001127 (Africa's Talking SIP)
- 💬 WhatsApp: +2348124975729
GitHub: github.com/ogabasseyy/ekaette
The Problem
Most customer service lines still rely on static recordings, long hold queues, and rigid call-centre routing. Customers spend significant time waiting to solve a simple request, and urgent needs are delayed behind generic queue systems that do not understand intent or priority.
We wanted to build an assistant that replaces that experience entirely — one that understands intent in real time on a live call, responds immediately when the task is simple, and continues the journey across voice and messaging without losing context.
Architecture
Ekaette is a split real-time system running on Google Cloud:
- Cloud Run (Main HTTP Service) — Africa's Talking voice/SMS webhooks, WhatsApp webhooks, admin APIs, callback orchestration, text channel runtime
- Cloud Run (Live Voice Service) — Dedicated long-lived WebSocket sessions for real-time voice streaming via the Gemini Live API
- SIP Bridge VM (GCE) — Converts Africa's Talking RTP/G.711 audio to PCM 16kHz for Gemini, with echo suppression, noise reduction, and VAD
All channels converge on one agent graph built with Google ADK 1.26.0.
How We Used Google AI Models
Ekaette uses 8 specialized Gemini models, each chosen for a specific role:
| Role | Model | Why |
|---|---|---|
| Live voice (all agents) | gemini-live-2.5-flash-native-audio |
Bidirectional streaming via bidiGenerateContent
|
| Text channels | gemini-2.5-pro |
WhatsApp/SMS via Runner.run_async()
|
| Text fallback | gemini-2.5-flash |
Automatic fallback when primary is unavailable |
| Vision analysis | gemini-2.5-flash |
Device grading and condition assessment |
| Live media analysis | gemini-2.5-pro |
Cross-session media analysis during active calls |
| TTS | gemini-2.5-flash-tts |
WhatsApp voice note replies |
| Image generation | gemini-3.1-flash-image-preview |
Product preview images sent on WhatsApp |
| Image fallback | gemini-2.5-flash-image |
Fallback for image generation |
The voice and text pipelines are intentionally separate — text models don't support bidiGenerateContent, and voice models don't need Runner.run_async().
Agent Architecture
A root orchestrator delegates to 5 specialized sub-agents:
# Simplified from app/agents/ekaette_router/agent.py
def create_ekaette_router(model, channel="voice"):
return Agent(
name="ekaette_router",
model=model,
instruction=instruction,
generate_content_config=types.GenerateContentConfig(
thinking_config=types.ThinkingConfig(thinking_budget=256)
),
sub_agents=[
create_vision_agent(model, channel=channel),
create_valuation_agent(model, channel=channel),
create_booking_agent(model, channel=channel),
create_catalog_agent(model, channel=channel),
create_support_agent(model, channel=channel),
],
before_agent_callback=before_agent_isolation_guard_and_dedup,
after_agent_callback=telemetry_after_agent,
before_model_callback=before_model_inject_config,
on_tool_error_callback=on_tool_error_emit,
)
# Two singletons — one per pipeline
ekaette_router = create_ekaette_router(LIVE_MODEL_ID) # voice
text_router = create_ekaette_router(TEXT_MODEL_ID, channel="text") # WhatsApp/SMS
Google Cloud Services
| Service | Usage |
|---|---|
| Vertex AI | Gemini Live API for real-time voice, Memory Bank for cross-session recall |
| Cloud Run | Split deployment — main HTTP + dedicated live voice service |
| Firestore | Registry (templates, companies), session state, products, booking slots, knowledge |
| Cloud Storage | Media uploads (photos, videos for trade-in analysis) |
| Cloud Tasks | Async WhatsApp message processing, silence nudges |
The Hardest Challenges
Native Audio Function Calling Regression
The GA gemini-live-2.5-flash-native-audio model has significantly lower function-calling accuracy than the older preview model. It would hallucinate sub-agent names as direct function calls (catalog_agent() instead of transfer_to_agent(agent_name="catalog_agent")).
We mitigated this with explicit agent description= fields, negative instructions, and an on_tool_error_callback that always returns a dict — returning None crashes the entire bidi stream (ADK Bug #4005).
Duplicate Agent Transfers (ADK Bug #3395)
After multiple transfers + session resumption, the model can loop, repeatedly transferring to the same sub-agent. We built a dedup callback that fingerprints each transfer by agent name + content hash and suppresses duplicates within a 2-second cooldown:
# Simplified from app/agents/dedup.py
async def dedup_before_agent(callback_context):
agent_name = callback_context.agent_name
state = callback_context.state
if agent_name == "ekaette_router":
return None # never suppress root
signature = sha1(f"{agent_name}|{content_hash(callback_context.user_content)}")
last = state.get("temp:dedup_last_signature")
last_ts = state.get("temp:dedup_last_ts")
if last == signature and (time.time() - last_ts) < 2.0:
return types.Content(role="model",
parts=[types.Part(text="I'm already working on that.")])
state["temp:dedup_last_signature"] = signature
state["temp:dedup_last_ts"] = time.time()
return None
Voice Accent Inconsistency
Without voice cloning (not yet available for Gemini native audio), the assistant's accent changed unpredictably between turns. IPA notation is ignored by the audio model. We solved this by pinning the voice to Aoede and using phonetic spelling (ehkaitay) in both the system instruction and greeting trigger.
Cloud Run Scaling for Telephony
A single active voice call ties up a Cloud Run instance with a long-lived WebSocket. With min-instances=1, Africa's Talking webhook callbacks got 429 errors because no instance was free. The error comes from Google Frontend — no application logs are emitted. We had to set min-instances=2 and split into separate services.
Key Lessons
The Gemini Live API is powerful but young. Build assuming the model and SDK will surprise you. Invest in callbacks and guardrails early.
Prompt engineering is not enough for production voice AI. Critical workflow decisions must live in the runtime layer, not in prompts. LLMs are strongest when they control expression, not business-critical state transitions.
Voice UX is unforgiving. A 500ms silence gap feels like an eternity on a live call. We built voice fillers, non-blocking tool execution, and context compression (80k → 40k tokens) to keep conversations natural.
Split your Cloud Run services for telephony. Long-lived WebSockets and short HTTP webhooks cannot share instances without starving each other.
What's Next
- Voice cloning when Google releases it for Gemini native audio
- Conversation analytics for quality scoring and conversion tracking
- Deeper industry-specific workflows beyond electronics
- Better memory and customer follow-up across longer time windows
Built by Bassey at Baci Technologies Limited. 641 automated tests. Strict TDD. Real phone calls.
#GeminiLiveAgentChallenge

Top comments (0)