Bassey John

Posted on Mar 16

Building Ekaette — A Multimodal AI Voice Assistant on Gemini Live API and Google Cloud

#gemini #googlecloud #ai #hackathon

This post was created for the purposes of entering the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge

What is Ekaette?

Ekaette is a configurable multimodal AI voice and messaging assistant for customer-facing businesses. Customers can call a phone number, speak naturally, send photos or videos on WhatsApp mid-call, and continue the same conversation across channels without repeating themselves.

It supports 6 industry templates (electronics, hotel, automotive, fashion, telecom, aviation) and is configurable per tenant and company without changing backend code.

Try it live:

📞 Call: +2342017001127 (Africa's Talking SIP)
💬 WhatsApp: +2348124975729

GitHub: github.com/ogabasseyy/ekaette

The Problem

Most customer service lines still rely on static recordings, long hold queues, and rigid call-centre routing. Customers spend significant time waiting to solve a simple request, and urgent needs are delayed behind generic queue systems that do not understand intent or priority.

We wanted to build an assistant that replaces that experience entirely — one that understands intent in real time on a live call, responds immediately when the task is simple, and continues the journey across voice and messaging without losing context.

Architecture

Ekaette is a split real-time system running on Google Cloud:

Cloud Run (Main HTTP Service) — Africa's Talking voice/SMS webhooks, WhatsApp webhooks, admin APIs, callback orchestration, text channel runtime
Cloud Run (Live Voice Service) — Dedicated long-lived WebSocket sessions for real-time voice streaming via the Gemini Live API
SIP Bridge VM (GCE) — Converts Africa's Talking RTP/G.711 audio to PCM 16kHz for Gemini, with echo suppression, noise reduction, and VAD

All channels converge on one agent graph built with Google ADK 1.26.0.

How We Used Google AI Models

Ekaette uses 8 specialized Gemini models, each chosen for a specific role:

Role	Model	Why
Live voice (all agents)	`gemini-live-2.5-flash-native-audio`	Bidirectional streaming via `bidiGenerateContent`
Text channels	`gemini-2.5-pro`	WhatsApp/SMS via `Runner.run_async()`
Text fallback	`gemini-2.5-flash`	Automatic fallback when primary is unavailable
Vision analysis	`gemini-2.5-flash`	Device grading and condition assessment
Live media analysis	`gemini-2.5-pro`	Cross-session media analysis during active calls
TTS	`gemini-2.5-flash-tts`	WhatsApp voice note replies
Image generation	`gemini-3.1-flash-image-preview`	Product preview images sent on WhatsApp
Image fallback	`gemini-2.5-flash-image`	Fallback for image generation

The voice and text pipelines are intentionally separate — text models don't support bidiGenerateContent, and voice models don't need Runner.run_async().

Agent Architecture

A root orchestrator delegates to 5 specialized sub-agents:

# Simplified from app/agents/ekaette_router/agent.py
def create_ekaette_router(model, channel="voice"):
    return Agent(
        name="ekaette_router",
        model=model,
        instruction=instruction,
        generate_content_config=types.GenerateContentConfig(
            thinking_config=types.ThinkingConfig(thinking_budget=256)
        ),
        sub_agents=[
            create_vision_agent(model, channel=channel),
            create_valuation_agent(model, channel=channel),
            create_booking_agent(model, channel=channel),
            create_catalog_agent(model, channel=channel),
            create_support_agent(model, channel=channel),
        ],
        before_agent_callback=before_agent_isolation_guard_and_dedup,
        after_agent_callback=telemetry_after_agent,
        before_model_callback=before_model_inject_config,
        on_tool_error_callback=on_tool_error_emit,
    )

# Two singletons — one per pipeline
ekaette_router = create_ekaette_router(LIVE_MODEL_ID)              # voice
text_router = create_ekaette_router(TEXT_MODEL_ID, channel="text") # WhatsApp/SMS

Google Cloud Services

Service	Usage
Vertex AI	Gemini Live API for real-time voice, Memory Bank for cross-session recall
Cloud Run	Split deployment — main HTTP + dedicated live voice service
Firestore	Registry (templates, companies), session state, products, booking slots, knowledge
Cloud Storage	Media uploads (photos, videos for trade-in analysis)
Cloud Tasks	Async WhatsApp message processing, silence nudges

The Hardest Challenges

Native Audio Function Calling Regression

The GA gemini-live-2.5-flash-native-audio model has significantly lower function-calling accuracy than the older preview model. It would hallucinate sub-agent names as direct function calls (catalog_agent() instead of transfer_to_agent(agent_name="catalog_agent")).

We mitigated this with explicit agent description= fields, negative instructions, and an on_tool_error_callback that always returns a dict — returning None crashes the entire bidi stream (ADK Bug #4005).

Duplicate Agent Transfers (ADK Bug #3395)

After multiple transfers + session resumption, the model can loop, repeatedly transferring to the same sub-agent. We built a dedup callback that fingerprints each transfer by agent name + content hash and suppresses duplicates within a 2-second cooldown:

# Simplified from app/agents/dedup.py
async def dedup_before_agent(callback_context):
    agent_name = callback_context.agent_name
    state = callback_context.state

    if agent_name == "ekaette_router":
        return None  # never suppress root

    signature = sha1(f"{agent_name}|{content_hash(callback_context.user_content)}")
    last = state.get("temp:dedup_last_signature")
    last_ts = state.get("temp:dedup_last_ts")

    if last == signature and (time.time() - last_ts) < 2.0:
        return types.Content(role="model",
            parts=[types.Part(text="I'm already working on that.")])

    state["temp:dedup_last_signature"] = signature
    state["temp:dedup_last_ts"] = time.time()
    return None

Voice Accent Inconsistency

Without voice cloning (not yet available for Gemini native audio), the assistant's accent changed unpredictably between turns. IPA notation is ignored by the audio model. We solved this by pinning the voice to Aoede and using phonetic spelling (ehkaitay) in both the system instruction and greeting trigger.

Cloud Run Scaling for Telephony

A single active voice call ties up a Cloud Run instance with a long-lived WebSocket. With min-instances=1, Africa's Talking webhook callbacks got 429 errors because no instance was free. The error comes from Google Frontend — no application logs are emitted. We had to set min-instances=2 and split into separate services.

Key Lessons

The Gemini Live API is powerful but young. Build assuming the model and SDK will surprise you. Invest in callbacks and guardrails early.
Prompt engineering is not enough for production voice AI. Critical workflow decisions must live in the runtime layer, not in prompts. LLMs are strongest when they control expression, not business-critical state transitions.
Voice UX is unforgiving. A 500ms silence gap feels like an eternity on a live call. We built voice fillers, non-blocking tool execution, and context compression (80k → 40k tokens) to keep conversations natural.
Split your Cloud Run services for telephony. Long-lived WebSockets and short HTTP webhooks cannot share instances without starving each other.

What's Next

Voice cloning when Google releases it for Gemini native audio
Conversation analytics for quality scoring and conversion tracking
Deeper industry-specific workflows beyond electronics
Better memory and customer follow-up across longer time windows

Built by Bassey at Baci Technologies Limited. 641 automated tests. Strict TDD. Real phone calls.

#GeminiLiveAgentChallenge

DEV Community