DEV Community

Cover image for Building Ekaette — A Multimodal AI Voice Assistant on Gemini Live API and Google Cloud
Bassey John
Bassey John

Posted on

Building Ekaette — A Multimodal AI Voice Assistant on Gemini Live API and Google Cloud

This post was created for the purposes of entering the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge


What is Ekaette?

Ekaette is a configurable multimodal AI voice and messaging assistant for customer-facing businesses. Customers can call a phone number, speak naturally, send photos or videos on WhatsApp mid-call, and continue the same conversation across channels without repeating themselves.

It supports 6 industry templates (electronics, hotel, automotive, fashion, telecom, aviation) and is configurable per tenant and company without changing backend code.

Try it live:

  • 📞 Call: +2342017001127 (Africa's Talking SIP)
  • 💬 WhatsApp: +2348124975729

GitHub: github.com/ogabasseyy/ekaette


The Problem

Most customer service lines still rely on static recordings, long hold queues, and rigid call-centre routing. Customers spend significant time waiting to solve a simple request, and urgent needs are delayed behind generic queue systems that do not understand intent or priority.

We wanted to build an assistant that replaces that experience entirely — one that understands intent in real time on a live call, responds immediately when the task is simple, and continues the journey across voice and messaging without losing context.


Architecture

Ekaette Architecture

Ekaette is a split real-time system running on Google Cloud:

  • Cloud Run (Main HTTP Service) — Africa's Talking voice/SMS webhooks, WhatsApp webhooks, admin APIs, callback orchestration, text channel runtime
  • Cloud Run (Live Voice Service) — Dedicated long-lived WebSocket sessions for real-time voice streaming via the Gemini Live API
  • SIP Bridge VM (GCE) — Converts Africa's Talking RTP/G.711 audio to PCM 16kHz for Gemini, with echo suppression, noise reduction, and VAD

All channels converge on one agent graph built with Google ADK 1.26.0.


How We Used Google AI Models

Ekaette uses 8 specialized Gemini models, each chosen for a specific role:

Role Model Why
Live voice (all agents) gemini-live-2.5-flash-native-audio Bidirectional streaming via bidiGenerateContent
Text channels gemini-2.5-pro WhatsApp/SMS via Runner.run_async()
Text fallback gemini-2.5-flash Automatic fallback when primary is unavailable
Vision analysis gemini-2.5-flash Device grading and condition assessment
Live media analysis gemini-2.5-pro Cross-session media analysis during active calls
TTS gemini-2.5-flash-tts WhatsApp voice note replies
Image generation gemini-3.1-flash-image-preview Product preview images sent on WhatsApp
Image fallback gemini-2.5-flash-image Fallback for image generation

The voice and text pipelines are intentionally separate — text models don't support bidiGenerateContent, and voice models don't need Runner.run_async().


Agent Architecture

A root orchestrator delegates to 5 specialized sub-agents:

# Simplified from app/agents/ekaette_router/agent.py
def create_ekaette_router(model, channel="voice"):
    return Agent(
        name="ekaette_router",
        model=model,
        instruction=instruction,
        generate_content_config=types.GenerateContentConfig(
            thinking_config=types.ThinkingConfig(thinking_budget=256)
        ),
        sub_agents=[
            create_vision_agent(model, channel=channel),
            create_valuation_agent(model, channel=channel),
            create_booking_agent(model, channel=channel),
            create_catalog_agent(model, channel=channel),
            create_support_agent(model, channel=channel),
        ],
        before_agent_callback=before_agent_isolation_guard_and_dedup,
        after_agent_callback=telemetry_after_agent,
        before_model_callback=before_model_inject_config,
        on_tool_error_callback=on_tool_error_emit,
    )

# Two singletons — one per pipeline
ekaette_router = create_ekaette_router(LIVE_MODEL_ID)              # voice
text_router = create_ekaette_router(TEXT_MODEL_ID, channel="text") # WhatsApp/SMS
Enter fullscreen mode Exit fullscreen mode

Google Cloud Services

Service Usage
Vertex AI Gemini Live API for real-time voice, Memory Bank for cross-session recall
Cloud Run Split deployment — main HTTP + dedicated live voice service
Firestore Registry (templates, companies), session state, products, booking slots, knowledge
Cloud Storage Media uploads (photos, videos for trade-in analysis)
Cloud Tasks Async WhatsApp message processing, silence nudges

The Hardest Challenges

Native Audio Function Calling Regression

The GA gemini-live-2.5-flash-native-audio model has significantly lower function-calling accuracy than the older preview model. It would hallucinate sub-agent names as direct function calls (catalog_agent() instead of transfer_to_agent(agent_name="catalog_agent")).

We mitigated this with explicit agent description= fields, negative instructions, and an on_tool_error_callback that always returns a dict — returning None crashes the entire bidi stream (ADK Bug #4005).

Duplicate Agent Transfers (ADK Bug #3395)

After multiple transfers + session resumption, the model can loop, repeatedly transferring to the same sub-agent. We built a dedup callback that fingerprints each transfer by agent name + content hash and suppresses duplicates within a 2-second cooldown:

# Simplified from app/agents/dedup.py
async def dedup_before_agent(callback_context):
    agent_name = callback_context.agent_name
    state = callback_context.state

    if agent_name == "ekaette_router":
        return None  # never suppress root

    signature = sha1(f"{agent_name}|{content_hash(callback_context.user_content)}")
    last = state.get("temp:dedup_last_signature")
    last_ts = state.get("temp:dedup_last_ts")

    if last == signature and (time.time() - last_ts) < 2.0:
        return types.Content(role="model",
            parts=[types.Part(text="I'm already working on that.")])

    state["temp:dedup_last_signature"] = signature
    state["temp:dedup_last_ts"] = time.time()
    return None
Enter fullscreen mode Exit fullscreen mode

Voice Accent Inconsistency

Without voice cloning (not yet available for Gemini native audio), the assistant's accent changed unpredictably between turns. IPA notation is ignored by the audio model. We solved this by pinning the voice to Aoede and using phonetic spelling (ehkaitay) in both the system instruction and greeting trigger.

Cloud Run Scaling for Telephony

A single active voice call ties up a Cloud Run instance with a long-lived WebSocket. With min-instances=1, Africa's Talking webhook callbacks got 429 errors because no instance was free. The error comes from Google Frontend — no application logs are emitted. We had to set min-instances=2 and split into separate services.


Key Lessons

  1. The Gemini Live API is powerful but young. Build assuming the model and SDK will surprise you. Invest in callbacks and guardrails early.

  2. Prompt engineering is not enough for production voice AI. Critical workflow decisions must live in the runtime layer, not in prompts. LLMs are strongest when they control expression, not business-critical state transitions.

  3. Voice UX is unforgiving. A 500ms silence gap feels like an eternity on a live call. We built voice fillers, non-blocking tool execution, and context compression (80k → 40k tokens) to keep conversations natural.

  4. Split your Cloud Run services for telephony. Long-lived WebSockets and short HTTP webhooks cannot share instances without starving each other.


What's Next

  • Voice cloning when Google releases it for Gemini native audio
  • Conversation analytics for quality scoring and conversion tracking
  • Deeper industry-specific workflows beyond electronics
  • Better memory and customer follow-up across longer time windows

Built by Bassey at Baci Technologies Limited. 641 automated tests. Strict TDD. Real phone calls.

#GeminiLiveAgentChallenge

Top comments (0)