Parul Malhotra

Posted on Mar 13

Building Joshu: A Real-Time Multimodal AI Assistant with Gemini Live

#ai #gemini #showdev #geminiliveagentchallenge

Building Joshu — A Real-Time Multimodal AI Assistant with Gemini Live

Our journey building a streaming AI agent for the Gemini Live Agent
Challenge.
Try Joshu here

Inspiration

We’ve all been there.

You’re cooking, debugging code, reading documentation, or commuting — and you want to ask your AI assistant something. But the flow breaks:

Open phone → type question → wait → read response → repeat.

Even voice assistants today still feel transactional.

Speak → pause → wait → answer.

That interaction pattern feels closer to issuing commands than having a conversation.

So we asked ourselves a simple question:

What if your AI could actually be present with you?

An assistant that can:

Hear what you hear
See what you see
Watch your screen
Help in real time
Act on your behalf

And most importantly — respond immediately, without breaking your flow.

That question led to Joshu, named after the Zen master Zhaozhou (Joshu), famous for giving direct and immediate answers.

We built Joshu for the Gemini Live Agent Challenge, and this post walks through the entire journey — from concept to architecture to the engineering challenges we solved along the way.

Meet Joshu

Joshu is a real-time multimodal AI personal assistant you talk to like a person.

No typing.
No waiting.
No push-to-talk.

Just conversation.

Core capabilities

🎙️ Real-time voice conversation

Talk naturally and interrupt anytime — Joshu responds instantly.

📷 Camera vision

Point your phone or laptop camera at something and ask:

“What is this?”

🖥️ Screen sharing

Joshu can watch your screen and help with:

debugging code
understanding errors
reviewing documents

📧 Send emails

Just say:

“Email Priya about tomorrow’s meeting.”

Joshu drafts and sends it.

💾 Save to Drive

Save notes, ideas, or summaries directly to Google Drive using voice.

🔍 Real-time search

Grounded answers using Google Search.

📍 Location awareness

Ask:

“Where am I?”

And Joshu uses device location to answer contextually.

📱 PWA

No install required.

Joshu runs as a Progressive Web App, so it works on any device instantly.

System Architecture

Most AI apps follow a simple pattern:

User → Request → Model → Response

That pattern breaks for real-time assistants.

We needed continuous streaming instead of request/response.

Our architecture looked like this:

Browser (React PWA)

    │
    │  PCM16 audio + JPEG frames
    ▼

WebSocket Relay (Node.js / Express)

    │
    ▼

Gemini 2.5 Flash Live API

    │
    ├─ streams transcript
    └─ streams audio response

How the pipeline works

1️⃣ The browser captures microphone audio and camera frames

2️⃣ Audio + images are streamed continuously via WebSocket

3️⃣ Our Node.js relay server forwards the stream to Gemini

4️⃣ Gemini Live API processes the stream in real time

5️⃣ Gemini streams back:

transcripts
generated audio
tool calls

6️⃣ The frontend plays audio instantly and executes tools

The key difference:

Gemini Live maintains session state across a persistent connection.

This makes the assistant behave like a conversational partner rather than a stateless chatbot.

Minimizing Latency

Real-time conversation only works if latency is extremely low.

The total delay can be approximated as:

L_total = L_capture + L_relay + L_gemini + L_tts ```



We optimized each component.

### Audio capture

We streamed **PCM16 audio chunks every ~128ms**, allowing Gemini to detect speech quickly.

### Relay overhead

Our **Node.js relay server adds <5ms latency**, acting as a lightweight proxy.

### Streaming responses

Gemini streams tokens immediately, allowing **TTS playback before the full response finishes**, improving responsiveness dramatically.

---

# Technology Stack

### Frontend

- React
- TypeScript
- Vite
- Web Audio API
- Canvas API
- Progressive Web App (PWA)

### Backend

- Node.js
- Express
- WebSocket relay

### AI

- Gemini 2.5 Flash Live API

### Google integrations

- Gmail API
- Google Drive API
- Google Search
- Google Identity Services (OAuth)

### Infrastructure

- Google Cloud Run
- Infrastructure as Code (`service.yaml`)

---

# Multimodal Streaming

Joshu processes multiple input streams simultaneously.

## Voice

Microphone audio is streamed as **PCM16 chunks**, allowing Gemini to perform real-time speech recognition and reasoning.

## Camera

Streaming raw video is expensive.

Instead we:

- capture frames from `<video>`
- draw them to `<canvas>`
- compress to JPEG



```javascript
canvas.toDataURL('image/jpeg', 0.8)

We throttle capture to 1 FPS when active, reducing bandwidth by over 80%.

Screen Sharing

Using the Screen Capture API, Joshu can see the user’s screen and provide contextual help such as:

debugging stack traces
reviewing pull requests
explaining diagrams

Tool Execution

One of the most powerful parts of Gemini Live is function calling.

When the model decides to perform an action, it emits a structured call
like:

send_email({
   to: "priya@example.com",
   subject: "...",
   body: "..."
})

Our system then executes the action.

Secure OAuth Flow

1️⃣ Frontend obtains a short-lived OAuth token
2️⃣ Token is sent to backend proxy
3️⃣ Backend validates token
4️⃣ Proxy calls Gmail / Drive APIs

This keeps credentials secure and avoids exposing sensitive tokens in the browser.

The Hardest Engineering Problems

Building a real-time multimodal AI system came with unexpected challenges.

Mobile audio quality

Mobile browsers apply aggressive:

noise suppression
echo cancellation
gain control

These can break speech detection.

We tuned getUserMedia constraints carefully:

audio: {
   echoCancellation: true,
   noiseSuppression: false,
   autoGainControl: false
}

This allowed Gemini’s own audio processing to work properly.

Interruptible conversations

Real conversations are interruptible.

Users should be able to talk while the AI is talking.

To support this we needed to:

detect voice activity
instantly drain the Web Audio output queue

We implemented this using:

AudioContext
ScriptProcessorNode
a custom look-ahead scheduling buffer

This was the most difficult engineering challenge in the project.

Camera streaming cost

Streaming full video frames is expensive.

We solved this by:

compressing frames
throttling to 1 frame per second
streaming only when the session is active

Bandwidth dropped by ~80%.

Secure tool execution

Calling Gmail directly from the browser introduces:

CORS issues
token exposure risks

We built a Cloud Run proxy that validates and forwards requests securely.

Silent token expiry

OAuth tokens expiring mid-session breaks the assistant.

We implemented background silent re-authentication to refresh tokens without interrupting conversations.

What We’re Proud Of

By the end of the project we achieved:

✅ True real-time voice conversation (not push-to-talk) ✅ Simultaneous voice + camera + screen input ✅ Tool execution triggered by natural speech ✅ A PWA that installs on any device ✅ Infrastructure as Code deployment to Cloud Run ✅ Silent OAuth refresh for uninterrupted sessions

What We Learned

Real-time systems are different

Streaming introduces new problems:

backpressure
dropped frames
reconnections
partial state

These issues rarely appear in traditional REST APIs.

Multimodal changes UX

When an AI can see and hear simultaneously, traditional chatbot UX patterns break.

You must rethink:

turn-taking
feedback signals
attention management

Gemini Live is fundamentally different

Gemini Live maintains persistent conversational context.

You design around it like a conversation, not a request.

Security becomes subtle

In streaming apps, users never expect authentication prompts mid-conversation.

Handling tokens securely without breaking UX requires careful system design.

What's Next for Joshu

This is just the beginning.

🧠 Persistent memory

Joshu remembers past conversations and builds long-term context.

📅 Calendar & tasks

Schedule meetings and manage to-dos using voice.

🌐 Multilingual support

Real-time translation and multilingual conversations.

🔌 Plugin system

Users will be able to connect their own APIs and workflows.

📲 Native mobile apps

Deeper OS integration and always-on assistants.

🏥 Domain-specific assistants

Specialized modes for:

healthcare
education
productivity

Final Thoughts

We started with a simple question:

What would an AI assistant feel like if it were truly present?

Building Joshu showed us that real-time multimodal AI changes everything.

When an assistant can see, hear, speak, and act, the boundary between software and collaborator starts to disappear.

And we’re just getting started.

Built for the Gemini Live Agent Challenge.

https://geminiliveagentchallenge.devpost.com/

DEV Community