Building Joshu — A Real-Time Multimodal AI Assistant with Gemini Live
Our journey building a streaming AI agent for the Gemini Live Agent
Challenge.
Try Joshu here
Inspiration
We’ve all been there.
You’re cooking, debugging code, reading documentation, or commuting — and you want to ask your AI assistant something. But the flow breaks:
Open phone → type question → wait → read response → repeat.
Even voice assistants today still feel transactional.
Speak → pause → wait → answer.
That interaction pattern feels closer to issuing commands than having a conversation.
So we asked ourselves a simple question:
What if your AI could actually be present with you?
An assistant that can:
- Hear what you hear
- See what you see
- Watch your screen
- Help in real time
- Act on your behalf
And most importantly — respond immediately, without breaking your flow.
That question led to Joshu, named after the Zen master Zhaozhou (Joshu), famous for giving direct and immediate answers.
We built Joshu for the Gemini Live Agent Challenge, and this post walks through the entire journey — from concept to architecture to the engineering challenges we solved along the way.
Meet Joshu
Joshu is a real-time multimodal AI personal assistant you talk to like a person.
No typing.
No waiting.
No push-to-talk.
Just conversation.
Core capabilities
🎙️ Real-time voice conversation
Talk naturally and interrupt anytime — Joshu responds instantly.
📷 Camera vision
Point your phone or laptop camera at something and ask:
“What is this?”
🖥️ Screen sharing
Joshu can watch your screen and help with:
- debugging code
- understanding errors
- reviewing documents
📧 Send emails
Just say:
“Email Priya about tomorrow’s meeting.”
Joshu drafts and sends it.
💾 Save to Drive
Save notes, ideas, or summaries directly to Google Drive using voice.
🔍 Real-time search
Grounded answers using Google Search.
📍 Location awareness
Ask:
“Where am I?”
And Joshu uses device location to answer contextually.
📱 PWA
No install required.
Joshu runs as a Progressive Web App, so it works on any device instantly.
System Architecture
Most AI apps follow a simple pattern:
User → Request → Model → Response
That pattern breaks for real-time assistants.
We needed continuous streaming instead of request/response.
Our architecture looked like this:
Browser (React PWA)
│
│ PCM16 audio + JPEG frames
▼
WebSocket Relay (Node.js / Express)
│
▼
Gemini 2.5 Flash Live API
│
├─ streams transcript
└─ streams audio response
How the pipeline works
1️⃣ The browser captures microphone audio and camera frames
2️⃣ Audio + images are streamed continuously via WebSocket
3️⃣ Our Node.js relay server forwards the stream to Gemini
4️⃣ Gemini Live API processes the stream in real time
5️⃣ Gemini streams back:
- transcripts
- generated audio
- tool calls
6️⃣ The frontend plays audio instantly and executes tools
The key difference:
Gemini Live maintains session state across a persistent connection.
This makes the assistant behave like a conversational partner rather than a stateless chatbot.
Minimizing Latency
Real-time conversation only works if latency is extremely low.
The total delay can be approximated as:
L_total = L_capture + L_relay + L_gemini + L_tts ```
We optimized each component.
### Audio capture
We streamed **PCM16 audio chunks every ~128ms**, allowing Gemini to detect speech quickly.
### Relay overhead
Our **Node.js relay server adds <5ms latency**, acting as a lightweight proxy.
### Streaming responses
Gemini streams tokens immediately, allowing **TTS playback before the full response finishes**, improving responsiveness dramatically.
---
# Technology Stack
### Frontend
- React
- TypeScript
- Vite
- Web Audio API
- Canvas API
- Progressive Web App (PWA)
### Backend
- Node.js
- Express
- WebSocket relay
### AI
- Gemini 2.5 Flash Live API
### Google integrations
- Gmail API
- Google Drive API
- Google Search
- Google Identity Services (OAuth)
### Infrastructure
- Google Cloud Run
- Infrastructure as Code (`service.yaml`)
---
# Multimodal Streaming
Joshu processes multiple input streams simultaneously.
## Voice
Microphone audio is streamed as **PCM16 chunks**, allowing Gemini to perform real-time speech recognition and reasoning.
## Camera
Streaming raw video is expensive.
Instead we:
- capture frames from `<video>`
- draw them to `<canvas>`
- compress to JPEG
```javascript
canvas.toDataURL('image/jpeg', 0.8)
We throttle capture to 1 FPS when active, reducing bandwidth by over 80%.
Screen Sharing
Using the Screen Capture API, Joshu can see the user’s screen and provide contextual help such as:
- debugging stack traces
- reviewing pull requests
- explaining diagrams
Tool Execution
One of the most powerful parts of Gemini Live is function calling.
When the model decides to perform an action, it emits a structured call
like:
send_email({
to: "priya@example.com",
subject: "...",
body: "..."
})
Our system then executes the action.
Secure OAuth Flow
1️⃣ Frontend obtains a short-lived OAuth token
2️⃣ Token is sent to backend proxy
3️⃣ Backend validates token
4️⃣ Proxy calls Gmail / Drive APIs
This keeps credentials secure and avoids exposing sensitive tokens in the browser.
The Hardest Engineering Problems
Building a real-time multimodal AI system came with unexpected challenges.
Mobile audio quality
Mobile browsers apply aggressive:
- noise suppression
- echo cancellation
- gain control
These can break speech detection.
We tuned getUserMedia constraints carefully:
audio: {
echoCancellation: true,
noiseSuppression: false,
autoGainControl: false
}
This allowed Gemini’s own audio processing to work properly.
Interruptible conversations
Real conversations are interruptible.
Users should be able to talk while the AI is talking.
To support this we needed to:
- detect voice activity
- instantly drain the Web Audio output queue
We implemented this using:
AudioContextScriptProcessorNode- a custom look-ahead scheduling buffer
This was the most difficult engineering challenge in the project.
Camera streaming cost
Streaming full video frames is expensive.
We solved this by:
- compressing frames
- throttling to 1 frame per second
- streaming only when the session is active
Bandwidth dropped by ~80%.
Secure tool execution
Calling Gmail directly from the browser introduces:
- CORS issues
- token exposure risks
We built a Cloud Run proxy that validates and forwards requests securely.
Silent token expiry
OAuth tokens expiring mid-session breaks the assistant.
We implemented background silent re-authentication to refresh tokens without interrupting conversations.
What We’re Proud Of
By the end of the project we achieved:
✅ True real-time voice conversation (not push-to-talk) ✅ Simultaneous voice + camera + screen input ✅ Tool execution triggered by natural speech ✅ A PWA that installs on any device ✅ Infrastructure as Code deployment to Cloud Run ✅ Silent OAuth refresh for uninterrupted sessions
What We Learned
Real-time systems are different
Streaming introduces new problems:
- backpressure
- dropped frames
- reconnections
- partial state
These issues rarely appear in traditional REST APIs.
Multimodal changes UX
When an AI can see and hear simultaneously, traditional chatbot UX patterns break.
You must rethink:
- turn-taking
- feedback signals
- attention management
Gemini Live is fundamentally different
Gemini Live maintains persistent conversational context.
You design around it like a conversation, not a request.
Security becomes subtle
In streaming apps, users never expect authentication prompts mid-conversation.
Handling tokens securely without breaking UX requires careful system design.
What's Next for Joshu
This is just the beginning.
🧠 Persistent memory
Joshu remembers past conversations and builds long-term context.
📅 Calendar & tasks
Schedule meetings and manage to-dos using voice.
🌐 Multilingual support
Real-time translation and multilingual conversations.
🔌 Plugin system
Users will be able to connect their own APIs and workflows.
📲 Native mobile apps
Deeper OS integration and always-on assistants.
🏥 Domain-specific assistants
Specialized modes for:
- healthcare
- education
- productivity
Final Thoughts
We started with a simple question:
What would an AI assistant feel like if it were truly present?
Building Joshu showed us that real-time multimodal AI changes everything.
When an assistant can see, hear, speak, and act, the boundary between software and collaborator starts to disappear.
And we’re just getting started.
Built for the Gemini Live Agent Challenge.
Top comments (0)