DEV Community

Gowshik S
Gowshik S

Posted on

I built an autonomous voice agent that sees your screen and acts across your apps — here's how

Most AI agents make the human do the work.

You craft the prompt carefully. You wait. It misunderstands. You rephrase.
You wait again. That's not autonomy — that's a smarter search bar.

I wanted to build something different. An agent where you speak once,
naturally — interrupt mid-sentence, change your mind — and it keeps up
without missing a beat. Then acts. Across your applications. Without
follow-up prompts.

That's Rio.


What Rio actually does

Here's the simplest demo I can describe:

I say: "Log this as a support ticket — my order hasn't arrived in 5 days"

Rio listens, identifies it as a support request, extracts the issue,
severity, and category from my voice, then writes a row to Google Sheets
in real time — ticket ID, timestamp, status OPEN — and confirms back
via voice: "Your ticket #007 has been logged."

No typing. No clicking. One instruction, end-to-end.

And while that's happening, Rio is watching my screen. It knows what
application is open. It can read text off the display. It sees context
I never explicitly told it.


The architecture that makes it possible

Rio runs as a split system — and this is the decision that made everything else work.

Local machine  ←──── binary WebSocket ────→  Cloud Run
  owns hardware                               owns thinking
  mic, screen,                                Gemini Live session,
  UI actions,                                 ADK orchestration,
  file system                                 model routing,
  browser                                     ToolBridge
Enter fullscreen mode Exit fullscreen mode

Why split?

Because the things that need to be fast (audio capture, screen frames,
UI clicks) need to live close to the hardware. And the things that need
to be powerful (model inference, multi-step planning, API integrations)
need to live in the cloud.

Mixing them kills both. Splitting them makes both excellent.


The binary WebSocket protocol

Everything — audio, video, tool calls — flows over a single WebSocket
connection. I designed a simple binary prefix protocol:

  • 0x01 + PCM16 mono audio at 16kHz → goes to Gemini Live session
  • 0x02 + JPEG screenshot bytes → goes to vision pipeline
  • JSON frames → tool calls, tool results, control signals

The key insight: audio and video can't share the same queue without
head-of-line blocking. A large JPEG frame will delay audio by hundreds
of milliseconds if they're treated equally.

Solution: priority queue on the send side. Audio at priority 0, frames
at priority 1. Audio always wins.


Native audio — not STT → LLM → TTS

This is the part I'm most proud of architecturally.

Most voice agents are secretly text agents wearing a voice costume:
speech-to-text → LLM → text-to-speech. The latency is acceptable.
The naturalness isn't.

Rio uses gemini-live-2.5-flash-native-audio — Gemini's native audio
model. Voice goes in, voice comes out. No transcription middle layer.
The model hears tone, pace, hesitation. It responds like a conversation,
not a query.

The tradeoff: native audio models are unreliable for function calling.
I had to route all tool execution through a text-based orchestrator via
live_model_tools=False. There's a hidden parsing layer between what
the model says and what the tool dispatcher hears. Getting that clean
took the most iteration of anything in this build.


True interruptions via Silero VAD

Real interruption handling is harder than it sounds.

The naive approach: detect silence, stop playback. This breaks constantly
— background noise, breath sounds, pauses mid-thought all trigger false
interruptions.

The right approach: Silero VAD running in a sounddevice callback thread,
detecting actual voice activity with a trained model. When it triggers,
it hands off to the asyncio event loop via call_soon_threadsafe.

The gotcha: getting that handoff fast enough to feel instantaneous.
VAD detection to playback stop needs to be under ~100ms or it feels
laggy. The threading boundary between sounddevice callbacks and asyncio
is where most of that latency lives.


The ToolBridge pattern

This is the pattern I'm most likely to reuse on future projects.

The problem: Gemini running in Cloud Run needs to execute actions on
my local machine. It can't do that directly. So I built a bridge:

# Cloud side — when Gemini calls a tool:
async def bridge_tool_call(name, args):
    call_id = str(uuid.uuid4())
    future = asyncio.Future()
    pending[call_id] = future

    await ws.send_json({
        "type": "tool_call",
        "id": call_id,
        "name": name,
        "args": args
    })

    result = await asyncio.wait_for(future, timeout=30)
    return result

# Local side — when tool_call arrives:
async def handle_tool_call(msg):
    result = await execute_tool(msg["name"], msg["args"])
    await ws.send_json({
        "type": "tool_result",
        "id": msg["id"],
        "result": result
    })
Enter fullscreen mode Exit fullscreen mode

Every tool is a closured async function, isolated per WebSocket connection.
Adding a new tool is 10 lines. The cloud model never knows it's talking
to a local machine.


Struggle detection — ML inside the agent

One thing I haven't seen in other agent demos: behavioral adaptation.

Rio runs a scikit-learn classification model in real time, trained on
user interaction patterns — retry frequency, error rate, response latency
variance. When the model predicts a user is struggling, Rio shifts its
response style proactively. Simpler language. More scaffolding. Without
being asked.

The hard part was feature engineering. Hesitation alone isn't struggle.
A retry alone isn't struggle. The signal lives in the combination —
and finding that combination required building a small labeled dataset
from my own interaction logs.

The next evolution: wiring this directly into ADK's LiveRequestQueue
as a streaming tool so the adaptation happens inside the agent loop,
not as an external observer.


The graceful degradation model

Free-tier Gemini API limits are real. 30 RPM disappears fast.

I built a token bucket rate limiter with 4 degradation levels:

Level Condition Rio's behavior
0 Healthy Full capability
1 70% bucket Reduce screenshot frequency
2 40% bucket Disable proactive vision
3 10% bucket Voice only, no tools
4 Empty Graceful hold with user message

The agent never crashes. It degrades. That distinction matters a lot
in a live demo.


What I'd do differently

Start with the protocol design. I retrofitted the binary frame protocol
onto an existing JSON-only WebSocket. It worked, but doing it upfront would
have saved two days of refactoring.

Wire the struggle detector earlier. I built it as a standalone pipeline
first, then realized it needed to be inside the agent loop to be useful.
That's a lesson about where ML belongs in agent systems — not as an
observer, but as an actor.

InMemorySessionService is a trap. Fine for development. The moment
Cloud Run cold-starts mid-conversation, your session handle is gone.
Design for persistence from day one.


What's next

The current build is one powerful agent. The next version is four:

  • Orchestrator — ADK + Gemini 2.5 Pro, plans and delegates
  • Live Agent — owns all voice and vision
  • UI Navigator — Playwright + Computer Use API, true visual grounding
  • Creative Agent — interleaved output, mixed-media responses

Connected via A2A protocol with Agent Cards. Rio as a node in a
larger agent network, not a standalone tool.


Try it / follow along

GitHub: https://github.com/Gowshik-S/Gemini-Live-Agent

The repo includes a one-command Cloud Run deployment via service.yaml:

gcloud run services replace cloud/service.yaml
Enter fullscreen mode Exit fullscreen mode

If you're building something in the agent space — especially anything
involving Gemini Live API or ADK — I'd love to compare notes.
Drop a comment or find me on LinkedIn.

One human. One instruction. Rio does the rest.

Top comments (0)