I built a desktop app that lets people control any website using only their voice. You talk, it takes a screenshot, sends it to Gemini 2.5 Flash, gets back a structured action, runs it in the browser, and repeats. The whole time, it's narrating what it's doing out loud. Here's how it came together for the Gemini Live Agent Challenge.
Created for the sole purpose of entering the Gemini Live Agent Challenge Hackathon
The Problem
Picture this: you can't use a mouse. Maybe you can't use a keyboard either. You might have a repetitive strain injury, a motor impairment, or honestly, you might have a broken wrist. The web doesn't really care. It expects you to click tiny buttons, scroll precisely, type into fields, and drag things around.
There are screen readers and voice control tools out there, but they all seem to expect you to learn their language. Memorize commands. Know what things are called in the DOM. Fight with dictation software that mishears every other word.
I wanted something where you could say what you want:
"Go to YouTube and search for lo-fi beats."
No special syntax. No menu navigation. Just plain language.
That's what Sally does.
Sally is a voice-first accessibility agent. It's an Electron desktop app built for people with motor impairments, cognitive disabilities, repetitive strain injuries, or anyone who wants hands-free web browsing. You hold the push-to-talk key (Right Alt on Windows, Right Option on macOS), say what you need, and Sally handles the rest. She navigates, clicks, types, scrolls, and narrates everything she does so you always know what's going on.
What Sally Is Made Of
Quick overview before I get into the details:
- Gemini 2.5 Flash for multimodal vision, speech-to-text, and deciding what action to take next
-
@google/genaiSDK (the official Google Gen AI SDK for Node.js) for all Gemini API calls - Google Cloud Run as an optional serverless backend that proxies Gemini requests
- Google Cloud Logging for structured observability and session tracking
- A persistent Electron-owned browser with multi-tab support, live screenshots, DOM extraction, and DOM-first action execution
- Electron + React + TypeScript for the cross-platform desktop shell
- ElevenLabs for neural text-to-speech so Sally sounds like a real person
-
uiohook-napifor a system-wide push-to-talk hotkey that works even when the app isn't in focus
Why I Went with Gemini 2.5 Flash
Honestly, this was one of the easier decisions. I needed three things from a model:
- Vision. It needs to look at a screenshot and actually understand what's on the page.
- Reasoning. Given a user's instruction and what's on screen, it needs to figure out the right next step.
- Speed. Sally runs in a tight loop where every action involves a round-trip to the model. If the model takes 5 seconds per call and a task needs 10 steps, that's almost a minute of waiting. Not acceptable.
Gemini 2.5 Flash nails all three. I send it a base64-encoded PNG screenshot along with the user's instruction, and it comes back with structured JSON telling me exactly what to do. Click this button. Fill in that text field. Scroll down. Go to this URL.
The speed really matters. A single task like "search Google for weather in Tokyo" involves maybe 4–5 model calls. With Flash, each call comes back fast enough that the whole thing feels smooth rather than painful.
I set the temperature to 0.2 and kept it there. When you're automating a browser for someone who can't easily undo a wrong click, you want the model to be predictable. High temperature and creative button-clicking don't mix.
Bonus: Speech-to-Text Too
One thing I didn't expect going in is that Gemini 2.5 Flash also handles speech-to-text. So the same model that looks at screenshots also transcribes voice commands. Fewer moving parts, fewer API keys to manage, fewer things that can break. I kept Whisper as a fallback option, but in the default setup, Gemini handles vision, reasoning, and transcription all in one.
How the Agentic Loop Works
This is the core of Sally. Here's the flow in plain language:
- You talk. Hold the push-to-talk key and say something like "Open Gmail and compose a new email to Mom."
- Sally transcribes your voice into text using Gemini.
- Sally opens or reuses its own browser, captures a live screenshot, and extracts structured page context from the DOM.
- Sally sends everything to Gemini the screenshot, your instruction, the current page URL and title, the structured page context, and a log of actions already taken in this session.
- Gemini sends back JSON with two things: a short narration sentence and an action to execute.
{
"narration": "I'll click the Compose button for you.",
"action": { "type": "click", "selector": "Compose" }
}
- Sally runs the action in its persistent Electron browser using DOM-first execution to click, type, scroll, or navigate in the live page.
- Sally speaks the narration through ElevenLabs TTS so you hear what just happened.
- Sally waits for the page to settle using loading detection plus bounded delays, then loops back to step 3.
The loop caps out at 40 iterations or 10 minutes, whichever hits first. If Gemini decides the task is complete, it returns a null action and the loop ends naturally. You can also just say "Cancel" at any point and everything stops.
Why It Doesn't Get Stuck in Loops
Each iteration gets a fresh screenshot, so Gemini is always looking at the current state of the page. It doesn't need to remember what the page looked like three steps ago.
But there's a catch: without any memory at all, the model might keep clicking the same button over and over. So I feed it a rolling history of the last 10 actions. The system prompt tells it explicitly: "Do not repeat these steps." That sliding window gives Gemini enough context to know where it's been without bloating the prompt with a full session transcript.
On top of that, Sally has a failure-counting mechanism. If the same action fails twice, it triggers a replan, Gemini reassesses the situation from scratch instead of stubbornly retrying. And if Sally detects that the user is truly stuck on a page, a Browser Rescue Mode kicks in. It analyzes what's blocking progress, identifies alternatives, and suggests a way forward rather than spinning in circles.
Sally Has a Personality
This part took way more effort than I expected. Sally isn't a cold automation script. She has a warm, concise system prompt that keeps her spoken narration calm, short, and human. She uses contractions, acknowledges progress, and stays steady when things break.
The prompt describes her as an assistant who is effectively sitting next to the user, describing what's on screen and taking action on their behalf. Every narration is kept short because it gets spoken aloud. Nobody wants to listen to a paragraph between each click.
All the Actions Sally Can Take
Gemini picks from these action types and Sally executes them:
| Action | What It Does |
|---|---|
navigate |
Go to a URL |
click |
Click a visible control using semantic DOM matching |
fill |
Set the value of a text field or contenteditable surface |
type |
Type text into the currently focused element |
select |
Pick an option from a dropdown or combobox-like control |
press |
Press a key like Enter, Tab, or Escape
|
hover |
Mouse over an element |
focus |
Focus a visible field or control |
check / uncheck
|
Toggle checkboxes, radios, or switches |
scroll / scroll_up
|
Scroll the page down or up |
back |
Hit the browser back button |
wait |
Pause up to 5 seconds for the page to settle |
open_tab |
Open a new browser tab with a URL |
switch_tab |
Switch to an existing tab by index |
Smart Destination Resolution
When you tell Sally "Open Gmail" or "Take me to YouTube," she doesn't just shove the text into a Google search. Sally has a built-in destination resolver that knows a dozen popular sites like Gmail, Drive, Docs, YouTube, LinkedIn, GitHub, Notion, Slack, Canva, Amazon, Reddit, Google Calendar, each with aliases so natural phrasing like "my email" or "my calendar" just works.
If Sally doesn't recognize the destination by name, she tries to construct a likely company URL. And if that doesn't work, she falls back to a Google "I'm Feeling Lucky" search. Direct URLs are detected automatically too. The result is that navigation feels instant for common destinations instead of taking three agentic loop steps just to get to Gmail.
Smart Home Commands
A fun side feature: Sally recognizes common smart home phrases and rewrites them into browser instructions. If you say "lights on," Sally navigates to home.google.com, finds the light controls, and toggles them. It works for thermostats, fans, and generic devices too. Under the hood it's just regex pattern matching. "Turn on the bedroom lights" gets rewritten to "Go to home.google.com, find the bedroom light, and turn it on." Then the normal agentic loop handles the actual navigation and clicking.
Guided Email Compose Flow
Email turned out to be one of the most common use cases during testing, so I gave it special treatment. When you say something like "Send an email to john@example.com about the meeting tomorrow," Sally doesn't just dump that into the agentic loop and hope for the best.
Instead, she detects the email intent, extracts the recipient address from your spoken text (converting "at" to @ and "dot" to . along the way), and breaks the task into guided subtasks: open Gmail, click Compose, fill in the recipient, draft the body, and confirm before sending. At each step, she checks her work inspecting the Gmail draft to verify the recipient, subject, and body are correct.
This matters because sending an email to the wrong person is the kind of mistake that's hard to undo. Sally flags suspicious-looking email addresses for review and always asks for confirmation before hitting Send.
Risky Action Confirmation
Speaking of confirmation = Sally doesn't blindly execute everything Gemini tells her to. Actions like send, submit, purchase, delete, and publish are flagged as risky. Before executing them, Sally pauses, narrates what she's about to do, and waits for your spoken confirmation.
You can say "yes," "go ahead," or "do it" to confirm, or "no," "cancel," or "stop" to abort. Sally listens for these natural affirmative and negative patterns rather than requiring a specific keyword. Safe actions like composing a draft or replying to a thread execute without interruption.
This confirmation system uses its own listening mode with trailing silence detection it waits 700ms of silence after you speak to make sure you're done, with a 4-second maximum so things keep moving.
Screen Questions and Research Mode
Sally isn't just a button-clicker. You can ask her questions about what's on your screen. "Who is this person?" "How many unread emails do I have?" "What does this error message say?" She'll look at the screenshot and give you a spoken answer.
But it goes further than that. If your question requires more information than what's visible on screen say, "What's this company's stock price?" Sally can automatically open the browser and research the answer for you. This auto-research behaviour is toggleable in Settings for users who prefer Sally to stick to what's already visible.
Sally also understands assistive browsing questions like:
- "What can I do here?"
- "What buttons are on this page?"
- "What form fields are here?"
- "Read me the errors."
These give you a quick spoken inventory of the page without needing to see it clearly.
Working with the @google/genai SDK
I used the official @google/genai Node.js SDK, and it was genuinely pleasant to work with. Here's what a typical Gemini call looks like in Sally:
import { GoogleGenAI } from '@google/genai';
const genai = new GoogleGenAI({ apiKey });
const result = await genai.models.generateContent({
model: 'gemini-2.5-flash',
contents: [
{
role: 'user',
parts: [
{ inlineData: { mimeType: 'image/png', data: screenshotBase64 } },
{ text: userPrompt }
]
}
],
config: {
systemInstruction: SYSTEM_PROMPT,
responseMimeType: 'application/json',
maxOutputTokens: 512,
temperature: 0.2
}
});
Things I liked about the SDK:
-
JSON response mode (
responseMimeType: 'application/json') is a huge deal for agentic apps. Instead of parsing free-form text to figure out what the model wants to do, you get clean structured JSON every time. - Multimodal in one call. The image and the text instruction go in the same request as separate parts. No separate vision API, no file upload step, no preprocessing.
- System instructions. I can keep the behavior and grounding rules separate from the user content.
- Token caps. Sally only ever needs a short narration sentence and a small JSON action object, so capping output helps keep the loop fast.
One Gotcha
The JSON response mode works well almost all the time. But every now and then Gemini wraps the JSON in markdown code fences even though you asked for raw JSON. Easy fix:
const cleaned = text
.replace(/^```
{% endraw %}
(?:json)?\s*/i, '')
.replace(/\s*
{% raw %}
```$/i, '');
const parsed = JSON.parse(cleaned);
Google Cloud Run as the Backend
Sally has two ways to reach Gemini:
- Direct mode. The desktop app calls the Gemini API straight from the user's machine using their API key. No server needed.
- Backend mode. The app talks to a lightweight Express server running on Google Cloud Run, which proxies the request to Gemini.
The Backend Is Tiny
The whole Cloud Run service is a single index.js file with three endpoints:
-
GET /health— returns the model name so you can verify the deployment is alive -
POST /api/interpret-screen— accepts a screenshot and instruction, calls Gemini, returns the narration and action -
POST /api/answer-screen-question— accepts a screenshot and question, calls Gemini, and returns a spoken answer plus optional research metadata -
POST /api/log— ingests structured log entries from the desktop app for cloud observability
The Dockerfile is Node.js 20 on Alpine Linux, listening on port 8080. Deploying is one command:
gcloud run deploy sally-backend \
--source . --platform managed \
--region us-central1 --allow-unauthenticated \
--set-env-vars "GEMINI_API_KEY=${GEMINI_API_KEY}"
Why Have a Backend?
- You don't want API keys on every device. If Sally is used on shared machines or in a team, the backend holds the key centrally.
- Centralized logging and rate limiting. Much easier to monitor usage, track costs, and throttle requests from one place.
- Easier updates. Swap out the model version or change config on the server without shipping a new desktop build.
The Fallback Is the Best Part
If the Cloud Run backend goes down and a local Gemini API key is configured, Sally quietly switches to direct Gemini API calls. There's a 5-minute cooldown after a backend failure so it doesn't keep retrying a dead endpoint. The user doesn't notice anything. It just keeps working.
Try Cloud Run backend (8-second timeout)
├── Success? → Use the response.
├── Failed? → Start cooldown, switch to direct Gemini API.
└── In cooldown? → Skip backend, go direct.
This means Sally has a hosted path for the hackathon and a local fallback path when backend connectivity is shaky.
Cloud Logging for Observability
One thing I added that proved invaluable during development (and will be even more useful in production) is structured cloud logging. Sally ships desktop events TTS requests, browser task starts, errors, and more to Google Cloud Logging through the backend's /api/log endpoint.
The logger batches entries (up to 10 per batch, flushed every 5 seconds) to avoid hammering the network on every action. Each log entry carries a severity level (DEBUG through EMERGENCY), so you can filter noise and focus on what matters. On graceful shutdown, the queue flushes so no logs are lost.
The best part: it's entirely optional. There's a toggle in Settings to enable or disable cloud event forwarding. When it's off, everything falls back to local console logging. When it's on, you get a centralized view of what Sally is doing across every session, latency on TTS requests, which tasks triggered browser automation, where things failed and why. For a hackathon demo, being able to pull up Cloud Logging and show exactly what happened behind the scenes is a nice touch.
CI/CD with Cloud Build
For production, I set up a Cloud Build pipeline in cloudbuild.yaml:
- Build the Docker image, tagged with the git short SHA
- Push it to Google Artifact Registry
- Deploy to Cloud Run with resource limits: 512 MB RAM, 1 CPU, max 10 instances, 60-second timeout
Standard Google Cloud CI/CD. Nothing fancy, but it gets the job done.
Making Clicks Work (The DOM Grounding System)
This was probably the hardest part of the whole project. When Gemini looks at a screenshot and says "click Search," it's speaking like a human. It might return "selector": "Search" or "selector": "Submit" or "selector": "the blue Sign In button".
The hard part is turning that human description into a reliable action in the live page.
Instead of relying on a browser-automation-specific fallback chain, Sally now builds a live inventory of visible interactive elements from the DOM. For each candidate element, it collects things like:
role-
labeloraria-label - visible text
placeholdername- tag name
Then it scores candidates against Gemini's selector text and filters them by action type:
-
clickprefers buttons, links, tabs, menu items, checkboxes, radios, and switches -
fillprefers textboxes, search boxes, and comboboxes -
selectprefers selectable controls -
focusprefers fields and focusable controls
If there are multiple similar matches, Gemini can also return an index.
Once Sally picks the best target, it executes the action directly in the page DOM and sends the result back into the next Gemini loop iteration. This screenshot + DOM grounding combo is what made the project feel reliable enough to demo.
Browser Session Persistence and Multi-Tab Support
This was a huge win for accessibility. Sally owns one persistent Electron browser surface with its own session partition. That means cookies, local storage, and sign-in state can persist across tasks and app restarts.
Without this, every Sally session would start from scratch. For someone with a motor impairment, that's exactly the kind of repetitive friction Sally is supposed to eliminate.
Sally's browser also supports multiple tabs. You can say "open a new tab with Google Docs" or "switch to the YouTube tab," and Sally handles it. Each tab tracks its own loading state, title, URL, and navigation history with back/forward support. The tab bar shows hostnames and loading indicators so you always know what's open.
This is different from reusing the user's actual Chrome profile. The upside is that Sally gets a stable, isolated automation surface that it fully controls. The tradeoff is that it doesn't automatically inherit whatever is already open in the user's everyday browser.
The UI: A Floating Pill
Sally shows up as a small floating bar at the top of your screen. It's 420x48 pixels when idle. Always on top, minimal. It shows the current state:
- Idle — waiting for a voice command
- Listening — recording while you're holding push-to-talk
- Thinking — Gemini is processing
- Acting — Sally is doing something in the browser
When Sally is actively working, a blue border lights up around the edges of your screen so you know automation is happening. The whole thing is React and Tailwind CSS inside Electron.
The bar adapts to what's happening. It has four layouts:
- idle — the default pill
-
compact — even smaller at
280x48 -
composer — expanded to
360x124with a text input area -
transcript —
360x104showing the conversation history
During recording, a real-time waveform visualization shows your audio levels, and a live transcript preview displays what Sally is hearing as you speak. It's a small detail, but it gives you confidence that your voice is being picked up correctly before you release the key.
The philosophy is that Sally should be felt, not seen. She should take up as little space as possible so the user has maximum room for the actual content they're trying to interact with.
The Settings Window
Sally ships with a proper settings UI where you can configure everything without touching a config file:
- API Keys — Enter and validate your Gemini and ElevenLabs keys, with a test button that confirms the key works before saving
- Backend URL — Point Sally at your own Cloud Run deployment and check its health status with one click
- Audio Device — Pick which microphone Sally listens to — important if you have multiple input devices
- Cloud Logging — Toggle whether desktop events get forwarded to Google Cloud Logging
- Auto Research — Control whether screen questions can automatically open the browser to look things up
- Microphone Mute — Toggle mute so the push-to-talk key is ignored without closing the app
There's also a Getting Started guide built right into the settings window with three steps to get up and running. The goal was to make Sally usable out of the box without reading documentation.
Multi-Display Support
Sally works across multiple monitors. Screenshots are captured from the display where your cursor is, so Sally always sees the same screen you're looking at. The blue border overlay targets the correct display too. This was a small thing to implement but matters a lot for anyone with a multi-monitor setup, which is common for accessibility workstations.
Challenges and What I Learned
Selectors were the hardest problem. Bridging the gap between what Gemini sees in a screenshot and what the DOM grounding layer can reliably target took the most iteration. The semantic matcher works well, but edge cases like shadow DOM, iframes, and unusual custom controls still trip it up sometimes.
The page-settle problem is trickier than it looks. After every action, Sally waits for the page to stop loading with bounded delays instead of relying on one fixed sleep. Too short and Gemini sees a half-loaded page with spinners. Too long and the whole experience drags. The current version is much better than a single hardcoded delay, but heavy SPAs can still be awkward.
Gemini sometimes sees things that aren't there. It's rare, but occasionally the model identifies a button or link that doesn't actually exist on the page. The DOM grounding layer handles this gracefully by reporting the failure back into the next loop iteration, but it's a good reminder that vision models aren't perfect yet.
Prompt engineering took longer than coding. Getting Sally's personality right, getting the grounding rules tight, getting the action format consistent, the system prompt went through dozens of revisions. The biggest lesson was to be explicit about what the model should not do. Telling Gemini "don't repeat actions you've already taken" and "don't navigate to a URL you're already on" fixed a huge share of the loop problems I was seeing.
Structured JSON output is almost perfect. The responseMimeType: 'application/json' feature is fantastic for building agents. You get clean, parseable output instead of having to regex your way through free-form text. The occasional markdown-wrapped response is a minor annoyance that a small parser fix solves.
What I'd Do Differently
- Write the system prompt before writing code. The prompt is the product. Everything else is just plumbing to execute what the prompt decides.
- Build the DOM grounding layer early. I wasted time trying to make Gemini return perfect CSS selectors. It's a vision model, not a browser inspector. Accept that and build robust page grounding from the start.
- Test on weird websites sooner. Sally worked great on Google properties from day one, but complex React SPAs and Angular apps surfaced issues I didn't see coming.
What's Next
- Smarter page-ready detection instead of bounded settle heuristics
- Better error context so Gemini knows why an action failed, not just that it failed
- Streaming responses to start narrating before the full model response arrives
- More prompt tuning based on real usage patterns
Wrapping Up
Sally started as a hackathon project, but the problem is real. Over a billion people worldwide live with some form of disability, and a lot of them struggle with the precise motor control that web browsing demands. Clicking small buttons, scrolling to the right spot, typing into tiny fields. The barrier isn't understanding or intent. It's the input device.
Combining multimodal AI with a persistent browser plus DOM grounding opens up something genuinely useful. Gemini 2.5 Flash is fast enough for real-time interaction and smart enough to understand what's on screen. Cloud Run makes the backend deployable in one command. The @google/genai SDK makes the whole thing buildable without fighting the tooling.
If you're building with Gemini, I'd really encourage you to look at multimodal use cases beyond chatbots. The model can see. That opens up accessibility tools, testing automation, workflow agents, and a bunch of stuff nobody's thought of yet.
Sally on GitHub if you want to check it out.
Thanks for reading!
Top comments (0)