I built a desktop app that lets people control any website using only their voice. You talk, it takes a screenshot, sends it to Gemini 2.5 Flash, gets back a structured action, runs it in the browser, and repeats. The whole time it's narrating what it's doing out loud. Here's how it came together for the Gemini Live Agent Challenge.
The Problem
Picture this: you can't use a mouse. Maybe you can't use a keyboard either. You might have a repetitive strain injury, a motor impairment, or honestly you might just have a broken wrist. The web doesn't really care. It expects you to click tiny buttons, scroll precisely, type into fields, drag things around.
There are screen readers and voice control tools out there, but they all seem to expect you to learn their language. Memorize commands. Know what things are called in the DOM. Fight with dictation software that mishears every other word.
I wanted something where you could just say what you want:
"Go to YouTube and search for lo-fi beats."
No special syntax. No menu navigation. Just plain language.
That's what Sally does.
Sally is a voice-first accessibility agent. It's an Electron desktop app built for people with motor impairments, cognitive disabilities, repetitive strain injuries, or anyone who wants hands-free web browsing. You hold the push-to-talk key (Right Alt), say what you need, and Sally handles the rest. She navigates, clicks, types, scrolls, and narrates everything she does so you always know what's going on.
What Sally Is Made Of
Quick overview before I get into the details:
- Gemini 2.5 Flash for multimodal vision (understanding screenshots), speech-to-text, and deciding what action to take next
-
@google/genaiSDK (the official Google Gen AI SDK for Node.js) for all Gemini API calls - Google Cloud Run as an optional serverless backend that proxies Gemini requests
- Playwright for actually controlling the browser (clicking, typing, scrolling, navigating)
- Electron + React + TypeScript for the cross-platform desktop shell
- ElevenLabs for neural text-to-speech so Sally sounds like a real person
- uiohook-napi for a system-wide push-to-talk hotkey that works even when the app isn't in focus
Why I Went with Gemini 2.5 Flash
Honestly, this was one of the easier decisions. I needed three things from a model:
- Vision. It needs to look at a screenshot and actually understand what's on the page.
- Reasoning. Given a user's instruction and what's on screen, it needs to figure out the right next step.
- Speed. Sally runs in a tight loop where every action involves a round-trip to the model. If the model takes 5 seconds per call and a task needs 10 steps, that's almost a minute of waiting. Not acceptable.
Gemini 2.5 Flash nails all three. I send it a base64-encoded PNG screenshot along with the user's instruction, and it comes back with structured JSON telling me exactly what to do. Click this button. Fill in that text field. Scroll down. Go to this URL.
The speed really matters. A single task like "search Google for weather in Tokyo" involves maybe 4-5 model calls (navigate, click search box, type query, press Enter, describe results). With Flash, each call comes back fast enough that the whole thing feels smooth rather than painful.
I set the temperature to 0.2 and kept it there. When you're automating a browser for someone who can't easily undo a wrong click, you want the model to be predictable. High temperature and creative button-clicking don't mix.
Bonus: Speech-to-Text Too
One thing I didn't expect going in is that Gemini 2.5 Flash also handles speech-to-text. So the same model that looks at screenshots also transcribes voice commands. Fewer moving parts, fewer API keys to manage, fewer things that can break. I kept Whisper as a fallback option, but in the default setup Gemini handles vision, reasoning, and transcription all in one.
How the Agentic Loop Works
This is the core of Sally. Here's the flow in plain language:
- You talk. Hold the push-to-talk key and say something like "Open Gmail and compose a new email to Mom."
- Sally transcribes your voice into text using Gemini.
- Sally takes a screenshot of whatever the browser is showing right now.
- Sally sends everything to Gemini. The screenshot, your instruction, the current page URL and title, and a log of actions already taken in this session.
- Gemini sends back JSON with two things: a short narration sentence and an action to execute.
{
"narration": "I'll click the Compose button for you.",
"action": { "type": "click", "selector": "Compose" }
}
- Sally runs the action using Playwright to click, type, scroll, or navigate in a real Chrome browser.
- Sally speaks the narration through ElevenLabs TTS so you hear what just happened.
- Wait 1.5 seconds for the page to finish loading, then loop back to step 3.
The loop caps out at 15 iterations or 3 minutes, whichever hits first. If Gemini decides the task is complete, it returns a null action and the loop ends naturally. You can also just say "Cancel" at any point and everything stops.
Why It Doesn't Get Stuck in Loops
Each iteration gets a fresh screenshot, so Gemini is always looking at the current state of the page. It doesn't need to remember what the page looked like three steps ago.
But there's a catch: without any memory at all, the model might keep clicking the same button over and over. So I feed it a rolling history of the last 10 actions. The system prompt tells it explicitly: "do NOT repeat these steps." That sliding window gives Gemini enough context to know where it's been without bloating the prompt with a full session transcript.
Sally Has a Personality
This part took way more effort than I expected. Sally isn't a cold automation script. She has a personality defined in a ~300-line system prompt. She uses contractions ("I'll", "Let's"), she celebrates small wins ("Got it!", "Alright, we're in!"), and she stays calm when things break ("Hmm, that didn't work. Let me try another way.").
The system prompt describes her as "a helpful friend sitting next to them, describing what's on screen and taking action on their behalf." Every narration is kept short because it gets spoken aloud. Nobody wants to listen to a paragraph between each click.
All the Actions Sally Can Take
Gemini picks from these action types and Playwright executes them:
| Action | What It Does |
|---|---|
navigate |
Go to a URL |
click |
Click an element by its visible text, ARIA label, or CSS selector |
fill |
Clear a text field and type new content |
type |
Type text character by character with a 50ms delay (looks human) |
select |
Pick an option from a dropdown |
press |
Press a key like Enter, Tab, Escape |
hover |
Mouse over an element |
scroll / scroll_up
|
Scroll the page down or up |
back |
Hit the browser back button |
wait |
Pause up to 5 seconds for the page to load |
Smart Home Commands
A fun side feature: Sally recognizes common smart home phrases and rewrites them into browser instructions. If you say "lights on," Sally navigates to home.google.com, finds the light controls, and toggles them. It works for thermostats, fans, and generic devices too. Under the hood it's just regex pattern matching. "Turn on the bedroom lights" gets rewritten to "Go to home.google.com, find the bedroom light, and turn it on." Then the normal agentic loop handles the actual navigation and clicking.
Working with the @google/genai SDK
I used the official @google/genai Node.js SDK, and it was genuinely pleasant to work with. Here's what a typical Gemini call looks like in Sally:
import { GoogleGenAI } from '@google/genai';
const genai = new GoogleGenAI({ apiKey });
const result = await genai.models.generateContent({
model: 'gemini-2.5-flash',
contents: [{
role: 'user',
parts: [
{ inlineData: { mimeType: 'image/png', data: screenshotBase64 } },
{ text: userPrompt }
]
}],
config: {
systemInstruction: SYSTEM_PROMPT,
responseMimeType: 'application/json',
maxOutputTokens: 512,
temperature: 0.2
}
});
Things I liked about the SDK:
JSON response mode (
responseMimeType: 'application/json') is a huge deal for agentic apps. Instead of parsing free-form text to figure out what the model wants to do, you get clean structured JSON every time. This single feature saved me hours of fragile parsing code.Multimodal in one call. The image and the text instruction go in the same request as separate
parts. No separate vision API, no file upload step, no preprocessing. You just inline the base64 PNG and go. For the agentic loop this is perfect because each iteration is one API call with everything bundled together.System instructions. I have a ~300-line system prompt and being able to pass it as
systemInstructionkeeps it separate from user content. It's consistent across every loop iteration without me having to manually glue it into the messages array.Token cap. I set
maxOutputTokens: 512because Sally only ever needs a short narration sentence and a small JSON action object. Capping it keeps responses fast and prevents the model from generating unnecessary text.
One Gotcha
The JSON response mode works well almost all the time. But every now and then Gemini wraps the JSON in markdown code fences even though you asked for raw JSON. Easy fix:
const cleaned = text.replace(/^```
{% endraw %}
(?:json)?\s*/i, '').replace(/\s*
{% raw %}
```$/i, '');
const parsed = JSON.parse(cleaned);
Two lines. Not a big deal, but good to know about if you're building structured output pipelines.
Google Cloud Run as the Backend
Sally has two ways to reach Gemini:
- Direct mode. The desktop app calls the Gemini API straight from the user's machine using their API key. No server needed.
- Backend mode. The app talks to a lightweight Express server running on Google Cloud Run, which proxies the request to Gemini.
The Backend Is Tiny
The whole Cloud Run service is a single index.js file with two endpoints:
-
GET /healthreturns the model name so you can verify the deployment is alive -
POST /api/interpret-screenaccepts a screenshot and instruction, calls Gemini, returns the narration and action
The Dockerfile is Node.js 20 on Alpine Linux, listening on port 8080. Deploying is one command:
gcloud run deploy sally-backend \
--source . --platform managed \
--region us-central1 --allow-unauthenticated \
--set-env-vars "GEMINI_API_KEY=${GEMINI_API_KEY}"
Why Have a Backend?
- You don't want API keys on every device. If Sally is used on shared machines or in a team, the backend holds the key centrally.
- Centralized logging and rate limiting. Much easier to monitor usage, track costs, and throttle requests from one place.
- Easier updates. Swap out the model version or change config on the server without shipping a new desktop build.
The Fallback Is the Best Part
If the Cloud Run backend goes down (network issue, cold start timeout, whatever), Sally quietly switches to direct Gemini API calls. There's a 5-minute cooldown after a backend failure so it doesn't keep retrying a dead endpoint. The user doesn't notice anything. It just keeps working.
Try Cloud Run backend (8-second timeout)
Success? Use the response.
Failed? Start cooldown, switch to direct Gemini API.
Already in cooldown? Skip backend, go direct.
This means Sally always has a path to Gemini regardless of whether the backend is healthy.
CI/CD with Cloud Build
For production, I set up a Cloud Build pipeline in cloudbuild.yaml:
- Build the Docker image, tagged with the git short SHA
- Push it to Google Artifact Registry
- Deploy to Cloud Run with resource limits: 512 MB RAM, 1 CPU, max 10 instances, 60-second timeout
Standard Google Cloud CI/CD. Nothing fancy, but it gets the job done.
Making Clicks Work (The Smart Selector System)
This was probably the hardest part of the whole project. When Gemini looks at a screenshot and says "click Search," it's speaking like a human. It might return "selector": "Search" or "selector": "Submit" or "selector": "the blue Sign In button".
Playwright needs something it can actually find in the DOM. So I built a cascading fallback that tries five strategies in order:
-
CSS selector first. Sometimes Gemini returns something like
[aria-label='Search']and it just works. -
Visible text.
page.getByText("Search")finds elements by what's actually written on them. - ARIA roles. Try matching against button, link, menuitem, tab, and checkbox roles.
-
ARIA labels.
page.getByLabel("Search")matchesaria-labelattributes. -
Placeholder text.
page.getByPlaceholder("Search")for input fields.
Each attempt gets a 3-second timeout. If all five fail, Sally reports the failure back to Gemini on the next loop iteration so it can try something different. This handles about 95% of real-world websites.
Browser Profile Persistence
This was a huge win for accessibility. Sally launches the browser using Playwright's launchPersistentContext with the user's actual Chrome profile. That means all your logins, cookies, saved passwords, extensions, and preferences carry over. You don't have to log in to anything.
Without this, every Sally session would start with "please sign in to Gmail... now sign in to YouTube... now sign in to Amazon..." For someone with a motor impairment, that's exactly the kind of tedious, repetitive clicking that Sally is supposed to eliminate.
The only catch is that you need to close Chrome before starting Sally, because Chrome locks its profile directory to one process at a time. Sally launches its own Chrome instance with your profile. If Chrome isn't installed, it falls back to Edge.
The UI: A Floating Pill
Sally shows up as a small floating bar at the top of your screen. It's 420x48 pixels when idle. Draggable, always on top, minimal. It shows the current state:
- Idle means it's waiting for a voice command
- Listening means it's recording (you're holding push-to-talk)
- Thinking means Gemini is processing
- Acting means Playwright is doing something in the browser
When Sally is actively working, a blue border lights up around the edges of your screen so you know automation is happening. The whole thing is React and Tailwind CSS inside Electron.
The philosophy is that Sally should be felt, not seen. She should take up as little space as possible so the user has maximum room for the actual content they're trying to interact with.
Challenges and What I Learned
Selectors were the hardest problem. Bridging the gap between what Gemini sees in a screenshot and what Playwright can find in the DOM took the most iteration. The cascading fallback works well, but edge cases like dynamically generated class names, shadow DOM, and iframes still trip it up sometimes.
The 1.5-second page settle delay is a guess. After every action, Sally waits 1.5 seconds for the page to re-render before taking the next screenshot. Too short and Gemini sees a half-loaded page with spinners. Too long and the whole experience drags. 1.5 seconds works for most sites, but heavy SPAs sometimes need more. I want to replace this with something smarter, like watching for network idle or DOM stability.
Gemini sometimes sees things that aren't there. It's rare, but occasionally the model identifies a button or link that doesn't actually exist on the page. The smart selector fallback handles this gracefully (it just can't find the element and moves on), but it's a good reminder that vision models aren't perfect yet.
Prompt engineering took longer than coding. Getting Sally's personality right, getting the grounding rules tight, getting the action format consistent... the system prompt went through dozens of revisions. The biggest lesson was to be explicit about what the model should NOT do. Telling Gemini "don't repeat actions you've already taken" and "don't navigate to a URL you're already on" fixed 90% of the infinite loop problems I was seeing.
Structured JSON output is almost perfect. The responseMimeType: 'application/json' feature is fantastic for building agents. You get clean, parseable output instead of having to regex your way through free-form text. The occasional markdown-wrapped response is a minor annoyance that a two-line fix solves.
What I'd Do Differently
- Write the system prompt before writing code. The prompt is the product. Everything else is just plumbing to execute what the prompt decides.
- Build the selector fallback chain early. I wasted time trying to make Gemini return perfect CSS selectors. It's a vision model, not a browser inspector. Accept that and build robust matching from the start.
- Test on weird websites sooner. Sally worked great on Google properties from day one, but complex React SPAs and Angular apps surfaced issues I didn't see coming.
What's Next
- Smarter page-ready detection instead of a fixed 1.5-second wait
- Multi-tab support for workflows that span multiple pages
- Better error context so Gemini knows why an action failed, not just that it failed
- Streaming responses to start narrating before the full model response arrives
- More prompt tuning based on real usage patterns
Wrapping Up
Sally started as a hackathon project, but the problem is real. Over a billion people worldwide live with some form of disability, and a lot of them struggle with the precise motor control that web browsing demands. Clicking small buttons, scrolling to the right spot, typing into tiny fields. The barrier isn't understanding or intent. It's the input device.
Combining multimodal AI with browser automation opens up something genuinely useful. Gemini 2.5 Flash is fast enough for real-time interaction and smart enough to understand what's on screen. Cloud Run makes the backend deployable in one command. The @google/genai SDK makes the whole thing buildable without fighting the tooling.
If you're building with Gemini, I'd really encourage you to look at multimodal use cases beyond chatbots. The model can see. That opens up accessibility tools, testing automation, workflow agents, and a bunch of stuff nobody's thought of yet.
Sally on GitHub if you want to check it out.
Thanks for reading!
Top comments (0)