Sarthak Rawat

Posted on Mar 16

Building TaskPilot: An AI Agent That Sees Your Screen and Takes Control

#geminiliveagentchallenge #electron #productivity #openclaw

I created this blog for detailing about my project in the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

The Problem: Automation That Breaks the Moment the UI Changes

Every developer has been there. You write a Selenium script, it works perfectly, and then the website updates its CSS class names and the whole thing falls apart. You set up an RPA workflow, it runs fine for a week, and then someone moves a button and it starts clicking the wrong thing.

Traditional automation is brittle because it's blind. It relies on DOM selectors, API hooks, and hardcoded coordinates. It doesn't actually see the screen. It just pokes at it.

But humans don't automate that way. When you ask a colleague to "find the cheapest flight to New York and book it," they open a browser, look at the screen, read what's there, and make decisions based on what they see. They don't need an API. They don't need a DOM inspector. They just need eyes.

That's the gap TaskPilot fills. It's an AI agent that observes your screen the way a human would, understands what it sees using Gemini's multimodal vision, and executes actions based on natural language intent. No selectors. No APIs. No brittle scripts. Just vision, reasoning, and action.

Built for the Gemini Live Agent Challenge hackathon. #GeminiLiveAgentChallenge

The Architecture: Two Agents, One Interface

TaskPilot has two distinct execution environments that share a single Electron frontend:

Desktop Mode — A TypeScript/Node.js agent (clawd-cursor) that runs locally and controls your actual OS. It can open apps, type text, click buttons, switch windows, and execute multi-app workflows across your entire desktop.

Browser Mode — A Python WebSocket server (computer-use-preview) deployed on Google Cloud Run that spins up a Playwright browser, runs a Gemini Computer Use vision loop, and streams screenshots and reasoning back to the frontend in real time.

The Electron frontend connects to whichever mode the user selects. Desktop mode talks to a local REST API on 127.0.0.1:3847. Browser mode connects to a Cloud Run WebSocket endpoint. The UI is identical either way — live screenshots, a reasoning panel, an action timeline, and a voice input button.

The 5-Layer Pipeline: Why We Don't Always Need Vision

The most important architectural decision in TaskPilot is the layered execution pipeline in clawd-cursor. The core insight: taking a screenshot and sending it to a vision LLM is expensive and slow. Most tasks don't need it.

So we built five layers, each cheaper and faster than the last. The agent tries them in order and only escalates when the current layer can't handle the task.

Layer 0 — Browser (Playwright CDP)
For any task that involves a browser, we go straight to Chrome DevTools Protocol. No screenshots. No LLM. We read the DOM directly, find elements, and interact with them programmatically. This handles a huge chunk of web tasks instantly and for free.

Layer 1 — Action Router (Regex + Keyboard Shortcuts)
A pattern-matching layer that recognizes common intents and maps them to direct actions. "Scroll down" becomes a keyboard shortcut. "Copy" becomes Ctrl+C. "Open Notepad" becomes a shell command. No LLM involved. This layer handles the majority of simple desktop tasks in under a second.

// From action-router.ts — pattern matching before any LLM call
const SHORTCUT_PATTERNS = [
  { pattern: /scroll\s+down/i, action: () => keyboard.type(Key.PageDown) },
  { pattern: /copy/i,          action: () => keyboard.pressKey(Key.LeftControl, Key.C) },
  { pattern: /open\s+(\w+)/i,  action: (match) => shell.exec(`start ${match[1]}`) },
];

Layer 1.5 — Smart Interaction (1 LLM Call)
When pattern matching isn't enough, we make a single cheap text LLM call to plan the steps, then execute them via CDP or the accessibility tree. One call, no screenshots.

Layer 2 — A11y Reasoner (Accessibility Tree + Cheap LLM)
We read the OS accessibility tree — the structured representation of every UI element on screen — and feed it to a cheap text model. The model reasons about which element to interact with based on its label, role, and position. Still no screenshots.

Layer 3 — Computer Use (Vision LLM)
Only when all else fails do we take a screenshot and send it to Gemini or Anthropic Computer Use. This is the most powerful layer but also the most expensive. By the time we reach it, we've already handled 80%+ of tasks in the layers above.

The performance difference is dramatic:

Task	Without Pipeline	With Pipeline
Calculator (255×38)	43s (18 LLM calls)	2.6s (0 LLM calls)
Notepad (type hello)	73s	2.0s
File Explorer	53s	1.9s
Gmail compose	162s	21.7s (1 LLM call)

The Browser Agent: Gemini Computer Use in a Loop

The Python computer-use-preview service is where Gemini's multimodal capabilities really shine. It runs a tight vision loop:

Capture a screenshot of the Playwright browser
Send it to Gemini along with the task and conversation history
Gemini returns a function call (click, type, navigate, scroll, etc.)
Execute the action in Playwright
Capture the next screenshot and repeat

# From agent.py — the core Gemini Computer Use loop
async def agent_loop(self) -> str:
    while True:
        response = await self._client.aio.models.generate_content(
            model=self._model_name,
            contents=self._contents,
            config=GenerateContentConfig(
                tools=self._tools,
                system_instruction=SYSTEM_PROMPT,
            ),
        )

        # Extract function calls from response
        for part in response.candidates[0].content.parts:
            if part.function_call:
                result = await self._execute_action(part.function_call)
                self._contents.append(...)  # append result to history

        if response.candidates[0].finish_reason == FinishReason.STOP:
            return self.final_reasoning

Every screenshot is streamed back to the Electron frontend over WebSocket as a base64-encoded JPEG. The user watches the agent think and act in real time — they see the same screen the agent sees, with a cursor indicator showing exactly where it's about to click.

We also implemented screenshot pruning: only the last 3 screenshots are kept in the Gemini context window. Older ones are replaced with text summaries. This keeps token costs manageable for long-running tasks without losing context.

The WebSocket Server: Bridging Frontend to Agent

The Python server (server.py) is the glue between the Electron frontend and the browser agent. It manages session lifecycle, handles both browser and desktop modes, and routes voice input.

Each connection gets an AgentSession with a dedicated worker thread. Commands from the frontend go into a queue; the worker processes them sequentially. This keeps the WebSocket handler non-blocking while the agent does its work.

class AgentSession:
    def __init__(self, ws, loop):
        self._ws = ws
        self._loop = loop
        self._cmd_queue = queue.Queue()
        self._worker_thread = threading.Thread(
            target=self._worker_loop, daemon=True
        )
        self._worker_thread.start()

    def _worker_loop(self):
        while not self._closed:
            cmd = self._cmd_queue.get(timeout=0.5)
            if cmd["action"] == "run_agent":
                self._run_agent(cmd["query"], cmd["model"], cmd["mode"])

For desktop mode, the server proxies the task to the local clawd-cursor REST API via ClawdBridge. The frontend doesn't need to know which backend is handling the task — it just sends a message and receives a stream of screenshots and reasoning updates.

Voice Input: Talking to Your Agent

One of the more satisfying features to build was voice input. The user clicks the microphone button in the Electron UI, speaks their task, and the audio is sent to the Python server for transcription via Google Cloud Speech-to-Text.

# From voice_input.py
class VoiceTranscriber:
    def transcribe(self, audio_bytes: bytes) -> str:
        client = speech.SpeechClient()
        audio = speech.RecognitionAudio(content=audio_bytes)
        config = speech.RecognitionConfig(
            encoding=speech.RecognitionConfig.AudioEncoding.WEBM_OPUS,
            sample_rate_hertz=48000,
            language_code="en-US",
        )
        response = client.recognize(config=config, audio=audio)
        return response.results[0].alternatives[0].transcript

The transcribed text drops straight into the task input field. It's a small thing but it makes the agent feel much more natural to use — especially for longer, more complex instructions where typing is tedious.

Safety: The Agent Needs to Know When to Ask

Giving an AI agent control of your desktop is powerful. It's also potentially dangerous. We built a safety layer that classifies every action before executing it.

Actions fall into three tiers:

Auto — Navigation, scrolling, reading. Execute immediately.
Preview — Typing text, opening files. Show the user what's about to happen.
Confirm — Sending emails, deleting files, form submissions. Pause and require explicit approval.

The web dashboard (served by the Express server in clawd-cursor) shows pending confirmations in real time. The user can approve or reject any action before it executes. There's also a kill switch that immediately halts the agent.

// From safety.ts — tier classification
export class SafetyLayer {
  classify(action: InputAction): SafetyTier {
    if (BLOCKED_PATTERNS.some(p => p.test(action.description))) {
      return SafetyTier.Block;
    }
    if (CONFIRM_PATTERNS.some(p => p.test(action.description))) {
      return SafetyTier.Confirm;
    }
    if (PREVIEW_PATTERNS.some(p => p.test(action.description))) {
      return SafetyTier.Preview;
    }
    return SafetyTier.Auto;
  }
}

Provider-Agnostic by Design

We didn't want to lock TaskPilot into a single AI provider. The doctor command runs an interactive setup wizard that scans your environment, detects available providers, tests each one, and recommends the optimal pipeline configuration.

The provider is auto-detected from the API key format:

sk-ant-* → Anthropic
sk-* → OpenAI
AIza* → Gemini
Local endpoint → Ollama

You can also mix providers: use Ollama for cheap text tasks (free, runs locally) and Gemini or Anthropic for vision tasks (best quality). The pipeline config is saved to .clawd-config.json and loaded on startup.

For the hackathon submission, Gemini is the primary provider — @google/genai for the TypeScript desktop agent and google-genai for the Python browser agent. Vertex AI mode is supported for cloud deployments.

Challenges We Faced

The "blind clicking" problem. Early versions of the browser agent would sometimes click coordinates that looked right in the screenshot but were slightly off in the actual browser due to scaling and DPI differences. We fixed this by normalizing coordinates relative to the Playwright viewport size and adding a cursor indicator in the frontend so users can see exactly where the agent is clicking.

Context window management. Long browser sessions accumulate a lot of screenshots. Sending all of them to Gemini on every iteration would be prohibitively expensive and slow. The screenshot pruning strategy — keeping only the last 3 screenshots and summarizing older ones as text — was the right balance between context retention and cost.

Thread safety in the Python server. The WebSocket handler runs in an async event loop, but the Playwright browser and Gemini client calls are blocking. Getting the threading model right — async WebSocket handler, queue-based worker thread, run_coroutine_threadsafe for sending messages back — took several iterations to get stable.

The 5-layer pipeline ordering. Deciding which layer handles which task isn't always obvious. We went through many iterations of the routing logic before settling on the current approach: browser CDP first, then pattern matching, then accessibility tree, then vision. The key insight was that the accessibility tree is almost always faster and cheaper than a screenshot, and it's surprisingly capable for most UI tasks.

Cross-platform desktop control. Windows uses PowerShell for accessibility queries. macOS uses JXA (JavaScript for Automation) and System Events. Linux has neither. We ended up with platform-specific script directories (scripts/mac/) and a runtime check that routes to the right implementation. Linux falls back to browser-only mode.

What We Learned

Vision is the fallback, not the foundation. The instinct when building a visual agent is to route everything through the vision model. That's wrong. Vision is expensive and slow. Build the cheap layers first — pattern matching, accessibility trees, keyboard shortcuts — and use vision only when they fail. Your users will notice the difference.

The accessibility tree is underrated. Most developers don't think about the OS accessibility tree as an automation primitive. But it's a structured, real-time representation of every UI element on screen, with labels, roles, and positions. For a huge range of tasks, it's more reliable than a screenshot and orders of magnitude cheaper.

Streaming makes agents feel alive. Sending screenshots and reasoning updates to the frontend in real time — rather than waiting for the task to complete — fundamentally changes how the agent feels to use. Users can see the agent thinking. They can intervene if it's going wrong. It transforms a black box into a collaborator.

Safety gates build trust. The confirmation flow for risky actions isn't just a safety feature — it's a trust-building mechanism. When users see the agent pause and ask "I'm about to send this email, confirm?" they feel in control. That feeling of control is what makes people comfortable giving an AI agent access to their desktop.

What's Next

TaskPilot is one of three projects our team built for the Gemini Live Agent Challenge. We also built:

A real-time visual companion (Live Agents category) — an agent you can talk to naturally that sees through your camera, grounded in Cloud Vision, Document AI, and Natural Language API
A 3D interactive history explorer (Creative Storyteller category) — a globe you can explore, clicking locations to get rich interleaved historical narratives with Imagen-generated imagery and Cloud TTS narration

Each project targets a different category, but they all share the same philosophy: use the right tool for the right job, ground AI reasoning in structured data, and build experiences that feel genuinely useful rather than impressive demos.

Try It Yourself

GitHub: https://github.com/tarinagarwal/task-pilot
Demo Video: https://vimeo.com/1174159668?share=copy&fl=sv&fe=ci

Built with Gemini (@google/genai, google-genai), Playwright, Google Cloud Run, Artifact Registry, Cloud Speech-to-Text, Cloud Firestore, Terraform, Electron, TypeScript, Python, and Express.

Created for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge

DEV Community