Aegis: Building a Vision-First Browser Agent with Gemini and Google ADK
How we taught an AI to navigate any website by looking at it, not parsing it.
Browser automation is broken.
Every Selenium script, every Puppeteer workflow, every RPA bot you've ever deployed shares the same fatal flaw: they depend on the DOM. CSS selectors. XPaths. Fragile identifiers that shatter the moment a website pushes a layout update, renames a class, or restructures a div. You spend more time maintaining selectors than automating tasks.
We've been building automation tools for the wrong layer. Humans don't navigate websites by inspecting elements. They look at the screen, recognize buttons, read labels, and click. The question that led to Aegis was simple: What if an AI agent could do the same?
The Vision: A Universal UI Navigator
Aegis is an AI-powered browser agent that understands web interfaces through pure vision. No DOM parsing. No API integrations. No hardcoded selectors. You describe what you want in natural language, text or voice, and Aegis takes control of a live browser to execute it.
Open Google Flights and find the cheapest direct flight to London next Friday. Fill out this insurance form using the data from my spreadsheet. Navigate the AWS console and spin up a new EC2 instance.
Aegis handles all of these by looking at the screen, identifying interactive elements, reasoning about the next action, and executing it. Step by step, like a human would.
But vision-based navigation alone isn't the breakthrough. The real innovation is what happens between the agent and the user.
The Steering System: Autonomous Agents Need a Wheel
Most browser agents operate on a fire-and-forget model. You issue a command, the agent disappears into a black box, and you hope it did the right thing. That's fine for toy demos. It's unacceptable for anything involving your bank account, your data, or your reputation.
Aegis introduces a real-time steering system built on a core principle: autonomy without observability is just a black box. The system provides three interaction modes:
Steer Mode lets you watch the agent work in real-time and redirect it at any moment. Say "actually, pick the blue one" or "skip that step" and Aegis adjusts mid-task, no restart required.
Interrupt Mode pauses the agent before critical actions. Purchases, form submissions, account deletions. The agent asks for confirmation, you approve or modify, and only then does it proceed.
Queue Mode lets you stack multiple instructions and walk away. Aegis executes them sequentially while you do other work, maintaining a complete task log you can review later.
This isn't just a feature toggle. It's a fundamentally different relationship between human and agent. The user interface reflects this: a three-panel layout with a live browser view showing exactly what the agent sees, a real-time action log narrating every decision, and an always-on input bar for voice or text. Think pair-programming, but for browsing.
Architecture Deep Dive
At its core, Aegis runs a tight loop: Screenshot → Analyze → Decide → Act → Repeat.
Here's how each component fits together:
┌─────────────────────────────────────────────────────┐
│ React Frontend │
│ ScreenView │ ActionLog │ SteeringControl │ InputBar │
│ WebSocket (real-time) │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ FastAPI Backend │
│ ┌──────────────┐ ┌────────────┐ ┌──────────────┐ │
│ │ Orchestrator │→ │ Analyzer │→ │ Executor │ │
│ │ (ADK Agent) │ │ (Gemini) │ │ (Playwright) │ │
│ └──────────────┘ └────────────┘ └──────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────────┐│
│ │ Navigator │ │ Live Browser ││
│ │ (ADK Tools) │ │ (Chromium) ││
│ └──────────────┘ └──────────────────┘│
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Google Cloud Platform │
│ Cloud Run │ Firestore │ Cloud Storage │ Firebase │
└─────────────────────────────────────────────────────┘
The Backend: Python, FastAPI, and Four Core Modules
The Executor (executor.py) manages the browser lifecycle through Playwright. It launches headless Chromium at a fixed 1280×720 viewport and exposes action primitives: click(x, y), type(text), scroll(direction), navigate(url), go_back(). Clean, deterministic operations that the AI orchestrates.
The Analyzer (analyzer.py) captures full-page screenshots, base64-encodes them, and sends them to Gemini 3 Pro with structured prompts. The model returns a JSON payload identifying page type, interactive elements with bounding-box coordinates, current state context, and navigation recommendations. This is where the magic lives: Gemini's multimodal vision turns raw pixels into structured understanding.
The Navigator (navigator.py) bridges analysis and execution into ADK-compatible async tool functions. Each tool has rich docstrings that Gemini uses for function calling and tool selection. This is a critical design choice: the agent doesn't just call functions, it reasons about which function to call based on the documentation embedded in the tools themselves.
async def click_element(x: int, y: int, description: "str) -> str:"
"""Click on an element at the specified coordinates.
Use this when you need to interact with a button, link, checkbox,
or any clickable UI element identified in the screenshot analysis.
The description should match the element identified by the analyzer.
"""
await executor.click(x, y)
screenshot = await analyzer.capture_and_analyze()
return f"Clicked '{description}' at ({x},{y}). New state: {screenshot.summary}"
The Orchestrator (orchestrator.py) wires everything together as a Google ADK Agent. It receives natural language instructions, decomposes them into steps, and calls navigator tools in sequence. The ADK framework makes this surprisingly elegant: each capability is a well-documented function, and the agent's decision-making is both predictable and extensible.
The Frontend: React, TypeScript, Real-Time Streaming
The frontend streams live browser screenshots over WebSocket. The ScreenView component renders the agent's viewport in real-time. The ActionLog displays a chronological feed of every decision. The SteeringControl component manages mode transitions. And the InputBar accepts both text and voice input, using Gemini's Live API for bidirectional audio.
The voice integration deserves emphasis. This isn't speech-to-text piped into a chatbot. Gemini Live API provides bidirectional audio streaming: the user speaks naturally, the agent transcribes in real-time, executes the task, and narrates progress back through synthesized voice. You can have a conversation with an agent that's actively doing things while you talk. It collapses the feedback loop between instruction and execution to zero.
The Steering Protocol
The steering system required the most careful engineering. Handling mid-task redirects while maintaining context meant building a message queue system that could gracefully interrupt the agent's planning loop without losing the overall goal. When a user sends a steering message in Steer Mode, the orchestrator pauses its current plan, re-analyzes the screen with the new instruction overlaid on the existing context, and resumes with an updated action sequence. The agent doesn't restart. It adapts.
Building with Gemini and Google Cloud
We chose this stack deliberately, and every choice earned its place.
Gemini 3 Pro powers the vision system. We evaluated several multimodal models. Gemini's ability to identify UI elements, read text in screenshots, and return structured coordinate data was consistently the most reliable. The key was prompt engineering: by using a fixed viewport size (1280×720) and structured output schemas, we achieved reliable element identification across radically different website designs. E-commerce sites, government portals, SaaS dashboards, social media feeds. Same model, same prompts, consistent results.
Google ADK was the right abstraction layer. We didn't want to build agent orchestration from scratch. ADK's tool-based architecture maps perfectly onto browser automation: each action is a tool, each tool has documentation, and the agent reasons about tool selection using the same model that analyzes screenshots. The cognitive loop stays unified.
Gemini Live API enabled voice interaction without building a separate speech pipeline. Bidirectional audio streaming, real-time transcription, and synthesis all handled by one API. This let us focus on the UX rather than infrastructure.
Google Cloud Platform handles deployment and persistence. Cloud Run provides containerized auto-scaling for both frontend and backend services. Firestore stores task history, session state, and user preferences. Cloud Storage maintains a screenshot audit trail for every action the agent takes, crucial for debugging, replay, and trust. Firebase Authentication provides secure user management with Google Sign-In.
Challenges and Hard-Won Lessons
Coordinate accuracy was the biggest technical hurdle. Gemini's vision model identifies UI elements reliably, but translating "the blue Submit button in the bottom right" into exact pixel coordinates required iteration. Our solution: a coordinate-validation loop where the agent re-examines its click target before acting. If the re-analysis doesn't confirm the element at those coordinates, the agent re-scans and adjusts. It adds latency but eliminates misclicks.
Latency management was the second challenge. Each screenshot-analyze-act cycle involves a round trip to Gemini. We optimized by streaming WebSocket updates to the frontend so users see progress in real-time even when individual steps take 1-2 seconds. Perceived latency matters more than actual latency when a human is watching.
The biggest lesson wasn't technical at all. The UX of AI agents matters more than their raw capability. A brilliant agent that you can't steer, observe, or trust is less useful than a good agent with great transparency. The steering system didn't exist in our original design. It emerged when we realized that autonomy alone isn't enough. People need control. They need to see what the agent sees. They need to intervene before irreversible actions. The moment we built that, Aegis stopped being a demo and started being a product.
What's Next
Aegis today navigates single browser tabs with human oversight. The roadmap extends in every direction:
Multi-tab orchestration will let the agent work across multiple browser contexts simultaneously, enabling complex cross-site workflows like "compare prices on Amazon, Walmart, and Best Buy, then buy the cheapest option."
Workflow recording and sharing will let users record a navigation session once, then replay it or share it as a reusable automation template. One-time setup, infinite replay.
Plugin ecosystem via MCP will enable community-built tool extensions for domain-specific workflows: e-commerce, HR portals, healthcare systems, government forms. Aegis already supports MCP server integration. The ecosystem just needs to grow.
Mobile UI navigation will extend the vision-based approach to mobile app screenshots via device emulation. Same architecture. New surface area.
The fundamental bet behind Aegis is this: vision-based UI understanding is the right abstraction layer for browser automation. Not the DOM. Not APIs. Not selectors. Vision. And with models like Gemini improving with every generation, the gap between "AI watching a screen" and "human watching a screen" will close fast.
We built Aegis to prove that future is already here.
Aegis was built for the Gemini Live Agent Challenge hackathon, competing in the UI Navigator track. Built with Gemini 3 Pro, Google ADK, Gemini Live API, Playwright, FastAPI, React, and deployed on Google Cloud Platform.
Team: Jesse Newton · Chronos Intelligence Systems
GitHub · Devpost
Top comments (0)