DEV Community

Cover image for Open Source Project #107: page-agent — A Pure-JS GUI Agent for the Browser, No Screenshots, No Plugins, No Backend
WonderLab
WonderLab

Posted on

Open Source Project #107: page-agent — A Pure-JS GUI Agent for the Browser, No Screenshots, No Plugins, No Backend

Introduction

"One script tag. Natural language controls your web app. No screenshots, no browser extensions, no backend, no Python."

This is article #107 in the "One Open Source Project a Day" series. Today's project is page-agent — Alibaba's open-source client-side GUI Agent library.

The dominant approach to making AI operate a browser follows one path: screenshot → multimodal LLM recognizes elements → execute action. That path has two costs: multimodal models (expensive) and a server or headless browser environment (complex infrastructure).

page-agent's answer: serializing the DOM to text is enough. Assign numeric indices to interactive elements, send the text to any LLM, get back "click element 3" as a tool call, execute it directly in the browser. The entire loop stays in-page. No screenshots. No multimodal capability needed.

What You'll Learn

  • Text-based DOM: how to turn a page into an LLM-readable indexed structure
  • ReAct loop architecture: Observe → Think (reflection + action) → Act, fully implemented in TypeScript
  • Reflection-Before-Action model: evaluating the previous step before planning the next
  • Built-in tool system: how click, input, scroll, and JS execution tools are defined
  • Single package vs. core package: page-agent (with UI) and @page-agent/core (pure logic)
  • Chrome extension + MCP Server for cross-page capability

Prerequisites

  • Understanding of LLM function calling (tool use) mechanics
  • Basic knowledge of DOM and browser events
  • Experience with OpenAI SDK or similar APIs

Project Background

Overview

page-agent is a pure client-side GUI Agent library that embeds LLM reasoning directly into web pages. It understands page structure through serialized DOM text and executes actions without leaving the browser.

The core architectural decision: DOM text instead of screenshots. The DOM already fully describes what elements exist on the page, what type they are, and whether they're currently interactive. Serializing this to text is more precise than a screenshot (button labels are never blurry), cheaper (no multimodal models), and faster (DOM reads are synchronous).

The project acknowledges browser-use (server-side Python browser automation) as its inspiration. page-agent's positioning is the client-side counterpart: runs inside the page, not controlling a headless browser from a server.

Author / Team

  • Organization: Alibaba
  • Primary Language: TypeScript
  • License: MIT
  • npm packages: page-agent (with UI Panel), @page-agent/core (pure agent logic)
  • Latest version: v1.10.0

Project Stats

  • 📄 License: MIT
  • 📦 npm: page-agent
  • 💻 Stack: TypeScript + npm workspaces + Vite

Features

What It Does

Traditional browser GUI Agent (screenshot path):
Page → Screenshot → Multimodal LLM visual understanding → Return coordinates/elements → Execute
Cost: multimodal API fees + server/headless browser infrastructure

page-agent (text DOM path):
Page → Serialize DOM with indexed interactive elements → Pure text LLM reasoning
    → Return tool call: { click_element_by_index: { index: 2 } }
    → Execute directly in the page
Cost: text LLM (cheaper) + zero backend (pure frontend JS)
Enter fullscreen mode Exit fullscreen mode

Use Cases

  1. SaaS AI Copilot: Embed an AI assistant directly in your product — user says "create a new project named X" and it happens. No backend rewrite needed.
  2. Smart form filling: Compress 20-click workflows in ERP/CRM/admin systems into a single sentence
  3. Accessibility: Add natural language control to any web application — voice commands, screen reader enhancement
  4. Cross-tab Agent: With the Chrome extension, agents can work across tabs (e.g., read data from a spreadsheet tab, fill it into a form in another tab)
  5. MCP control: Through the MCP Server (Beta), external agent clients can control the browser

Quick Start

One-line integration (free Demo LLM, technical evaluation only):

<script src="https://cdn.jsdelivr.net/npm/page-agent@1.10.0/dist/iife/page-agent.demo.js" crossorigin="true"></script>
Enter fullscreen mode Exit fullscreen mode

After loading, a floating Agent panel appears in the bottom-right corner. Type natural language commands directly.

NPM installation (production):

npm install page-agent
Enter fullscreen mode Exit fullscreen mode
import { PageAgent } from 'page-agent'

const agent = new PageAgent({
    model: 'gpt-4o',
    baseURL: 'https://api.openai.com/v1',
    apiKey: 'YOUR_API_KEY',
    language: 'en-US',
})

const result = await agent.execute('Click the login button, then fill in the username as test@example.com')
console.log(result.success, result.message)
Enter fullscreen mode Exit fullscreen mode

Using @page-agent/core (no UI panel, embedded scenarios):

import { PageAgentCore } from '@page-agent/core'
import { PageController } from '@page-agent/page-controller'

const pageController = new PageController({ enableMask: true })
const agent = new PageAgentCore({
    pageController,
    model: 'gpt-4o',
    apiKey: 'sk-...',
    maxSteps: 20,
    onAfterStep: async (agent, history) => {
        console.log('step completed', history.at(-1))
    }
})

const result = await agent.execute('Find the lowest-priced item in the product list and add it to the cart')
Enter fullscreen mode Exit fullscreen mode

Supported models (any OpenAI-compatible endpoint works):

Provider Models
OpenAI gpt-4o, gpt-4-turbo, gpt-5.2, gpt-5.4
Anthropic claude-opus-4.8, claude-sonnet-4, claude-haiku-3.5
Alibaba qwen3.5-plus, qwen3.6-max, qwen3.6-flash
DeepSeek deepseek-chat, deepseek-reasoner
Google gemini-2.0-flash (via OpenAI-compatible endpoint)
Local Ollama (qwen3:14b, tested on RTX 3090 24GB)

Deep Dive

Text-Based DOM: The Core Technique

page-agent's PageController converts the DOM into an indexed simplified text structure:

Raw DOM (simplified):
<div class="form-container">
  <input type="text" placeholder="Username" />
  <input type="password" placeholder="Password" />
  <button class="btn-primary">Sign In</button>
  <a href="/register">Create account</a>
</div>

Serialized for LLM (interactive elements with indices):
URL: https://example.com/login
Title: Login - My App

[0]<input placeholder="Username" />
[1]<input type="password" placeholder="Password" />
[2]<button>Sign In</button>
[3]<a>Create account</a>
Enter fullscreen mode Exit fullscreen mode

Each interactive element gets a numeric index [N]. The LLM only needs to return "click [2]" or "input 'admin@example.com' into [0]."

DOM processing pipeline:

Live DOM
    ↓ dom_tree/ module
FlatDomTree (flattened tree with DomNode map)
    ↓ Dehydration (simplification)
Indexed text representation
    ↓
LLM context input
    ↓
LLM returns tool call: { click_element_by_index: { index: 2 } }
    ↓
PageController.clickElement(2) → locate HTMLElement by index → fire click
Enter fullscreen mode Exit fullscreen mode

Only elements meeting all three criteria are indexed (to reduce noise):

  • isVisible: true — element is in or scrollable to viewport
  • isInteractive: true — clickable/inputtable/selectable elements
  • isTopElement: true — not obscured by overlapping elements

ReAct Loop Architecture

agent.execute("complete some task")
    ↓
┌─────────────────────────────────────────────┐
│         Main loop (up to maxSteps)           │
│                                             │
│  ① Observe                                  │
│     pageController.updateTree()             │
│     → refresh DOM, get current page text    │
│                                             │
│  ② Think (LLM call)                         │
│     Input: system prompt + history + DOM    │
│     LLM output (Reflection model):          │
│     {                                       │
│       evaluation_previous_goal: "...",  ←── how did the last step go?
│       memory: "...",                    ←── what to remember?
│       next_goal: "...",                 ←── what to do next?
│       action: { click_element_by_index: {index: 2} }
│     }                                       │
│                                             │
│  ③ Act                                      │
│     Execute the tool call returned by LLM   │
│     Record to history (persistent, visible  │
│     to LLM in next steps)                   │
│                                             │
│  ④ Check termination                        │
│     LLM calls done tool → return result     │
│     OR maxSteps reached                     │
└─────────────────────────────────────────────┘
    ↓
ExecutionResult { success, message, history }
Enter fullscreen mode Exit fullscreen mode

Reflection-Before-Action Model

Before each LLM call, the previous step's result and full history are passed to the model, which is required to reflect before acting:

{
  "evaluation_previous_goal": "Successfully clicked the login button, page navigated to the dashboard",
  "memory": "Login complete, currently on the dashboard. Still need to find the settings page.",
  "next_goal": "Locate and click the Settings or Account option in the navigation bar",
  "action": {
    "click_element_by_index": { "index": 5 }
  }
}
Enter fullscreen mode Exit fullscreen mode

evaluation_previous_goal forces the model to evaluate the previous step's outcome before proceeding — preventing blind continuation (e.g., if a click triggered nothing, the model should pivot rather than click again).

memory is the short-term memory mechanism: compressing key progress into 1-3 sentences that persist in history, so the LLM doesn't "forget" what it has already accomplished in a long multi-step task.

Built-in Tool System

Tools the LLM can call via function calling:

Tool Purpose
click_element_by_index Click element at specified index
input_text Type text into input field at index
select_dropdown_option Select dropdown option by text content
scroll Vertical scroll (by pages or pixels)
scroll_horizontally Horizontal scroll
execute_javascript Execute arbitrary JS (AbortSignal supported)
wait Wait 1-10 seconds (for page loads or animations)
ask_user Ask the user a question (human-in-the-loop node)
done Task complete, return result

Custom tools (extend or override defaults):

const agent = new PageAgent({
    model: 'gpt-4o',
    apiKey: '...',
    customTools: {
        // Add a custom tool
        get_current_user: {
            description: 'Get the currently logged-in user information',
            inputSchema: z.object({}),
            execute: async function() {
                const user = await fetchCurrentUser()
                return JSON.stringify(user)
            }
        },
        // Set to null to disable a built-in tool
        execute_javascript: null
    }
})
Enter fullscreen mode Exit fullscreen mode

Monorepo Architecture

packages/
├── page-agent/        → Main package (npm: page-agent), includes UI panel
├── core/              → Pure agent logic (npm: @page-agent/core), no UI
├── llms/              → LLM client, multi-provider support
├── page-controller/   → DOM operations and visual feedback
├── ui/                → Control panel + i18n, decoupled from core
├── extension/         → Chrome extension (WXT + React)
└── website/           → Documentation site (React)
Enter fullscreen mode Exit fullscreen mode

Key module boundary design:

  • page-controller has no knowledge of LLMs — only handles DOM operations
  • llms has no knowledge of page structure — only handles LLM communication
  • core combines them into the ReAct loop
  • page-agent (main package) adds a UI panel on top of core — both are independently usable

How the LLM Client Works

The llms package uses a MacroTool pattern that wraps the reflection model:

// Each LLM call expects this structure back
export interface MacroToolInput extends Partial<AgentReflection> {
    action: Record<string, any>
}

export interface AgentReflection {
    evaluation_previous_goal: string  // "Previous action succeeded/failed because..."
    memory: string                    // "Key facts to remember: ..."
    next_goal: string                 // "Next I will..."
}
Enter fullscreen mode Exit fullscreen mode

The OpenAI-compatible client handles:

  1. Converting tools to OpenAI function-calling format
  2. Building requests with parallel_tool_calls: false (one action per step)
  3. Provider-specific patches (DeepSeek: disable explicit tool_choice; MiniMax: temperature/tool-call compatibility)
  4. Automatic retry with exponential backoff for 429/500 errors
  5. Token usage tracking including prompt cache hits and reasoning tokens

Cross-Page Capability: Chrome Extension + MCP

Chrome Extension (for cross-tab workflows):

  • Agent runs in an extension-controlled context with access to switch tabs and read different pages' DOM
  • Use case: "Read data from a spreadsheet tab, paste it into a form in another tab"

MCP Server (Beta):

  • Exposes page-agent as MCP tools, letting external agent clients (Claude Desktop, Claude Code) control the browser remotely
  • Use case: connecting browser control capability into a larger agent workflow

Reliability Design

  • Step delay: 400ms between steps — breathing room for page renders and network requests
  • Click wait: 200ms after click operations — ensures DOM updates complete
  • Concurrency guard: Prevents concurrent execute() calls to avoid race conditions
  • AbortSignal support: All tool execution honors cancellation signals; execute_javascript is interruptible
  • Automatic retry: LLM failures (429 rate limits, 500 server errors) auto-retry with exponential backoff (100ms base)
  • Token tracking: Every step records promptTokens, completionTokens, prompt cache hits, and reasoning tokens

Resources

Official Links


Summary

page-agent's core insight is that browser automation doesn't need to "look at" the page. The DOM is already a structured description of what exists on the page — serializing it to indexed text lets any text-only LLM understand and operate it directly. No multimodal model, no visual understanding, no coordinate targeting.

This insight dramatically lowers the barrier for GUI agents: no server required, no headless browser, no screenshots, no extra infrastructure — a single script tag or npm package running in the user's browser.

The Reflection-Before-Action model (evaluate last step → plan next step) combined with persistent memory in history is what lets multi-step tasks stay on track. The agent doesn't just chain commands blindly; it continuously re-evaluates whether its actions are having the intended effect.

For developers who want to add AI Copilot capability to a web product, or automate internal tool workflows without touching backend infrastructure, page-agent is one of the lowest-friction options currently available.


Explore PrimeSkills — a curated marketplace of AI agents and skills, each validated against real enterprise workflows. No hype, just what actually works.

Visit my personal site for more insights and interesting products.

Top comments (0)