WonderLab

Posted on Jun 27

Open Source Project #107: page-agent — A Pure-JS GUI Agent for the Browser, No Screenshots, No Plugins, No Backend

#opensource #agents #llm #typescript

Introduction

"One script tag. Natural language controls your web app. No screenshots, no browser extensions, no backend, no Python."

This is article #107 in the "One Open Source Project a Day" series. Today's project is page-agent — Alibaba's open-source client-side GUI Agent library.

The dominant approach to making AI operate a browser follows one path: screenshot → multimodal LLM recognizes elements → execute action. That path has two costs: multimodal models (expensive) and a server or headless browser environment (complex infrastructure).

page-agent's answer: serializing the DOM to text is enough. Assign numeric indices to interactive elements, send the text to any LLM, get back "click element 3" as a tool call, execute it directly in the browser. The entire loop stays in-page. No screenshots. No multimodal capability needed.

What You'll Learn

Text-based DOM: how to turn a page into an LLM-readable indexed structure
ReAct loop architecture: Observe → Think (reflection + action) → Act, fully implemented in TypeScript
Reflection-Before-Action model: evaluating the previous step before planning the next
Built-in tool system: how click, input, scroll, and JS execution tools are defined
Single package vs. core package: page-agent (with UI) and @page-agent/core (pure logic)
Chrome extension + MCP Server for cross-page capability

Prerequisites

Understanding of LLM function calling (tool use) mechanics
Basic knowledge of DOM and browser events
Experience with OpenAI SDK or similar APIs

Project Background

Overview

page-agent is a pure client-side GUI Agent library that embeds LLM reasoning directly into web pages. It understands page structure through serialized DOM text and executes actions without leaving the browser.

The core architectural decision: DOM text instead of screenshots. The DOM already fully describes what elements exist on the page, what type they are, and whether they're currently interactive. Serializing this to text is more precise than a screenshot (button labels are never blurry), cheaper (no multimodal models), and faster (DOM reads are synchronous).

The project acknowledges browser-use (server-side Python browser automation) as its inspiration. page-agent's positioning is the client-side counterpart: runs inside the page, not controlling a headless browser from a server.

Author / Team

Organization: Alibaba
Primary Language: TypeScript
License: MIT
npm packages: page-agent (with UI Panel), @page-agent/core (pure agent logic)
Latest version: v1.10.0

Project Stats

📄 License: MIT
📦 npm: page-agent
💻 Stack: TypeScript + npm workspaces + Vite

Features

What It Does

Traditional browser GUI Agent (screenshot path):
Page → Screenshot → Multimodal LLM visual understanding → Return coordinates/elements → Execute
Cost: multimodal API fees + server/headless browser infrastructure

page-agent (text DOM path):
Page → Serialize DOM with indexed interactive elements → Pure text LLM reasoning
    → Return tool call: { click_element_by_index: { index: 2 } }
    → Execute directly in the page
Cost: text LLM (cheaper) + zero backend (pure frontend JS)

Use Cases

SaaS AI Copilot: Embed an AI assistant directly in your product — user says "create a new project named X" and it happens. No backend rewrite needed.
Smart form filling: Compress 20-click workflows in ERP/CRM/admin systems into a single sentence
Accessibility: Add natural language control to any web application — voice commands, screen reader enhancement
Cross-tab Agent: With the Chrome extension, agents can work across tabs (e.g., read data from a spreadsheet tab, fill it into a form in another tab)
MCP control: Through the MCP Server (Beta), external agent clients can control the browser

Quick Start

One-line integration (free Demo LLM, technical evaluation only):

<script src="https://cdn.jsdelivr.net/npm/page-agent@1.10.0/dist/iife/page-agent.demo.js" crossorigin="true"></script>

After loading, a floating Agent panel appears in the bottom-right corner. Type natural language commands directly.

NPM installation (production):

npm install page-agent

import { PageAgent } from 'page-agent'

const agent = new PageAgent({
    model: 'gpt-4o',
    baseURL: 'https://api.openai.com/v1',
    apiKey: 'YOUR_API_KEY',
    language: 'en-US',
})

const result = await agent.execute('Click the login button, then fill in the username as test@example.com')
console.log(result.success, result.message)

Using @page-agent/core (no UI panel, embedded scenarios):

import { PageAgentCore } from '@page-agent/core'
import { PageController } from '@page-agent/page-controller'

const pageController = new PageController({ enableMask: true })
const agent = new PageAgentCore({
    pageController,
    model: 'gpt-4o',
    apiKey: 'sk-...',
    maxSteps: 20,
    onAfterStep: async (agent, history) => {
        console.log('step completed', history.at(-1))
    }
})

const result = await agent.execute('Find the lowest-priced item in the product list and add it to the cart')

Supported models (any OpenAI-compatible endpoint works):

Provider	Models
OpenAI	gpt-4o, gpt-4-turbo, gpt-5.2, gpt-5.4
Anthropic	claude-opus-4.8, claude-sonnet-4, claude-haiku-3.5
Alibaba	qwen3.5-plus, qwen3.6-max, qwen3.6-flash
DeepSeek	deepseek-chat, deepseek-reasoner
Google	gemini-2.0-flash (via OpenAI-compatible endpoint)
Local	Ollama (qwen3:14b, tested on RTX 3090 24GB)

Deep Dive

Text-Based DOM: The Core Technique

page-agent's PageController converts the DOM into an indexed simplified text structure:

Raw DOM (simplified):
<div class="form-container">
  <input type="text" placeholder="Username" />
  <input type="password" placeholder="Password" />
  <button class="btn-primary">Sign In</button>
  <a href="/register">Create account</a>
</div>

Serialized for LLM (interactive elements with indices):
URL: https://example.com/login
Title: Login - My App

[0]<input placeholder="Username" />
[1]<input type="password" placeholder="Password" />
[2]<button>Sign In</button>
[3]<a>Create account</a>

Each interactive element gets a numeric index [N]. The LLM only needs to return "click [2]" or "input 'admin@example.com' into [0]."

DOM processing pipeline:

Live DOM
    ↓ dom_tree/ module
FlatDomTree (flattened tree with DomNode map)
    ↓ Dehydration (simplification)
Indexed text representation
    ↓
LLM context input
    ↓
LLM returns tool call: { click_element_by_index: { index: 2 } }
    ↓
PageController.clickElement(2) → locate HTMLElement by index → fire click

Only elements meeting all three criteria are indexed (to reduce noise):

isVisible: true — element is in or scrollable to viewport
isInteractive: true — clickable/inputtable/selectable elements
isTopElement: true — not obscured by overlapping elements

ReAct Loop Architecture

agent.execute("complete some task")
    ↓
┌─────────────────────────────────────────────┐
│         Main loop (up to maxSteps)           │
│                                             │
│  ① Observe                                  │
│     pageController.updateTree()             │
│     → refresh DOM, get current page text    │
│                                             │
│  ② Think (LLM call)                         │
│     Input: system prompt + history + DOM    │
│     LLM output (Reflection model):          │
│     {                                       │
│       evaluation_previous_goal: "...",  ←── how did the last step go?
│       memory: "...",                    ←── what to remember?
│       next_goal: "...",                 ←── what to do next?
│       action: { click_element_by_index: {index: 2} }
│     }                                       │
│                                             │
│  ③ Act                                      │
│     Execute the tool call returned by LLM   │
│     Record to history (persistent, visible  │
│     to LLM in next steps)                   │
│                                             │
│  ④ Check termination                        │
│     LLM calls done tool → return result     │
│     OR maxSteps reached                     │
└─────────────────────────────────────────────┘
    ↓
ExecutionResult { success, message, history }

Reflection-Before-Action Model

Before each LLM call, the previous step's result and full history are passed to the model, which is required to reflect before acting:

{
  "evaluation_previous_goal": "Successfully clicked the login button, page navigated to the dashboard",
  "memory": "Login complete, currently on the dashboard. Still need to find the settings page.",
  "next_goal": "Locate and click the Settings or Account option in the navigation bar",
  "action": {
    "click_element_by_index": { "index": 5 }
  }
}

evaluation_previous_goal forces the model to evaluate the previous step's outcome before proceeding — preventing blind continuation (e.g., if a click triggered nothing, the model should pivot rather than click again).

memory is the short-term memory mechanism: compressing key progress into 1-3 sentences that persist in history, so the LLM doesn't "forget" what it has already accomplished in a long multi-step task.

Built-in Tool System

Tools the LLM can call via function calling:

Tool	Purpose
`click_element_by_index`	Click element at specified index
`input_text`	Type text into input field at index
`select_dropdown_option`	Select dropdown option by text content
`scroll`	Vertical scroll (by pages or pixels)
`scroll_horizontally`	Horizontal scroll
`execute_javascript`	Execute arbitrary JS (AbortSignal supported)
`wait`	Wait 1-10 seconds (for page loads or animations)
`ask_user`	Ask the user a question (human-in-the-loop node)
`done`	Task complete, return result

Custom tools (extend or override defaults):

const agent = new PageAgent({
    model: 'gpt-4o',
    apiKey: '...',
    customTools: {
        // Add a custom tool
        get_current_user: {
            description: 'Get the currently logged-in user information',
            inputSchema: z.object({}),
            execute: async function() {
                const user = await fetchCurrentUser()
                return JSON.stringify(user)
            }
        },
        // Set to null to disable a built-in tool
        execute_javascript: null
    }
})

Monorepo Architecture

packages/
├── page-agent/        → Main package (npm: page-agent), includes UI panel
├── core/              → Pure agent logic (npm: @page-agent/core), no UI
├── llms/              → LLM client, multi-provider support
├── page-controller/   → DOM operations and visual feedback
├── ui/                → Control panel + i18n, decoupled from core
├── extension/         → Chrome extension (WXT + React)
└── website/           → Documentation site (React)

Key module boundary design:

page-controller has no knowledge of LLMs — only handles DOM operations
llms has no knowledge of page structure — only handles LLM communication
core combines them into the ReAct loop
page-agent (main package) adds a UI panel on top of core — both are independently usable

How the LLM Client Works

The llms package uses a MacroTool pattern that wraps the reflection model:

// Each LLM call expects this structure back
export interface MacroToolInput extends Partial<AgentReflection> {
    action: Record<string, any>
}

export interface AgentReflection {
    evaluation_previous_goal: string  // "Previous action succeeded/failed because..."
    memory: string                    // "Key facts to remember: ..."
    next_goal: string                 // "Next I will..."
}

The OpenAI-compatible client handles:

Converting tools to OpenAI function-calling format
Building requests with parallel_tool_calls: false (one action per step)
Provider-specific patches (DeepSeek: disable explicit tool_choice; MiniMax: temperature/tool-call compatibility)
Automatic retry with exponential backoff for 429/500 errors
Token usage tracking including prompt cache hits and reasoning tokens

Cross-Page Capability: Chrome Extension + MCP

Chrome Extension (for cross-tab workflows):

Agent runs in an extension-controlled context with access to switch tabs and read different pages' DOM
Use case: "Read data from a spreadsheet tab, paste it into a form in another tab"

MCP Server (Beta):

Exposes page-agent as MCP tools, letting external agent clients (Claude Desktop, Claude Code) control the browser remotely
Use case: connecting browser control capability into a larger agent workflow

Reliability Design

Step delay: 400ms between steps — breathing room for page renders and network requests
Click wait: 200ms after click operations — ensures DOM updates complete
Concurrency guard: Prevents concurrent execute() calls to avoid race conditions
AbortSignal support: All tool execution honors cancellation signals; execute_javascript is interruptible
Automatic retry: LLM failures (429 rate limits, 500 server errors) auto-retry with exponential backoff (100ms base)
Token tracking: Every step records promptTokens, completionTokens, prompt cache hits, and reasoning tokens

Resources

Official Links

🌟 GitHub: alibaba/page-agent
🚀 Live Demo: alibaba.github.io/page-agent
📖 Documentation: alibaba.github.io/page-agent/docs
📦 npm: page-agent

Summary

page-agent's core insight is that browser automation doesn't need to "look at" the page. The DOM is already a structured description of what exists on the page — serializing it to indexed text lets any text-only LLM understand and operate it directly. No multimodal model, no visual understanding, no coordinate targeting.

This insight dramatically lowers the barrier for GUI agents: no server required, no headless browser, no screenshots, no extra infrastructure — a single script tag or npm package running in the user's browser.

The Reflection-Before-Action model (evaluate last step → plan next step) combined with persistent memory in history is what lets multi-step tasks stay on track. The agent doesn't just chain commands blindly; it continuously re-evaluates whether its actions are having the intended effect.

For developers who want to add AI Copilot capability to a web product, or automate internal tool workflows without touching backend infrastructure, page-agent is one of the lowest-friction options currently available.

Explore PrimeSkills — a curated marketplace of AI agents and skills, each validated against real enterprise workflows. No hype, just what actually works.

Visit my personal site for more insights and interesting products.

DEV Community