DEV Community

refaat Al Ktifan
refaat Al Ktifan

Posted on • Originally published at oronts.com

AI Browser Automation Without BrowserBase: What We Built Instead

Oronts Automation

The Paid Browser Automation Market

BrowserBase, Browserless, and similar services charge per-minute or per-session for managed headless browsers. For AI workflows that need to interact with web pages (filling forms, extracting structured data, navigating multi-step processes), these services handle the infrastructure: browser instances, anti-detection, proxies, and session management.

The pricing adds up fast. At $0.10-0.50 per session-minute, a workflow that processes 1,000 pages per day at 2 minutes each costs $200-1,000 per day. For an AI system that runs continuously, that's $6,000-30,000 per month just for browser infrastructure.

We built a self-hosted alternative using Playwright + LLM for page understanding. It handles 90% of the use cases at a fraction of the cost. This article covers the architecture. For how we build AI workflow systems and agentic AI more broadly, those guides cover the higher-level patterns.

The Architecture

┌─────────────────────────────────────────────────────────┐
│                  AI Browser Engine                       │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │  Task Queue   │  │  Instance    │  │  Session      │  │
│  │  (BullMQ)     │  │  Pool        │  │  Manager      │  │
│  │               │  │  (Playwright │  │  (cookies,    │  │
│  │  Prioritized  │  │   browsers)  │  │   localStorage│  │
│  │  Retry logic  │  │              │  │   auth state) │  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  │
│         │                 │                  │           │
│         ▼                 ▼                  ▼           │
│  ┌──────────────────────────────────────────────────┐   │
│  │              Page Interaction Layer                │   │
│  │                                                    │   │
│  │  1. Navigate to URL                               │   │
│  │  2. Wait for page load                            │   │
│  │  3. Extract page structure (accessibility tree)   │   │
│  │  4. Send structure to LLM for understanding       │   │
│  │  5. LLM returns action plan (click, type, select) │   │
│  │  6. Execute actions via Playwright                │   │
│  │  7. Extract structured data from result           │   │
│  └──────────────────────────────────────────────────┘   │
│                                                          │
└─────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Instance Pooling

Running a new browser for every task is expensive (cold start: 1-3 seconds, memory: 200-400MB per instance). A pool reuses browser instances across tasks.

class BrowserPool {
    private available: Browser[] = [];
    private inUse = new Map<string, Browser>();
    private maxInstances: number;

    constructor(options: { maxInstances: number }) {
        this.maxInstances = options.maxInstances;
    }

    async acquire(): Promise<{ browser: Browser; id: string }> {
        // Reuse an available instance
        if (this.available.length > 0) {
            const browser = this.available.pop()!;
            const id = crypto.randomUUID();
            this.inUse.set(id, browser);
            return { browser, id };
        }

        // Create new if under limit
        if (this.inUse.size < this.maxInstances) {
            const browser = await chromium.launch({
                headless: true,
                args: [
                    '--no-sandbox',
                    '--disable-setuid-sandbox',
                    '--disable-dev-shm-usage',
                    '--disable-gpu',
                    '--single-process',
                ],
            });
            const id = crypto.randomUUID();
            this.inUse.set(id, browser);
            return { browser, id };
        }

        // Pool exhausted: wait for one to be released
        return new Promise((resolve) => {
            this.waitQueue.push(resolve);
        });
    }

    async release(id: string): Promise<void> {
        const browser = this.inUse.get(id);
        if (!browser) return;

        this.inUse.delete(id);

        // Clear state between tasks
        const pages = browser.contexts();
        for (const context of pages) {
            await context.close();
        }

        // If someone is waiting, give them this instance
        if (this.waitQueue.length > 0) {
            const resolve = this.waitQueue.shift()!;
            const newId = crypto.randomUUID();
            this.inUse.set(newId, browser);
            resolve({ browser, id: newId });
        } else {
            this.available.push(browser);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Pool Sizing

Workload Pool Size Memory Required
Light (< 100 pages/hour) 2-3 instances 1-2 GB
Medium (100-500 pages/hour) 5-10 instances 3-5 GB
Heavy (500+ pages/hour) 10-20 instances 5-10 GB

Each Chromium instance uses 200-400MB of RAM. The pool size determines your throughput ceiling and memory requirements. Start small and scale based on actual load.

Session Management

Many workflows require maintaining login state across multiple page interactions. The session manager persists cookies, localStorage, and authentication tokens between tasks.

class SessionManager {
    private sessions = new Map<string, SessionState>();

    async createSession(id: string, options: SessionOptions): Promise<BrowserContext> {
        const context = await browser.newContext({
            viewport: { width: 1280, height: 720 },
            userAgent: options.userAgent || this.getRandomUserAgent(),
            locale: options.locale || 'en-US',
            timezoneId: options.timezone || 'Europe/Berlin',
        });

        // Restore previous session state if exists
        const existing = this.sessions.get(id);
        if (existing) {
            await context.addCookies(existing.cookies);
            // localStorage restored via page.evaluate after navigation
        }

        return context;
    }

    async saveSession(id: string, context: BrowserContext): Promise<void> {
        const cookies = await context.cookies();
        const pages = context.pages();
        let localStorage = {};

        if (pages.length > 0) {
            localStorage = await pages[0].evaluate(() => {
                const data: Record<string, string> = {};
                for (let i = 0; i < window.localStorage.length; i++) {
                    const key = window.localStorage.key(i);
                    if (key) data[key] = window.localStorage.getItem(key) || '';
                }
                return data;
            });
        }

        this.sessions.set(id, {
            cookies,
            localStorage,
            lastUsed: Date.now(),
        });
    }
}
Enter fullscreen mode Exit fullscreen mode

LLM-Driven Page Understanding

The core innovation: instead of writing CSS selectors or XPath queries for every page, send the page's accessibility tree to an LLM and let it decide which elements to interact with.

async function extractPageStructure(page: Page): Promise<string> {
    // Get the accessibility tree (structured, compact representation)
    const tree = await page.accessibility.snapshot();

    // Convert to a text format the LLM can understand
    return formatAccessibilityTree(tree, {
        maxDepth: 5,
        includeRoles: ['button', 'link', 'textbox', 'combobox', 'checkbox', 'heading'],
        includeText: true,
        includeLabels: true,
    });
}

function formatAccessibilityTree(node: any, options: any, depth = 0): string {
    if (depth > options.maxDepth) return '';
    if (!options.includeRoles.includes(node.role) && depth > 1) {
        // Skip non-interactive elements, but recurse into children
        return (node.children || []).map(c => formatAccessibilityTree(c, options, depth + 1)).join('');
    }

    const indent = '  '.repeat(depth);
    let result = `${indent}[${node.role}] ${node.name || ''}`;
    if (node.value) result += ` value="${node.value}"`;
    result += '\n';

    for (const child of node.children || []) {
        result += formatAccessibilityTree(child, options, depth + 1);
    }
    return result;
}
Enter fullscreen mode Exit fullscreen mode

LLM Action Planning

Send the page structure to the LLM with the task description. The LLM returns a sequence of actions:

async function planActions(pageStructure: string, task: string): Promise<Action[]> {
    const response = await llm.generate({
        model: 'gpt-4o-mini', // Fast model for action planning
        messages: [
            {
                role: 'system',
                content: `You are a browser automation assistant. Given a page structure and a task,
                return a JSON array of actions to accomplish the task.
                Available actions: click(selector), type(selector, text), select(selector, value),
                wait(ms), extract(selector).
                Use the element text/labels to identify targets, not CSS selectors.`,
            },
            {
                role: 'user',
                content: `Page structure:\n${pageStructure}\n\nTask: ${task}`,
            },
        ],
        responseFormat: 'json',
    });

    return JSON.parse(response.text);
}

// Example task: "Fill in the contact form with name Sara Mustermann and email sara.mustermann@beispiel.de"
// LLM returns:
// [
//   { "action": "type", "target": "Name input field", "value": "Sara Mustermann" },
//   { "action": "type", "target": "Email input field", "value": "sara.mustermann@beispiel.de" },
//   { "action": "click", "target": "Submit button" }
// ]
Enter fullscreen mode Exit fullscreen mode

Resolving LLM Actions to Playwright Commands

The LLM returns human-readable targets ("Name input field"). A resolver maps them to Playwright selectors:

async function resolveAndExecute(page: Page, actions: Action[]): Promise<void> {
    for (const action of actions) {
        // Find the element matching the LLM's description
        const element = await findElementByDescription(page, action.target);

        if (!element) {
            throw new ActionError(`Could not find element: ${action.target}`);
        }

        switch (action.action) {
            case 'click':
                await element.click();
                await page.waitForLoadState('networkidle');
                break;
            case 'type':
                await element.fill(action.value);
                break;
            case 'select':
                await element.selectOption(action.value);
                break;
            case 'wait':
                await page.waitForTimeout(action.value);
                break;
            case 'extract':
                const text = await element.textContent();
                results.push({ field: action.target, value: text });
                break;
        }
    }
}

async function findElementByDescription(page: Page, description: string): Promise<ElementHandle | null> {
    // Try multiple strategies to find the element
    const strategies = [
        // By aria-label
        () => page.$(`[aria-label*="${description}" i]`),
        // By placeholder
        () => page.$(`[placeholder*="${description}" i]`),
        // By visible text
        () => page.$(`text=${description}`),
        // By label association
        () => page.$(`label:has-text("${description}") + input, label:has-text("${description}") input`),
        // By role and name
        () => page.getByRole('textbox', { name: new RegExp(description, 'i') }).first().elementHandle(),
        () => page.getByRole('button', { name: new RegExp(description, 'i') }).first().elementHandle(),
    ];

    for (const strategy of strategies) {
        try {
            const element = await strategy();
            if (element) return element;
        } catch {
            continue;
        }
    }

    return null;
}
Enter fullscreen mode Exit fullscreen mode

Anti-Detection Basics

Some websites detect and block headless browsers. Basic countermeasures:

const context = await browser.newContext({
    // Randomize viewport
    viewport: {
        width: 1280 + Math.floor(Math.random() * 200),
        height: 720 + Math.floor(Math.random() * 100),
    },

    // Rotate user agents
    userAgent: getRandomUserAgent(),

    // Set realistic locale and timezone
    locale: 'de-DE',
    timezoneId: 'Europe/Berlin',

    // Realistic geolocation
    geolocation: { latitude: 48.1351, longitude: 11.5820 },
    permissions: ['geolocation'],
});

// Override navigator.webdriver (headless detection)
await page.addInitScript(() => {
    Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
});
Enter fullscreen mode Exit fullscreen mode

Note: anti-detection is an arms race. For sites with sophisticated bot detection (Cloudflare, Akamai), self-hosted Playwright will eventually be detected. This is where paid services like BrowserBase add value: they invest continuously in anti-detection. For most business automation tasks (internal tools, partner portals, public data), basic anti-detection is sufficient.

When Paid Tools ARE Worth It

Scenario Self-Hosted Paid Service
Internal tool automation Best choice (no anti-detection needed) Overkill
Public data extraction (simple) Good (basic anti-detection works) Unnecessary
Sites with bot detection Possible but constant maintenance Worth it (they handle anti-detection)
High-volume scraping (10K+ pages/day) Complex (proxy rotation, IP management) Worth it (managed infrastructure)
Regulated data (GDPR, compliance) Better (data stays on your infrastructure) Risk (data goes through third party)
One-time migration Good (temporary workload) Unnecessary cost

The decision framework: if you're automating internal workflows or processing public data from sites without aggressive bot detection, self-host. If you're doing high-volume extraction from sites with Cloudflare-level protection, pay for a service that handles anti-detection as their core business.

Cost Comparison

Component Self-Hosted (monthly) BrowserBase (monthly)
Compute (5 instances) $50-100 (container/VPS) N/A
LLM calls (action planning) $20-50 (GPT-4o-mini) N/A
BrowserBase sessions N/A $500-2,000
Proxy service (if needed) $50-200 Included
Maintenance 2-4 hours/month None
Total (1,000 pages/day) $120-350/month $500-2,000/month
Total (10,000 pages/day) $300-800/month $3,000-10,000/month

Self-hosting is 3-10x cheaper at scale. The trade-off is maintenance time and anti-detection capability.

Common Pitfalls

  1. No instance pooling. Launching a new browser per task wastes 1-3 seconds on cold start and 200-400MB of RAM. Pool and reuse instances.

  2. Hardcoded CSS selectors. Pages change their DOM structure regularly. LLM-based element identification is more resilient than hardcoded selectors.

  3. No session persistence. Multi-step workflows that require login fail when the session state is lost between steps.

  4. Ignoring anti-detection entirely. Even basic measures (random viewport, user agent rotation, webdriver override) prevent detection on most sites.

  5. Using a large model for action planning. GPT-4o-mini or Claude Haiku are fast enough for page understanding. A large model adds latency without better accuracy for this task.

  6. No timeout on page loads. Some pages load indefinitely (infinite scrolling, slow third-party scripts). Set a navigation timeout and handle it.

  7. Running in production without monitoring. Track success rate, average execution time, and error types per workflow. Alert when success rate drops.

Key Takeaways

  • Self-hosted Playwright + LLM handles 90% of browser automation use cases. For internal tools, partner portals, and public data without aggressive bot detection, this is the right approach.

  • Instance pooling is essential. Reuse browser instances across tasks. Cold starts and memory allocation are the biggest performance bottleneck.

  • LLM page understanding replaces brittle selectors. Send the accessibility tree to a fast model. Let it decide which elements to interact with. More resilient to page changes than hardcoded CSS selectors.

  • Paid services earn their cost on anti-detection. If your target sites have Cloudflare or similar protection, BrowserBase invests continuously in bypassing it. That's their core business. Don't try to compete.

  • Self-hosting is 3-10x cheaper at scale. But you pay in maintenance time and anti-detection limitations. Make the trade-off consciously.

FIND MORE: https://oronts.com/en/guides/browser-automation-ai-without-paid-tools

Top comments (0)