refaat Al Ktifan

Posted on Apr 25 • Originally published at oronts.com

AI Browser Automation Without BrowserBase: What We Built Instead

#ai #automation #playwright #mcp

The Paid Browser Automation Market

BrowserBase, Browserless, and similar services charge per-minute or per-session for managed headless browsers. For AI workflows that need to interact with web pages (filling forms, extracting structured data, navigating multi-step processes), these services handle the infrastructure: browser instances, anti-detection, proxies, and session management.

The pricing adds up fast. At $0.10-0.50 per session-minute, a workflow that processes 1,000 pages per day at 2 minutes each costs $200-1,000 per day. For an AI system that runs continuously, that's $6,000-30,000 per month just for browser infrastructure.

We built a self-hosted alternative using Playwright + LLM for page understanding. It handles 90% of the use cases at a fraction of the cost. This article covers the architecture. For how we build AI workflow systems and agentic AI more broadly, those guides cover the higher-level patterns.

The Architecture

┌─────────────────────────────────────────────────────────┐
│                  AI Browser Engine                       │
│                                                          │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  │
│  │  Task Queue   │  │  Instance    │  │  Session      │  │
│  │  (BullMQ)     │  │  Pool        │  │  Manager      │  │
│  │               │  │  (Playwright │  │  (cookies,    │  │
│  │  Prioritized  │  │   browsers)  │  │   localStorage│  │
│  │  Retry logic  │  │              │  │   auth state) │  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘  │
│         │                 │                  │           │
│         ▼                 ▼                  ▼           │
│  ┌──────────────────────────────────────────────────┐   │
│  │              Page Interaction Layer                │   │
│  │                                                    │   │
│  │  1. Navigate to URL                               │   │
│  │  2. Wait for page load                            │   │
│  │  3. Extract page structure (accessibility tree)   │   │
│  │  4. Send structure to LLM for understanding       │   │
│  │  5. LLM returns action plan (click, type, select) │   │
│  │  6. Execute actions via Playwright                │   │
│  │  7. Extract structured data from result           │   │
│  └──────────────────────────────────────────────────┘   │
│                                                          │
└─────────────────────────────────────────────────────────┘

Instance Pooling

Running a new browser for every task is expensive (cold start: 1-3 seconds, memory: 200-400MB per instance). A pool reuses browser instances across tasks.

class BrowserPool {
    private available: Browser[] = [];
    private inUse = new Map<string, Browser>();
    private maxInstances: number;

    constructor(options: { maxInstances: number }) {
        this.maxInstances = options.maxInstances;
    }

    async acquire(): Promise<{ browser: Browser; id: string }> {
        // Reuse an available instance
        if (this.available.length > 0) {
            const browser = this.available.pop()!;
            const id = crypto.randomUUID();
            this.inUse.set(id, browser);
            return { browser, id };
        }

        // Create new if under limit
        if (this.inUse.size < this.maxInstances) {
            const browser = await chromium.launch({
                headless: true,
                args: [
                    '--no-sandbox',
                    '--disable-setuid-sandbox',
                    '--disable-dev-shm-usage',
                    '--disable-gpu',
                    '--single-process',
                ],
            });
            const id = crypto.randomUUID();
            this.inUse.set(id, browser);
            return { browser, id };
        }

        // Pool exhausted: wait for one to be released
        return new Promise((resolve) => {
            this.waitQueue.push(resolve);
        });
    }

    async release(id: string): Promise<void> {
        const browser = this.inUse.get(id);
        if (!browser) return;

        this.inUse.delete(id);

        // Clear state between tasks
        const pages = browser.contexts();
        for (const context of pages) {
            await context.close();
        }

        // If someone is waiting, give them this instance
        if (this.waitQueue.length > 0) {
            const resolve = this.waitQueue.shift()!;
            const newId = crypto.randomUUID();
            this.inUse.set(newId, browser);
            resolve({ browser, id: newId });
        } else {
            this.available.push(browser);
        }
    }
}

Pool Sizing

Workload	Pool Size	Memory Required
Light (< 100 pages/hour)	2-3 instances	1-2 GB
Medium (100-500 pages/hour)	5-10 instances	3-5 GB
Heavy (500+ pages/hour)	10-20 instances	5-10 GB

Each Chromium instance uses 200-400MB of RAM. The pool size determines your throughput ceiling and memory requirements. Start small and scale based on actual load.

Session Management

Many workflows require maintaining login state across multiple page interactions. The session manager persists cookies, localStorage, and authentication tokens between tasks.

class SessionManager {
    private sessions = new Map<string, SessionState>();

    async createSession(id: string, options: SessionOptions): Promise<BrowserContext> {
        const context = await browser.newContext({
            viewport: { width: 1280, height: 720 },
            userAgent: options.userAgent || this.getRandomUserAgent(),
            locale: options.locale || 'en-US',
            timezoneId: options.timezone || 'Europe/Berlin',
        });

        // Restore previous session state if exists
        const existing = this.sessions.get(id);
        if (existing) {
            await context.addCookies(existing.cookies);
            // localStorage restored via page.evaluate after navigation
        }

        return context;
    }

    async saveSession(id: string, context: BrowserContext): Promise<void> {
        const cookies = await context.cookies();
        const pages = context.pages();
        let localStorage = {};

        if (pages.length > 0) {
            localStorage = await pages[0].evaluate(() => {
                const data: Record<string, string> = {};
                for (let i = 0; i < window.localStorage.length; i++) {
                    const key = window.localStorage.key(i);
                    if (key) data[key] = window.localStorage.getItem(key) || '';
                }
                return data;
            });
        }

        this.sessions.set(id, {
            cookies,
            localStorage,
            lastUsed: Date.now(),
        });
    }
}

LLM-Driven Page Understanding

The core innovation: instead of writing CSS selectors or XPath queries for every page, send the page's accessibility tree to an LLM and let it decide which elements to interact with.

async function extractPageStructure(page: Page): Promise<string> {
    // Get the accessibility tree (structured, compact representation)
    const tree = await page.accessibility.snapshot();

    // Convert to a text format the LLM can understand
    return formatAccessibilityTree(tree, {
        maxDepth: 5,
        includeRoles: ['button', 'link', 'textbox', 'combobox', 'checkbox', 'heading'],
        includeText: true,
        includeLabels: true,
    });
}

function formatAccessibilityTree(node: any, options: any, depth = 0): string {
    if (depth > options.maxDepth) return '';
    if (!options.includeRoles.includes(node.role) && depth > 1) {
        // Skip non-interactive elements, but recurse into children
        return (node.children || []).map(c => formatAccessibilityTree(c, options, depth + 1)).join('');
    }

    const indent = '  '.repeat(depth);
    let result = `${indent}[${node.role}] ${node.name || ''}`;
    if (node.value) result += ` value="${node.value}"`;
    result += '\n';

    for (const child of node.children || []) {
        result += formatAccessibilityTree(child, options, depth + 1);
    }
    return result;
}

LLM Action Planning

Send the page structure to the LLM with the task description. The LLM returns a sequence of actions:

async function planActions(pageStructure: string, task: string): Promise<Action[]> {
    const response = await llm.generate({
        model: 'gpt-4o-mini', // Fast model for action planning
        messages: [
            {
                role: 'system',
                content: `You are a browser automation assistant. Given a page structure and a task,
                return a JSON array of actions to accomplish the task.
                Available actions: click(selector), type(selector, text), select(selector, value),
                wait(ms), extract(selector).
                Use the element text/labels to identify targets, not CSS selectors.`,
            },
            {
                role: 'user',
                content: `Page structure:\n${pageStructure}\n\nTask: ${task}`,
            },
        ],
        responseFormat: 'json',
    });

    return JSON.parse(response.text);
}

// Example task: "Fill in the contact form with name Sara Mustermann and email sara.mustermann@beispiel.de"
// LLM returns:
// [
//   { "action": "type", "target": "Name input field", "value": "Sara Mustermann" },
//   { "action": "type", "target": "Email input field", "value": "sara.mustermann@beispiel.de" },
//   { "action": "click", "target": "Submit button" }
// ]

Resolving LLM Actions to Playwright Commands

The LLM returns human-readable targets ("Name input field"). A resolver maps them to Playwright selectors:

async function resolveAndExecute(page: Page, actions: Action[]): Promise<void> {
    for (const action of actions) {
        // Find the element matching the LLM's description
        const element = await findElementByDescription(page, action.target);

        if (!element) {
            throw new ActionError(`Could not find element: ${action.target}`);
        }

        switch (action.action) {
            case 'click':
                await element.click();
                await page.waitForLoadState('networkidle');
                break;
            case 'type':
                await element.fill(action.value);
                break;
            case 'select':
                await element.selectOption(action.value);
                break;
            case 'wait':
                await page.waitForTimeout(action.value);
                break;
            case 'extract':
                const text = await element.textContent();
                results.push({ field: action.target, value: text });
                break;
        }
    }
}

async function findElementByDescription(page: Page, description: string): Promise<ElementHandle | null> {
    // Try multiple strategies to find the element
    const strategies = [
        // By aria-label
        () => page.$(`[aria-label*="${description}" i]`),
        // By placeholder
        () => page.$(`[placeholder*="${description}" i]`),
        // By visible text
        () => page.$(`text=${description}`),
        // By label association
        () => page.$(`label:has-text("${description}") + input, label:has-text("${description}") input`),
        // By role and name
        () => page.getByRole('textbox', { name: new RegExp(description, 'i') }).first().elementHandle(),
        () => page.getByRole('button', { name: new RegExp(description, 'i') }).first().elementHandle(),
    ];

    for (const strategy of strategies) {
        try {
            const element = await strategy();
            if (element) return element;
        } catch {
            continue;
        }
    }

    return null;
}

Anti-Detection Basics

Some websites detect and block headless browsers. Basic countermeasures:

const context = await browser.newContext({
    // Randomize viewport
    viewport: {
        width: 1280 + Math.floor(Math.random() * 200),
        height: 720 + Math.floor(Math.random() * 100),
    },

    // Rotate user agents
    userAgent: getRandomUserAgent(),

    // Set realistic locale and timezone
    locale: 'de-DE',
    timezoneId: 'Europe/Berlin',

    // Realistic geolocation
    geolocation: { latitude: 48.1351, longitude: 11.5820 },
    permissions: ['geolocation'],
});

// Override navigator.webdriver (headless detection)
await page.addInitScript(() => {
    Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
});

Note: anti-detection is an arms race. For sites with sophisticated bot detection (Cloudflare, Akamai), self-hosted Playwright will eventually be detected. This is where paid services like BrowserBase add value: they invest continuously in anti-detection. For most business automation tasks (internal tools, partner portals, public data), basic anti-detection is sufficient.

When Paid Tools ARE Worth It

Scenario	Self-Hosted	Paid Service
Internal tool automation	Best choice (no anti-detection needed)	Overkill
Public data extraction (simple)	Good (basic anti-detection works)	Unnecessary
Sites with bot detection	Possible but constant maintenance	Worth it (they handle anti-detection)
High-volume scraping (10K+ pages/day)	Complex (proxy rotation, IP management)	Worth it (managed infrastructure)
Regulated data (GDPR, compliance)	Better (data stays on your infrastructure)	Risk (data goes through third party)
One-time migration	Good (temporary workload)	Unnecessary cost

The decision framework: if you're automating internal workflows or processing public data from sites without aggressive bot detection, self-host. If you're doing high-volume extraction from sites with Cloudflare-level protection, pay for a service that handles anti-detection as their core business.

Cost Comparison

Component	Self-Hosted (monthly)	BrowserBase (monthly)
Compute (5 instances)	$50-100 (container/VPS)	N/A
LLM calls (action planning)	$20-50 (GPT-4o-mini)	N/A
BrowserBase sessions	N/A	$500-2,000
Proxy service (if needed)	$50-200	Included
Maintenance	2-4 hours/month	None
Total (1,000 pages/day)	$120-350/month	$500-2,000/month
Total (10,000 pages/day)	$300-800/month	$3,000-10,000/month

Self-hosting is 3-10x cheaper at scale. The trade-off is maintenance time and anti-detection capability.

Common Pitfalls

No instance pooling. Launching a new browser per task wastes 1-3 seconds on cold start and 200-400MB of RAM. Pool and reuse instances.
Hardcoded CSS selectors. Pages change their DOM structure regularly. LLM-based element identification is more resilient than hardcoded selectors.
No session persistence. Multi-step workflows that require login fail when the session state is lost between steps.
Ignoring anti-detection entirely. Even basic measures (random viewport, user agent rotation, webdriver override) prevent detection on most sites.
Using a large model for action planning. GPT-4o-mini or Claude Haiku are fast enough for page understanding. A large model adds latency without better accuracy for this task.
No timeout on page loads. Some pages load indefinitely (infinite scrolling, slow third-party scripts). Set a navigation timeout and handle it.
Running in production without monitoring. Track success rate, average execution time, and error types per workflow. Alert when success rate drops.

Key Takeaways

Self-hosted Playwright + LLM handles 90% of browser automation use cases. For internal tools, partner portals, and public data without aggressive bot detection, this is the right approach.
Instance pooling is essential. Reuse browser instances across tasks. Cold starts and memory allocation are the biggest performance bottleneck.
LLM page understanding replaces brittle selectors. Send the accessibility tree to a fast model. Let it decide which elements to interact with. More resilient to page changes than hardcoded CSS selectors.
Paid services earn their cost on anti-detection. If your target sites have Cloudflare or similar protection, BrowserBase invests continuously in bypassing it. That's their core business. Don't try to compete.
Self-hosting is 3-10x cheaper at scale. But you pay in maintenance time and anti-detection limitations. Make the trade-off consciously.

FIND MORE: https://oronts.com/en/guides/browser-automation-ai-without-paid-tools

Top comments (3)

Ohad Badihi • Apr 28

The cost numbers check out for small workloads. What flips it back at bigger scale is the work to keep it running. You get a great per-render price, but you also inherit the headaches — Chromium leaks memory over time, containers run out of RAM and die, traffic spikes take everything down at once. For 1k pages a day on an internal tool, self-hosting is the obvious win. For 100k pages a day with paying customers waiting on screenshots, paying someone else to babysit the browsers starts looking reasonable again. The anti-detection point is the other big one — that's a treadmill nobody wants to be on.

refaat Al Ktifan • Apr 28

Fair points. We hit all of those.

Chromium memory leaks, we kill the browser process every N tasks, not just close tabs. Each pool instance runs with a hard container memory limit so a leak takes down one worker, not the service. Ugly but stable.

Scale changes the math. For internal tools and back-office automation where we use this, the ops overhead is minimal. For high-volume production pipelines with SLAs, managed infrastructure earns its price.

The anti-detection point is the real divider. We scope strictly to first-party sites and partner integrations. The moment you need fingerprint rotation and proxy management, you're in a different problem space entirely and self-hosting stops making sense.
CHECK: github.com/ulixee/hero

Ohad Badihi • May 2

Wrote up the operational tax part in more detail here, including the exact memory-restart pattern: dev.to/rendershot/headless-chromiu...