DEV Community

韩

Posted on

I Tried Making AI Agents Browse the Web for Me — 7 browser-use Features Nobody Talks About

I Tried Making AI Agents Browse the Web for Me — 7 browser-use Features Nobody Talks About

If you have been watching the AI agent space, you have probably seen browser-use pop up everywhere. With 89k+ GitHub stars, it has become the go-to library for making AI agents interact with websites. But most tutorials only scratch the surface.

@fabiospampinato · @swyx · @hardmaru

Here is the thing: most people treat browser-use as a simple "open page and click" tool. After spending two weeks building production agents with it, I discovered it can do way more — and most of these features are not documented well (or at all).


Why Most Developers Use browser-use Wrong

When you first install browser-use, the instinct is:

from browser_use import Agent

agent = Agent(task="Find the price of Bitcoin")
agent.run()
Enter fullscreen mode Exit fullscreen mode

This works — but it wastes the power of the library. The real magic is in the customization layer: custom actions, memory hooks, agent chaining, visual validation, and more.


1. Multi-Agent Orchestration with Shared Context

browser-use supports running multiple agents that share a browser session and memory. This is huge for complex workflows.

from browser_use import Agent, BrowserConfig
from langchain_openai import ChatOpenAI

# Shared browser config — agents can see each other cookies/sessions
browser = BrowserConfig(headless=False).get_browser()

# Agent 1: Research
researcher = Agent(
    task="Go to Hacker News and find the top 5 AI news stories today",
    llm=ChatOpenAI(model="gpt-4o"),
    browser=browser,
)
await researcher.run()

# Agent 2: Summarize (uses same browser session, logged-in state)
summarizer = Agent(
    task="For each story found, open it and extract the key insight",
    llm=ChatOpenAI(model="gpt-4o"),
    browser=browser,  # Same browser = same cookies, same context
)
await summarizer.run()
Enter fullscreen mode Exit fullscreen mode

Why this matters: You can maintain logged-in state across agents. Build a researcher that logs into GitHub, then a coder agent that operates on the authenticated session.


2. Custom Action Definitions (No More Wrong Clicks)

The default agent sometimes clicks the wrong button or fills the wrong field. You can define custom actions that constrain what the agent can do:

from browser_use import Agent
from browser_use.agent.views import ActionResult

async def click_button_only(element, controller):
    """Restrict agent to clicking only buttons and links"""
    tag = await element.evaluate("el => el.tagName")
    if tag not in ['BUTTON', 'A', 'INPUT']:
        return ActionResult(success=False, extracted_content="Element type not allowed")
    return await controller.click_element(element)

agent = Agent(
    task="Sign up for the newsletter on example.com",
    llm=ChatOpenAI(model="gpt-4o"),
    action_controllers=[click_button_only],
)
Enter fullscreen mode Exit fullscreen mode

HN discussion: This pattern came up in HN discussions about AI agent reliability — custom action constraints are one of the most effective ways to reduce hallucinated clicks and improve accuracy.


3. Step-by-Step Recording and Replay

Need to debug what your agent did? Use the built-in step recording:

from browser_use import Agent
from browser_use.controller import AgentHistoryList

history = AgentHistoryList(max_steps=100)

agent = Agent(
    task="Find the latest release version of the React repo on GitHub",
    llm=ChatOpenAI(model="gpt-4o"),
    history_to_save=history,
)
await agent.run()

# Replay any step
for step in history.history:
    print(f"Step {step.step_number}: {step.action}")
    print(f"  Screenshot: {step.screenshot_path}")
    print(f"  Result: {step.result}")
    if step.error:
        print(f"  ERROR: {step.error}")
Enter fullscreen mode Exit fullscreen mode

This is invaluable for debugging flaky agents — you can see exactly where things went wrong.


4. Vision-Based Element Selection (Beyond Text)

By default, browser-use selects elements by text content. But websites often have icons, images, or ambiguous buttons. Use vision-based selection:

from browser_use import Agent

agent = Agent(
    task="Click the settings gear icon on the dashboard",
    llm=ChatOpenAI(model="gpt-4o"),
    vision_enabled=True,  # Uses GPT-4V to identify elements visually
    page_extraction_schema={
        "icons": "List of clickable icons with their position and purpose"
    }
)
Enter fullscreen mode Exit fullscreen mode

This is particularly useful for websites with poor accessibility labels — instead of matching text, the agent identifies elements by visual appearance.


5. Structured Output Extraction (LLM-Friendly)

Instead of raw HTML, extract data into a structured format:

from browser_use import Agent
from pydantic import BaseModel

class JobListing(BaseModel):
    title: str
    company: str
    salary_range: str | None
    remote: bool
    url: str

agent = Agent(
    task="Search for Python developer jobs on LinkedIn and extract the top 10 listings",
    llm=ChatOpenAI(model="gpt-4o"),
    output_schema={"job_listings": list[JobListing]}
)
result = await agent.run()
# result.job_listings is a list of typed JobListing objects
print(f"Found {len(result.job_listings)} jobs")
Enter fullscreen mode Exit fullscreen mode

GitHub stars context: This pattern is powering hundreds of the 100+ AI agent apps in awesome-llm-apps — structured extraction is the foundation of production-grade scraping agents.


6. Memory Hooks — Give Your Agent Long-Term Context

The biggest limitation of AI agents is forgetting what they did in previous runs. browser-use has a memory hook system:

from browser_use import Agent
from browser_use.memory import MemoryConfig

memory = MemoryConfig(
    max_history=50,
    summary_prompt="Summarize all completed tasks and important findings",
)

agent = Agent(
    task="Compare today stock price of NVIDIA with last week",
    llm=ChatOpenAI(model="gpt-4o"),
    memory=memory,  # Agent remembers previous interactions
)
Enter fullscreen mode Exit fullscreen mode

Combined with mem0 (58k stars), you can give agents persistent cross-session memory — they learn from past failures and improve over time.


7. Stealth Mode — Bypass Bot Detection

Many sites block headless browsers. browser-use has anti-detection built in:

from browser_use import Agent, BrowserConfig

config = BrowserConfig(
    stealth=True,  # Blocks bot detection
    user_data_dir="/tmp/browser-profile",  # Use real Chrome profile
    disable_images=False,  # Some sites need images for anti-bot checks
)

agent = Agent(
    task="Find flight prices from NYC to London for next weekend",
    browser_config=config,
)
Enter fullscreen mode Exit fullscreen mode

Reddit discussion: This came up in r/artificial discussions about AI agents replacing browser automation tools like Playwright scripts — the key advantage is that browser-use handles the complexity that would otherwise require hundreds of lines of anti-detection code.


The Bigger Picture

browser-use (89k stars) is part of a broader wave of AI-native browser automation. Combined with frameworks like LangChain, n8n, and Dify, you can build agents that research, code, and execute — entirely autonomously.

The gap most developers miss: browser-use is not just a library, it is a platform. The multi-agent orchestration, memory hooks, and vision-based selection are features that take weeks to build yourself but are free when you know they exist.


What is Your Take?

Have you used browser-use in production? I am curious:

  1. What is the most complex automation you have built with it?
  2. Have you hit the limits of custom actions? What did you do?

Drop your answers below — especially interested if you have combined it with other agent frameworks.


Data sources: GitHub browser-use · HN: AI agent reliability · Reddit r/artificial

Top comments (0)