I Tried Making AI Agents Browse the Web for Me — 7 browser-use Features Nobody Talks About
If you have been watching the AI agent space, you have probably seen browser-use pop up everywhere. With 89k+ GitHub stars, it has become the go-to library for making AI agents interact with websites. But most tutorials only scratch the surface.
@fabiospampinato · @swyx · @hardmaru
Here is the thing: most people treat browser-use as a simple "open page and click" tool. After spending two weeks building production agents with it, I discovered it can do way more — and most of these features are not documented well (or at all).
Why Most Developers Use browser-use Wrong
When you first install browser-use, the instinct is:
from browser_use import Agent
agent = Agent(task="Find the price of Bitcoin")
agent.run()
This works — but it wastes the power of the library. The real magic is in the customization layer: custom actions, memory hooks, agent chaining, visual validation, and more.
1. Multi-Agent Orchestration with Shared Context
browser-use supports running multiple agents that share a browser session and memory. This is huge for complex workflows.
from browser_use import Agent, BrowserConfig
from langchain_openai import ChatOpenAI
# Shared browser config — agents can see each other cookies/sessions
browser = BrowserConfig(headless=False).get_browser()
# Agent 1: Research
researcher = Agent(
task="Go to Hacker News and find the top 5 AI news stories today",
llm=ChatOpenAI(model="gpt-4o"),
browser=browser,
)
await researcher.run()
# Agent 2: Summarize (uses same browser session, logged-in state)
summarizer = Agent(
task="For each story found, open it and extract the key insight",
llm=ChatOpenAI(model="gpt-4o"),
browser=browser, # Same browser = same cookies, same context
)
await summarizer.run()
Why this matters: You can maintain logged-in state across agents. Build a researcher that logs into GitHub, then a coder agent that operates on the authenticated session.
2. Custom Action Definitions (No More Wrong Clicks)
The default agent sometimes clicks the wrong button or fills the wrong field. You can define custom actions that constrain what the agent can do:
from browser_use import Agent
from browser_use.agent.views import ActionResult
async def click_button_only(element, controller):
"""Restrict agent to clicking only buttons and links"""
tag = await element.evaluate("el => el.tagName")
if tag not in ['BUTTON', 'A', 'INPUT']:
return ActionResult(success=False, extracted_content="Element type not allowed")
return await controller.click_element(element)
agent = Agent(
task="Sign up for the newsletter on example.com",
llm=ChatOpenAI(model="gpt-4o"),
action_controllers=[click_button_only],
)
HN discussion: This pattern came up in HN discussions about AI agent reliability — custom action constraints are one of the most effective ways to reduce hallucinated clicks and improve accuracy.
3. Step-by-Step Recording and Replay
Need to debug what your agent did? Use the built-in step recording:
from browser_use import Agent
from browser_use.controller import AgentHistoryList
history = AgentHistoryList(max_steps=100)
agent = Agent(
task="Find the latest release version of the React repo on GitHub",
llm=ChatOpenAI(model="gpt-4o"),
history_to_save=history,
)
await agent.run()
# Replay any step
for step in history.history:
print(f"Step {step.step_number}: {step.action}")
print(f" Screenshot: {step.screenshot_path}")
print(f" Result: {step.result}")
if step.error:
print(f" ERROR: {step.error}")
This is invaluable for debugging flaky agents — you can see exactly where things went wrong.
4. Vision-Based Element Selection (Beyond Text)
By default, browser-use selects elements by text content. But websites often have icons, images, or ambiguous buttons. Use vision-based selection:
from browser_use import Agent
agent = Agent(
task="Click the settings gear icon on the dashboard",
llm=ChatOpenAI(model="gpt-4o"),
vision_enabled=True, # Uses GPT-4V to identify elements visually
page_extraction_schema={
"icons": "List of clickable icons with their position and purpose"
}
)
This is particularly useful for websites with poor accessibility labels — instead of matching text, the agent identifies elements by visual appearance.
5. Structured Output Extraction (LLM-Friendly)
Instead of raw HTML, extract data into a structured format:
from browser_use import Agent
from pydantic import BaseModel
class JobListing(BaseModel):
title: str
company: str
salary_range: str | None
remote: bool
url: str
agent = Agent(
task="Search for Python developer jobs on LinkedIn and extract the top 10 listings",
llm=ChatOpenAI(model="gpt-4o"),
output_schema={"job_listings": list[JobListing]}
)
result = await agent.run()
# result.job_listings is a list of typed JobListing objects
print(f"Found {len(result.job_listings)} jobs")
GitHub stars context: This pattern is powering hundreds of the 100+ AI agent apps in awesome-llm-apps — structured extraction is the foundation of production-grade scraping agents.
6. Memory Hooks — Give Your Agent Long-Term Context
The biggest limitation of AI agents is forgetting what they did in previous runs. browser-use has a memory hook system:
from browser_use import Agent
from browser_use.memory import MemoryConfig
memory = MemoryConfig(
max_history=50,
summary_prompt="Summarize all completed tasks and important findings",
)
agent = Agent(
task="Compare today stock price of NVIDIA with last week",
llm=ChatOpenAI(model="gpt-4o"),
memory=memory, # Agent remembers previous interactions
)
Combined with mem0 (58k stars), you can give agents persistent cross-session memory — they learn from past failures and improve over time.
7. Stealth Mode — Bypass Bot Detection
Many sites block headless browsers. browser-use has anti-detection built in:
from browser_use import Agent, BrowserConfig
config = BrowserConfig(
stealth=True, # Blocks bot detection
user_data_dir="/tmp/browser-profile", # Use real Chrome profile
disable_images=False, # Some sites need images for anti-bot checks
)
agent = Agent(
task="Find flight prices from NYC to London for next weekend",
browser_config=config,
)
Reddit discussion: This came up in r/artificial discussions about AI agents replacing browser automation tools like Playwright scripts — the key advantage is that browser-use handles the complexity that would otherwise require hundreds of lines of anti-detection code.
The Bigger Picture
browser-use (89k stars) is part of a broader wave of AI-native browser automation. Combined with frameworks like LangChain, n8n, and Dify, you can build agents that research, code, and execute — entirely autonomously.
The gap most developers miss: browser-use is not just a library, it is a platform. The multi-agent orchestration, memory hooks, and vision-based selection are features that take weeks to build yourself but are free when you know they exist.
What is Your Take?
Have you used browser-use in production? I am curious:
- What is the most complex automation you have built with it?
- Have you hit the limits of custom actions? What did you do?
Drop your answers below — especially interested if you have combined it with other agent frameworks.
Data sources: GitHub browser-use · HN: AI agent reliability · Reddit r/artificial
Top comments (0)