DEV Community

Cover image for ๐Ÿง  talk2browser โ€“ Browser automation with everyday Language (Powered by LangGraph)
Thushara Jayasinghe
Thushara Jayasinghe

Posted on

๐Ÿง  talk2browser โ€“ Browser automation with everyday Language (Powered by LangGraph)

Ever wanted to automate real browser actions just by describing what you want? Meet talk2browser, a LangGraph-powered agent that turns prompts into real-time web actions and reusable test scripts.

Hi everyone! ๐Ÿ‘‹ I'm excited to share talk2browser, which leverages LangGraph's agent orchestration capabilities to create a self-improving browser automation system. Inspired by the Browser-Use open source project, it takes natural language tasks and executes real browser actions while generating reusable test scripts.

๐Ÿ“š Resources

๐Ÿ”— LangGraph Implementation

talk2browser showcases advanced LangGraph patterns:

  • Agent State Management โ€” Complex browser workflows with conditional transitions using AgentState TypedDict
  • Dynamic Tool Registration โ€” 25+ browser automation tools automatically registered as LangGraph tools via decorators
  • Multi-Step Orchestration โ€” Planning โ†’ Execution โ†’ Script Generation phases with state persistence
  • Self-Improving Workflows โ€” Action recording and replay capabilities for iterative improvement
  • Vision Integration โ€” YOLOv11-based UI element detection with LLM context injection
  • Sensitive Data Handling โ€” Secure credential management with environment variable injection

โœจ Key Features

Feature Description
๐Ÿ—ฃ๏ธ Natural Language Control Plain English commands for web app testing and automation
๐Ÿ“ Multi-Framework Scripts Auto-generates Playwright, Cypress, and Selenium code from recorded actions
๐Ÿ‘๏ธ Vision Integration YOLOv11-based UI element detection with bounding box coordinates
๐Ÿ” Secure Data Handling Environment-based credential management with SecretStr support
๐Ÿ“Š PDF Report Generation Comprehensive documentation output with screenshots and structured data
โ™ป๏ธ Repeatable Execution JSON action recording for consistent replay across unlimited runs
๐ŸŽฏ Element Detection Smart CSS/XPath selector resolution with hash-based element mapping
๐Ÿ”ง Quality Assurance Full mypy, flake8, black compliance with automated CI/CD pipeline

๐Ÿง  Agent Architecture

The LangGraph agent uses a two-node graph with conditional routing:

class AgentState(TypedDict):
    messages: Annotated[List[BaseMessage], add_messages]
    next: str  # For LangGraph routing
    element_map: Dict[str, str]  # Element hash to xpath mapping
    vision: dict  # Optional vision metadata for LLM context

# Agent workflow: chatbot -> tools -> chatbot (or END)
graph = StateGraph(AgentState)
graph.add_node("agent", self._chatbot)
graph.add_node("tools", ToolNode(TOOLS))
graph.add_conditional_edges("agent", self._route_tools)
Enter fullscreen mode Exit fullscreen mode

The agent maintains context across browser sessions and learns from previous automation patterns through the ActionService which records all tool calls with execution time, arguments, results, and errors.

Note: The system includes 25+ registered tools including navigation, clicking, form filling, screenshot capture, PDF generation, and script creation capabilities.

๐Ÿš€ Quick Example

Here's how to automate GitHub trending analysis:

import asyncio
from talk2browser.agent.agent import BrowserAgent

async def main():
    # Prepare a test scenario
    task = """Go to https://github.com/trending.
    Extract information about the top 10 trending repositories including:
    - Repository name, owner, description, language, stars, forks, URL
    Create a comprehensive PDF report and generate a Playwright script."""

    async with BrowserAgent(headless=False, info_mode=True) as agent:
        response = await agent.run(task)
        print("Agent response:", response)

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

CLI Usage

Or use the CLI with predefined tasks:

python examples/test_agent.py --task github_trending

# Available tasks:
# github_trending, selenium, cypress, playwright, tiktok_trending
Enter fullscreen mode Exit fullscreen mode

๐ŸŽฎ Getting Started

Prerequisites

  • Python 3.10+ (required for modern type hints)
  • Git (for cloning the repository)
  • Anthropic API Key (for Claude LLM functionality)

Installation

git clone https://github.com/talk2silicon/talk2browser
cd talk2browser

# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install with development dependencies
pip install -e .[dev]

# Install Playwright browsers
playwright install

# Set up environment variables
cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY
Enter fullscreen mode Exit fullscreen mode

Quick Test

python examples/test_agent.py --task github_trending
Enter fullscreen mode Exit fullscreen mode

๐Ÿ” Code Quality & Development

This project maintains high code quality through automated checks:

  • ๐Ÿงน Code Linting (flake8) - Style and syntax checking
  • ๐ŸŽจ Code Formatting (black) - Consistent code formatting
  • ๐Ÿ” Type Checking (mypy) - Static type analysis with zero errors
  • ๐Ÿงช Unit Tests (pytest) - Automated testing

Local Development

# Run all quality checks
flake8 src/ tests/
black --check src/ tests/
mypy src/
pytest

# Auto-fix formatting
black src/ tests/
Enter fullscreen mode Exit fullscreen mode

๐Ÿ› ๏ธ Technical Architecture

Core Components

talk2browser/
โ”œโ”€โ”€ src/talk2browser/
โ”‚   โ”œโ”€โ”€ agent/              # LangGraph agent implementation
โ”‚   โ”‚   โ”œโ”€โ”€ agent.py        # Main BrowserAgent class
โ”‚   โ”‚   โ””โ”€โ”€ llm_singleton.py # LLM instance management
โ”‚   โ”œโ”€โ”€ browser/            # Browser interaction layer
โ”‚   โ”‚   โ”œโ”€โ”€ client.py       # PlaywrightClient wrapper
โ”‚   โ”‚   โ”œโ”€โ”€ page.py         # BrowserPage abstraction
โ”‚   โ”‚   โ””โ”€โ”€ page_manager.py # Multi-page session management
โ”‚   โ”œโ”€โ”€ services/           # Core services
โ”‚   โ”‚   โ”œโ”€โ”€ action_service.py      # Action recording/replay
โ”‚   โ”‚   โ”œโ”€โ”€ sensitive_data_service.py # Secure credential handling
โ”‚   โ”‚   โ””โ”€โ”€ vision_service.py      # YOLOv11 integration
โ”‚   โ”œโ”€โ”€ tools/              # LangGraph tool registry
โ”‚   โ”‚   โ”œโ”€โ”€ browser_tools.py       # 25+ browser automation tools
โ”‚   โ”‚   โ”œโ”€โ”€ script_tools.py        # Script generation tools
โ”‚   โ”‚   โ””โ”€โ”€ file_system_tools.py   # File/PDF operations
โ”‚   โ””โ”€โ”€ utils/              # Utility functions
โ”œโ”€โ”€ examples/               # Example scripts and usage
โ””โ”€โ”€ tests/                 # Test suite
Enter fullscreen mode Exit fullscreen mode

Tool Registration System

@tool
@resolve_hash_args
async def click(selector: str, *, timeout: int = 5000) -> str:
    """Click on an element matching the CSS selector."""
    # Automatic tool registration with LangGraph
    # Hash-based element resolution
    # Error handling and logging
Enter fullscreen mode Exit fullscreen mode

State Management

# Agent maintains persistent state across tool calls
state = {
    "messages": [HumanMessage, AIMessage, ToolMessage],
    "next": "tools",  # or "agent" or END
    "element_map": {"#abc123": "xpath=//button[@id='submit']"},
    "vision": {"detections": [...], "image_path": "..."}
}
Enter fullscreen mode Exit fullscreen mode

๐Ÿค Community Questions

I'd love to hear from the LangChain community:

  1. What real-world automation workflows could benefit from natural language control? (e.g., E2E testing, data extraction, monitoring)
  2. How do you currently approach multi-step browser automation with state persistence across actions?
  3. What LangGraph patterns have you found most effective for conditional routing and error recovery in agent workflows?
  4. How do you handle dynamic web content and element detection in your automation projects?
  5. What's your experience with integrating computer vision (YOLO, OCR) into LangChain/LangGraph workflows?
  6. How do you manage sensitive data and credentials in production automation systems?
  7. What testing frameworks would you most want to see supported for script generation?

โš ๏ธ What to Watch Out For

  • Vision/YOLOv11 Integration: Optional feature. Requires a YOLOv11 model file and additional setup. Not required for core browser automation.
  • Script Summarization: (Planned) Feature for AI-powered summaries of generated automation scripts is on the roadmap but not yet implemented.
  • PDF Generation: Fully supported. Generates comprehensive PDF reports with execution details and screenshots.
  • Manual Action Override: Partially implemented. Human-in-the-loop/manual override is available for some actions and is being actively enhanced for broader coverage.

๐Ÿ”ฎ Future Roadmap

  • PDF Script Documentation โ€” Generate comprehensive PDF reports for generated test scripts with execution details and screenshots
  • Script Summarization โ€” AI-powered summaries of generated automation scripts with key actions and validation points
  • Enhanced Manual Action Override โ€” Improved human-in-the-loop capabilities for manual intervention during automation
  • Performance Optimization โ€” Faster element detection and action execution
  • Error Handling โ€” Better recovery from browser automation failures
  • Test Coverage โ€” Expanded unit and integration test suite

๐Ÿ› ๏ธ Technical Stack

  • LangGraph: Agent orchestration and state management
  • Playwright: Browser automation engine with 25+ registered tools
  • Claude 3 Opus/Haiku: Natural language reasoning and planning
  • YOLOv11: Computer vision for UI element detection
  • Python 3.10+: Core implementation with full type safety
  • Pydantic: Data validation and settings management

Looking for feedback, use cases, and contributions! What browser automation challenges could this help solve for your projects? ๐Ÿค”

Feel free to star โญ the repo if you find this interesting!

๐Ÿท๏ธ Tags

#langgraph #browser-automation #playwright #ai-agents #test-automation #natural-language #python #claude #computer-vision #pdf-generation

Top comments (0)