Ever wanted to automate real browser actions just by describing what you want? Meet talk2browser, a LangGraph-powered agent that turns prompts into real-time web actions and reusable test scripts.
Hi everyone! ๐ I'm excited to share talk2browser, which leverages LangGraph's agent orchestration capabilities to create a self-improving browser automation system. Inspired by the Browser-Use open source project, it takes natural language tasks and executes real browser actions while generating reusable test scripts.
๐ Resources
- ๐ Website: https://www.talk2browser.com
- ๐ GitHub Repository: https://github.com/talk2silicon/talk2browser
- ๐ฅ Demo Video: YouTube Demo
- ๐ License: MIT
๐ LangGraph Implementation
talk2browser showcases advanced LangGraph patterns:
-
Agent State Management โ Complex browser workflows with conditional transitions using
AgentState
TypedDict - Dynamic Tool Registration โ 25+ browser automation tools automatically registered as LangGraph tools via decorators
- Multi-Step Orchestration โ Planning โ Execution โ Script Generation phases with state persistence
- Self-Improving Workflows โ Action recording and replay capabilities for iterative improvement
- Vision Integration โ YOLOv11-based UI element detection with LLM context injection
- Sensitive Data Handling โ Secure credential management with environment variable injection
โจ Key Features
Feature | Description |
---|---|
๐ฃ๏ธ Natural Language Control | Plain English commands for web app testing and automation |
๐ Multi-Framework Scripts | Auto-generates Playwright, Cypress, and Selenium code from recorded actions |
๐๏ธ Vision Integration | YOLOv11-based UI element detection with bounding box coordinates |
๐ Secure Data Handling | Environment-based credential management with SecretStr support |
๐ PDF Report Generation | Comprehensive documentation output with screenshots and structured data |
โป๏ธ Repeatable Execution | JSON action recording for consistent replay across unlimited runs |
๐ฏ Element Detection | Smart CSS/XPath selector resolution with hash-based element mapping |
๐ง Quality Assurance | Full mypy, flake8, black compliance with automated CI/CD pipeline |
๐ง Agent Architecture
The LangGraph agent uses a two-node graph with conditional routing:
class AgentState(TypedDict):
messages: Annotated[List[BaseMessage], add_messages]
next: str # For LangGraph routing
element_map: Dict[str, str] # Element hash to xpath mapping
vision: dict # Optional vision metadata for LLM context
# Agent workflow: chatbot -> tools -> chatbot (or END)
graph = StateGraph(AgentState)
graph.add_node("agent", self._chatbot)
graph.add_node("tools", ToolNode(TOOLS))
graph.add_conditional_edges("agent", self._route_tools)
The agent maintains context across browser sessions and learns from previous automation patterns through the ActionService
which records all tool calls with execution time, arguments, results, and errors.
Note: The system includes 25+ registered tools including navigation, clicking, form filling, screenshot capture, PDF generation, and script creation capabilities.
๐ Quick Example
Here's how to automate GitHub trending analysis:
import asyncio
from talk2browser.agent.agent import BrowserAgent
async def main():
# Prepare a test scenario
task = """Go to https://github.com/trending.
Extract information about the top 10 trending repositories including:
- Repository name, owner, description, language, stars, forks, URL
Create a comprehensive PDF report and generate a Playwright script."""
async with BrowserAgent(headless=False, info_mode=True) as agent:
response = await agent.run(task)
print("Agent response:", response)
asyncio.run(main())
CLI Usage
Or use the CLI with predefined tasks:
python examples/test_agent.py --task github_trending
# Available tasks:
# github_trending, selenium, cypress, playwright, tiktok_trending
๐ฎ Getting Started
Prerequisites
- Python 3.10+ (required for modern type hints)
- Git (for cloning the repository)
- Anthropic API Key (for Claude LLM functionality)
Installation
git clone https://github.com/talk2silicon/talk2browser
cd talk2browser
# Create virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install with development dependencies
pip install -e .[dev]
# Install Playwright browsers
playwright install
# Set up environment variables
cp .env.example .env
# Edit .env and add your ANTHROPIC_API_KEY
Quick Test
python examples/test_agent.py --task github_trending
๐ Code Quality & Development
This project maintains high code quality through automated checks:
- ๐งน Code Linting (flake8) - Style and syntax checking
- ๐จ Code Formatting (black) - Consistent code formatting
- ๐ Type Checking (mypy) - Static type analysis with zero errors
- ๐งช Unit Tests (pytest) - Automated testing
Local Development
# Run all quality checks
flake8 src/ tests/
black --check src/ tests/
mypy src/
pytest
# Auto-fix formatting
black src/ tests/
๐ ๏ธ Technical Architecture
Core Components
talk2browser/
โโโ src/talk2browser/
โ โโโ agent/ # LangGraph agent implementation
โ โ โโโ agent.py # Main BrowserAgent class
โ โ โโโ llm_singleton.py # LLM instance management
โ โโโ browser/ # Browser interaction layer
โ โ โโโ client.py # PlaywrightClient wrapper
โ โ โโโ page.py # BrowserPage abstraction
โ โ โโโ page_manager.py # Multi-page session management
โ โโโ services/ # Core services
โ โ โโโ action_service.py # Action recording/replay
โ โ โโโ sensitive_data_service.py # Secure credential handling
โ โ โโโ vision_service.py # YOLOv11 integration
โ โโโ tools/ # LangGraph tool registry
โ โ โโโ browser_tools.py # 25+ browser automation tools
โ โ โโโ script_tools.py # Script generation tools
โ โ โโโ file_system_tools.py # File/PDF operations
โ โโโ utils/ # Utility functions
โโโ examples/ # Example scripts and usage
โโโ tests/ # Test suite
Tool Registration System
@tool
@resolve_hash_args
async def click(selector: str, *, timeout: int = 5000) -> str:
"""Click on an element matching the CSS selector."""
# Automatic tool registration with LangGraph
# Hash-based element resolution
# Error handling and logging
State Management
# Agent maintains persistent state across tool calls
state = {
"messages": [HumanMessage, AIMessage, ToolMessage],
"next": "tools", # or "agent" or END
"element_map": {"#abc123": "xpath=//button[@id='submit']"},
"vision": {"detections": [...], "image_path": "..."}
}
๐ค Community Questions
I'd love to hear from the LangChain community:
- What real-world automation workflows could benefit from natural language control? (e.g., E2E testing, data extraction, monitoring)
- How do you currently approach multi-step browser automation with state persistence across actions?
- What LangGraph patterns have you found most effective for conditional routing and error recovery in agent workflows?
- How do you handle dynamic web content and element detection in your automation projects?
- What's your experience with integrating computer vision (YOLO, OCR) into LangChain/LangGraph workflows?
- How do you manage sensitive data and credentials in production automation systems?
- What testing frameworks would you most want to see supported for script generation?
โ ๏ธ What to Watch Out For
- Vision/YOLOv11 Integration: Optional feature. Requires a YOLOv11 model file and additional setup. Not required for core browser automation.
- Script Summarization: (Planned) Feature for AI-powered summaries of generated automation scripts is on the roadmap but not yet implemented.
- PDF Generation: Fully supported. Generates comprehensive PDF reports with execution details and screenshots.
- Manual Action Override: Partially implemented. Human-in-the-loop/manual override is available for some actions and is being actively enhanced for broader coverage.
๐ฎ Future Roadmap
- PDF Script Documentation โ Generate comprehensive PDF reports for generated test scripts with execution details and screenshots
- Script Summarization โ AI-powered summaries of generated automation scripts with key actions and validation points
- Enhanced Manual Action Override โ Improved human-in-the-loop capabilities for manual intervention during automation
- Performance Optimization โ Faster element detection and action execution
- Error Handling โ Better recovery from browser automation failures
- Test Coverage โ Expanded unit and integration test suite
๐ ๏ธ Technical Stack
- LangGraph: Agent orchestration and state management
- Playwright: Browser automation engine with 25+ registered tools
- Claude 3 Opus/Haiku: Natural language reasoning and planning
- YOLOv11: Computer vision for UI element detection
- Python 3.10+: Core implementation with full type safety
- Pydantic: Data validation and settings management
Looking for feedback, use cases, and contributions! What browser automation challenges could this help solve for your projects? ๐ค
Feel free to star โญ the repo if you find this interesting!
๐ท๏ธ Tags
#langgraph
#browser-automation
#playwright
#ai-agents
#test-automation
#natural-language
#python
#claude
#computer-vision
#pdf-generation
Top comments (0)