Opensource web agent outclasses Browser-Use

Andrea — Tue, 08 Apr 2025 17:26:50 +0000

Complete post: https://github.com/nottelabs/open-operator-evals

The race for open-source web agents is heating up, leading to very bold statements being thrown around. We cut through the noise and bring a fully transparent and reproducible benchmark to get a sense of the curent scene. Everything is open, inviting you to see exactly how different systems perform—and perhaps prompting a closer look at other's claims ;)

Rank	Provider	Agent Self-Report	LLM Evaluation	Time per Task	Task Reliability
🏆	Notte	86.2%	79.0%	47s	96.6%
2️⃣	Browser-Use	77.3%	60.2%	113s	83.3%
3️⃣	Convergence	38.4%	31.4%	83s	50%

Results are averaged over tasks and then over 8 separate runs to account for the high variance inherent in web agent systems. In our benchmarks, each provider ran each task 8 times using the same configuration, headless mode, and strict limits: 6 minutes or 20 steps maximum—because no one wants an agent burning 80 steps to find a lasagna recipe. Agents had to handle execution and failures autonomously.

Key highlights

You can investigate all replays/logs and reproduce the benchmark yourself 👇🏻

Notte leads the benchmark by achieving the highest performance with 86.2% self-reported success and 79% LLM-verified completion. It also has the fastest execution time at 47s per task and an impressive 96.6% task reliability—Percentage of tasks an agent successfully completes at least once when given multiple attempts
Browser-Use demonstrates a notable performance difference compared to their claimed results in their blog post, achieving 77.3% self-reported agent performance and 60.2% LLM-verified success versus their stated 89%. The absence of access to their results files prevents us from verifying their reported performance.
Convergence shows significantly lower performance than competitors with 38.4% agent success and 31.4% evaluation success, primarily due to CAPTCHA and bot detection issues. However, shows strong self-awareness, achieving near-perfect alignment in some instances, indicating potential for improvement if detection challenges are overcome.

PS: We are actively hiring software and research engineers 🪩

The metrics

In the main table

Agent Self-Report The success rate reported by the agent itself across all tasks. This reflects the agent's internal confidence in its performance.
LLM Evaluation The success rate determined by GPT-4 using WebVoyager's evaluation prompt as a judge evaluator, assessing the agent's actions and outputs. This provides an objective measure of task completion.
Time per Task The average execution time in seconds for the agent to attempt and complete a single task. This indicates the efficiency and speed of the agent's operations.
Task Reliability The percentage of tasks the agent successfully completed at least once across multiple attempts (8 in this benchmark). This metric highlights the agent's ability to handle a diverse set of tasks given sufficient retries, indicating system robustness.

In the breakdowns

Alignment Ratio of Agent Self-Report to LLM Evaluation, indicating overestimation (>1.0) or underestimation (<1.0) by the agent. Being close to 1 or <1.0 is typically better.
Mismatch Counts the specific instances where the agent claimed success but the evaluator disagreed. This reveals how often the agent incorrectly assessed its own performance.

The dataset

WebVoyager is a dataset of ~600 tasks for web agents. Example:

task: Book a journey with return option on same day from Edinburg to Manchester for Tomorrow, and show me the lowest price option available
url: https://www.google.com/travel/flights

An agent navigates the site and returns a success status and an answer. Relying on the agent’s self-reported success is unreliable, as agents may misjudge task completion. WebVoyager addresses this with an independent LLM evaluator that judges success based on agent actions and screenshots.

The challenge of high variance

Beyond known limitations like outdated web content, a key issue is the high variance in agent performance. These systems, powered by non-deterministic LLMs and operating on a constantly changing web, often yield inconsistent results. Reasoning errors, execution failures, and unpredictable network behavior make single-run evaluations unreliable. To counter this, we propose to run each task multiple times for a much more accurate view—averaging results helps smooth out randomness and gives a more statistically sound estimate of performance.

WebVoyager30

To reduce variance and improve reproducibility, we sampled WebVoyager30—a 30-task subset across 15 diverse websites. It retains the full dataset’s complexity while enabling practical multi-run evaluation, offering a more reliable benchmark for the community.

Running 30 tasks × 8 times (240 runs total) is far more informative than running 600 tasks once, as it averages out randomness and provides a statistically sounder view of performance. Running all 600 tasks 8× would be ideal but is often impractical due to compute costs and time, making fast and accessible reproduction difficult.

The selected tasks are neither trivial nor overly complex—they reflect the overall difficulty of the full dataset, making this a reasonable and cost-effective proxy.

Breakdowns

Benchmark results breakdown for each provider.

Notte

Provider: notte
Version: [v1.3.3](https://github.com/nottelabs/notte/releases/tag/v1.3.3)
Reasoning: gemini/gemini-2.0-flash

Notte leads the benchmark with 86.2% self-reported success and 79% LLM-verified completion, along with the fastest execution time at 47s per task and an impressive 96.6% task reliability. It shows consistent performance, with self-assessments slightly overestimating results. Alignment ratios range from 0.960 to 1.183, with low mismatch counts (mostly 3). Task times are really efficient (45-51s), and run 1743001170-7 achieved near-perfect alignment at 0.960.0.

Runs	Agent Self-Report	LLM Evaluation	Alignment	Mismatch	Time per Task
1743001170-0	0.929	0.857	1.084	3	47s
1743001170-3	0.867	0.767	1.130	3	50s
1743001170-4	0.867	0.800	1.084	3	51s
1743001170-6	0.867	0.733	1.183	4	45s
1743001170-1	0.862	0.759	1.136	3	47s
1743001170-7	0.857	0.893	0.960	1	47s
1743001170-2	0.828	0.759	1.091	2	45s
1743001170-5	0.821	0.750	1.095	3	49s

Browser-Use

Provider: Browser-Use
Version: [v0.1.40](https://github.com/browser-use/browser-use/releases/tag/0.1.40)
Reasoning: openai/gpt-4o

Browser-Use reported an 89% success rate on WebVoyager, but we were unable to replicate these results despite our efforts, both on WebVoyager30 with multiple retries and with the full dataset in a single shot. We also tested different configurations of the agent, browser, and lenient interpretations of ambiguous outcomes, but their reported performance was impossible to achieve. Browser-Use shows higher alignment ratios (1.2–1.534), indicating 20–50% overestimation of its abilities. It also has more mismatches (5–8), reflecting a bigger gap between self-assessment and performance.

Runs	Agent Self-Report	LLM Evaluation	Alignment	Mismatch	Time per Task
1743016360-6	0.833	0.667	1.249	7	98s
1743016360-4	0.815	0.667	1.222	5	119s
1743016360-1	0.808	0.577	1.400	7	127s
1743016360-5	0.800	0.600	1.333	6	95s
1743016360-2	0.786	0.679	1.158	5	132s
1743016360-7	0.767	0.500	1.534	8	105s
1743016360-3	0.708	0.542	1.306	5	113s
1743016360-0	0.667	0.583	1.144	2	118s

Convergence

Provider: Convergence
Version: [a4389c5](https://github.com/convergence-ai/proxy-lite/commit/a4389c599d5f5f77dc18510c879e2e783434766b)
Reasoning: Convergence Proxy-lite

Convergence Proxy-lite performs significantly below competitors at just 38.4% (agent) and 31.4% (evaluation) success rates. However, these results appear heavily impacted by technical issues, as the system frequently triggers Google's CAPTCHA and bot detection services. Despite these limitations, Convergence demonstrates remarkably better alignment between self-assessment and evaluation than Browser-Use, with one run achieving perfect 1.000 alignment with zero mismatches. This suggests that with improved bot detection handling, Convergence would likely outperform Browser-Use due to its superior self-awareness and calibration.

Runs	Agent Self-Report	LLM Evaluation	Alignment	Mismatch	Time per Task
1743114165-6	0.483	0.345	1.400	4	77s
1743114165-0	0.407	0.333	1.222	2	85s
1743114165-3	0.393	0.286	1.374	3	82s
1743114165-4	0.379	0.345	1.099	2	82s
1743114165-5	0.379	0.276	1.373	3	84s
1743114165-7	0.367	0.333	1.102	3	84s
1743114165-2	0.357	0.286	1.248	3	86s
1743114165-1	0.310	0.310	1.000	0	84s

Conclusion

Our open-source agent evaluation reveals notable differences between reported and observed performance. While Notte shows strong capabilities and good self-awareness, other systems exhibit issues with reproducibility and self-assessment. These results underscore the importance of clear, reproducible benchmarks. We encourage collaboration from the research and engineering community to develop improved trusted evaluation standards.

DEV Community: Andrea

[Boost]

Web Agents That Actually Understand Websites: How Notte's Perception Layer Solves the DOM Problem