OpenAI Operator scores 43% on hard web tasks. We scored 81%. Here are all 300 runs.

#agents #ai #automation #showdev

TinyFish set out to build web agents that solve real world problems for DoorDash, Google Hotels, ClassPass, and all the smaller businesses trying to keep up with the giants. That's what we do every day at production scale.

But we also wanted to test ourselves against the public benchmarks. Not because benchmarks are the goal. They rarely translate to real-world performance. The constraints aren't realistic, and it doesn't matter if your agent can play interactive chess puzzles on the web. What matters is whether it can solve your problem faster, cheaper, and at scale.

Still, Mind2Web is the most rigorous public evaluation for web agents right now, with 300 tasks across 136 live websites, three difficulty levels, and human evaluation. It's where OpenAI Operator, Claude Computer Use, and Browser Use all have published scores. So we ran it.

We ran TinyFish through the full benchmark in parallel. Here are the results alongside the current leaderboard:

We're submitting to the official leaderboard. In the meantime, we published every run so you don't have to take our word for it.

All 300 tasks, including every failure → **[[Public] **](https://docs.google.com/spreadsheets/d/1jgRESVlSYygPO4dKKqzPohGUX5b78Ay59422mM29CsU/edit?usp=sharing)[TinyFish-Mind2Web Agent Runs](https://docs.google.com/spreadsheets/d/1jgRESVlSYygPO4dKKqzPohGUX5b78Ay59422mM29CsU/edit?usp=sharing)

The rest of this post covers what these tasks actually involve, how we failed, and why we think the system works.

What these tasks involve

Mind2Web tasks run on live websites. The actual site, with all its pop-ups, dynamic pricing, and form validation.

An easy task: "Browse Marriott Bonvoy credit cards on Marriott." Navigate, find the section, view the listings. A few clicks.

A hard task: "Book 4 tickets in the upper section for any Kevin Hart show in New York in the next three months and view ticket prices with estimated fees." That's StubHub. Search events, filter by date range and location, select a show, choose a seating section, set ticket quantity, navigate a pricing page where fees calculate in real time. Ten-plus steps where things change between page loads.

Another hard task: "Find the highest critic-scored red or white wine from Oregon, priced under $40, that pairs well with fish or dessert." Multiple filters in sequence on wineaccess.com, constraint checking against each result, paginated inventory that shifts underneath you.

The benchmark evaluates each intermediate step, not just the final answer. And this is what makes the easy-to-hard drop the most interesting number in the results:

Hard tasks compound errors. Every step is a chance to fail, and failures cascade. At 95% per-step accuracy, a 3-step task succeeds 86% of the time, but a 10-step task succeeds 60%. At 90% per-step, the 10-step task drops to 35%.

A system that drops 16 points from easy to hard handles compounding well. A system that drops 58 points was being flattered by easy tasks.

40 failures, all documented

We failed 40 out of 300 tasks. Here's every one, with the reason.

Anti-bot blocks — 12 failures. Sites that blocked execution at the infrastructure level before the agent could attempt the task.

apartments.com accounts for 8 of our 40 failures. If you've tried automating anything on that site, you already know. We ran every task through our own platform with the same proxy routing and infrastructure configuration our customers use in production. Some sites are just that aggressive.

UI interaction limitations — 4 failures. Widget types our execution layer doesn't handle yet.

Edge cases — 24 failures.

Every one of these 300 tasks has a clickable link to the full execution trace. Pick a failure, or pick a pass. Watch what happened.

How the system works

The standard web agent architecture: screenshot the page, send it to a frontier model, ask what to click, repeat. This is how Operator, Claude Computer Use, and Browser Use all work.

It has a scaling problem. A round-trip to a frontier model takes 1-5 seconds per step. Large models are stochastic — same screenshot, different actions — so consistency degrades across long workflows. And the cost per session at production volume doesn't work.

We split the problem based on an observation: about 20-30% of steps in a typical web workflow need actual reasoning. Understanding what a page is asking, interpreting an unusual layout, choosing between valid paths. The rest — clicking date pickers, selecting dropdowns, submitting forms, paginating — is mechanical.

The reasoning layer uses large models for the 20-30% that's ambiguous. The execution layer uses small, task-specific models trained on web interaction patterns for the rest. These run in milliseconds, not seconds. Same input, same output. No hallucinated click targets.

The infrastructure layer handles proxy routing, geographic distribution, and reliable execution on sites with strict automation requirements. All of it was running during the benchmark, the same setup our customers use. An agent that reasons perfectly but fails at the execution layer is useless, and this is the layer we're investing the most in right now.

A good example: the results we published are one-shot success rates with no retries and no manual intervention. But we did re-run some failed tasks afterward. Take Task #197 on kaggle.com ("Identify the ongoing competition that offers the highest prize and find the code that received the most votes in that competition"). In our benchmark submission, it failed on an anti-bot block. On a subsequent run, TinyFish automatically reconfigured, switching to a different proxy and completing the task successfully. You can watch the full execution trace here. That auto-reconfiguration is the differentiator: not just handling execution requirements, but having a system that detects blocks and adapts in real time without human input.

One API. Natural language in, structured data out.