Computer-Use Agents Hit 66% on OSWorld. The Other 34% Is a Data Problem.

#ai #llm #agents #machinelearning

Two numbers from the last few weeks tell the whole story of where computer-use agents actually are.

The first is from Microsoft's Build 2026 keynote, where the company reframed the PC itself as an "agentic operating system" and open-sourced the Microsoft Agent Framework so agents can run natively on Windows. The second is from Stanford's latest AI Index: agent task success on OSWorld jumped from 12% to 66% in roughly a year. That is a genuinely staggering rate of progress for software that drives a real desktop — clicking, typing, scrolling, navigating menus the way a person does.

But flip the second number around. A 66% success rate means that on a benchmark of ordinary desktop tasks, the best agents still fail roughly one time in three. And these aren't exotic tasks. OSWorld is built from everyday work across Chrome, Thunderbird, the LibreOffice suite, VS Code, GIMP, VLC, and basic OS operations. The agent that books your travel or reconciles your spreadsheet is wrong often enough that you cannot look away.

If you're building on top of computer-use agents, the interesting engineering question isn't "when will the models get good enough." It's "what specifically is breaking in that 34%, and is it a model problem or a data problem?" Having looked closely at a lot of agent traces, my answer is that most of it is a data problem — and that's actually good news, because data problems are tractable.

Where the 34% actually goes

When you stop reading benchmark headline numbers and start reading individual trajectories, the failures cluster into a few recognizable shapes.

Grounding failures. The agent knows what it wants to do but cannot reliably translate intent into the right pixel. It means to click "Export" and lands on "Export as template." It targets a button that has scrolled three pixels out of where it expected. GUI grounding — mapping a described UI element to its actual coordinates and state on screen — is still where a large share of single-step errors originate, and it gets worse on dense enterprise software the base models barely saw in training.

Inefficiency that compounds into failure. A sharp paper from this year, OSWorld-Human, hand-annotated the optimal human trajectory for each OSWorld task and then measured how many steps agents actually take. The result: even the best agents use 1.4x to 2.7x more steps than necessary. Extra steps aren't just slow. Every redundant action is another chance to drift off course, exhaust a context window, or trigger an irreversible side effect. Long-horizon desktop work punishes wandering.

No sense of "done" or "wrong." Agents frequently complete a task, declare victory, and are simply mistaken — the file saved to the wrong folder, the form submitted with a stale value. Or they hit an error dialog and treat it as success. The model has plenty of capability to act and almost no calibrated signal about whether the action achieved the goal.

Notice what these three failure modes have in common. None of them is primarily a reasoning deficit. They're deficits in the data the model learned from and the signal it gets about its own behavior. That distinction is everything.

Why this is a data problem, not a model problem

Training a computer-use agent is, under the hood, mostly supervised fine-tuning on operation trajectories — sequences of (screen state, action) pairs — followed by reinforcement-style refinement. The dominant open trajectory corpora are small and skewed. When OpenCUA-style open trajectory data makes up something like 30% of a training mix, you're leaning hard on a narrow, mostly-Western, mostly-consumer slice of how software gets used. The model has seen a thousand ways to compose a Gmail message and almost no examples of your hospital's scheduling system, your bank's internal console, or a Vietnamese-language ERP.

You can't prompt your way out of a distribution gap. If the trajectories that teach the agent how to recover from a failed click, how to verify a save, or how to operate a specialized line-of-business app simply aren't in the data, the agent won't reliably do those things no matter how clever the orchestration layer is. This is why the field is investing so heavily in trajectory construction — reverse task synthesis, pretraining from unlabeled screen-recording video, and human-annotated optimal paths. The bottleneck has moved from architecture to fuel.

There are three categories of data work that move the 34% the most, and they map cleanly onto what reliable agents need.

1. Trajectory correction, not just trajectory collection. Raw recordings of people using software are noisy: dead ends, fat-fingered clicks, idle scrolling. What teaches an agent to be efficient is a corrected trajectory — the redundant steps pruned, the recovery from a mistake annotated as a recovery, the optimal path made explicit. This is painstaking, expert-in-the-loop work, and it's exactly the kind of reasoning and human-feedback data that separates an agent that wanders for 14 steps from one that finishes in 6. Tool-use validation belongs here too: checking that when the agent invokes an action or API, it picked the right one with the right arguments, and labeling the cases where it didn't.

2. Grounding annotation on the software that actually matters. Closing the grounding gap means labeled screen data from the long tail of real applications — element boundaries, states, the difference between an enabled and a disabled control, localized UI in the languages your users actually work in. General-purpose web datasets won't cover your domain. This is multimodal annotation at its least glamorous and most valuable, and it's the work behind every agent that can operate an unfamiliar interface on the first try instead of the fifth.

3. Honest evaluation, including adversarial. A 66% benchmark score on generic tasks tells you almost nothing about how an agent behaves on your workflows, or how it fails when a UI changes underneath it. You need task suites built from your real software, response scoring that catches the silent "completed but wrong" failures, and red-teaming that probes what the agent does when a dialog is ambiguous, a destructive action is one click away, or a prompt-injection trap is sitting in an email it's asked to read. This is the territory of model evaluation and QA — the difference between knowing your agent passes a leaderboard and knowing it's safe to point at a production system.

What to do if you're shipping one of these agents

A few concrete takeaways, regardless of which base model you build on.

Instrument your traces before you tune anything. You cannot fix a failure distribution you haven't measured. Capture full (state, action, outcome) tuples and categorize failures by the three buckets above — grounding, inefficiency, verification. The mix tells you where to spend.

Treat "done" as a learned skill, not an assumption. Build explicit verification steps and train on examples of detecting failure, not just examples of success. An agent that knows when it's wrong is worth more than one that's marginally more often right.

Invest in domain trajectories early. The single highest-leverage thing most teams can do is generate and correct a few hundred high-quality trajectories on their own software, in their own languages. That narrow, well-labeled data tends to outperform far larger volumes of generic web traces for your use case.

Make evaluation adversarial and continuous. UIs drift, models update, and a passing score last month doesn't hold. Bake red-teaming and regression evals into your release process the same way you'd bake in unit tests.

The computer-use agent story in 2026 is not really about a capability ceiling. The models can already drive a desktop impressively well. The gap between a 66% demo and a 99% production system is filled with unglamorous, domain-specific, human-in-the-loop data work: corrected trajectories, grounded screens, and evaluation that's honest about how things break. The teams that win the agent race won't be the ones with the cleverest prompt. They'll be the ones who treated their data pipeline as the product.

Disclosure: I work at SyncSoft.AI, where our bilingual, SME-led teams in Vietnam build the trajectory, annotation, and evaluation data behind reliable AI agents. If you're wrestling with the last 34% on your own agents, we're always happy to compare notes — feel free to reach out.