We've all experienced it.
You automate a repetitive workflow, only for everything to break because someone moved a button or redesigned a page. Surprisingly, many AI agents suffer from the exact same problem.
Instead of understanding the underlying task, they're often forced to interact with software the same way humans do: by clicking buttons, reading screens, and navigating menus.
This approach works.
But it raises an important question:
Are we teaching AI systems to solve problems, or are we simply teaching them to use software designed for humans?
By the end of this article, we'll explore why UI navigation has become one of the biggest bottlenecks in AI agents and why many developers believe the industry is solving the wrong problem.
A Simple Task Isn't Really Simple
Imagine someone wants to check the status of an order.
For a browser-based AI agent, the process might look something like this:
What seems like a simple request actually requires multiple steps. And every step introduces opportunities for failure.
The UI Was Built For Humans
Graphical user interfaces exist because humans need visual representations of information. Buttons, menus, and forms are abstractions created for us.
Computers themselves don't need buttons.
Yet many modern AI agents spend enormous amounts of computation trying to interpret screenshots, locate interface elements, and simulate mouse clicks.
They're essentially pretending to be humans.
We've Seen This Problem Before
In Robotic Process Automation (RPA). Large enterprises invested heavily in automation workflows that depended on screen layouts and specific button locations.
Then a website redesign happened.
Or a menu moved.
Or a label changed.
And suddenly entire workflows stopped working.
Fragile automation became one of the biggest challenges facing RPA systems.
Many AI agents are repeating the same pattern—only with language models replacing traditional scripts.
Browser Agents Pay a Heavy Price
Research from companies building browser agents has also highlighted challenges such as:
- Slow execution
- Increased token costs
- Fragility
- Ambiguous actions
The Industry Is Optimizing The Wrong Layer
Today's AI systems are incredibly powerful.
Yet some of them spend most of their time performing tasks like:
- Finding buttons.
- Waiting for pages to load.
- Reading screenshots.
- Navigating menus.
These aren't the user's actual goals.
Users don't want a button clicked.
Users want a problem solved.
There's a difference.
The Irony
Humans navigate interfaces because we have no alternative.
Machines do.
And yet we're investing enormous amounts of compute teaching AI systems how to behave like users instead of asking whether user interfaces are even the right abstraction for machines.
The result is an interesting paradox:
Some of the world's most advanced AI models spend their time doing something web browsers have been doing since the 1990s.
GitHub: github.com/Hobbydefiningdoctory/capman
Capman-site: capman
capman v0.6.2 — TypeScript, MIT licence, dual CJS/ESM, zero runtime dependencies beyond zod.



Top comments (0)