If you have been following AI news, you’ve probably come across the term AI agents quite often lately—especially vertical agents. These are said to be the unavoidable evolution of the AI revolution, supposedly able to mimic and eventually replace human labor in almost every industry. And we’re not just talking about the digital world (i.e., computers) but soon also in the physical world.
Here are some recent announcements that have been in the news, some of which I have covered in previous blog posts.
Anthropic's Computer Use – A feature for Claude 3.5 Sonnet that enables it to interact with computer interfaces via specific tools (computer, text editor, bash commands). This is available as a beta feature.
OpenAI's Operator – A web browsing agent released in January 2025 that can perform online tasks like filling out forms and ordering items through its own browser.
Salesforce is strategically investing in AI agents to enhance its customer relationship management (CRM) offerings. In September 2024, the company introduced Agentforce, a platform designed to automate tasks such as generating reports and summarizing conversations, thereby boosting productivity and efficiency for businesses. Source
This all sounds very exciting but also a bit scary. However, many nuances get lost in public debates—especially when considering what is truly possible today and what limitations exist, given the technologies available right now.
In this blog post, I want to discuss the real challenges that even these groundbreaking advancements face today.
Before we dive into the challenges, it’s important to distinguish between two types of agents: horizontal and vertical.
Horizontal Agents
Horizontal agents—sometimes called all-rounders—are designed for use in any setting and are not tailored to a single use case. They come equipped with core skills, much like humans do: vision, language understanding, and the ability to take action (which, in software, means clicking, scrolling, etc.). These capabilities aren't entirely new; there have been solutions, for example, under the umbrella of what's known as RPA (Robotic Process Automation).
However, Generative AI makes these capabilities far more accessible, cheaper, and better than before.
One concrete example of recent progress is Anthropic’s Computer Use, which I’ve discussed in different contexts and for varied use cases. OpenAI’s Operator—which essentially does something similar but with some differences—was also released, and I’ll talk about those differences in upcoming posts.
Vertical Agents
Vertical agents are slightly different: they have specialized knowledge (APIs, data, workflows, etc.) about a specific domain and are designed for that domain. For instance, you could have an agent for marketing that monitors a campaign or acts on behalf of the user on social media. Many companies find this type of solution appealing because it not only solves their specific problems but also tends to be more predictable than generic (horizontal) agents. We’ll come back to this point later.
Right now, there are numerous platforms for building and publishing these kinds of agents in various forms—like workflows on Make.com or n8n, chatbots on CustomGPTs or Poe actors, and plenty more in the open-source space with LangChain, LlamaIndex, and so forth.
The Big Challenges (Especially for Horizontal Agents)
Despite how exciting AI agents are, there are still challenges we can’t ignore—especially for horizontal agents, which are seen as the biggest potential competitors to human labor. For business owners, the idea of replacing a $20-per-hour workforce with a piece of software that costs 10 cents per task (or even less) is quite tempting.
But as you might guess, we’re nowhere near that stage yet. Here’s why:
1. Performance
As mentioned, horizontal agents like Computer Use can control a computer on behalf of a user. It’s not even tied to a specific operating system because it generally:
- Knows most common operating systems.
- Leverages its core skills: vision, language understanding, and the ability to click or type.
It definitely works better than ever before, making unplanned tasks—like “Find a gift for my 10-year-old daughter on eBay”—possible with little or no user intervention.
Remember, this piece of software was never explicitly programmed for that specific task. It could even shop on a website it has never seen before, figuring out from what it “sees” and “reads” how to proceed. This is unprecedented.
But: It is slow—much slower than a typical human doing the same task.
And what’s more, it remains slow even if it has done the same task a thousand times. This is different from how humans learn—a major drawback.
Of course, you could fine-tune it, but then it becomes more of a vertical agent than a horizontal one.
The main reason for its slowness is that the model has to handle many steps, including:
- Taking a screenshot
- Analyzing the image
- Reading a large volume of text
- Interpreting this text
- Finally taking action
2. Models Are Perfectionists
Models always interpret everything in detail. They don’t skim or first look for key phrases to grasp the broader context. This makes the process even slower because of the vast amount of information on most websites—navigation menus, ads, etc.—which humans would typically ignore or barely notice.
Of course, you can tweak the prompt and the tools you use (such as reading files, opening a web page, etc.), but this is an extra effort that may not work in other environments.
3. Cost
Given all the steps mentioned above, you can imagine that this doesn’t just impact speed; it also affects the overall cost. Again, I’m mostly referring to horizontal agents here. Vertical agents would skip some of these steps and optimize the rest, making them much faster and cheaper—but of course, they can only be used for very specific scenarios.
The cost of horizontal agents is unpredictable. A simple scenario like booking a flight might cost 30 cents, or it might cost $2–$3—or even more. It depends on how well the model can reason, interpret, and take the correct actions.
The Context Window
One big driver of cost is the context window: the information you must provide to the agent to accomplish the task, including any information collected during the session. The key issue is that transformer-based AI has no long-term memory. Each time you start a conversation, it’s from scratch; there’s no knowledge of previous conversations, so it cannot learn from ongoing interaction.
Example: Booking a Flight
- Take a screenshot of the desktop
- Move the cursor to the browser icon
- Click the browser icon
- Take another screenshot to see the default page
- Move the cursor to the address bar
- Click the address bar
- ...
I won't list everything else, but keep in mind that each step is an API call with a request, processing, and response, which is highly dependent on network latency, token generation speed, and other factors.
In essence, not an easy job (it's amazing that people do these things countless times a day without thinking twice).
4. Predictability
Managing costs in an AI-driven environment may sound straightforward for an enterprise (ROI): for example, measuring how long your assistant takes on average to complete a task, then multiplying by an hourly rate.
This could work for vertical (specialized) agents, but it’s trickier for horizontal agents because of one major factor: unpredictability.
When you ask an LLM like ChatGPT the same question multiple times, don’t expect to get the exact same answer each time.
Here’s an example:
And then three seconds later:
The second answer is different, even though we gave exactly the same input.
And even more, think about how often the environment (= Input) changes whenever you let the agent do something. Probably very often.
Every small change in the environment makes the answer deviate even more. This isn’t necessarily a bad thing, but it’s simply unpredictable. One time a task might take three minutes, another time it might struggle to scroll or interpret an image and take ten minutes. In an unattended setting, it could even cause damage if the AI has access to important things like your email or credit card.
Wrap-Up
In this blog post, we talked about the amazing advent of AI agents, but also about the real-world challenges we face today, including the speed, cost and predictability of general-purpose, horizontal agents, and what will hinder the advancement to the level we call AGI. A term full of hope and fear.
But we're all aware that the rapid development of AI in both hardware (chips) and software (models) will continue to evolve and make things that are unimaginable today a reality.
Top comments (0)