DEV Community

RichBrian020120
RichBrian020120

Posted on

AI Agent Automation for Computers: Overcoming the High Costs of Operator

For the past few months, I have been researching how to make AI agents (AI Agents) truly practical. My goal is to develop an AI that can automatically perform computer operations, similar to products like Operator or Manus—allowing users to input commands and have the AI agent complete tasks such as opening Excel, filling out forms, sending emails, browsing the web, and executing scripts.

Initially, my idea was clear: provide users with a cloud-based virtual desktop where the AI agent could perform operations. This approach would eliminate the need for users to install complex software while enabling tasks to be executed automatically in the background without interfering with their local devices. However, when I started implementing this solution, I realized the challenges were far greater than expected.

Why Is the Operator Model So Expensive?

My first version of the solution was based on the approach taken by Operator and Manus:

  1. Users send tasks via a web interface or API (e.g., "Help me organize this Excel file").
  2. The AI agent connects to a Cloud PC and operates applications using remote desktop access.
  3. The AI agent parses the UI and executes mouse and keyboard operations to complete the task.

In theory, this process is straightforward—users don’t need to install any software, all tasks run in the cloud, and multiple cloud desktops can be operated in parallel to enhance efficiency. However, during testing, I encountered several nearly unacceptable cost issues:

  • High Computational Overhead for UI Parsing To allow the AI agent to read screen information, understand the UI, and perform actions, I had to frequently call large AI models. Since desktop UIs are complex, even basic operations like "clicking a button" required extensive computation and context maintenance, leading to significantly higher costs than expected.
  • Expensive Cloud PC Costs Windows cloud servers are far more expensive than regular Linux servers. Each user needed a dedicated desktop environment, meaning I had to maintain costly remote instances even if users ran only a few tasks.
  • Slow Task Execution The AI had to parse the UI before executing mouse and keyboard actions. Even for simple text editing, it was several times slower than a human user. Additionally, network latency and remote desktop loading times further degraded the experience.

Finding Cost Reduction Strategies: From Operator to Open-Source Models

Seeing that AI execution costs far exceeded subscription revenue, I began exploring ways to lower costs.

Initially, I considered:

  1. Finding cheaper AI solutions, such as replacing OpenAI API with open-source large models.
  2. Optimizing UI parsing to reduce unnecessary computations and improve AI operation accuracy.

This led me to explore locally running AI agents, during which I stumbled upon an open-source project called UI-TARS.

Key Features of UI-TARS:

  • Allows AI to analyze UI from screenshots and execute actions without repeatedly calling expensive cloud-based AI models.
  • Uses OCR and inference models for local UI parsing, reducing API call costs significantly.
  • Supports mobile interfaces, not just Windows desktops.

This gave me a new idea:

What if AI agents didn’t have to use cloud PCs but instead operated in a more lightweight cloud environment?

Cloud Phones vs. Cloud PCs: Why Cloud Phones Are Better for AI Automation

While studying UI-TARS, I discovered that it not only supported desktop applications but also adapted well to mobile UIs. I then experimented with replacing cloud PCs with cloud phones (Cloud Phone)—and the results were surprisingly good.

Key Differences Between Cloud Phones and Cloud PCs:

  • Lower Computing Resource Requirements The same server can host significantly more cloud phone instances than cloud PC instances, reducing operational costs.
  • Simpler and More Stable UI Mobile UI structures are less complex and more stable than desktop applications, making it easier for AI agents to parse and execute actions.
  • Lighter Interaction Logic Instead of simulating a mouse and keyboard, AI agents can perform direct touch-based interactions, reducing error rates.
  • More Flexible Bulk Automation Compared to cloud PCs, cloud phones can run multiple apps simultaneously and switch between multiple accounts more easily, making them more adaptable.

My Experiment:

  1. Running common productivity apps (e.g., Google Docs, Excel, Slack) on cloud phones.
  2. Letting AI agents execute automation tasks on cloud phones (e.g., filling out forms, bulk replying to messages, generating documents).
  3. Comparing AI performance on cloud PCs vs. cloud phones.

Final Results:

  • AI execution costs dropped by over 60% since cloud phones require significantly less computation for UI parsing.
  • Task execution speed doubled because AI agents could interact directly with apps rather than simulating inefficient mouse and keyboard inputs.
  • Overall experience became more stable, as mobile UIs change less frequently than web UIs, reducing AI agent failure rates.

Through this exploration, I ultimately replaced cloud PCs with cloud phones, achieving a more cost-effective, efficient, and stable AI agent solution.

However, the question remains: Is this approach suitable for all AI agent applications?

  • For data processing, text editing, and office automation, cloud phones are already a great fit and significantly lower costs.
  • For professional software (Photoshop, AutoCAD, etc.), cloud PCs might still be necessary, as cloud phones currently lack sufficient capabilities.

My project address: https://www.androidcloud.ai/

Telegram: https://t.me/+KkKBnuniG99iNzI1

If you have better suggestions or solutions, please leave a message

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

Billboard image

Create up to 10 Postgres Databases on Neon's free plan.

If you're starting a new project, Neon has got your databases covered. No credit cards. No trials. No getting in your way.

Try Neon for Free →

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay