GOROman

Posted on Mar 13

CUA - Computer-Using Agent: The AI That Operates Your Computer

#openai #cua #chatgpt #ai

What is Computer-Using Agent (CUA)?

OpenAI's "Computer-Using Agent" (CUA) is an innovative AI model that can operate computers through graphical user interfaces (GUIs). It has the unprecedented ability to see the screen, click buttons, fill out forms, and navigate menus just like a human user would.

CUA has been released as a research preview called "Operator," allowing users to execute complex computer tasks by simply giving natural language instructions. For example, you could say "Book a flight to New York next week," and Operator would navigate travel websites, search for flights, and complete the booking process.

How CUA Works

CUA processes pixel data from screenshots to understand what's happening on screen and interacts with interfaces like a human would. This eliminates the need for platform-specific APIs to operate various applications. The process is divided into three main steps:

1. Perception

CUA takes screenshots of the computer screen to understand the context of the digital environment. These visual inputs form the foundation for decision-making.

2. Reasoning

Using chain-of-thought reasoning, CUA evaluates its observations and tracks progress across intermediate steps. By analyzing past and current screenshots, the system dynamically adapts to new challenges and unexpected changes.

3. Action

Using a virtual mouse and keyboard, CUA executes tasks like typing, clicking, and scrolling. For sensitive tasks, such as handling login credentials or solving CAPTCHA challenges, the system requests user confirmation to ensure security.

This structured workflow allows CUA to navigate complex multi-step tasks and self-correct when encountering errors, making it a powerful tool for digital problem-solving.

CUA's Key Capabilities and Benchmarks

CUA sets new benchmarks in both computer use and browser-based tasks, demonstrating flexibility across different environments. Its performance has been evaluated using platforms like OSWorld, WebArena, and WebVoyager:

OSWorld: CUA achieved a 38.1% success rate for general computer-use tasks, far exceeding the previous state-of-the-art (SOTA) result of 22.0%.
WebArena: On this benchmark, which simulates real-world tasks in e-commerce and content management, CUA scored 58.1%, outperforming the prior SOTA of 36.2%.
WebVoyager: Testing live website interactions (e.g., Amazon, GitHub), CUA matched human performance with an 87% success rate.

These benchmarks highlight CUA's ability to operate effectively across digital environments using a single general interface of screen, mouse, and keyboard. However, there is still room for improvement in more complex scenarios, such as WebArena tasks, where human success rates are higher.

CUA's Advanced Features: Flexibility and Adaptability

One of the most remarkable aspects of CUA is its ability to break tasks into multi-step plans and adapt dynamically when faced with challenges. For example, if a webpage fails to load correctly or a task diverts from the expected path, CUA can adjust its strategy in real-time. This flexibility stems from its integration of GUI perception with structured problem-solving.

CUA Safety Protocols

OpenAI has prioritized safety in the development of CUA, recognizing the risks associated with giving AI systems access to digital environments. Key measures include:

User Confirmation: For sensitive actions like handling passwords or personal data, CUA requests explicit user approval.
Operator System Card: A detailed system card outlines the safeguards in place, ensuring transparency and accountability.

The initial release of CUA through the Operator research preview is limited to Pro-tier users in the U.S., allowing OpenAI to gather real-world feedback and refine the system before global deployment.

CUA Applications and Future Potential

CUA opens the door to a wide range of applications, from automating mundane workflows to enabling accessibility solutions for individuals with physical limitations. Potential use cases include:

Administrative Assistance: Automating tasks like form-filling, data entry, and email management.
E-commerce Management: Navigating and updating online store inventories or processing customer orders.
Education and Research: Assisting with data collection and organization from online resources.

By eliminating the need for task-specific APIs, CUA offers unparalleled flexibility, making it easier for developers and businesses to integrate into existing workflows.

CUA Challenges and Next Steps

While CUA represents a significant step forward, it still faces limitations:

Complex Task Handling: Success rates on intricate tasks, such as those in WebArena, show room for improvement compared to human performance.
Generalization: Adapting to a broader range of use cases and digital environments will require further training and fine-tuning.
Ethical Considerations: Ensuring responsible use and minimizing risks of misuse remain ongoing priorities.

OpenAI's strategy of phased deployment allows for continuous improvement, leveraging user feedback to refine the system. As CUA progresses, it could redefine the boundaries of AI-human collaboration.

Summary

Computer-Using Agent (CUA) is a technology that revolutionizes how AI interacts with digital interfaces. With its ability to understand screenshots and operate mouse and keyboard, CUA can use computers just like humans do.

While still in its early stages, it already shows impressive results and will enable more automation and assistance with digital tasks in the future. OpenAI's commitment to safety and phased deployment approach ensures that CUA is developed responsibly and can be leveraged to improve our digital lives.

My Hands-on Experience with CUA

Hello, I'm GOROman. I've had the chance to try out the CUA technology described above, where "AI operates a computer by itself" - something that sounds like science fiction. It was more impressive than I expected, so I'd like to share my experience.

OpenAI's "Computer-Using Agent" - even the name conveys how powerful this is. Simply put, it's a system where AI looks at the screen, makes judgments like "I should click here" or "I should fill out this form," and actually controls the mouse and keyboard.

It feels like we're approaching the sci-fi world of Jarvis from Iron Man.

How the Computer-Using Agent Works

Understanding the technical details beforehand helps understand the API behavior, so let me briefly explain how it works.

The system consists of three main components:

Visual System: AI that captures and analyzes the screen
Decision Engine: AI that determines what actions to take
Control System: The part that actually moves the mouse and keyboard

For example, if you instruct it to "check the weather and send it by email":

Visual AI: "Currently displaying desktop screen. Browser icon is in the bottom left."
Decision AI: "Should open the browser first"
Control System: "Click on Chrome icon in the bottom left"

Visual AI: "Browser is open. There's a search bar."
Decision AI: "Input 'Tokyo weather' in the search bar"
Control System: "Click on search bar and type 'Tokyo weather'"

This is how the AIs communicate with each other to process the task.

API Implementation: Easier Than Expected

When I actually tried the API, the implementation was surprisingly simple. Here's a basic implementation example in Node.js:

import axios from 'axios';

// Basic configuration
const API_KEY = 'your_openai_api_key'; // ← Replace with your key
const API_URL = 'https://api.openai.com/v1/computer-agent';

// Function to request tasks from the agent
async function runComputerAgent(taskDescription) {
  try {
    const response = await axios.post(
      API_URL,
      {
        task: taskDescription,
        allowed_apps: ["chrome", "notepad", "excel"], // Apps allowed to use
        screen_recording: true, // Recording operations (for debugging)
        confirmation_level: "high" // Level of confirmation required before important operations
      },
      {
        headers: {
          'Authorization': `Bearer ${API_KEY}`,
          'Content-Type': 'application/json'
        }
      }
    );

    return response.data;
  } catch (error) {
    console.error('An error occurred:', error.response?.data || error);
    throw error;
  }
}

// Usage example
async function main() {
  const result = await runComputerAgent(
    "Search for 'latest AI news' on Google and summarize the first 3 article titles in Notepad"
  );

  console.log('Task ID:', result.task_id);
  console.log('Status URL:', result.status_url);
}

main();

With just this code, the AI will search Google and summarize the results in Notepad on your PC. It's almost frighteningly simple.

Use Cases: What Can It Do?

After trying it out, here are some use cases I found particularly useful:

1. Automating Data Collection

// Example of collecting stock data daily
const collectStockData = async (ticker) => {
  return await runComputerAgent(`
    1. Open Yahoo Finance stock chart for ${ticker}
    2. Download the past month's data as CSV
    3. Open the downloaded CSV and extract only the closing prices
    4. Append the extracted data to stock_data.csv (in the format of date, ticker, closing price)
  `);
};

// Process multiple stocks
const tickers = ['AAPL', 'MSFT', 'GOOGL'];
for (const ticker of tickers) {
  await collectStockData(ticker);
}

Previously, we had to write scraping code using Selenium or Puppeteer, but now we can do it with just natural language instructions. What's more, it can adapt even if the screen UI changes!

2. Automating Tedious Tasks

Monthly expense reports can be really tedious... These can now be delegated:

await runComputerAgent(`
  1. Log in to the company portal and open the expense report page
  2. Open "expenses-2024-03.xlsx" on the desktop
  3. Transfer the Excel data to the expense report form
  4. Upload "receipts.zip" as an attachment
  5. Review the content and press the submit button
`);

3. Automated Testing

UI testing can be really cumbersome, especially for applications where the frontend changes frequently, making test code maintenance difficult.

// Example test function
const testUserRegistration = async () => {
  return await runComputerAgent(`
    1. Open browser to http://myapp.local
    2. Click the "Register" button
    3. Fill out the form with the following information:
       - Username: testuser123
       - Email: test@example.com
       - Password: TestPass123!
       - Confirm Password: TestPass123!
    4. Click the "Register" button
    5. Verify that a registration success message is displayed
    6. Verify that it navigated to the login page
    7. Save test results to a JSON file
  `);
};

The big difference from conventional UI testing tools is that you don't need to worry about low-level implementations like "where the button is on the screen" or "what CSS selector to use." It recognizes elements just like a human would, thinking "ah, the register button is here."

Issues I Noticed and Countermeasures

I also noticed some issues when using it:

1. Authentication and Security

Naturally, the AI can see your passwords, which is quite concerning. A solution could be:

// Example for handling sensitive information
await runComputerAgent(`
  1. Open the login page in the browser
  2. Enter "admin" in the username field
  3. Notify me when ready to enter the password (human will input)
`);

// ↑ Here the AI stops and prompts the human to enter the password
// After the human enters the password, continue

await continueTask(taskId, "Password entry complete. Please continue");

I recommend this hybrid approach where humans intervene only for sensitive parts.

2. Error Handling

The AI can get confused when unexpected pop-ups or error dialogs appear. You can address this as follows:

await runComputerAgent(`
  1. Open the application
  2. Import data
  3. If a "File not found" error appears:
     a. Click "OK"
     b. Select "sample.csv" and import
  4. If asked "Do you want to update?":
     a. Click "Yes"
`);

It's reassuring to specify anticipated error patterns and how to handle them in advance.

Advanced API Usage Techniques

Here are some more advanced usage patterns:

1. Custom Pre-processing and Post-processing

// Example of multi-step processing
async function processWebData() {
  // 1. Data acquisition phase
  const extractionResult = await runComputerAgent(
    "Extract data table from website and save as CSV"
  );

  // 2. Human verification
  console.log("Please verify the data:", extractionResult.output_file);
  const userConfirmation = await promptUser("Is the data correct? (yes/no)");

  if (userConfirmation === 'yes') {
    // 3. Continue: Data processing
    await runComputerAgent(
      `Open and analyze the extracted CSV file ${extractionResult.output_file}`
    );
  } else {
    console.log("Process aborted");
  }
}

2. Parallel Processing

// Execute multiple tasks in parallel
async function parallelTasks() {
  const tasks = [
    runComputerAgent("Task 1 description"),
    runComputerAgent("Task 2 description"),
    runComputerAgent("Task 3 description")
  ];

  const results = await Promise.all(tasks);
  console.log("All tasks completed:", results);
}

However, having multiple agents operate on the same screen can cause confusion, so I recommend using virtual desktops or separate windows.

Complete Sample Code

I've uploaded the complete code to GitHub.
If you're interested, check it out here: github.com/GOROman/computer-agent-samples

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

DEV Community