📯Introducing CUGA

#cuga #opensource #agents #react

CUGA (ConfigUrable Generalist Agent)

Introduction — What is CUGA?

CUGA (ConfigUrable Generalist Agent) is an open-source generalist agent framework from IBM Research, purpose-built for enterprise automation. Designed for developers, CUGA combines and improves the best of foundational agentic patterns such as ReAct, CodeAct, and Planner-Executor — into a modular architecture enabling trustworthy, policy-aware, and composable automation across web interfaces, APIs, and custom enterprise systems.

CUGA achieves state-of-the-art performance on leading benchmarks:

🥇 #1 on AppWorld — a benchmark with 750 real-world tasks across 457 APIs, and
🥈 #2 on WebArena — a complex benchmark for autonomous web agents across application domains.

Key features

Complex task execution: State of the art results across Web and APIs.
Flexible tool integrations: CUGA works across REST APIs via OpenAPI specs, MCP servers, and custom connectors.
Composable agent architecture: CUGA itself can be exposed as a tool to other agents, enabling nested reasoning and multi-agent collaboration.
Configurable reasoning modes: Choose between fast heuristics or deep planning depending on your task’s complexity and latency needs.
Policy-aware instructions (Experimental): CUGA components can be configured with policy-aware instructions to improve alignment of the agent behavior.
Save & Reuse (Experimental): CUGA captures and reuses successful execution paths, enabling consistent and faster behavior across repeated tasks.

ReAct Agents — TL;DR

ReAct (Reasoning and Acting) agents are a powerful framework for AI agents that enhance a Large Language Model’s (LLM) ability to handle complex, multi-step tasks by integrating logical reasoning (Thought) with external action (Action) in a continuous, interleaved loop. The core idea is to prompt the LLM to alternate between articulating its chain-of-thought to plan and track its progress, and executing actions like using a search engine, calling an API, or interacting with a database. This continuous cycle of T*hought → Action → Observation* allows the agent to dynamically gather external information, adjust its plan based on the results of its actions, overcome issues like factual hallucinations, and ultimately reach a more accurate and reliable final answer than reasoning-only or action-only approaches.

CodeAct Agents — TL;DR

CodeAct agents are a specialized type of AI agent that use executable Python code as a unified action space to interact with their environment, rather than being limited to pre-defined tools or formatted text commands. Integrated with a sandboxed Python interpreter, a CodeAct agent can generate a code snippet (its “action”), execute it, and then receive the output or any error message as an Observation. This iterative code execution and self-debugging loop allows the agent to dynamically utilize existing software packages, handle complex logic like loops and conditionals, and correct its mistakes in real-time, making it exceptionally powerful for sophisticated tasks such as data analysis, model training, and composing multiple APIs in a flexible manner. Essentially, the CodeAct paradigm turns the Large Language Model (LLM) into an autonomous developer, enabling it to write, run, and refine its own solutions.

Planner-Executor Agents — TL;DR

Planner-Executor agents represent an architectural pattern designed to enhance the reliability and efficiency of Large Language Model (LLM) agents by explicitly separating the thinking phase from the action phase. In this framework, the Planner (often a powerful LLM) first generates a comprehensive, multi-step strategy to achieve a complex goal. Subsequently, a separate Executor component is responsible for carrying out the plan one step at a time, using tools or APIs to perform specific actions. This separation provides several benefits: it enforces long-term planning, allows for a faster execution loop by reducing the need to consult the large LLM after every action, and permits the use of smaller, more cost-effective models for the execution phase. If an execution step fails or new information is observed, the agent can re-engage the planner to dynamically revise the remaining plan.

In short, the CUGA project offers enterprise agent builders a powerful, open-source, and modular framework specifically designed for enterprise automation. It combines the foundational agentic patterns mentioned above, into a single architecture.

Testing CUGA Locally

Hereafter the steps to try CUGA locally ⬇️

Python 3.12
uv package manager

# In terminal, clone the repository and navigate into it
git clone https://github.com/cuga-project/cuga-agent.git
cd cuga-agent

# 1. Create and activate virtual environment
uv venv --python=3.12 && source .venv/bin/activate

# 2. Install dependencies
uv sync

# 3. Set up environment variables
# Create .env file with your API keys
echo "OPENAI_API_KEY=your-openai-api-key-here" > .env

# 4. Start the demo
cuga start demo

# Chrome will open automatically at https://localhost:8005
# then try sending your task to CUGA: 'get top account by revenue from digital sales'

CUGA supports multiple LLM providers with flexible configuration options. You can configure models through TOML files or override specific settings using environment variables.

## Supported Platforms

- **OpenAI** - GPT models via OpenAI API (also supports LiteLLM via base URL override)
- **IBM WatsonX** - IBM's enterprise LLM platform
- **Azure OpenAI** - Microsoft's Azure OpenAI service
- **RITS** - Internal IBM research platform

## Configuration Priority

1. **Environment Variables** (highest priority)
2. **TOML Configuration** (medium priority)
3. **Default Values** (lowest priority)

### Option 1: OpenAI 🌐

**Setup Instructions:**

1. Create an account at [platform.openai.com](https://platform.openai.com)
2. Generate an API key from your [API keys page](https://platform.openai.com/api-keys)
3. Add to your `.env` file:
   # env
   # OpenAI Configuration
   OPENAI_API_KEY=sk-...your-key-here...
   AGENT_SETTING_CONFIG="settings.openai.toml"

   # Optional overrides
   MODEL_NAME=gpt-4o                    # Override model name
   OPENAI_BASE_URL=https://api.openai.com/v1  # Override base URL
   OPENAI_API_VERSION=2024-08-06        # Override API version

Examples

There are several ready-to-run examples available on the repository and you can even access one online.

Below you can see the CUGA as MCP sample 🆓

from cuga.backend.activity_tracker.tracker import ActivityTracker
from cuga.backend.cuga_graph.utils.controller import AgentRunner as CugaAgent, ExperimentResult as AgentResult
from fastmcp import FastMCP
import os

# Initialize components
tracker = ActivityTracker()
mcp = FastMCP("CUGA Running as MCP 🚀")

# Set the environment file to the .env file in the current directory
os.environ["ENV_FILE"] = os.path.join(os.path.dirname(__file__), ".env")
os.environ["MCP_SERVERS_FILE"] = os.path.join(os.path.dirname(__file__), "mcp_servers.yaml")


@mcp.tool
async def run_api_task(task: str) -> str:
    """
    Run a task using API mode only - headless browser automation without GUI interaction
    Args:
        task: The task description to execute

    Returns:
        str: The result of the task execution
    """
    os.environ["DYNA_CONF_ADVANCED_FEATURES__MODE"] = "api"
    cuga_agent = CugaAgent(browser_enabled=False)
    await cuga_agent.initialize_appworld_env()
    task_result: AgentResult = await cuga_agent.run_task_generic(eval_mode=False, goal=task)
    return task_result.answer


async def run_web_task(task: str, start_url: str) -> str:
    """
    Run a task using web mode only - browser automation with GUI interaction
    Args:
        task: The task description to execute
        start_url: The starting URL for the task

    Returns:
        str: The result of the task execution
    """
    os.environ["DYNA_CONF_ADVANCED_FEATURES__MODE"] = "web"
    cuga_agent = CugaAgent(browser_enabled=False)
    try:
        await cuga_agent.initialize_freemode_env(start_url=start_url, browser_mode="browser_only")
        task_result: AgentResult = await cuga_agent.run_task_generic(eval_mode=False, goal=task)
    except Exception:
        if hasattr(cuga_agent, "env") and cuga_agent.env:
            cuga_agent.env.close()
        raise
    else:
        cuga_agent.env.close()
        return task_result.answer


@mcp.tool
async def run_hybrid_task(task: str, start_url: str) -> str:
    """
    Run a task using hybrid mode - combination of API and web interaction
    Args:
        task: The task description to execute
        start_url: The starting URL for the task

    Returns:
        str: The result of the task execution
    """
    os.environ["DYNA_CONF_ADVANCED_FEATURES__MODE"] = "hybrid"
    cuga_agent = CugaAgent(browser_enabled=False)
    try:
        await cuga_agent.initialize_freemode_env(start_url=start_url, browser_mode="browser_only")
        task_result: AgentResult = await cuga_agent.run_task_generic(eval_mode=False, goal=task)
    except Exception:
        if hasattr(cuga_agent, "env") and cuga_agent.env:
            cuga_agent.env.close()
        raise
    else:
        cuga_agent.env.close()
        return task_result.answer


def main():
    """Main function to run the MCP server."""
    print("Starting FastMCP server...")
    print("Server must be running before starting the main application.")
    mcp.run(transport="sse")


if __name__ == "__main__":
    main()

# API Mode - Headless execution
result = await mcp_client.call_tool("run_api_task", {
    "task": "get my top account by revenue"
})

# Web Mode - Browser GUI execution
result = await mcp_client.call_tool("run_web_task", {
    "task": "navigate to dashboard and get revenue data",
    "start_url": "https://example.com"
})

# Hybrid Mode - Combined execution
result = await mcp_client.call_tool("run_hybrid_task", {
    "task": "analyze data from API and web interface",
    "start_url": "https://example.com"
})

Once the pre-required environment is set-up and all the API keys are provided, the code could be tested locally.

Conclusion
CUGA (ConfigUrable Generalist Agent) is an exemplary open-source agent framework from IBM Research that establishes a new benchmark for trustworthy enterprise automation. It achieves state-of-the-art performance, securing the #1 rank on AppWorld and #2 on WebArena, demonstrating superior capability in complex task execution across web and APIs. By combining and improving foundational agentic patterns like ReAct, CodeAct, and Planner-Executor into a modular, composable, and policy-aware architecture, CUGA offers developers an indispensable, flexible solution. Ultimately, CUGA is the definitive framework for building robust, reliable, and scalable agents, ready to take on the most intricate workflows across modern enterprise systems.

Thanks for reading 🤗