김이더

Posted on Mar 24

Claude Now Controls Your Desktop — Computer Use Tool Deep Dive

#ai #claude #computeruse #agents

What Does "AI Using a Computer" Even Mean

The core of Computer Use Tool is simple. Claude takes a screenshot to see the screen, then controls the mouse and keyboard to get things done. When there's no API integration for an app, it just looks at the screen and clicks — like a human would.

If you're a game developer, this clicks right away. It's the same flow as QA automation testing: screen capture → image matching → input simulation. Think Appium or Sikuli, but the LLM replaces the image matcher. The difference is that instead of pixel-matching, a vision model "understands" the screen, and instead of hardcoded scenarios, natural language commands drive the actions.

The available actions include screenshot, left_click, type, key, and mouse_move as basics. Claude 4.x models add scroll, right_click, double_click, triple_click, left_click_drag, hold_key, and wait. Opus 4.6 even gets zoom — inspecting a specific screen region at full resolution for precise recognition of small UI elements.

It's like zooming into a Slate widget in UE5 to debug pixel-level layout issues. When the AI thinks "what does that tiny button say?", it zooms in to check.

The Agent Loop — The Real Architecture

The real core of Computer Use isn't individual actions. It's the agent loop.

Here's the flow. User says "save a cat picture to my desktop." Claude responds "I'll take a screenshot." Your application actually captures the screenshot and returns it. Claude analyzes the screen and responds "I'll click the browser." Your application executes the click and returns the result.

This repeats until Claude decides the task is done.

# The agent loop skeleton
while iterations < max_iterations:
    response = client.beta.messages.create(
        model=model,
        tools=tools,
        messages=messages,
    )

    tool_results = []
    for block in response.content:
        if block.type == "tool_use":
            result = run_tool(block.name, block.input)
            tool_results.append({
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result
            })

    if not tool_results:
        break  # No tool calls = task complete

    messages.append({"role": "user", "content": tool_results})

This is structurally similar to a game server's tick loop. Every tick: check state, make decisions, execute actions, check state again. Except a game loop runs at 16ms per tick, while this involves screenshot capture + API calls + vision analysis per turn, so latency is noticeable. Anthropic themselves recommend focusing on use cases where speed isn't critical.

The crucial point is that Claude doesn't directly connect to your computer. Your code sits in the middle, receives Claude's requests, executes them in the real environment, and feeds back results. Claude says "left_click at (512, 384)" — your code does the actual clicking.

This architecture means you can run it safely inside a sandbox. Spin up a Docker container with a virtual display (Xvfb), let Claude work in there, and your host system stays untouched.

Coordinate Scaling — A Sneaky Problem

The first practical issue you'll hit is coordinate scaling. The API resizes images to a max of 1568px on the longest edge. A 1512x982 screen gets downsampled to roughly 1330x864. Claude analyzes the smaller image and returns coordinates in that space, but your clicks need to happen in the original resolution.

import math

def get_scale_factor(width, height):
    long_edge = max(width, height)
    total_pixels = width * height

    long_edge_scale = 1568 / long_edge
    total_pixels_scale = math.sqrt(1_150_000 / total_pixels)

    return min(1.0, long_edge_scale, total_pixels_scale)

scale = get_scale_factor(screen_width, screen_height)

# Scale Claude's coordinates back to original resolution
def execute_click(x, y):
    screen_x = x / scale
    screen_y = y / scale
    perform_click(screen_x, screen_y)

Same problem as DPI scaling in UE5 UI — logical coordinates vs. physical coordinates. That bug where Slate widget positions are wrong on Retina displays because you forgot to account for the display scale factor? The exact same thing happens to AI. Without proper inverse scaling, Claude thinks it's clicking a button but actually hits something else entirely.

OpenClaw vs Claude Computer Use — What's Actually Different

Both do "AI controls a computer." But the design philosophy is worlds apart.

OpenClaw is a general-purpose agent platform. Austrian developer Peter Steinberger launched it in November 2025 as "Clawdbot," then renamed it twice — to Moltbot, then OpenClaw — after Anthropic raised trademark concerns. It blew past 200K GitHub stars and became one of the fastest-growing open source projects ever. It's model-agnostic: plug in Claude, GPT, Gemini, or local LLMs. Send commands through WhatsApp, Telegram, or Discord, and it handles the rest on your machine.

Claude Computer Use is an API tool. Claude-model-only, running inside Anthropic's Messages API. You define the tool, Claude requests actions, and you implement the agent loop yourself.

The biggest difference is who controls the environment. OpenClaw installs directly on your local machine, reads and writes files, executes shell commands. Convenient, but risky. An API key storage vulnerability (CVE-2026-25253) exposed plaintext credentials. Infostealers were caught harvesting entire OpenClaw configurations. Major Korean companies including Naver, Kakao, and Baemin banned internal usage. The Dutch data protection authority warned against deploying experimental agents like OpenClaw on systems handling sensitive data.

Claude Computer Use is designed around isolated environments — Docker containers, VMs. Claude never touches the host directly. Your code mediates every action. A prompt injection classifier runs automatically on every screenshot, asking for user confirmation when suspicious instructions are detected.

Here's the analogy. OpenClaw is giving a friend your computer password and telling them to figure it out. Claude Computer Use is putting someone in an isolated room, showing them the screen via CCTV, and relaying their click requests. OpenClaw wins on freedom. Computer Use wins on security, and it's not close.

One more thing. OpenClaw focuses on "life-integrated" features — messenger integration, autonomous scheduling, a Skills marketplace (ClawHub). Claude Computer Use is a building block for developers to embed desktop automation into their own products. OpenClaw is the finished product. Computer Use is the component.

macOS Only — Windows Has to Wait

The Computer Use research preview in Cowork and Claude Code is macOS only. Not Windows. Not Linux. The macOS desktop app must be running. You can assign tasks from your phone via Dispatch, but the actual computer control happens on the Mac.

Anthropic hasn't officially explained why Mac first, but the reasoning seems clear. macOS has well-structured accessibility APIs. Apple Silicon's unified memory architecture is advantageous for vision processing. And frankly, OpenClaw already proved that Mac users have the highest demand for this kind of tool. The OpenClaw community was dominated by "running it 24/7 on a Mac mini M4" stories. Apple Silicon's power efficiency and silence make it the only realistic always-on option.

For game developers, this is a bit disappointing. UE5 development is mostly Windows, build pipelines run on Windows servers. If you wanted to wire Claude Computer Use into build automation, you can't do it through Cowork right now. But the API-level Computer Use Tool has no platform restriction. Spin up a Linux environment in Docker and run it there. The Mac-only limitation applies specifically to the "control my actual computer" feature in Cowork/Claude Code.

Anthropic says they plan to expand platform support after gathering early feedback. Windows support is a matter of when, not if.

So How Many Tokens Does This Actually Burn

Computer Use is expensive not because of individual actions, but because of cumulative context. Every agent loop iteration feeds the entire previous conversation back as input. Loop 1 has 1 screenshot. Loop 8 has 8 screenshots plus all 7 previous responses stacked into the input tokens.

Let's start with the fixed costs. Every API call carries a system prompt overhead of ~499 tokens, plus tool definitions (computer at 735 tokens, bash and editor at ~200 each). On top of that, each screenshot costs ~1,600 tokens. That's the official figure from Anthropic's vision docs for a 1024x768 image.

Here's what one loop looks like in concrete numbers.

Loop 1 input:
  System prompt        499 tokens
  Tool defs (3)      1,135 tokens
  User message         ~100 tokens
  Screenshot (1)     1,600 tokens
  ────────────────────────
  Total              ~3,334 tokens

Loop 1 output:
  Thinking             1,024 tokens (minimum)
  Action response       ~150 tokens
  ────────────────────────
  Total              ~1,174 tokens

Looks manageable so far. But the problem snowballs as loops accumulate.

Loop 8 input:
  System prompt        499 tokens
  Tool defs (3)      1,135 tokens
  User message         ~100 tokens
  Screenshots (8)   12,800 tokens  ← this is the killer
  Previous outputs   ~1,400 tokens
  tool_results (7)     ~350 tokens
  ────────────────────────
  Total             ~16,284 tokens

A single turn at loop 8 eats 16K input tokens. Summed from loop 1 through 8, total input hits roughly 78,000 tokens and output around 9,400.

On Sonnet 4.6 pricing: input $3/MTok × 78K = $0.234, output $15/MTok × 9.4K = $0.141. One task costs $0.375. "Save a cat picture to my desktop" is a 40-cent operation.

Switch to Opus 4.6 at $5/$25 per MTok and the same task runs $0.625. Run 100 tasks a day and you're looking at $1,875/month.

But there's room to optimize. Prompt caching drops the system prompt and tool definition tokens (~1,634 tokens that repeat every turn) to 10% of base price on cache hits. That's a 90% discount on the fixed overhead that repeats every single turn. For async workloads, the Batch API adds another 50% discount.

In game dev, there's a fundamental principle: cache anything that repeats every frame. The exact same principle applies here. Cache the system prompt and tool definitions, and if possible, reduce screenshot resolution to save tokens. Dropping from 1024x768 to 800x600 cuts per-screenshot tokens noticeably. Of course, lower resolution means lower vision accuracy — it's a tradeoff. Render resolution vs. framerate, same structure.

Game QA With Computer Use — What Works and What Doesn't

Can you use Claude Computer Use in a UE5 project right now? At the API level, yes. Docker + Linux + Xvfb + browser, then tell Claude to "pull this sprint's bug list from Jira and organize it into a spreadsheet."

But the really interesting part is QA automation. And there's a clear line dividing what's possible from what isn't.

Definitely works: functional QA

Think about traditional UI automation testing. XPath and widget ID based — when the UI changes, every test script breaks. Move one button and 10 tests fail. Computer Use understands the screen visually, so it can find the "Settings" button even after a layout redesign.

Here's what that looks like concretely.

Scenario: Main menu → Settings → Graphics → Change resolution → Apply

Traditional approach (Selenium/Appium style):
  driver.find_element(By.ID, "btn_settings").click()
  driver.find_element(By.XPATH, "//div[@class='graphics-tab']").click()
  → Breaks on every UI refactor

Computer Use approach:
  "Find the settings button in the main menu and click it.
   Navigate to the graphics tab, change resolution to 1920x1080, and apply."
  → Claude reads the screenshot and figures it out

This kind of functional QA is Computer Use's sweet spot. Menu navigation, verifying settings actually apply, login/logout flows, purchasing items and checking inventory. The key: tests where there's a clear expected outcome — "do A, expect B."

Visual regression testing fits naturally too. Capture screenshots of the same scene before and after a patch, then ask Claude to spot differences. It can catch broken textures, clipped UI elements, changed font rendering. The advantage over pixel-diff tools is context: Claude can distinguish "this background color change was intentional" from "this health bar being cut off is a bug."

Localization QA is another strong case. Spin up 10 language builds, tell Claude "navigate every menu and flag any truncated or broken text." A human takes 2-3 hours per language. Claude can run them in parallel.

Gray area: performance QA

Can it find areas where FPS drops below 30? Half yes.

Claude can read an FPS counter from a screenshot. If there's a debug overlay showing "28 fps," it'll recognize that. But the agent loop cycle — screenshot → analysis → action → screenshot — takes several seconds per turn. Catching real-time frame drops is impossible. A 0.5-second stutter between screenshots is invisible.

But it works for slower exploration. "Walk through each zone of the map, read the FPS counter, and log any area below 30fps." It's slow but covers wide areas — exploratory performance profiling. Not a precision benchmark, but good enough for QA teams to generate a rough heatmap of "where's it slow?"

Long-running tests work similarly. "Play the game for 2 hours and read memory usage from the task manager every 30 minutes." The kind of tedious monitoring that nobody wants to sit through.

Can't do it: judging the "feel" of a game

This is the core question. Can AI judge whether a game is fun?

Short answer: no. At least not yet.

The most important and hardest-to-automate area of game QA is "game feel" — the tactile sensation of play. The difference between a hit registering 100ms after pressing the button versus 150ms. Numerically, that's 50ms. But the player feels "something's off." This is what developers call game feel or juice.

Computer Use cannot judge these subtle differences. Three structural reasons make it impossible.

First, temporal resolution is too low. Computer Use is screenshot-based. At several seconds per cycle, there's no way to detect a 50ms input delay difference. That's 3 frames in a 60fps game — completely invisible in a single screenshot.

Second, feel is subjective. The line between "camera shake that adds excitement" and "camera shake that causes motion sickness" varies per person. The difference between a 0.05-second and 0.08-second hitstop (the brief frame freeze on hit for impact feedback) — ask Claude and it'll say the two screenshots look "almost identical." The feedback loop that lives in a player's fingertips can't be reconstructed from images.

Third, contextual accumulation matters. Fun in games comes from flow, not moments. Easy section → hard section → reward. Judging whether the difficulty curve is right requires 30-60 minutes of continuous play while tracking emotional shifts. Running the agent loop for an hour would cost a fortune in tokens, and Claude doesn't have emotional states — it can't tell you "I'm bored right now" or "that was a satisfying challenge."

But there's a workaround.

Claude can't directly judge "fun or not fun." But it can collect proxy metrics that correlate with fun.

Proxy metric collection example:

"Play for 10 minutes and record:
 - Number of deaths in the same section
 - Time from death to restart (UI flow)
 - Time spent reading item descriptions in the shop
 - Number of backtracks in the skill tree"

Five deaths in the same section signals a difficulty spike. Reading an item description for 10+ seconds means the tooltip is confusing. Repeatedly backing out of the skill tree suggests the choices aren't intuitive. Claude collects these behavioral patterns, generates a report like "deaths in this section are 3x the average," and the designer makes the judgment call based on data AI collected.

Animation blending is similar. Tell Claude "repeat the run-to-stop transition and capture screenshots at 5-frame intervals — flag any moment where the character pose looks unnatural." It can catch visual artifacts — arms clipping through the body, feet sinking into the floor. But "does this blend feel good?" still requires a human.

Here's the breakdown.

Definitely possible:
  Functional QA (menu nav, settings, purchase flows)
  Visual regression (pre/post patch visual diffs)
  Localization QA (text truncation, broken strings)
  Automated screenshot collection + report generation

Conditionally possible:
  Performance monitoring (reading FPS counters — not real-time)
  Proxy metric collection (death counts, UI dwell time)
  Visual artifact detection (clipping, penetration)

Not possible:
  Game feel / tactile feedback judgment
  Difficulty balance "appropriateness"
  Emotional impact of cinematics
  Real-time frame drop detection

Computer Use is a QA assistant, not a QA replacement. It absorbs the repetitive work that human QA hates doing, so humans can focus on the question that matters: "Is this fun?" There's an old saying in the game industry: "Fun isn't in the spreadsheet." If AI fills the spreadsheet, humans can spend their time finding the fun.

"AI can find bugs. AI can't find fun. Not yet."

Next post: I actually do it. I hook Claude Computer Use up to Telegram, throw real tasks at it, and record how many tokens it actually burns, how long the setup takes, and where it breaks. Theory ends here. Next up is the field report.

→ Claude Now Controls Your Desktop 2/2 — Running It From Telegram, a Hands-On Report