Claude Controls My Computer — Computer Use Tool Deep Dive 2/2

#ai #claude #computeruse #agents

Part 1 is on radarlog.kr.
More posts at radarlog.kr.

Claude Computer Use is a tool that lets AI see your screen through screenshots and control the mouse and keyboard. In Part 1, I covered the theory — agent loops, coordinate scaling, token cost structures. Clean diagrams, neat calculations. I ended with this line:

"Next post, I'll try it for real."

I tried it. It cost me $3 to send "hi" on Telegram.

One Command, Claude Does the Rest

I enabled Computer Use in Claude Code and typed "send hi on Telegram." Computer Use is not an MCP — it's a built-in feature in Claude Code. You toggle it on in settings. No claude mcp add needed.

From here on, I just watched. I didn't touch the keyboard once after that command.

Claude's first move was requesting Telegram access permissions. macOS threw up an Accessibility permission dialog. I clicked Allow, figured that was it. It wasn't. Every click triggered another permission warning, which stole focus from Telegram, which caused the next click to fail. Infinite loop. Claude eventually asked me to manually add the permission — the only moment in this entire experiment where I touched anything. The lesson: macOS Accessibility permissions bind per-process, not per-app.

After fixing permissions, the real problem started. Claude tried clicking Telegram's message input field, but coordinates kept drifting. The coordinate scaling issue from Part 1 hit hard. The input field sits at the bottom of the screen, right above the Dock — a few pixels of error means you click the Dock instead. Claude tried 15 different coordinates, zoomed in, tried the type tool for direct keyboard input. All failed.

Around the 20th failure, Claude changed strategy on its own. "Use AppleScript through Bash instead of clicking." I didn't hint "try AppleScript." Claude reasoned through it: screenshot analysis → click failure → focus problem → alternative search — entirely autonomous. AppleScript is macOS's native automation framework that delivers key events directly at the process level, bypassing focus issues. Here's what Claude wrote and executed:

tell application "Telegram" to activate
delay 1
set the clipboard to "Hello"
tell application "System Events"
    tell process "Telegram"
        keystroke "v" using command down
        delay 0.5
        key code 36
    end tell
end tell

Worked first try. Korean text had a separate issue — keystroke "안녕" produces "aa" because AppleScript sends keycodes one at a time, breaking Korean character composition. Claude diagnosed this too and switched to clipboard-based pasting on its own.

On the second attempt, a deep link to radarbot opened a Spanish radar alert bot instead of my @YRadar_bot. Claude found the correct username by checking BotFather, then connected to the right bot.

All I said was "send hi on Telegram." Permission loops, 15 failed clicks, AppleScript discovery, Korean input diagnosis, clipboard workaround, wrong deep link — Claude handled everything autonomously.

Screenshots Scrolling By — Is This Working?

While Computer Use runs, screenshots keep appearing. Claude captures one every turn, so I can see Telegram opening, things getting clicked, more screenshots being taken — all in real time on my screen.

But I can't tell if it's going well or not.

A screenshot appears — OK, something's happening. But did the click succeed or fail? I don't know until the next screenshot arrives. If the screen hasn't changed, it failed. If it changed, it worked. But it's Claude judging that, not me. I'm just staring at screenshots scrolling by.

In gaming terms, it felt like watching a replay. Someone else (AI) is playing, and I'm spectating. But replays are comfortable because you know the outcome. This is spectating without knowing — and it's weirdly anxious. "Did it just click the Dock?" I think, but I can't intervene. Claude moves on to the next attempt.

Sometimes 10+ seconds pass with no new screenshot. Is Claude thinking? API latency? Something broken? No progress indicator. Just waiting. "Command and wait" is the actual Computer Use experience. Control is out of your hands. That's both freeing and unsettling.

Would Lower Resolution Have Helped?

Something I missed in Part 1: Anthropic has recommended resolutions.

General desktop: 1024x768 or 1280x720
Web apps:        1280x800 or 1366x768

Anything above 1920x1080 risks performance degradation, and the API resizes images to max 1568px anyway — higher is pointless. I didn't check my exact Mac resolution, but on a Retina display, physical resolution might trigger resizing even when logical resolution looks reasonable.

The core failure in this experiment was coordinate scaling error when clicking Telegram's input field. If I'd set the resolution to 1024x768, no resizing would occur, and coordinate accuracy would go up significantly. No resizing means no coordinate drift. Instead of 15 failed clicks, it might have taken one or two.

Same principle as lowering render resolution in games for higher framerate. In Computer Use, lower resolution means higher accuracy. Token costs drop too — Part 1's ~1,600 tokens per screenshot assumes 1024x768. Higher resolution means larger pre-resize images and slower processing.

Next time I use Computer Use, setting virtual display resolution to 1024x768 will be step one. The biggest lesson from this experiment was "coordinate accuracy determines everything," and resolution settings are the first lever to pull.

The $3 "Hello" — Cost Breakdown

Claude's first attempt to send two characters generated ~63 tool calls total. About 20 screenshots, 15 click attempts, 8 keyboard input attempts, 5 AppleScript executions, and 12 miscellaneous actions like wait and zoom.

On the $100 Max plan, usage went from 29% to 32%. That's 3%. Three percent of $100 is $3.

The second attempt, Claude started with deep links + AppleScript — having learned from round one. Core tool calls should have been 3-4, but the wrong deep link plus more click retries pushed it past 20 again.

Done efficiently: $0.30. What actually happened: $3. A 10x difference.

Part 1 calculated theoretical costs at $0.375–0.625 for an 8-iteration loop. In practice, failed attempts dominate the cost. Successful actions are cheap. Retry loops are expensive. Screenshots especially — each one feeds an image through the vision model, burning tokens. Lower resolution would have meant fewer click failures, fewer retry screenshots, and costs closer to that $0.30 target.

Lessons — The Gap Between Theory and Practice

The agent loop from Part 1 is theoretically correct. Screenshot, analyze, act, screenshot again. Clean and elegant.

Practice is not clean.

macOS accessibility permissions binding per-process — not in the docs. Telegram's input field being pixels from the Dock — only learned by trying. Korean keystrokes becoming "aa" — only learned by failing. Setting the recommended resolution beforehand would have prevented half the coordinate issues. All four are real-world knowledge you can't get from Anthropic's official documentation alone.

In Part 1, I wrote "Computer Use is a QA assistant, not a QA replacement." After the hands-on, I'd put it differently. Computer Use is a UI manipulation tool, but UI manipulation alone isn't enough. The real power emerges when you combine it with OS-native tools — Bash, AppleScript, deep links.

When I wrote Part 1, I imagined a world where the agent loop runs smoothly. That world isn't here yet. Coordinates drift, clicks land on the Dock, focus disappears. But what impressed me was Claude's autonomous problem-solving. I said "send hi on Telegram." That's it. When clicks failed, Claude tried different coordinates. When that failed, it zoomed in. When that failed, it pulled AppleScript from Bash. The entire debugging chain was performed by Claude alone.

And throughout, all I saw were screenshots scrolling by. Not knowing if it was working or not.