I pasted a screenshot into Claude Code last week and it changed the wrong button. Not a crash, not a bad idea. It just picked the wrong element on a busy screen and edited that one.
A screenshot carries a whole screen of intent. To you it is obvious which thing you meant. To the agent it is a flat image it re-reads every turn, and on a dense UI it guesses.
What fixed it for me was marking the one element and handing over a small structured reference instead of the raw image:
{
"element": "Save button",
"text": "Save",
"position": { "x": 0.82, "y": 0.12 },
"intent": "keep it disabled until the form is valid"
}
That JSON is about 700 tokens. The raw screenshot is around 1,500 on Sonnet and Haiku, up to 4,784 on Opus, and it gets billed again on every turn.
So the screenshot is the longest prompt you will ever write. The same screenshot, marked and structured, is one of the shortest.
I built a small free Mac tool around this idea (SlimSnap). The tool is not the point though. The point is the agent should not have to guess which element you meant. Mark it, and the guessing goes away.
Top comments (0)