DEV Community

Alex @Bickov
Alex @Bickov

Posted on

Your AI agent is guessing which element you meant

Coding agents are good at execution and bad at reading a flat image. Give one a screenshot of a busy UI and it has to infer, from pixels alone, which of six similar buttons you were pointing at. Half the time it picks wrong, you re-prompt, and you have burned a round trip plus a few thousand tokens on the image.

I started handing the agent a structured capture instead. Same screenshot, but the element I care about and what I want done with it are explicit:

json{
  "target": { "label": "Save draft", "region": [820, 140, 96, 32] },
  "intent": "disable until the form is valid",
  "note": "this is the secondary button, not the primary Save"
}
Enter fullscreen mode Exit fullscreen mode

The agent no longer guesses. It acts on the element I named. And the payload is roughly 700 tokens versus the several thousand a raw screenshot costs, so longer sessions stay inside context.

Structure beats pixels when the next reader is a machine. The picture carries everything at once and signals nothing. A few fields carry the one thing that matters.

Curious whether anyone here has hit the same wall feeding screenshots to agents, and what you did about it.

Flag for voice pass

  • "good at execution and bad at reading a flat image" is a balanced parallel, the exact anti-AI pattern. Rephrase in your voice.
  • "The picture carries everything at once and signals nothing" reads clever/templated. Keep only if it is genuinely how you'd say it.
  • The closing question is a soft generic CTA. Replace with a sharper, real question if you have one.
  • Tags suggestion: #claude #ai #productivity #webdev (same as last republish).
  • The JSON is illustrative. Match it to SlimSnap's real schema field names before posting so it is accurate.

Top comments (0)