Why I feed my coding agent JSON instead of screenshots

#ai #claude #productivity #webdev

Claude Code can read images. So can Cursor. So can ChatGPT. I built SlimSnap anyway, and the reason is boring: the image is the wrong shape for the job.

Here is the boring version. The job is "show the agent what is on my screen and have it act." For that job a retina screenshot pasted into a coding session is somewhere between expensive and lossy, depending on how much you care about your token budget, your context window, and the agent acting on the right thing.

The token math

A screenshot pasted to Claude as a vision input is downscaled and billed at the API's per-image cap: about 1,568 tokens on Sonnet and Haiku (the models Claude Code uses by default), up to 4,784 tokens on Opus 4.7 and 4.8. Pasted to Codex CLI (which runs on OpenAI's GPT-4o), a typical 1440x900 screenshot in high detail mode runs about 1,105 tokens. Pasted to Gemini CLI on Gemini 2.5, the same image is about 1,548 tokens. The same screen, turned into a SlimSnap JSON document, runs about 700 tokens. That JSON contains the elements, their normalized bounding boxes, their extracted colors, and the OCR text for each.

About 55 percent fewer than Sonnet or Gemini, 37 percent fewer than Codex, up to 85 percent fewer than Opus, per turn. And the only representation with structured intent the agent can act on.

Representation	Per-turn tokens
Screenshot on Opus 4.7 / 4.8 (max billed)	~4,784
Screenshot on Sonnet / Haiku (max billed)	~1,568
Screenshot on Gemini 2.5	~1,548
Screenshot on Codex CLI / GPT-4o	~1,105
Same screen as SlimSnap JSON	~700

For a one-shot question that is a curiosity. For the way I actually use a terminal coding agent, which is a long iterative session where I show it the page state every few prompts, it stops being a curiosity. Twenty turns of screenshots on Sonnet burns about 31k tokens of vision before you've said anything. Twenty turns on Codex CLI is about 22k. On Opus 4.7+ it is about 96k. Twenty turns of SlimSnap JSON is 14k. On a 200k context window, that is the difference between finishing the refactor and getting compacted out mid-session.

If you are running an agent all day, the bill matters. If you are running it on a tough refactor, the context matters more.

Structure beats pixels

The token math is the part that wins HN comments. The part that actually matters to whether the agent is helpful is structure.

When you paste a screenshot, the agent has to look at pixels and infer everything: what is a button, what is text, what color is what, what label belongs to what input, where the user is pointing. It does this every turn, because raw pixels are not persistent reasoning state. If you ask a follow-up six prompts later, the agent goes back to the pixels. (Why it sees rather than reads: Claude doesn't OCR your screenshot, it interprets it.)

When you paste structured JSON, the agent reads facts. Element e4 is a button, bbox [0.34, 0.60, 0.32, 0.07] normalized, color #3B82F6, OCR text "Sign up". The next turn it does not re-interpret pixels, it references e4. The reasoning is grounded in the same primitives the next turn will use.

json{
  "schema_version": "1.0",
  "captured_at": "2026-05-19T18:17:46Z",
  "screen": { "title": "Create your account", "app": "Safari" },
  "image": { "width_px": 1440, "height_px": 900, "file": "signup.png" },
  "elements": [
    { "id": "e1", "type": "label", "value": "Create your account",
      "bbox": [0.34, 0.18, 0.32, 0.06] },
    { "id": "e2", "type": "input", "value": "Email",
      "bbox": [0.34, 0.34, 0.32, 0.07] },
    { "id": "e3", "type": "input", "value": "Password",
      "bbox": [0.34, 0.46, 0.32, 0.07] },
    { "id": "e4", "type": "button", "value": "Sign up",
      "bbox": [0.34, 0.60, 0.32, 0.07], "color": "#3B82F6" }
  ],
  "estimated_tokens": 712
}

Annotations carry the same property. A red rectangle in a PNG is a red rectangle. A red rectangle in SlimSnap JSON has an intent field, a target_ref pointing at the element it overlaps, and an optional callout string. The agent does not have to guess which input I am pointing at. The schema says: this arrow targets e4, intent is highlight, callout is "this one is misaligned."

Pasting the image and saying "fix the misaligned input" makes the agent guess which input. Pasting the JSON makes the agent act, because the guess is already collapsed.

Why I built a skill instead of a slash command

Once the capture is structured JSON, the rest is easy. SlimSnap writes a tiny config file at ~/.slimsnap/config.json on every startup and settings change. It contains the default save folder and the filename pattern. Nothing else.

json{
  "schema_version": "1.0",
  "default_save_folder": "/Users/alex/Documents/SlimSnap",
  "filename_pattern": "{title}-{timestamp}"
}

The Claude Code skill (free, MIT, separate repo at github.com/bickov/slimsnap-skill) reads that config, lists the save folder, loads the latest JSON file, and parses it into the agent's working context. There is no upload step. There is no slash command to remember. There is no "here is a screenshot." I press the global shortcut, annotate, hit Save, type "fix this" in Claude Code, and the agent already has the structured capture loaded.

I went with a skill instead of a CLI tool or a Claude Code plugin because skills auto-trigger. The user does not invoke them, the agent does, when the prompt looks like the skill's purpose. "What's in my latest SlimSnap capture?" triggers it. So does "fix the misaligned input from my last screenshot." The skill is invisible until it is useful.

This is the part I want to be obvious to anyone building on top of Claude Code: the contract between the desktop app and the agent is the config file, not a hardcoded path. The skill discovers the save folder. If you move your captures, the skill follows. If you build a different consumer (a CLI, a VS Code extension, a different agent), it reads the same config and works.

The open schema is the actual product

SlimSnap is a Mac app and it is free at launch. The thing that will outlast the app is the schema. It is published as a separate MIT repo at github.com/bickov/slimsnap-schema. JSON Schema 2020-12, two examples, a README that explains every field.

The reason this matters: the most common reaction to a new desktop tool from a one-person shop is "what if you disappear." If the schema is open and the consumer is a skill anyone can fork, the answer is "your captures still work, your skill still works, build a different annotator if you want." The data is not locked to the app.

I am not pretending this makes SlimSnap an open-source project. The desktop app is closed. The schema and the skill are open. That is the smallest set of things I can give away and still let people invest in the workflow without trusting me forever.

What this is not

It is not a Claude Code replacement. Claude Code still reads images fine. If you paste a screenshot once for a one-off question, do that. The case for converting to JSON is the case for a workflow: you do this many times a day, you want the agent to be cheap and precise, and you want a clean handoff between the capture and the agent.

It is also not for backend-only work. If you are refactoring a Go service you do not need to show Claude Code your screen. SlimSnap is sharpest for frontend, design-to-code, and bug-reproduction work, where you constantly need to point at something visual and have the agent reason about it.

If that is your loop, the tool, the skill, and the schema are at slimsnap.ai.