Tushar Shukla

Posted on Mar 7

Annotation-First UI Pair Programming: A New Workflow Pattern

#webdev #ai #productivity #tooling

There is a recurring frustration in AI-assisted frontend development. You describe what you want. The agent writes code. You preview it in the browser. It is close, but not right. The spacing is off, a component overlaps another, the hover state covers the wrong area. You switch back to the chat, type a correction in natural language, hope the agent understands which element you mean, and iterate.

This loop is slow because it relies on the developer translating a visual observation into a text description, and the agent translating that text description back into a spatial understanding of the DOM. Information is lost in both directions.

Annotation-first UI pair programming is a workflow pattern that eliminates this translation step. Instead of describing visual issues in text, the developer annotates them directly on the rendered page. Instead of guessing what the developer means, the agent reads structured metadata about the exact elements and regions that were flagged.

The problem: AI agents cannot see the browser

AI coding agents, Claude Code, Cursor, Copilot, Codex, are powerful at manipulating source code. They read files, understand ASTs, generate diffs, and run tests. But they operate in a text world. When they write frontend code, they cannot verify the visual result.

Consider what happens when an agent generates a React component:

The agent writes JSX based on a text prompt.
The developer previews the result in a browser.
The developer sees that a button is misaligned, a card has wrong padding, or a dropdown renders behind a modal.
The developer describes the problem in chat: "The submit button is too close to the input field, add more margin."
The agent modifies the code based on this description.
Steps 2-5 repeat until the UI looks right.

The bottleneck is step 4. Natural language is imprecise about spatial relationships. "Too close" could mean 4px or 40px. "The submit button" might match multiple elements. The developer knows exactly what is wrong because they can see it. The agent has to infer what is wrong from an incomplete verbal description.

The Stack Overflow 2025 Developer Survey quantified this: 84% of developers now use AI tools, but 66% cite an "almost right" frustration, the code is functionally correct but needs visual correction. For frontend work specifically, this gap is the primary productivity drain.

The solution: annotate visually, fix programmatically

Annotation-first pair programming inverts the communication direction. Instead of the developer describing the problem to the agent, the developer marks the problem directly on the rendered page, and the agent reads a structured representation of what was marked.

Here is the workflow:

  DEVELOPER (browser)                    AI AGENT (code editor)
  --------------------                   ----------------------

  1. Preview rendered UI
         |
  2. Spot visual issue
         |
  3. Annotate element          ------>   4. Read annotations
     or draw region                         via MCP tools
     (click/drag on page)                   (structured data)
         |                                      |
         |                               5. Locate element in
         |                                  source code using
         |                                  selector/path/tag
         |                                      |
         |                               6. Generate code fix
         |                                      |
  7. Preview updated UI        <------   (agent applies diff)
         |
  8. Verify fix or
     add new annotations
         |
  9. Mark resolved             ------>  10. Agent clears
     (or agent auto-clears)                 annotation

The key difference from traditional chat-based iteration: step 3 produces structured data, not a text description. When a developer annotates an element, the system captures:

CSS selector, unique path to the element in the DOM
DOM path, full element hierarchy (e.g., html > body > div.app > main > form > button.submit)
Tag name, the HTML element type
Bounding box, exact pixel position and size on the page
Attributes, id, class, data attributes, ARIA roles
Computed styles, current CSS values (in detailed/forensic output levels)
Developer comment, free-text description of the issue
Intent, fix, change, question, or approve
Severity, blocking, important, or suggestion

For region annotations (drawing a rectangle or ellipse on the page), the system also captures shape and geometry data, enabling the developer to flag layout issues that span multiple elements.

The agent receives all of this through a structured query, not a screenshot, not a text description, but a JSON object with enough context to map the visual issue back to source code.

Why structured annotations beat screenshots

A natural question: why not just send the agent a screenshot?

Screenshots are pixels. To extract actionable information from a screenshot, the model has to perform visual reasoning, identify elements, estimate their positions, map them back to code. This is computationally expensive, error-prone, and loses precision. A screenshot does not tell you the CSS selector for an element or its computed margin value. It shows you what things look like, not what they are.

Structured annotations are metadata. They tell the agent: "This is a button.submit element, at position (340, 520), with computed margin-top: 4px, and the developer says it needs more spacing." The agent can immediately search for that selector in the source code and modify the relevant CSS.

The two approaches also differ in information density. A screenshot of a full page contains thousands of visual details, most of which are irrelevant to the specific issue. An annotation contains only the flagged issue, with precisely the metadata needed to fix it. This is a better fit for AI model context windows, where every token counts.

A reference implementation: onUI

onUI is an open-source (GPL-3.0) Chrome/Edge/Firefox browser extension that implements the annotation-first workflow pattern. It provides two capture modes:

Annotate mode: Click any element on the page (or Shift-click to multi-select) to open an annotation dialog. Set the intent (fix, change, question, approve), severity (blocking, important, suggestion), and write a comment. The extension captures the element's selector, DOM path, bounding box, attributes, and optionally computed styles.

Draw mode: Drag to draw a rectangle or ellipse region on the page. This captures the region's shape and geometry along with the annotation metadata. Useful for layout issues, spacing problems, or visual groupings that do not map to a single DOM element.

Annotations are stored locally and exposed to AI agents through a local MCP (Model Context Protocol) server called onui-local. The server runs over stdio, no cloud backend, no account, no data leaves the developer's machine.

The MCP tool surface

The onui-local MCP server exposes 8 tools that AI agents can call:

Reading annotations:

onui_list_pages, discover which pages have annotations
onui_get_annotations, get all annotations for a specific URL
onui_get_report, generate a formatted report at four detail levels (compact, standard, detailed, forensic)
onui_search_annotations, search with filters for status, severity, intent, and free-text query

Managing annotations:

onui_update_annotation_metadata, update status, intent, severity, or comment for a single annotation
onui_bulk_update_annotation_metadata, batch update multiple annotations
onui_delete_annotation, remove a specific annotation
onui_clear_page_annotations, clear all annotations for a page URL

Report detail levels

The four output levels control how much metadata the agent receives:

The developer chooses the output level in the extension's settings panel. For most fixes, standard provides enough context. For complex layout debugging, detailed or forensic gives the agent the full picture.

Example agent flow

Here is a concrete example of the workflow in practice:

A developer is building a dashboard. They preview it in Chrome and notice the sidebar overlaps the main content area on narrow viewports.
They enable onUI on the tab, switch to Draw mode, and draw a rectangle around the overlapping region. They set intent to "fix," severity to "blocking," and write: "Sidebar overlaps main content below 1024px viewport width."
Their AI agent (Claude Code, Cursor, etc.) queries the MCP server:

{
  "tool": "onui_get_report",
  "arguments": {
    "pageUrl": "http://localhost:3000/dashboard",
    "level": "detailed"
  }
}

The agent receives structured data including the region geometry, overlapping elements, and the developer's comment. It searches the codebase for the sidebar component, identifies the CSS layout rules, and generates a fix using a container query or media query.
The developer previews the fix. If the overlap is resolved, they either delete the annotation or the agent calls onui_clear_page_annotations to clean up.

Setup

onUI can be installed from the Chrome Web Store or via a one-command installer:

# macOS/Linux, install extension + MCP bridge
curl -fsSL https://github.com/onllm-dev/onUI/releases/latest/download/install.sh | bash -s,,mcp

MCP server registration for Claude Code:

{
  "mcpServers": {
    "onui-local": {
      "command": "node",
      "args": [
        "/path/to/onUI/packages/mcp-server/dist/bin/onui-cli.js",
        "mcp"
      ]
    }
  }
}

The server auto-registers for Claude Code and Codex when those CLIs are installed.

When annotation-first works best

This workflow pattern is most effective when:

The issue is visual, not logical. Misalignment, wrong colors, broken responsive layouts, overlapping elements. The developer can see the problem instantly but describing it in text is ambiguous.
The codebase is large. The agent needs help finding which component owns the broken element. A CSS selector or DOM path narrows the search dramatically.
Multiple issues need fixing in one pass. Batch-annotating 5-10 issues and sending them all to the agent is faster than 5-10 rounds of chat.
Privacy matters. Local-only storage means no screenshots or page content are uploaded to third-party servers. The annotation data stays on the developer's machine and is only read by the locally-running MCP server.

The pattern is less useful for purely functional bugs (wrong data, broken API calls) where the issue is not visible in the rendered UI, or for greenfield generation where there is no existing UI to annotate.

Beyond onUI: the pattern is portable

The annotation-first workflow is not tied to any single tool. The core idea, human marks visual issues, agent reads structured metadata, can be implemented with different capture mechanisms and different transport protocols.

What makes the pattern work is the structured bridge between human visual perception and agent code manipulation. As MCP adoption grows (the MCP SDK now has 97M+ monthly npm downloads), more tools will likely implement this pattern. The combination of a browser-side capture layer and an MCP-exposed tool surface is a template that other projects can follow.

The question is not whether AI agents will get better at understanding visual UI. They will. The question is how we bridge the gap in the meantime, and annotation-first pair programming is a practical answer available today.

DEV Community