Akshay Padamata

Posted on Mar 3

Out of the box personal AI agent - Controls your pc

#ai #agents #llm #automation

The DPI Scaling Problem in Desktop Automation (And How We Fixed It)

You're automating your desktop. An AI looks at a screenshot, identifies a button,
and returns coordinates: (523, 412). You click there.

On a regular display, it works. On a Retina/4K display, it doesn't.

Here's why—and how we solved it.

The Coordinate Nightmare

When you automate a GUI, you need to know where to click. There are two approaches:

Approach 1: Screenshot + LLM Vision (Current Standard)

Most automation tools work like this:

Screenshot taken at 2880x1800 (Retina) 
    ↓
Resize to 1280x720 (for API efficiency) 
    ↓
LLM analyzes: "Button is at (640, 360) in screenshot space"
    ↓
Scale back to logical: (1280, 720)
    ↓
Apply DPI scaling (2.0x): (2560, 1440)
    ↓
Click at (2560, 1440)
    ↓
🔴 Miss. Off-screen. Clicked wrong button.

Three coordinate transformations. Three opportunities for error.

On a Retina display with 1.5x or 2.0x DPI scaling, typical error is ±50 pixels. On a 32" 4K display, it's even worse.

This is why screenshot-based automation fails on high-DPI displays. It's not the LLM's fault—the coordinate system itself is broken.

Approach 2: DOM Injection (Our Approach)

What if we asked the browser itself where things are?

const element = document.querySelector('[data-agentref="45"]');
const rect = element.getBoundingClientRect();

// getBoundingClientRect returns CSS pixels
const cssX = rect.left + rect.width / 2;
const cssY = rect.top + rect.height / 2;

// JavaScript knows the DPR—multiply by it
const physicalX = cssX * window.devicePixelRatio;
const physicalY = cssY * window.devicePixelRatio;

// Return physical pixels to the automation engine
return { x: physicalX, y: physicalY };

One transformation. Built into the browser. No guessing.

Result on Retina: ±2px error (rounding only). No hallucination. No DPI confusion.

Why This Matters

Imagine automating:

Form filling: Click field 1, fill, click field 2, fill. One wrong click = wrong field.
Data entry: Click row 47 in a spreadsheet. Off by 50px? You click row 48. Uh oh.
Precise clicking: Close this modal, click the OK button. Miss = automation breaks.

Screenshot-based automation fails silently on these tasks. Users blame the AI. The AI is actually correct—the coordinate system is broken.

How We Built It

At solnetex (https://solnetex.com), we're automating desktops from your phone. We needed pixel-perfect clicks.

Here's the architecture:

1. Inject Element References

When we extract the DOM, we inject data-agentref attributes:

<!-- Before -->
<button class="primary" id="submit-btn">Save</button>

<!-- After (in AI context) -->
<button class="primary" id="submit-btn" data-agentref="42">Save</button>

The AI gets a semantic tree with refs:

[ref=42] button "Save" @(523, 412)
[ref=43] link "Cancel" @(320, 412)
[ref=44] input "Name" @(100, 200)

2. Use Refs for Clicking (Not Coordinates)

Instead of "click at (523, 412)", the AI says "dom_click ref=42".

// Node.js runtime receives: dom_click ref=42
const element = document.querySelector('[data-agentref="42"]');

// Let JavaScript get the exact position
const rect = element.getBoundingClientRect();
const screenX = (rect.left + rect.width / 2) * window.devicePixelRatio;
const screenY = (rect.top + rect.height / 2) * window.devicePixelRatio;

// Return native screen coordinates
return { screenX, screenY };

3. Handle Edge Cases

Contenteditable (Discord, Slack): Real mouse click (needs focus)
Simple inputs: JS .click() works fine
SPA widgets (Google Docs): Real mouse click (JavaScript changes aren't tracked by React)
Links: JS .click() navigates reliably

4. Fallback Chain

Try DOM injection (fast, accurate)
  ↓ (if JS blocked)
Try A11y tree snapshots (medium cost, good accuracy)
  ↓ (if unavailable)
Fall back to screenshots (expensive, last resort)

Result: 80% of clicks are fast and accurate. 20% are slower but still work.

The Real Problem We Solved

It's not about being "smarter" than LLMs. It's about removing coordinate guessing entirely.

The LLM is good at:

Understanding intent ("click the save button")
Analyzing page semantics ("this is a modal dialog")

The LLM is bad at:

Predicting pixel coordinates on displays it's never seen
Understanding DPI scaling
Knowing the difference between a Retina 27" and a 4K 32"

By using DOM injection, we let the browser do what it's good at (measuring), and let the LLM do what it's good at (reasoning).

This is the real insight: You don't need a smarter vision model. You need a better coordinate system.

Why OpenClaw Doesn't Do This

OpenClaw is primarily for terminal automation and API calls. GUI automation is secondary. They use Puppeteer (headless Chrome), which can control browsers, but they still rely on screenshots for coordinate guessing.

It works for their use case. But if you're automating high-DPI desktop apps, it breaks down.

The Trade-offs

DOM injection wins:

✅ Pixel-perfect accuracy
✅ Screen-size independent (no scaling needed)
✅ DPI handled natively
✅ Works on ANY display size

DOM injection loses:

❌ Requires JavaScript enabled (some browsers block it)
❌ Only works in browsers (can't click system dialogs)
❌ Requires JS injection setup (initial latency)

Screenshots win:

✅ Works everywhere (desktop, mobile, any app)
✅ No JS needed

Screenshots lose:

❌ Coordinate guessing (inaccurate on high-DPI)
❌ Expensive (5000+ tokens per screenshot)
❌ Hallucination risk (LLM might miss the button entirely)

What We're Using This For

We built Agent Pro to automate your desktop from your phone. No setup required. Sign in with Cleer, describe what you want automated, use your phone.

The DPI scaling problem was one of the first issues we hit. Once we solved it, everything got faster and more reliable.

The Takeaway

If you're building desktop automation:

Don't rely solely on screenshot + LLM vision—it breaks on high-DPI displays
Use the browser's native APIs—getBoundingClientRect() knows more than your LLM
Separate concerns—let JavaScript measure, let AI reason
Design for fallbacks—screenshots are your safety net, not your foundation

The coordinate system matters more than the vision model.

Have you hit this problem? How did you solve it? Drop a comment below.

If you want to try pixel-perfect automation from your phone, Agent Pro is live for Cleer users.

DEV Community