<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Akshay Padamata</title>
    <description>The latest articles on DEV Community by Akshay Padamata (@akshay_padamata_8dcc7821d).</description>
    <link>https://dev.to/akshay_padamata_8dcc7821d</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3804287%2F7969acab-01fc-4ee7-b01e-e4384ba9da75.jpg</url>
      <title>DEV Community: Akshay Padamata</title>
      <link>https://dev.to/akshay_padamata_8dcc7821d</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/akshay_padamata_8dcc7821d"/>
    <language>en</language>
    <item>
      <title>Out of the box personal AI agent - Controls your pc</title>
      <dc:creator>Akshay Padamata</dc:creator>
      <pubDate>Tue, 03 Mar 2026 16:25:19 +0000</pubDate>
      <link>https://dev.to/akshay_padamata_8dcc7821d/out-of-the-box-personal-ai-agent-controls-your-pc-1ahe</link>
      <guid>https://dev.to/akshay_padamata_8dcc7821d/out-of-the-box-personal-ai-agent-controls-your-pc-1ahe</guid>
      <description>&lt;h1&gt;
  
  
  The DPI Scaling Problem in Desktop Automation (And How We Fixed It)
&lt;/h1&gt;

&lt;p&gt;You're automating your desktop. An AI looks at a screenshot, identifies a button, &lt;br&gt;
and returns coordinates: (523, 412). You click there.&lt;/p&gt;

&lt;p&gt;On a regular display, it works. On a Retina/4K display, it doesn't.&lt;/p&gt;

&lt;p&gt;Here's why—and how we solved it.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Coordinate Nightmare
&lt;/h2&gt;

&lt;p&gt;When you automate a GUI, you need to know where to click. There are two approaches:&lt;/p&gt;
&lt;h3&gt;
  
  
  Approach 1: Screenshot + LLM Vision (Current Standard)
&lt;/h3&gt;

&lt;p&gt;Most automation tools work like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Screenshot taken at 2880x1800 (Retina) 
    ↓
Resize to 1280x720 (for API efficiency) 
    ↓
LLM analyzes: "Button is at (640, 360) in screenshot space"
    ↓
Scale back to logical: (1280, 720)
    ↓
Apply DPI scaling (2.0x): (2560, 1440)
    ↓
Click at (2560, 1440)
    ↓
🔴 Miss. Off-screen. Clicked wrong button.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Three coordinate transformations. Three opportunities for error.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On a Retina display with 1.5x or 2.0x DPI scaling, typical error is ±50 pixels. On a 32" 4K display, it's even worse.&lt;/p&gt;

&lt;p&gt;This is why screenshot-based automation fails on high-DPI displays. It's not the LLM's fault—the coordinate system itself is broken.&lt;/p&gt;

&lt;h3&gt;
  
  
  Approach 2: DOM Injection (Our Approach)
&lt;/h3&gt;

&lt;p&gt;What if we asked the browser itself where things are?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-agentref="45"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getBoundingClientRect&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// getBoundingClientRect returns CSS pixels&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cssX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;left&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;width&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cssY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;top&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;height&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// JavaScript knows the DPR—multiply by it&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;physicalX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cssX&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;devicePixelRatio&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;physicalY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cssY&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;devicePixelRatio&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Return physical pixels to the automation engine&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;physicalX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;y&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;physicalY&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;One transformation. Built into the browser. No guessing.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Result on Retina: ±2px error (rounding only). No hallucination. No DPI confusion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Imagine automating:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Form filling:&lt;/strong&gt; Click field 1, fill, click field 2, fill. One wrong click = wrong field.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data entry:&lt;/strong&gt; Click row 47 in a spreadsheet. Off by 50px? You click row 48. Uh oh.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precise clicking:&lt;/strong&gt; Close this modal, click the OK button. Miss = automation breaks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Screenshot-based automation fails silently on these tasks. Users blame the AI. The AI is actually correct—the coordinate system is broken.&lt;/p&gt;

&lt;h2&gt;
  
  
  How We Built It
&lt;/h2&gt;

&lt;p&gt;At solnetex (&lt;a href="https://solnetex.com" rel="noopener noreferrer"&gt;https://solnetex.com&lt;/a&gt;), we're automating desktops from your phone. We needed pixel-perfect clicks.&lt;/p&gt;

&lt;p&gt;Here's the architecture:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Inject Element References
&lt;/h3&gt;

&lt;p&gt;When we extract the DOM, we inject &lt;code&gt;data-agentref&lt;/code&gt; attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="c"&gt;&amp;lt;!-- Before --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;button&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"primary"&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"submit-btn"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Save&lt;span class="nt"&gt;&amp;lt;/button&amp;gt;&lt;/span&gt;

&lt;span class="c"&gt;&amp;lt;!-- After (in AI context) --&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;button&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"primary"&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"submit-btn"&lt;/span&gt; &lt;span class="na"&gt;data-agentref=&lt;/span&gt;&lt;span class="s"&gt;"42"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Save&lt;span class="nt"&gt;&amp;lt;/button&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The AI gets a semantic tree with refs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ref=42] button "Save" @(523, 412)
[ref=43] link "Cancel" @(320, 412)
[ref=44] input "Name" @(100, 200)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Use Refs for Clicking (Not Coordinates)
&lt;/h3&gt;

&lt;p&gt;Instead of "click at (523, 412)", the AI says "dom_click ref=42".&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Node.js runtime receives: dom_click ref=42&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;querySelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-agentref="42"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Let JavaScript get the exact position&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;element&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getBoundingClientRect&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;screenX&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;left&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;width&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;devicePixelRatio&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;screenY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;top&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;rect&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;height&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nb"&gt;window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;devicePixelRatio&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Return native screen coordinates&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;screenX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;screenY&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Handle Edge Cases
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Contenteditable&lt;/strong&gt; (Discord, Slack): Real mouse click (needs focus)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple inputs&lt;/strong&gt;: JS .click() works fine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SPA widgets&lt;/strong&gt; (Google Docs): Real mouse click (JavaScript changes aren't tracked by React)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Links&lt;/strong&gt;: JS .click() navigates reliably&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Fallback Chain
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Try DOM injection (fast, accurate)
  ↓ (if JS blocked)
Try A11y tree snapshots (medium cost, good accuracy)
  ↓ (if unavailable)
Fall back to screenshots (expensive, last resort)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 80% of clicks are fast and accurate. 20% are slower but still work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Problem We Solved
&lt;/h2&gt;

&lt;p&gt;It's not about being "smarter" than LLMs. It's about &lt;strong&gt;removing coordinate guessing entirely&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The LLM is good at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understanding intent ("click the save button")&lt;/li&gt;
&lt;li&gt;Analyzing page semantics ("this is a modal dialog")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The LLM is bad at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predicting pixel coordinates on displays it's never seen&lt;/li&gt;
&lt;li&gt;Understanding DPI scaling&lt;/li&gt;
&lt;li&gt;Knowing the difference between a Retina 27" and a 4K 32"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;By using DOM injection, we let the browser do what it's good at (measuring), and let the LLM do what it's good at (reasoning).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the real insight: You don't need a smarter vision model. You need a better coordinate system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why OpenClaw Doesn't Do This
&lt;/h2&gt;

&lt;p&gt;OpenClaw is primarily for terminal automation and API calls. GUI automation is secondary. They use Puppeteer (headless Chrome), which can control browsers, but they still rely on screenshots for coordinate guessing.&lt;/p&gt;

&lt;p&gt;It works for their use case. But if you're automating high-DPI desktop apps, it breaks down.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Trade-offs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;DOM injection wins:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Pixel-perfect accuracy&lt;/li&gt;
&lt;li&gt;✅ Screen-size independent (no scaling needed)&lt;/li&gt;
&lt;li&gt;✅ DPI handled natively&lt;/li&gt;
&lt;li&gt;✅ Works on ANY display size&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DOM injection loses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ Requires JavaScript enabled (some browsers block it)&lt;/li&gt;
&lt;li&gt;❌ Only works in browsers (can't click system dialogs)&lt;/li&gt;
&lt;li&gt;❌ Requires JS injection setup (initial latency)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Screenshots win:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Works everywhere (desktop, mobile, any app)&lt;/li&gt;
&lt;li&gt;✅ No JS needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Screenshots lose:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;❌ Coordinate guessing (inaccurate on high-DPI)&lt;/li&gt;
&lt;li&gt;❌ Expensive (5000+ tokens per screenshot)&lt;/li&gt;
&lt;li&gt;❌ Hallucination risk (LLM might miss the button entirely)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What We're Using This For
&lt;/h2&gt;

&lt;p&gt;We built Agent Pro to automate your desktop from your phone. No setup required. Sign in with Cleer, describe what you want automated, use your phone.&lt;/p&gt;

&lt;p&gt;The DPI scaling problem was one of the first issues we hit. Once we solved it, everything got faster and more reliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;If you're building desktop automation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Don't rely solely on screenshot + LLM vision&lt;/strong&gt;—it breaks on high-DPI displays&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use the browser's native APIs&lt;/strong&gt;—getBoundingClientRect() knows more than your LLM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Separate concerns&lt;/strong&gt;—let JavaScript measure, let AI reason&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design for fallbacks&lt;/strong&gt;—screenshots are your safety net, not your foundation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The coordinate system matters more than the vision model.&lt;/p&gt;




&lt;p&gt;Have you hit this problem? How did you solve it? Drop a comment below.&lt;/p&gt;

&lt;p&gt;If you want to try pixel-perfect automation from your phone, &lt;a href="https://cleer.ai" rel="noopener noreferrer"&gt;Agent Pro&lt;/a&gt; is live for Cleer users.&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
