antonis mixail

Posted on Mar 11

We Gave AI Eyes and Hands on Windows - Here's How

#ai #dotnet #opensource #automation

We Gave AI Eyes and Hands on Windows — Here's How

AI coding assistants can write 500 lines of code in seconds. But ask them to click a button? They're blind.

While building https://orbination.com — our AI platform launching next month — we hit this wall hard. Our agents needed to interact with the actual Windows desktop.
Open apps. Click through dialogs. Read what's on screen. Test their own work.

Nothing out there did what we needed. So we built it.

The Problem

We tried the screenshot approach first. Take a screenshot, send it to the AI, let it figure out where to click.

It was:

Slow — each screenshot costs thousands of vision tokens
Unreliable — the AI guesses coordinates from pixels
Expensive — 15 screenshots to navigate one menu

We needed the AI to know what's on screen, not guess from images.

The Solution: Read the Actual UI

Windows has a built-in accessibility layer called UIAutomation. Every button, input field, menu item, and checkbox exposes itself to this system with:

Exact text/label
Exact position (bounding rectangle)
Control type (button, input, text, tab...)
Interaction patterns (can I click it? type in it? toggle it?)

Instead of sending screenshots, we send this:

[button] "Save" @ 450,320
[input] "Search..." @ 200,60
[tab] "Settings" @ 120,35

Three lines of text instead of a 1MB image. The AI knows exactly what to click and where.

We wrapped this into an MCP server — the open protocol that connects AI assistants to external tools. Single .NET 8 executable. No Python. No Node.js. No Selenium.

Then Reality Hit

Problem 1: Dark Themes

Half the apps we use have dark themes. Standard OCR on a dark screenshot?

0 lines detected.

We built automatic dark theme enhancement:

if (IsDarkImage(bitmap))
{
// Invert colors + boost contrast 1.4x
using var enhanced = EnhanceForOcr(bitmap);
result = RunOcrEngine(enhanced, language);
}

Check darkness first, enhance once, OCR once. Single pass.

Result: 0 → 37 lines detected on draw.io's dark interface.

Problem 2: Web Apps Inside Iframes

UIAutomation can't see inside web-rendered dialogs and iframes. A dark-themed "OK" button in a web dialog? Invisible to UIAutomation.

We built an automatic OCR fallback into click_element:

click_element "OK"
→ UIAutomation: not found
→ Capture window → detect dark theme → enhance → OCR → find "OK" → click center
→ Result: Clicked "OK" @ 523,418 (OCR fallback) ✓

One tool call. The AI doesn't even know which strategy was used — it just works.

Problem 3: Multi-Monitor DPI Chaos

Three monitors, different scaling, negative coordinates for left-side displays. GetWindowRect returns different values depending on DPI awareness.

One line fixed it:

SetProcessDpiAwarenessContext(DPI_AWARENESS_CONTEXT_PER_MONITOR_AWARE_V2);

Problem 4: Speed

Navigating draw.io required 15+ individual tool calls. Click menu, wait, click submenu, wait, select all, paste, wait, click OK...

We built run_sequence — batch multiple actions in a single MCP call:

focus "Chrome"
wait 300
hotkey ctrl+a
wait 200
hotkey ctrl+v
wait 500

One tool call instead of six. The round-trip latency between AI and tools is the real bottleneck, not the actions themselves.

Problem 5: Which Windows Are Actually Visible?

The AI would try to interact with windows hidden behind other windows. We built grid-based occlusion detection:

Chrome (chrome) @ -2060,-1461 ← 100% visible
VS Code (Code) @ -1500,-800 ← 65% visible
Explorer (explorer) @ -1400,-700 ← 0% visible [OCCLUDED]

24px grid cells, process windows front-to-back, calculate visible fraction. The AI now knows which windows are actually usable.

The Biggest Lesson

Don't screenshot. OCR first.

When we switched from screenshot-first to OCR-first workflows, everything improved:

┌────────────────────────────────┬─────────────┬───────┬─────────────┐
│ Approach │ Tokens │ Speed │ Reliability │
├────────────────────────────────┼─────────────┼───────┼─────────────┤
│ Screenshot → guess coordinates │ ~5000/image │ Slow │ Low │
├────────────────────────────────┼─────────────┼───────┼─────────────┤
│ OCR → exact text + coordinates │ ~200/scan │ Fast │ High │
└────────────────────────────────┴─────────────┴───────┴─────────────┘

We embedded this as ServerInstructions in the MCP server itself. Every AI client that connects receives the optimal workflow automatically:

options.ServerInstructions = """
Observation Priority:
1. ocr_window — exact text with click coordinates
2. get_window_details — UI elements with types
3. list_windows — window visibility %
4. screenshot_to_file — ONLY for final verification
""";

The AI learns the right approach on connection. No configuration needed.

The Demo

We told Claude Code: "Create an architecture diagram in draw.io."

The AI:

list_windows — found Chrome
navigate_to_url — opened app.diagrams.net
ocr_window — read the storage dialog, saw "Create New Diagram"
click_element "Create New Diagram" — UIAutomation click
click_element "Create" — created blank diagram
set_clipboard — prepared the XML
click_menu_item "Extras" > "Edit Diagram" — one call for menu navigation
run_sequence — Ctrl+A, Ctrl+V to paste XML
click_element "OK" — applied the diagram
ocr_window — verified all elements rendered correctly

Dark themed web app. Multiple dialogs. File save confirmation. All handled autonomously.

What's In The Box

45+ tools organized into:

Vision — scan_desktop, list_windows, ocr_window, get_window_details
Interaction — click_element (with OCR fallback), interact, fill_form, click_menu_item
Batch — run_sequence, focus_and_hotkey
Input — mouse_click, keyboard_hotkey, paste_text
Capture — screenshot_to_file, screenshot_window (PrintWindow API — works when obscured)

Under the hood:

Win32 P/Invoke (EnumWindows, SendInput, PrintWindow)
UIAutomation with CacheRequest (single cross-process call per window)
Windows.Media.Ocr with dark theme auto-enhancement
Grid-based window occlusion analysis
30-second smart cache with per-window refresh
Per-monitor DPI awareness

This Is v1

This is the first open-source release. It works — we use it daily building Orbination. But it's v1.

What we think could make it great with community input:

Chrome DevTools Protocol integration for direct browser control
Better OCR text matching (word boundaries, fuzzy matching)
Linux/macOS ports using AT-SPI and Accessibility API
Workflow recording and replay

We believe desktop control is the missing layer for truly autonomous AI agents. Not agents that just generate text — agents that use computers like humans do.

Orbination launches next month. This is the first piece we're sharing.

Get It

git clone https://github.com/amichail-1/Orbination-AI-Desktop-Vision-Control.git
cd Orbination-AI-Desktop-Vision-Control/DesktopControlMcp
dotnet build -c Release
claude mcp add desktop-control -- "bin\Release\net8.0-windows\DesktopControlMcp.exe"

GitHub: https://github.com/amichail-1/Orbination-AI-Desktop-Vision-Control

MIT License. Built by https://leia.gr for https://orbination.com.