We Gave AI Eyes and Hands on Windows — Here's How
AI coding assistants can write 500 lines of code in seconds. But ask them to click a button? They're blind.
While building https://orbination.com — our AI platform launching next month — we hit this wall hard. Our agents needed to interact with the actual Windows desktop.
Open apps. Click through dialogs. Read what's on screen. Test their own work.
Nothing out there did what we needed. So we built it.
The Problem
We tried the screenshot approach first. Take a screenshot, send it to the AI, let it figure out where to click.
It was:
- Slow — each screenshot costs thousands of vision tokens
- Unreliable — the AI guesses coordinates from pixels
- Expensive — 15 screenshots to navigate one menu
We needed the AI to know what's on screen, not guess from images.
The Solution: Read the Actual UI
Windows has a built-in accessibility layer called UIAutomation. Every button, input field, menu item, and checkbox exposes itself to this system with:
- Exact text/label
- Exact position (bounding rectangle)
- Control type (button, input, text, tab...)
- Interaction patterns (can I click it? type in it? toggle it?)
Instead of sending screenshots, we send this:
[button] "Save" @ 450,320
[input] "Search..." @ 200,60
[tab] "Settings" @ 120,35
Three lines of text instead of a 1MB image. The AI knows exactly what to click and where.
We wrapped this into an MCP server — the open protocol that connects AI assistants to external tools. Single .NET 8 executable. No Python. No Node.js. No Selenium.
Then Reality Hit
Problem 1: Dark Themes
Half the apps we use have dark themes. Standard OCR on a dark screenshot?
0 lines detected.
We built automatic dark theme enhancement:
if (IsDarkImage(bitmap))
{
// Invert colors + boost contrast 1.4x
using var enhanced = EnhanceForOcr(bitmap);
result = RunOcrEngine(enhanced, language);
}
Check darkness first, enhance once, OCR once. Single pass.
Result: 0 → 37 lines detected on draw.io's dark interface.
Problem 2: Web Apps Inside Iframes
UIAutomation can't see inside web-rendered dialogs and iframes. A dark-themed "OK" button in a web dialog? Invisible to UIAutomation.
We built an automatic OCR fallback into click_element:
click_element "OK"
→ UIAutomation: not found
→ Capture window → detect dark theme → enhance → OCR → find "OK" → click center
→ Result: Clicked "OK" @ 523,418 (OCR fallback) ✓
One tool call. The AI doesn't even know which strategy was used — it just works.
Problem 3: Multi-Monitor DPI Chaos
Three monitors, different scaling, negative coordinates for left-side displays. GetWindowRect returns different values depending on DPI awareness.
One line fixed it:
SetProcessDpiAwarenessContext(DPI_AWARENESS_CONTEXT_PER_MONITOR_AWARE_V2);
Problem 4: Speed
Navigating draw.io required 15+ individual tool calls. Click menu, wait, click submenu, wait, select all, paste, wait, click OK...
We built run_sequence — batch multiple actions in a single MCP call:
focus "Chrome"
wait 300
hotkey ctrl+a
wait 200
hotkey ctrl+v
wait 500
One tool call instead of six. The round-trip latency between AI and tools is the real bottleneck, not the actions themselves.
Problem 5: Which Windows Are Actually Visible?
The AI would try to interact with windows hidden behind other windows. We built grid-based occlusion detection:
Chrome (chrome) @ -2060,-1461 ← 100% visible
VS Code (Code) @ -1500,-800 ← 65% visible
Explorer (explorer) @ -1400,-700 ← 0% visible [OCCLUDED]
24px grid cells, process windows front-to-back, calculate visible fraction. The AI now knows which windows are actually usable.
The Biggest Lesson
Don't screenshot. OCR first.
When we switched from screenshot-first to OCR-first workflows, everything improved:
┌────────────────────────────────┬─────────────┬───────┬─────────────┐
│ Approach │ Tokens │ Speed │ Reliability │
├────────────────────────────────┼─────────────┼───────┼─────────────┤
│ Screenshot → guess coordinates │ ~5000/image │ Slow │ Low │
├────────────────────────────────┼─────────────┼───────┼─────────────┤
│ OCR → exact text + coordinates │ ~200/scan │ Fast │ High │
└────────────────────────────────┴─────────────┴───────┴─────────────┘
We embedded this as ServerInstructions in the MCP server itself. Every AI client that connects receives the optimal workflow automatically:
options.ServerInstructions = """
Observation Priority:
1. ocr_window — exact text with click coordinates
2. get_window_details — UI elements with types
3. list_windows — window visibility %
4. screenshot_to_file — ONLY for final verification
""";
The AI learns the right approach on connection. No configuration needed.
The Demo
We told Claude Code: "Create an architecture diagram in draw.io."
The AI:
- list_windows — found Chrome
- navigate_to_url — opened app.diagrams.net
- ocr_window — read the storage dialog, saw "Create New Diagram"
- click_element "Create New Diagram" — UIAutomation click
- click_element "Create" — created blank diagram
- set_clipboard — prepared the XML
- click_menu_item "Extras" > "Edit Diagram" — one call for menu navigation
- run_sequence — Ctrl+A, Ctrl+V to paste XML
- click_element "OK" — applied the diagram
- ocr_window — verified all elements rendered correctly
Dark themed web app. Multiple dialogs. File save confirmation. All handled autonomously.
What's In The Box
45+ tools organized into:
- Vision — scan_desktop, list_windows, ocr_window, get_window_details
- Interaction — click_element (with OCR fallback), interact, fill_form, click_menu_item
- Batch — run_sequence, focus_and_hotkey
- Input — mouse_click, keyboard_hotkey, paste_text
- Capture — screenshot_to_file, screenshot_window (PrintWindow API — works when obscured)
Under the hood:
- Win32 P/Invoke (EnumWindows, SendInput, PrintWindow)
- UIAutomation with CacheRequest (single cross-process call per window)
- Windows.Media.Ocr with dark theme auto-enhancement
- Grid-based window occlusion analysis
- 30-second smart cache with per-window refresh
- Per-monitor DPI awareness
This Is v1
This is the first open-source release. It works — we use it daily building Orbination. But it's v1.
What we think could make it great with community input:
- Chrome DevTools Protocol integration for direct browser control
- Better OCR text matching (word boundaries, fuzzy matching)
- Linux/macOS ports using AT-SPI and Accessibility API
- Workflow recording and replay
We believe desktop control is the missing layer for truly autonomous AI agents. Not agents that just generate text — agents that use computers like humans do.
Orbination launches next month. This is the first piece we're sharing.
Get It
git clone https://github.com/amichail-1/Orbination-AI-Desktop-Vision-Control.git
cd Orbination-AI-Desktop-Vision-Control/DesktopControlMcp
dotnet build -c Release
claude mcp add desktop-control -- "bin\Release\net8.0-windows\DesktopControlMcp.exe"
GitHub: https://github.com/amichail-1/Orbination-AI-Desktop-Vision-Control
MIT License. Built by https://leia.gr for https://orbination.com.

Top comments (0)