DEV Community

screenhand
screenhand

Posted on

ScreenHand: Give AI Agents Eyes and Hands on Your Desktop (Open Source MCP Server)

What is ScreenHand?

ScreenHand is an open-source MCP (Model Context Protocol) server that gives AI agents native desktop control on macOS and Windows. Think of it as giving Claude, Cursor, or any MCP-compatible AI the ability to see your screen and interact with it — clicking buttons, typing text, navigating apps, and automating browser workflows.

Why We Built It

AI agents are powerful reasoners, but they're blind. They can write code but can't click a button. They can draft an email but can't send it. ScreenHand bridges this gap with 82 MCP tools spanning:

  • Desktop automation — click, type, scroll, drag, OCR, accessibility tree
  • Browser control via CDP — navigate, fill forms, click with anti-detection, execute JS
  • Anti-detection — human-like typing delays, stealth mode, realistic mouse movements
  • Memory system — persistent learning from errors and patterns across sessions
  • Job system — multi-step persistent jobs with worker daemon
  • Playbooks — reusable automation sequences for Instagram, X/Twitter, LinkedIn, YouTube, Reddit, Discord, and more

Architecture

MCP Client (Claude, Cursor, etc.)
  | stdio (Model Context Protocol)
mcp-desktop.ts — 82 tools, Zod validation
  |
  +-- Native Bridge (Swift on macOS, C# on Windows)
  |     JSON-RPC over stdio, accessibility APIs
  |
  +-- CDP Chrome Adapter
  |     Chrome DevTools Protocol for browser automation
  |
  +-- Session Supervisor — lease management, recovery
  +-- Job System — persistent multi-step workflows
  +-- Playbook Engine — battle-tested automation recipes
  +-- Memory Service — learns from errors across sessions
Enter fullscreen mode Exit fullscreen mode

Quick Start

npm install -g screenhand
Enter fullscreen mode Exit fullscreen mode

Add to your MCP client config:

{
  "mcpServers": {
    "screenhand": {
      "command": "npx",
      "args": ["screenhand"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

That's it. Your AI agent now has eyes and hands.

Battle-Tested Playbooks

We've built and tested playbooks for 8 platforms:

Platform Actions Tested
Instagram Like, comment, save, DM, create post, follow, search
X/Twitter Like, reply, retweet, bookmark, create post, DM, follow
LinkedIn Post, like, comment, connect, message, search
YouTube Upload video, like, comment, subscribe, search
Reddit Upvote, comment, create post, search
Discord Join server, navigate channels, send messages, DM
Threads Like, reply, repost, create post, follow, search
n8n Create workflows, add nodes, execute, publish

Each playbook documents real selectors, error patterns, and workarounds discovered through live testing.

Key Technical Insights

Building desktop automation taught us things docs don't cover:

  • Reddit uses shadow DOMshreddit-post.shadowRoot for action buttons
  • X/Twitter needs JS dispatchmousedown+mouseup+click for retweet menus
  • LinkedIn uses Quill editor.ql-editor for posts and comments
  • YouTube uploads work via DataTransfer API — no file picker needed
  • Discord message actions only appear on hover — need mouseover dispatch

Open Source

ScreenHand is AGPL-3.0 licensed and available on GitHub:

Star us if you find it useful! We'd love contributions — especially new platform playbooks.


⭐ Star Us on GitHub

If ScreenHand looks useful, please star us on GitHub — it helps others discover the project and motivates us to keep building.

GitHub stars

Built by Clazro Technology. We believe AI agents should be able to do everything a human can on a computer.

Top comments (0)