Leon

Posted on Mar 30 • Edited on Apr 6 • Originally published at taprun.dev

Why Browser Automation Needs a Protocol, Not More AI

#ai #automation #mcp

Every AI browser automation tool today works the same way: the LLM looks at the page, decides what to click, clicks it, looks again, decides again. Every. Single. Step.

This is fundamentally wrong. Here's why, and what we built instead.

The Problem: AI-Per-Step is Broken

Tools like Browser-Use and Stagehand are impressive demos. But in production:

Slow: Each step needs an LLM call (1-3 seconds). A 10-step workflow takes 30+ seconds.
Expensive: Every interaction costs tokens. Running 1000 times/day = real money.
Unreliable: LLMs hallucinate. Different results each run. No determinism.
Fragile: The AI might click the wrong button, misread a selector, or get confused by a modal.

The core issue: operating an interface is a solved problem the moment you figure out how. The hard part is understanding the page. That's what AI is good at. The easy part is executing the same steps again. That doesn't need AI at all.

The Insight: Separate Understanding from Execution

What if AI only runs once — to analyze the site and create a deterministic script — and then that script runs forever?

forge_inspect → forge_verify → forge_save → run forever
   AI analyzes     AI tests       AI saves    zero AI, zero tokens

This is Tap: an interface protocol for AI agents.

The Protocol: 8 + 16

Tap defines a minimal, complete contract for operating any interface:

8 kernel primitives — the irreducible atoms of interaction:

eval · pointer · keyboard · nav · wait · screenshot · tap · capabilities

16 stdlib operations — composed from the kernel:

click · type · hover · scroll · pressKey · select · upload · dialog
fetch · find · cookies · download · waitFor · waitForNetwork · ssrState · storage

A new runtime implements 8 methods, instantly gets 16 operations and every existing script. Today: Chrome Extension + Playwright. Tomorrow: Android, iOS, desktop apps.

What a Tap Looks Like

// API-first: fetch data directly
export default {
  site: "bilibili", name: "hot",
  extract: async () => {
    const res = await fetch('https://api.bilibili.com/x/web-interface/ranking/v2',
      { credentials: 'include' })
    const data = await res.json()
    return data.data.list.map(v => ({
      title: v.title, author: v.owner.name,
      views: String(v.stat.view)
    }))
  }
}

// Action: operate the interface
export default {
  site: "x", name: "post",
  args: { content: { type: "string" } },
  async run(page, args) {
    await page.nav('https://x.com/compose/post')
    await page.type('[data-testid="tweetTextarea_0"]', args.content)
    await page.click('[data-testid="tweetButton"]')
    return [{ status: 'posted' }]
  }
}

No LLM. No tokens. Pure JavaScript. Runs in under 1 second.

81 Skills Across 41 Sites

The community has already forged taps for GitHub, Reddit, Hacker News, X/Twitter, YouTube, Bilibili, Zhihu, Xiaohongshu, Weibo, Medium, arXiv, and many more.

curl -fsSL https://raw.githubusercontent.com/LeonTing1010/tap/master/install.sh | sh
tap install    # Clone 81 community skills
tap list       # See them all

It's an MCP server — works with Claude Code, Cursor, Windsurf, or any MCP-compatible agent:

{
  "mcpServers": {
    "tap": { "command": "tap", "args": ["mcp"] }
  }
}

The Economics

	AI-per-step	Tap
Forge cost	—	~$0.05 (one-time)
Run cost	$0.01-0.10/run	$0.00
1000 runs	$10-100	$0.05 total
Latency	10-30s	<1s
Deterministic	No	Yes

Try It

GitHub: github.com/LeonTing1010/tap
Skills: github.com/LeonTing1010/tap-skills

Forge once, run forever. That's the idea.

Tap is AGPL-3.0 licensed. ~1,800 lines of Deno. Zero dependencies.

Try it: taprun.dev | GitHub

DEV Community