DEV Community

Leon
Leon

Posted on

Why Browser Automation Needs a Protocol, Not More AI

Every AI browser automation tool today works the same way: the LLM looks at the page, decides what to click, clicks it, looks again, decides again. Every. Single. Step.

This is fundamentally wrong. Here's why, and what we built instead.

The Problem: AI-Per-Step is Broken

Tools like Browser-Use and Stagehand are impressive demos. But in production:

  • Slow: Each step needs an LLM call (1-3 seconds). A 10-step workflow takes 30+ seconds.
  • Expensive: Every interaction costs tokens. Running 1000 times/day = real money.
  • Unreliable: LLMs hallucinate. Different results each run. No determinism.
  • Fragile: The AI might click the wrong button, misread a selector, or get confused by a modal.

The core issue: operating an interface is a solved problem the moment you figure out how. The hard part is understanding the page. That's what AI is good at. The easy part is executing the same steps again. That doesn't need AI at all.

The Insight: Separate Understanding from Execution

What if AI only runs once — to analyze the site and create a deterministic script — and then that script runs forever?

forge_inspect → forge_verify → forge_save → run forever
   AI analyzes     AI tests       AI saves    zero AI, zero tokens
Enter fullscreen mode Exit fullscreen mode

This is Tap: an interface protocol for AI agents.

The Protocol: 8 + 16

Tap defines a minimal, complete contract for operating any interface:

8 kernel primitives — the irreducible atoms of interaction:

eval · pointer · keyboard · nav · wait · screenshot · tap · capabilities
Enter fullscreen mode Exit fullscreen mode

16 stdlib operations — composed from the kernel:

click · type · hover · scroll · pressKey · select · upload · dialog
fetch · find · cookies · download · waitFor · waitForNetwork · ssrState · storage
Enter fullscreen mode Exit fullscreen mode

A new runtime implements 8 methods, instantly gets 16 operations and every existing script. Today: Chrome Extension + Playwright. Tomorrow: Android, iOS, desktop apps.

What a Tap Looks Like

// API-first: fetch data directly
export default {
  site: "bilibili", name: "hot",
  extract: async () => {
    const res = await fetch('https://api.bilibili.com/x/web-interface/ranking/v2',
      { credentials: 'include' })
    const data = await res.json()
    return data.data.list.map(v => ({
      title: v.title, author: v.owner.name,
      views: String(v.stat.view)
    }))
  }
}
Enter fullscreen mode Exit fullscreen mode
// Action: operate the interface
export default {
  site: "x", name: "post",
  args: { content: { type: "string" } },
  async run(page, args) {
    await page.nav('https://x.com/compose/post')
    await page.type('[data-testid="tweetTextarea_0"]', args.content)
    await page.click('[data-testid="tweetButton"]')
    return [{ status: 'posted' }]
  }
}
Enter fullscreen mode Exit fullscreen mode

No LLM. No tokens. Pure JavaScript. Runs in under 1 second.

81 Skills Across 41 Sites

The community has already forged taps for GitHub, Reddit, Hacker News, X/Twitter, YouTube, Bilibili, Zhihu, Xiaohongshu, Weibo, Medium, arXiv, and many more.

curl -fsSL https://raw.githubusercontent.com/LeonTing1010/tap/master/install.sh | sh
tap install    # Clone 81 community skills
tap list       # See them all
Enter fullscreen mode Exit fullscreen mode

It's an MCP server — works with Claude Code, Cursor, Windsurf, or any MCP-compatible agent:

{
  "mcpServers": {
    "tap": { "command": "tap", "args": ["mcp"] }
  }
}
Enter fullscreen mode Exit fullscreen mode

The Economics

AI-per-step Tap
Forge cost ~$0.05 (one-time)
Run cost $0.01-0.10/run $0.00
1000 runs $10-100 $0.05 total
Latency 10-30s <1s
Deterministic No Yes

Try It

Forge once, run forever. That's the idea.


Tap is AGPL-3.0 licensed. ~1,800 lines of Deno. Zero dependencies.

Top comments (0)