Leon

Posted on Apr 22 • Originally published at taprun.dev

rtrvr.ai vs Taprun: cheaper LLM-at-runtime still isn't zero tokens

#ai #automation #browser #comparison

rtrvr.ai is a polished entrant in the browser-agent space. Their architecture is genuinely interesting — "DOM-native" processing with "Smart DOM Compression", 25× cheaper than vision-based alternatives, 81% SOTA accuracy on their reported benchmark. They ship Chrome extension, Cloud dashboard, API, MCP server, CLI, and even a WhatsApp bot. Their landing page lists 10 named competitors. The pricing mirrors Taprun almost exactly — $9.99 / $29.99 / $99.99 / $499.99 per month.

So when I first read their docs, the obvious question was: do they do what Taprun does? Because if they do, Taprun is in trouble.

They don't. And the reason sits on a single architectural line every browser-agent tool has to pick a side of.

The line: is the LLM at authoring, or at runtime?

There are two fundamentally different ways to point an LLM at a browser:

LLM at runtime. The model is called every time the automation executes. Each run is an inference pass. Optimisations make that pass smaller — DOM compression, accessibility trees instead of screenshots, smaller models — but the pass is still there.
LLM at authoring. The model is called once, during setup. It reads the site, figures out the structure, and emits a deterministic program. From then on, the program runs. The LLM never gets called again.

Browser Use, Stagehand, Playwright MCP, and rtrvr.ai all sit on side (1). They differ in how they call the LLM — vision vs DOM, big model vs small model, whole page vs compressed — but not in whether they call it.

Taprun sits on side (2). tap forge runs the LLM once to author a .tap.js file. tap.run executes that file forever with zero inference.

This distinction isn't marketing. It's Python vs compiled C. Both evaluate expressions; one evaluates at runtime, the other at compile time. You pick based on whether the workload repeats.

Where rtrvr.ai is genuinely strong

Credit where it's due. rtrvr gets a lot right:

Form-factor breadth. Chrome extension, Cloud dashboard, Sheets integration, REST API, MCP server, CLI, embeddable widget, WhatsApp bot. Each one hits a different user at a different moment. This is good distribution design.
"No CDP = No Failures" framing. Chrome DevTools Protocol automation is bot-detectable; their architecture avoids it. This is a real reliability argument.
25× cheaper vs vision. Replacing screenshots (~114K tokens) with accessibility trees and DOM compression (~26K tokens) is a meaningful improvement. Their claimed $0.12 per task is genuine progress.
BYOK path. "Bring your own key or local endpoint" collapses their cost to roughly your LLM provider's invoice. Clever.
Explicit competitive comparison. They list 10 competitors by name on their landing page. It's confident. It invites comparison. That's a good sign.

If your use case is agentic exploration — new sites, unknown tasks, one-off interactions — rtrvr is a serious tool. I'd reach for it myself.

Where the LLM-at-runtime model hits a ceiling

The ceiling isn't quality. It's structural.

Per-run cost scales linearly with runs. 26K tokens per task × 1,000 runs/day = 26M tokens/day. At Gemini Flash Lite rates that's real money; at Gemini Pro rates it's ~$260/day. rtrvr's own pricing acknowledges this: the Basic tier is 1,500 credits/month, which at "5 credits/task" is ~300 tasks. A single production workflow running every 5 minutes eats that budget in three days.

Output variance is by design. When the same page, same prompt, same task produces slightly different extractions across runs, you can't build monitoring around it. Row count fluctuation isn't "a bug" when the system is designed to re-interpret the page every time. The 81% SOTA accuracy number is a fine benchmark result, but it means 19% of invocations are wrong in some way, and you don't know which 19%.

"Self-healing" still pays tokens to heal. Every browser-agent tool in this category markets "self-healing". What they mean is: when the selector breaks, the LLM re-runs to figure out the new one. That's real, and it's useful — but it is reactive. The task has already failed (or silently returned garbage) before healing kicks in, and every heal is another inference pass.

What Taprun does differently — structurally, not just cheaper

Taprun moves the LLM to authoring time. Once.

# Authoring: LLM inspects once, emits deterministic code
$ tap forge https://reddit.com/r/programming
✓ Inspected: REST API detected at oauth.reddit.com
✓ Verified: 25 rows, score 95/100
✓ Saved: reddit/hot.tap.js  (pure JavaScript, on your disk)

# Runtime: no LLM, no tokens, same output every time
$ tap reddit hot     # 25 rows, ~200 ms, $0.00
$ tap reddit hot     # 25 rows, ~200 ms, $0.00
$ tap reddit hot     # 25 rows, ~200 ms, $0.00

Because the output is deterministic, monitoring is tractable. Because execution is deterministic, row count is a health signal. Because the program is on your disk, it works offline and doesn't depend on anyone's cloud.

And the "self-healing" axis flips from reactive to proactive:

$ tap doctor --auto reddit hot
✗ selector div.thing — gone since last run
⚠ fingerprint diff: ↑ 2 structural changes
✓ heal bundle ready — current code + git history + page snapshot

tap doctor checks a structural fingerprint before the run fires. If the site drifted, the run doesn't even start — you get a diff of what changed and a bundle your AI agent can patch offline. No retry tokens. No silent bad data.

Where the numbers actually land

Take a workflow that runs every 5 minutes — 288 runs/day, ~8,640 runs/month. Not extreme; this is a single production scraper.

Browser Use: 8,640 × $0.50 = $4,320/month (lower bound)
rtrvr.ai Basic ($9.99): 1,500 credits / 5 per task = 300 tasks. You're over budget by day two. Need Scale ($499.99) — 60,000 credits covers ~12,000 tasks. BYOK path lowers the number but your Gemini bill replaces it.
Taprun Free: 8,640 runs × $0 = $0/month. You keep the $9 if you want AI to forge new taps for you; you keep the $29 if you want auto-repair on cron.

At 10 runs a day, none of this matters. At 10 runs a minute, it's the only thing that matters.

When to pick each

Pick rtrvr.ai when:

You're doing agentic exploration — new sites, undefined tasks, high variance in what you're extracting
You want a polished cloud dashboard and don't mind hosted state
You need WhatsApp or embeddable widget form factors
Your per-task count stays under ~300/month, or you're comfortable with a $499/mo scale tier

Pick Taprun when:

You run the same automation more than once — and want to know the output is the same every time
You want the program on your disk, version-controlled, not saved in someone's dashboard
You want structural fingerprint diffs, not retry loops, as your breakage story
Your scale is "every 5 minutes forever" and you want the bill to stay $9/mo
You want to keep working offline and in sandboxed environments where outbound LLM calls aren't allowed

They're not really competitors — they're different tools for different moments. Use rtrvr to figure out what you want to extract. Use Taprun once you know.

The one-line summary

rtrvr made LLM-at-runtime 25× cheaper than the vision-based baseline. Taprun made it zero. Those aren't points on the same line.

DEV Community