We Ran 15,000 Browser Automations. The Failure That Matters Most Is Invisible to Your Monitoring.

#automation #monitoring #webdev #webscraping

Half of our YouTube automation runs return 0 rows. Status: ok. No exception thrown. No error logged. The program finishes in about 20 seconds and hands back an empty array, silently.

We didn't know this until we looked at the traces.

Over the past few months, Tap has executed 15,455 automation programs across real websites — Reddit, GitHub, Bilibili, Xiaohongshu, YouTube, Twitter, and more. The traces are structured JSON: site, tap name, status, rows returned, duration, error message if any. We analyzed all of them. What we found disagrees with the conventional mental model of how browser automations break.

The Reliability Table Nobody Publishes

Here are the actual numbers. Each row is a real platform. Hard error rate is the fraction of runs that threw an exception. Silent empty rate is the fraction of successful runs (status: ok) that returned zero rows.

Platform	Total runs	Hard error %	Silent empty %	Effective failure %	Avg duration
Twitter / X	128	0%	0%	0%	154 ms
GitHub	437	0%	0.2%	0.2%	3,644 ms
Reddit	688	13.8%	6.4%	19.4%	4,075 ms
Xiaohongshu	361	15.8%	6.6%	21.5%	9,054 ms
Bilibili	259	30.1%	18.2%	43.1%	2,666 ms
Weibo	38	36.8%	0%	36.8%	4,644 ms
YouTube	49	30.6%	50.0%	65.3%	20,273 ms

GitHub and Twitter are near-zero failure. YouTube is the opposite: two out of three runs either throw an error or return nothing. The 50% silent empty rate is more alarming than the 30.6% hard error rate — at least hard errors are visible.

The Failure Mode You're Not Tracking

Here's the part that surprised us most. We expected "element not found" to be the dominant failure. The conventional model: selector breaks, automation throws, you fix the selector. Obvious, visible, actionable.

The actual numbers:

Element not found (explicit selector failure): 5 occurrences
Cannot read properties of undefined (reading 'url') (implicit structural failure): 176 occurrences

The ratio is 35:1 in favor of the failure mode your monitoring doesn't catch.

What does Cannot read properties of undefined (reading 'url') actually mean? The selector found something. The extraction ran. The automation didn't crash during navigation. It returned data — a list of objects — but the objects no longer have a url field. The downstream code hits undefined and throws.

This is a structural drift failure, not a selector failure. The DOM element is there. The page loaded. The program traversed the right nodes. But the shape of the data those nodes return has changed — a field that was always present quietly stopped being present.

The sites affected, in order of frequency:

Bilibili (videos, articles, analytics, benchmark, content-ideas, stats, trending)
Algora bounties
IssueHunt bounties
Douyin (search, hot)
Zhihu search
X/Twitter (notifications, trending)
Xiaohongshu search
Weibo search
Baidu hot
Hacker News hot
ProductHunt forum comments
TechCrunch latest
Ars Technica news

That list spans Chinese platforms, Western platforms, social networks, news sites, and developer bounty boards. The failure mode is not platform-specific. It's inherent to how browser automation interacts with any site that changes its rendering.

Why Your Monitoring Doesn't See This

Consider what's happening at the infrastructure layer when this failure occurs:

HTTP response: 200
Page loaded successfully: yes
Navigation completed: yes
Automation process exited: 0
Exception thrown: eventually — but only after the extraction, when downstream code accesses the malformed object

Most monitoring stacks see a successful process exit followed by an application exception. But the harder version is when the object does have a url field — it just points to something different. A related item section. A sponsored result. A pagination link that got included in the data array.

In those cases: status ok, rows returned, no exception, wrong data. Pydantic passes. Row count checks pass. Prometheus reports a healthy process. OTel has nothing to report. The only signal is semantic: these URLs aren't the URLs you wanted.

The Platform Reliability Pattern

GitHub and Twitter have published APIs that their web UIs reflect. A GitHub repository page structure is stable because it's owned by the same team that maintains the underlying data model.

Bilibili, Douyin, Xiaohongshu, and Weibo run aggressive A/B experiments on their rendering layer — sometimes multiple experiments simultaneously for different user cohorts. The same page, loaded twice in the same session, can return different DOM structures. The url field on a video card might be in item.url in one experiment variant and item.jumpUrl in another.

YouTube lands in between for a different reason: aggressive anti-bot measures that return empty results instead of blocking requests. A request that would return a 429 or CAPTCHA on a naive scraper returns 200 with an empty content container on a logged-out browser session. Status: ok. Rows: 0. Duration: 20 seconds of wasted compute.

What Catches This, and What Doesn't

Tool	Catches hard error?	Catches silent empty?	Catches wrong data (right shape)?
Process monitoring	✓	✗	✗
Pydantic / type validation	✓	Sometimes	✗
Row count threshold	✓	✓	✗
Health contracts (range + pattern + drift)	✓	✓	✓
Structural fingerprinting	✓	✓	Signals change, not interpretation

The only layer that catches all three failure classes is a contract that validates semantics — not just shape. A min_rows check catches silent empties. A pattern check on URLs catches wrong-source data. A drift check catches distribution shifts that look valid but represent changed behavior.

What We'd Do Differently

Treat silent empties as first-class failures. A run returning zero rows should be suspicious by default. Most automations that legitimately return zero rows are edge cases. Most that return zero rows unexpectedly are broken. The difference is detectable with a min_rows contract.

Fingerprint before running, not after. The structural drift that causes Cannot read properties of undefined is detectable in the DOM before you run your extraction logic. A fingerprint check is cheaper than a full tap execution.

Treat Chinese platforms as a separate reliability tier. The A/B experiment cadence is genuinely different. A tap targeting Bilibili needs shorter contract drift windows and more frequent health checks than one targeting GitHub.

Duration is a signal. Our YouTube taps average 20 seconds per run and fail 65% of the time. That's not slow extraction — that's waiting for content that's not coming. A timeout contract that fires at 8 seconds would catch most of these early.

The trace data from 15,455 runs is the most honest answer we have to "what actually breaks in browser automation?"

The answer: silent structural drift, not explicit selector failure. The sites that change fastest break most. The failures that matter most are the ones that look like success.

Built with Tap — browser automation programs that run forever.