DEV Community

Leon
Leon

Posted on

We Ran 15,000 Browser Automations. The Failure That Matters Most Is Invisible to Your Monitoring.

Half of our YouTube automation runs return 0 rows. Status: ok. No exception thrown. No error logged. The program finishes in about 20 seconds and hands back an empty array, silently.

We didn't know this until we looked at the traces.

Over the past few months, Tap has executed 15,455 automation programs across real websites — Reddit, GitHub, Bilibili, Xiaohongshu, YouTube, Twitter, and more. The traces are structured JSON: site, tap name, status, rows returned, duration, error message if any. We analyzed all of them. What we found disagrees with the conventional mental model of how browser automations break.

The Reliability Table Nobody Publishes

Here are the actual numbers. Each row is a real platform. Hard error rate is the fraction of runs that threw an exception. Silent empty rate is the fraction of successful runs (status: ok) that returned zero rows.

Platform Total runs Hard error % Silent empty % Effective failure % Avg duration
Twitter / X 128 0% 0% 0% 154 ms
GitHub 437 0% 0.2% 0.2% 3,644 ms
Reddit 688 13.8% 6.4% 19.4% 4,075 ms
Xiaohongshu 361 15.8% 6.6% 21.5% 9,054 ms
Bilibili 259 30.1% 18.2% 43.1% 2,666 ms
Weibo 38 36.8% 0% 36.8% 4,644 ms
YouTube 49 30.6% 50.0% 65.3% 20,273 ms

GitHub and Twitter are near-zero failure. YouTube is the opposite: two out of three runs either throw an error or return nothing. The 50% silent empty rate is more alarming than the 30.6% hard error rate — at least hard errors are visible.

The Failure Mode You're Not Tracking

Here's the part that surprised us most. We expected "element not found" to be the dominant failure. The conventional model: selector breaks, automation throws, you fix the selector. Obvious, visible, actionable.

The actual numbers:

  • Element not found (explicit selector failure): 5 occurrences
  • Cannot read properties of undefined (reading 'url') (implicit structural failure): 176 occurrences

The ratio is 35:1 in favor of the failure mode your monitoring doesn't catch.

What does Cannot read properties of undefined (reading 'url') actually mean? The selector found something. The extraction ran. The automation didn't crash during navigation. It returned data — a list of objects — but the objects no longer have a url field. The downstream code hits undefined and throws.

This is a structural drift failure, not a selector failure. The DOM element is there. The page loaded. The program traversed the right nodes. But the shape of the data those nodes return has changed — a field that was always present quietly stopped being present.

The sites affected, in order of frequency:

  • Bilibili (videos, articles, analytics, benchmark, content-ideas, stats, trending)
  • Algora bounties
  • IssueHunt bounties
  • Douyin (search, hot)
  • Zhihu search
  • X/Twitter (notifications, trending)
  • Xiaohongshu search
  • Weibo search
  • Baidu hot
  • Hacker News hot
  • ProductHunt forum comments
  • TechCrunch latest
  • Ars Technica news

That list spans Chinese platforms, Western platforms, social networks, news sites, and developer bounty boards. The failure mode is not platform-specific. It's inherent to how browser automation interacts with any site that changes its rendering.

Why Your Monitoring Doesn't See This

Consider what's happening at the infrastructure layer when this failure occurs:

  • HTTP response: 200
  • Page loaded successfully: yes
  • Navigation completed: yes
  • Automation process exited: 0
  • Exception thrown: eventually — but only after the extraction, when downstream code accesses the malformed object

Most monitoring stacks see a successful process exit followed by an application exception. But the harder version is when the object does have a url field — it just points to something different. A related item section. A sponsored result. A pagination link that got included in the data array.

In those cases: status ok, rows returned, no exception, wrong data. Pydantic passes. Row count checks pass. Prometheus reports a healthy process. OTel has nothing to report. The only signal is semantic: these URLs aren't the URLs you wanted.

The Platform Reliability Pattern

GitHub and Twitter have published APIs that their web UIs reflect. A GitHub repository page structure is stable because it's owned by the same team that maintains the underlying data model.

Bilibili, Douyin, Xiaohongshu, and Weibo run aggressive A/B experiments on their rendering layer — sometimes multiple experiments simultaneously for different user cohorts. The same page, loaded twice in the same session, can return different DOM structures. The url field on a video card might be in item.url in one experiment variant and item.jumpUrl in another.

YouTube lands in between for a different reason: aggressive anti-bot measures that return empty results instead of blocking requests. A request that would return a 429 or CAPTCHA on a naive scraper returns 200 with an empty content container on a logged-out browser session. Status: ok. Rows: 0. Duration: 20 seconds of wasted compute.

What Catches This, and What Doesn't

Tool Catches hard error? Catches silent empty? Catches wrong data (right shape)?
Process monitoring
Pydantic / type validation Sometimes
Row count threshold
Health contracts (range + pattern + drift)
Structural fingerprinting Signals change, not interpretation

The only layer that catches all three failure classes is a contract that validates semantics — not just shape. A min_rows check catches silent empties. A pattern check on URLs catches wrong-source data. A drift check catches distribution shifts that look valid but represent changed behavior.

What We'd Do Differently

Treat silent empties as first-class failures. A run returning zero rows should be suspicious by default. Most automations that legitimately return zero rows are edge cases. Most that return zero rows unexpectedly are broken. The difference is detectable with a min_rows contract.

Fingerprint before running, not after. The structural drift that causes Cannot read properties of undefined is detectable in the DOM before you run your extraction logic. A fingerprint check is cheaper than a full tap execution.

Treat Chinese platforms as a separate reliability tier. The A/B experiment cadence is genuinely different. A tap targeting Bilibili needs shorter contract drift windows and more frequent health checks than one targeting GitHub.

Duration is a signal. Our YouTube taps average 20 seconds per run and fail 65% of the time. That's not slow extraction — that's waiting for content that's not coming. A timeout contract that fires at 8 seconds would catch most of these early.


The trace data from 15,455 runs is the most honest answer we have to "what actually breaks in browser automation?"

The answer: silent structural drift, not explicit selector failure. The sites that change fastest break most. The failures that matter most are the ones that look like success.


Built with Tap — browser automation programs that run forever.

Top comments (0)