Leon

Posted on Apr 4 • Originally published at taprun.dev

Your Scraper Is Broken Right Now. You Just Don't Know It Yet.

#webdev #automation #ai #scraping

Somewhere in your infrastructure, a scraper is returning empty arrays. Your dashboard shows stale numbers. A report your team relies on has been wrong since Tuesday.

You won't find out until someone complains.

The Silent Failure Problem

Most scrapers don't crash loudly. They fail quietly.

"Instead of throwing an error when a page structure changes, they return empty arrays... A scraper that fails silently poisons your data for days or weeks before anyone notices."

This happens because scrapers have no health contract. They extract whatever they find — and when the site changes, "whatever they find" is nothing. No error. No alert. Just empty data flowing downstream.

The Maintenance Tax

When you do notice, you're back to fixing selectors. Again.

"Every time a website redesigns or updates their layout, I'm manually fixing selectors and rewriting parts of the workflow. It's eating up hours every month."

"Maintaining tests can take up to 50% of the time for QA test automation engineers."

The loop is always the same: build automation → site changes → selectors break → spend hours fixing → repeat.

The AI Agent Tax

AI browser agents promise to solve this by re-interpreting the page every run. But they introduce two new problems:

1. Cost compounds.

"The program cost $1.05 to run. So doing it at any scale quickly becomes a little bit silly."

2. Reliability degrades at each step.

"If each step has a .95 chance of completing successfully, after not very many steps you have a pretty small overall probability of success."

95% per step sounds great. But a 10-step workflow is 60% overall. AI agents trade one problem (brittle selectors) for another (probabilistic failure).

A Different Approach: Health Contracts

What if your automation had a contract that defined what "healthy" looks like?

// Built into every program
health: {
  min_rows: 5,           // must return at least 5 results
  non_empty: ["title"]   // "title" field must never be empty
}

Now instead of silently returning empty arrays, the system knows when something is wrong:

$ tap doctor
hackernews/hot    ✓ ok     30 rows  (245ms)
google/trends     ✗ fail   0 rows   min_rows: expected ≥5, got 0
github/trending   ✓ ok     25 rows  (1.2s)
bbc/news          ✗ fail   3 rows   min_rows: expected ≥5, got 3

Two failures caught. Before your data went bad. Before anyone complained.

Watch: Real-Time Change Detection

Health checks catch breakage. But what about legitimate changes?

"I built Site Spy after missing a visa appointment slot because a government page changed and I didn't notice for two weeks."

$ tap watch hackernews hot --every 10m
2026-04-04T10:00  +added   "Show HN: Tap"  score=342
2026-04-04T10:10  +added   "Rust 2.0 announced"  score=128
2026-04-04T10:10  -removed "Old post fell off"  score=12

Run your program on an interval, diff the results, output only what changed. Pipe it to a file, Slack webhook, or another program.

The Self-Healing Loop

Put it all together:

# 1. AI writes a deterministic program once
$ tap forge "scrape Hacker News top stories"
✓ Saved: hackernews/hot.tap.js

# 2. Run forever at $0
$ tap hackernews hot
30 rows (245ms) Cost: $0.00

# 3. Watch for data changes
$ tap watch hackernews hot --every 1h

# 4. Daily health check
$ tap doctor --schedule "0 6 * * *"

# 5. Auto-heal when something breaks
$ tap doctor --auto

The loop: forge → run → watch → doctor → heal → run. You sleep. Your automations don't stop.

Why This Works

The key insight: AI should run at authoring time, not at runtime.

Forge uses AI once to write a deterministic program
Run executes it with zero AI — $0 per execution, 100% deterministic
Doctor detects breakage via health contracts — no AI needed
Heal re-invokes AI only when the site actually changes

99% of runs need zero AI. You only pay for intelligence when the world changes.

Tap is open source. 195+ pre-built automations included. Getting started takes 2 minutes.

Originally published at taprun.dev/blog

DEV Community