John Rooney for Extract by Zyte

Posted on Mar 16

I Built a Self-Healing Web Scraper to Auto-Solve 403s

#claude #agents #ai #webscraping

Web scraping has a recurring enemy: the 403. Sites add bot detection, anti-scraping tools update their challenges, and scrapers that worked fine last week start silently failing. The usual fix is manual — check the logs, diagnose the cause, update the config, redeploy. I wanted to see if an agent could handle that loop instead.

So I built a self-healing scraper. After each crawl, a Claude-powered agent reads the failure logs, probes the broken domains with escalating fetch strategies, and rewrites the config automatically. By the next run, it's already fixed itself.

How it works

The project has two parts: a scraper and a self-healing agent.

The scraper

main.py is a straightforward Python scraper driven entirely by a config.json file. Each domain entry tells the scraper which URLs to fetch and how to fetch them:

{
  "id": "books",
  "zyte": true,
  "browser_html": false,
  "urls": ["https://www.bookstocsrape.co.uk/products/..."]
}

There are three fetch modes:

Direct — a plain requests.get(). Fast, free, works for sites that don't block bots.
Zyte API (httpResponseBody) — routes the request through Zyte's residential proxy network. Good for sites that block datacenter IPs.
Zyte API (browserHtml) — spins up a real browser via Zyte, executes JavaScript, and returns the fully-rendered DOM. Required for sites using JS-based bot challenges.

Every request is logged to scraper.log in the same format:

2026-03-14 09:12:01 url=https://... domain_id=scan status=200

If a request throws any exception, it's recorded as a 403. That keeps the log clean and gives the agent a consistent signal to act on.

The self-healing agent

agent.py is a Claude-powered agent that runs after each crawl. It uses the Claude Agent SDK and has access to three tools: Read, Bash, and Edit — enough to operate completely autonomously.

The agent works through a staged process:

Read the log — finds every domain that returned a 403
Cross-reference the config — skips domains already configured to use Zyte
Stage 1 probe — uses the zyte-api CLI to fetch one URL per failing domain with httpResponseBody, then inspects the page <title>
Challenge detection — if the title contains phrases like "Just a moment", "Checking your browser", or "Verifying you are human", the page is flagged as a bot challenge
Stage 2 probe — challenge pages are re-probed using browserHtml, which runs a real browser to bypass JS-based detection
Config update — the agent edits config.json directly, setting zyte: true and/or browserHtml: true for domains that now work

The next crawl automatically uses the right fetch strategy. No manual intervention needed.

Design decisions

Config-driven, not code-driven

Everything lives in config.json. Adding a new domain is a one-liner, and the scraper doesn't need to know anything about individual sites — it just reads the config and follows instructions. The agent writes to the same file, so the loop closes itself naturally.

Graduated fetch strategy

Not every site needs an expensive browser render. By escalating from direct to httpResponseBody to [browserHtml](https://www.zyte.com/zyte-api/headless-browser/) only when necessary, I keep costs manageable. Browser renders are slower and consume more API credits — reserving them for sites that actually need them makes a meaningful difference at scale.

Letting the agent handle the heuristics

The challenge detection logic — matching titles against known bot-detection phrases — is exactly the kind of fuzzy heuristic that's tedious to maintain as code but natural for a language model to reason about. Claude also handles edge cases gracefully: if the zyte-api CLI isn't installed, if the log is empty, if a domain is already correctly configured. A rule-based script would need explicit handling for every one of those scenarios.

The limitations

It's worth being honest about where this approach falls short.

It's reactive, not proactive. The agent only runs after a failed crawl. If a site starts blocking mid-run, those URLs fail silently until the next cycle.

Title-based detection is fragile. Most bot-challenge pages say "Just a moment…" — but a legitimate site could theoretically use that phrase. A false positive would cause the scraper to wastefully use browser rendering where it isn't needed.

One URL per domain. The agent probes only the first failing URL for each domain. Different URL patterns on the same domain can have different bot-detection behaviour, which this doesn't account for.

No rollback. Once the config is updated, there's no way to detect if a Zyte setting later stops working and revert it automatically.

Cost opacity. The scraper logs HTTP status codes, not Zyte API credit consumption. There's no visibility into what each domain actually costs to fetch.

Where I'd take it next

Smarter challenge detection. Rather than keyword-matching on the title, the agent could read the full page HTML and make a more nuanced call — is this a product page, a login wall, or a soft block with a CAPTCHA? Each requires a different response.

Proactive monitoring. A lightweight probe running daily against each configured domain, independent of the main crawl, would let the agent update the config before a full scrape run hits a known-bad configuration.

Per-URL config. Right now zyte and browser_html are set at the domain level. Some sites serve static product pages on one path and JS-rendered category pages on another — granular per-URL settings would handle that cleanly.

Structured data extraction. Right now parse_page only pulls the page title. The natural next step is structured product extraction — price, availability, name, images — either via CSS selectors in the config or Zyte's product extraction type, which uses ML models to parse product data from any page.

Multi-agent parallelism. The self-healing loop is currently a single agent. As the config grows, a coordinator could spawn one subagent per failing domain, each running its own probe pipeline concurrently. The Claude Agent SDK supports subagents natively, so this would be a relatively small change.

The core idea is simple: a scraper that observes its own failures and reconfigures itself. What I found interesting about building it wasn't the scraping itself — it was seeing how little scaffolding the agent actually needed. Three tools, a clear task, and it handles the diagnostic work that would otherwise fall to me.

DEV Community