DEV Community

John Rooney for Extract by Zyte

Posted on

I gave Claude access to a web scraping API

If you've worked with Claude for any length of time, you've probably noticed it can do a lot more than answer questions. With the right setup, it can take actions — running scripts, processing files, working through multi-step workflows autonomously. Skills are what make that possible.

What is a skill?

A skill is a small, self-contained instruction set that tells Claude how to use a specific tool or script to accomplish a well-defined task. Technically, it's a markdown file — a SKILL.md — that describes what a tool does, when to reach for it, and exactly how to run it. Claude reads that file and follows the instructions as part of a larger workflow.

Skills are designed to be composable. Each one does one thing well, and they're built to hand off to each other. The fetcher skill retrieves HTML. The parser skill extracts data from it. The compare skill turns multiple parsed outputs into a structured comparison. Together, they form a complete scraping pipeline — and Claude orchestrates the whole thing.

See our skills here: https://github.com/zytelabs/claude-webscraping-skills

The skill format looks like this:

---
name: fetcher
description: "Fetches raw HTML from a URL using httpx, with automatic fallback to Zyte API if blocked."
---
Enter fullscreen mode Exit fullscreen mode

That front matter is what Claude uses to match the right skill to the right task. The description is deliberately precise: it tells Claude not just what the skill does, but how it does it, so Claude can reason about whether it's the right tool for the job.

What the fetcher skill does

The fetcher skill's job is exactly what it sounds like: given a URL, fetch the raw HTML and return it. It uses httpx as its primary HTTP client — a modern, performant Python library well suited to scraping workloads.

What makes it more than a simple wrapper is the fallback logic. A significant number of sites actively block automated requests. Without a fallback, a blocked request just fails, and you're left manually diagnosing why. The fetcher skill handles this automatically. If a request comes back with a BLOCKED status, it retries via Zyte API, which provides built-in unblocking. Most of the time, you get your HTML without ever needing to intervene.

When to use it

The skill's SKILL.md is explicit about this:

## When to use
Use this skill when the user provides one or more URLs and asks you to fetch,
retrieve, scrape, or get the HTML or page content.
Enter fullscreen mode Exit fullscreen mode

In practice, that means any time you're starting a scraping or data extraction task and you have a URL to work from. It's the entry point for the pipeline.

How it works

The instructions in the skill file are straightforward:

## Instructions
1. Run `fetcher.py` with the URL as an argument:
   python fetcher.py <url>

2. If the script returns a successful HTML response, return the HTML to the
   conversation for use in the next step.
3. If the script returns a `BLOCKED` status, re-run with the `--zyte` flag:
   python fetcher.py <url> --zyte

4. Inform the user if a URL could not be fetched after both attempts.
Enter fullscreen mode Exit fullscreen mode

The two-step process keeps things efficient. httpx is fast and lightweight, so it handles the majority of requests without needing to route through Zyte API. The fallback only kicks in when it's needed. If both attempts fail, Claude surfaces that to you clearly rather than silently moving on.

For multiple URLs, the script runs once per URL — there's no batching — so Claude loops through a list sequentially.

Transparency about failure

One detail worth highlighting is the final instruction: inform the user if a URL could not be fetched after both attempts. This might seem obvious, but it reflects a design principle worth being explicit about. A skill that silently drops failed URLs would produce incomplete data downstream, and you might not notice until you're looking at a comparison table with missing rows. Surfacing failures immediately keeps the pipeline honest.

What comes next

The fetcher skill's output is raw HTML — exactly what the parser skill expects as its input. The two are designed to be used in sequence. Once you have the HTML, the parser skill takes over, extracting structured data through JSON-LD or CSS selectors depending on what the page contains.

That handoff is documented in the skill's notes:

## Notes
- For multiple URLs, run the script once per URL
- Pass the raw HTML output into the Parser skill for extraction
Enter fullscreen mode Exit fullscreen mode

The pipeline continues from there.

Do you need a skill?

Skills are a good fit when you have a well-defined, repeatable task that benefits from consistent behaviour across many runs. Fetching HTML from a URL is a clear example: the inputs and outputs are predictable, the fallback logic is always the same, and packaging that into a skill means Claude applies it reliably without you having to re-explain the process each time.

Read our break down of Skills vs MCP vs Web Scraping Copilot here - our VS Code extension

That said, skills aren't always the right tool. If you only need to fetch a handful of pages once, asking Claude to write a quick httpx script directly may be faster and more flexible. Similarly, if your target sites have unusual behaviour — rate limiting, JavaScript rendering, login walls, or multi-step navigation — a bespoke Scrapy spider built with Zyte API gives you far more control than a general-purpose fetch wrapper. Scrapy's middleware architecture, item pipelines, and scheduling make it better suited to large-scale or complex crawls where you need precise control over every aspect of the request cycle.

The fetcher skill sits in the middle: more structured than an ad hoc script, less complex than a full Scrapy project. It's the right choice when you want Claude to handle straightforward retrieval as part of a larger automated workflow, without the overhead of setting up and maintaining a dedicated spider.

Top comments (0)