Build Scrapy spiders in 23.54 seconds with this free Claude skill

#scrapy #claudeskills #webscraping #ai

I built a Claude skill that generates Scrapy spiders in under 30 seconds — ready to run, ready to extract good data. In this post I'll walk through what I built, the design decisions behind it, and where I think it can go next.

What it does

The skill takes a single input: a category or product listing URL. From there, Claude generates a complete, runnable Scrapy spider as a single Python script. No project setup, no configuration files, no boilerplate to write. Just a script you can run immediately.

Here's what that looks like in practice. I opened Claude Code in an empty folder with dependencies installed, activated the skill, and said: "Create a spider for this site" — and pasted a URL.

Within seconds, the script was generated. I ran it, watched the products roll in, piped the output through jq, and had clean structured product data. Start to finish: under a minute.

Why a single-file script, not a full Scrapy project?

Scrapy is usually a full project — multiple files, lots of moving parts, a proper setup process. Running it from a script instead is generally discouraged for production work, but for this use case it's actually the right call.

The goal here is what I'd call pump-and-dump scraping: give Claude a URL, get a spider, run it for a couple of days, move on. It's not designed to scrape millions of products every day for years. For that kind of scale you need proper infrastructure, robust monitoring, and serious logging. This isn't that — and that's intentional.

What you do get, even in the single-file approach, is almost everything Scrapy offers: middleware, automatic retries, and concurrency handling. You'd have to build all of that yourself with a plain requests script. Scrapy gives it to you for free, even when running from a script.

The key design decision: AI extraction

The other major call I made was to lean entirely on Zyte API's AI extraction rather than generating CSS or XPath selectors.

Specifically, the skill uses two extraction types chained together: productNavigation on the category or listing page, which returns product URLs and the next page link, and product on each product URL, which returns structured product data including name, price, availability, brand, SKU (stock keeping unit), description, images, and more.

This means the spider doesn't need to know anything about the structure of the site it's crawling. There are no selectors to generate, no schema to define, no user confirmation step. The AI on Zyte's end handles all of that. It does cost slightly more than a raw HTTP request, but given how little time it takes to go from URL to working spider, the trade-off makes sense.

I've hardcoded httpResponseBody as the extraction source — it's faster and more cost-efficient than browser rendering. If a site is JavaScript-heavy and you're not getting the data you need, you can switch to browserHtml with a one-line change. The spider logs a warning to remind you of this.

The use case is deliberately narrow

This skill is designed for e-commerce sites, and only e-commerce sites. That's not a limitation I stumbled into — it's a feature.

Because the scope is narrow, the spider structure is simple and predictable: category pages with pagination, product links, and detail pages. Zyte API's productNavigation and product extraction types handle this reliably. Widening the scope to arbitrary crawling would require a lot more of Scrapy's machinery and would quickly exceed what makes sense for a lightweight script like this.

What it doesn't do: deep subcategory crawling, link discovery, or full-site crawls. If a page renders all its products without pagination, that works fine — the next page just returns nothing.

Logging and output

I replaced Scrapy's default logging with Rich logging, which gives cleaner terminal output. Scrapy's logs are verbose in ways that aren't useful when you're running a short-lived script — I wanted something concise enough that if something went wrong, it would be obvious at a glance.

Output goes to a .jsonl file named after the spider, alongside a plain .log file. Both are derived from the spider name, which is itself derived from the domain. Run example_com.py, get example_com.jsonl and example_com.log.

Where this goes next

The immediate next step I have in mind is selector-based extraction as an alternative path — useful for sites where the AI extraction isn't quite right, or where you want more control over exactly what gets pulled.

The longer-term vision is running this fully agentically. URLs get submitted somewhere — a queue, a database table, a form — an agent picks them up, builds the spider, and maybe runs a quick validation. The spider then goes into a pool to be run on a schedule, and data lands in a database rather than a flat file. Give Claude access to a virtual private server (VPS) via terminal and most of this is achievable without much extra infrastructure. The skill is already the hard part.

Download the skill

The skill is free to download and use. It's a single .skill file you can install directly into Claude Code. You'll need:

pip install scrapy scrapy-zyte-api rich
export ZYTE_API_KEY=your_key_here

Scrapy 2.13 or above is required for AsyncCrawlerProcess.

The link to the repo and the skill download are in the video description, and here. If you've built something similar, or have thoughts on the design decisions — especially around the extraction approach or the logging setup — I'd love to hear it in the comments. GitHub links to your own scrapers are very welcome too.

If you're interested in more agentic scraping patterns, I also built a Claude skill that helps spiders recover from excessive bans — you can watch that video here.