John Rooney for Extract by Zyte

Posted on Apr 8

Writing production-ready Scrapy spiders with opencode

#webscraping #ai #opencode #scrapy

AI-enabled code editors can now conjure scraping code on command. But anyone who has used a generic coding agent to build a spider knows what comes next: a plausible-looking file that falls apart the moment it hits a real website. The selectors are fragile, the error handling is missing, and the structure ignores everything Scrapy actually expects from production code.

The problem is not the AI. It's the prompts, the context, and knowing where to let the agent drive and where to stay in control. This article walks through using opencode to build Scrapy spiders that are actually deployable, covering setup, the prompts that work, and the pitfalls that will burn you if you are not careful.

Why opencode works well for scraping projects

Most AI coding agents are designed around general-purpose software projects. opencode is different in one important way: it is terminal-native, model-agnostic, and designed to operate inside your actual working directory. It reads your project, understands your file structure, and writes code into the files that already exist rather than pasting snippets into a chat window.

For Scrapy projects specifically, this matters. A spider is not a standalone script. It depends on items, settings, middlewares, pipelines, and page objects. An agent that can see all of those files at once produces far better output than one operating on a blank context.

opencode also supports custom commands stored as Markdown files. That means you can encode your own Scrapy conventions as reusable prompts and call them every time you start a new spider, without retyping the same context.

Getting set up

Install opencode with the one-liner:

curl -fsSL https://opencode.ai/install | bash

On macOS and Linux, the Homebrew tap gives you the fastest updates:

brew install anomalyco/tap/opencode

On Windows, use WSL for the best experience. The choco install opencode path works but the terminal experience is noticeably smoother inside a Linux environment.

Once installed, connect your model provider. The /connect command in the terminal user interface walks you through it. If you want to avoid managing API keys from multiple providers, opencode Zen gives you a curated set of pre-tested models through a single subscription at opencode.ai/auth.

For scraping work, choose a model with a large context window. Spider files, page objects, items, and a sample HTML fixture can easily fill 20,000 tokens before you have written a single prompt. Models with at least 64k context are the practical minimum.

Initialize your Scrapy project first

Before you open opencode, scaffold your Scrapy project as you normally would:

scrapy startproject myproject
cd myproject

Then initialize opencode inside the project root:

opencode init

This creates an AGENTS.md file. Commit it. opencode reads this file on every session to understand how your project is structured. Fill it with the conventions your project follows: which item classes exist, which middlewares are active, whether you are using scrapy-poet page objects, and which version of Zyte API or other HTTP backends you are using. The more context AGENTS.md carries, the less you repeat yourself in prompts.

A minimal AGENTS.md for a Scrapy project looks like this:

# Project conventions

- Python 3.12, Scrapy 2.12
- All spiders use scrapy-poet page objects (never parse in the spider class itself)
- Item classes are defined in items.py using dataclasses
- Zyte API is configured via scrapy-zyte-api; ZYTE_API_KEY is in .env
- Settings live in settings.py; never hardcode values in spider files
- All spiders output to JSON Lines via FEEDS setting
- Test fixtures live in tests/fixtures/ as .html files

The prompts that actually work

Generic prompts produce generic code. The prompts below are tested patterns that produce Scrapy-idiomatic output.

Starting a new spider

The most common mistake is asking opencode to "write a spider for X." That produces a working script, not a Scrapy spider. Be specific about structure:

Create a Scrapy spider for https://books.toscrape.com that:
- Uses a scrapy-poet page object called BookListPage for list pages and BookDetailPage for detail pages
- Extracts: title, price, availability, star rating, and product URL
- Handles pagination by following the "next" link
- Stores results in a BookItem dataclass in items.py
- Does not put any CSS selector logic inside the spider class itself

Start with the page objects in pages.py, then write the spider in spiders/books.py.

The explicit constraint against putting selectors in the spider class is important. Without it, the agent will inline everything, which defeats scrapy-poet's purpose and makes the code harder to test.

Asking for resilient selectors

Generated selectors are often too specific. They target a class that is only present on one layout variant, or chain through five levels of nesting that will break on the next site deploy.

Prompt the agent to justify its selector choices:

Write the CSS selectors for BookDetailPage. For each field, explain why you chose
that selector over alternatives. Prefer attribute-based selectors (like [itemprop] or
[data-*]) over class names where both options exist.

This produces more defensive selectors and, more importantly, gives you enough reasoning to judge whether to accept them.

Adding error handling

The agent will skip error handling unless you ask for it explicitly:

Add error handling to BookDetailPage:
- If price is missing, log a warning and return None (do not raise an exception)
- If star rating cannot be parsed, default to 0
- Add a try/except around the availability field and log the raw text if parsing fails

Never assume the agent will add graceful degradation on its own. It optimizes for the happy path.

Writing tests

opencode is genuinely useful for generating pytest fixtures and test scaffolding. Give it a concrete fixture to work from:

Write a pytest test for BookDetailPage using the HTML fixture at
tests/fixtures/book_detail.html. Test that:
- title is extracted as a non-empty string
- price is a float greater than zero
- availability is one of: "In stock", "Out of stock"
- star_rating is an integer between 0 and 5

Use pytest parametrize if testing multiple fixture variants.

Pitfalls to watch for

The agent assumes the HTML is static

By default, any spider the agent generates will use response.css() or response.xpath() on raw HTML. If your target site renders content with JavaScript, those selectors return nothing. Before you run any generated spider, check whether the target page is JavaScript-rendered by viewing source in your browser. If the data you need is absent from the raw HTML, prompt the agent to use Zyte API's headless browser or a Playwright download handler instead of a plain HTTP request.

Selectors written against one page break on others

The agent writes selectors against whatever HTML you give it. If you paste a single product page, it will produce selectors that work on that product page. Run the spider against 10 or 20 URLs from the same site before treating the selectors as reliable.

Ask the agent to help you validate coverage:

Here are three different product page HTML snippets from the same site (pasted below).
Identify any selectors in BookDetailPage that would fail on snippet 2 or snippet 3,
and suggest more robust alternatives.

Context window exhaustion mid-session

Long sessions that involve large HTML files, multiple spider files, and back-and-forth debugging will eventually exhaust the model's context. When this happens, the agent starts contradicting earlier decisions or forgetting your project conventions.

The fix is to keep sessions short and focused. One session per spider, or one session per refactor task. Use your AGENTS.md to carry conventions across sessions rather than re-explaining them in chat.

Generated settings override your existing configuration

When the agent writes setup instructions, it often suggests adding settings directly to settings.py. If you already have a settings file, this can clobber existing values or introduce conflicts. Review every settings change the agent proposes before accepting it.

The agent does not know about anti-bot measures

opencode has no knowledge of whether a site actively blocks scrapers. It will happily generate a spider that will be blocked immediately in production. Anti-bot handling, rate limiting, and request fingerprinting are your responsibility to layer in. Zyte API handles the blocking and fingerprinting side; you still need to configure the integration yourself rather than expecting the agent to know it is necessary.

Useful custom commands for scraping

opencode custom commands let you encode reusable prompts as Markdown files in ~/.config/opencode/commands/. Here are three worth setting up for any Scrapy workflow.

`user:new-spider`

# New scrapy-poet spider

Create a new Scrapy spider for the URL provided by the user.
- Use scrapy-poet page objects (list page + detail page if applicable)
- Put all selector logic in page objects, nothing in the spider class
- Use item dataclasses from items.py (create new ones if needed)
- Include pagination handling
- Add logging for missing fields (warning level)
- Write page objects first, then the spider

Ask the user for the target URL before starting.

`user:harden-selectors`

# Harden selectors

Review the page objects in the current file. For each CSS or XPath selector:
1. Identify whether it targets a class, ID, tag, or attribute
2. If it targets a class name, suggest an attribute-based or structural alternative
3. Flag any selectors that chain more than three levels deep as fragile

Output a revised version of the file with improved selectors and inline comments
explaining each change.

`user:gen-tests`

# Generate pytest tests

Given a page object file and an HTML fixture provided by the user:
1. Write a pytest test file that covers all extracted fields
2. Test that required fields are non-null and the correct type
3. Test that optional fields handle absence gracefully (None, not exception)
4. Use parametrize if multiple fixture variants are present

Ask for the fixture file path before starting.

Where opencode fits in the workflow

Think of opencode as a fast first-draft tool, not an autonomous spider factory. The right workflow is:

Scaffold the project and write AGENTS.md manually
Use opencode to generate page objects and the spider skeleton
Review every selector by hand before trusting it
Run the spider against a sample of real URLs and inspect the output
Use opencode to patch failures and write tests
Handle anti-bot, rate limiting, and deployment yourself

The agent saves the most time on the repetitive structural work: boilerplate item classes, pagination logic, field extraction scaffolding, and test stubs. The judgment calls around which selectors are robust, whether a site is JavaScript-rendered, and how to handle blocking remain entirely in your hands.

That division of labor is what makes this approach work at production scale rather than just for prototypes.

Try it yourself

Install opencode, initialize it in an existing Scrapy project, and start with the user:new-spider custom command above. Pick a publicly accessible, static site like books.toscrape.com to test the workflow before applying it to a site with more complexity.

For the JavaScript-rendered sites and anything with active anti-bot measures, pair opencode's code generation with Zyte API to handle the access layer. You can sign up for a free trial and have a working integration running in minutes. The Zyte documentation covers the scrapy-zyte-api configuration in detail.