DEV Community: Extract by Zyte

The compiler caught a lot. It didn't catch enough.

John Rooney — Sun, 31 May 2026 18:20:02 +0000

I built a small web scraping framework in Rust, mostly with an AI doing the typing. It's called ferrous — a Colly-style collector: register CSS selector callbacks, queue URLs, write JSONL. About 700 lines. The pitch I kept hearing, and half-believed, was that Rust and LLMs are a good match now: the borrow checker is a correctness oracle the model can lean on, so the class of bugs that plagues AI-written Python just won't compile.

That's true. It's also where the story gets uncomfortable, because the build was green and the code was still wrong.

How I worked

I'm not a Rust native. ferrous was partly an excuse to get fluent — build something real instead of reading about lifetimes — and partly a test of how far an LLM could carry the typing while I drove. The loop was plain: describe the next change in English, let the model write the Rust, read what came back, run cargo, move on. It kept observations.md as a running design journal, one entry per change, each with a short rationale for the decision it made.

That setup has a soft spot, and it's the whole point of this post. When you drive a language you don't fully know, the only reviewer you've got with real authority is the compiler. Everything past that — is this idiomatic, is it the right abstraction, does it actually do what the journal claims — depends on already knowing what correct looks like. Which is exactly the knowledge a learner doesn't have yet. Keep that in mind through the next part, because it's the difference between the one bug I could have caught and the six I couldn't.

The bug that the toolchain told me wasn't there

I ship two fetch backends. The default goes through the Zyte API; an optional one, gated behind a wreq feature, makes direct requests with browser TLS emulation. Each has an example. After a refactor that added URL resolution — ctx.resolve_and_visit(href) so callbacks stop hand-building absolute URLs — I had the model update the examples to use it. It did, and it wrote up the change in a running design journal it kept:

Added ctx.resolve(href) and ctx.resolve_and_visit(href) ... The examples were updated to use ctx.resolve_and_visit(&href).

cargo test passed. cargo check passed. I believed the journal. Here is the actual line in the feature-gated example:

.on_html("li.next a", |el, ctx| {
    if let Some(href) = el.attr("href") {
        ctx.resolve_and_visit(href);   // href is String; the method wants &str
    }
})

resolve_and_visit takes &str. href is a String. That does not compile — String doesn't coerce to &str in argument position, only &String does. The Zyte example got the &; the wreq one didn't. The model fixed one of the two call sites it claimed to have fixed, and reported both as done.

So why was the build green? Because the broken example is behind a feature flag, and cargo check and cargo test don't build it by default. You have to ask:

$ cargo check --all-targets --all-features
error[E0308]: mismatched types
  --> examples/books_direct.rs:20:39
   |
20 |                 ctx.resolve_and_visit(href);
   |                     ----------------- ^^^^ expected `&str`, found `String`

This is the part worth sitting with. The compiler would have caught it — it's a textbook E0308, the friendliest error rustc produces, complete with help: consider borrowing here. The type system did its job perfectly. It just never ran on that file, because the default build target set didn't include it, and nobody — not me, not the model — pointed it at the path where the error lived. The oracle was switched off for exactly the line that needed it, and the green checkmark covered the gap.

What the compiler buys you, honestly

I don't want to undersell the good part, because it's real and it's specific. The most interesting moment in the whole project was the model refactoring the Element type. Originally it stored matched HTML as a string and re-parsed it on every field access — three accessors, three parses of the same fragment. The fix was to parse once and store the scraper::Html. But Html is !Send, and the crawl loop runs callbacks inside spawned tasks. The model reasoned about this explicitly:

Html is !Send but this is safe: Element is only created and consumed inside synchronous callbacks and is never stored in an Arc or sent across threads.

And it structured the crawl loop to hold that invariant — parse the document, run every callback, drain the results into owned Vecs, and drop the Html before the first .await:

let (all_visits, all_items) = {
    let doc = Html::parse_document(&html);
    // ... run callbacks, collect owned results ...
    (all_visits, all_items)
    // doc dropped here, before any await
};

In Python this reasoning is a comment you hope stays true. In Rust, if the model had gotten it wrong — held the Html across the await, stuffed an Element into the task's captured state — it wouldn't compile. The Send bound is load-bearing. That's the version of "the compiler helps the AI" that actually holds up: it converts a class of concurrency mistakes into build failures, and the model can lean on that to attempt refactors it would otherwise have to be timid about.

The same goes for the ordinary stuff. Widening push_item(Value) to push_item<T: Serialize>, swapping a spin-polling semaphore loop for a JoinSet, threading a FetchError enum through the fetch path — these landed cleanly and idiomatically, because they're well-trodden Rust patterns and the types kept the model honest about the seams.

The class of bugs that compiles fine

Then I had a second model do a senior-review pass over the finished code, and it found seven things the green build was hiding. None of them are type errors. All of them compile.

The library panics on normal user mistakes — an invalid selector, a missing API key, a bad output path all unwrap/expect their way to a crash, even though run() returns a ScrapeResult as if failure were representable. concurrency(0) doesn't error; the inner while tasks.len() < 0 loop never spawns anything, tasks.is_empty() is immediately true, and the crawl exits having silently processed nothing. join_next().await discards its Result, so a panicking callback vanishes and the run still reports success. And the stats are quietly miscounted: fetch_errors never increments for HTTP 4xx/5xx — only for network and parse failures — so a crawl that 500s on every page can report fetch_errors: 0 and had_errors() == false. The README, meanwhile, promises that successful status codes are tracked; they aren't recorded at all.

Every one of these is the same shape of mistake: code that satisfies the types and the test suite while doing the wrong thing. The compiler has nothing to say about whether concurrency(0) is meaningful, whether a 500 counts as an error, or whether your docs match your behavior. Those are semantic claims, and semantics is exactly the layer where LLMs are fluent and confident and wrong — and it's the layer Rust doesn't police.

There's a tidy demonstration of this sitting in the repo: the design journal and the README both describe the intended state of the code, not its actual state. The journal says both examples were fixed; one wasn't. The README says status codes are tracked; they're partly not. The model writes the world as it meant to leave it, and prose has no type checker.

You can't review what you can't read

Go back through those seven findings and ask what it would have taken to catch each one by reading. concurrency(0) exiting silently: you'd have to know that while tasks.len() < 0 is vacuously false, then trace what an empty JoinSet does on the next line. The fetch_errors miscount: you'd have to be holding the intended meaning of "error" in your head and notice the HttpError arm bumps a status counter where you expected it to bump the error count. The &href that wouldn't compile: you'd have to know that String doesn't coerce to &str in argument position — the single Rust fact the entire bug turns on.

None of these are exotic. They're the things you internalize after enough hours in the language. But that's precisely the trap when a learner pairs with an AI: the model emits code that looks right, reads fluently, and compiles, and the only way to know it's wrong is to already know the thing you were hoping the model would handle for you. Knowing Rust well wouldn't have stopped the model from writing the concurrency(0) hole — it would have stopped me from nodding past it in review.

So the green build gets promoted past its pay grade. When you can't evaluate the semantics yourself, "it compiles" slides from not obviously broken to works, because compiling is the only check you're actually equipped to read. That promotion is the real hazard of writing an unfamiliar language with a model that writes it confidently. The bugs didn't come from the model being bad at Rust — it handles the syntax better than I do. They came from neither of us being positioned to see the gap between code that satisfies the types and code that does the right thing. The model can't see it because it's a semantic claim about intent; I couldn't see it because I didn't yet know what right looked like.

So, is Rust better for AI now?

For the bugs Rust can see, yes, unambiguously, and more than I expected before I watched the !Send refactor go through. The borrow checker and the trait system catch a real category of AI error at compile time, and that lets a model attempt more aggressive changes without silently corrupting state. The floor is genuinely higher than in a dynamic language.

But the floor isn't the problem. Most of what was wrong with ferrous compiled, passed its tests, and was described accurately by documentation that was itself false. The compiler raised the floor; it did nothing for the ceiling, and feature flags punched a hole in even the floor by hiding a target from the default build. "It compiles, the tests are green, and the model says it's done" turned out to be three independent false comforts stacked on top of each other.

The lesson I'd actually act on has two parts. The mechanical one is cheap: the green checkmark is scoped and the model doesn't know the scope, so run --all-features --all-targets on every check, read the diff instead of the summary of the diff, and treat anything the model asserts about its own output — "both examples updated," "status codes tracked" — as a claim to verify rather than a result to log.

The other part is slower and matters more. If you're using an AI to write a language you're still learning, the AI is not a substitute for learning it — it's the thing that makes the learning feel optional right up until a 500 gets counted as a success. The type system is a fast, narrow oracle that flags the bugs it can see and stays silent on the ones that matter most. Closing that silence is on you, and you can only close it by knowing the language well enough to read what the model wrote and see where it's quietly wrong. I came out of ferrous knowing more Rust than I went in with. That, more than the framework, was the point — and it's the only thing that would have caught the other six.

Why Your Scraper Breaks Without Warning (And How to Fix That)

John Rooney — Sun, 31 May 2026 18:08:26 +0000

Most scraper failures don't raise exceptions. The spider finishes, the pipeline writes a file, the process exits with code 0 — and the output contains 0 items, or 800 instead of 8,000, or fields that are all empty strings. A CSS selector stopped matching after a site redesign. A field that used to be present is now conditional. The "next page" link moved.

None of these produce tracebacks. Without explicit checks, you won't know until someone looks at the data.

This post covers three layers of defence: validating individual items before they reach storage, checking aggregate counts at the end of a run, and setting up logging that's actually useful.

Why selectors fail silently

BeautifulSoup and Scrapy's CSS selectors return None or an empty list when nothing matches — they don't raise. The problem propagates depending on what you do next.

# This raises AttributeError if price_element is None
price_element = book.find("p", class_="price_color")
price = price_element.text.strip()  # AttributeError: 'NoneType' has no 'text'

That's a loud failure, which is fine. But this version is silent:

price = book.css("p.price_color::text").get(default="")

get(default="") is convenient, but it means a changed selector produces empty strings in your output rather than an error. You end up with a file full of {"title": "Some Book", "price": "", "rating": ""} records that look complete until you actually check the values.

The fix is validation — checking that the fields you extracted are actually populated before writing anything.

Item validation

A validation function takes an item dict and raises if any required field is missing or empty:

REQUIRED_FIELDS = ["title", "price", "rating"]

def validate_item(item: dict) -> dict:
    problems = [f for f in REQUIRED_FIELDS if not item.get(f)]
    if problems:
        raise ValueError(f"Missing fields {problems}: {item}")
    return item

not item.get(f) catches None, "", [], and 0 — all values that get() defaults produce when a selector fails. If you want to allow 0 as a valid value, use item.get(f) is None instead.

In a standalone scraper:

import requests
from bs4 import BeautifulSoup
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)
logger = logging.getLogger(__name__)

REQUIRED_FIELDS = ["title", "price", "rating"]

def validate_item(item: dict) -> dict:
    problems = [f for f in REQUIRED_FIELDS if not item.get(f)]
    if problems:
        raise ValueError(f"Missing fields {problems}: {item}")
    return item

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
})

resp = session.get("https://books.toscrape.com/", timeout=15)
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")

items = []
errors = 0

for book in soup.find_all("article", class_="product_pod"):
    raw = {
        "title":  book.find("h3").find("a")["title"],
        "price":  book.find("p", class_="price_color").text.strip(),
        "rating": book.find("p", class_="star-rating")["class"][1],
    }
    try:
        items.append(validate_item(raw))
    except ValueError as e:
        logger.warning("Skipping invalid item: %s", e)
        errors += 1

logger.info("Scraped %d valid items, %d errors", len(items), errors)

Whether to skip invalid items or abort depends on the use case. For a scrape where partial data is usable, log and skip. For a scrape where every record matters, raise immediately.

Validating in a Scrapy pipeline

In Scrapy, validation belongs in a pipeline that runs before the storage pipelines. The ItemAdapter class normalises access across Scrapy Items, dataclasses, and plain dicts:

# pipelines.py
from itemadapter import ItemAdapter
import logging

logger = logging.getLogger(__name__)

REQUIRED_FIELDS = {"title", "price", "rating"}


class ItemValidatorPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        missing = REQUIRED_FIELDS - {
            k for k, v in adapter.items()
            if v not in (None, "", [])
        }

        if missing:
            raise ValueError(
                f"Item missing required fields: {missing} | {dict(adapter)}"
            )

        return item

Raising ValueError (or Scrapy's DropItem) from a pipeline drops the item and logs the error. It doesn't stop the spider. Enable it at a lower priority number than your storage pipelines so it runs first:

# settings.py
ITEM_PIPELINES = {
    "myproject.pipelines.ItemValidatorPipeline": 100,
    "myproject.pipelines.JsonlPipeline":         300,
    "myproject.pipelines.SqlitePipeline":        400,
}

Count checking

Per-item validation catches bad fields, but not a situation where the spider simply stopped finding items at all. A selector change that breaks the outer find_all("article", class_="product_pod") call produces zero items with zero errors — everything looks fine.

A count check at the end of a run catches this:

MIN_EXPECTED = 10

logger.info("Scraped %d valid items, %d errors", len(items), errors)

if len(items) < MIN_EXPECTED:
    logger.error(
        "Item count %d below threshold %d — check selectors or site structure",
        len(items),
        MIN_EXPECTED,
    )

In Scrapy, this belongs in an extension that connects to the spider_closed signal:

# extensions.py
from scrapy import signals
import logging

logger = logging.getLogger(__name__)


class ItemCountChecker:
    MIN_ITEMS = 10

    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def spider_opened(self, spider):
        self.item_count = 0
        spider.crawler.signals.connect(
            self.item_scraped, signal=signals.item_scraped
        )

    def item_scraped(self, item, spider):
        self.item_count += 1

    def spider_closed(self, spider):
        if self.item_count < self.MIN_ITEMS:
            logger.error(
                "Spider %s closed with only %d items (threshold: %d). "
                "Check selectors or site structure.",
                spider.name,
                self.item_count,
                self.MIN_ITEMS,
            )
        else:
            logger.info(
                "Spider %s: %d items scraped.", spider.name, self.item_count
            )

Enable in settings:

EXTENSIONS = {
    "myproject.extensions.ItemCountChecker": 500,
}

The threshold needs thought. A scraper that usually returns 800 items should probably alert below 700, not below 10. Track a few runs first, then set the threshold relative to your expected baseline.

Logging that's useful

Scrapy's default logging is verbose at INFO level — request counts, middleware messages, and stats flood the output along with the things you actually care about. Two settings improve it:

# settings.py
LOG_LEVEL = "WARNING"   # suppress INFO noise from Scrapy internals
LOG_FILE  = "scrape.log"

Set LOG_LEVEL = "WARNING" for Scrapy internals, but log your own pipeline and extension messages at INFO. Because Scrapy uses Python's standard logging module, you can configure your own loggers separately:

import logging
logging.getLogger("myproject.pipelines").setLevel(logging.INFO)
logging.getLogger("myproject.extensions").setLevel(logging.INFO)

For standalone scrapers, the format matters. The default %(message)s loses context. At minimum include the timestamp and level:

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(name)s %(message)s",
)

When something goes wrong at 2am, the timestamp is how you correlate a failure with a site change or a rate-limit event.

The three signals that something is broken

In practice, silent scraper failures show up three ways:

A count drop with no errors. Item count is 0 or far below baseline, error count is 0, the spider exited cleanly. This is a selector change — the outer container no longer matches.

A count that looks right but fields are empty. Item count is normal, but price and rating are empty strings across all records. An inner selector changed while the outer one still matches.

A count drop accompanied by validation errors. Item count is lower than expected, and the log contains Skipping invalid item messages. A field is now conditional — present on some items, absent on others, probably due to a new page layout.

Each failure mode points at a different selector level. Logging both counts and individual errors separately makes it easier to tell which one you're dealing with.

Putting it together

A scraper with validation, count checking, and useful logging doesn't need to be complex. These three additions together cover the most common silent failure modes:

Validate each item before writing — catches changed inner selectors
Check the final count against a threshold — catches broken outer selectors
Log counts and errors at the end of every run — makes failures visible without manual inspection

For a production scraper that runs on a schedule, route the logs somewhere you'll see them. LOG_FILE to a location your monitoring system watches, or send the error-level messages to a webhook. A scraper that runs silently and produces empty output for a week is worse than one that fails loudly on the first run.

Tags: python scrapy webscraping tutorial

How to Store What You Scrape (Without Making a Mess)

John Rooney — Sun, 31 May 2026 18:07:43 +0000

The default for a first scraper is usually printing to stdout or dumping everything into a JSON array. Both work for a single test run. Neither works reliably at scale — a JSON array can't be read incrementally, and stdout isn't a data format.

The format you choose matters more than it seems. The wrong one creates problems that appear later: a 2GB JSON file that has to be fully parsed before you can read a single record; a CSV that corrupts silently when a field contains a comma; a database that slows to a crawl because nobody added an index. This post covers the three formats that cover most scraping use cases, and when to reach for each.

JSONL over JSON arrays

A JSON array written to a file is convenient until the scrape crashes halfway through, at which point you have a truncated, unparseable file. You also can't append to it without reading the whole thing first.

JSONL (JSON Lines) writes one JSON object per line. Each line is independently parseable. A crashed scrape leaves a valid file up to the last complete line. Appending is open("file", "a"). Streaming a large file is a for line in f loop.

import json
import requests
from bs4 import BeautifulSoup
from pathlib import Path

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
}

session = requests.Session()
session.headers.update(HEADERS)

output = Path("books.jsonl")

with output.open("w") as f:
    resp = session.get("https://books.toscrape.com/", timeout=15)
    resp.encoding = "utf-8"
    soup = BeautifulSoup(resp.text, "html.parser")

    for book in soup.find_all("article", class_="product_pod"):
        item = {
            "title":  book.find("h3").find("a")["title"],
            "price":  book.find("p", class_="price_color").text.strip(),
            "rating": book.find("p", class_="star-rating")["class"][1],
        }
        f.write(json.dumps(item) + "\n")

# Reading back
with output.open() as f:
    records = [json.loads(line) for line in f]

print(f"Read {len(records)} records")
print(records[0])

Output:

Read 20 records
{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three'}

For a paginated scrape, open the file in append mode ("a") and write after each page. If the scrape is interrupted on page 15 of 50, you still have the first 14 pages intact. You can resume from page 15 without re-scraping the ones you already have.

CSV with DictWriter

CSV is the right format when the output has a fixed, flat schema and you need to open it in a spreadsheet or feed it to a tool that expects CSV. For nested data or variable fields, JSONL is more flexible.

import csv
import requests
from bs4 import BeautifulSoup
from pathlib import Path

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
}

session = requests.Session()
session.headers.update(HEADERS)

FIELDS = ["title", "price", "rating"]
output = Path("books.csv")

with output.open("w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=FIELDS)
    writer.writeheader()

    resp = session.get("https://books.toscrape.com/", timeout=15)
    resp.encoding = "utf-8"
    soup = BeautifulSoup(resp.text, "html.parser")

    for book in soup.find_all("article", class_="product_pod"):
        writer.writerow({
            "title":  book.find("h3").find("a")["title"],
            "price":  book.find("p", class_="price_color").text.strip(),
            "rating": book.find("p", class_="star-rating")["class"][1],
        })

Two things to get right here: newline="" in the open() call is required on Windows (without it, csv.writer produces double line endings); and DictWriter with explicit fieldnames means extra keys in your dict are ignored rather than causing a crash, which matters when scraped data is messier than expected.

DictWriter also handles quoting automatically — if a field contains a comma or a quote character, it wraps the field correctly. Manual string concatenation doesn't.

SQLite for deduplication and querying

SQLite is worth reaching for when you're doing repeated scrapes of the same site (price monitoring, availability checking) and need to avoid re-inserting data you already have. INSERT OR IGNORE skips any row that would violate a UNIQUE constraint, with no error raised.

import sqlite3
import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
}

session = requests.Session()
session.headers.update(HEADERS)

conn = sqlite3.connect("books.db")
conn.execute("""
    CREATE TABLE IF NOT EXISTS books (
        title  TEXT,
        price  TEXT,
        rating TEXT,
        UNIQUE(title)
    )
""")
conn.commit()

resp = session.get("https://books.toscrape.com/", timeout=15)
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")

inserted = 0
skipped  = 0

for book in soup.find_all("article", class_="product_pod"):
    try:
        conn.execute(
            "INSERT INTO books (title, price, rating) VALUES (?, ?, ?)",
            (
                book.find("h3").find("a")["title"],
                book.find("p", class_="price_color").text.strip(),
                book.find("p", class_="star-rating")["class"][1],
            ),
        )
        inserted += 1
    except sqlite3.IntegrityError:
        skipped += 1  # already in the DB

conn.commit()
conn.close()

print(f"Inserted: {inserted}, skipped (duplicate): {skipped}")

The parameterised query (? placeholders) is not optional. String formatting SQL with scraped data is a SQL injection vulnerability even when you control the scraper — the scraped site controls the data.

For repeated scrapes where you want to track price history rather than deduplicate, swap the UNIQUE constraint and INSERT OR IGNORE for a timestamp column and a plain INSERT. Each run adds new rows without touching old ones.

Scrapy pipelines

In Scrapy, storage belongs in item pipelines. A pipeline is a class with a process_item method that receives each item as it comes off the spider. Pipelines can be chained — one validates, one writes to JSONL, one writes to SQLite — and each gets called in the order defined in ITEM_PIPELINES in your settings.

# pipelines.py
import json
import sqlite3
from itemadapter import ItemAdapter


class JsonlPipeline:
    def open_spider(self, spider):
        self.file = open("output.jsonl", "w")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(ItemAdapter(item).asdict())
        self.file.write(line + "\n")
        return item  # must return item for the next pipeline to receive it


class SqlitePipeline:
    def open_spider(self, spider):
        self.conn = sqlite3.connect("output.db")
        self.conn.execute("""
            CREATE TABLE IF NOT EXISTS items (
                title  TEXT,
                price  TEXT,
                rating TEXT,
                UNIQUE(title)
            )
        """)
        self.conn.commit()

    def close_spider(self, spider):
        self.conn.close()

    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        self.conn.execute(
            "INSERT OR IGNORE INTO items (title, price, rating) VALUES (?, ?, ?)",
            (adapter["title"], adapter["price"], adapter["rating"]),
        )
        self.conn.commit()
        return item

Enable in settings.py:

ITEM_PIPELINES = {
    "myproject.pipelines.JsonlPipeline":   300,
    "myproject.pipelines.SqlitePipeline":  400,
}

The integer values are priority — lower numbers run first. The return item at the end of each process_item is mandatory; if you forget it, the next pipeline in the chain receives None.

open_spider and close_spider are the correct place to open and close file handles and database connections. Opening them in __init__ means they persist for the lifetime of the pipeline object even when no spider is running.

Choosing a format

Format	Use when
JSONL	Default for most scrapes; variable fields; needs to survive interruption
CSV	Fixed flat schema; downstream tools expect CSV; spreadsheet output
SQLite	Repeated scrapes of the same targets; need to query across runs; deduplication
PostgreSQL / MySQL	Multiple scrapers writing concurrently; data volumes above ~10GB

The jump to a full database server is rarely needed for a single-scraper project. SQLite handles hundreds of millions of rows on modern hardware without issue, provided you're not writing from multiple processes simultaneously.

Tags: python scrapy webscraping tutorial

Scraping Paginated Sites Without Getting It Wrong

John Rooney — Sun, 31 May 2026 18:07:12 +0000

Pagination is where a lot of scrapers quietly go wrong — not with errors, but with missing data. A scraper that stops two pages early produces no exceptions. Neither does one that re-fetches the same page in a loop. The output just looks thin, and you may not notice until you check it against the actual record count.

There are three common pagination patterns. Identifying which one you're dealing with takes 30 seconds in DevTools; implementing it correctly takes another ten minutes. This post covers all three.

Pattern 1: Page number in the URL

The simplest pattern. The URL contains a page parameter — either as a query string (?page=2) or as part of the path (/catalogue/page-2.html). Increment it until you get a 404 or an empty result set.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://books.toscrape.com/catalogue/"

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
})

all_books = []
page = 1

while True:
    url = f"https://books.toscrape.com/catalogue/page-{page}.html"
    resp = session.get(url, timeout=15)

    if resp.status_code == 404:
        break  # past the last page

    resp.encoding = "utf-8"
    soup = BeautifulSoup(resp.text, "html.parser")
    books = soup.find_all("article", class_="product_pod")

    if not books:
        break  # empty page — also done

    for book in books:
        all_books.append({
            "title":  book.find("h3").find("a")["title"],
            "price":  book.find("p", class_="price_color").text.strip(),
            "rating": book.find("p", class_="star-rating")["class"][1],
        })

    print(f"Page {page}: {len(books)} books")
    page += 1

print(f"\nTotal: {len(all_books)} books")

Two termination conditions, not one. Some sites return an empty 200 for out-of-range pages rather than a 404. Check both.

Pattern 2: Following the "next" link

A cleaner approach for HTML-paginated sites: let the page tell you where to go next, rather than constructing URLs yourself. Most paginated sites include a "Next" link in the HTML. Follow it until it disappears.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

BASE = "https://books.toscrape.com/catalogue/"

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
})

url = "https://books.toscrape.com/catalogue/page-1.html"
all_books = []

while url:
    resp = session.get(url, timeout=15)
    resp.encoding = "utf-8"
    soup = BeautifulSoup(resp.text, "html.parser")

    for book in soup.find_all("article", class_="product_pod"):
        all_books.append({
            "title":  book.find("h3").find("a")["title"],
            "price":  book.find("p", class_="price_color").text.strip(),
            "rating": book.find("p", class_="star-rating")["class"][1],
        })

    next_btn = soup.select_one("li.next a")
    url = urljoin(BASE, next_btn["href"]) if next_btn else None

print(f"Scraped {len(all_books)} books")

The urljoin(BASE, next_btn["href"]) call is worth noting. The href in a "next" link is often relative (page-2.html, ../page-2.html). urljoin resolves it against the base URL correctly regardless of what form the relative path takes. Concatenating strings instead will break on unusual relative paths.

Pattern 3: API cursor / continuation token

JSON APIs often paginate differently. Instead of page numbers, they return a token or flag telling you whether more results exist, and sometimes a cursor to pass back on the next request.

The simplest version: a has_next boolean.

import requests

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "application/json",
    "Accept-Language": "en-GB,en;q=0.9",
})

all_quotes = []
page = 1

while True:
    resp = session.get(
        "https://quotes.toscrape.com/api/quotes",
        params={"page": page},
        timeout=15,
    )
    resp.raise_for_status()
    data = resp.json()

    all_quotes.extend(data["quotes"])
    print(f"Page {page}: {len(data['quotes'])} quotes")

    if not data["has_next"]:
        break
    page += 1

print(f"\nTotal: {len(all_quotes)} quotes")

Some APIs use a cursor instead — the response includes a next_cursor or next_page_token field that you pass as a parameter on the subsequent request. The structure changes but the loop logic is the same: keep going until the cursor field is null or absent.

# Generic cursor pattern
params = {"limit": 100}

while True:
    resp = session.get("https://example.com/api/items", params=params, timeout=15)
    data = resp.json()

    items.extend(data["results"])

    cursor = data.get("next_cursor")
    if not cursor:
        break
    params["cursor"] = cursor

Rate limiting

Sending requests as fast as the network allows is not scraping — it's a load test. Most sites will rate-limit or block traffic that arrives faster than a human could generate it. A 1-2 second delay between pages is a reasonable starting point; adjust based on the site's response times and any explicit rate-limit headers it sends.

import time

while url:
    resp = session.get(url, timeout=15)
    # ... process page ...

    next_btn = soup.select_one("li.next a")
    url = urljoin(BASE, next_btn["href"]) if next_btn else None

    if url:
        time.sleep(1)  # only sleep if there's another request coming

Sleeping after the last page is unnecessary. Put the sleep before the next request or, as above, after confirming there is a next request.

For higher-volume work, time.sleep with a fixed value is blunt. A better approach uses a random delay within a range — time.sleep(random.uniform(0.5, 2.0)) — which avoids the metronomic request timing that fixed delays produce.

Duplicate URL detection

Some sites have inconsistent pagination — "next" links that eventually loop back, or page parameters that wrap around. A simple seen_urls set catches this before it turns into an infinite loop:

from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
})

BASE = "https://books.toscrape.com/catalogue/"
url = "https://books.toscrape.com/catalogue/page-1.html"
seen_urls = set()
all_books = []

while url:
    if url in seen_urls:
        print(f"Loop detected at {url} — stopping")
        break
    seen_urls.add(url)

    resp = session.get(url, timeout=15)
    resp.encoding = "utf-8"
    soup = BeautifulSoup(resp.text, "html.parser")

    for book in soup.find_all("article", class_="product_pod"):
        all_books.append({
            "title":  book.find("h3").find("a")["title"],
            "price":  book.find("p", class_="price_color").text.strip(),
            "rating": book.find("p", class_="star-rating")["class"][1],
        })

    next_btn = soup.select_one("li.next a")
    url = urljoin(BASE, next_btn["href"]) if next_btn else None

print(f"Scraped {len(all_books)} books across {len(seen_urls)} pages")

Using Scrapy's CrawlSpider

If you're building on Scrapy, CrawlSpider handles link following automatically via rules. This is the idiomatic Scrapy approach for sites where pagination follows a consistent pattern:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class BooksSpider(CrawlSpider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com/catalogue/page-1.html"]

    rules = (
        # Follow "next page" links
        Rule(
            LinkExtractor(restrict_css="li.next a"),
            callback="parse_page",
            follow=True,
        ),
    )

    def parse_page(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title":  book.css("h3 a::attr(title)").get(),
                "price":  book.css("p.price_color::text").get(default="").strip(),
                "rating": book.css("p.star-rating::attr(class)").get(default="").split()[-1],
            }

CrawlSpider deduplicates URLs by default (Scrapy's built-in duplicate filter handles it), respects DOWNLOAD_DELAY in your settings, and handles retries. For a site with straightforward pagination, it removes most of the boilerplate above.

One thing to know: CrawlSpider calls the rules on every response, including the ones your callback generates. If a page both contains items and a "next" link, the rule fires correctly — but if you override parse() directly on a CrawlSpider, you'll break the rule processing. Use a separate callback method, as above.

Quick decision guide

Situation	Approach
URL has `/page/2` or `?page=2`	Increment, stop on 404 or empty
Page has a "Next" link in HTML	Follow href with `urljoin`, stop when absent
JSON API with `has_next` flag	Loop until flag is false
JSON API with cursor/token	Pass cursor back each request, stop when null
Building on Scrapy	`CrawlSpider` + `LinkExtractor`

Tags: python scrapy webscraping tutorial

How to Handle JavaScript-Rendered Pages Without a Full Browser

John Rooney — Sun, 31 May 2026 18:06:27 +0000

The HTML that requests downloads is what the server sends before any JavaScript runs. For a large and growing number of sites, that document is nearly empty — a shell with <script> tags that populate the content after the browser executes them. BeautifulSoup finds nothing because there is nothing to find in the source.

Two approaches handle this. The first is to skip the HTML entirely and call the underlying API the JavaScript is already talking to. The second is to use a real browser. The first option is faster, more reliable, and available more often than people expect.

Identifying a JS-rendered page

The test is straightforward: compare what you see in "View Source" against what you see in DevTools' Elements panel. View Source shows the raw server response — exactly what requests receives. The Elements panel shows the live DOM after JavaScript has run.

If your target data appears in the Elements panel but not in View Source, it's JS-rendered.

The other common indicator: requests returns a 200 with a short body. A page that renders 50 product cards in the browser but returns 2KB of HTML to a plain HTTP request is loading its content dynamically.

Option 1: Find the underlying API

When a page loads content via JavaScript, that JavaScript has to get the data from somewhere. Usually it makes an XHR or Fetch request to a JSON API. That same API is available to your scraper — with no browser required.

To find it, open DevTools, go to the Network tab, filter by "Fetch/XHR", then reload the page. Watch for requests that return JSON. Click on one, check the "Preview" tab — if it contains your data, you've found the endpoint.

Here's a concrete example. quotes.toscrape.com/js/ renders its quotes via JavaScript. The raw HTML response contains zero quote elements:

import requests
from bs4 import BeautifulSoup

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
})

resp = session.get("https://quotes.toscrape.com/js/", timeout=15)
soup = BeautifulSoup(resp.text, "html.parser")
print(len(soup.find_all("div", class_="quote")))  # 0

But the JavaScript is fetching from /api/quotes. That endpoint returns clean JSON and supports pagination via a page parameter:

import requests

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "application/json",
    "Accept-Language": "en-GB,en;q=0.9",
})

all_quotes = []
page = 1

while True:
    resp = session.get(
        "https://quotes.toscrape.com/api/quotes",
        params={"page": page},
        timeout=15,
    )
    resp.raise_for_status()
    data = resp.json()

    all_quotes.extend(data["quotes"])
    print(f"Page {page}: {len(data['quotes'])} quotes")

    if not data["has_next"]:
        break
    page += 1

print(f"\nTotal: {len(all_quotes)} quotes")

Output:

Page 1: 10 quotes
Page 2: 10 quotes
...
Page 10: 10 quotes

Total: 100 quotes

This is faster than any browser-based approach and produces cleaner data. Change the Accept header to application/json when hitting JSON endpoints — some APIs check it.

When inspecting Network requests, look for:

URLs containing /api/, /graphql, /v1/, /data/, or .json
Requests where the Preview tab shows an object or array
Query parameters like page, offset, cursor, limit — these tell you the pagination model upfront

Not every site exposes a clean API. Some assemble their data server-side, some use GraphQL, some obfuscate the endpoints. When the API route isn't viable, use a browser.

Option 2: Playwright

Playwright drives a real browser, so what it sees is identical to what a user sees. It's slower than requests and consumes more memory, but it's the correct tool when JavaScript execution is unavoidable.

Install:

pip install playwright
python -m playwright install chromium

Basic scrape of the same JS-rendered page:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    page.goto("https://quotes.toscrape.com/js/", wait_until="networkidle")

    quotes = page.locator("div.quote").all()
    for q in quotes:
        text   = q.locator("span.text").inner_text()
        author = q.locator("small.author").inner_text()
        print(f"{text[:70]}... — {author}")

    browser.close()

wait_until="networkidle" tells Playwright to wait until there are no ongoing network requests for at least 500ms — enough time for most JS-driven content to load. For pages with long-running background requests (analytics pings, chat widgets), "domcontentloaded" or a specific element wait is more reliable:

# Wait for a specific element rather than network quiet
page.goto("https://quotes.toscrape.com/js/")
page.wait_for_selector("div.quote")

page.locator() is preferable to page.query_selector_all(). Locators are lazy — they don't execute until you call a method on them — and they retry automatically if the element isn't immediately present. This makes them more tolerant of pages that render content in stages.

Blocking unnecessary resources

By default, Playwright fetches everything: stylesheets, images, fonts, analytics scripts. For scraping, none of that matters. Blocking it cuts load time noticeably:

from playwright.sync_api import sync_playwright

def block_non_essential(route):
    if route.request.resource_type in ("image", "font", "stylesheet", "media"):
        route.abort()
    else:
        route.continue_()

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.route("**/*", block_non_essential)

    page.goto("https://quotes.toscrape.com/js/", wait_until="networkidle")

    texts = page.locator("div.quote span.text").all_inner_texts()
    print(f"Found {len(texts)} quotes")

    browser.close()

The page.route() call intercepts every request. For anything that matches image, font, stylesheet, or media, it calls route.abort() instead of letting it through. The data-carrying requests — the HTML document and any XHR calls — continue normally.

Using Playwright to discover hidden APIs

Even when you intend to use Playwright for the actual scraping, it can show you API calls you didn't know existed. Register a response handler before navigation:

from playwright.sync_api import sync_playwright

api_calls = []

def capture_api(response):
    content_type = response.headers.get("content-type", "")
    if "json" in content_type:
        api_calls.append(response.url)

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.on("response", capture_api)

    page.goto("https://quotes.toscrape.com/js/", wait_until="networkidle")

    print("JSON responses observed:")
    for url in api_calls:
        print(f"  {url}")

    browser.close()

If any JSON responses appear, check whether they contain your target data. If they do, you can switch from Playwright to a plain requests call against that URL — no browser needed for subsequent runs.

Choosing between the two approaches

API scraping is worth the investigation time. A requests loop that pages through a JSON API runs 10-50x faster than Playwright, uses a fraction of the memory, and produces structured data without parsing HTML. If the endpoint exists and isn't authenticated in a way you can't replicate, use it.

Playwright is the right answer when: the page builds its content from multiple sources with no single API, authentication involves cookies set by JavaScript challenges, or content only appears after user interactions like scroll events or button clicks.

The two approaches also compose. Use Playwright to log in and capture session cookies, then hand those cookies to a requests.Session for the actual data collection:

from playwright.sync_api import sync_playwright
import requests

# Step 1: get session cookies via browser
with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/login")
    page.fill("#username", "you@example.com")
    page.fill("#password", "your-password")
    page.click("button[type=submit]")
    page.wait_for_url("**/dashboard")
    cookies = page.context.cookies()
    browser.close()

# Step 2: transfer cookies to requests session
session = requests.Session()
for cookie in cookies:
    session.cookies.set(cookie["name"], cookie["value"], domain=cookie["domain"])

# Step 3: scrape with requests
resp = session.get("https://example.com/api/data", timeout=15)
print(resp.json())

Tags: python webscraping playwright tutorial

Why Your Scraper Works in the Browser But Fails in Python

John Rooney — Sun, 31 May 2026 18:05:05 +0000

When a requests.get() call returns a 403, or a 200 with an "Access Denied" body, the first instinct is usually to blame the site. But the more likely explanation is that the server received a request that doesn't look anything like what a browser sends — and responded accordingly.

HTTP servers see every header your client sends. A bare requests call sends four. Chrome sends around fifteen, and the values are specific enough that the gap is obvious server-side. This post covers what that gap looks like, why it matters, and how to close it.

What requests actually sends by default

Start a fresh Python session and inspect what requests puts on the wire:

import requests

session = requests.Session()
req = requests.Request("GET", "https://example.com")
prepared = session.prepare_request(req)

for header, value in prepared.headers.items():
    print(f"{header}: {value}")

Output:

User-Agent: python-requests/2.33.1
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive

Four headers. That's it. A real Chrome browser on the same request would send around 15-20, and the content of each one is meaningfully different.

The User-Agent is the most obvious signal — python-requests/2.33.1 is not subtle — but it's also the least interesting one to fixate on, because it's rarely the only reason a request fails. The deeper issue is the overall fingerprint: the combination of which headers are present, in what order, and with what values.

What a browser actually sends

For a standard top-level page navigation, Chrome sends something like this:

User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8
Accept-Language: en-GB,en;q=0.9
Accept-Encoding: gzip, deflate, br
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Connection: keep-alive

A few things stand out:

Accept is specific. The browser advertises exactly which content types it can handle, with quality weights (q=0.9). The requests default of */* says "anything goes," which is an unusual declaration for something claiming to be a browser.

Sec-Fetch-* headers are a family added by Chrome in 2019. Sec-Fetch-Dest: document says the request is fetching a top-level document. Sec-Fetch-Mode: navigate says it's a user-initiated navigation. Sec-Fetch-Site: none says there's no referring site (i.e., it was typed directly or bookmarked). These headers don't affect what most sites return, but sites that check for them will immediately identify an absent set as non-browser traffic.

Accept-Language identifies the browser's locale. Absent entirely in a raw requests call.

Building a session with browser-like headers

The right approach is to set your headers once on a Session object, not on every individual request. Sessions also handle cookies automatically across requests, which matters as soon as you need to scrape pages behind a login or track state.

import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
}

session = requests.Session()
session.headers.update(HEADERS)

With this session, every session.get() and session.post() call will include these headers automatically. You can override individual headers per-request by passing a headers dict to the call — the per-request dict is merged with the session headers, with the per-request values winning on collision:

# Adds Referer for this request only; all other session headers still apply
resp = session.get(
    "https://example.com/products/",
    headers={"Referer": "https://example.com/"},
)

A working example

books.toscrape.com is a site built specifically for scraping practice. Here's a complete working scraper using the session approach:

import requests
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
}

session = requests.Session()
session.headers.update(HEADERS)

resp = session.get("https://books.toscrape.com/", timeout=15)
resp.raise_for_status()
resp.encoding = "utf-8"  # server omits charset in Content-Type; set it explicitly

soup = BeautifulSoup(resp.text, "html.parser")
books = soup.find_all("article", class_="product_pod")

for book in books:
    title  = book.find("h3").find("a")["title"]
    price  = book.find("p", class_="price_color").text.strip()
    rating = book.find("p", class_="star-rating")["class"][1]
    print(f"{title} | {price} | {rating} stars")

Output:

A Light in the Attic | £51.77 | Three stars
Tipping the Velvet | £53.74 | One stars
Soumission | £50.10 | One stars
Sharp Objects | £47.82 | Four stars
Sapiens: A Brief History of Humankind | £54.23 | Five stars
...

One note on resp.encoding = "utf-8": when a server sends Content-Type: text/html without a charset parameter, requests defaults to ISO-8859-1 per the HTTP spec. That produces garbled currency symbols (£ becomes Â£). Setting it explicitly to UTF-8 before accessing resp.text fixes it. Alternatively, pass resp.content (raw bytes) to BeautifulSoup and let it detect the encoding from the HTML meta tags — but that can fail on malformed pages, so explicit is safer.

When you need to scrape multiple pages concurrently

If you're pulling many pages and speed matters, httpx gives you an async-compatible API with the same session model. The header setup is identical; you just swap the client:

import asyncio
import httpx
from bs4 import BeautifulSoup

HEADERS = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
}

async def fetch_pages(urls: list[str]) -> list[httpx.Response]:
    async with httpx.AsyncClient(
        headers=HEADERS,
        timeout=15,
        follow_redirects=True,
    ) as client:
        tasks = [client.get(url) for url in urls]
        return await asyncio.gather(*tasks)

urls = [
    "https://books.toscrape.com/catalogue/page-1.html",
    "https://books.toscrape.com/catalogue/page-2.html",
    "https://books.toscrape.com/catalogue/page-3.html",
]

responses = asyncio.run(fetch_pages(urls))

for url, resp in zip(urls, responses):
    soup = BeautifulSoup(resp.text, "html.parser")
    books = soup.find_all("article", class_="product_pod")
    print(f"{url} -> {len(books)} books")

Output:

https://books.toscrape.com/catalogue/page-1.html -> 20 books
https://books.toscrape.com/catalogue/page-2.html -> 20 books
https://books.toscrape.com/catalogue/page-3.html -> 20 books

httpx.AsyncClient fetches all three pages concurrently rather than sequentially. For 3 pages the difference is negligible; for 50 it's significant.

One thing to watch: asyncio.gather() will raise an exception if any request fails. In production you'd want to wrap each client.get() call in a try/except or use a return_exceptions=True argument to gather() so a single failed request doesn't kill the whole batch.

A helper to catch silent failures

A 200 response doesn't mean you got what you wanted. Some sites return 200 with a block page or a challenge in the body. Add a quick check to your fetch function:

def fetch(session: requests.Session, url: str, **kwargs) -> requests.Response:
    resp = session.get(url, timeout=15, **kwargs)
    resp.raise_for_status()

    body = resp.text.lower()
    if "access denied" in body or "captcha" in body or "enable javascript" in body:
        raise ValueError(f"Block signal detected in response body for {url}")

    return resp

This won't catch every case — some challenge pages use different wording, and JavaScript-rendered content simply won't appear in the HTML at all — but it catches the obvious ones early and turns a silent data quality problem into a loud error.

Quick reference: headers you should always set

Header	Why it matters
`User-Agent`	Identifies the client; `python-requests/x.x` is an instant flag
`Accept`	Advertises supported content types; `/` is unusual for browser traffic
`Accept-Language`	Absent in raw `requests`; browsers always send it
`Upgrade-Insecure-Requests`	Tells the server you prefer HTTPS; browsers send this on HTTP requests
`Sec-Fetch-Dest`	Part of the Fetch metadata spec; absent headers are a signal
`Sec-Fetch-Mode`	As above
`Sec-Fetch-Site`	As above
`Sec-Fetch-User`	Indicates user-initiated navigation

Copy the values directly from your browser's DevTools for the site you're targeting. The values above are correct for Chrome on Linux navigating to a top-level page; they differ slightly for XHR requests, POST submissions, and sub-resource loads (images, scripts).

What this doesn't solve

Better headers get you further than the bare requests default, but they're not a complete solution. Two scenarios they won't help with:

JavaScript-rendered content. If the data you want is injected by JavaScript after the initial HTML loads, requests and httpx will never see it. The HTML response simply won't contain those elements. The fix for that is a real browser via Playwright or Selenium — or finding the underlying API call the JavaScript is making (often easier and more reliable). That's what the next post in this series covers.

TLS fingerprinting. The HTTP headers your scraper sends are one signal; the TLS handshake is another. Some detection systems check the cipher suite order and TLS extension profile your client presents, which differs from Chrome's even if your headers match exactly. requests uses Python's ssl module, which has a distinct fingerprint. Addressing that requires a library like curl_cffi that wraps libcurl's TLS stack and can impersonate Chrome's handshake.

For most sites and most scraping tasks, the session approach above is enough to get started. When it isn't, the failure mode is usually clear: you'll get 403s, CAPTCHAs, or suspiciously empty responses regardless of how you set your headers.

Tags: python scrapy webscraping tutorial

Building a self-hosted browser scraping service (is it more hassle than its worth?)

John Rooney — Wed, 27 May 2026 13:54:31 +0000

There is a version of this project that is not worth doing. If you need browser rendering for a handful of URLs, pointing Playwright at a local binary and running it is fine. If you need to scale to thousands of requests and you want someone else to manage infrastructure, fingerprinting, proxies, and binary maintenance, Zyte API's headless browser handles all of that without any of what follows.

But if you want to understand exactly how a browser scraping service works at the infrastructure level, or you have a steady workload that you want running on hardware you already own, building one yourself teaches you things that matter. This article documents what that build required, the decisions behind each part of it, and the places where I would reach for Zyte API instead.

The architecture: separating the browser from the code that drives it

The foundational decision is understanding that Playwright is a control library, not a browser. It speaks Chrome DevTools Protocol (CDP) to whatever binary you point it at, and the binary is entirely separate from the library. This distinction is what makes a remote browser service possible.

# Local: Playwright launches and manages the browser itself
browser = await p.chromium.launch()

# Remote: Playwright connects to a browser running elsewhere
browser = await p.chromium.connect("ws://192.168.1.100:3000")

# From here, the API is identical
context = await browser.new_context()
page = await context.new_page()
await page.goto("https://example.com")

When you call playwright.connect(), the library stays on your machine and the browser runs on the server. Your scraping scripts become clients of a persistent browser service, which means multiple projects can share one browser instance, and the hardware running the browser can be completely separate from the hardware running your code.

The finished setup is four things working together: a patched Chromium binary (CloakBrowser), a virtual framebuffer so the browser runs headed on a machine with no display (Xvfb), the Playwright server process that accepts WebSocket connections, and Docker with supervisord managing the whole thing.

I am running this on a HP ProDesk 405 G6 with a Ryzen 4650G and 32GB of RAM. It is a small form factor desktop that draws very little power, runs Linux natively, and handles 16 concurrent browser contexts without difficulty.

Why the choice of binary matters

When a browser is put into automation mode, it is supposed to advertise that fact. navigator.webdriver = true is in the W3C WebDriver spec, not an incidental side effect of Playwright. Detection is not finding a bug in your setup; it is reading a flag the spec requires.

The detection surface has three distinct layers. At the JavaScript level there are visible properties: navigator.webdriver, the shape of navigator.plugins, and the presence or absence of window.chrome. These can be overridden before the page loads, but the overrides are detectable because the property descriptor and prototype chain look different from what a native property would produce. At the binary level there are internal automation flags and the CDP debugging port being open on localhost, which pages can probe via timing differences in connection failures. At the network level, TLS handshake characteristics and HTTP/2 settings are compiled into the browser's network stack and cannot be changed from JS or from Playwright settings.

# navigator.webdriver in a standard Playwright browser
> navigator.webdriver
true

# In a patched binary, the property is removed at the source
> navigator.webdriver
undefined

Projects like CloakBrowser patch Chromium at the C++ level before compilation, which means the signals are never emitted rather than overridden after the fact. A JS-level patch leaves something to detect; a binary-level patch does not. This is the reason patched binaries exist rather than simply using playwright-stealth or similar libraries.

Getting CloakBrowser into the container requires a specific step: Playwright maintains two Chromium slot directories, and you need to replace both.

# Replace both the full Chromium slot and the headless shell slot
RUN npx playwright install chromium \
    && SLOT=$(ls /root/.cache/ms-playwright/ | grep '^chromium-') \
    && CHROME_DIR="/root/.cache/ms-playwright/$SLOT/chrome-linux64" \
    && rm -rf "$CHROME_DIR" \
    && cp -r /browsers/chromium/. "$CHROME_DIR/" \
    && chmod +x "$CHROME_DIR/chrome" \
    # Playwright prefers the headless shell for headless mode
    # Replace this slot too or Playwright will ignore your patched binary
    && HS_SLOT=$(ls /root/.cache/ms-playwright/ | grep '^chromium_headless_shell-') \
    && HS_DIR="/root/.cache/ms-playwright/$HS_SLOT/chrome-headless-shell-linux64" \
    && rm -rf "$HS_DIR" \
    && cp -r /browsers/chromium/. "$HS_DIR/" \
    && mv "$HS_DIR/chrome" "$HS_DIR/chrome-headless-shell" \
    && chmod +x "$HS_DIR/chrome-headless-shell"

The second slot (chromium_headless_shell) is what Playwright uses when it runs in headless mode. If you only replace the first slot, Playwright silently falls back to its bundled binary and your patched version is never used. This took several hours to diagnose, and the only way to catch it was watching ps aux during an active browser session to read the actual binary path in the process arguments.

Why headed mode, and why that requires Xvfb

Headless mode is one of the more reliable detection signals available. The browser reports different screen properties, WebGL behaves differently, the font rendering pipeline changes, and the User-Agent string typically contains HeadlessChrome rather than Chrome. The fingerprint for headless Chromium has been studied for years by antibot vendors.

Running the browser headed via Xvfb (X Virtual Framebuffer) removes this entire category of signal. Xvfb provides a virtual display that the browser renders into without needing a physical monitor. The browser has no idea it is running on a headless machine; its screen properties, rendering pipeline, and UA string all reflect a genuine headed session.

# Install Xvfb alongside browser dependencies
RUN apt-get install -y xvfb

# Set the display environment variable
ENV DISPLAY=:99

The Playwright server startup script starts Xvfb on display :99 before launching the server:

#!/bin/bash
export DISPLAY=":99"
export PLAYWRIGHT_CHROMIUM_USE_HEADLESS_NEW=0
export PW_TEST_HEADED=1
exec npx playwright run-server --port 3000 --host 0.0.0.0

The tradeoff is slightly higher memory per context compared to headless mode. On 32GB of RAM running 16 concurrent contexts, this is not a practical constraint.

Why supervisord rather than a simpler process setup

A browser service is not one process. It is Xvfb, the Playwright server, and eventually many browser child processes. Docker's default model is one foreground process per container, which does not fit. A shell script with basic process management works until something crashes out of order; supervisord handles ordering, monitoring, and restart behavior cleanly.

[program:xvfb]
command=Xvfb :99 -screen 0 1920x1080x24 -ac
autorestart=true
priority=10

[program:playwright]
command=/start-playwright.sh
autorestart=true
priority=20
startsecs=3
stdout_logfile=/var/log/supervisor/playwright.log
stderr_logfile=/var/log/supervisor/playwright.log

The priority ordering ensures Xvfb is running before Playwright starts. If either process crashes, supervisord restarts them in the correct sequence. One detail worth noting: environment variables set in supervisord's environment directive do not reliably propagate into child processes. The Playwright startup script sets them directly with export to avoid this.

Concurrency: one browser instance, many contexts

A single browser instance runs multiple isolated contexts. Each context has separate cookies, separate session storage, and separate state, so contexts behave like independent browser profiles sharing one process. For most scraping workloads, one instance with a pool of contexts is the right model: you avoid the startup cost of launching a new process for each request while maintaining clean isolation between sessions.

The async queue pattern works well here. Workers pull URLs from the queue, create a context, scrape, close the context, and immediately pick up the next URL. A 403 response requeues the URL with a backoff delay and frees the worker to continue with other jobs.

async def worker(worker_id, browser, queue, results):
    while True:
        url, attempt = queue.get_nowait()
        result, should_retry = await scrape(browser, url)

        if result:
            results.append(result)
        elif should_retry and attempt < MAX_RETRIES:
            await asyncio.sleep(2 * attempt)
            await queue.put((url, attempt + 1))

        queue.task_done()

# Spin up N workers against the same browser instance
semaphore = asyncio.Semaphore(CONCURRENCY)
workers = [
    asyncio.create_task(worker(i, browser, queue, results))
    for i in range(CONCURRENCY)
]
await queue.join()

Proxy credentials go in new_context() per context, not at the browser level. Using residential proxies with sticky sessions means the same exit IP handles the full page load and all subresource requests, which matters for sites that correlate requests within a session.

context = await browser.new_context(
    proxy={
        "server": "http://proxy-provider:port",
        "username": "your-username",
        "password": "your-password",
    },
    locale="en-US",
    timezone_id="America/New_York",
    viewport={"width": 1920, "height": 1080},
)

# Block unnecessary resource types to reduce proxy bandwidth
await context.route(
    "**/*",
    lambda route: route.abort()
    if route.request.resource_type in ("image", "media", "font", "stylesheet")
    else route.continue_()
)

Blocking images, fonts, and stylesheets at the context level cuts proxy bandwidth significantly without affecting the data you are trying to extract. At 16 concurrent contexts on the ProDesk, throughput is limited by proxy response time rather than CPU or memory.

What this setup requires of you

The list of things you need to manage and maintain: binary updates as antibot vendors adapt to CloakBrowser, Docker image rebuilds when Playwright updates and the slot structure changes, proxy provider accounts and rotation logic, Xvfb stability under load, supervisord configuration, and the ongoing work of tuning context settings for new target sites.

This is not a set-it-and-forget-it infrastructure. It is a platform that requires active maintenance, and the engineering time that goes into it is real. As the challenges of scaling Playwright and Puppeteer make clear, the operational surface of a browser scraping operation grows quickly once you move beyond a single machine.

When Zyte API is the better answer

For many use cases, Zyte API removes the operational overhead described above entirely. Zyte API's headless browser is a purpose-built scraping browser with proxy management, session handling, and unblocking built in. You make a request, you get a rendered page. The binary maintenance, the fingerprint tuning, the proxy rotation, and the infrastructure management are handled for you.

The comparison with a self-hosted setup comes down to three questions.

Scale and cost. At low to medium volume, a home server with existing hardware costs only proxy fees and electricity. At high volume, the per-request pricing of a managed service can be more economical than the engineering time required to maintain and scale your own infrastructure.

Maintenance tolerance. Antibot vendors update their detection logic continuously. Staying ahead of them with a self-hosted binary means tracking binary releases, testing against real targets, and rebuilding regularly. Zyte API abstracts this.

Integration depth. If you are already working in the Scrapy ecosystem, Scrapy Cloud and the Zyte API Scrapy integration give you a managed pipeline with monitoring, scheduling, and data delivery. Building the equivalent from scratch on self-hosted infrastructure is a significant project.

A working self-hosted setup with a patched binary, residential proxies, and 16 concurrent contexts gets through a meaningful range of real targets. For targets that require it and for workloads that justify the maintenance overhead, it is a legitimate option. For everything else, start with Zyte API for free and skip the part where you watch ps aux for 30 minutes trying to figure out why Playwright is launching the wrong binary.

The repo

The complete setup, including the Dockerfile, docker-compose configuration, and supervisord setup, is available at [github link]. CloakBrowser is sourced separately from their releases page and is not included in the repository. The Dockerfile handles replacing both Playwright Chromium slots with the patched build once you have the tarball in place.

How to write and publish a Python package to PyPI

John Rooney — Mon, 11 May 2026 09:57:41 +0000

I wanted to publish my Scrapy download handler to PyPi - UV made it incredibly easy. Here's how.

At some point, every scraping developer writes the same middleware twice.

The first time is in project A. It works well, so when project B comes along you copy it across. Then project C. Then a colleague asks for it. You email them the file. They make changes. Now there are four versions of the same code living in four different repositories, diverging slowly, none of them getting each other's bug fixes.

The solution is a package. Publish it once to the Python Package Index (PyPI) and every project that needs it can install it with pip install your-package. Updates go to everyone at once. The code has a home.

This guide walks through the full process using uv, a fast, modern Python toolchain that replaces pip, virtualenv, pip-tools, twine, and build with a single tool. We will write a reusable Scrapy download handler, structure it as a proper Python package, test it, and publish it to PyPI.

By the end, the package will be installable with:

uv add scrapy-random-delay

And usable in any Scrapy project with two lines in settings.py:

DOWNLOAD_HANDLERS = {
    "http":  "scrapy_random_delay.RandomDelayHandler",
    "https": "scrapy_random_delay.RandomDelayHandler",
}
RANDOM_DELAY_RANGE = (0.5, 3.0)  # seconds

Installing uv

If you do not have uv installed:

# macOS and Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or via pip if you prefer
pip install uv

Verify the installation:

uv --version
# uv 0.5.0 (or later)

uv is written in Rust and is dramatically faster than pip for dependency resolution and installation: typically 10 to 100 times faster. More importantly for this guide, it provides a complete project management workflow that makes creating and publishing packages significantly simpler than the traditional pip, build, and twine toolchain.

What we are building

A Scrapy download handler that adds a random delay drawn from a configurable range before each request, with optional per-domain configuration. It is realistic, self-contained, and exactly the kind of code worth packaging: useful across multiple projects but simple enough to understand completely.

How Scrapy download handlers work

A download handler is responsible for a URL scheme. Scrapy's default HTTP handler takes a request, makes the network call, and returns the response. A custom handler wraps this: add headers, impose a delay, swap the underlying HTTP library. The API is simple:

class MyHandler:
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.settings)

    async def download_request(self, request, spider):
        # return a Response
        pass

    def close(self):
        pass

Our handler delegates actual HTTP work to Scrapy's built-in HTTPDownloadHandler and adds the delay logic around it.

Creating the project

uv init scaffolds a new package project in one command:

uv init --package scrapy-random-delay
cd scrapy-random-delay

The --package flag tells uv to create a proper importable package rather than a single-script project. The generated structure looks like this:

scrapy-random-delay/
├── src/
│   └── scrapy_random_delay/
│       └── __init__.py
├── pyproject.toml
└── README.md

uv uses the src/ layout by default, placing your package inside src/ so it cannot be accidentally imported from the project root during development, which can mask packaging errors.

Look at what uv generated in pyproject.toml:

[project]
name = "scrapy-random-delay"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.12"
dependencies = []

[project.scripts]
scrapy-random-delay = "scrapy_random_delay:main"

[build-system]
requires = ["uv_build>=0.11.7,<0.12.0"]
build-backend = "uv_build"

Two things to note: the [project.scripts] entry is scaffolding you can delete, since the package exports a handler class rather than a command-line entry point. The build backend is uv_build, uv's own backend introduced in recent versions. We will fill in the rest of the metadata as we go.

Adding dependencies

Add Scrapy as a runtime dependency:

uv add scrapy

Add development dependencies — pytest and pytest-asyncio for testing:

uv add --dev pytest pytest-asyncio

uv writes these into pyproject.toml automatically and creates a uv.lock lockfile that pins exact versions for reproducible installs. You do not need to manually edit pyproject.toml for dependencies.

The resulting pyproject.toml dependencies section:

[project]
dependencies = [
    "scrapy>=2.12",
]

[dependency-groups]
dev = [
    "pytest>=8.0",
    "pytest-asyncio>=0.24",
]

Writing the package

Add the handler module:

touch src/scrapy_random_delay/handler.py

# src/scrapy_random_delay/handler.py
import asyncio
import random
import logging
from typing import Tuple, Optional

from scrapy import Request
from scrapy.crawler import Crawler
from scrapy.http import Response
from scrapy.settings import Settings
from scrapy.core.downloader.handlers.http11 import HTTP11DownloadHandler as HTTPDownloadHandler

logger = logging.getLogger(__name__)


class RandomDelayHandler:
    """
    A Scrapy download handler that adds a random delay before each request.

    Settings
    --------
    RANDOM_DELAY_RANGE : tuple[float, float]
        (min_seconds, max_seconds) delay range. Default: (0.5, 2.0)

    RANDOM_DELAY_PER_DOMAIN : dict[str, tuple[float, float]]
        Per-domain overrides. Keys are domain strings, values are (min, max) tuples.
        Example: {"api.example.com": (1.0, 4.0), "cdn.example.com": (0.0, 0.1)}

    RANDOM_DELAY_VERBOSE : bool
        Log the actual delay chosen for each request. Default: False
    """

    DEFAULT_RANGE: Tuple[float, float] = (0.5, 2.0)

    def __init__(self, settings: Settings, crawler: Crawler):
        self._settings = settings
        self._range = self._parse_range(
            settings.get("RANDOM_DELAY_RANGE", self.DEFAULT_RANGE)
        )
        self._per_domain: dict = settings.getdict("RANDOM_DELAY_PER_DOMAIN", {})
        self._verbose: bool = settings.getbool("RANDOM_DELAY_VERBOSE", False)
        self._delegate = HTTPDownloadHandler(settings, crawler)

    @classmethod
    def from_crawler(cls, crawler: Crawler) -> "RandomDelayHandler":
        return cls(crawler.settings, crawler)

    def _parse_range(self, value) -> Tuple[float, float]:
        try:
            low, high = float(value[0]), float(value[1])
        except (TypeError, ValueError, IndexError) as e:
            raise ValueError(
                f"RANDOM_DELAY_RANGE must be a two-element sequence of numbers, got: {value!r}"
            ) from e

        if low < 0 or high < 0:
            raise ValueError("RANDOM_DELAY_RANGE values must be non-negative")
        if low > high:
            raise ValueError(
                f"RANDOM_DELAY_RANGE minimum ({low}) must not exceed maximum ({high})"
            )
        return low, high

    def _get_delay_range(self, request: Request) -> Tuple[float, float]:
        from urllib.parse import urlparse
        domain = urlparse(request.url).netloc
        if domain in self._per_domain:
            return self._parse_range(self._per_domain[domain])
        return self._range

    async def download_request(self, request: Request, spider) -> Response:
        low, high = self._get_delay_range(request)
        delay = random.uniform(low, high)

        if self._verbose:
            logger.debug(f"Random delay: {delay:.2f}s before {request.url}")

        await asyncio.sleep(delay)
        return await self._delegate.download_request(request, spider)

    def close(self):
        self._delegate.close()

Update __init__.py to export the handler and set the version:

# src/scrapy_random_delay/__init__.py
from scrapy_random_delay.handler import RandomDelayHandler

__version__ = "0.1.0"
__all__ = ["RandomDelayHandler"]

Writing the tests

Create the test directory:

mkdir tests
touch tests/__init__.py tests/conftest.py tests/test_handler.py

# tests/conftest.py
import pytest
from unittest.mock import MagicMock
from scrapy.http import Request, Response
from scrapy.utils.test import get_crawler


@pytest.fixture
def crawler():
    return get_crawler()


@pytest.fixture
def make_request():
    def _make(url="https://example.com/products", **kwargs):
        return Request(url=url, **kwargs)
    return _make


@pytest.fixture
def mock_response():
    return MagicMock(spec=Response)

# tests/test_handler.py
import pytest
from unittest.mock import AsyncMock, MagicMock, patch

from scrapy.utils.test import get_crawler
from scrapy_random_delay import RandomDelayHandler


def make_handler(settings_dict: dict = None) -> RandomDelayHandler:
    crawler = get_crawler(settings_dict=settings_dict or {})
    with patch("scrapy_random_delay.handler.HTTPDownloadHandler"):
        return RandomDelayHandler(crawler.settings, crawler)


class TestRangeValidation:
    def test_default_range_applied_when_not_configured(self):
        handler = make_handler()
        assert handler._range == (0.5, 2.0)

    def test_custom_range_accepted(self):
        handler = make_handler({"RANDOM_DELAY_RANGE": (1.0, 5.0)})
        assert handler._range == (1.0, 5.0)

    def test_equal_values_accepted(self):
        handler = make_handler({"RANDOM_DELAY_RANGE": (2.0, 2.0)})
        assert handler._range == (2.0, 2.0)

    def test_zero_range_accepted(self):
        handler = make_handler({"RANDOM_DELAY_RANGE": (0.0, 0.0)})
        assert handler._range == (0.0, 0.0)

    def test_inverted_range_raises(self):
        with pytest.raises(ValueError, match="minimum.*must not exceed maximum"):
            make_handler({"RANDOM_DELAY_RANGE": (5.0, 1.0)})

    def test_negative_range_raises(self):
        with pytest.raises(ValueError, match="non-negative"):
            make_handler({"RANDOM_DELAY_RANGE": (-1.0, 2.0)})

    def test_invalid_type_raises(self):
        with pytest.raises(ValueError, match="two-element sequence"):
            make_handler({"RANDOM_DELAY_RANGE": "bad-value"})


class TestPerDomainOverrides:
    def test_per_domain_range_returned_for_matching_domain(self, make_request):
        handler = make_handler({
            "RANDOM_DELAY_RANGE": (0.5, 1.0),
            "RANDOM_DELAY_PER_DOMAIN": {"api.example.com": [2.0, 4.0]},
        })
        request = make_request(url="https://api.example.com/products")
        assert handler._get_delay_range(request) == (2.0, 4.0)

    def test_default_range_returned_for_unmatched_domain(self, make_request):
        handler = make_handler({
            "RANDOM_DELAY_RANGE": (0.5, 1.0),
            "RANDOM_DELAY_PER_DOMAIN": {"api.example.com": [2.0, 4.0]},
        })
        request = make_request(url="https://other.example.com/products")
        assert handler._get_delay_range(request) == (0.5, 1.0)


class TestDownloadRequest:
    @pytest.mark.asyncio
    async def test_delegates_to_inner_handler(self, make_request, mock_response):
        handler = make_handler({"RANDOM_DELAY_RANGE": (0.0, 0.0)})
        handler._delegate.download_request = AsyncMock(return_value=mock_response)

        result = await handler.download_request(make_request(), MagicMock())

        assert result is mock_response
        handler._delegate.download_request.assert_awaited_once()

    @pytest.mark.asyncio
    async def test_delay_is_within_range(self, make_request, mock_response):
        import time
        handler = make_handler({"RANDOM_DELAY_RANGE": (0.05, 0.1)})
        handler._delegate.download_request = AsyncMock(return_value=mock_response)

        start = time.monotonic()
        await handler.download_request(make_request(), MagicMock())
        elapsed = time.monotonic() - start

        assert 0.05 <= elapsed <= 0.5  # generous upper bound for slow CI

    @pytest.mark.asyncio
    async def test_verbose_mode_logs_delay(self, make_request, mock_response, caplog):
        import logging
        handler = make_handler({
            "RANDOM_DELAY_RANGE": (0.0, 0.0),
            "RANDOM_DELAY_VERBOSE": True,
        })
        handler._delegate.download_request = AsyncMock(return_value=mock_response)

        with caplog.at_level(logging.DEBUG, logger="scrapy_random_delay.handler"):
            await handler.download_request(make_request(), MagicMock())

        assert "Random delay" in caplog.text


class TestClose:
    def test_close_delegates_to_inner_handler(self):
        handler = make_handler()
        handler._delegate.close = MagicMock()
        handler.close()
        handler._delegate.close.assert_called_once()

Configure pytest in pyproject.toml by adding this section:

[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths    = ["tests"]

Run the tests with uv:

uv run pytest

uv run executes the command inside the project's virtual environment, which uv manages automatically. No source .venv/bin/activate needed.

Completing pyproject.toml

Fill in the project metadata. uv has already set the basics: add the rest:

[project]
name            = "scrapy-random-delay"
version         = "0.1.0"
description     = "A Scrapy download handler that adds a configurable random delay before each request."
readme          = "README.md"
license         = { file = "LICENSE" }
requires-python = ">=3.9"
authors         = [
    { name = "Your Name", email = "you@example.com" },
]
keywords        = ["scrapy", "web-scraping", "download-handler", "rate-limiting"]
classifiers     = [
    "Development Status :: 4 - Beta",
    "Intended Audience :: Developers",
    "License :: OSI Approved :: MIT License",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.9",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3.12",
    "Framework :: Scrapy",
    "Topic :: Internet :: WWW/HTTP",
    "Topic :: Software Development :: Libraries :: Python Modules",
]
dependencies = [
    "scrapy>=2.8",
]

[dependency-groups]
dev = [
    "pytest>=8.0",
    "pytest-asyncio>=0.24",
]

[project.urls]
Homepage      = "https://github.com/yourname/scrapy-random-delay"
Documentation = "https://github.com/yourname/scrapy-random-delay#readme"
Repository    = "https://github.com/yourname/scrapy-random-delay.git"
Issues        = "https://github.com/yourname/scrapy-random-delay/issues"

[build-system]
requires      = ["uv_build>=0.11.7,<0.12.0"]
build-backend = "uv_build"

[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths    = ["tests"]

The classifiers list is used by PyPI for filtering and discovery. The Framework :: Scrapy classifier ensures your package appears when someone browses the Scrapy ecosystem. Browse the full list at pypi.org/classifiers.

Writing the README

# scrapy-random-delay

A Scrapy download handler that adds a configurable random delay before each request.

Scrapy's built-in `DOWNLOAD_DELAY` setting adds a fixed delay between requests.
This package draws a delay from a uniform distribution between a minimum and maximum,
which produces more natural request timing and is configurable per domain.

## Installation

    uv add scrapy-random-delay
    # or
    pip install scrapy-random-delay

## Usage

Add to your Scrapy `settings.py`:

    DOWNLOAD_HANDLERS = {
        "http":  "scrapy_random_delay.RandomDelayHandler",
        "https": "scrapy_random_delay.RandomDelayHandler",
    }

    RANDOM_DELAY_RANGE = (0.5, 3.0)   # seconds (min, max). Default: (0.5, 2.0)

### Per-domain configuration

    RANDOM_DELAY_PER_DOMAIN = {
        "api.example.com": (1.0, 4.0),   # slower for sensitive endpoints
        "cdn.example.com": (0.0, 0.1),   # faster for static assets
    }

### Verbose logging

    RANDOM_DELAY_VERBOSE = True   # logs the delay chosen for each request

## Settings reference

| Setting | Type | Default | Description |
|---|---|---|---|
| `RANDOM_DELAY_RANGE` | `(float, float)` | `(0.5, 2.0)` | Min and max delay in seconds |
| `RANDOM_DELAY_PER_DOMAIN` | `dict` | `{}` | Per-domain delay overrides |
| `RANDOM_DELAY_VERBOSE` | `bool` | `False` | Log each delay to DEBUG |

## Compatibility

- Python 3.9+
- Scrapy 2.8+

## License

MIT

Publishing to PyPI

Create accounts

You need two accounts: test.pypi.org for the test registry, and pypi.org for the real registry that pip install and uv add use. Use the test registry first, since it resets periodically and will not pollute the real index with test uploads. Enable two-factor authentication on both, as PyPI requires it for publishing.

Publish to Test PyPI

uv has publishing built in via uv publish. No need to install twine or build separately: uv handles the build step automatically.

uv publish --publish-url https://test.pypi.org/legacy/

You will be prompted for your Test PyPI username (__token__) and your API token. Create a token at test.pypi.org/manage/account/token.

Verify the Test PyPI installation

uv add scrapy-random-delay --index https://test.pypi.org/simple/ --extra-index https://pypi.org/simple/

The --extra-index flag allows uv to find dependencies like scrapy from the real PyPI, since they will not be on Test PyPI.

Verify it works:

import scrapy_random_delay
print(scrapy_random_delay.__version__)   # 0.1.0
from scrapy_random_delay import RandomDelayHandler
print(RandomDelayHandler)

Publish to the real PyPI

Once satisfied:

uv publish

Your package is live. Within a few minutes anyone can install it:

uv add scrapy-random-delay
# or
pip install scrapy-random-delay

Using a token stored in the environment

Rather than typing your token on every publish, store it in an environment variable:

export UV_PUBLISH_TOKEN=pypi-your-token-here
uv publish

Or configure it per-project in a .env file (never commit this):

# .env
UV_PUBLISH_TOKEN=pypi-your-token-here

# .gitignore
.env

Automating releases with GitHub Actions

Manually publishing works fine, but automating it means a new version ships by pushing a version tag, with no manual steps:

# .github/workflows/publish.yml
name: Publish to PyPI

on:
  push:
    tags:
      - "v*"   # triggers on tags like v0.1.0, v1.2.3

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v4

      - name: Run tests
        run: uv run pytest

  publish:
    needs: test
    runs-on: ubuntu-latest
    environment: pypi

    permissions:
      id-token: write   # required for trusted publishing

    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v4

      - name: Publish to PyPI
        run: uv publish
        # Uses trusted publishing via OIDC — no token stored in secrets needed
        # Configure at pypi.org/manage/account/publishing/

The workflow is notably short compared to the traditional pip, build, and twine approach, since uv handles the build and publish in a single command.

Trusted publishing is worth setting up. It authenticates the GitHub Actions workflow directly with PyPI using OpenID Connect, so you never need to store a PyPI token as a GitHub secret. Configure it once at pypi.org under your account's Publishing settings, specifying your GitHub username, repository name, and workflow file name.

To cut a release:

# Update version in pyproject.toml and src/scrapy_random_delay/__init__.py
git commit -am "Release v0.2.0"
git tag v0.2.0
git push && git push --tags

The workflow runs, tests pass, the package publishes.

Bumping versions

uv does not yet have a built-in version bump command, but the process is straightforward. The version lives in two places: pyproject.toml and __init__.py, and both need updating together.

For a small package, doing this by hand is fine. For a larger project, bump2version or python-semantic-release automate the version increment, changelog update, git commit, and tag in one command:

uv add --dev bump2version
uv run bump2version patch   # 0.1.0 → 0.1.1
uv run bump2version minor   # 0.1.0 → 0.2.0
uv run bump2version major   # 0.1.0 → 1.0.0

Configure it in pyproject.toml:

[tool.bumpversion]
current_version = "0.1.0"
commit          = true
tag             = true

[[tool.bumpversion.files]]
filename = "pyproject.toml"

[[tool.bumpversion.files]]
filename = "src/scrapy_random_delay/__init__.py"

Now bumping the version, committing, tagging, and triggering the publish workflow is a single command.

Testing against multiple Python versions

Add a test matrix to the workflow to catch version-specific issues before they reach users:

jobs:
  test:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.9", "3.10", "3.11", "3.12"]

    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v4

      - name: Run tests on Python ${{ matrix.python-version }}
        run: uv run --python ${{ matrix.python-version }} pytest

uv handles Python version management directly: --python 3.9 downloads and uses that Python version without any additional tooling like pyenv.

Keeping your package maintainable

Pin your minimum Scrapy version. scrapy>=2.8 in dependencies sets a floor. Check which Scrapy version introduced the APIs you use and do not allow anything older. uv add scrapy will pick the current latest: tighten the floor to the oldest version you have tested against.

Keep a CHANGELOG. Users upgrading a dependency want to know if anything changed. Keep a CHANGELOG.md with an ## Unreleased section at the top that becomes the next version entry when you tag a release.

Be conservative with dependencies. Every package in dependencies becomes a constraint users must satisfy alongside their own dependencies. A Scrapy extension that depends on pandas and numpy is painful to integrate. Keep the list to what is genuinely required.

Handle deprecation cleanly. When you remove a setting or change an API, keep the old behavior working for at least one minor version and log a deprecation warning. Users who update without reading the changelog will thank you.

Lock files are for applications, not libraries. Commit uv.lock if this is an application. Do not commit it if this is a library, since your users' projects will resolve their own dependency versions and a committed lockfile does not help them.

The complete file tree

After following this guide, your project looks like this:

scrapy-random-delay/
├── src/
│   └── scrapy_random_delay/
│       ├── __init__.py          # exports RandomDelayHandler, sets __version__
│       └── handler.py           # the handler implementation
├── tests/
│   ├── __init__.py
│   ├── conftest.py              # shared fixtures
│   └── test_handler.py          # test suite
├── .github/
│   └── workflows/
│       └── publish.yml          # automated test and publish workflow
├── pyproject.toml               # project metadata, dependencies, tool config
├── uv.lock                      # locked dependency versions (omit for libraries)
├── README.md                    # installation and usage documentation
├── LICENSE                      # MIT or whichever license you choose
└── .gitignore                   # include .env, __pycache__, dist/, *.egg-info/

Next steps

Publishing a Scrapy extension to PyPI is the same process regardless of what the extension does: the download handler is just the example. The same pyproject.toml structure, the same uv publish command, and the same GitHub Actions workflow apply to middlewares, pipelines, item types, and extensions. If you want to go deeper on Scrapy's component system and understand where each type of extension fits, the Modern Scrapy Developer's Guide covers the full architecture from spiders through to deployment on Scrapy Cloud.

How to tell if a page uses JavaScript rendering (and what to do about it)

John Rooney — Mon, 11 May 2026 09:55:57 +0000

You write a scraper, test your selectors in the browser, and everything looks right. Then you run the spider and get back nothing. This is the most common point of confusion for developers new to web scraping: the browser shows you the data, your scraper does not find it, and the two are looking at completely different things.

The browser executes JavaScript before you see anything on screen, but your scraper, unless you specifically configure it to do otherwise, does not. It sees the raw HTML the server sent before any JavaScript ran, and on a modern web application that raw HTML is often just a shell: a few <div> tags, some <script> elements, and no actual content.

Figuring out whether a page uses JavaScript rendering takes about two minutes and a browser you already have open, and once you know what you are dealing with, the path forward is clear.

The two-minute test

Open the page you want to scrape. Right-click anywhere on the page and select View Page Source: not Inspect, not DevTools, but View Page Source. This shows you the raw HTML the server sent before the browser ran any JavaScript.

Now search that source for a piece of text you can see on the rendered page: a product name, a price, a headline, anything specific.

If you find it, the content is in the HTML and your scraper can extract it without JavaScript rendering. If you do not find it, the content was injected by JavaScript after the page loaded, your scraper will not find it either, and you need a different approach.

That is the entire test. Everything else in this guide is detail.

Understanding why this happens

It helps to understand the three different ways a page can deliver its content, because each one requires a different scraping approach.

Server-side rendering. The server builds the complete HTML page and sends it. When the browser receives it, all the content is already in the markup. Wikipedia works this way, many news sites work this way, and older e-commerce platforms work this way. This is the easiest case for scraping: requests and Beautiful Soup are sufficient.

Client-side rendering. The server sends a minimal HTML shell with almost no content, plus a JavaScript bundle. The JavaScript runs in the browser, fetches data from an API, and builds the DOM dynamically, which means the content never exists in the original HTML. React, Vue, and Angular applications built as single-page applications typically work this way, and they require either browser rendering or finding and calling the underlying API directly.

Hybrid rendering. The server sends an HTML page with some content already in it (enough for search engines and initial paint), and JavaScript then enhances the page by adding more data, enabling interactivity, and loading supplementary content. Many modern e-commerce and content sites work this way, and depending on which data you need, you may or may not need JavaScript rendering.

The View Source test tells you which case you are in.

Confirming it with Python

The manual test is fast, but if you want to confirm programmatically or build a check into your scraping toolchain, a comparison of the raw response against the rendered DOM is definitive.

import requests
from bs4 import BeautifulSoup

def check_for_js_rendering(url: str, search_text: str) -> dict:
    """
    Fetch a page with plain requests and check whether expected text is present.
    Returns a diagnostic dict.
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
    }

    response = requests.get(url, headers=headers, timeout=15)

    result = {
        "url": url,
        "status_code": response.status_code,
        "content_length": len(response.text),
        "search_text": search_text,
        "found_in_raw_html": search_text.lower() in response.text.lower(),
        "likely_js_rendered": False,
        "notes": [],
    }

    soup = BeautifulSoup(response.text, "lxml")

    # Check for signals that suggest client-side rendering
    script_tags = soup.find_all("script")
    result["script_count"] = len(script_tags)

    # Common JS framework root elements
    js_roots = soup.select("#root, #app, #__next, #__nuxt, [data-reactroot]")
    if js_roots:
        result["notes"].append(f"Found JS framework root element: {js_roots[0]}")
        result["likely_js_rendered"] = True

    # Very little visible text relative to page size is a strong signal
    visible_text = soup.get_text(strip=True)
    text_ratio = len(visible_text) / max(len(response.text), 1)
    result["text_ratio"] = round(text_ratio, 3)
    if text_ratio < 0.05 and len(response.text) > 5000:
        result["notes"].append(f"Low text ratio ({text_ratio:.1%}) suggests JS-rendered content")
        result["likely_js_rendered"] = True

    # Explicit not-found confirmation
    if not result["found_in_raw_html"]:
        result["likely_js_rendered"] = True
        result["notes"].append(f"'{search_text}' not found in raw HTML — almost certainly JS rendered")

    return result


# Usage
result = check_for_js_rendering(
    url="https://example.com/products/headphones",
    search_text="Wireless Headphones"
)

print(f"Status: {result['status_code']}")
print(f"Content length: {result['content_length']} bytes")
print(f"Text ratio: {result['text_ratio']:.1%}")
print(f"Found in raw HTML: {result['found_in_raw_html']}")
print(f"Likely JS rendered: {result['likely_js_rendered']}")
for note in result["notes"]:
    print(f"  -> {note}")

Reading the signals

Even before searching for specific text, certain patterns in the raw HTML tell you what you are dealing with.

Signal one: the HTML is nearly empty

Open View Source and immediately scroll down. A server-rendered page will have recognizable HTML structure: <header>, <main>, product listings, article text, navigation. A client-side rendered page will often look something like this:

<!DOCTYPE html>
<html>
  <head>
    <meta charset="utf-8">
    <title>My Store</title>
    <link rel="stylesheet" href="/static/css/main.abc123.css">
  </head>
  <body>
    <div id="root"></div>
    <script src="/static/js/main.def456.js"></script>
  </body>
</html>

A <div id="root"> or <div id="app"> with nothing inside it is the unmistakable signature of a React or Vue single-page application. Everything visible in the browser was injected by JavaScript into that empty div.

from bs4 import BeautifulSoup
import requests

def has_empty_app_root(url: str) -> bool:
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(response.text, "lxml")

    # These are the standard mounting points for major JS frameworks
    for selector in ("#root", "#app", "#__next", "#__nuxt", "[data-reactroot]"):
        el = soup.select_one(selector)
        if el and not el.get_text(strip=True):
            print(f"Found empty JS root: {selector}")
            return True

    return False

Signal two: JavaScript framework fingerprints

Even on hybrid-rendered pages, certain markers identify which JavaScript framework is in use:

def identify_js_framework(html: str) -> list[str]:
    """Identify JavaScript frameworks present on the page."""
    frameworks = []

    checks = [
        ("React",         ["data-reactroot", "data-reactid", "__REACT_DEVTOOLS"]),
        ("Next.js",       ["__NEXT_DATA__", "_next/static", "__next"]),
        ("Vue",           ["data-v-", "__vue__", "nuxt__", "__NUXT__"]),
        ("Angular",       ["ng-version", "_nghost", "ng-app"]),
        ("Nuxt",          ["__NUXT__", "__nuxt"]),
        ("Gatsby",        ["___gatsby", "__PATH_PREFIX__"]),
        ("Svelte",        ["__svelte", "svelte-"]),
    ]

    for framework, markers in checks:
        if any(marker in html for marker in markers):
            frameworks.append(framework)

    return frameworks


response = requests.get("https://example.com", headers={"User-Agent": "Mozilla/5.0"})
frameworks = identify_js_framework(response.text)
if frameworks:
    print(f"JS frameworks detected: {', '.join(frameworks)}")
    print("Page likely requires JavaScript rendering for full content")
else:
    print("No major JS framework detected — likely server-rendered")

Signal three: the Network tab in DevTools

For any page where you are unsure, the Network tab in browser DevTools gives you a definitive picture. Open DevTools, click the Network tab, and reload the page.

Look at the first request: the one for the HTML document itself. Click it and look at the Response tab. If the response body contains the data you want, you do not need JavaScript rendering. If it contains an empty shell, you do.

While you are there, also filter to Fetch/XHR and check whether the data you want is arriving via an API call in the background, such as a request to /api/products returning JSON. If it is, you may not need browser rendering at all, because you can call that API directly from Python, which is faster and more reliable than rendering the full page. The guide on intercepting XHR and fetch requests covers this in detail.

Signal four: the page works without CSS but breaks without JavaScript

A quick test that sometimes reveals the answer is to open your browser's developer settings, disable JavaScript, and reload the page. If the content disappears or the page shows a "Please enable JavaScript" message, the content is JavaScript-rendered.

What to do about it

Once you have confirmed the page requires JavaScript rendering, you have three paths forward. They are not mutually exclusive: the right choice depends on what the page is doing, how much data you need, and how much complexity you want to manage.

Path one: find and call the API directly

This is always the first thing to try, because when it works it is the best outcome. Many JavaScript-rendered pages load their data from a JSON API, and if you can find that API, you can call it directly from Python and get clean, structured data without running a browser at all.

Open the Network tab in DevTools, filter to Fetch/XHR, reload the page, and look for requests returning JSON that contains the data you want. Right-click any promising request and select Copy, then Copy as cURL.

import requests

# Reproduced from the cURL command copied from DevTools
response = requests.get(
    "https://example.com/api/v2/products",
    params={"category": "electronics", "page": 1},
    headers={
        "Accept": "application/json",
        "Referer": "https://example.com/products",
        "User-Agent": "Mozilla/5.0",
    },
)

data = response.json()
products = data["products"]
print(f"Found {len(products)} products via API")

If the API requires authentication tokens that are generated by JavaScript on page load, you can extract them first and then call the API directly:

from playwright.sync_api import sync_playwright
import requests

def get_api_token(url: str) -> str | None:
    """Extract an API token from a JS-rendered page."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        page.wait_for_load_state("networkidle")

        # Try common locations for auth tokens
        token = (
            page.evaluate("() => window.__INITIAL_STATE__?.auth?.token") or
            page.evaluate("() => localStorage.getItem('token')") or
            page.get_attribute('meta[name="api-token"]', "content")
        )
        browser.close()
        return token

token = get_api_token("https://example.com/products")
if token:
    response = requests.get(
        "https://example.com/api/products",
        headers={"Authorization": f"Bearer {token}"},
    )
    products = response.json()

Path two: use Playwright or Selenium to render the page

When the API approach is not viable (because the data is not available via a clean API, or the authentication is too complex to reproduce), use a browser automation library to render the page and scrape the resulting DOM.

Playwright is the recommended choice for new projects, since it is faster than Selenium, has a cleaner async API, and supports Chromium, Firefox, and WebKit.

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_with_playwright(url: str) -> list[dict]:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()

        # Block images and fonts to speed up page loads
        page.route(
            "**/*.{png,jpg,jpeg,gif,webp,svg,woff,woff2,ttf,eot}",
            lambda route: route.abort()
        )

        page.goto(url)

        # Wait for the content you actually need, not just page load
        page.wait_for_selector("article.product", timeout=10000)

        # Parse with Beautiful Soup — the HTML is now fully rendered
        html = page.content()
        browser.close()

    soup = BeautifulSoup(html, "lxml")
    products = []
    for article in soup.select("article.product"):
        products.append({
            "name": article.select_one(".product-title").get_text(strip=True)
                    if article.select_one(".product-title") else None,
            "price": article.select_one(".price").get_text(strip=True)
                     if article.select_one(".price") else None,
        })

    return products

For async crawls:

import asyncio
from playwright.async_api import async_playwright
from bs4 import BeautifulSoup

async def scrape_page(browser, url: str) -> list[dict]:
    page = await browser.new_page()
    try:
        await page.route(
            "**/*.{png,jpg,jpeg,gif,webp}",
            lambda route: route.abort()
        )
        await page.goto(url)
        await page.wait_for_selector("article.product")
        html = await page.content()
        soup = BeautifulSoup(html, "lxml")
        return [
            {
                "name": a.select_one(".product-title").get_text(strip=True)
                        if a.select_one(".product-title") else None,
            }
            for a in soup.select("article.product")
        ]
    finally:
        await page.close()

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        urls = [
            "https://example.com/products?page=1",
            "https://example.com/products?page=2",
        ]
        tasks = [scrape_page(browser, url) for url in urls]
        results = await asyncio.gather(*tasks)
        await browser.close()
        return [item for page_items in results for item in page_items]

products = asyncio.run(main())

Choosing what to wait for

The most common mistake with Playwright is waiting for the wrong thing. page.wait_for_load_state("load") fires when the initial HTML and scripts have loaded, not when the JavaScript has finished rendering content. Use one of these instead:

# Wait for a specific element you need to be present
page.wait_for_selector(".product-listing")

# Wait for network activity to stop (risky — some pages have background polling)
page.wait_for_load_state("networkidle")

# Wait for a specific API response to complete
with page.expect_response("**/api/products**") as resp_info:
    page.goto(url)
response = resp_info.value
data = response.json()  # intercept the API response directly

# Wait for an element to contain specific text
page.wait_for_function("document.querySelector('.price')?.textContent?.includes('£')")

Path three: use Zyte API

If you are running scrapers at scale, managing a fleet of browser instances is a significant operational burden, since each browser process uses substantial memory (300–500 MB at minimum), headless browsers require careful configuration to avoid detection, and the infrastructure to run them reliably across many concurrent jobs requires real engineering investment.

Zyte API handles browser rendering and bot detection at the infrastructure level. You send a standard HTTP request and get back rendered HTML, with the browser execution, proxy rotation, and fingerprint management handled by the platform.

import requests

response = requests.post(
    "https://api.zyte.com/v1/extract",
    auth=("YOUR_API_KEY", ""),
    json={
        "url": "https://example.com/products",
        "browserHtml": True,   # request browser-rendered HTML
    },
)

data = response.json()
rendered_html = data["browserHtml"]

# Parse with Beautiful Soup as normal
from bs4 import BeautifulSoup
soup = BeautifulSoup(rendered_html, "lxml")
products = soup.select("article.product")
print(f"Found {len(products)} products")

In Scrapy, Zyte API integrates via the scrapy-zyte-api package:

# settings.py
ZYTE_API_KEY = "YOUR_API_KEY"
DOWNLOAD_HANDLERS = {
    "http":  "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
    "https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
}

# Spider: request browser-rendered HTML for specific requests
def start_requests(self):
    yield scrapy.Request(
        "https://example.com/products",
        meta={
            "zyte_api_automap": True,
            "zyte_api": {"browserHtml": True},
        },
    )

How to decide

Is the data in the raw HTML? Run the View Source test. If yes, use requests and Beautiful Soup or Scrapy and skip browser rendering entirely.

Is the data loaded via a visible API call? Check the Network tab with the Fetch/XHR filter. If yes, call the API directly from Python, since it will be faster and more reliable than any rendering approach.

Is the data accessible via the API but protected by a token that requires JavaScript to generate? Extract the token with a single Playwright page load, then call the API directly for the actual data scraping: one browser load, many API calls.

Do you need to interact with the page (scroll, click, fill forms, navigate tabs) to reveal the data? Use Playwright or Selenium.

Are you running this at scale and do not want to manage browser infrastructure? Use Zyte API.

When rendering is slower than you expect

Browser rendering is ten to twenty times slower than a plain HTTP request and uses significantly more memory, but a few practical adjustments make a meaningful difference.

Block unnecessary resources. Images, fonts, video, and tracking scripts add load time without contributing to the data you need:

page.route(
    "**/*.{png,jpg,jpeg,gif,webp,svg,ico,woff,woff2,ttf,mp4,webm}",
    lambda route: route.abort()
)

# Also block common tracking and analytics domains
def block_trackers(route):
    blocked = ["google-analytics.com", "googletagmanager.com",
               "facebook.net", "doubleclick.net", "hotjar.com"]
    if any(domain in route.request.url for domain in blocked):
        route.abort()
    else:
        route.continue_()

page.route("**/*", block_trackers)

Reuse browser contexts rather than browser instances. Creating a new browser is expensive; creating a new context within an existing browser is cheap:

async with async_playwright() as p:
    browser = await p.chromium.launch(headless=True)

    async def scrape(url):
        # New context per task — isolated cookies and storage, reuses browser process
        context = await browser.new_context()
        page = await context.new_page()
        await page.goto(url)
        html = await page.content()
        await context.close()
        return html

    results = await asyncio.gather(*[scrape(url) for url in urls])
    await browser.close()

Wait for specific elements rather than network idle. networkidle waits until there are no active network connections for 500 ms, and many pages never reach this state because they have background analytics pings. Waiting for the specific element you need is faster and more reliable.

Checking your work

Once you have set up rendering, confirm it is actually returning the content you need:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def verify_rendering(url: str, expected_text: str) -> bool:
    """Confirm that rendering produces the expected content."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url)
        page.wait_for_load_state("networkidle")
        html = page.content()
        browser.close()

    soup = BeautifulSoup(html, "lxml")
    found = expected_text.lower() in soup.get_text().lower()
    print(f"'{expected_text}': {'found ✓' if found else 'not found ✗'}")
    return found

verify_rendering("https://example.com/products/headphones", "Wireless Headphones")

Next steps

Once you know a page requires JavaScript rendering and have chosen your approach, the next challenge is often what happens to the data after you extract it, since cleaning, deduplication, and storage work the same regardless of how the HTML was obtained. If you went the API interception route, the guide on intercepting XHR and fetch requests in the browser covers the full workflow from DevTools discovery to paginated API calls in Python. If you are running Playwright at scale inside Scrapy, web scraping dynamic websites with Zyte API covers how to handle rendering and unblocking without managing browser infrastructure yourself.

What to do when websites change and your spider doesn't know

John Rooney — Mon, 11 May 2026 09:37:11 +0000

Empty-field-rate monitoring catches selectors that return nothing. It does not catch selectors that return something wrong. The most damaging form of schema drift is the kind where a selector keeps producing values, the values are syntactically reasonable, and they are no longer the values you wanted. A price selector that quietly starts returning the financing instalment instead of the sticker price will pass every non-empty check while corrupting your data for as long as the drift goes unnoticed. That is the failure mode this post is about.

Drift comes in several flavours

People talk about "schema drift" as if it were one thing, but in scraping practice there are several kinds of drift, each of which fails differently and demands a different defence.

Drift type	Meaning	Example
DOM/layout drift	The page structure changes	Product cards move from table rows to grid cards
Data contract drift	The meaning or format of a field changes	Price changes from numeric text to "Contact us"
Navigation drift	Discovery paths change	Pagination links disappear, replaced by infinite scroll
Output schema drift	The spider output changes shape	A field is renamed or removed in the item definition

The first kind is the most familiar. The second is the most dangerous. When extraction returns nothing, you can catch it with a simple non-empty assertion. When extraction returns something plausible but wrong, the validation has to be semantic, and most pipelines have nothing in place to do that work.

Consider the financing-price example. Before the redesign, your selector .product-price matched a <span> containing the value $129.99. After the redesign, the same class name is reused for a marketing element that displays $11/mo with affirm. Your extractor still returns a string. The string still contains a dollar sign and a number. A naive validator looks at it, decides it is a price, and accepts it. The data is wrong, but nothing in the pipeline knows that.

The dangerous failure is not always when extraction returns nothing. It is when extraction returns something plausible but wrong.

The empty-field-rate metric from the previous post in this series will catch DOM drift that produces blanks. It will not catch data contract drift that produces something that just looks like a real value. For that, you need an extra layer of defence.

Structural fingerprints as smoke alarms

One way to catch a site change before the data goes wrong is to monitor the page structure itself, separately from the data you extract. The basic idea is simple: hash a fragment of the page that should remain stable, store the hash as a baseline, and compare future fetches against it. If the hash changes, something about the page changed, and you have an early warning.

The naive implementation, hashing the raw HTML of the page or the product container, is too noisy to be useful. Modern pages contain rotating ads, A/B test variants, randomised CSS class names from build tools, recommendation widgets, inventory banners, and inline analytics scripts, all of which change between requests without anything meaningful changing on the page. A raw hash will fire constantly and you will learn to ignore it.

The better pattern is a normalised structural fingerprint. The goal is to capture the shape of the page, the hierarchy of tags and the semantic attributes, while discarding everything that varies cosmetically.

from hashlib import sha256
from lxml import html
from copy import deepcopy

VOLATILE_TAGS = {"script", "style", "noscript", "iframe"}
VOLATILE_ATTRS_PREFIX = ("data-track", "data-analytics", "data-test-id-")

def normalize_subtree(element):
    """Return a string representation of structure only, not content or noise."""
    el = deepcopy(element)

    # remove volatile tags entirely
    for tag in VOLATILE_TAGS:
        for node in el.iter(tag):
            node.getparent().remove(node) if node.getparent() is not None else None

    parts = []
    for node in el.iter():
        # keep tag and stable semantic attributes only
        attrs = []
        for k, v in sorted(node.attrib.items()):
            if k.startswith("aria-") or k in {"role", "itemprop", "itemtype"}:
                attrs.append(f"{k}={v}")
        parts.append(f"<{node.tag} {' '.join(attrs)}>")
    return "".join(parts)

def fingerprint(html_str, container_xpath):
    tree = html.fromstring(html_str)
    container = tree.xpath(container_xpath)
    if not container:
        return None
    return sha256(normalize_subtree(container[0]).encode()).hexdigest()

The principle behind the normalisation is to keep the things that should be stable across requests (tag hierarchy, ARIA roles, microdata attributes, intentional data-* attributes) and drop the things that are not (text content, generated class names, scripts, ads, tracking IDs). What remains is a structural fingerprint that changes when the developer of the target site changes the page, and is mostly stable otherwise.

A note on A/B testing: even with normalisation, a single hash mismatch is not a reliable signal of a real change. The site might be serving you a different test variant than the one you fingerprinted last week, and the difference is genuine without being a redesign. The right pattern is to sample more than one fetch before concluding that drift has occurred, and to treat a single mismatch as a prompt for review rather than an automatic alert.

Use fingerprints as smoke alarms, not verdicts. When the hash changes, fire a review task. Do not abort the crawl, do not roll back the deployment, and do not page anyone in the middle of the night. The fingerprint is telling you to look at the page; it is not telling you the page is broken.

Live canary checks before production runs

The fingerprint catches changes after they happen. The canary check catches them before they cost you a full crawl of bad data. The pattern is straightforward: pick a small, stable set of representative URLs, fetch them, run your current extraction logic against them, and assert that the critical fields come back with plausible values.

import pytest
import requests
from myproject.extractors import extract_product

CANARY_URLS = [
    "https://example.com/product/12345",
    "https://example.com/product/67890",
]

@pytest.mark.parametrize("url", CANARY_URLS)
def test_extraction_canary(url):
    response = requests.get(url, timeout=30)
    response.raise_for_status()

    item = extract_product(response.text)

    assert item["title"], f"empty title for {url}"
    assert item["price"], f"empty price for {url}"
    assert _looks_like_price(item["price"]), (
        f"price {item['price']!r} for {url} does not look like a price"
    )
    assert item["availability"] in {"in_stock", "out_of_stock", "preorder"}, (
        f"unexpected availability {item['availability']!r} for {url}"
    )

def _looks_like_price(value):
    import re
    # rejects "$11/mo" style strings, accepts "$129.99" and "129,99 €"
    return bool(re.fullmatch(r"[^\d]?\d{1,3}([.,]\d{3})*([.,]\d{2})?[^\d]?", value.strip()))

The semantic checks are what make this useful. Asserting that the title is non-empty is fine, but asserting that the price actually looks like a price is what catches the financing-string failure mode. The check on availability against a known set rejects values that are syntactically valid strings but no longer in the contract.

Wiring this into CI is a question of cadence. Running canary checks on every commit will produce noise from transient network issues and rate limiting. Running them on a schedule (every few hours, or before each production deployment) gives you a useful signal without the false-positive churn. Failed runs should store the fetched HTML, the extracted item, and the assertion that failed, all as artifacts you can inspect later. A canary that fires and discards the evidence is a canary that wastes your time when you go to investigate.

# .github/workflows/canary.yml
name: Extraction canary
on:
  schedule:
    - cron: "0 */4 * * *"
  workflow_dispatch:

jobs:
  canary:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements.txt
      - run: pytest tests/canary -v
      - if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: canary-failures
          path: tests/canary/artifacts/

The same A/B caveat applies here. If a canary fails on a single fetch, retry on a fresh request before alerting. If it fails consistently across multiple fetches, the change is real.

Where this connects to the rest of the stack

If your spider is part of a system that pulls a lot of data from a small set of high-value sites, an alternative to maintaining selectors and canaries is to skip the selector-based approach for some content types entirely. Zyte API's pageContent data type, released in late 2025, is one example of a route around the problem: it returns the cleaned main content of a page without you having to maintain selectors at all, which means there is no selector to drift against. That trade-off is not right for every project, especially when you need fine-grained structured fields, but it is worth knowing about when the maintenance cost of a selector-based pipeline starts to dominate.

For pipelines that stay selector-based, the combination of structural fingerprints and canary checks is the strongest defence available. Fingerprints flag that the page changed; canaries verify that your extraction still works against the changed page. Neither is sufficient on its own, and both together still rely on the metrics from the previous post to catch the failure modes they miss.

What to do next

Pick the three or four most valuable URLs in your crawl and write canary checks for them with semantic assertions, not just non-empty checks. Add a normalised structural fingerprint for the same URLs and store the baseline. Run both on a schedule before your next production deployment. That alone will catch most of the silent-failure cases that empty-field-rate monitoring lets through.

In the final post of this series, we will look at the third leg of production-ready scraping: making sure that when something does go wrong mid-run, you can restart the crawl without duplicating data or corrupting state.

Scrapy AutoThrottle: How to tune crawl speed without getting blocked

John Rooney — Thu, 30 Apr 2026 19:58:32 +0000

AutoThrottle is one of Scrapy's most useful production features and one of the most commonly misconfigured. Most guides tell you to add four lines to your settings file and move on. This one explains what the algorithm is actually doing, what its assumptions are, why those assumptions sometimes break down, and how to tune it for real crawls.

If you have enabled AutoThrottle and are not sure whether it is working, or if your crawl speed is not what you expect despite having it turned on, this is the post to read.

What AutoThrottle actually measures

The most common misconception about AutoThrottle: it adjusts based on HTTP response codes. It does not. It adjusts based on response latency, the time between sending a request and receiving a response.

AutoThrottle never looks at whether you are receiving 429s, 503s, or any other status code that might indicate rate limiting or blocking. When responses come back quickly, it assumes the server has capacity and reduces the delay between requests. When responses slow down, it assumes load and increases the delay. The goal is to maintain a target number of in-flight concurrent requests to the server, adjusting the inter-request delay to keep actual concurrency close to that target.

This works well when server latency is an accurate signal of server load. It breaks in several common situations:

Content delivery network (CDN)-cached responses. A CDN edge node returns cached content at sub-millisecond latency. AutoThrottle sees very fast responses and reduces the delay aggressively, potentially hammering the origin server, which AutoThrottle never directly observes.
Silent rate limiting. Some sites return a 200 with a soft-block page, a CAPTCHA, a antiban challenge, or empty results, in fast response time. AutoThrottle interprets this as a healthy server and keeps the rate high. You are being blocked; the algorithm does not know.
Rate limiting via response code. A server that returns 429 in milliseconds looks, to AutoThrottle, like a fast healthy server. Latency-based throttling is irrelevant when the server is enforcing a request cap by policy rather than by load.

Knowing what AutoThrottle measures is what lets you decide when to use it. Before tuning delay settings, it is worth understanding your request requirements thoroughly: the post "The recipe for a request: Scaling data extraction" argues for investigating those requirements carefully before committing to a crawl strategy.

The settings that matter

Five settings control AutoThrottle behavior. Three of them matter for most crawls.

AUTOTHROTTLE_ENABLED = True

# The delay before the algorithm takes over — used for the first few requests
AUTOTHROTTLE_START_DELAY = 1.0

# Ceiling on the computed delay — prevents runaway backoff on very slow servers
AUTOTHROTTLE_MAX_DELAY = 60.0

# Target number of in-flight concurrent requests to the server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Log per-request throttling decisions — useful for tuning
AUTOTHROTTLE_DEBUG = False

`AUTOTHROTTLE_TARGET_CONCURRENCY`

This is the most commonly misunderstood setting. It is not a hard concurrency cap; it is a target. AutoThrottle computes a delay to try to maintain approximately this many in-flight requests at any given time, given the observed latency.

The default of 1.0 means AutoThrottle aims for one request in flight at a time. At 200ms average response time, that translates to roughly five requests per second. At 50ms, roughly 20 requests per second. Raising TARGET_CONCURRENCY makes the crawl more aggressive; at 4.0 with 200ms average latency, AutoThrottle aims for about 20 requests per second.

Also note the interaction with CONCURRENT_REQUESTS_PER_DOMAIN: if that cap is lower than what AutoThrottle's computed delay would allow, the cap takes precedence. Both settings need to be sized for your throughput target.

`AUTOTHROTTLE_MAX_DELAY`

Without this, a slow or overloaded server can push computed delays into minutes. The default of 60 seconds is reasonable for polite crawling, but it means a blocked or very slow server will stall your crawl for up to a minute between requests. Set it proportional to your acceptable throughput floor: a ceiling of 10 seconds means the slowest you crawl is one request per 10 seconds.

`AUTOTHROTTLE_START_DELAY`

This is the delay used for the first few requests before the algorithm has enough latency samples to make good decisions. Set it to something in the range of the delay you would use if setting DOWNLOAD_DELAY manually. Too low and the first few requests come in a burst; too high and you waste time at the start of each domain.

Reading the debug output

The most direct way to understand what AutoThrottle is doing is to enable debug logging:

AUTOTHROTTLE_DEBUG = True

This logs one line per request showing the throttle decision:

2026-01-15 09:42:11 [scrapy.extensions.throttle] DEBUG: slot: example.com
  prev/curr concurrency: 3/4
  prev/curr latency: 0.24s/0.31s
  target latency: 0.31s
  delay: 0.08s -> 0.10s

slot is the domain this decision applies to.

prev/curr concurrency shows how many requests were in-flight on the previous request versus the current one, indicating whether actual concurrency is matching the target.

prev/curr latency is the signal AutoThrottle is responding to. Rising latency drives the delay up; falling latency drives it down.

target latency is what the algorithm is aiming for, derived from TARGET_CONCURRENCY and current observed latency.

delay is the computed inter-request delay before and after this adjustment. This is the number that feeds into the actual crawl rate.

What to look for during tuning:

Delay stuck at MAX_DELAY consistently. The server is responding slowly or you are being rate limited by latency (rare). Consider raising MAX_DELAY if the server is legitimately slow, or investigate whether you are being blocked.
Delay near zero consistently. AutoThrottle is running with almost no restriction. Either the server is very fast or TARGET_CONCURRENCY is too high relative to CONCURRENT_REQUESTS_PER_DOMAIN.
Concurrency consistently below target. The server is slower than expected and AutoThrottle cannot achieve the target without excessive load. Lower TARGET_CONCURRENCY.

When to use AutoThrottle, a fixed delay, or neither

AutoThrottle makes sense when you are crawling a domain where server response time is a reliable signal of server load, you want adaptive behavior that maximizes throughput without overloading the server, and you are not under a specific documented rate limit. That covers most general-purpose crawls.

Use a fixed DOWNLOAD_DELAY when you have a specific rate limit to respect: an API with documented caps, or a crawl policy agreed with the site operator. Fixed delays give you auditable, predictable behavior. The Scrapy 2026 release improved retry logic and delay handling, making fixed-delay crawls more reliable in high-volume scenarios.

DOWNLOAD_DELAY = 1.0           # one second between requests
RANDOMIZE_DOWNLOAD_DELAY = True  # randomise between 0.5s and 1.5s

RANDOMIZE_DOWNLOAD_DELAY (on by default when DOWNLOAD_DELAY is set) makes the timing pattern look less mechanical, which helps with some simple bot detection approaches.

Use neither when you are crawling many small domains in parallel where per-domain load is negligible, or when you are using a scraping API such as Zyte API that manages politeness for you. Adding a delay in those cases only reduces throughput without benefiting the target server.

Patterns that trick AutoThrottle

CDN caching

Many high-traffic sites serve content from CDN edges that return cached responses in under 10ms. AutoThrottle sees fast responses and reduces the delay, sometimes to near zero. The origin server may be under significant load from your crawl; CDN latency tells you nothing about it. If you are crawling a CDN-backed site and need to be polite to the origin, use a fixed DOWNLOAD_DELAY rather than AutoThrottle.

Silent rate limiting

A site that returns a soft-block page in fast response time looks identical to a successfully served page from AutoThrottle's perspective. You are being blocked; the algorithm sees only fast responses.

The diagnostic: if your item rate drops while response latency stays low and AutoThrottle is not backing off, you are being silently blocked. Slowing down will not fix it. The problem is elsewhere in the request fingerprint.

Variable-latency backends

Some sites have pages that respond in 30ms (static HTML, heavily cached) and pages that take 800ms (product pages with real-time inventory lookups). AutoThrottle makes decisions based on recent latency samples and will oscillate: fast pages drive the delay down, then a batch of slow pages backs it up. If the fast pages are not the ones you need to throttle for, lower TARGET_CONCURRENCY to give more headroom, or separate fast and slow page types into different crawl jobs with different throttle settings.

Settings cheat sheet

Recommended starting points for three common scenarios:

Scenario	`TARGET_CONCURRENCY`	`MAX_DELAY`	`START_DELAY`
Polite crawl of a single domain	1.0	10	2.0
Faster crawl of a cooperative server	4.0	30	1.0
Multi-domain crawl, many parallel slots	2.0	60	1.0

Enable AUTOTHROTTLE_DEBUG = True, run a short crawl, read the output, and adjust from there.

A note on blocking vs rate limiting

AutoThrottle addresses one problem: sending requests too fast for the server to handle comfortably. It does not address fingerprint-based detection, including TLS fingerprints, HTTP header patterns, browser behavior signatures, or IP reputation. Modern anti-bot systems primarily detect scrapers through these signals, not through request rate alone.

If you are following AutoThrottle's guidance on request rate and still getting blocked, the rate is probably not the issue. The problem is in the request fingerprint, and that requires a different solution.

What the Scrapy Maintainer Thinks About AI-Generated Scrapers

Neha Setia — Sat, 11 Apr 2026 15:13:41 +0000

I sat down with Adrian Chaves, one of the lead Scrapy maintainers, who also works at Zyte, to ask him the questions I've been chewing on since Zyte launched Web Scraping Copilot: what happens when an LLM writes your spider(the web scraping code)? What gets easier? What doesn't change?

His answers surprised me. A few highlights:

On vibe coding: Adrian has thoughts about developers treating scraper generation as a black box, and why Scrapy's design philosophy matters more, not less, when an LLM is writing the code.
The bottleneck isn't what you think. He argues the hard part of scraping in 2026 isn't writing code. It's reading pages. And that's the part LLMs still struggle with.
What "good design meeting the future halfway" means. Why frameworks like Scrapy that were built for humans are turning out to be the best frameworks for AI agents too.
Where LLMs actually help. The concrete places where AI makes a scraper developer's life better, and where it just adds complexity.

Full conversation is on the Zyte blog, linked below. If you're building scrapers, thinking about adding AI to your extraction pipeline, or just curious what someone who's been maintaining one of the most widely used scraping frameworks for years thinks about all of this, it's worth a read.

Read the full interview on zyte.com

Happy to discuss in the comments.
What are you using AI for in your scraping workflow right now, and where have you hit walls?

Tags: web scraping • scrapy • ai • opus • anti-bot • Claude AI • sonnet • open source