John Rooney for Extract by Zyte

Posted on May 31 • Edited on Jun 2

Why Your Scraper Breaks Without Warning (And How to Fix It)

#webscraping #python #zyte #programming

Most scraper failures don't raise exceptions. The spider finishes, the pipeline writes a file, the process exits with code 0 — and the output contains 0 items, or 800 instead of 8,000, or fields that are all empty strings. A CSS selector stopped matching after a site redesign. A field that used to be present is now conditional. The "next page" link moved.

None of these produce tracebacks. Without explicit checks, you won't know until someone looks at the data.

This post covers three layers of defence: validating individual items before they reach storage, checking aggregate counts at the end of a run, and setting up logging that's actually useful.

Why selectors fail silently

BeautifulSoup and Scrapy's CSS selectors return None or an empty list when nothing matches — they don't raise. The problem propagates depending on what you do next.

# This raises AttributeError if price_element is None
price_element = book.find("p", class_="price_color")
price = price_element.text.strip()  # AttributeError: 'NoneType' has no 'text'

That's a loud failure, which is fine. But this version is silent:

price = book.css("p.price_color::text").get(default="")

get(default="") is convenient, but it means a changed selector produces empty strings in your output rather than an error. You end up with a file full of {"title": "Some Book", "price": "", "rating": ""} records that look complete until you actually check the values.

The fix is validation — checking that the fields you extracted are actually populated before writing anything.

Item validation

A validation function takes an item dict and raises if any required field is missing or empty:

REQUIRED_FIELDS = ["title", "price", "rating"]

def validate_item(item: dict) -> dict:
    problems = [f for f in REQUIRED_FIELDS if not item.get(f)]
    if problems:
        raise ValueError(f"Missing fields {problems}: {item}")
    return item

not item.get(f) catches None, "", [], and 0 — all values that get() defaults produce when a selector fails. If you want to allow 0 as a valid value, use item.get(f) is None instead.

In a standalone scraper:

import requests
from bs4 import BeautifulSoup
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)
logger = logging.getLogger(__name__)

REQUIRED_FIELDS = ["title", "price", "rating"]

def validate_item(item: dict) -> dict:
    problems = [f for f in REQUIRED_FIELDS if not item.get(f)]
    if problems:
        raise ValueError(f"Missing fields {problems}: {item}")
    return item

session = requests.Session()
session.headers.update({
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-GB,en;q=0.9",
})

resp = session.get("https://books.toscrape.com/", timeout=15)
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")

items = []
errors = 0

for book in soup.find_all("article", class_="product_pod"):
    raw = {
        "title":  book.find("h3").find("a")["title"],
        "price":  book.find("p", class_="price_color").text.strip(),
        "rating": book.find("p", class_="star-rating")["class"][1],
    }
    try:
        items.append(validate_item(raw))
    except ValueError as e:
        logger.warning("Skipping invalid item: %s", e)
        errors += 1

logger.info("Scraped %d valid items, %d errors", len(items), errors)

Whether to skip invalid items or abort depends on the use case. For a scrape where partial data is usable, log and skip. For a scrape where every record matters, raise immediately.

Validating in a Scrapy pipeline

In Scrapy, validation belongs in a pipeline that runs before the storage pipelines. The ItemAdapter class normalises access across Scrapy Items, dataclasses, and plain dicts:

# pipelines.py
from itemadapter import ItemAdapter
import logging

logger = logging.getLogger(__name__)

REQUIRED_FIELDS = {"title", "price", "rating"}


class ItemValidatorPipeline:
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)

        missing = REQUIRED_FIELDS - {
            k for k, v in adapter.items()
            if v not in (None, "", [])
        }

        if missing:
            raise ValueError(
                f"Item missing required fields: {missing} | {dict(adapter)}"
            )

        return item

Raising ValueError (or Scrapy's DropItem) from a pipeline drops the item and logs the error. It doesn't stop the spider. Enable it at a lower priority number than your storage pipelines so it runs first:

# settings.py
ITEM_PIPELINES = {
    "myproject.pipelines.ItemValidatorPipeline": 100,
    "myproject.pipelines.JsonlPipeline":         300,
    "myproject.pipelines.SqlitePipeline":        400,
}

Count checking

Per-item validation catches bad fields, but not a situation where the spider simply stopped finding items at all. A selector change that breaks the outer find_all("article", class_="product_pod") call produces zero items with zero errors — everything looks fine.

A count check at the end of a run catches this:

MIN_EXPECTED = 10

logger.info("Scraped %d valid items, %d errors", len(items), errors)

if len(items) < MIN_EXPECTED:
    logger.error(
        "Item count %d below threshold %d — check selectors or site structure",
        len(items),
        MIN_EXPECTED,
    )

In Scrapy, this belongs in an extension that connects to the spider_closed signal:

# extensions.py
from scrapy import signals
import logging

logger = logging.getLogger(__name__)


class ItemCountChecker:
    MIN_ITEMS = 10

    @classmethod
    def from_crawler(cls, crawler):
        ext = cls()
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        return ext

    def spider_opened(self, spider):
        self.item_count = 0
        spider.crawler.signals.connect(
            self.item_scraped, signal=signals.item_scraped
        )

    def item_scraped(self, item, spider):
        self.item_count += 1

    def spider_closed(self, spider):
        if self.item_count < self.MIN_ITEMS:
            logger.error(
                "Spider %s closed with only %d items (threshold: %d). "
                "Check selectors or site structure.",
                spider.name,
                self.item_count,
                self.MIN_ITEMS,
            )
        else:
            logger.info(
                "Spider %s: %d items scraped.", spider.name, self.item_count
            )

Enable in settings:

EXTENSIONS = {
    "myproject.extensions.ItemCountChecker": 500,
}

The threshold needs thought. A scraper that usually returns 800 items should probably alert below 700, not below 10. Track a few runs first, then set the threshold relative to your expected baseline.

Logging that's useful

Scrapy's default logging is verbose at INFO level — request counts, middleware messages, and stats flood the output along with the things you actually care about. Two settings improve it:

# settings.py
LOG_LEVEL = "WARNING"   # suppress INFO noise from Scrapy internals
LOG_FILE  = "scrape.log"

Set LOG_LEVEL = "WARNING" for Scrapy internals, but log your own pipeline and extension messages at INFO. Because Scrapy uses Python's standard logging module, you can configure your own loggers separately:

import logging
logging.getLogger("myproject.pipelines").setLevel(logging.INFO)
logging.getLogger("myproject.extensions").setLevel(logging.INFO)

For standalone scrapers, the format matters. The default %(message)s loses context. At minimum include the timestamp and level:

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(name)s %(message)s",
)

When something goes wrong at 2am, the timestamp is how you correlate a failure with a site change or a rate-limit event.

The three signals that something is broken

In practice, silent scraper failures show up three ways:

A count drop with no errors. Item count is 0 or far below baseline, error count is 0, the spider exited cleanly. This is a selector change — the outer container no longer matches.

A count that looks right but fields are empty. Item count is normal, but price and rating are empty strings across all records. An inner selector changed while the outer one still matches.

A count drop accompanied by validation errors. Item count is lower than expected, and the log contains Skipping invalid item messages. A field is now conditional — present on some items, absent on others, probably due to a new page layout.

Each failure mode points at a different selector level. Logging both counts and individual errors separately makes it easier to tell which one you're dealing with.

Putting it together

A scraper with validation, count checking, and useful logging doesn't need to be complex. These three additions together cover the most common silent failure modes:

Validate each item before writing — catches changed inner selectors
Check the final count against a threshold — catches broken outer selectors
Log counts and errors at the end of every run — makes failures visible without manual inspection

For a production scraper that runs on a schedule, route the logs somewhere you'll see them. LOG_FILE to a location your monitoring system watches, or send the error-level messages to a webhook. A scraper that runs silently and produces empty output for a week is worse than one that fails loudly on the first run.

Tags: python scrapy webscraping tutorial