Nobody writes tests for scrapers. I get it. The site changes, your tests break, you feel like you spent Tuesday writing tests for the site you don't control. So you skip them.
Then the site changes again. Your scraper silently returns empty rows. The dashboard goes blank. Your client texts at 11pm. You discover, in the cold light of debug, that this exact failure was deterministic and could have been caught in 30 seconds by a single fixture-based test.
The house always wins.
The 3-item checklist
What scrapers actually need to test:
- Extraction against a frozen HTML fixture. Save a copy of the page once. Run the parser against it. Assert the fields. This catches your bugs.
- Schema validation against a live response. Periodically (daily, weekly), hit one real URL and validate the output shape. This catches their changes.
- Smoke test the full pipeline against a known-good URL. End-to-end. One URL. Asserts that you get one row out, with the expected fields. This catches integration breakage.
You don't need a Jest config or a pytest empire. You need three test files.
The replacement: a fixture-first test in <10 lines
# tests/test_extractor.py
from pathlib import Path
from my_scraper.extract import extract_comment
def test_youtube_comment_extraction():
html = Path("tests/fixtures/youtube_comment_2026-04-01.html").read_text()
result = extract_comment(html)
assert result["author"] == "@somecreator"
assert result["likes"] == 1247
assert "great video" in result["text"].lower()
Then your extract_comment(html) is a pure function — give it HTML, get a dict back. No browser, no network. Runs in milliseconds. Survives a CI minute budget. Catches every regression in your parsing code instantly.
Save the fixture by literally hitting the URL once and writing the response to disk:
# scripts/refresh_fixture.py
async with async_playwright() as p:
browser = await p.chromium.launch()
page = await browser.new_page()
await page.goto("https://www.youtube.com/watch?v=...")
Path("tests/fixtures/youtube_comment_2026-04-01.html").write_text(
await page.content()
)
Run it once a quarter. When the test starts failing, refresh the fixture, fix the extractor, commit both. That's the loop.
Quick case
On our YouTube comments scraper, fixture-based tests caught three parsing regressions before they ever reached production:
- A field rename (
likeCount→likeCountplus a thousand-separator format change). - A new "pinned" badge that broke our author-name selector.
- A timestamp format change from "2 days ago" to "2d".
All three would have shipped silently. The cron would still run. The CSV would still write. The fields would just be wrong or empty. Instead, the test failed in CI on the PR that introduced the change, fifteen minutes after the fixture was last refreshed.
The cost of writing the test the first time: 20 minutes. The cost of the bugs it caught, if shipped: at minimum a refund and an apology each.
The CTA you didn't ask for
Every actor we ship now starts with three test files:
-
tests/test_extract.py— fixture-based unit tests for parsing. -
tests/test_schema.py— Pydantic / Zod schema check on a live URL, run on a schedule. -
tests/test_smoke.py— single-URL end-to-end check on every deploy.
It's the most boring testing pyramid you've ever seen and it has paid for itself an embarrassing number of times — the YouTube comments scraper is where it caught the most regressions in 2026.
So:
Open your scraper. Do you have a tests/ folder? Drop "yes" or "no" in the comments. If "no" — what's stopping you?
Agree, disagree, or have a fixture strategy that actually works? Reply.
Written by **Nova Chen, Automation Dev Advocate at SIÁN Agency. Find more from Nova on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Top comments (0)