DEV Community

Sangmin Lee
Sangmin Lee

Posted on • Originally published at claudeguide.io

Automate Web Scraping with Claude Code (2026)

Originally published at claudeguide.io/claude-code-web-scraping-automation

Automate Web Scraping with Claude Code (2026)

To automate web scraping with Claude Code, prompt it to generate a Playwright script that navigates to your target URL, extracts the HTML, and passes it to the Claude API for structured parsing. Claude Code writes the browser automation layer while the Claude API handles the extraction logic — converting messy HTML into clean JSON. This separation means you can scrape any site without writing brittle CSS selectors that break on every redesign in 2026. The combination handles pagination, JavaScript-rendered content, and infinite scroll out of the box, and schedules with a single cron line.


Playwright + Claude Code Workflow

The architecture has two layers. Playwright handles browser interaction (navigation, scrolling, clicking). Claude API handles data extraction (understanding HTML structure and pulling structured fields).

Use this prompt in Claude Code to scaffold the full pipeline:

Build a Python web scraper with two components:
1. A Playwright async scraper that visits URLs from a list, renders each page,
   and captures the full HTML.
2. An Anthropic API client that receives raw HTML and returns a JSON object
   with fields: [title, price, rating, description, sku, availability].

Requirements:
- Playwright uses headless Chromium with a 10s timeout per page
- Claude API calls use claude-haiku-4-5 for cost efficiency
- Enable prompt caching on the system prompt (static extraction schema)
- Save results to output.jsonl with one JSON object per line
- Log failures to errors.log without stopping the run
Enter fullscreen mode Exit fullscreen mode

Claude Code generates a working script in one pass. The key insight is separating navigation (Playwright) from understanding (Claude API) — each does what it is good at.


Structured Data Extraction from HTML

Passing raw HTML to Claude API is reliable but expensive if you send entire pages. Claude Code will trim the HTML to the relevant section first:

import anthropic
import json
from playwright.async_api import async_playwright

client = anthropic.Anthropic()

EXTRACTION_SYSTEM = """You are a structured data extractor. Given raw HTML from a product page,
return ONLY a valid JSON object with these fields:
- title (string)
- price (number, USD, no currency symbol)
- rating (float 0-5, null if absent)
- review_count (integer, null if absent)
- sku (string, null if absent)
- availability (string: "in_stock" | "out_of_stock" | "unknown")
- description (string, first 200 chars of product description)

If a field cannot be found, return null. Never return anything outside the JSON object."""

async def extract_product(html: str, url: str) -

---

## Handling Pagination and Infinite Scroll

Claude Code generates two patterns depending on the site type.

**Pattern 1: Next-page button pagination**

Enter fullscreen mode Exit fullscreen mode


python
async def scrape_paginated(base_url: str, max_pages: int = 50) -


Frequently Asked Questions

Is web scraping legal?

Web scraping occupies a legal gray zone that depends on jurisdiction, what is being scraped, and how. Scraping publicly available, non-personal data is generally lawful in many jurisdictions (a 2022 Ninth Circuit ruling in hiQ v. LinkedIn affirmed this for public data under the CFAA). However, scraping in violation of a site's Terms of Service can expose you to breach-of-contract claims. Scraping personal data (names, emails, contact info) without a lawful basis violates GDPR in the EU and similar privacy laws elsewhere. Always read robots.txt, review the ToS, prefer official APIs, and consult a lawyer for commercial-scale scraping.

How does Claude API improve over regex or CSS selectors for extraction?

CSS selectors and regex break when a site redesigns its HTML structure — which happens constantly. Claude API reads the HTML semantically, so it finds the price field whether it is in a <span class="price">, a data-price attribute, or a JSON-LD script tag. In the benchmark above, a selector-based approach required 12 manual updates over the same 30-day period. The Claude-based extractor needed zero.

What model should I use for extraction — Haiku, Sonnet, or Opus?

Use claude-haiku-4-5 for structured extraction from HTML. It handles this task at 94%+ accuracy and costs roughly 25× less than Sonnet. Upgrade to Sonnet only if you need to extract from very noisy, unstructured content (forum posts, PDFs) or if you need multi-field reasoning across multiple pages simultaneously. See Claude Haiku vs Sonnet vs Opus: Which Model? for a full cost comparison.

How do I handle sites that block headless browsers?

First, check whether you should be scraping the site at all (see the legality question above). If you have a legitimate use case, Playwright with a real user agent string and the webdriver property removed passes most fingerprint checks. For stricter sites, add realistic mouse movement delays between actions. Playwright also supports authenticated sessions via storageState — you can log in once, save the session, and reuse it across runs without re-authenticating.

Can I run the scraper on a schedule without a server?

Yes. On macOS, use launchd (described above) — it is more reliable than cron because it runs even after system sleep. For cloud scheduling with zero infrastructure, a GitHub Actions workflow with a schedule: trigger runs your scraper on any cron expression for free (within 2,000 minutes/month on the free tier). For paid workloads, a simple Fly.io or Railway cron job costs roughly $1–3/month and gives you persistent storage without managing a full VM.

Top comments (0)