Henry Knight

Posted on Jun 7 • Edited on Jun 13

Build a Web Scraper in 30 Minutes with Claude AI

#webdev #python #automation #ai

If you've ever tried to parse messy HTML with BeautifulSoup and regex, you know the pain. Selectors break when the site updates, edge cases multiply, and you end up with a brittle script that works until it doesn't.

Claude AI changes this completely. Instead of writing fragile selectors, you describe what you want in plain English — and Claude extracts it. Here's how to build a working web scraper in 30 minutes using Python and the Claude API.

What You'll Build

A scraper that:

Fetches any web page with requests
Sends the HTML to Claude
Gets back clean, structured JSON data

No CSS selectors. No XPath. No regex hell.

Prerequisites

Python 3.8+
pip install anthropic requests
An Anthropic API key from console.anthropic.com

Step 1: Fetch the Page

Start with a basic fetch:

import requests

def fetch_page(url: str) -> str:
    headers = {"User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0)"}
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
    return response.text

Nothing fancy — just grab the raw HTML. We'll let Claude handle the extraction logic.

Step 2: Extract Data with Claude

Here's the core of the approach. Pass the HTML to Claude with a clear extraction prompt:

import anthropic
import json

client = anthropic.Anthropic()

def extract_data(html: str, what_to_extract: str) -> dict:
    # Trim HTML to avoid huge token usage
    trimmed = html[:15000]

    message = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Extract the following from this HTML and return ONLY valid JSON:

What to extract: {what_to_extract}

HTML:
{trimmed}

Return only the JSON object, no explanation."""
        }]
    )

    return json.loads(message.content[0].text)

Claude reads the HTML like a human would — it understands context, handles malformed markup, and returns clean structured data. No selector maintenance required.

Step 3: Put It Together

def scrape(url: str, fields: str) -> dict:
    print(f"Fetching {url}...")
    html = fetch_page(url)

    print("Extracting with Claude...")
    data = extract_data(html, fields)

    return data

# Example: scrape Hacker News top stories
result = scrape(
    "https://news.ycombinator.com",
    "list of top 5 post titles and their point counts"
)

print(result)
# {'posts': [{'title': '...', 'points': 342}, ...]}

Run it. You'll have structured data from any page in seconds.

Why This Works Better Than Traditional Scraping

Traditional scrapers break when:

A class name changes from post-title to article-heading
Data moves to a different DOM node
The site adds an A/B test that shuffles layout

Claude-based scrapers are resilient because Claude understands intent. Tell it "get the product price" and it finds it regardless of which span it's in.

Practical Tips

Trim your HTML first. Most pages have 50KB+ of nav, footer, and script tags you don't need. Slice to the relevant section before sending:

# Find the main content area before sending to Claude
start = html.find('<main')
end = html.find('</main>') + 7
main_content = html[start:end] if start != -1 else html[:15000]

Be specific in your prompts. "Extract product info" is vague. "Extract product name, price in USD, and whether it's in stock" gets you clean output every time.

Cache responses. If you're running the same extraction repeatedly, cache the HTML fetch and only re-run Claude when the content changes.

What's Next

This approach works great for one-off extractions. But for sites that require login, JavaScript rendering, or CAPTCHA handling, you need browser automation on top.

I've packaged authenticated scraping, dynamic page handling, retry logic, and Claude extraction into a ready-to-deploy starter kit. If you want to skip setup and go straight to production-grade scraping, check out the Claude Browser Agent Starter Kit — it's what I use for all my automation projects.

The 30-minute version above gets you surprisingly far. Most public pages are fair game. Once you need to go deeper — login flows, JS-heavy SPAs, rate limiting — the starter kit has you covered.

Have questions or want to see a specific site scraped? Drop a comment below.

DEV Community