If you've ever tried to parse messy HTML with BeautifulSoup and regex, you know the pain. Selectors break when the site updates, edge cases multiply, and you end up with a brittle script that works until it doesn't.
Claude AI changes this completely. Instead of writing fragile selectors, you describe what you want in plain English — and Claude extracts it. Here's how to build a working web scraper in 30 minutes using Python and the Claude API.
What You'll Build
A scraper that:
- Fetches any web page with
requests - Sends the HTML to Claude
- Gets back clean, structured JSON data
No CSS selectors. No XPath. No regex hell.
Prerequisites
- Python 3.8+
pip install anthropic requests- An Anthropic API key from console.anthropic.com
Step 1: Fetch the Page
Start with a basic fetch:
import requests
def fetch_page(url: str) -> str:
headers = {"User-Agent": "Mozilla/5.0 (compatible; MyScraper/1.0)"}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
return response.text
Nothing fancy — just grab the raw HTML. We'll let Claude handle the extraction logic.
Step 2: Extract Data with Claude
Here's the core of the approach. Pass the HTML to Claude with a clear extraction prompt:
import anthropic
import json
client = anthropic.Anthropic()
def extract_data(html: str, what_to_extract: str) -> dict:
# Trim HTML to avoid huge token usage
trimmed = html[:15000]
message = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Extract the following from this HTML and return ONLY valid JSON:
What to extract: {what_to_extract}
HTML:
{trimmed}
Return only the JSON object, no explanation."""
}]
)
return json.loads(message.content[0].text)
Claude reads the HTML like a human would — it understands context, handles malformed markup, and returns clean structured data. No selector maintenance required.
Step 3: Put It Together
def scrape(url: str, fields: str) -> dict:
print(f"Fetching {url}...")
html = fetch_page(url)
print("Extracting with Claude...")
data = extract_data(html, fields)
return data
# Example: scrape Hacker News top stories
result = scrape(
"https://news.ycombinator.com",
"list of top 5 post titles and their point counts"
)
print(result)
# {'posts': [{'title': '...', 'points': 342}, ...]}
Run it. You'll have structured data from any page in seconds.
Why This Works Better Than Traditional Scraping
Traditional scrapers break when:
- A class name changes from
post-titletoarticle-heading - Data moves to a different DOM node
- The site adds an A/B test that shuffles layout
Claude-based scrapers are resilient because Claude understands intent. Tell it "get the product price" and it finds it regardless of which span it's in.
Practical Tips
Trim your HTML first. Most pages have 50KB+ of nav, footer, and script tags you don't need. Slice to the relevant section before sending:
# Find the main content area before sending to Claude
start = html.find('<main')
end = html.find('</main>') + 7
main_content = html[start:end] if start != -1 else html[:15000]
Be specific in your prompts. "Extract product info" is vague. "Extract product name, price in USD, and whether it's in stock" gets you clean output every time.
Cache responses. If you're running the same extraction repeatedly, cache the HTML fetch and only re-run Claude when the content changes.
What's Next
This approach works great for one-off extractions. But for sites that require login, JavaScript rendering, or CAPTCHA handling, you need browser automation on top.
I've packaged authenticated scraping, dynamic page handling, retry logic, and Claude extraction into a ready-to-deploy starter kit. If you want to skip setup and go straight to production-grade scraping, check out the Claude Browser Agent Starter Kit — it's what I use for all my automation projects.
The 30-minute version above gets you surprisingly far. Most public pages are fair game. Once you need to go deeper — login flows, JS-heavy SPAs, rate limiting — the starter kit has you covered.
Have questions or want to see a specific site scraped? Drop a comment below.
Top comments (0)