Last month I was scraping product data from 15 different e-commerce sites.
Every site had different HTML. Every site broke my selectors every few weeks. I was spending more time maintaining scrapers than actually using the data.
Then I tried something stupid: I sent the raw HTML to Claude and said "extract the product name, price, and rating."
It worked. On every site. Without a single CSS selector.
The Old Way (200+ lines per site)
# site1.py
price = soup.find('span', class_='price-tag__amount').text.strip()
name = soup.find('h1', {'data-testid': 'product-title'}).text
rating = soup.find('div', class_='star-rating').get('aria-label')
# site2.py (completely different selectors)
price = soup.select_one('.pdp-price .sale-price').text
name = soup.select_one('#productName').text
rating = soup.select_one('.ratings-count span').text
# site3.py (different again)
# ... you get the idea
15 sites × 15 selectors each = 225 CSS selectors that break whenever a site updates their HTML.
The New Way (3 lines)
import anthropic
def extract(html_text, prompt):
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Extract from this text. Return JSON only.\n"
f"Task: {prompt}\nText:\n{html_text[:5000]}"
}]
)
return response.content[0].text
# Works on ANY site. No selectors.
data = extract(page_text, "Extract: product name, price, rating")
That's it. Three lines of extraction logic that work on every site.
But What About Cost?
This is the first question everyone asks. Here's the math:
- Claude Sonnet: ~$0.003 per page (3 cents per 10 pages)
- 1,000 pages/day = $3/day = $90/month
- Compare to: 20 hours/month maintaining selectors × $50/hr = $1,000/month
AI extraction is 10x cheaper than manual maintenance.
For low-volume scraping (<100 pages/day), cost is basically zero.
When NOT to Use This
Honesty time — LLM extraction isn't always the answer:
- High volume (10K+ pages/day): Selectors are faster and cheaper at scale
- Simple, stable sites: If the HTML never changes, selectors work fine
- Structured APIs available: Always prefer an API over scraping
- Real-time data: LLM adds ~1-2 seconds latency per page
The Hybrid Approach (What I Actually Use)
def smart_extract(url, prompt):
# Try CSS selectors first (fast + free)
result = try_selectors(url)
if result and result.is_valid():
return result
# Fall back to LLM (slower but never breaks)
return llm_extract(url, prompt)
Selectors for speed. LLM as fallback. Best of both worlds.
Your Turn
I'm genuinely curious: are you still writing CSS selectors for scraping, or have you switched to AI extraction?
If you've tried LLM-based scraping, what was your experience? Better? Worse? Weird edge cases?
Drop your story in the comments — I'll share the most interesting ones in a follow-up post.
I open-sourced my extraction code: LLM Data Extraction on GitHub. Star it if you want to try the approach yourself.
More scraping tools: Awesome Web Scraping 2026 — 77+ free tools and APIs.
Top comments (0)