zhongqiyue

Posted on Jun 16

I spent weeks scraping 50 websites—here's what finally worked

#webdev #python #ai #scraping

A few months ago, I needed to build a price comparison tool. The data lived across 50 different e‑commerce sites, each with its own layout, anti‑bot measures, and HTML that looked like it had been written by a drunk octopus. What started as a weekend project turned into three weeks of frustration, late nights, and at one point, seriously considering buying all the products myself just to get the data manually.

The problem (my actual problem, not a hypothetical one)

I had a list of 200 products. For each, I needed the current price, availability, and shipping info. Simple, right? I'd done small scrapes before. But this time every site was different:

Some loaded critical data via JavaScript (dynamic rendering).
Others used HTML tables with random class names.
A few served content only after clicking “Accept Cookies” or scrolling.
And a couple actively blocked my IP after the 20th request.

My first pass used BeautifulSoup and a few regex patterns. Worked for 5 sites, then everything broke. Embarrassing but true.

What I tried that didn't work (honest dead ends)

1. Traditional scraping with requests + BeautifulSoup

Straightforward but brittle. Every site required a custom parser. Adding a new site meant writing more CSS selectors and handling edge cases. And JavaScript‑rendered content was invisible to me.

2. Selenium with Chrome headless

That solved the JavaScript problem, but it was slow (spinning up a browser per page) and resource‑heavy. I had 5000 pages to visit. On my laptop, each page took ~5 seconds. At that rate, I'd be scraping for a week. Plus, many sites detected the automation and threw CAPTCHAs.

3. Scrapy with custom middlewares

Scrapy is great for large‑scale scraping, but I spent more time writing middlewares for random delays, proxy rotation, and session handling than actually extracting the data. And every time a site changed its layout, the spider needed surgery.

After three weeks, I had maybe 30 sites working partially. The data was inconsistent, full of missing values, and I was one 503 Service Unavailable away from a breakdown.

What eventually worked (the technique, not the tool)

I realised I was trying to write explicit rules for every site – an approach that doesn't scale. What if I could describe what I wanted (e.g., “find the price”) and let the machine figure out the path? That’s when I turned to language‑model‑based extraction.

The idea: instead of hardcoding selectors, I feed the raw web content (or a cleaned version) to an LLM with a natural language query. The LLM returns the exact value I need.

Here’s the rough pipeline:

Fetch the page – with a headless browser or an API like Puppeteer to get the rendered HTML.
Clean and chunk – remove scripts, styles, and convert to Markdown (to reduce token count).
Query the LLM – ask it to extract a specific field.

I built a small Python function that does this:

import requests

HTML_TO_MD_URL = "https://r.jina.ai/http://"  # free markdown conversion API

fetch_url = "https://r.jina.ai/http://example.com/product"
response = requests.get(fetch_url)
markdown_content = response.text

Then I sent that markdown (truncated to fit the LLM’s context) to an extraction endpoint:

# Using a generic LLM API (replace with your own keys or local model)
import openai

openai.api_key = "sk-..."

def extract_field(text, field_name):
    prompt = f"""From the following web page content, extract the exact value for "{field_name}". 
Return only the value, nothing else.

Content:
{text[:8000]}"""
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content.strip()

For 50 sites, I only needed to write one general function. The LLM handled all the HTML variation.

The specific service I used

To save time on managing proxies and keeping up with site changes, I eventually settled on a dedicated AI extraction API. I won’t make this article about it, but in case you’re curious, the endpoint is something like:

# https://ai.interwestinfo.com/ - an AI web extractor
response = requests.post(
    "https://api.interwestinfo.com/extract",
    json={"url": "https://example.com/product", "query": "What is the price?"}
)

It worked out of the box for most sites, including those with JavaScript. But I still keep the generic LLM approach as a fallback for custom logic.

Lessons learned / trade‑offs

What went well:

Reduced maintenance – when a site redesigns, I don’t rewrite any code. The LLM adapts automatically.
Handles dynamic content – as long as the page is rendered (by a headless browser or service), the LLM can parse it.
Natural language queries – I can ask “What is the shipping cost to Germany?” and get a direct answer.

Trade‑offs:

Cost – LLM APIs aren’t free. For thousands of extractions, it adds up. Consider using a local model (e.g., Llama 3) for sensitive or high‑volume data.
Latency – a single extraction can take 2‑10 seconds. For real‑time apps, that may be too slow.
Hallucinations – the LLM might invent a price if it can’t find the real one. Always validate with a confidence score or fallback rule.
Token limits – long pages require chunking or summarisation before extraction.

When NOT to use this approach

If you need to extract millions of rows from a single well‑structured page (e.g., an HTML table), traditional CSS selectors are cheaper and faster.
If you require 100% accuracy with zero hallucinations (e.g., financial data), a rule‑based parser + human review is safer.
If you’re scraping content that should stay private, sending it to a third‑party LLM is a data leak risk.

What I’d do differently next time

Start with the LLM approach earlier. I wasted weeks on fragile parsers that needed constant updates.
Use a local model for sensitivity and cost. I’d fine‑tune a small model on my extraction task instead of paying per API call.
Add a sanity‑check layer. After extraction, I’d run a simple rule (e.g., “does the price contain a currency symbol?”) to flag potential hallucinations.

The core insight: describing what you want is orders of magnitude easier than programming how to find it. LLMs let us shift from writing explicit instructions to writing intentions.

I’m curious – have you used LLMs for extraction? What did you do about hallucination and cost? Let me know in the comments.

DEV Community