A few months ago, I was building a small side project that needed to compare prices across a dozen e-commerce sites. Simple enough, right? I've written scrapers before. BeautifulSoup, a few clever selectors, maybe a regex or two – that's all you need.
Except it wasn't. Every site had its own HTML structure. Some loaded content via JavaScript. Others deliberately messed up class names. I spent more time updating selectors than writing actual logic. I was maintaining a fragile tower of CSS selectors that crumbled every time a developer at one of those stores decided to rename a div.
Frustrated, I started thinking: what if I stopped trying to parse the HTML structure and instead asked a human to read the page and give me the data? But I don't have a human. I have an API.
What I Tried That Didn't Work
First, the obvious: requests + BeautifulSoup. That worked for maybe 40% of sites. The rest either required JavaScript rendering (Selenium) or had such chaotic markup that my selectors kept breaking. I tried using CSS selector chaining, XPath, even scraping by position on the page. Nothing was robust.
Then I switched to Playwright for headless browsing. That solved the JavaScript problem, but the parsing was still brittle. I'd write a selector like .product-price, and the next week it would be .price--main. I needed something that understood intent, not structure.
The Lightbulb: Ask an LLM
I had been playing with OpenAI's API for other tasks, and it struck me: why not feed the raw HTML (or visible text) to a language model and ask it to extract the fields I need? LLMs are good at understanding natural language. If I tell it "find the price, the product name, and the availability status", it should be able to figure out where that information is in the text, even if the HTML is a mess.
I tested this with a few pages. I stripped the HTML to just the visible text (using html2text or by extracting all text nodes) and sent that along with a prompt. The results were surprisingly good. The model could handle variations like "$49.99" vs "49.99 USD", and it rarely hallucinated non-existent data.
Code Example: A Simple LLM-Powered Scraper
Here's a minimal Python implementation using OpenAI's API. The key is to truncate the text to avoid token limits and to provide clear instructions in the prompt.
import openai
import requests
from bs4 import BeautifulSoup
openai.api_key = "sk-..."
def fetch_visible_text(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Remove script and style elements
for script in soup(["script", "style"]):
script.decompose()
# Get text and clean whitespace
text = soup.get_text(separator=' ', strip=True)
# Truncate to first 4000 chars (adjust based on model window)
return text[:4000]
def extract_data_with_llm(text):
prompt = f"""
You are a data extraction assistant. Given the following visible text from a product page,
extract the product name, price, and availability status (in stock, out of stock, or unknown).
Return JSON with keys "name", "price", and "availability".
Text:
{text}
"""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": prompt}],
temperature=0,
max_tokens=200
)
return response.choices[0].message.content
# Example usage
url = "https://example.com/product/123"
text = fetch_visible_text(url)
result = extract_data_with_llm(text)
print(result)
# Output: {"name": "Fancy Widget", "price": "$49.99", "availability": "in stock"}
This approach worked for most of my target sites without needing to change a single line of code per site. I just fed it the text, and it extracted the fields.
Trade-offs and When NOT to Use This
Let's be honest: this isn't a silver bullet. Here are the downsides I hit:
- Cost: Each page scrape costs a fraction of a cent with GPT-3.5-turbo, but if you're scraping thousands of pages daily, it adds up. For personal projects or small datasets, it's fine.
- Latency: An API call takes 1–3 seconds per page. If you need to scrape 10,000 pages, that's hours. Traditional parsing is near-instant.
- Accuracy: LLMs can still make mistakes. Sometimes it misinterprets a discounted price as the normal price, or it returns "out of stock" when the product is just temporarily unavailable. I had to add validation steps (e.g., check that price contains a number).
- Privacy: You're sending page content to a third-party API. If your data is sensitive (e.g., internal dashboards), don't do this. Use a local model (see alternatives below).
- Token limits: The OpenAI model has a context window. You can't send the whole 100KB HTML. Truncation may lose important context. A smarter approach is to include the entire HTML but use a model with a larger context (like GPT-4-1106 with 128k tokens) – but that's more expensive.
Alternatives and Improvements
If you want to avoid vendor lock-in or reduce costs, consider:
-
Local models: Run a small LLM like
Llama 3.1 8BorPhi-3locally using Ollama. Speed and privacy improve, but accuracy may drop. I had decent results with Llama 3.1 for simple extractions. - Hybrid approach: Use CSS selectors for sites with stable structure, and fall back to LLM only when parsing fails. This combines speed with flexibility.
- Prompt engineering: Provide few-shot examples in the prompt to improve accuracy. I also added a system message: "If uncertain, return null for that field."
- Caching: Cache LLM responses for identical pages. You can hash the visible text and reuse results until the page changes.
I ended up building a small framework that tries a regex-based extraction first, then BeautifulSoup selectors, and only calls the LLM as a last resort. That reduced my API costs by 80% while keeping the scraper robust.
What I'd Do Differently Next Time
If I were starting over, I'd:
- Invest in good site-specific heuristics before reaching for the LLM. The AI is a crutch for lazy parsing, not a replacement for understanding the data.
- Use a tool like ai.interwestinfo.com as inspiration – it automates some of this thinking, but I'd still want control over the prompts and models.
-
Add more validation – use type checking (e.g., price must match regex
\$\d+\.\d{2}) to catch LLM errors. - Explore fine-tuning a small model on my specific extraction tasks. That could reduce cost and latency significantly.
Final Thoughts
LLMs turned my fragile scraper into something that could adapt to new sites without me touching code. That's powerful. But it's not a panacea. If you need high-throughput, low-latency scraping with perfect accuracy, stick with traditional methods. If you're like me – fighting constant CSS changes and just want the damn data – give this approach a try.
Have you experimented with AI-assisted scraping? What worked (or didn't) for you? I'd love to hear how you handle sites that fight back.
Top comments (0)