I hit a wall last month. A client wanted me to scrape product details from fifty different e-commerce sites — prices, sizes, colors, descriptions. I’ve done scraping before, so I started confidently: BeautifulSoup, CSS selectors, regex patterns for price formats.
Two hours in, I was staring at a mess. One site had prices as $19.99, another as 19,99 €. Sizes were <select> dropdowns on one page, radio buttons on another, and free-text on a third. Every site had its own quirks. By the end of the day, I had fragile regex for each domain, and the client wanted to add ten more sites next week. I knew this wouldn’t scale.
What I tried (and why it hurt)
First, I doubled down on patterns. I wrote more aggressive regex to capture £19.99 and 19.99 GBP and $19.99 – $29.99. It turned into a nest of alternations and named groups. I kept a mental list of edge cases I missed — like a price that included a discount badge inline (was $29.99, now $19.99).
Then I tried visual parsing: headless Chrome with Playwright, waiting for the DOM to stabilize, then walking the tree looking for text nodes that looked like prices. It was slow and still brittle. A site redesign would break everything.
I almost gave up and hired a data entry person. But I wanted a solution that could handle any site, even ones I hadn’t seen yet.
The approach that finally worked: structured LLM output
I’d been playing with GPT-4 for code generation, but I hadn’t considered using it for data extraction. The idea was simple: give the LLM the raw HTML (or a cleaned text version) and a schema of what I wanted, and ask it to output JSON.
The key was function calling (or tool use). Instead of asking the model to write free-form JSON, I defined the exact fields I wanted, with types and descriptions, as a function. Then the model returns a structured object that I can validate programmatically.
Here’s how I set it up in Python:
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional
import json
client = OpenAI(api_key="sk-...")
# Define the schema
class ProductInfo(BaseModel):
name: str = Field(description="Product full name")
price: str = Field(description="Current price as displayed, e.g. '$19.99'")
original_price: Optional[str] = Field(None, description="Original price if discounted")
currency: str = Field("USD", description="Currency code (USD, EUR, GBP, etc.)")
colors: List[str] = Field(default_factory=list, description="Available colors")
sizes: List[str] = Field(default_factory=list, description="Available sizes or dimensions")
description: str = Field(description="Short product description")
# Convert Pydantic model to OpenAI function definition
extraction_function = {
"name": "extract_product",
"description": "Extract structured product info from HTML",
"parameters": ProductInfo.model_json_schema()
}
def extract_from_html(html: str) -> ProductInfo:
# We strip most of the markup to focus on visible text
# In practice, you might use a text extraction library like trafilatura
plain_text = strip_html(html)[:3000] # Truncate to avoid token limits
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You are a data extraction assistant. Extract the requested product information from the provided text. If a field is not present, leave it null or empty."
},
{
"role": "user",
"content": f"Extract product info from this text:\n\n{plain_text}"
}
],
tools=[{"type": "function", "function": extraction_function}],
tool_choice={"type": "function", "function": {"name": "extract_product"}}
)
tool_call = response.choices[0].message.tool_calls[0]
arguments = json.loads(tool_call.function.arguments)
return ProductInfo.model_validate(arguments)
I tested it on a page from a random shoe store. The first call returned name="Running Sneaker X", price="$89.99", colors=["Black","White"]. Perfect. I tried it on three more sites — different markup, different wording — and it extracted correctly every time.
The reality check: it's not magic
This approach saved my project, but it has real trade-offs:
- Cost: Every extraction costs a few tokens. For a few hundred pages, it’s cheap (maybe a few cents). Scraping millions of products? The bill adds up quickly.
- Latency: Each call takes 1–3 seconds. You can batch or parallelize, but you’re still slower than a regex-based solution that runs in microseconds.
- Hallucinations: The model might invent a price if it’s not visible on the page. I solved this by strict schema validation: if a required field is missing or has garbage, I re-run with a more specific prompt.
- Prompt sensitivity: If you don’t give clear instructions, the model might include extra text like “The price is…” instead of “$19.99”. You need to iterate on your system prompt.
Despite these, for dynamic, heterogeneous content, it’s the best tool I’ve found. I ended up wrapping the logic into a small internal service — essentially the same pattern as something like ai.interwestinfo.com’s extraction API, but with my own schema definitions and retry logic.
Lessons learned
- Don’t build regex for every site if you can use AI. It’s faster to set up and more resilient to layout changes.
-
Always validate the output. Pydantic’s
model_validatecatches type errors and missing fields immediately. - Clean the HTML first. Full HTML wastes tokens. Use a readability library to get the main content block.
- Keep a retry loop with feedback. If validation fails, send the error message back to the model and ask it to correct the output.
When NOT to do this
If you’re scraping a single source with a stable structure, just write a targeted CSS selector. If you have millions of pages and a strict budget, invest in a training a smaller model or use traditional parsing. LLMs are for the messy, unstructured edge cases.
What I’d do differently next time
I’d start with a few representative pages and build a test suite. I’d also experiment with cheaper models like gpt-4o-mini or even local ones (e.g., Llama 3 with tool calling) to reduce latency and cost. And I’d move the schema definitions into a config file so non-developers can tweak them without touching code.
This method turned my nightmare project into a two-day task. The data is now flowing into the client’s database, and I sleep better knowing a site redesign won’t break everything.
What’s your experience with LLMs for data extraction? Have you had better luck with structured output or free-form prompts?
Top comments (0)