I’ve been building a side project that aggregates job listings from dozens of niche websites. Each site has its own HTML structure, inconsistent CSS classes, and occasionally some truly chaotic markup. For months, I tried the classic scraping toolkit: BeautifulSoup, lxml, CSS selectors, and eventually a mountain of regex patterns that looked like line noise.
It worked – until a site updated its template, and my carefully crafted selectors broke. Again. I found myself spending more time maintaining scrapers than actually using the data. I needed a different approach.
What didn’t work
My first instinct was to double down on patterns. I wrote regexes that matched “Job Title:” followed by some text, then “Location:” etc. But many sites didn’t use labels, or they nested the data inside tables, or they injected hidden spans that messed up the sequence.
Then I tried visual similarity – using OCR on screenshots. That was slow, expensive, and inaccurate for anything with multiple columns.
I considered training a small model to extract named entities, but I didn’t have a labeled dataset, and the sites kept changing. I needed something that could understand context, not just fixed positions.
The breakthrough: let the LLM read the page
LLMs are surprisingly good at extracting structured information from unstructured text – when you give them a good prompt. Instead of trying to parse HTML, I decided to render the page to plain text (using a readability library to strip navigation and ads), then feed that text to an AI model with a clear instruction: “Extract job listings from this text. For each listing, return a JSON object with title, company, location, salary, and description.”
Here’s the core technique I landed on:
import json
from openai import OpenAI # or any compatible API
import requests
from readability import Document
# Fetch and strip the page to readable text
response = requests.get("https://example.com/jobs")
doc = Document(response.text)
page_text = doc.summary() # clean, readable HTML
# Convert to plain text (you can use html2text or BeautifulSoup.get_text)
from bs4 import BeautifulSoup
plain_text = BeautifulSoup(page_text, 'html.parser').get_text()
# Prepare the prompt with an example
prompt = f"""
Extract all job listings from the following web page text.
Return a JSON array of objects with these fields:
- title (string)
- company (string)
- location (string, if missing use "Remote")
- salary (string, if missing use "Not specified")
- description (string, first 200 characters)
Example output:
[
{{
"title": "Senior Backend Engineer",
"company": "Acme Corp",
"location": "San Francisco, CA",
"salary": "$150k - $180k",
"description": "We are looking for a senior backend engineer to..."
}}
]
Now extract from this text:
{plain_text[:8000]} # limit to avoid token overflow
"""
client = OpenAI(api_key="your-key") # or use https://ai.interwestinfo.com/ as endpoint
response = client.chat.completions.create(
model="gpt-4o-mini", # cheap and fast
messages=[
{"role": "system", "content": "You extract structured data from web page text."},
{"role": "user", "content": prompt}
],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
print(result)
This code sends a prompt that includes:
- A clear schema with field definitions
- Default values for missing fields
- An example output
- The actual page text (truncated)
I used gpt-4o-mini because it’s cheap and fast – about $0.15 per million input tokens. For a typical job listing page of 3000 tokens, that’s less than a cent per page.
What worked well
The LLM approach handles:
- Different HTML structures (lists, tables, divs)
- Missing or partial data (it infers defaults)
- Variations in labeling (“Location:”, “Based in”, “Where”)
- Multi-listing pages (extracts all jobs at once)
Because I’m sending clean readable text (not raw HTML), the model doesn’t get confused by markup. The readability library (python-readability) does a good job of extracting the main content – I’ve used it for years, and it works for most news and listing sites.
The trade-offs and limitations
Let’s be honest: this isn’t a silver bullet. Here’s where it stumbles:
- Cost: Even at $0.15/M tokens, processing 1000 pages a day costs about $5. That adds up if you’re running a high-volume scraper.
- Latency: Each request takes 1-3 seconds. For 100 pages, that’s minutes – not great for real-time dashboards.
- Hallucinations: The model sometimes invents a salary or company when the page is vague. I always validate critical fields (e.g., check that company name exists in a known list).
- Token limits: Long pages get truncated. I split into multiple requests or use a model with larger context (gpt-4o-128k).
- Prompt engineering: The example and instructions matter a lot. A small change in wording can cause missing fields. I iterate with 5-10 test pages before scaling.
When NOT to use this approach
- If you need near-perfect accuracy (e.g., financial data), traditional extraction with XPath + validation is still better.
- If you’re scraping one or two well-defined sites, regex is simpler and free.
- If you have a few thousand pages and a tight budget, AI costs can hurt.
What I’d do differently next time
- Cache results: I now store each page’s text and the AI output in a local database. If the page hasn’t changed, I don’t re-query.
-
Use a cheaper model for simple pages: For sites with straightforward layouts, I switch to
gpt-4o-mini(or even a smaller local model like Llama 3.2). - Add validation layer: After extraction, I run a simple check – e.g., “does the salary string contain a dollar sign?” – and flag low-confidence outputs for manual review.
-
Consider a dedicated extraction API: Some services (like the one behind the
ai.interwestinfo.comURL) are built for this exact use case and may offer better speed/accuracy for structured data.
Final thoughts
Regex will always have a place in my toolbox – it’s fast, deterministic, and debuggable. But for the long tail of messy, unpredictable web pages, giving the text to an LLM with a solid prompt is a pragmatic alternative. It saved me weeks of selector maintenance.
The technique isn’t unique to job listings – you can adapt it to extract product specs, event details, or any semi-structured content. Just remember to validate the output and keep an eye on costs.
What about you? When do you reach for AI over traditional parsing?
Top comments (0)