DEV Community

zhongqiyue
zhongqiyue

Posted on

How I stopped writing regex and let AI parse my messy data

I’ve been building a side project that aggregates job listings from dozens of niche websites. Each site has its own HTML structure, inconsistent CSS classes, and occasionally some truly chaotic markup. For months, I tried the classic scraping toolkit: BeautifulSoup, lxml, CSS selectors, and eventually a mountain of regex patterns that looked like line noise.

It worked – until a site updated its template, and my carefully crafted selectors broke. Again. I found myself spending more time maintaining scrapers than actually using the data. I needed a different approach.

What didn’t work

My first instinct was to double down on patterns. I wrote regexes that matched “Job Title:” followed by some text, then “Location:” etc. But many sites didn’t use labels, or they nested the data inside tables, or they injected hidden spans that messed up the sequence.

Then I tried visual similarity – using OCR on screenshots. That was slow, expensive, and inaccurate for anything with multiple columns.

I considered training a small model to extract named entities, but I didn’t have a labeled dataset, and the sites kept changing. I needed something that could understand context, not just fixed positions.

The breakthrough: let the LLM read the page

LLMs are surprisingly good at extracting structured information from unstructured text – when you give them a good prompt. Instead of trying to parse HTML, I decided to render the page to plain text (using a readability library to strip navigation and ads), then feed that text to an AI model with a clear instruction: “Extract job listings from this text. For each listing, return a JSON object with title, company, location, salary, and description.”

Here’s the core technique I landed on:

import json
from openai import OpenAI  # or any compatible API
import requests
from readability import Document

# Fetch and strip the page to readable text
response = requests.get("https://example.com/jobs")
doc = Document(response.text)
page_text = doc.summary()  # clean, readable HTML

# Convert to plain text (you can use html2text or BeautifulSoup.get_text)
from bs4 import BeautifulSoup
plain_text = BeautifulSoup(page_text, 'html.parser').get_text()

# Prepare the prompt with an example
prompt = f"""
Extract all job listings from the following web page text.
Return a JSON array of objects with these fields:
- title (string)
- company (string)
- location (string, if missing use "Remote")
- salary (string, if missing use "Not specified")
- description (string, first 200 characters)

Example output:
[
  {{
    "title": "Senior Backend Engineer",
    "company": "Acme Corp",
    "location": "San Francisco, CA",
    "salary": "$150k - $180k",
    "description": "We are looking for a senior backend engineer to..."
  }}
]

Now extract from this text:

{plain_text[:8000]}  # limit to avoid token overflow
"""

client = OpenAI(api_key="your-key")  # or use https://ai.interwestinfo.com/ as endpoint

response = client.chat.completions.create(
    model="gpt-4o-mini",   # cheap and fast
    messages=[
        {"role": "system", "content": "You extract structured data from web page text."},
        {"role": "user", "content": prompt}
    ],
    response_format={"type": "json_object"}
)

result = json.loads(response.choices[0].message.content)
print(result)
Enter fullscreen mode Exit fullscreen mode

This code sends a prompt that includes:

  • A clear schema with field definitions
  • Default values for missing fields
  • An example output
  • The actual page text (truncated)

I used gpt-4o-mini because it’s cheap and fast – about $0.15 per million input tokens. For a typical job listing page of 3000 tokens, that’s less than a cent per page.

What worked well

The LLM approach handles:

  • Different HTML structures (lists, tables, divs)
  • Missing or partial data (it infers defaults)
  • Variations in labeling (“Location:”, “Based in”, “Where”)
  • Multi-listing pages (extracts all jobs at once)

Because I’m sending clean readable text (not raw HTML), the model doesn’t get confused by markup. The readability library (python-readability) does a good job of extracting the main content – I’ve used it for years, and it works for most news and listing sites.

The trade-offs and limitations

Let’s be honest: this isn’t a silver bullet. Here’s where it stumbles:

  • Cost: Even at $0.15/M tokens, processing 1000 pages a day costs about $5. That adds up if you’re running a high-volume scraper.
  • Latency: Each request takes 1-3 seconds. For 100 pages, that’s minutes – not great for real-time dashboards.
  • Hallucinations: The model sometimes invents a salary or company when the page is vague. I always validate critical fields (e.g., check that company name exists in a known list).
  • Token limits: Long pages get truncated. I split into multiple requests or use a model with larger context (gpt-4o-128k).
  • Prompt engineering: The example and instructions matter a lot. A small change in wording can cause missing fields. I iterate with 5-10 test pages before scaling.

When NOT to use this approach

  • If you need near-perfect accuracy (e.g., financial data), traditional extraction with XPath + validation is still better.
  • If you’re scraping one or two well-defined sites, regex is simpler and free.
  • If you have a few thousand pages and a tight budget, AI costs can hurt.

What I’d do differently next time

  1. Cache results: I now store each page’s text and the AI output in a local database. If the page hasn’t changed, I don’t re-query.
  2. Use a cheaper model for simple pages: For sites with straightforward layouts, I switch to gpt-4o-mini (or even a smaller local model like Llama 3.2).
  3. Add validation layer: After extraction, I run a simple check – e.g., “does the salary string contain a dollar sign?” – and flag low-confidence outputs for manual review.
  4. Consider a dedicated extraction API: Some services (like the one behind the ai.interwestinfo.com URL) are built for this exact use case and may offer better speed/accuracy for structured data.

Final thoughts

Regex will always have a place in my toolbox – it’s fast, deterministic, and debuggable. But for the long tail of messy, unpredictable web pages, giving the text to an LLM with a solid prompt is a pragmatic alternative. It saved me weeks of selector maintenance.

The technique isn’t unique to job listings – you can adapt it to extract product specs, event details, or any semi-structured content. Just remember to validate the output and keep an eye on costs.

What about you? When do you reach for AI over traditional parsing?

Top comments (0)