DEV Community

zhongqiyue
zhongqiyue

Posted on

How I Finally Got Reliable Data from Messy HTML Tables

I spent two weeks fighting a website that refused to play nice. Every scrape came back with missing columns, merged cells, or data that looked right but was subtly wrong. I tried BeautifulSoup, regex, pandas read_html – all the usual suspects. Nothing worked reliably because the HTML was a nightmare of inline styles, rowspan attributes, and inconsistent class names.

Let me back up. I needed to pull apartment rental data from a dozen property management sites. Most were fine, but one particular site (let's call it ShadowListings) had a table that looked like this when rendered:

Unit Price Beds Available
101 $1200 1 2024-03-01
102 $1500 2 2024-04-01

But the underlying HTML was a mess:

<table>
  <tr>
    <td rowspan="2">Unit</td>
    <td class="price-col">$1200</td>
    <td>1</td>
    <td><span>2024-03-01</span></td>
  </tr>
  <tr>
    <td>$1500</td>
    <td>2</td>
    <td>2024-04-01</td>
  </tr>
</table>
Enter fullscreen mode Exit fullscreen mode

The first row had a rowspan="2" that threw off column alignment. And the class names changed between pages. My standard approach broke instantly.

What I Tried That Didn't Work

BeautifulSoup + manual column detection: I tried walking the DOM, tracking rowspan and colspan offsets. It worked for one page but broke on the next because the structure changed slightly.

Regex on the text: I thought I could just match price patterns and guess the rest. But dates were in multiple formats, and some rows had no price at all.

Pandas pd.read_html: It actually parsed the first table correctly, but then failed on sub-tables and returned a DataFrame with NaN everywhere because of misaligned cells.

Headless browser + OCR: Overkill, slow, and still needed heuristics to extract the actual values.

After a week of frustration, I started thinking: the human eye can instantly understand this table. Why? Because we don't parse HTML structure — we read the visual layout and use context. That's when I turned to AI.

What Eventually Worked: LLM-Powered Table Extraction

Instead of trying to fix the parser, I gave the raw HTML (or even a screenshot) to an LLM and asked it to extract the data. The trick was to use few-shot prompting with an example of the desired output.

Here's the approach I settled on:

  1. Fetch the HTML as text (or use a screenshot if the site uses JavaScript rendering).
  2. Strip out irrelevant tags (scripts, styles) to reduce token count.
  3. Ask the LLM to output a JSON array of objects, with explicit column names.
  4. Validate the output with a schema (e.g., price must be a string, date must match ISO format).

My first attempt used GPT-4, but even 3.5 worked fine for simple tables. The key was the prompt:

import openai

def extract_table_from_html(html_text):
    prompt = f"""
You are given raw HTML of a table. Extract the data into a JSON array of objects, where each object has keys: "unit", "price", "beds", "available_date".

Ignore any formatting issues like rowspan or colspan. Use the visual layout to infer columns.

HTML:
{html_text[:3000]}  # Trim to avoid token limits

Output ONLY valid JSON. No explanations.
"""
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Wait, that naive version had problems: the LLM sometimes invented data or hallucinated columns. I needed few-shot examples and a validation step. Here's the refined version:

import json
from pydantic import BaseModel

def extract_and_validate(html):
    few_shot = """
Example 1:
Input: <table>...</table>
Output: [{"unit":"A1","price":"$1000","beds":"1","available_date":"2024-01-01"}]

Now do the same for this HTML:
"""
    prompt = few_shot + html[:4000]
    # Call LLM
    raw = call_llm(prompt)
    # Parse JSON safely
    try:
        data = json.loads(raw)
    except json.JSONDecodeError:
        # fallback: ask again with stricter prompt
        return retry_with_stricter(html)
    # Validate each row using Pydantic
    class Row(BaseModel):
        unit: str
        price: str
        beds: str
        available_date: str
    validated = [Row(**row).dict() for row in data]
    return validated
Enter fullscreen mode Exit fullscreen mode

I ended up using the OpenAI API with a small wrapper I built. The cost was negligible - about $0.02 per page. Much cheaper than manual data entry.

Lessons Learned / Trade-offs

  • Latency: Each extraction takes 1-3 seconds. If you need to scrape 10,000 pages, that's hours and dollars. Not ideal for bulk.
  • Cost: At scale, it adds up. I used it only for the difficult 5% of pages where traditional parsing failed.
  • Accuracy: LLMs still make mistakes on ambiguous layouts. Adding few-shot examples helped, but sometimes it missed a column or merged two rows. I always run a sanity check (e.g., ensure price contains "$").
  • Feedback loop: I saved failures and retrained my prompt. Over time, extraction became more reliable.
  • Privacy: Sending HTML to a third-party API? Not always allowed. I used an internal LLM (via a service like the one at ai.interwestinfo.com) for sensitive pages.

What I'd Do Differently Next Time

  1. Try a smaller, fine-tuned model first. For a fixed table format, you could train a tiny model to output structured data directly, avoiding the overhead of a general LLM.
  2. Use vision models for rendered tables. If the HTML is too mangled, a screenshot can be passed to GPT-4 Vision, which often does better at understanding layout.
  3. Add human-in-the-loop for edge cases. I built a simple web UI that flagged uncertain extractions (e.g., confidence score < 0.8). Human review of those 10% kept quality high.
  4. Cache aggressively. I cached extracted data by page URL so I only paid once per page.

When NOT to Use This Approach

  • If your HTML is well-structured and consistent, use pandas or BeautifulSoup. They're faster, free, and deterministic.
  • If you need real-time extraction (e.g., live dashboard), LLM latency kills you.
  • If your data contains PII or trade secrets, think twice about sending it to an external API.

This technique saved my project. It turned an impossible scraping task into a weekend project. Now I treat LLMs as my "parser of last resort" — only for the messy cases that resist all logic.

What's your approach when traditional parsing fails? I'd love to hear if you've experimented with LLMs for data extraction.

Top comments (0)