circobit

Posted on Feb 11 • Edited on Feb 17

The Hidden Complexity of HTML Tables (Why Parsing Them Is Harder Than You Think)

#webdev #productivity #data #tutorial

HTML tables look simple. <table>, <tr>, <td>. What could go wrong?

After building HTML Table Exporter, a table export tool that's processed thousands of real-world tables, I can tell you: a lot. This post covers the edge cases that break naive parsers and how to handle them.

The Deceptively Simple Case

A perfect table looks like this:

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Revenue</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Acme Inc</td>
      <td>$1.2M</td>
    </tr>
  </tbody>
</table>

Parsing this is trivial:

const rows = table.querySelectorAll('tr');
const data = [...rows].map(row => 
  [...row.querySelectorAll('td, th')].map(cell => cell.textContent.trim())
);

Done, right? Not even close.

Problem 1: Merged Cells (colspan/rowspan)

Real tables have merged cells. A lot of them.

<tr>
  <td rowspan="3">Q1 2024</td>
  <td>January</td>
  <td>$100k</td>
</tr>
<tr>
  <td>February</td>
  <td>$120k</td>
</tr>
<tr>
  <td>March</td>
  <td>$90k</td>
</tr>

If you parse this naively, you get:

Row 1: ["Q1 2024", "January", "$100k"]
Row 2: ["February", "$120k"]          // Missing first column!
Row 3: ["March", "$90k"]              // Missing first column!

The Fix: Build a Position Matrix

You need to track which cells are "occupied" by rowspans from previous rows:

function parseTableWithMergedCells(table) {
  const rows = table.querySelectorAll('tr');
  const matrix = [];
  const rowspanTracker = []; // Track active rowspans per column

  rows.forEach((row, rowIndex) => {
    matrix[rowIndex] = [];
    let colIndex = 0;

    // Skip columns occupied by previous rowspans
    while (rowspanTracker[colIndex] > 0) {
      matrix[rowIndex][colIndex] = matrix[rowIndex - 1]?.[colIndex] || '';
      rowspanTracker[colIndex]--;
      colIndex++;
    }

    row.querySelectorAll('td, th').forEach(cell => {
      // Skip occupied columns
      while (rowspanTracker[colIndex] > 0) {
        matrix[rowIndex][colIndex] = matrix[rowIndex - 1]?.[colIndex] || '';
        rowspanTracker[colIndex]--;
        colIndex++;
      }

      const colspan = parseInt(cell.getAttribute('colspan')) || 1;
      const rowspan = parseInt(cell.getAttribute('rowspan')) || 1;
      const value = cell.textContent.trim();

      // Fill colspan
      for (let c = 0; c < colspan; c++) {
        matrix[rowIndex][colIndex] = value;

        // Track rowspan for future rows
        if (rowspan > 1) {
          rowspanTracker[colIndex] = rowspan - 1;
        }
        colIndex++;
      }
    });
  });

  return matrix;
}

This is simplified—the real implementation needs to handle nested rowspans within colspans, which gets ugly fast.

Problem 2: Tables That Aren't Data Tables

Not every <table> contains data. Many sites (yes, still in 2024) use tables for layout:

<table>
  <tr>
    <td><nav>Menu here</nav></td>
    <td><main>Content here</main></td>
  </tr>
</table>

Or for forms:

<table>
  <tr>
    <td><label>Email:</label></td>
    <td><input type="email"></td>
  </tr>
</table>

The Fix: Heuristics

I use several signals to detect "real" data tables:

function isDataTable(table) {
  const rows = table.querySelectorAll('tr');
  const cells = table.querySelectorAll('td, th');

  // Too few rows or cells
  if (rows.length < 2 || cells.length < 4) return false;

  // Contains form elements (probably a form layout)
  if (table.querySelector('input, select, textarea, button')) return false;

  // Mostly navigation links
  const links = table.querySelectorAll('a');
  const textContent = table.textContent.length;
  const linkText = [...links].reduce((sum, a) => sum + a.textContent.length, 0);
  if (linkText / textContent > 0.7) return false;

  // Check column consistency
  const colCounts = [...rows].map(row => 
    row.querySelectorAll('td, th').length
  );
  const variance = Math.max(...colCounts) - Math.min(...colCounts);
  if (variance > 3) return false; // Inconsistent columns = probably layout

  return true;
}

None of these are perfect. You'll always have edge cases.

Problem 3: Hidden Content

Cells often contain more than visible text:

<td>
  <span class="value">1,234</span>
  <span class="sort-key" style="display:none">1234</span>
</td>

Wikipedia does this a lot for sortable tables. If you just grab textContent, you get "1,234 1234".

The Fix: Extract Visible Text Only

function getVisibleText(element) {
  // Clone to avoid modifying original
  const clone = element.cloneNode(true);

  // Remove hidden elements
  clone.querySelectorAll('[style*="display: none"], [style*="display:none"], .hidden, [hidden]').forEach(el => el.remove());

  // Also check computed style for dynamically hidden elements
  // (more expensive, use sparingly)

  return clone.textContent.trim();
}

Problem 4: Numbers That Aren't Numbers

"$1,234.56" is a number. So is "1.234,56" (European format). So is "(1,234)" (accounting negative). So is "1,234 M" (with suffix).

Your spreadsheet needs actual numbers to do math.

The Fix: Locale-Aware Parsing

function parseNumber(value) {
  if (!value || typeof value !== 'string') return value;

  // Remove currency symbols and whitespace
  let cleaned = value.replace(/[$€£¥₹\s]/g, '').trim();

  // Handle accounting negatives: (1,234) -> -1234
  if (cleaned.startsWith('(') && cleaned.endsWith(')')) {
    cleaned = '-' + cleaned.slice(1, -1);
  }

  // Handle suffixes: 1.5M, 2.3B, 100K
  const suffixes = { 'K': 1e3, 'M': 1e6, 'B': 1e9, 'T': 1e12 };
  const suffixMatch = cleaned.match(/([0-9.,]+)\s*([KMBT])$/i);
  if (suffixMatch) {
    cleaned = suffixMatch[1];
    var multiplier = suffixes[suffixMatch[2].toUpperCase()];
  }

  // Detect European vs US format
  // European: 1.234,56 (dot for thousands, comma for decimal)
  // US: 1,234.56 (comma for thousands, dot for decimal)
  const lastComma = cleaned.lastIndexOf(',');
  const lastDot = cleaned.lastIndexOf('.');

  if (lastComma > lastDot && lastComma > cleaned.length - 4) {
    // European format
    cleaned = cleaned.replace(/\./g, '').replace(',', '.');
  } else {
    // US format
    cleaned = cleaned.replace(/,/g, '');
  }

  let num = parseFloat(cleaned);
  if (multiplier) num *= multiplier;

  return isNaN(num) ? value : num;
}

This handles maybe 90% of cases. The other 10% will surprise you.

Problem 5: Character Encoding Hell

You'd think UTF-8 solved this. It didn't.

Real tables contain:

Non-breaking spaces (  / \u00A0) that look like spaces but aren't
Zero-width characters that break string comparison
Windows-1252 characters that got mangled into UTF-8
Emoji that break older parsers
Right-to-left marks in mixed-language tables

The Fix: Normalize Everything

function normalizeText(text) {
  return text
    // Normalize unicode (handles composed vs decomposed characters)
    .normalize('NFC')
    // Replace non-breaking spaces with regular spaces
    .replace(/\u00A0/g, ' ')
    // Remove zero-width characters
    .replace(/[\u200B-\u200D\uFEFF]/g, '')
    // Normalize whitespace
    .replace(/\s+/g, ' ')
    .trim();
}

And when exporting to CSV for Excel, prepend the UTF-8 BOM:

const BOM = '\uFEFF';
const csvContent = BOM + generateCSV(data);

Without the BOM, Excel may interpret your UTF-8 file as Windows-1252 and mangle special characters.

Problem 6: Nested Tables

Yes, tables inside tables. Usually for layout, but sometimes for data:

<table>
  <tr>
    <td>Product A</td>
    <td>
      <table>
        <tr><td>Size S</td><td>$10</td></tr>
        <tr><td>Size M</td><td>$12</td></tr>
      </table>
    </td>
  </tr>
</table>

The Fix: Decide Your Strategy

Options:

Flatten: Convert nested table to text ("Size S: $10, Size M: $12")
Extract separately: Treat nested tables as separate exports
Expand rows: Create multiple parent rows, one per nested row

I went with option 2 (extract separately) with option 1 as fallback for deeply nested cases. There's no perfect answer—it depends on use case.

The Reality

After handling all these cases, my table parser is ~800 lines of JavaScript. And it still doesn't handle everything perfectly.

Some hard truths:

No parser is perfect. Real-world HTML is messy.
Heuristics fail. You'll always need escape hatches for users.
Performance matters. Some pages have 50+ tables. Parsing needs to be fast.
Edge cases are infinite. Ship something that works for 95% of cases, then iterate.

Tools and Resources

If you're building something similar:

SheetJS (xlsx) - Solid library for generating Excel files
Papa Parse - Fast CSV parsing and generation
Chrome DevTools - $('table') in console to quickly inspect tables

Or if you just need to export tables without building anything: I made HTML Table Exporter specifically because I got tired of writing one-off scrapers. It handles all the edge cases above.

Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.

What weird table edge cases have you encountered? I'm always looking for new test cases to break my parser.

DEV Community