circobit

Posted on Mar 4

The Hidden Complexity of HTML Tables

#datascience #html #webdev #javascript

HTML tables look simple. Rows and cells. What could go wrong?

Everything, it turns out.

I've spent months building HTML Table Exporter, a table extraction tool, and I've learned that HTML tables are deceptively complex. Here's what nobody tells you when you start parsing them.

The Naive Approach (That Breaks Immediately)

Your first attempt probably looks like this:

function extractTable(table) {
  return Array.from(table.rows).map(row =>
    Array.from(row.cells).map(cell => cell.textContent.trim())
  );
}

Clean, simple, wrong.

This works for basic tables. But the real web is full of tables that will break this code in creative ways.

Problem 1: Rowspan and Colspan

HTML cells can span multiple rows or columns:

<table>
  <tr>
    <td rowspan="3">Category A</td>
    <td>Item 1</td>
    <td>$10</td>
  </tr>
  <tr>
    <td>Item 2</td>
    <td>$20</td>
  </tr>
  <tr>
    <td>Item 3</td>
    <td>$30</td>
  </tr>
</table>

The naive approach gives you:

Row 0: ["Category A", "Item 1", "$10"]
Row 1: ["Item 2", "$20"]           // Only 2 cells!
Row 2: ["Item 3", "$30"]           // Only 2 cells!

Rows 1 and 2 are missing a column. Your column alignment is now broken.

The Fix: Build a Virtual Grid

You need to track which cells are "occupied" by spans from previous rows:

function extractTableMatrix(table) {
  const rows = Array.from(table.rows);
  const grid = [];

  rows.forEach((rowEl, rowIndex) => {
    if (!grid[rowIndex]) grid[rowIndex] = [];
    let colIndex = 0;

    // Skip columns already occupied by rowspans
    while (grid[rowIndex][colIndex] !== undefined) {
      colIndex++;
    }

    Array.from(rowEl.cells).forEach((cell) => {
      // Keep advancing past occupied columns
      while (grid[rowIndex][colIndex] !== undefined) {
        colIndex++;
      }

      const text = cell.textContent.trim();
      const rowSpan = cell.rowSpan || 1;
      const colSpan = cell.colSpan || 1;

      // Fill the entire span region
      for (let r = 0; r < rowSpan; r++) {
        for (let c = 0; c < colSpan; c++) {
          const targetRow = rowIndex + r;
          const targetCol = colIndex + c;

          if (!grid[targetRow]) grid[targetRow] = [];
          if (grid[targetRow][targetCol] === undefined) {
            grid[targetRow][targetCol] = text;
          }
        }
      }

      colIndex += colSpan;
    });
  });

  return grid;
}

Now you get:

Row 0: ["Category A", "Item 1", "$10"]
Row 1: ["Category A", "Item 2", "$20"]
Row 2: ["Category A", "Item 3", "$30"]

The spanning cell's value is duplicated into each position it occupies.

Problem 2: Nested Tables

Some sites use tables inside tables for layout:

<table>
  <tr>
    <td>
      <table>
        <tr><td>Nested content</td></tr>
      </table>
    </td>
    <td>Regular cell</td>
  </tr>
</table>

Using cell.textContent captures everything, including nested table content. Using cell.innerText might help, but it's inconsistent across browsers.

The Fix: Clone and Remove

function extractCellText(cell) {
  if (!cell) return "";

  // Clone to avoid modifying the DOM
  const clone = cell.cloneNode(true);

  // Remove invisible and nested content
  const removeSelectors = "style, script, noscript, template, table";
  clone.querySelectorAll(removeSelectors).forEach(el => el.remove());

  // Normalize whitespace
  return (clone.textContent || "").replace(/\s+/g, " ").trim();
}

Problem 3: Headers Aren't Always Row 0

Many tables have title rows, navigation rows, or other content before the actual headers:

<table>
  <tr>
    <td colspan="3">Quarterly Sales Report 2024</td>
  </tr>
  <tr>
    <th>Region</th>
    <th>Q1</th>
    <th>Q2</th>
  </tr>
  <tr>
    <td>North</td>
    <td>$1.2M</td>
    <td>$1.4M</td>
  </tr>
</table>

If you assume row 0 is headers, you'll get "Quarterly Sales Report 2024" as your column name.

The Fix: Detect Header Row

function detectHeaderRowIndex(matrix) {
  for (let i = 0; i < Math.min(matrix.length - 1, 3); i++) {
    const row = matrix[i];
    const nextRow = matrix[i + 1];

    // Count unique values
    const uniqueValues = new Set(row.filter(c => c && c.trim()));
    const uniqueNext = new Set(nextRow.filter(c => c && c.trim()));

    // Title row: 1 unique value (spans all columns), next row has more
    const isTitleRow = 
      uniqueValues.size === 1 && 
      uniqueNext.size > 1 &&
      row[0]?.length > 30;

    if (isTitleRow) {
      return i + 1; // Headers are in the next row
    }
  }

  return 0;
}

Problem 4: Empty Cells and Inconsistent Columns

Some rows have fewer cells than others. Some cells are completely empty. Some have just whitespace or  .

// Normalize all rows to the same length
function normalizeMatrix(grid) {
  const maxCols = Math.max(...grid.map(row => row.length));

  return grid.map(row => {
    const normalized = new Array(maxCols);
    for (let i = 0; i < maxCols; i++) {
      normalized[i] = row[i] ?? "";
    }
    return normalized;
  });
}

Problem 5: Invisible Content

Tables can contain content that's visually hidden but present in the DOM:

<style> blocks for scoped CSS
<script> tags
display: none elements
Zero-width characters

const invisibleSelectors = [
  "style",
  "script", 
  "noscript",
  "template",
  "[hidden]",
  "[style*='display: none']",
  "[style*='display:none']"
].join(", ");

clone.querySelectorAll(invisibleSelectors).forEach(el => el.remove());

Problem 6: Character Encoding

Tables from different sources use different encodings:

  (non-breaking space)
— and – (dashes)
Smart quotes vs straight quotes
UTF-8 vs ISO-8859-1

Always normalize:

function normalizeText(text) {
  return text
    .replace(/\u00a0/g, " ")           // Non-breaking space
    .replace(/[\u2018\u2019]/g, "'")   // Smart single quotes
    .replace(/[\u201c\u201d]/g, '"')   // Smart double quotes
    .replace(/[\u2013\u2014]/g, "-")   // En/em dashes
    .trim();
}

The Complete Pipeline

After handling all these cases, my extraction pipeline looks like this:

function extractTable(tableElement) {
  // 1. Build virtual grid (handles rowspan/colspan)
  let matrix = extractTableMatrix(tableElement);

  // 2. Normalize to consistent column count
  matrix = normalizeMatrix(matrix);

  // 3. Find actual header row
  const headerIndex = detectHeaderRowIndex(matrix);

  // 4. Remove title rows
  if (headerIndex > 0) {
    matrix = matrix.slice(headerIndex);
  }

  // 5. Clean all cell text
  matrix = matrix.map(row => 
    row.map(cell => normalizeText(extractCellText(cell)))
  );

  return matrix;
}

Edge Cases I've Encountered

After processing thousands of tables, here are some real-world edge cases:

Source	Issue	Solution
Wikipedia	"v t e" navigation rows	Pattern detection + skip
Wikipedia	Horizontally duplicated columns	Detect repeating headers, unstack
FBRef	Grouped column headers	Merge group + sub-header
Financial sites	Numbers as `1.234.567,89`	Locale-aware normalization
Government sites	Tables inside `<form>`	Ignore form wrapper

Lessons Learned

Never trust the DOM structure. Build your own normalized representation.
Test with edge cases early. Wikipedia, FBRef, and government sites are great stress tests.
Normalize everything. Whitespace, encoding, column count—make it consistent.
Headers need detection. Don't assume row 0 is the header.
Spans are the enemy. The grid-building approach handles them cleanly.

If you want to skip all this complexity, HTML Table Exporter handles these edge cases automatically. One click to export any table to CSV, JSON, or Excel.

Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.

But if you're building your own parser, I hope this saves you some debugging time. The rabbit hole is deep.

What's the weirdest table structure you've encountered? Share in the comments. I collect edge cases like trading cards.

DEV Community