DEV Community

Cover image for The Hidden Complexity of HTML Tables
circobit
circobit

Posted on

The Hidden Complexity of HTML Tables

HTML tables look simple. Rows and cells. What could go wrong?

Everything, it turns out.

I've spent months building HTML Table Exporter, a table extraction tool, and I've learned that HTML tables are deceptively complex. Here's what nobody tells you when you start parsing them.

The Naive Approach (That Breaks Immediately)

Your first attempt probably looks like this:

function extractTable(table) {
  return Array.from(table.rows).map(row =>
    Array.from(row.cells).map(cell => cell.textContent.trim())
  );
}
Enter fullscreen mode Exit fullscreen mode

Clean, simple, wrong.

This works for basic tables. But the real web is full of tables that will break this code in creative ways.

Problem 1: Rowspan and Colspan

HTML cells can span multiple rows or columns:

<table>
  <tr>
    <td rowspan="3">Category A</td>
    <td>Item 1</td>
    <td>$10</td>
  </tr>
  <tr>
    <td>Item 2</td>
    <td>$20</td>
  </tr>
  <tr>
    <td>Item 3</td>
    <td>$30</td>
  </tr>
</table>
Enter fullscreen mode Exit fullscreen mode

The naive approach gives you:

Row 0: ["Category A", "Item 1", "$10"]
Row 1: ["Item 2", "$20"]           // Only 2 cells!
Row 2: ["Item 3", "$30"]           // Only 2 cells!
Enter fullscreen mode Exit fullscreen mode

Rows 1 and 2 are missing a column. Your column alignment is now broken.

The Fix: Build a Virtual Grid

You need to track which cells are "occupied" by spans from previous rows:

function extractTableMatrix(table) {
  const rows = Array.from(table.rows);
  const grid = [];

  rows.forEach((rowEl, rowIndex) => {
    if (!grid[rowIndex]) grid[rowIndex] = [];
    let colIndex = 0;

    // Skip columns already occupied by rowspans
    while (grid[rowIndex][colIndex] !== undefined) {
      colIndex++;
    }

    Array.from(rowEl.cells).forEach((cell) => {
      // Keep advancing past occupied columns
      while (grid[rowIndex][colIndex] !== undefined) {
        colIndex++;
      }

      const text = cell.textContent.trim();
      const rowSpan = cell.rowSpan || 1;
      const colSpan = cell.colSpan || 1;

      // Fill the entire span region
      for (let r = 0; r < rowSpan; r++) {
        for (let c = 0; c < colSpan; c++) {
          const targetRow = rowIndex + r;
          const targetCol = colIndex + c;

          if (!grid[targetRow]) grid[targetRow] = [];
          if (grid[targetRow][targetCol] === undefined) {
            grid[targetRow][targetCol] = text;
          }
        }
      }

      colIndex += colSpan;
    });
  });

  return grid;
}
Enter fullscreen mode Exit fullscreen mode

Now you get:

Row 0: ["Category A", "Item 1", "$10"]
Row 1: ["Category A", "Item 2", "$20"]
Row 2: ["Category A", "Item 3", "$30"]
Enter fullscreen mode Exit fullscreen mode

The spanning cell's value is duplicated into each position it occupies.

Problem 2: Nested Tables

Some sites use tables inside tables for layout:

<table>
  <tr>
    <td>
      <table>
        <tr><td>Nested content</td></tr>
      </table>
    </td>
    <td>Regular cell</td>
  </tr>
</table>
Enter fullscreen mode Exit fullscreen mode

Using cell.textContent captures everything, including nested table content. Using cell.innerText might help, but it's inconsistent across browsers.

The Fix: Clone and Remove

function extractCellText(cell) {
  if (!cell) return "";

  // Clone to avoid modifying the DOM
  const clone = cell.cloneNode(true);

  // Remove invisible and nested content
  const removeSelectors = "style, script, noscript, template, table";
  clone.querySelectorAll(removeSelectors).forEach(el => el.remove());

  // Normalize whitespace
  return (clone.textContent || "").replace(/\s+/g, " ").trim();
}
Enter fullscreen mode Exit fullscreen mode

Problem 3: Headers Aren't Always Row 0

Many tables have title rows, navigation rows, or other content before the actual headers:

<table>
  <tr>
    <td colspan="3">Quarterly Sales Report 2024</td>
  </tr>
  <tr>
    <th>Region</th>
    <th>Q1</th>
    <th>Q2</th>
  </tr>
  <tr>
    <td>North</td>
    <td>$1.2M</td>
    <td>$1.4M</td>
  </tr>
</table>
Enter fullscreen mode Exit fullscreen mode

If you assume row 0 is headers, you'll get "Quarterly Sales Report 2024" as your column name.

The Fix: Detect Header Row

function detectHeaderRowIndex(matrix) {
  for (let i = 0; i < Math.min(matrix.length - 1, 3); i++) {
    const row = matrix[i];
    const nextRow = matrix[i + 1];

    // Count unique values
    const uniqueValues = new Set(row.filter(c => c && c.trim()));
    const uniqueNext = new Set(nextRow.filter(c => c && c.trim()));

    // Title row: 1 unique value (spans all columns), next row has more
    const isTitleRow = 
      uniqueValues.size === 1 && 
      uniqueNext.size > 1 &&
      row[0]?.length > 30;

    if (isTitleRow) {
      return i + 1; // Headers are in the next row
    }
  }

  return 0;
}
Enter fullscreen mode Exit fullscreen mode

Problem 4: Empty Cells and Inconsistent Columns

Some rows have fewer cells than others. Some cells are completely empty. Some have just whitespace or &nbsp;.

// Normalize all rows to the same length
function normalizeMatrix(grid) {
  const maxCols = Math.max(...grid.map(row => row.length));

  return grid.map(row => {
    const normalized = new Array(maxCols);
    for (let i = 0; i < maxCols; i++) {
      normalized[i] = row[i] ?? "";
    }
    return normalized;
  });
}
Enter fullscreen mode Exit fullscreen mode

Problem 5: Invisible Content

Tables can contain content that's visually hidden but present in the DOM:

  • <style> blocks for scoped CSS
  • <script> tags
  • display: none elements
  • Zero-width characters
const invisibleSelectors = [
  "style",
  "script", 
  "noscript",
  "template",
  "[hidden]",
  "[style*='display: none']",
  "[style*='display:none']"
].join(", ");

clone.querySelectorAll(invisibleSelectors).forEach(el => el.remove());
Enter fullscreen mode Exit fullscreen mode

Problem 6: Character Encoding

Tables from different sources use different encodings:

  • &nbsp; (non-breaking space)
  • &mdash; and &ndash; (dashes)
  • Smart quotes vs straight quotes
  • UTF-8 vs ISO-8859-1

Always normalize:

function normalizeText(text) {
  return text
    .replace(/\u00a0/g, " ")           // Non-breaking space
    .replace(/[\u2018\u2019]/g, "'")   // Smart single quotes
    .replace(/[\u201c\u201d]/g, '"')   // Smart double quotes
    .replace(/[\u2013\u2014]/g, "-")   // En/em dashes
    .trim();
}
Enter fullscreen mode Exit fullscreen mode

The Complete Pipeline

After handling all these cases, my extraction pipeline looks like this:

function extractTable(tableElement) {
  // 1. Build virtual grid (handles rowspan/colspan)
  let matrix = extractTableMatrix(tableElement);

  // 2. Normalize to consistent column count
  matrix = normalizeMatrix(matrix);

  // 3. Find actual header row
  const headerIndex = detectHeaderRowIndex(matrix);

  // 4. Remove title rows
  if (headerIndex > 0) {
    matrix = matrix.slice(headerIndex);
  }

  // 5. Clean all cell text
  matrix = matrix.map(row => 
    row.map(cell => normalizeText(extractCellText(cell)))
  );

  return matrix;
}
Enter fullscreen mode Exit fullscreen mode

Edge Cases I've Encountered

After processing thousands of tables, here are some real-world edge cases:

Source Issue Solution
Wikipedia "v t e" navigation rows Pattern detection + skip
Wikipedia Horizontally duplicated columns Detect repeating headers, unstack
FBRef Grouped column headers Merge group + sub-header
Financial sites Numbers as 1.234.567,89 Locale-aware normalization
Government sites Tables inside <form> Ignore form wrapper

Lessons Learned

  1. Never trust the DOM structure. Build your own normalized representation.

  2. Test with edge cases early. Wikipedia, FBRef, and government sites are great stress tests.

  3. Normalize everything. Whitespace, encoding, column count—make it consistent.

  4. Headers need detection. Don't assume row 0 is the header.

  5. Spans are the enemy. The grid-building approach handles them cleanly.


If you want to skip all this complexity, HTML Table Exporter handles these edge cases automatically. One click to export any table to CSV, JSON, or Excel.

Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.

But if you're building your own parser, I hope this saves you some debugging time. The rabbit hole is deep.


What's the weirdest table structure you've encountered? Share in the comments. I collect edge cases like trading cards.

Top comments (0)