circobit

Posted on Jun 3

How I Handle Rowspans and Colspans in Table Extraction

#webdev #javascript #tutorial #html

Tables with merged cells are everywhere. Wikipedia uses them. Sports stats sites use them. Financial reports use them.

And they break almost every table scraper I've seen.

Here's the problem: when you iterate through <tr> and <td> elements, you get what's in the HTML—not what's visually rendered. A cell with rowspan="3" appears once in the HTML but occupies three rows in the output.

The naive approach fails silently. You get misaligned columns, missing data, and exports that look nothing like the original table.

Here's how to fix it.

The Problem: HTML Doesn't Match the Grid

Consider this simple table with a merged cell:

<table>
  <tr>
    <td rowspan="2">A</td>
    <td>B</td>
  </tr>
  <tr>
    <td>C</td>
  </tr>
</table>

Visual representation:

| A | B |
|   | C |

But if you naively extract cells by iterating rows:

// Wrong approach
table.querySelectorAll('tr').forEach(row => {
  const cells = row.querySelectorAll('td');
  console.log(cells.map(c => c.textContent));
});
// Output: ["A", "B"], ["C"]
// Expected: ["A", "B"], ["A", "C"]

Row 2 is missing the "A" that visually spans into it.

The Solution: Build a Matrix

The correct approach is to pre-allocate a 2D grid and fill it according to span attributes.

function extractTableMatrix(table) {
  const rows = Array.from(table.rows);
  const grid = [];

  rows.forEach((rowEl, rowIndex) => {
    if (!grid[rowIndex]) grid[rowIndex] = [];
    const gridRow = grid[rowIndex];

    let colIndex = 0;

    // Skip columns already occupied by rowspans from previous rows
    while (gridRow[colIndex] !== undefined) {
      colIndex++;
    }

    const cells = Array.from(rowEl.cells);

    cells.forEach((cell) => {
      // Advance past occupied cells
      while (gridRow[colIndex] !== undefined) {
        colIndex++;
      }

      const text = cell.textContent.trim();
      const rowSpan = parseInt(cell.rowSpan, 10) || 1;
      const colSpan = parseInt(cell.colSpan, 10) || 1;

      // Fill the block this cell spans
      for (let r = 0; r < rowSpan; r++) {
        const targetRowIndex = rowIndex + r;
        if (!grid[targetRowIndex]) grid[targetRowIndex] = [];

        for (let c = 0; c < colSpan; c++) {
          const targetColIndex = colIndex + c;
          if (grid[targetRowIndex][targetColIndex] === undefined) {
            grid[targetRowIndex][targetColIndex] = text;
          }
        }
      }

      colIndex += colSpan;
    });
  });

  return grid;
}

Now the output matches the visual table:

const matrix = extractTableMatrix(table);
// [["A", "B"], ["A", "C"]]

Why This Matters: Real-World Examples

Wikipedia Country Tables

Wikipedia uses merged cells extensively. A table listing countries by region might have:

<tr>
  <td rowspan="5">Europe</td>
  <td>France</td>
</tr>
<tr>
  <td>Germany</td>
</tr>
<!-- ... -->

Without proper handling, you lose the "Europe" association for rows 2-5.

Sports Statistics Sites

Sites like FBREF use both rowspan AND colspan. Player stats tables have:

Grouped headers with colspan (Playing Time → MP | Starts | Min)
Summary rows with rowspan spanning player details

A naive scraper exports gibberish. A matrix-based approach preserves the structure.

Financial Reports

Quarterly reports often have:

<td colspan="4">2024</td>
<!-- Then: Q1, Q2, Q3, Q4 -->

The year header needs to propagate to all four quarter columns.

Edge Cases to Handle

Empty Cells in the Middle

Sometimes cells are genuinely empty. Don't confuse this with "not yet filled by a span."

// Normalize to same column count
const maxCols = grid.reduce((max, row) => 
  Math.max(max, row ? row.length : 0), 0
);

const normalized = grid.map(row => {
  const r = row || [];
  const out = new Array(maxCols);
  for (let i = 0; i < maxCols; i++) {
    out[i] = r[i] != null ? r[i] : "";
  }
  return out;
});

Nested Tables

Tables inside cells create noise. Most of the time, you want the text content of the nested table, not a recursive extraction.

function extractCellText(cell) {
  const clone = cell.cloneNode(true);

  // Remove nested tables, scripts, styles
  clone.querySelectorAll("table, script, style, noscript")
    .forEach(el => el.remove());

  return clone.textContent.replace(/\s+/g, " ").trim();
}

Colspan + Rowspan Combined

The algorithm handles this naturally—you're just filling a larger rectangle in the grid.

<td rowspan="2" colspan="3">Big Cell</td>

Fills a 2×3 block with the same value.

When to Use a Tool

Writing this from scratch is educational but time-consuming. For production use, consider:

HTML Table Exporter — Browser extension that handles these edge cases automatically
Pandas read_html() — Works but struggles with complex nested tables
Beautiful Soup — Requires manual span handling

The Full Algorithm

Here's the production version with all edge cases handled:

function extractCellText(cell) {
  if (!cell) return "";
  const clone = cell.cloneNode(true);

  // Remove elements that contain non-visible content
  const invisibleSelectors = "style, script, noscript, template, link";
  clone.querySelectorAll(invisibleSelectors).forEach(el => el.remove());

  const text = clone.textContent || "";
  return text.replace(/\s+/g, " ").trim();
}

function extractTableMatrix(table) {
  const rows = Array.from(table.rows);
  const grid = [];

  rows.forEach((rowEl, rowIndex) => {
    if (!grid[rowIndex]) grid[rowIndex] = [];
    const gridRow = grid[rowIndex];

    let colIndex = 0;

    while (gridRow[colIndex] !== undefined) {
      colIndex++;
    }

    const cells = Array.from(rowEl.cells);

    cells.forEach((cell) => {
      while (gridRow[colIndex] !== undefined) {
        colIndex++;
      }

      const text = extractCellText(cell);
      const rowSpan = parseInt(cell.rowSpan, 10) || 1;
      const colSpan = parseInt(cell.colSpan, 10) || 1;

      for (let r = 0; r < rowSpan; r++) {
        const targetRowIndex = rowIndex + r;
        if (!grid[targetRowIndex]) grid[targetRowIndex] = [];
        const targetRow = grid[targetRowIndex];

        for (let c = 0; c < colSpan; c++) {
          const targetColIndex = colIndex + c;
          if (targetRow[targetColIndex] === undefined) {
            targetRow[targetColIndex] = text;
          }
        }
      }

      colIndex += colSpan;
    });
  });

  // Normalize all rows to same length
  const maxCols = grid.reduce((max, row) => 
    Math.max(max, row ? row.length : 0), 0
  );

  return grid.map(row => {
    const r = row || [];
    const out = new Array(maxCols);
    for (let i = 0; i < maxCols; i++) {
      out[i] = r[i] != null ? r[i] : "";
    }
    return out;
  });
}

Summary

Approach	Handles Spans	Production Ready
Naive iteration	❌	❌
Matrix with spans	✅	✅
Pandas read_html	Partial	Depends
Browser extension	✅	✅

The key insight: think of tables as grids, not as nested HTML elements. Build the grid first, then export.

For more on exporting Wikipedia's particularly tricky tables, see our guide on exporting Wikipedia tables to Excel.

Need to extract tables without writing code? Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.

DEV Community