DEV Community

Cover image for How I Handle Rowspans and Colspans in Table Extraction
circobit
circobit

Posted on

How I Handle Rowspans and Colspans in Table Extraction

Tables with merged cells are everywhere. Wikipedia uses them. Sports stats sites use them. Financial reports use them.

And they break almost every table scraper I've seen.

Here's the problem: when you iterate through <tr> and <td> elements, you get what's in the HTML—not what's visually rendered. A cell with rowspan="3" appears once in the HTML but occupies three rows in the output.

The naive approach fails silently. You get misaligned columns, missing data, and exports that look nothing like the original table.

Here's how to fix it.

The Problem: HTML Doesn't Match the Grid

Consider this simple table with a merged cell:

<table>
  <tr>
    <td rowspan="2">A</td>
    <td>B</td>
  </tr>
  <tr>
    <td>C</td>
  </tr>
</table>
Enter fullscreen mode Exit fullscreen mode

Visual representation:

| A | B |
|   | C |
Enter fullscreen mode Exit fullscreen mode

But if you naively extract cells by iterating rows:

// Wrong approach
table.querySelectorAll('tr').forEach(row => {
  const cells = row.querySelectorAll('td');
  console.log(cells.map(c => c.textContent));
});
// Output: ["A", "B"], ["C"]
// Expected: ["A", "B"], ["A", "C"]
Enter fullscreen mode Exit fullscreen mode

Row 2 is missing the "A" that visually spans into it.

The Solution: Build a Matrix

The correct approach is to pre-allocate a 2D grid and fill it according to span attributes.

function extractTableMatrix(table) {
  const rows = Array.from(table.rows);
  const grid = [];

  rows.forEach((rowEl, rowIndex) => {
    if (!grid[rowIndex]) grid[rowIndex] = [];
    const gridRow = grid[rowIndex];

    let colIndex = 0;

    // Skip columns already occupied by rowspans from previous rows
    while (gridRow[colIndex] !== undefined) {
      colIndex++;
    }

    const cells = Array.from(rowEl.cells);

    cells.forEach((cell) => {
      // Advance past occupied cells
      while (gridRow[colIndex] !== undefined) {
        colIndex++;
      }

      const text = cell.textContent.trim();
      const rowSpan = parseInt(cell.rowSpan, 10) || 1;
      const colSpan = parseInt(cell.colSpan, 10) || 1;

      // Fill the block this cell spans
      for (let r = 0; r < rowSpan; r++) {
        const targetRowIndex = rowIndex + r;
        if (!grid[targetRowIndex]) grid[targetRowIndex] = [];

        for (let c = 0; c < colSpan; c++) {
          const targetColIndex = colIndex + c;
          if (grid[targetRowIndex][targetColIndex] === undefined) {
            grid[targetRowIndex][targetColIndex] = text;
          }
        }
      }

      colIndex += colSpan;
    });
  });

  return grid;
}
Enter fullscreen mode Exit fullscreen mode

Now the output matches the visual table:

const matrix = extractTableMatrix(table);
// [["A", "B"], ["A", "C"]]
Enter fullscreen mode Exit fullscreen mode

Why This Matters: Real-World Examples

Wikipedia Country Tables

Wikipedia uses merged cells extensively. A table listing countries by region might have:

<tr>
  <td rowspan="5">Europe</td>
  <td>France</td>
</tr>
<tr>
  <td>Germany</td>
</tr>
<!-- ... -->
Enter fullscreen mode Exit fullscreen mode

Without proper handling, you lose the "Europe" association for rows 2-5.

Sports Statistics Sites

Sites like FBREF use both rowspan AND colspan. Player stats tables have:

  • Grouped headers with colspan (Playing Time → MP | Starts | Min)
  • Summary rows with rowspan spanning player details

A naive scraper exports gibberish. A matrix-based approach preserves the structure.

Financial Reports

Quarterly reports often have:

<td colspan="4">2024</td>
<!-- Then: Q1, Q2, Q3, Q4 -->
Enter fullscreen mode Exit fullscreen mode

The year header needs to propagate to all four quarter columns.

Edge Cases to Handle

Empty Cells in the Middle

Sometimes cells are genuinely empty. Don't confuse this with "not yet filled by a span."

// Normalize to same column count
const maxCols = grid.reduce((max, row) => 
  Math.max(max, row ? row.length : 0), 0
);

const normalized = grid.map(row => {
  const r = row || [];
  const out = new Array(maxCols);
  for (let i = 0; i < maxCols; i++) {
    out[i] = r[i] != null ? r[i] : "";
  }
  return out;
});
Enter fullscreen mode Exit fullscreen mode

Nested Tables

Tables inside cells create noise. Most of the time, you want the text content of the nested table, not a recursive extraction.

function extractCellText(cell) {
  const clone = cell.cloneNode(true);

  // Remove nested tables, scripts, styles
  clone.querySelectorAll("table, script, style, noscript")
    .forEach(el => el.remove());

  return clone.textContent.replace(/\s+/g, " ").trim();
}
Enter fullscreen mode Exit fullscreen mode

Colspan + Rowspan Combined

The algorithm handles this naturally—you're just filling a larger rectangle in the grid.

<td rowspan="2" colspan="3">Big Cell</td>
Enter fullscreen mode Exit fullscreen mode

Fills a 2×3 block with the same value.

When to Use a Tool

Writing this from scratch is educational but time-consuming. For production use, consider:

  • HTML Table Exporter — Browser extension that handles these edge cases automatically
  • Pandas read_html() — Works but struggles with complex nested tables
  • Beautiful Soup — Requires manual span handling

The Full Algorithm

Here's the production version with all edge cases handled:

function extractCellText(cell) {
  if (!cell) return "";
  const clone = cell.cloneNode(true);

  // Remove elements that contain non-visible content
  const invisibleSelectors = "style, script, noscript, template, link";
  clone.querySelectorAll(invisibleSelectors).forEach(el => el.remove());

  const text = clone.textContent || "";
  return text.replace(/\s+/g, " ").trim();
}

function extractTableMatrix(table) {
  const rows = Array.from(table.rows);
  const grid = [];

  rows.forEach((rowEl, rowIndex) => {
    if (!grid[rowIndex]) grid[rowIndex] = [];
    const gridRow = grid[rowIndex];

    let colIndex = 0;

    while (gridRow[colIndex] !== undefined) {
      colIndex++;
    }

    const cells = Array.from(rowEl.cells);

    cells.forEach((cell) => {
      while (gridRow[colIndex] !== undefined) {
        colIndex++;
      }

      const text = extractCellText(cell);
      const rowSpan = parseInt(cell.rowSpan, 10) || 1;
      const colSpan = parseInt(cell.colSpan, 10) || 1;

      for (let r = 0; r < rowSpan; r++) {
        const targetRowIndex = rowIndex + r;
        if (!grid[targetRowIndex]) grid[targetRowIndex] = [];
        const targetRow = grid[targetRowIndex];

        for (let c = 0; c < colSpan; c++) {
          const targetColIndex = colIndex + c;
          if (targetRow[targetColIndex] === undefined) {
            targetRow[targetColIndex] = text;
          }
        }
      }

      colIndex += colSpan;
    });
  });

  // Normalize all rows to same length
  const maxCols = grid.reduce((max, row) => 
    Math.max(max, row ? row.length : 0), 0
  );

  return grid.map(row => {
    const r = row || [];
    const out = new Array(maxCols);
    for (let i = 0; i < maxCols; i++) {
      out[i] = r[i] != null ? r[i] : "";
    }
    return out;
  });
}
Enter fullscreen mode Exit fullscreen mode

Summary

Approach Handles Spans Production Ready
Naive iteration
Matrix with spans
Pandas read_html Partial Depends
Browser extension

The key insight: think of tables as grids, not as nested HTML elements. Build the grid first, then export.

For more on exporting Wikipedia's particularly tricky tables, see our guide on exporting Wikipedia tables to Excel.


Need to extract tables without writing code? Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.

Top comments (0)