Tables with merged cells are everywhere. Wikipedia uses them. Sports stats sites use them. Financial reports use them.
And they break almost every table scraper I've seen.
Here's the problem: when you iterate through <tr> and <td> elements, you get what's in the HTML—not what's visually rendered. A cell with rowspan="3" appears once in the HTML but occupies three rows in the output.
The naive approach fails silently. You get misaligned columns, missing data, and exports that look nothing like the original table.
Here's how to fix it.
The Problem: HTML Doesn't Match the Grid
Consider this simple table with a merged cell:
<table>
<tr>
<td rowspan="2">A</td>
<td>B</td>
</tr>
<tr>
<td>C</td>
</tr>
</table>
Visual representation:
| A | B |
| | C |
But if you naively extract cells by iterating rows:
// Wrong approach
table.querySelectorAll('tr').forEach(row => {
const cells = row.querySelectorAll('td');
console.log(cells.map(c => c.textContent));
});
// Output: ["A", "B"], ["C"]
// Expected: ["A", "B"], ["A", "C"]
Row 2 is missing the "A" that visually spans into it.
The Solution: Build a Matrix
The correct approach is to pre-allocate a 2D grid and fill it according to span attributes.
function extractTableMatrix(table) {
const rows = Array.from(table.rows);
const grid = [];
rows.forEach((rowEl, rowIndex) => {
if (!grid[rowIndex]) grid[rowIndex] = [];
const gridRow = grid[rowIndex];
let colIndex = 0;
// Skip columns already occupied by rowspans from previous rows
while (gridRow[colIndex] !== undefined) {
colIndex++;
}
const cells = Array.from(rowEl.cells);
cells.forEach((cell) => {
// Advance past occupied cells
while (gridRow[colIndex] !== undefined) {
colIndex++;
}
const text = cell.textContent.trim();
const rowSpan = parseInt(cell.rowSpan, 10) || 1;
const colSpan = parseInt(cell.colSpan, 10) || 1;
// Fill the block this cell spans
for (let r = 0; r < rowSpan; r++) {
const targetRowIndex = rowIndex + r;
if (!grid[targetRowIndex]) grid[targetRowIndex] = [];
for (let c = 0; c < colSpan; c++) {
const targetColIndex = colIndex + c;
if (grid[targetRowIndex][targetColIndex] === undefined) {
grid[targetRowIndex][targetColIndex] = text;
}
}
}
colIndex += colSpan;
});
});
return grid;
}
Now the output matches the visual table:
const matrix = extractTableMatrix(table);
// [["A", "B"], ["A", "C"]]
Why This Matters: Real-World Examples
Wikipedia Country Tables
Wikipedia uses merged cells extensively. A table listing countries by region might have:
<tr>
<td rowspan="5">Europe</td>
<td>France</td>
</tr>
<tr>
<td>Germany</td>
</tr>
<!-- ... -->
Without proper handling, you lose the "Europe" association for rows 2-5.
Sports Statistics Sites
Sites like FBREF use both rowspan AND colspan. Player stats tables have:
- Grouped headers with colspan (Playing Time → MP | Starts | Min)
- Summary rows with rowspan spanning player details
A naive scraper exports gibberish. A matrix-based approach preserves the structure.
Financial Reports
Quarterly reports often have:
<td colspan="4">2024</td>
<!-- Then: Q1, Q2, Q3, Q4 -->
The year header needs to propagate to all four quarter columns.
Edge Cases to Handle
Empty Cells in the Middle
Sometimes cells are genuinely empty. Don't confuse this with "not yet filled by a span."
// Normalize to same column count
const maxCols = grid.reduce((max, row) =>
Math.max(max, row ? row.length : 0), 0
);
const normalized = grid.map(row => {
const r = row || [];
const out = new Array(maxCols);
for (let i = 0; i < maxCols; i++) {
out[i] = r[i] != null ? r[i] : "";
}
return out;
});
Nested Tables
Tables inside cells create noise. Most of the time, you want the text content of the nested table, not a recursive extraction.
function extractCellText(cell) {
const clone = cell.cloneNode(true);
// Remove nested tables, scripts, styles
clone.querySelectorAll("table, script, style, noscript")
.forEach(el => el.remove());
return clone.textContent.replace(/\s+/g, " ").trim();
}
Colspan + Rowspan Combined
The algorithm handles this naturally—you're just filling a larger rectangle in the grid.
<td rowspan="2" colspan="3">Big Cell</td>
Fills a 2×3 block with the same value.
When to Use a Tool
Writing this from scratch is educational but time-consuming. For production use, consider:
- HTML Table Exporter — Browser extension that handles these edge cases automatically
- Pandas read_html() — Works but struggles with complex nested tables
- Beautiful Soup — Requires manual span handling
The Full Algorithm
Here's the production version with all edge cases handled:
function extractCellText(cell) {
if (!cell) return "";
const clone = cell.cloneNode(true);
// Remove elements that contain non-visible content
const invisibleSelectors = "style, script, noscript, template, link";
clone.querySelectorAll(invisibleSelectors).forEach(el => el.remove());
const text = clone.textContent || "";
return text.replace(/\s+/g, " ").trim();
}
function extractTableMatrix(table) {
const rows = Array.from(table.rows);
const grid = [];
rows.forEach((rowEl, rowIndex) => {
if (!grid[rowIndex]) grid[rowIndex] = [];
const gridRow = grid[rowIndex];
let colIndex = 0;
while (gridRow[colIndex] !== undefined) {
colIndex++;
}
const cells = Array.from(rowEl.cells);
cells.forEach((cell) => {
while (gridRow[colIndex] !== undefined) {
colIndex++;
}
const text = extractCellText(cell);
const rowSpan = parseInt(cell.rowSpan, 10) || 1;
const colSpan = parseInt(cell.colSpan, 10) || 1;
for (let r = 0; r < rowSpan; r++) {
const targetRowIndex = rowIndex + r;
if (!grid[targetRowIndex]) grid[targetRowIndex] = [];
const targetRow = grid[targetRowIndex];
for (let c = 0; c < colSpan; c++) {
const targetColIndex = colIndex + c;
if (targetRow[targetColIndex] === undefined) {
targetRow[targetColIndex] = text;
}
}
}
colIndex += colSpan;
});
});
// Normalize all rows to same length
const maxCols = grid.reduce((max, row) =>
Math.max(max, row ? row.length : 0), 0
);
return grid.map(row => {
const r = row || [];
const out = new Array(maxCols);
for (let i = 0; i < maxCols; i++) {
out[i] = r[i] != null ? r[i] : "";
}
return out;
});
}
Summary
| Approach | Handles Spans | Production Ready |
|---|---|---|
| Naive iteration | ❌ | ❌ |
| Matrix with spans | ✅ | ✅ |
| Pandas read_html | Partial | Depends |
| Browser extension | ✅ | ✅ |
The key insight: think of tables as grids, not as nested HTML elements. Build the grid first, then export.
For more on exporting Wikipedia's particularly tricky tables, see our guide on exporting Wikipedia tables to Excel.
Need to extract tables without writing code? Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.
Top comments (0)