HTML tables look simple. Rows and cells. What could go wrong?
Everything, it turns out.
I've spent months building HTML Table Exporter, a table extraction tool, and I've learned that HTML tables are deceptively complex. Here's what nobody tells you when you start parsing them.
The Naive Approach (That Breaks Immediately)
Your first attempt probably looks like this:
function extractTable(table) {
return Array.from(table.rows).map(row =>
Array.from(row.cells).map(cell => cell.textContent.trim())
);
}
Clean, simple, wrong.
This works for basic tables. But the real web is full of tables that will break this code in creative ways.
Problem 1: Rowspan and Colspan
HTML cells can span multiple rows or columns:
<table>
<tr>
<td rowspan="3">Category A</td>
<td>Item 1</td>
<td>$10</td>
</tr>
<tr>
<td>Item 2</td>
<td>$20</td>
</tr>
<tr>
<td>Item 3</td>
<td>$30</td>
</tr>
</table>
The naive approach gives you:
Row 0: ["Category A", "Item 1", "$10"]
Row 1: ["Item 2", "$20"] // Only 2 cells!
Row 2: ["Item 3", "$30"] // Only 2 cells!
Rows 1 and 2 are missing a column. Your column alignment is now broken.
The Fix: Build a Virtual Grid
You need to track which cells are "occupied" by spans from previous rows:
function extractTableMatrix(table) {
const rows = Array.from(table.rows);
const grid = [];
rows.forEach((rowEl, rowIndex) => {
if (!grid[rowIndex]) grid[rowIndex] = [];
let colIndex = 0;
// Skip columns already occupied by rowspans
while (grid[rowIndex][colIndex] !== undefined) {
colIndex++;
}
Array.from(rowEl.cells).forEach((cell) => {
// Keep advancing past occupied columns
while (grid[rowIndex][colIndex] !== undefined) {
colIndex++;
}
const text = cell.textContent.trim();
const rowSpan = cell.rowSpan || 1;
const colSpan = cell.colSpan || 1;
// Fill the entire span region
for (let r = 0; r < rowSpan; r++) {
for (let c = 0; c < colSpan; c++) {
const targetRow = rowIndex + r;
const targetCol = colIndex + c;
if (!grid[targetRow]) grid[targetRow] = [];
if (grid[targetRow][targetCol] === undefined) {
grid[targetRow][targetCol] = text;
}
}
}
colIndex += colSpan;
});
});
return grid;
}
Now you get:
Row 0: ["Category A", "Item 1", "$10"]
Row 1: ["Category A", "Item 2", "$20"]
Row 2: ["Category A", "Item 3", "$30"]
The spanning cell's value is duplicated into each position it occupies.
Problem 2: Nested Tables
Some sites use tables inside tables for layout:
<table>
<tr>
<td>
<table>
<tr><td>Nested content</td></tr>
</table>
</td>
<td>Regular cell</td>
</tr>
</table>
Using cell.textContent captures everything, including nested table content. Using cell.innerText might help, but it's inconsistent across browsers.
The Fix: Clone and Remove
function extractCellText(cell) {
if (!cell) return "";
// Clone to avoid modifying the DOM
const clone = cell.cloneNode(true);
// Remove invisible and nested content
const removeSelectors = "style, script, noscript, template, table";
clone.querySelectorAll(removeSelectors).forEach(el => el.remove());
// Normalize whitespace
return (clone.textContent || "").replace(/\s+/g, " ").trim();
}
Problem 3: Headers Aren't Always Row 0
Many tables have title rows, navigation rows, or other content before the actual headers:
<table>
<tr>
<td colspan="3">Quarterly Sales Report 2024</td>
</tr>
<tr>
<th>Region</th>
<th>Q1</th>
<th>Q2</th>
</tr>
<tr>
<td>North</td>
<td>$1.2M</td>
<td>$1.4M</td>
</tr>
</table>
If you assume row 0 is headers, you'll get "Quarterly Sales Report 2024" as your column name.
The Fix: Detect Header Row
function detectHeaderRowIndex(matrix) {
for (let i = 0; i < Math.min(matrix.length - 1, 3); i++) {
const row = matrix[i];
const nextRow = matrix[i + 1];
// Count unique values
const uniqueValues = new Set(row.filter(c => c && c.trim()));
const uniqueNext = new Set(nextRow.filter(c => c && c.trim()));
// Title row: 1 unique value (spans all columns), next row has more
const isTitleRow =
uniqueValues.size === 1 &&
uniqueNext.size > 1 &&
row[0]?.length > 30;
if (isTitleRow) {
return i + 1; // Headers are in the next row
}
}
return 0;
}
Problem 4: Empty Cells and Inconsistent Columns
Some rows have fewer cells than others. Some cells are completely empty. Some have just whitespace or .
// Normalize all rows to the same length
function normalizeMatrix(grid) {
const maxCols = Math.max(...grid.map(row => row.length));
return grid.map(row => {
const normalized = new Array(maxCols);
for (let i = 0; i < maxCols; i++) {
normalized[i] = row[i] ?? "";
}
return normalized;
});
}
Problem 5: Invisible Content
Tables can contain content that's visually hidden but present in the DOM:
-
<style>blocks for scoped CSS -
<script>tags -
display: noneelements - Zero-width characters
const invisibleSelectors = [
"style",
"script",
"noscript",
"template",
"[hidden]",
"[style*='display: none']",
"[style*='display:none']"
].join(", ");
clone.querySelectorAll(invisibleSelectors).forEach(el => el.remove());
Problem 6: Character Encoding
Tables from different sources use different encodings:
-
(non-breaking space) -
—and–(dashes) - Smart quotes vs straight quotes
- UTF-8 vs ISO-8859-1
Always normalize:
function normalizeText(text) {
return text
.replace(/\u00a0/g, " ") // Non-breaking space
.replace(/[\u2018\u2019]/g, "'") // Smart single quotes
.replace(/[\u201c\u201d]/g, '"') // Smart double quotes
.replace(/[\u2013\u2014]/g, "-") // En/em dashes
.trim();
}
The Complete Pipeline
After handling all these cases, my extraction pipeline looks like this:
function extractTable(tableElement) {
// 1. Build virtual grid (handles rowspan/colspan)
let matrix = extractTableMatrix(tableElement);
// 2. Normalize to consistent column count
matrix = normalizeMatrix(matrix);
// 3. Find actual header row
const headerIndex = detectHeaderRowIndex(matrix);
// 4. Remove title rows
if (headerIndex > 0) {
matrix = matrix.slice(headerIndex);
}
// 5. Clean all cell text
matrix = matrix.map(row =>
row.map(cell => normalizeText(extractCellText(cell)))
);
return matrix;
}
Edge Cases I've Encountered
After processing thousands of tables, here are some real-world edge cases:
| Source | Issue | Solution |
|---|---|---|
| Wikipedia | "v t e" navigation rows | Pattern detection + skip |
| Wikipedia | Horizontally duplicated columns | Detect repeating headers, unstack |
| FBRef | Grouped column headers | Merge group + sub-header |
| Financial sites | Numbers as 1.234.567,89
|
Locale-aware normalization |
| Government sites | Tables inside <form>
|
Ignore form wrapper |
Lessons Learned
Never trust the DOM structure. Build your own normalized representation.
Test with edge cases early. Wikipedia, FBRef, and government sites are great stress tests.
Normalize everything. Whitespace, encoding, column count—make it consistent.
Headers need detection. Don't assume row 0 is the header.
Spans are the enemy. The grid-building approach handles them cleanly.
If you want to skip all this complexity, HTML Table Exporter handles these edge cases automatically. One click to export any table to CSV, JSON, or Excel.
Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.
But if you're building your own parser, I hope this saves you some debugging time. The rabbit hole is deep.
What's the weirdest table structure you've encountered? Share in the comments. I collect edge cases like trading cards.
Top comments (0)