HTML tables look simple. <table>, <tr>, <td>. What could go wrong?
After building HTML Table Exporter, a table export tool that's processed thousands of real-world tables, I can tell you: a lot. This post covers the edge cases that break naive parsers and how to handle them.
The Deceptively Simple Case
A perfect table looks like this:
<table>
<thead>
<tr>
<th>Name</th>
<th>Revenue</th>
</tr>
</thead>
<tbody>
<tr>
<td>Acme Inc</td>
<td>$1.2M</td>
</tr>
</tbody>
</table>
Parsing this is trivial:
const rows = table.querySelectorAll('tr');
const data = [...rows].map(row =>
[...row.querySelectorAll('td, th')].map(cell => cell.textContent.trim())
);
Done, right? Not even close.
Problem 1: Merged Cells (colspan/rowspan)
Real tables have merged cells. A lot of them.
<tr>
<td rowspan="3">Q1 2024</td>
<td>January</td>
<td>$100k</td>
</tr>
<tr>
<td>February</td>
<td>$120k</td>
</tr>
<tr>
<td>March</td>
<td>$90k</td>
</tr>
If you parse this naively, you get:
Row 1: ["Q1 2024", "January", "$100k"]
Row 2: ["February", "$120k"] // Missing first column!
Row 3: ["March", "$90k"] // Missing first column!
The Fix: Build a Position Matrix
You need to track which cells are "occupied" by rowspans from previous rows:
function parseTableWithMergedCells(table) {
const rows = table.querySelectorAll('tr');
const matrix = [];
const rowspanTracker = []; // Track active rowspans per column
rows.forEach((row, rowIndex) => {
matrix[rowIndex] = [];
let colIndex = 0;
// Skip columns occupied by previous rowspans
while (rowspanTracker[colIndex] > 0) {
matrix[rowIndex][colIndex] = matrix[rowIndex - 1]?.[colIndex] || '';
rowspanTracker[colIndex]--;
colIndex++;
}
row.querySelectorAll('td, th').forEach(cell => {
// Skip occupied columns
while (rowspanTracker[colIndex] > 0) {
matrix[rowIndex][colIndex] = matrix[rowIndex - 1]?.[colIndex] || '';
rowspanTracker[colIndex]--;
colIndex++;
}
const colspan = parseInt(cell.getAttribute('colspan')) || 1;
const rowspan = parseInt(cell.getAttribute('rowspan')) || 1;
const value = cell.textContent.trim();
// Fill colspan
for (let c = 0; c < colspan; c++) {
matrix[rowIndex][colIndex] = value;
// Track rowspan for future rows
if (rowspan > 1) {
rowspanTracker[colIndex] = rowspan - 1;
}
colIndex++;
}
});
});
return matrix;
}
This is simplified—the real implementation needs to handle nested rowspans within colspans, which gets ugly fast.
Problem 2: Tables That Aren't Data Tables
Not every <table> contains data. Many sites (yes, still in 2024) use tables for layout:
<table>
<tr>
<td><nav>Menu here</nav></td>
<td><main>Content here</main></td>
</tr>
</table>
Or for forms:
<table>
<tr>
<td><label>Email:</label></td>
<td><input type="email"></td>
</tr>
</table>
The Fix: Heuristics
I use several signals to detect "real" data tables:
function isDataTable(table) {
const rows = table.querySelectorAll('tr');
const cells = table.querySelectorAll('td, th');
// Too few rows or cells
if (rows.length < 2 || cells.length < 4) return false;
// Contains form elements (probably a form layout)
if (table.querySelector('input, select, textarea, button')) return false;
// Mostly navigation links
const links = table.querySelectorAll('a');
const textContent = table.textContent.length;
const linkText = [...links].reduce((sum, a) => sum + a.textContent.length, 0);
if (linkText / textContent > 0.7) return false;
// Check column consistency
const colCounts = [...rows].map(row =>
row.querySelectorAll('td, th').length
);
const variance = Math.max(...colCounts) - Math.min(...colCounts);
if (variance > 3) return false; // Inconsistent columns = probably layout
return true;
}
None of these are perfect. You'll always have edge cases.
Problem 3: Hidden Content
Cells often contain more than visible text:
<td>
<span class="value">1,234</span>
<span class="sort-key" style="display:none">1234</span>
</td>
Wikipedia does this a lot for sortable tables. If you just grab textContent, you get "1,234 1234".
The Fix: Extract Visible Text Only
function getVisibleText(element) {
// Clone to avoid modifying original
const clone = element.cloneNode(true);
// Remove hidden elements
clone.querySelectorAll('[style*="display: none"], [style*="display:none"], .hidden, [hidden]').forEach(el => el.remove());
// Also check computed style for dynamically hidden elements
// (more expensive, use sparingly)
return clone.textContent.trim();
}
Problem 4: Numbers That Aren't Numbers
"$1,234.56" is a number. So is "1.234,56" (European format). So is "(1,234)" (accounting negative). So is "1,234 M" (with suffix).
Your spreadsheet needs actual numbers to do math.
The Fix: Locale-Aware Parsing
function parseNumber(value) {
if (!value || typeof value !== 'string') return value;
// Remove currency symbols and whitespace
let cleaned = value.replace(/[$€£¥₹\s]/g, '').trim();
// Handle accounting negatives: (1,234) -> -1234
if (cleaned.startsWith('(') && cleaned.endsWith(')')) {
cleaned = '-' + cleaned.slice(1, -1);
}
// Handle suffixes: 1.5M, 2.3B, 100K
const suffixes = { 'K': 1e3, 'M': 1e6, 'B': 1e9, 'T': 1e12 };
const suffixMatch = cleaned.match(/([0-9.,]+)\s*([KMBT])$/i);
if (suffixMatch) {
cleaned = suffixMatch[1];
var multiplier = suffixes[suffixMatch[2].toUpperCase()];
}
// Detect European vs US format
// European: 1.234,56 (dot for thousands, comma for decimal)
// US: 1,234.56 (comma for thousands, dot for decimal)
const lastComma = cleaned.lastIndexOf(',');
const lastDot = cleaned.lastIndexOf('.');
if (lastComma > lastDot && lastComma > cleaned.length - 4) {
// European format
cleaned = cleaned.replace(/\./g, '').replace(',', '.');
} else {
// US format
cleaned = cleaned.replace(/,/g, '');
}
let num = parseFloat(cleaned);
if (multiplier) num *= multiplier;
return isNaN(num) ? value : num;
}
This handles maybe 90% of cases. The other 10% will surprise you.
Problem 5: Character Encoding Hell
You'd think UTF-8 solved this. It didn't.
Real tables contain:
- Non-breaking spaces (
/\u00A0) that look like spaces but aren't - Zero-width characters that break string comparison
- Windows-1252 characters that got mangled into UTF-8
- Emoji that break older parsers
- Right-to-left marks in mixed-language tables
The Fix: Normalize Everything
function normalizeText(text) {
return text
// Normalize unicode (handles composed vs decomposed characters)
.normalize('NFC')
// Replace non-breaking spaces with regular spaces
.replace(/\u00A0/g, ' ')
// Remove zero-width characters
.replace(/[\u200B-\u200D\uFEFF]/g, '')
// Normalize whitespace
.replace(/\s+/g, ' ')
.trim();
}
And when exporting to CSV for Excel, prepend the UTF-8 BOM:
const BOM = '\uFEFF';
const csvContent = BOM + generateCSV(data);
Without the BOM, Excel may interpret your UTF-8 file as Windows-1252 and mangle special characters.
Problem 6: Nested Tables
Yes, tables inside tables. Usually for layout, but sometimes for data:
<table>
<tr>
<td>Product A</td>
<td>
<table>
<tr><td>Size S</td><td>$10</td></tr>
<tr><td>Size M</td><td>$12</td></tr>
</table>
</td>
</tr>
</table>
The Fix: Decide Your Strategy
Options:
- Flatten: Convert nested table to text ("Size S: $10, Size M: $12")
- Extract separately: Treat nested tables as separate exports
- Expand rows: Create multiple parent rows, one per nested row
I went with option 2 (extract separately) with option 1 as fallback for deeply nested cases. There's no perfect answer—it depends on use case.
The Reality
After handling all these cases, my table parser is ~800 lines of JavaScript. And it still doesn't handle everything perfectly.
Some hard truths:
- No parser is perfect. Real-world HTML is messy.
- Heuristics fail. You'll always need escape hatches for users.
- Performance matters. Some pages have 50+ tables. Parsing needs to be fast.
- Edge cases are infinite. Ship something that works for 95% of cases, then iterate.
Tools and Resources
If you're building something similar:
- SheetJS (xlsx) - Solid library for generating Excel files
- Papa Parse - Fast CSV parsing and generation
-
Chrome DevTools -
$('table')in console to quickly inspect tables
Or if you just need to export tables without building anything: I made HTML Table Exporter specifically because I got tired of writing one-off scrapers. It handles all the edge cases above.
Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.
What weird table edge cases have you encountered? I'm always looking for new test cases to break my parser.
Top comments (0)