The first row of an HTML table is the header row.
Except when it isn't.
Wikipedia tables often have a title row spanning all columns before the actual headers. Sports statistics sites have grouped headers where "Playing Time" spans multiple sub-columns like "MP", "Starts", "Min". Financial tables have unit rows ("in millions USD") that look like headers but aren't.
If you assume row 0 is always the header, your exports will be broken for a significant portion of real-world tables.
Here's how to detect the actual header row programmatically.
The Problem: Three Types of "First Rows"
Consider these common patterns:
Pattern 1: Title Row
<table>
<tr>
<th colspan="4">World Population by Country</th> <!-- Title, not header -->
</tr>
<tr>
<th>Rank</th>
<th>Country</th>
<th>Population</th>
<th>% of World</th>
</tr>
<tr>
<td>1</td>
<td>India</td>
<td>1,428,627,663</td>
<td>17.85%</td>
</tr>
</table>
Row 0 is a title. Row 1 is the header. Row 2+ is data.
Pattern 2: Grouped Headers (Two-Level)
<table>
<tr>
<th></th>
<th></th>
<th colspan="3">Playing Time</th>
<th colspan="2">Performance</th>
</tr>
<tr>
<th>Player</th>
<th>Nation</th>
<th>MP</th>
<th>Starts</th>
<th>Min</th>
<th>Gls</th>
<th>Ast</th>
</tr>
<tr>
<td>John Smith</td>
<td>ENG</td>
<td>34</td>
<td>30</td>
<td>2700</td>
<td>12</td>
<td>8</td>
</tr>
</table>
Row 0 is group headers. Row 1 is the actual column headers. Row 2+ is data.
Pattern 3: Wikipedia Navigation Prefix
<tr>
<th colspan="3">v t e World Heritage Sites</th>
</tr>
The "v t e" (view/talk/edit) links are Wikipedia's template navigation. They need to be stripped, and the row might still be a title rather than a header.
Heuristic 1: Detecting Title Rows
A title row typically has:
- A single cell (or very few cells)
- Large colspan spanning most/all columns
- Text content that looks like a title, not column names
function isTitleRow(row, totalColumns) {
if (!row || row.length === 0) return false;
// Count non-empty cells
const nonEmptyCells = row.filter(cell => cell && cell.trim()).length;
// Title rows usually have 1-2 non-empty cells
if (nonEmptyCells > 2) return false;
// Check if first cell spans most columns (indicates colspan)
// In a normalized matrix, this shows as repeated values
const firstValue = row[0];
const repeatedCount = row.filter(cell => cell === firstValue).length;
// If the first value repeats across >50% of columns, it's likely a colspan title
if (repeatedCount > totalColumns * 0.5) {
return true;
}
return false;
}
Heuristic 2: What Makes a Row "Look Like Headers"
Header rows have characteristics that distinguish them from data rows:
function rowLooksLikeHeaders(row) {
if (!row || row.length === 0) return false;
const dominated by pure numbers, it's not headers
let numericCells = 0;
let textCells = 0;
let emptyCells = 0;
for (const cell of row) {
const value = (cell || "").trim();
if (!value) {
emptyCells++;
} else if (/^-?\d+([.,]\d+)?%?$/.test(value)) {
// Pure number or percentage
numericCells++;
} else {
textCells++;
}
}
const totalNonEmpty = numericCells + textCells;
if (totalNonEmpty === 0) return false;
// Headers are mostly text, not numbers
// If >70% of non-empty cells are numeric, it's probably data
if (numericCells / totalNonEmpty > 0.7) {
return false;
}
// Headers shouldn't be mostly empty
if (emptyCells / row.length > 0.7) {
return false;
}
return true;
}
Heuristic 3: What Makes a Row "Look Like Data"
The inverse check helps confirm we've found the right boundary:
function rowLooksLikeData(row) {
if (!row || row.length === 0) return false;
let numericCells = 0;
let dateCells = 0;
let totalNonEmpty = 0;
for (const cell of row) {
const value = (cell || "").trim();
if (!value) continue;
totalNonEmpty++;
// Check for numeric patterns
if (/^-?\d+([.,]\d+)?%?$/.test(value)) {
numericCells++;
}
// Check for date patterns
if (/^\d{1,4}[-/\.]\d{1,2}[-/\.]\d{1,4}$/.test(value)) {
dateCells++;
}
}
if (totalNonEmpty === 0) return false;
// Data rows typically have numeric or date content
const dataLikeCells = numericCells + dateCells;
return dataLikeCells / totalNonEmpty > 0.3;
}
Heuristic 4: Detecting Grouped Column Headers
FBREF-style tables have a group header row followed by a sub-header row. The group row has:
- Empty cells at the beginning (columns without groups)
- Repeated values from colspan expansion
- Multiple unique non-empty values (not just one like a title)
function detectGroupHeaderRow(row, nextRow) {
if (!row || !nextRow || row.length < 4) return false;
// Group header rows MUST have empty cells at the beginning
// This distinguishes from horizontally duplicated tables
const firstCellEmpty = !(row[0] || "").trim();
if (!firstCellEmpty) return false;
// Count unique non-empty values
const uniqueValues = new Set(
row.filter(v => v && v.trim()).map(v => v.trim().toLowerCase())
);
// A title row has exactly ONE unique value
// A group header row must have MULTIPLE unique values
if (uniqueValues.size <= 1) return false;
// Count consecutive repeated values (indicates colspan expansion)
let consecutiveRepeats = 0;
for (let i = 1; i < row.length; i++) {
const curr = (row[i] || "").trim();
const prev = (row[i - 1] || "").trim();
if (curr === prev) consecutiveRepeats++;
}
const repeatRatio = consecutiveRepeats / (row.length - 1);
// High repeat ratio (>30%) suggests colspan expansion
// Next row should have more unique values (the actual sub-headers)
const nextUniqueValues = new Set(
nextRow.filter(v => v && v.trim()).map(v => v.trim().toLowerCase())
);
return repeatRatio > 0.3 && nextUniqueValues.size > uniqueValues.size;
}
Heuristic 5: Cleaning Wikipedia Navigation Prefixes
Wikipedia templates often prefix content with "v t e" (links to view/talk/edit the template):
function cleanWikipediaNavPrefix(text) {
if (!text) return text;
// Pattern 1: "v t e " at the beginning (space-separated)
// Pattern 2: "v | t | e " (pipe-separated)
// Pattern 3: "[v] [t] [e] " (bracket-separated)
return text
.replace(/^\s*v\s+t\s+e\s+/i, "")
.replace(/^\s*v\s*\|\s*t\s*\|\s*e\s+/i, "")
.replace(/^\s*\[v\]\s*\[t\]\s*\[e\]\s+/i, "")
.trim();
}
Putting It Together: The Detection Algorithm
function detectHeaderRowIndex(matrix) {
if (!matrix || matrix.length < 2) return 0;
const totalColumns = matrix[0]?.length || 0;
for (let i = 0; i < Math.min(matrix.length - 1, 5); i++) {
const currentRow = matrix[i];
const nextRow = matrix[i + 1];
// Skip title rows
if (isTitleRow(currentRow, totalColumns)) {
continue;
}
// Check for grouped headers (two-level)
if (detectGroupHeaderRow(currentRow, nextRow)) {
// The sub-header row (i+1) is the actual header
return i + 1;
}
// Check if this row looks like headers and next looks like data
if (rowLooksLikeHeaders(currentRow) && rowLooksLikeData(nextRow)) {
return i;
}
}
// Fallback: assume row 0 is header
return 0;
}
Real-World Testing
These heuristics were developed by testing against:
- Wikipedia country/population tables (title rows + "v t e" prefixes)
- FBREF player statistics (grouped headers)
- Financial tables with unit rows
- Government data tables with multiple header levels
No heuristic is perfect. The goal is to handle the common patterns correctly and fail gracefully on unusual tables.
When Detection Fails
For tables that don't fit common patterns, provide a manual override:
function extractTable(matrix, options = {}) {
const headerRowIndex = options.headerRowIndex ?? detectHeaderRowIndex(matrix);
const headerRow = matrix[headerRowIndex];
const dataRows = matrix.slice(headerRowIndex + 1);
return { headerRow, dataRows };
}
Users who know their data can specify the header row explicitly.
Summary
| Pattern | Detection Method |
|---|---|
| Title row | Single cell with large colspan |
| Standard header | Row with mostly text, followed by row with numbers |
| Grouped headers | Empty first cells + repeated values + more unique values in next row |
| Wikipedia nav | "v t e" prefix pattern |
The key insight: headers and data have different characteristics. Headers are text-heavy with descriptive labels. Data is number-heavy with actual values. The boundary between them is usually detectable.
For more on Wikipedia's specific table challenges, see our guide on exporting Wikipedia tables to Excel.
Need automatic header detection without writing code? Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.
Top comments (0)