DEV Community

Cover image for Detecting Header Rows Automatically: The Heuristics Behind Table Parsing
circobit
circobit

Posted on

Detecting Header Rows Automatically: The Heuristics Behind Table Parsing

The first row of an HTML table is the header row.

Except when it isn't.

Wikipedia tables often have a title row spanning all columns before the actual headers. Sports statistics sites have grouped headers where "Playing Time" spans multiple sub-columns like "MP", "Starts", "Min". Financial tables have unit rows ("in millions USD") that look like headers but aren't.

If you assume row 0 is always the header, your exports will be broken for a significant portion of real-world tables.

Here's how to detect the actual header row programmatically.

The Problem: Three Types of "First Rows"

Consider these common patterns:

Pattern 1: Title Row

<table>
  <tr>
    <th colspan="4">World Population by Country</th>  <!-- Title, not header -->
  </tr>
  <tr>
    <th>Rank</th>
    <th>Country</th>
    <th>Population</th>
    <th>% of World</th>
  </tr>
  <tr>
    <td>1</td>
    <td>India</td>
    <td>1,428,627,663</td>
    <td>17.85%</td>
  </tr>
</table>
Enter fullscreen mode Exit fullscreen mode

Row 0 is a title. Row 1 is the header. Row 2+ is data.

Pattern 2: Grouped Headers (Two-Level)

<table>
  <tr>
    <th></th>
    <th></th>
    <th colspan="3">Playing Time</th>
    <th colspan="2">Performance</th>
  </tr>
  <tr>
    <th>Player</th>
    <th>Nation</th>
    <th>MP</th>
    <th>Starts</th>
    <th>Min</th>
    <th>Gls</th>
    <th>Ast</th>
  </tr>
  <tr>
    <td>John Smith</td>
    <td>ENG</td>
    <td>34</td>
    <td>30</td>
    <td>2700</td>
    <td>12</td>
    <td>8</td>
  </tr>
</table>
Enter fullscreen mode Exit fullscreen mode

Row 0 is group headers. Row 1 is the actual column headers. Row 2+ is data.

Pattern 3: Wikipedia Navigation Prefix

<tr>
  <th colspan="3">v t e World Heritage Sites</th>
</tr>
Enter fullscreen mode Exit fullscreen mode

The "v t e" (view/talk/edit) links are Wikipedia's template navigation. They need to be stripped, and the row might still be a title rather than a header.

Heuristic 1: Detecting Title Rows

A title row typically has:

  • A single cell (or very few cells)
  • Large colspan spanning most/all columns
  • Text content that looks like a title, not column names
function isTitleRow(row, totalColumns) {
  if (!row || row.length === 0) return false;

  // Count non-empty cells
  const nonEmptyCells = row.filter(cell => cell && cell.trim()).length;

  // Title rows usually have 1-2 non-empty cells
  if (nonEmptyCells > 2) return false;

  // Check if first cell spans most columns (indicates colspan)
  // In a normalized matrix, this shows as repeated values
  const firstValue = row[0];
  const repeatedCount = row.filter(cell => cell === firstValue).length;

  // If the first value repeats across >50% of columns, it's likely a colspan title
  if (repeatedCount > totalColumns * 0.5) {
    return true;
  }

  return false;
}
Enter fullscreen mode Exit fullscreen mode

Heuristic 2: What Makes a Row "Look Like Headers"

Header rows have characteristics that distinguish them from data rows:

function rowLooksLikeHeaders(row) {
  if (!row || row.length === 0) return false;

  const dominated by pure numbers, it's not headers
  let numericCells = 0;
  let textCells = 0;
  let emptyCells = 0;

  for (const cell of row) {
    const value = (cell || "").trim();

    if (!value) {
      emptyCells++;
    } else if (/^-?\d+([.,]\d+)?%?$/.test(value)) {
      // Pure number or percentage
      numericCells++;
    } else {
      textCells++;
    }
  }

  const totalNonEmpty = numericCells + textCells;
  if (totalNonEmpty === 0) return false;

  // Headers are mostly text, not numbers
  // If >70% of non-empty cells are numeric, it's probably data
  if (numericCells / totalNonEmpty > 0.7) {
    return false;
  }

  // Headers shouldn't be mostly empty
  if (emptyCells / row.length > 0.7) {
    return false;
  }

  return true;
}
Enter fullscreen mode Exit fullscreen mode

Heuristic 3: What Makes a Row "Look Like Data"

The inverse check helps confirm we've found the right boundary:

function rowLooksLikeData(row) {
  if (!row || row.length === 0) return false;

  let numericCells = 0;
  let dateCells = 0;
  let totalNonEmpty = 0;

  for (const cell of row) {
    const value = (cell || "").trim();
    if (!value) continue;

    totalNonEmpty++;

    // Check for numeric patterns
    if (/^-?\d+([.,]\d+)?%?$/.test(value)) {
      numericCells++;
    }

    // Check for date patterns
    if (/^\d{1,4}[-/\.]\d{1,2}[-/\.]\d{1,4}$/.test(value)) {
      dateCells++;
    }
  }

  if (totalNonEmpty === 0) return false;

  // Data rows typically have numeric or date content
  const dataLikeCells = numericCells + dateCells;
  return dataLikeCells / totalNonEmpty > 0.3;
}
Enter fullscreen mode Exit fullscreen mode

Heuristic 4: Detecting Grouped Column Headers

FBREF-style tables have a group header row followed by a sub-header row. The group row has:

  • Empty cells at the beginning (columns without groups)
  • Repeated values from colspan expansion
  • Multiple unique non-empty values (not just one like a title)
function detectGroupHeaderRow(row, nextRow) {
  if (!row || !nextRow || row.length < 4) return false;

  // Group header rows MUST have empty cells at the beginning
  // This distinguishes from horizontally duplicated tables
  const firstCellEmpty = !(row[0] || "").trim();
  if (!firstCellEmpty) return false;

  // Count unique non-empty values
  const uniqueValues = new Set(
    row.filter(v => v && v.trim()).map(v => v.trim().toLowerCase())
  );

  // A title row has exactly ONE unique value
  // A group header row must have MULTIPLE unique values
  if (uniqueValues.size <= 1) return false;

  // Count consecutive repeated values (indicates colspan expansion)
  let consecutiveRepeats = 0;
  for (let i = 1; i < row.length; i++) {
    const curr = (row[i] || "").trim();
    const prev = (row[i - 1] || "").trim();
    if (curr === prev) consecutiveRepeats++;
  }

  const repeatRatio = consecutiveRepeats / (row.length - 1);

  // High repeat ratio (>30%) suggests colspan expansion
  // Next row should have more unique values (the actual sub-headers)
  const nextUniqueValues = new Set(
    nextRow.filter(v => v && v.trim()).map(v => v.trim().toLowerCase())
  );

  return repeatRatio > 0.3 && nextUniqueValues.size > uniqueValues.size;
}
Enter fullscreen mode Exit fullscreen mode

Heuristic 5: Cleaning Wikipedia Navigation Prefixes

Wikipedia templates often prefix content with "v t e" (links to view/talk/edit the template):

function cleanWikipediaNavPrefix(text) {
  if (!text) return text;

  // Pattern 1: "v t e " at the beginning (space-separated)
  // Pattern 2: "v | t | e " (pipe-separated)
  // Pattern 3: "[v] [t] [e] " (bracket-separated)

  return text
    .replace(/^\s*v\s+t\s+e\s+/i, "")
    .replace(/^\s*v\s*\|\s*t\s*\|\s*e\s+/i, "")
    .replace(/^\s*\[v\]\s*\[t\]\s*\[e\]\s+/i, "")
    .trim();
}
Enter fullscreen mode Exit fullscreen mode

Putting It Together: The Detection Algorithm

function detectHeaderRowIndex(matrix) {
  if (!matrix || matrix.length < 2) return 0;

  const totalColumns = matrix[0]?.length || 0;

  for (let i = 0; i < Math.min(matrix.length - 1, 5); i++) {
    const currentRow = matrix[i];
    const nextRow = matrix[i + 1];

    // Skip title rows
    if (isTitleRow(currentRow, totalColumns)) {
      continue;
    }

    // Check for grouped headers (two-level)
    if (detectGroupHeaderRow(currentRow, nextRow)) {
      // The sub-header row (i+1) is the actual header
      return i + 1;
    }

    // Check if this row looks like headers and next looks like data
    if (rowLooksLikeHeaders(currentRow) && rowLooksLikeData(nextRow)) {
      return i;
    }
  }

  // Fallback: assume row 0 is header
  return 0;
}
Enter fullscreen mode Exit fullscreen mode

Real-World Testing

These heuristics were developed by testing against:

  • Wikipedia country/population tables (title rows + "v t e" prefixes)
  • FBREF player statistics (grouped headers)
  • Financial tables with unit rows
  • Government data tables with multiple header levels

No heuristic is perfect. The goal is to handle the common patterns correctly and fail gracefully on unusual tables.

When Detection Fails

For tables that don't fit common patterns, provide a manual override:

function extractTable(matrix, options = {}) {
  const headerRowIndex = options.headerRowIndex ?? detectHeaderRowIndex(matrix);

  const headerRow = matrix[headerRowIndex];
  const dataRows = matrix.slice(headerRowIndex + 1);

  return { headerRow, dataRows };
}
Enter fullscreen mode Exit fullscreen mode

Users who know their data can specify the header row explicitly.

Summary

Pattern Detection Method
Title row Single cell with large colspan
Standard header Row with mostly text, followed by row with numbers
Grouped headers Empty first cells + repeated values + more unique values in next row
Wikipedia nav "v t e" prefix pattern

The key insight: headers and data have different characteristics. Headers are text-heavy with descriptive labels. Data is number-heavy with actual values. The boundary between them is usually detectable.

For more on Wikipedia's specific table challenges, see our guide on exporting Wikipedia tables to Excel.


Need automatic header detection without writing code? Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.

Top comments (0)