circobit

Posted on Jun 17

Detecting Header Rows Automatically: The Heuristics Behind Table Parsing

#algorithms #javascript #tutorial #webdev

The first row of an HTML table is the header row.

Except when it isn't.

Wikipedia tables often have a title row spanning all columns before the actual headers. Sports statistics sites have grouped headers where "Playing Time" spans multiple sub-columns like "MP", "Starts", "Min". Financial tables have unit rows ("in millions USD") that look like headers but aren't.

If you assume row 0 is always the header, your exports will be broken for a significant portion of real-world tables.

Here's how to detect the actual header row programmatically.

The Problem: Three Types of "First Rows"

Consider these common patterns:

Pattern 1: Title Row

<table>
  <tr>
    <th colspan="4">World Population by Country</th>  <!-- Title, not header -->
  </tr>
  <tr>
    <th>Rank</th>
    <th>Country</th>
    <th>Population</th>
    <th>% of World</th>
  </tr>
  <tr>
    <td>1</td>
    <td>India</td>
    <td>1,428,627,663</td>
    <td>17.85%</td>
  </tr>
</table>

Row 0 is a title. Row 1 is the header. Row 2+ is data.

Pattern 2: Grouped Headers (Two-Level)

<table>
  <tr>
    <th></th>
    <th></th>
    <th colspan="3">Playing Time</th>
    <th colspan="2">Performance</th>
  </tr>
  <tr>
    <th>Player</th>
    <th>Nation</th>
    <th>MP</th>
    <th>Starts</th>
    <th>Min</th>
    <th>Gls</th>
    <th>Ast</th>
  </tr>
  <tr>
    <td>John Smith</td>
    <td>ENG</td>
    <td>34</td>
    <td>30</td>
    <td>2700</td>
    <td>12</td>
    <td>8</td>
  </tr>
</table>

Row 0 is group headers. Row 1 is the actual column headers. Row 2+ is data.

Pattern 3: Wikipedia Navigation Prefix

<tr>
  <th colspan="3">v t e World Heritage Sites</th>
</tr>

The "v t e" (view/talk/edit) links are Wikipedia's template navigation. They need to be stripped, and the row might still be a title rather than a header.

Heuristic 1: Detecting Title Rows

A title row typically has:

A single cell (or very few cells)
Large colspan spanning most/all columns
Text content that looks like a title, not column names

function isTitleRow(row, totalColumns) {
  if (!row || row.length === 0) return false;

  // Count non-empty cells
  const nonEmptyCells = row.filter(cell => cell && cell.trim()).length;

  // Title rows usually have 1-2 non-empty cells
  if (nonEmptyCells > 2) return false;

  // Check if first cell spans most columns (indicates colspan)
  // In a normalized matrix, this shows as repeated values
  const firstValue = row[0];
  const repeatedCount = row.filter(cell => cell === firstValue).length;

  // If the first value repeats across >50% of columns, it's likely a colspan title
  if (repeatedCount > totalColumns * 0.5) {
    return true;
  }

  return false;
}

Heuristic 2: What Makes a Row "Look Like Headers"

Header rows have characteristics that distinguish them from data rows:

function rowLooksLikeHeaders(row) {
  if (!row || row.length === 0) return false;

  const dominated by pure numbers, it's not headers
  let numericCells = 0;
  let textCells = 0;
  let emptyCells = 0;

  for (const cell of row) {
    const value = (cell || "").trim();

    if (!value) {
      emptyCells++;
    } else if (/^-?\d+([.,]\d+)?%?$/.test(value)) {
      // Pure number or percentage
      numericCells++;
    } else {
      textCells++;
    }
  }

  const totalNonEmpty = numericCells + textCells;
  if (totalNonEmpty === 0) return false;

  // Headers are mostly text, not numbers
  // If >70% of non-empty cells are numeric, it's probably data
  if (numericCells / totalNonEmpty > 0.7) {
    return false;
  }

  // Headers shouldn't be mostly empty
  if (emptyCells / row.length > 0.7) {
    return false;
  }

  return true;
}

Heuristic 3: What Makes a Row "Look Like Data"

The inverse check helps confirm we've found the right boundary:

function rowLooksLikeData(row) {
  if (!row || row.length === 0) return false;

  let numericCells = 0;
  let dateCells = 0;
  let totalNonEmpty = 0;

  for (const cell of row) {
    const value = (cell || "").trim();
    if (!value) continue;

    totalNonEmpty++;

    // Check for numeric patterns
    if (/^-?\d+([.,]\d+)?%?$/.test(value)) {
      numericCells++;
    }

    // Check for date patterns
    if (/^\d{1,4}[-/\.]\d{1,2}[-/\.]\d{1,4}$/.test(value)) {
      dateCells++;
    }
  }

  if (totalNonEmpty === 0) return false;

  // Data rows typically have numeric or date content
  const dataLikeCells = numericCells + dateCells;
  return dataLikeCells / totalNonEmpty > 0.3;
}

Heuristic 4: Detecting Grouped Column Headers

FBREF-style tables have a group header row followed by a sub-header row. The group row has:

Empty cells at the beginning (columns without groups)
Repeated values from colspan expansion
Multiple unique non-empty values (not just one like a title)

function detectGroupHeaderRow(row, nextRow) {
  if (!row || !nextRow || row.length < 4) return false;

  // Group header rows MUST have empty cells at the beginning
  // This distinguishes from horizontally duplicated tables
  const firstCellEmpty = !(row[0] || "").trim();
  if (!firstCellEmpty) return false;

  // Count unique non-empty values
  const uniqueValues = new Set(
    row.filter(v => v && v.trim()).map(v => v.trim().toLowerCase())
  );

  // A title row has exactly ONE unique value
  // A group header row must have MULTIPLE unique values
  if (uniqueValues.size <= 1) return false;

  // Count consecutive repeated values (indicates colspan expansion)
  let consecutiveRepeats = 0;
  for (let i = 1; i < row.length; i++) {
    const curr = (row[i] || "").trim();
    const prev = (row[i - 1] || "").trim();
    if (curr === prev) consecutiveRepeats++;
  }

  const repeatRatio = consecutiveRepeats / (row.length - 1);

  // High repeat ratio (>30%) suggests colspan expansion
  // Next row should have more unique values (the actual sub-headers)
  const nextUniqueValues = new Set(
    nextRow.filter(v => v && v.trim()).map(v => v.trim().toLowerCase())
  );

  return repeatRatio > 0.3 && nextUniqueValues.size > uniqueValues.size;
}

Heuristic 5: Cleaning Wikipedia Navigation Prefixes

Wikipedia templates often prefix content with "v t e" (links to view/talk/edit the template):

function cleanWikipediaNavPrefix(text) {
  if (!text) return text;

  // Pattern 1: "v t e " at the beginning (space-separated)
  // Pattern 2: "v | t | e " (pipe-separated)
  // Pattern 3: "[v] [t] [e] " (bracket-separated)

  return text
    .replace(/^\s*v\s+t\s+e\s+/i, "")
    .replace(/^\s*v\s*\|\s*t\s*\|\s*e\s+/i, "")
    .replace(/^\s*\[v\]\s*\[t\]\s*\[e\]\s+/i, "")
    .trim();
}

Putting It Together: The Detection Algorithm

function detectHeaderRowIndex(matrix) {
  if (!matrix || matrix.length < 2) return 0;

  const totalColumns = matrix[0]?.length || 0;

  for (let i = 0; i < Math.min(matrix.length - 1, 5); i++) {
    const currentRow = matrix[i];
    const nextRow = matrix[i + 1];

    // Skip title rows
    if (isTitleRow(currentRow, totalColumns)) {
      continue;
    }

    // Check for grouped headers (two-level)
    if (detectGroupHeaderRow(currentRow, nextRow)) {
      // The sub-header row (i+1) is the actual header
      return i + 1;
    }

    // Check if this row looks like headers and next looks like data
    if (rowLooksLikeHeaders(currentRow) && rowLooksLikeData(nextRow)) {
      return i;
    }
  }

  // Fallback: assume row 0 is header
  return 0;
}

Real-World Testing

These heuristics were developed by testing against:

Wikipedia country/population tables (title rows + "v t e" prefixes)
FBREF player statistics (grouped headers)
Financial tables with unit rows
Government data tables with multiple header levels

No heuristic is perfect. The goal is to handle the common patterns correctly and fail gracefully on unusual tables.

When Detection Fails

For tables that don't fit common patterns, provide a manual override:

function extractTable(matrix, options = {}) {
  const headerRowIndex = options.headerRowIndex ?? detectHeaderRowIndex(matrix);

  const headerRow = matrix[headerRowIndex];
  const dataRows = matrix.slice(headerRowIndex + 1);

  return { headerRow, dataRows };
}

Users who know their data can specify the header row explicitly.

Summary

Pattern	Detection Method
Title row	Single cell with large colspan
Standard header	Row with mostly text, followed by row with numbers
Grouped headers	Empty first cells + repeated values + more unique values in next row
Wikipedia nav	"v t e" prefix pattern

The key insight: headers and data have different characteristics. Headers are text-heavy with descriptive labels. Data is number-heavy with actual values. The boundary between them is usually detectable.

For more on Wikipedia's specific table challenges, see our guide on exporting Wikipedia tables to Excel.

Need automatic header detection without writing code? Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.

DEV Community