circobit

Posted on Apr 15

Building a Type Inference System for Messy Web Data

#javascript #datascience #algorithms #webdev

Web tables are strings. Everything is a string. But when you export to JSON or SQL, you want:

"1,234.56" → 1234.56 (number)
"2024-03-15" → date type
"Yes" → true (boolean)
"N/A" → null

Building a reliable type inference system means handling the chaos of real-world data. Here's how I approached it for HTML Table Exporter.

The Problem: Ambiguity Everywhere

Consider these values:

Value	Could be...
`"1,234"`	Number 1234 (US) or 1.234 (EU)
`"01/02/03"`	Jan 2, 2003 (US) or Feb 1, 2003 (EU) or 2001-02-03
`"1"`	Integer 1 or boolean true
`"N/A"`	String or null
`"0"`	Integer 0 or boolean false

Type inference isn't about parsing—it's about guessing intent from ambiguous evidence.

Architecture: Column-Level Inference

Inferring types cell-by-cell is unreliable. The value "1" alone could be anything. But a column containing ["1", "2", "3", "4"] is clearly integers.

My approach:

Sample values from the column (up to 100)
Infer type for each non-null value
Aggregate to determine the column type
Apply 90% threshold—if 90%+ values match a type, use it

const DATA_TYPES = {
  STRING: "string",
  INTEGER: "integer", 
  NUMBER: "number",
  BOOLEAN: "boolean",
  DATE: "date",
  NULL: "null",
};

function inferColumnType(columnValues, maxSamples = 100) {
  const nonNullValues = columnValues.filter(v => v != null && v !== "");

  if (nonNullValues.length === 0) {
    return { type: DATA_TYPES.STRING, confidence: 0 };
  }

  const sample = nonNullValues.slice(0, maxSamples);

  // Count occurrences of each type
  const typeCounts = {
    [DATA_TYPES.INTEGER]: 0,
    [DATA_TYPES.NUMBER]: 0,
    [DATA_TYPES.BOOLEAN]: 0,
    [DATA_TYPES.DATE]: 0,
    [DATA_TYPES.STRING]: 0,
  };

  for (const val of sample) {
    const type = inferValueType(val);
    if (type === DATA_TYPES.INTEGER) {
      // INTEGER is a subset of NUMBER
      typeCounts[DATA_TYPES.INTEGER]++;
      typeCounts[DATA_TYPES.NUMBER]++;
    } else if (type !== DATA_TYPES.NULL) {
      typeCounts[type]++;
    }
  }

  const total = sample.length;
  const threshold = 0.9;

  // Check types in priority order
  if (typeCounts[DATA_TYPES.BOOLEAN] >= total * threshold) {
    return { type: DATA_TYPES.BOOLEAN, confidence: typeCounts[DATA_TYPES.BOOLEAN] / total };
  }
  if (typeCounts[DATA_TYPES.DATE] >= total * threshold) {
    return { type: DATA_TYPES.DATE, confidence: typeCounts[DATA_TYPES.DATE] / total };
  }
  if (typeCounts[DATA_TYPES.INTEGER] >= total * threshold) {
    return { type: DATA_TYPES.INTEGER, confidence: typeCounts[DATA_TYPES.INTEGER] / total };
  }
  if (typeCounts[DATA_TYPES.NUMBER] >= total * threshold) {
    return { type: DATA_TYPES.NUMBER, confidence: typeCounts[DATA_TYPES.NUMBER] / total };
  }

  return { type: DATA_TYPES.STRING, confidence: 1 };
}

Why 90%? Real data has noise. A column of 100 integers might have one "N/A". Requiring 100% match is too strict.

Value-Level Type Detection

The inferValueType function handles individual values:

function inferValueType(value) {
  if (value == null || value === "") {
    return DATA_TYPES.NULL;
  }

  const str = String(value).trim();
  if (str === "") return DATA_TYPES.NULL;

  // Boolean check
  const lowerStr = str.toLowerCase();
  if (["true", "false", "yes", "no", "sí", "si"].includes(lowerStr)) {
    return DATA_TYPES.BOOLEAN;
  }

  // Date check (ISO format preferred)
  if (/^\d{4}-\d{2}-\d{2}/.test(str)) {
    return DATA_TYPES.DATE;
  }
  if (/^\d{1,2}[/\-\.]\d{1,2}[/\-\.]\d{2,4}$/.test(str)) {
    return DATA_TYPES.DATE;
  }

  // Integer check (strict)
  if (/^-?\d+$/.test(str)) {
    return DATA_TYPES.INTEGER;
  }

  // Number check (with decimals and separators)
  const cleanedForNumber = str
    .replace(/[$€£¥%\s]/g, "")
    .replace(/,/g, ".");

  if (/^-?\d+(\.\d+)?$/.test(cleanedForNumber)) {
    return DATA_TYPES.NUMBER;
  }

  return DATA_TYPES.STRING;
}

The Number Normalization Challenge

European vs US number formats are the hardest ambiguity:

Value	US interpretation	EU interpretation
`"1,234"`	1234	1.234
`"1.234"`	1.234	1234
`"1,234.56"`	1234.56	Invalid
`"1.234,56"`	Invalid	1234.56

My heuristic for detecting format:

function normalizeNumber(value) {
  if (value == null) return value;

  let str = String(value).trim();

  // Remove currency symbols and whitespace
  str = str.replace(/[$€£¥\s]/g, "");

  // Handle percentage
  const isPercent = str.endsWith("%");
  if (isPercent) str = str.slice(0, -1);

  // Detect format by analyzing separators
  const commaCount = (str.match(/,/g) || []).length;
  const dotCount = (str.match(/\./g) || []).length;
  const lastComma = str.lastIndexOf(",");
  const lastDot = str.lastIndexOf(".");

  let normalized;

  if (commaCount === 0 && dotCount === 0) {
    // Plain integer: "1234"
    normalized = str;
  } else if (commaCount === 0 && dotCount === 1) {
    // Either "1.234" (EU thousand) or "1.23" (decimal)
    // Heuristic: 3 digits after dot = thousand separator
    const afterDot = str.slice(lastDot + 1);
    if (afterDot.length === 3 && /^\d+$/.test(afterDot)) {
      // Likely EU thousand separator
      normalized = str.replace(".", "");
    } else {
      // Likely decimal
      normalized = str;
    }
  } else if (commaCount === 1 && dotCount === 0) {
    // Either "1,234" (US thousand) or "1,23" (EU decimal)
    const afterComma = str.slice(lastComma + 1);
    if (afterComma.length === 3 && /^\d+$/.test(afterComma)) {
      // Likely US thousand separator
      normalized = str.replace(",", "");
    } else {
      // Likely EU decimal
      normalized = str.replace(",", ".");
    }
  } else if (lastDot > lastComma) {
    // "1,234.56" - US format
    normalized = str.replace(/,/g, "");
  } else if (lastComma > lastDot) {
    // "1.234,56" - EU format
    normalized = str.replace(/\./g, "").replace(",", ".");
  } else {
    // Ambiguous, return as-is
    return value;
  }

  const num = parseFloat(normalized);
  if (Number.isNaN(num)) return value;

  return isPercent ? num / 100 : num;
}

Key insight: The position and count of separators resolves most ambiguity. The last separator is usually the decimal separator.

Boolean Detection (With Gotchas)

Obvious booleans: "true", "false", "yes", "no"

But what about "0" and "1"?

// Problematic: This converts tax rates to booleans
// Column: [0, 0.05, 0.1, 0.2]  -> [false, 0.05, 0.1, 0.2]

My solution: Don't convert purely numeric values to booleans.

function applyBooleanNormalization(rows, booleanConfig) {
  const trueValues = new Set(booleanConfig.true.map(v => v.toLowerCase()));
  const falseValues = new Set(booleanConfig.false.map(v => v.toLowerCase()));

  return rows.map((row, rowIndex) => {
    if (rowIndex === 0) return row; // Skip header

    return row.map(cell => {
      if (cell == null) return cell;

      const cellStr = String(cell).toLowerCase().trim();

      // Skip if purely numeric (prevents 0 -> false in number columns)
      if (/^-?\d+(\.\d+)?$/.test(cellStr)) {
        return cell;
      }

      if (trueValues.has(cellStr)) return "true";
      if (falseValues.has(cellStr)) return "false";

      return cell;
    });
  });
}

Null Value Handling

Web data uses many representations for "no value":

"" (empty string)
"N/A", "n/a", "NA"
"-", "--"
"null", "NULL"
"none", "None"
"." (seen in government data)

Configurable null detection:

function applyNullValues(rows, nullPatterns) {
  const nullSet = new Set(nullPatterns.map(v => v.toLowerCase().trim()));

  return rows.map((row, rowIndex) => {
    if (rowIndex === 0) return row; // Skip header

    return row.map(cell => {
      if (cell == null) return null;

      const cellStr = String(cell).toLowerCase().trim();
      return nullSet.has(cellStr) ? null : cell;
    });
  });
}

// Usage
applyNullValues(rows, ["N/A", "n/a", "-", "--", "null", "none", "."]);

The Pipeline

Full type inference and cleaning pipeline:

function cleanTable(tableInfo, config) {
  let rows = cloneRows(tableInfo.rows);

  // 1. Trim all strings
  if (config.trimStrings) {
    rows = applyTrimStrings(rows);
  }

  // 2. Convert null patterns to actual null
  if (config.nullValues?.length) {
    rows = applyNullValues(rows, config.nullValues);
  }

  // 3. Normalize booleans (before numbers, to avoid 0->false issues)
  if (config.booleans) {
    rows = applyBooleanNormalization(rows, config.booleans);
  }

  // 4. Normalize numbers (handles EU/US formats)
  if (config.normalizeNumbers) {
    rows = applyNumberNormalization(rows);
  }

  // 5. Normalize dates (optional, format-specific)
  if (config.normalizeDates) {
    rows = applyDateNormalization(rows, config.dateFormat);
  }

  return { ...tableInfo, rows };
}

Testing Edge Cases

Type inference has many edge cases. I maintain a test suite:

// Number normalization tests
assertEqual(normalizeNumber("1,234.56"), 1234.56);    // US format
assertEqual(normalizeNumber("1.234,56"), 1234.56);    // EU format
assertEqual(normalizeNumber("€1.234,56"), 1234.56);   // With currency
assertEqual(normalizeNumber("45,5%"), 0.455);         // Percentage
assertEqual(normalizeNumber("1.234"), 1234);          // EU thousand
assertEqual(normalizeNumber("1.23"), 1.23);           // Decimal

// Boolean tests
assertEqual(inferValueType("Yes"), DATA_TYPES.BOOLEAN);
assertEqual(inferValueType("0"), DATA_TYPES.INTEGER);  // NOT boolean
assertEqual(inferValueType("0.5"), DATA_TYPES.NUMBER); // NOT boolean

Integration with Export

When exporting to SQL, use inferred types:

function inferSqlColumnTypes(rows) {
  const header = rows[0];
  const dataRows = rows.slice(1);

  return header.map((_, colIndex) => {
    const columnValues = dataRows.map(row => row[colIndex]);
    const { type } = inferColumnType(columnValues);

    switch (type) {
      case DATA_TYPES.INTEGER: return "INTEGER";
      case DATA_TYPES.NUMBER: return "REAL";
      case DATA_TYPES.BOOLEAN: return "BOOLEAN";
      case DATA_TYPES.DATE: return "DATE";
      default: return "TEXT";
    }
  });
}

Lessons Learned

Column context beats cell analysis. A single "1" is ambiguous; a column of integers is not.
Be conservative with conversion. It's better to leave something as a string than corrupt it with a wrong conversion.
Make it configurable. Different domains have different null values, boolean representations, and number formats.
Test with real data. Synthetic tests miss the chaos of actual web tables.

This system powers the data cleaning in HTML Table Exporter. For a practical example of these challenges, see how copying tables from websites to Excel often results in broken formatting that proper type inference can fix.

The PRO version lets users configure these rules per export profile. Learn more at gauchogrid.com/html-table-exporter or try it on the Chrome Web Store.

Building type inference for another domain? I'd love to hear about your edge cases.

DEV Community