Web tables are strings. Everything is a string. But when you export to JSON or SQL, you want:
-
"1,234.56"→1234.56(number) -
"2024-03-15"→ date type -
"Yes"→true(boolean) -
"N/A"→null
Building a reliable type inference system means handling the chaos of real-world data. Here's how I approached it for HTML Table Exporter.
The Problem: Ambiguity Everywhere
Consider these values:
| Value | Could be... |
|---|---|
"1,234" |
Number 1234 (US) or 1.234 (EU) |
"01/02/03" |
Jan 2, 2003 (US) or Feb 1, 2003 (EU) or 2001-02-03 |
"1" |
Integer 1 or boolean true |
"N/A" |
String or null |
"0" |
Integer 0 or boolean false |
Type inference isn't about parsing—it's about guessing intent from ambiguous evidence.
Architecture: Column-Level Inference
Inferring types cell-by-cell is unreliable. The value "1" alone could be anything. But a column containing ["1", "2", "3", "4"] is clearly integers.
My approach:
- Sample values from the column (up to 100)
- Infer type for each non-null value
- Aggregate to determine the column type
- Apply 90% threshold—if 90%+ values match a type, use it
const DATA_TYPES = {
STRING: "string",
INTEGER: "integer",
NUMBER: "number",
BOOLEAN: "boolean",
DATE: "date",
NULL: "null",
};
function inferColumnType(columnValues, maxSamples = 100) {
const nonNullValues = columnValues.filter(v => v != null && v !== "");
if (nonNullValues.length === 0) {
return { type: DATA_TYPES.STRING, confidence: 0 };
}
const sample = nonNullValues.slice(0, maxSamples);
// Count occurrences of each type
const typeCounts = {
[DATA_TYPES.INTEGER]: 0,
[DATA_TYPES.NUMBER]: 0,
[DATA_TYPES.BOOLEAN]: 0,
[DATA_TYPES.DATE]: 0,
[DATA_TYPES.STRING]: 0,
};
for (const val of sample) {
const type = inferValueType(val);
if (type === DATA_TYPES.INTEGER) {
// INTEGER is a subset of NUMBER
typeCounts[DATA_TYPES.INTEGER]++;
typeCounts[DATA_TYPES.NUMBER]++;
} else if (type !== DATA_TYPES.NULL) {
typeCounts[type]++;
}
}
const total = sample.length;
const threshold = 0.9;
// Check types in priority order
if (typeCounts[DATA_TYPES.BOOLEAN] >= total * threshold) {
return { type: DATA_TYPES.BOOLEAN, confidence: typeCounts[DATA_TYPES.BOOLEAN] / total };
}
if (typeCounts[DATA_TYPES.DATE] >= total * threshold) {
return { type: DATA_TYPES.DATE, confidence: typeCounts[DATA_TYPES.DATE] / total };
}
if (typeCounts[DATA_TYPES.INTEGER] >= total * threshold) {
return { type: DATA_TYPES.INTEGER, confidence: typeCounts[DATA_TYPES.INTEGER] / total };
}
if (typeCounts[DATA_TYPES.NUMBER] >= total * threshold) {
return { type: DATA_TYPES.NUMBER, confidence: typeCounts[DATA_TYPES.NUMBER] / total };
}
return { type: DATA_TYPES.STRING, confidence: 1 };
}
Why 90%? Real data has noise. A column of 100 integers might have one "N/A". Requiring 100% match is too strict.
Value-Level Type Detection
The inferValueType function handles individual values:
function inferValueType(value) {
if (value == null || value === "") {
return DATA_TYPES.NULL;
}
const str = String(value).trim();
if (str === "") return DATA_TYPES.NULL;
// Boolean check
const lowerStr = str.toLowerCase();
if (["true", "false", "yes", "no", "sí", "si"].includes(lowerStr)) {
return DATA_TYPES.BOOLEAN;
}
// Date check (ISO format preferred)
if (/^\d{4}-\d{2}-\d{2}/.test(str)) {
return DATA_TYPES.DATE;
}
if (/^\d{1,2}[/\-\.]\d{1,2}[/\-\.]\d{2,4}$/.test(str)) {
return DATA_TYPES.DATE;
}
// Integer check (strict)
if (/^-?\d+$/.test(str)) {
return DATA_TYPES.INTEGER;
}
// Number check (with decimals and separators)
const cleanedForNumber = str
.replace(/[$€£¥%\s]/g, "")
.replace(/,/g, ".");
if (/^-?\d+(\.\d+)?$/.test(cleanedForNumber)) {
return DATA_TYPES.NUMBER;
}
return DATA_TYPES.STRING;
}
The Number Normalization Challenge
European vs US number formats are the hardest ambiguity:
| Value | US interpretation | EU interpretation |
|---|---|---|
"1,234" |
1234 | 1.234 |
"1.234" |
1.234 | 1234 |
"1,234.56" |
1234.56 | Invalid |
"1.234,56" |
Invalid | 1234.56 |
My heuristic for detecting format:
function normalizeNumber(value) {
if (value == null) return value;
let str = String(value).trim();
// Remove currency symbols and whitespace
str = str.replace(/[$€£¥\s]/g, "");
// Handle percentage
const isPercent = str.endsWith("%");
if (isPercent) str = str.slice(0, -1);
// Detect format by analyzing separators
const commaCount = (str.match(/,/g) || []).length;
const dotCount = (str.match(/\./g) || []).length;
const lastComma = str.lastIndexOf(",");
const lastDot = str.lastIndexOf(".");
let normalized;
if (commaCount === 0 && dotCount === 0) {
// Plain integer: "1234"
normalized = str;
} else if (commaCount === 0 && dotCount === 1) {
// Either "1.234" (EU thousand) or "1.23" (decimal)
// Heuristic: 3 digits after dot = thousand separator
const afterDot = str.slice(lastDot + 1);
if (afterDot.length === 3 && /^\d+$/.test(afterDot)) {
// Likely EU thousand separator
normalized = str.replace(".", "");
} else {
// Likely decimal
normalized = str;
}
} else if (commaCount === 1 && dotCount === 0) {
// Either "1,234" (US thousand) or "1,23" (EU decimal)
const afterComma = str.slice(lastComma + 1);
if (afterComma.length === 3 && /^\d+$/.test(afterComma)) {
// Likely US thousand separator
normalized = str.replace(",", "");
} else {
// Likely EU decimal
normalized = str.replace(",", ".");
}
} else if (lastDot > lastComma) {
// "1,234.56" - US format
normalized = str.replace(/,/g, "");
} else if (lastComma > lastDot) {
// "1.234,56" - EU format
normalized = str.replace(/\./g, "").replace(",", ".");
} else {
// Ambiguous, return as-is
return value;
}
const num = parseFloat(normalized);
if (Number.isNaN(num)) return value;
return isPercent ? num / 100 : num;
}
Key insight: The position and count of separators resolves most ambiguity. The last separator is usually the decimal separator.
Boolean Detection (With Gotchas)
Obvious booleans: "true", "false", "yes", "no"
But what about "0" and "1"?
// Problematic: This converts tax rates to booleans
// Column: [0, 0.05, 0.1, 0.2] -> [false, 0.05, 0.1, 0.2]
My solution: Don't convert purely numeric values to booleans.
function applyBooleanNormalization(rows, booleanConfig) {
const trueValues = new Set(booleanConfig.true.map(v => v.toLowerCase()));
const falseValues = new Set(booleanConfig.false.map(v => v.toLowerCase()));
return rows.map((row, rowIndex) => {
if (rowIndex === 0) return row; // Skip header
return row.map(cell => {
if (cell == null) return cell;
const cellStr = String(cell).toLowerCase().trim();
// Skip if purely numeric (prevents 0 -> false in number columns)
if (/^-?\d+(\.\d+)?$/.test(cellStr)) {
return cell;
}
if (trueValues.has(cellStr)) return "true";
if (falseValues.has(cellStr)) return "false";
return cell;
});
});
}
Null Value Handling
Web data uses many representations for "no value":
-
""(empty string) -
"N/A","n/a","NA" -
"-","--" -
"null","NULL" -
"none","None" -
"."(seen in government data)
Configurable null detection:
function applyNullValues(rows, nullPatterns) {
const nullSet = new Set(nullPatterns.map(v => v.toLowerCase().trim()));
return rows.map((row, rowIndex) => {
if (rowIndex === 0) return row; // Skip header
return row.map(cell => {
if (cell == null) return null;
const cellStr = String(cell).toLowerCase().trim();
return nullSet.has(cellStr) ? null : cell;
});
});
}
// Usage
applyNullValues(rows, ["N/A", "n/a", "-", "--", "null", "none", "."]);
The Pipeline
Full type inference and cleaning pipeline:
function cleanTable(tableInfo, config) {
let rows = cloneRows(tableInfo.rows);
// 1. Trim all strings
if (config.trimStrings) {
rows = applyTrimStrings(rows);
}
// 2. Convert null patterns to actual null
if (config.nullValues?.length) {
rows = applyNullValues(rows, config.nullValues);
}
// 3. Normalize booleans (before numbers, to avoid 0->false issues)
if (config.booleans) {
rows = applyBooleanNormalization(rows, config.booleans);
}
// 4. Normalize numbers (handles EU/US formats)
if (config.normalizeNumbers) {
rows = applyNumberNormalization(rows);
}
// 5. Normalize dates (optional, format-specific)
if (config.normalizeDates) {
rows = applyDateNormalization(rows, config.dateFormat);
}
return { ...tableInfo, rows };
}
Testing Edge Cases
Type inference has many edge cases. I maintain a test suite:
// Number normalization tests
assertEqual(normalizeNumber("1,234.56"), 1234.56); // US format
assertEqual(normalizeNumber("1.234,56"), 1234.56); // EU format
assertEqual(normalizeNumber("€1.234,56"), 1234.56); // With currency
assertEqual(normalizeNumber("45,5%"), 0.455); // Percentage
assertEqual(normalizeNumber("1.234"), 1234); // EU thousand
assertEqual(normalizeNumber("1.23"), 1.23); // Decimal
// Boolean tests
assertEqual(inferValueType("Yes"), DATA_TYPES.BOOLEAN);
assertEqual(inferValueType("0"), DATA_TYPES.INTEGER); // NOT boolean
assertEqual(inferValueType("0.5"), DATA_TYPES.NUMBER); // NOT boolean
Integration with Export
When exporting to SQL, use inferred types:
function inferSqlColumnTypes(rows) {
const header = rows[0];
const dataRows = rows.slice(1);
return header.map((_, colIndex) => {
const columnValues = dataRows.map(row => row[colIndex]);
const { type } = inferColumnType(columnValues);
switch (type) {
case DATA_TYPES.INTEGER: return "INTEGER";
case DATA_TYPES.NUMBER: return "REAL";
case DATA_TYPES.BOOLEAN: return "BOOLEAN";
case DATA_TYPES.DATE: return "DATE";
default: return "TEXT";
}
});
}
Lessons Learned
Column context beats cell analysis. A single
"1"is ambiguous; a column of integers is not.Be conservative with conversion. It's better to leave something as a string than corrupt it with a wrong conversion.
Make it configurable. Different domains have different null values, boolean representations, and number formats.
Test with real data. Synthetic tests miss the chaos of actual web tables.
This system powers the data cleaning in HTML Table Exporter. For a practical example of these challenges, see how copying tables from websites to Excel often results in broken formatting that proper type inference can fix.
The PRO version lets users configure these rules per export profile. Learn more at gauchogrid.com/html-table-exporter or try it on the Chrome Web Store.
Building type inference for another domain? I'd love to hear about your edge cases.
Top comments (0)