You export a table from a German website. The revenue column shows "1.234,56 €".
You paste it into Excel. It becomes "1.23456E" or, worse, a date.
Welcome to the international number format problem.
The Core Issue: What Does "1.234" Mean?
In the US and UK: 1.234 = one point two three four (decimal)
In Germany, Spain, and most of Latin America: 1.234 = one thousand two hundred thirty-four
Same characters. Completely different values.
When extracting data from international websites, you can't assume which format you're dealing with. And getting it wrong means your analysis is off by orders of magnitude.
The Three-Digit Ambiguity
The hardest case is exactly three digits after a separator.
| Input | US Interpretation | EU Interpretation |
|---|---|---|
| 1.234 | 1.234 (decimal) | 1,234 (thousands) |
| 1,234 | 1,234 (thousands) | 1.234 (decimal) |
| 1.234.567 | Invalid | 1,234,567 |
| 1,234,567 | 1,234,567 | Invalid |
With two digits after, it's clearly decimal: "1.23" or "1,23".
With four+ digits after, it's clearly decimal: "3.14159".
With three digits? Could be either.
The Detection Algorithm
Here's a heuristic approach that handles most real-world cases:
function normalizeNumberString(value) {
if (typeof value !== "string") return value;
const v = value.trim();
if (!v || !/[0-9]/.test(v)) return value;
// Step 1: Clean currency symbols, percentages, whitespace
let cleaned = v
.replace(/ /g, " ")
.replace(/^(USD|EUR|GBP|R\$|MXN)\s*/i, "")
.replace(/[$€£¥₹₽₩]/g, "")
.replace(/%/g, "")
.replace(/\s+/g, "");
// Step 2: Handle negative signs
const isNegative = cleaned.match(/^-/);
if (isNegative) {
cleaned = cleaned.substring(1).trim();
}
// Step 3: Pure digits? We're done
if (/^\d+$/.test(cleaned)) {
return isNegative ? `-${cleaned}` : cleaned;
}
// Step 4: Detect separators
const hasComma = cleaned.includes(",");
const hasDot = cleaned.includes(".");
// Step 5: Apply heuristic
let decimalSeparator = ".";
let thousandsSeparator = ",";
if (hasComma && hasDot) {
// Both present: last one is decimal
const lastComma = cleaned.lastIndexOf(",");
const lastDot = cleaned.lastIndexOf(".");
if (lastComma > lastDot) {
decimalSeparator = ",";
thousandsSeparator = ".";
}
} else if (hasComma && !hasDot) {
const parts = cleaned.split(",");
if (parts.length === 2 && parts[1].length <= 2) {
// "1,23" → decimal
decimalSeparator = ",";
} else {
// "1,234" or "1,234,567" → thousands
decimalSeparator = null;
thousandsSeparator = ",";
}
} else if (hasDot && !hasComma) {
const parts = cleaned.split(".");
if (parts.length === 2) {
const afterDot = parts[1];
if (afterDot.length <= 2 || afterDot.length >= 4) {
// "1.23" or "3.14159" → decimal
decimalSeparator = ".";
} else {
// "1.234" → THOUSANDS (pragmatic choice for tables)
decimalSeparator = null;
thousandsSeparator = ".";
}
} else {
// Multiple dots → thousands
decimalSeparator = null;
thousandsSeparator = ".";
}
}
// Step 6: Clean thousands separators
if (thousandsSeparator) {
cleaned = cleaned.replace(new RegExp(`\\${thousandsSeparator}`, "g"), "");
}
// Step 7: Normalize decimal to dot
if (decimalSeparator && decimalSeparator !== ".") {
cleaned = cleaned.replace(decimalSeparator, ".");
}
// Step 8: Validate
const num = Number(cleaned);
if (!Number.isFinite(num)) return value;
return isNegative ? `-${cleaned}` : cleaned;
}
Why Three Digits = Thousands
The key decision: when we see exactly three digits after a single separator (like "1.234"), we interpret it as thousands, not decimal.
Why? In HTML tables:
- Financial data with thousands is extremely common: $1.234, €1.234
- Decimals with exactly three digits are rare in practice
- Scientific data with three decimals (like "3.141") is usually written as "3.14159" or just "3.14"
This heuristic is a pragmatic trade-off that works for the majority of real-world tables.
The Full Format Matrix
| Input | Interpretation | Normalized Output |
|---|---|---|
| 1.234,56 | EU decimal | 1234.56 |
| 1,234.56 | US decimal | 1234.56 |
| 1.234.567,89 | EU thousands + decimal | 1234567.89 |
| 1,234,567.89 | US thousands + decimal | 1234567.89 |
| 1.234 | Thousands | 1234 |
| 1,234 | Thousands | 1234 |
| 3.14 | Decimal | 3.14 |
| 3,14 | Decimal | 3.14 |
| 3.14159 | Decimal (4+ digits) | 3.14159 |
| $ 1.200,50 | Currency + EU | 1200.50 |
| -5.000 | Negative thousands | -5000 |
| 12.5% | Percentage | 12.5 |
Handling Currency Codes
International data comes with currency prefixes and suffixes. Clean them first:
// Remove currency codes BEFORE symbols
.replace(/^(USD|EUR|GBP|JPY|CHF|CAD|AUD|CNY|INR|BRL|R\$|MXN|KRW)\s*/i, "")
// Then remove symbols
.replace(/[$€£¥₹₽₩₪฿₫₴₦]/g, "")
Order matters: "R$" (Brazilian Real) contains "$", so removing currency codes first prevents partial matches.
When the Heuristic Fails
No heuristic is perfect. You'll get incorrect results when:
- Scientific data with exactly 3 decimals: "3.141" becomes "3141"
- Prices under $10 with 3 decimal places: "$1.234" becomes "$1234"
- Mixed formats in one column: Some rows EU, some rows US
For case 3, you need column-level format detection:
function detectColumnFormat(values) {
let euIndicators = 0;
let usIndicators = 0;
for (const v of values) {
if (/\d\.\d{3},\d{2}$/.test(v)) euIndicators++;
if (/\d,\d{3}\.\d{2}$/.test(v)) usIndicators++;
}
if (euIndicators > usIndicators) return "eu";
if (usIndicators > euIndicators) return "us";
return "auto"; // Use per-cell heuristic
}
Export Profiles for Different Regions
When you know your source country, you can avoid the ambiguity entirely by using region-specific profiles:
- European format: Assumes comma as decimal separator, uses semicolon as CSV delimiter
- US/UK format: Assumes dot as decimal separator, uses comma as CSV delimiter
HTML Table Exporter includes preset profiles for both formats, plus specialized profiles for tools like Pandas and DuckDB.
Common Pitfalls
Don't Lose Precision
// Wrong: loses precision on large numbers
const num = parseFloat("12345678901234567890");
// 12345678901234567000 (JavaScript limit)
// Better: keep as string until final calculation
const cleaned = normalizeNumberString(value);
// Only parseFloat when you need to compute
Watch for HTML Entities
Web tables sometimes contain (non-breaking space) inside numbers:
// "1 234 567" should become "1234567"
.replace(/ /g, " ")
.replace(/\s+/g, "")
Preserve Original When Unsure
If the value doesn't look like a number, return it unchanged:
if (!/[0-9]/.test(v)) return value;
// ...
if (!Number.isFinite(num)) return value;
Better to leave "N/A" as "N/A" than try to parse it.
Summary
| Scenario | Detection | Notes |
|---|---|---|
| Both . and , | Last one = decimal | Universal |
| Only comma, ≤2 digits after | Decimal | "1,23" |
| Only comma, 3+ digits after | Thousands | "1,234" |
| Only dot, ≤2 or 4+ digits | Decimal | "1.23", "3.14159" |
| Only dot, exactly 3 digits | Thousands | "1.234" |
The three-digit case is the pragmatic trade-off. It works for most real-world data. When in doubt, use an export profile that matches your source region.
For more on copying data from websites to Excel without formatting issues, see our guide on copying tables from websites to Excel.
Need clean number exports without the guesswork? Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.
Top comments (0)