DEV Community

Cover image for Number Normalization for International Data: A Practical Guide
circobit
circobit

Posted on

Number Normalization for International Data: A Practical Guide

You export a table from a German website. The revenue column shows "1.234,56 €".

You paste it into Excel. It becomes "1.23456E" or, worse, a date.

Welcome to the international number format problem.

The Core Issue: What Does "1.234" Mean?

In the US and UK: 1.234 = one point two three four (decimal)

In Germany, Spain, and most of Latin America: 1.234 = one thousand two hundred thirty-four

Same characters. Completely different values.

When extracting data from international websites, you can't assume which format you're dealing with. And getting it wrong means your analysis is off by orders of magnitude.

The Three-Digit Ambiguity

The hardest case is exactly three digits after a separator.

Input US Interpretation EU Interpretation
1.234 1.234 (decimal) 1,234 (thousands)
1,234 1,234 (thousands) 1.234 (decimal)
1.234.567 Invalid 1,234,567
1,234,567 1,234,567 Invalid

With two digits after, it's clearly decimal: "1.23" or "1,23".
With four+ digits after, it's clearly decimal: "3.14159".
With three digits? Could be either.

The Detection Algorithm

Here's a heuristic approach that handles most real-world cases:

function normalizeNumberString(value) {
  if (typeof value !== "string") return value;
  const v = value.trim();
  if (!v || !/[0-9]/.test(v)) return value;

  // Step 1: Clean currency symbols, percentages, whitespace
  let cleaned = v
    .replace(/ /g, " ")
    .replace(/^(USD|EUR|GBP|R\$|MXN)\s*/i, "")
    .replace(/[$€£¥₹₽₩]/g, "")
    .replace(/%/g, "")
    .replace(/\s+/g, "");

  // Step 2: Handle negative signs
  const isNegative = cleaned.match(/^-/);
  if (isNegative) {
    cleaned = cleaned.substring(1).trim();
  }

  // Step 3: Pure digits? We're done
  if (/^\d+$/.test(cleaned)) {
    return isNegative ? `-${cleaned}` : cleaned;
  }

  // Step 4: Detect separators
  const hasComma = cleaned.includes(",");
  const hasDot = cleaned.includes(".");

  // Step 5: Apply heuristic
  let decimalSeparator = ".";
  let thousandsSeparator = ",";

  if (hasComma && hasDot) {
    // Both present: last one is decimal
    const lastComma = cleaned.lastIndexOf(",");
    const lastDot = cleaned.lastIndexOf(".");

    if (lastComma > lastDot) {
      decimalSeparator = ",";
      thousandsSeparator = ".";
    }
  } else if (hasComma && !hasDot) {
    const parts = cleaned.split(",");

    if (parts.length === 2 && parts[1].length <= 2) {
      // "1,23" → decimal
      decimalSeparator = ",";
    } else {
      // "1,234" or "1,234,567" → thousands
      decimalSeparator = null;
      thousandsSeparator = ",";
    }
  } else if (hasDot && !hasComma) {
    const parts = cleaned.split(".");

    if (parts.length === 2) {
      const afterDot = parts[1];

      if (afterDot.length <= 2 || afterDot.length >= 4) {
        // "1.23" or "3.14159" → decimal
        decimalSeparator = ".";
      } else {
        // "1.234" → THOUSANDS (pragmatic choice for tables)
        decimalSeparator = null;
        thousandsSeparator = ".";
      }
    } else {
      // Multiple dots → thousands
      decimalSeparator = null;
      thousandsSeparator = ".";
    }
  }

  // Step 6: Clean thousands separators
  if (thousandsSeparator) {
    cleaned = cleaned.replace(new RegExp(`\\${thousandsSeparator}`, "g"), "");
  }

  // Step 7: Normalize decimal to dot
  if (decimalSeparator && decimalSeparator !== ".") {
    cleaned = cleaned.replace(decimalSeparator, ".");
  }

  // Step 8: Validate
  const num = Number(cleaned);
  if (!Number.isFinite(num)) return value;

  return isNegative ? `-${cleaned}` : cleaned;
}
Enter fullscreen mode Exit fullscreen mode

Why Three Digits = Thousands

The key decision: when we see exactly three digits after a single separator (like "1.234"), we interpret it as thousands, not decimal.

Why? In HTML tables:

  • Financial data with thousands is extremely common: $1.234, €1.234
  • Decimals with exactly three digits are rare in practice
  • Scientific data with three decimals (like "3.141") is usually written as "3.14159" or just "3.14"

This heuristic is a pragmatic trade-off that works for the majority of real-world tables.

The Full Format Matrix

Input Interpretation Normalized Output
1.234,56 EU decimal 1234.56
1,234.56 US decimal 1234.56
1.234.567,89 EU thousands + decimal 1234567.89
1,234,567.89 US thousands + decimal 1234567.89
1.234 Thousands 1234
1,234 Thousands 1234
3.14 Decimal 3.14
3,14 Decimal 3.14
3.14159 Decimal (4+ digits) 3.14159
$ 1.200,50 Currency + EU 1200.50
-5.000 Negative thousands -5000
12.5% Percentage 12.5

Handling Currency Codes

International data comes with currency prefixes and suffixes. Clean them first:

// Remove currency codes BEFORE symbols
.replace(/^(USD|EUR|GBP|JPY|CHF|CAD|AUD|CNY|INR|BRL|R\$|MXN|KRW)\s*/i, "")
// Then remove symbols
.replace(/[$€£¥₹₽₩₪฿₫₴₦]/g, "")
Enter fullscreen mode Exit fullscreen mode

Order matters: "R$" (Brazilian Real) contains "$", so removing currency codes first prevents partial matches.

When the Heuristic Fails

No heuristic is perfect. You'll get incorrect results when:

  1. Scientific data with exactly 3 decimals: "3.141" becomes "3141"
  2. Prices under $10 with 3 decimal places: "$1.234" becomes "$1234"
  3. Mixed formats in one column: Some rows EU, some rows US

For case 3, you need column-level format detection:

function detectColumnFormat(values) {
  let euIndicators = 0;
  let usIndicators = 0;

  for (const v of values) {
    if (/\d\.\d{3},\d{2}$/.test(v)) euIndicators++;
    if (/\d,\d{3}\.\d{2}$/.test(v)) usIndicators++;
  }

  if (euIndicators > usIndicators) return "eu";
  if (usIndicators > euIndicators) return "us";
  return "auto"; // Use per-cell heuristic
}
Enter fullscreen mode Exit fullscreen mode

Export Profiles for Different Regions

When you know your source country, you can avoid the ambiguity entirely by using region-specific profiles:

  • European format: Assumes comma as decimal separator, uses semicolon as CSV delimiter
  • US/UK format: Assumes dot as decimal separator, uses comma as CSV delimiter

HTML Table Exporter includes preset profiles for both formats, plus specialized profiles for tools like Pandas and DuckDB.

Common Pitfalls

Don't Lose Precision

// Wrong: loses precision on large numbers
const num = parseFloat("12345678901234567890");
// 12345678901234567000 (JavaScript limit)

// Better: keep as string until final calculation
const cleaned = normalizeNumberString(value);
// Only parseFloat when you need to compute
Enter fullscreen mode Exit fullscreen mode

Watch for HTML Entities

Web tables sometimes contain &nbsp; (non-breaking space) inside numbers:

// "1&nbsp;234&nbsp;567" should become "1234567"
.replace(/&nbsp;/g, " ")
.replace(/\s+/g, "")
Enter fullscreen mode Exit fullscreen mode

Preserve Original When Unsure

If the value doesn't look like a number, return it unchanged:

if (!/[0-9]/.test(v)) return value;
// ...
if (!Number.isFinite(num)) return value;
Enter fullscreen mode Exit fullscreen mode

Better to leave "N/A" as "N/A" than try to parse it.

Summary

Scenario Detection Notes
Both . and , Last one = decimal Universal
Only comma, ≤2 digits after Decimal "1,23"
Only comma, 3+ digits after Thousands "1,234"
Only dot, ≤2 or 4+ digits Decimal "1.23", "3.14159"
Only dot, exactly 3 digits Thousands "1.234"

The three-digit case is the pragmatic trade-off. It works for most real-world data. When in doubt, use an export profile that matches your source region.

For more on copying data from websites to Excel without formatting issues, see our guide on copying tables from websites to Excel.


Need clean number exports without the guesswork? Learn more at gauchogrid.com/html-table-exporter or try it free on the Chrome Web Store.

Top comments (0)