SEN LLC

Posted on Jun 13

A Scatter-Plot Explorer for World Statistics — Log Scales and Hand-Rolled Pearson Correlation

#javascript #dataviz #statistics #frontend

"Do countries with higher GDP per capita also have longer life expectancy?" I built a tool that lets you explore questions like that across 48 countries by picking any two of five metrics as scatter-plot axes. Two implementation hinges: (1) metrics that span orders of magnitude (population: Singapore 5.6M to India 1,417M, a 250× range) must be plotted and correlated on a log scale or every point collapses to one corner, and (2) a hand-rolled Pearson correlation coefficient recomputed live as you change axes. Vanilla JS, no chart library, with 34 Node tests on the computation layer.

🌐 Demo: https://sen.ltd/portfolio/global-stats/
📦 GitHub: https://github.com/sen-ltd/global-stats

The data model

48 countries × 5 metrics (population, GDP per capita, life expectancy, CO2 per capita, area):

{ name: "Japan", code: "JP", region: "アジア",
  population: 125.1, gdpPerCapita: 33800, lifeExpectancy: 84.5,
  co2PerCapita: 8.5, area: 378 },

Metric definitions live in a separate table with a log flag:

export const METRICS = [
  { key: "population",     label: "...", log: true  },
  { key: "gdpPerCapita",   label: "...", log: true  },
  { key: "lifeExpectancy", label: "...", log: false },
  { key: "co2PerCapita",   label: "...", log: true  },
  { key: "area",           label: "...", log: true  },
];

Only life expectancy is log: false. That distinction does real work.

Why log scale is non-negotiable

Plot "population vs GDP" on linear axes and it's a disaster. Population spans 250× (Singapore to India); GDP per capita spans 100× (Ethiopia $1,030 to Norway $106,150). On linear axes:

nearly every point collapses into the bottom-left corner
China and India alone stick to the right edge
the correlation coefficient gets dragged around by the big outliers

The fix is log transformation — equal spacing per order of magnitude, so countries of wildly different size share one viewport. Linear metrics like life expectancy (52–85 years, a mere 1.6× range) stay linear.

export function normalize(value, metric, domainMin, domainMax) {
  if (metric.log) {
    const lv = Math.log10(value);
    const lmin = Math.log10(domainMin);
    const lmax = Math.log10(domainMax);
    if (lmax === lmin) return 0.5;
    return (lv - lmin) / (lmax - lmin);
  }
  if (domainMax === domainMin) return 0.5;
  return (value - domainMin) / (domainMax - domainMin);
}

Tested by asserting the geometric midpoint maps to center:

test("log: geometric midpoint → 0.5", () => {
  const m = getMetric("gdpPerCapita");
  // domain 1000..100000, geometric mean = 10000 → 0.5
  assert.ok(Math.abs(normalize(10000, m, 1000, 100000) - 0.5) < 1e-9);
});

Linear would put 50500 at the center; log puts the geometric mean 10000 there. That difference is what "thinking in orders of magnitude" means.

Hand-rolled Pearson correlation

Pick two axes, get a coefficient r. Straight from the definition:

export function pearson(xs, ys) {
  const n = xs.length;
  if (n < 2 || ys.length !== n) return null;
  const meanX = xs.reduce((a, b) => a + b, 0) / n;
  const meanY = ys.reduce((a, b) => a + b, 0) / n;
  let num = 0, denX = 0, denY = 0;
  for (let i = 0; i < n; i++) {
    const dx = xs[i] - meanX, dy = ys[i] - meanY;
    num += dx * dy; denX += dx * dx; denY += dy * dy;
  }
  const den = Math.sqrt(denX * denY);
  if (den === 0) return null; // zero variance → undefined
  return num / den;
}

Returning null for zero variance matters: 0/0 = NaN would corrupt axis labels downstream. Handle undefined explicitly.

test("perfect positive correlation = 1", () => {
  assert.ok(Math.abs(pearson([1, 2, 3], [2, 4, 6]) - 1) < 1e-9);
});
test("no correlation ≈ 0", () => {
  // a symmetric V has zero LINEAR correlation
  assert.ok(Math.abs(pearson([-2, -1, 0, 1, 2], [4, 1, 0, 1, 4])) < 1e-9);
});
test("zero variance → null", () => {
  assert.equal(pearson([5, 5, 5], [1, 2, 3]), null);
});

The V-shape test is the important one: zero correlation means zero linear correlation, not "no relationship." A perfect parabola has Pearson r = 0. The test documents that limitation.

Correlate in log space too

The key insight: if you display on a log scale, you must correlate on log-transformed values to match. Power-law relationships (y = ax^b) become straight lines in log-log space (log y = b·log x + log a), so Pearson on the logs captures the true strength:

export function metricCorrelation(keyX, keyY, pool) {
  const mx = getMetric(keyX), my = getMetric(keyY);
  const xs = [], ys = [];
  for (const c of pool) {
    let x = c[keyX], y = c[keyY];
    if (mx.log) x = Math.log10(x); // power-law → linear
    if (my.log) y = Math.log10(y);
    xs.push(x); ys.push(y);
  }
  return pearson(xs, ys);
}

test("GDP vs life expectancy is a strong positive correlation", () => {
  assert.ok(metricCorrelation("gdpPerCapita", "lifeExpectancy") > 0.5);
});

The actual value comes out at r ≈ 0.84 — the famous Preston curve (income vs longevity) reproduced from the data. GDP is log, life expectancy is linear, so it's a semi-log correlation, which matches the economics finding that life expectancy scales with the logarithm of income.

Y-axis inversion

SVG's origin is top-left, so "bigger value = higher up" needs a Y flip:

return pool.map((c) => ({
  country: c,
  cx: normalize(c[keyX], mx, dx.min, dx.max),
  cy: 1 - normalize(c[keyY], my, dy.min, dy.max), // invert
}));

Guarded by a test:

test("y is inverted: highest life-expectancy country has smallest cy", () => {
  const top = pts.reduce((a, b) => (b.country.lifeExpectancy > a.country.lifeExpectancy ? b : a));
  for (const p of pts) {
    if (p.country.code !== top.country.code) assert.ok(p.cy >= top.cy - 1e-9);
  }
});

Data integrity tests

Hardcoded public data deserves integrity checks — and for a log-scale tool, "all metrics positive" is a precondition, not a nicety (log10(0) = -∞, log10(negative) = NaN):

test("no duplicate ISO codes", () => { /* ... */ });
test("every metric field is present and positive", () => {
  for (const c of COUNTRIES)
    for (const m of METRICS)
      assert.ok(typeof c[m.key] === "number" && c[m.key] > 0);
});
test("life expectancy in a sane range (40-90)", () => { /* ... */ });

Architecture

data.js  ← 48 countries × 5 metrics (World Bank / UN / OWID ~2022)
core.js  ← pearson, normalize (log-aware), scatter scaling, region aggregation (DOM-free, 34 tests)
app.js   ← SVG scatter + sortable table

Try it

Demo: https://sen.ltd/portfolio/global-stats/
GitHub: https://github.com/sen-ltd/global-stats

Set the axes to "CO2 vs GDP" for a clear positive correlation (richer = more emissions). Set "population vs life expectancy" for near-zero (big and small countries live equally long). Colors encode region.

Takeaways

Order-of-magnitude metrics (population, GDP, area) need log scales or points collapse. Linear metrics (life expectancy) stay linear. A per-metric log flag toggles both.
Pearson is implementable from the definition. Return null for zero variance — don't leak NaN into the view.
Display on log → correlate on log. Power laws straighten out in log-log space.
Pearson measures linear correlation only. A V-shape test documents that.
With log scales, "all values positive" is a precondition, not a check. Test it.
GDP vs life expectancy gives r ≈ 0.84 — the Preston curve, straight from the data.

This is OSS portfolio #262 from SEN LLC (Tokyo). https://sen.ltd/portfolio/

DEV Community