bitanath

Posted on May 25

Principal Components in TypeScript (Part 4)

#typescript #analytics #datascience

This is part four of a series Principal Components in TypeScript and focuses on the application of PCA to actually derive named insights

TL;DR

If you need a TL;DR, just read the code or grab the package here on npm

Not a Code Blog

This is not a code blog. There’s no easy copy-paste solution here.
If that’s what you want, go straight to the source code above.

Now this post attempts to use PCA in a totally different direction compared to the vanilla dimensionality reduction or to data compression discussed in the earlier Parts 1 and 2. Part 3 used it for neural network explanation. Part 4 will use it for something else entirely ... Attributing causation to data through factor analysis.

Now remember how PCA collapses data with 100 dimensions into a single dimension, wouldn't it be cool if this dimension was interpretable. For example, let's say the 100 columns were like stress, smoking frequency, alcohol ml etc etc.. you see where I am going with this, the final dimension would be something like cardiac arrest or premature demise. On that cheery note, let's figure out how PCA can actually be used to label this reduced dimension.

Wait really? Can it?

Nope, not PCA, but the starting point remains the same, SVD. We still decompose the data to get a bunch of eigenvectors and then utilize them to get a single factor or a combination of factors.

From SVD to named factors

The standard PCA pipeline goes: center data → covariance → eigendecomposition. That gives you eigenvectors that are abstract math objects. To get something you can name, you need to go one step further — compute how each original variable correlates with each factor score.

Here's the entire pipeline in one function, building directly on SVD:

import { svd } from "./svd";
interface FactorResult {
  loadings: number[][];
  scores: number[][];
  variance: number[];
}
function standardize(data: number[][]): number[][] {
  const m = data.length;
  const n = data[0].length;
  const means = new Array(n).fill(0);
  const stds = new Array(n).fill(0);
  const standardized: number[][] = Array.from({ length: m }, () => new Array(n));
  for (let j = 0; j < n; j++) {
    let sum = 0;
    for (let i = 0; i < m; i++) sum += data[i][j];
    means[j] = sum / m;
  }
  for (let j = 0; j < n; j++) {
    let sumSq = 0;
    for (let i = 0; i < m; i++) {
      const diff = data[i][j] - means[j];
      sumSq += diff * diff;
    }
    stds[j] = Math.sqrt(sumSq / m) || 1;
  }
  for (let i = 0; i < m; i++) {
    for (let j = 0; j < n; j++) {
      standardized[i][j] = (data[i][j] - means[j]) / stds[j];
    }
  }
  return standardized;
}
function correlation(x: number[], y: number[]): number {
  const n = x.length;
  let sumX = 0, sumY = 0, sumXY = 0, sumX2 = 0, sumY2 = 0;
  for (let i = 0; i < n; i++) {
    sumX += x[i]; sumY += y[i];
    sumXY += x[i] * y[i]; sumX2 += x[i] * x[i]; sumY2 += y[i] * y[i];
  }
  const num = n * sumXY - sumX * sumY;
  const den = Math.sqrt((n * sumX2 - sumX * sumX) * (n * sumY2 - sumY * sumY));
  return den === 0 ? 0 : num / den;
}
export function factor(data: number[][]): FactorResult {
  const standardized = standardize(data);
  const result = svd(standardized);
  const n = data[0].length;
  const m = data.length;
  // Factor scores = standardized · V  (equiv. to U · Σ)
  const factorScores: number[][] = Array.from({ length: m }, () => new Array(n));
  for (let i = 0; i < m; i++) {
    for (let j = 0; j < n; j++) {
      let sum = 0;
      for (let k = 0; k < n; k++)
        sum += standardized[i][k] * result.V[k][j];
      factorScores[i][j] = sum;
    }
  }
  // Loadings = correlation between each variable and each factor score
  const loadings: number[][] = Array.from({ length: n }, () => new Array(n));
  for (let varIdx = 0; varIdx < n; varIdx++) {
    const variableCol = standardized.map(row => row[varIdx]);
    for (let factorIdx = 0; factorIdx < n; factorIdx++) {
      const factorCol = factorScores.map(row => row[factorIdx]);
      loadings[varIdx][factorIdx] = correlation(variableCol, factorCol);
    }
  }
  // Sign consistency
  for (let j = 0; j < n; j++) {
    let sum = 0;
    for (let i = 0; i < n; i++) sum += loadings[i][j];
    const sign = Math.abs(sum) < 1e-10 ? 1 : (sum < 0 ? -1 : 1);
    for (let i = 0; i < n; i++) loadings[i][j] *= sign;
    for (let i = 0; i < m; i++) factorScores[i][j] *= sign;
  }
  // Variance per factor
  const variance = new Array(n);
  for (let j = 0; j < n; j++) {
    let sumSq = 0;
    for (let i = 0; i < n; i++) sumSq += loadings[i][j] * loadings[i][j];
    variance[j] = sumSq / n;
  }
  return { loadings, scores: factorScores, variance };
}

What's happening here

Three things, and they're all important.

First, we standardize the data — z-scores, mean 0, variance 1. This puts every variable on the same scale so the SVD isn't biased by units (millilitres vs cigarettes vs hours of sleep).

Second, we compute factor scores as standardized · V. Since the SVD gives us X = UΣVᵀ, multiplying by V is equivalent to UΣ. Each row of the result is how that observation scores on each factor. The first column is the score on the strongest latent dimension.

Third, we compute loadings as the correlation between each original variable and each factor score. A loading of 0.8 means that variable moves tightly with that factor. This is what makes interpretation possible — you read the loadings column by column and ask: what do the high-loading variables have in common?

Reading the result

Pass your health data into factor() and inspect the loadings:

const healthData = [
  [8, 20, 200, 1, 5],  // stress, smoking, alcohol, exercise, sleep
  [6, 15, 150, 3, 6],
  [9, 25, 300, 0, 4],
  [3, 5,  50,  5, 8],
  [7, 18, 180, 2, 5],
];

const result = factor(healthData);

const varNames = ["Stress", "Smoking", "Alcohol", "Exercise", "Sleep"];

result.loadings.forEach((row, i) => {
  console.log(`${varNames[i]}: [${row.map(l => l.toFixed(3)).join(", ")}]`);
});

console.log("Variance explained:", result.variance.map(v => v.toFixed(3)).join(", "));

Now let's assume this is yours truly, a poverty stricken stressed out developer of things.

Factor 1 loads high on stress, smoking, and alcohol, negative on exercise and sleep. That's my "Cardiac Risk" axis. Factor 2 might split exercise and sleep against the rest — call it "Recovery." I named your latent space without guessing. I better make some lifestyle changes soon or ☠️

Why this works better than raw eigenvectors

Raw eigenvectors are arbitrary in sign and scale. A value of 0.5 on an eigenvector doesn't mean the same thing as a correlation of 0.5 — eigenvectors are unit-scaled by construction. Computing correlations with factor scores gives you directly interpretable numbers between -1 and 1. You can look at a loading of 0.92 and say "this variable is strongly associated with this factor" with the same confidence you'd have looking at any Pearson correlation.

The sign-flip loop at the end ensures consistency — factor 1 always points in the dominant direction of its loadings, so you don't get a random sign reversal between runs.

PCA vs Factor Analysis redux

PCA decomposes total variance. This function decomposes standardized data via SVD and then derives loadings from correlations — which is closer to what factor analysis does conceptually (modelling shared variance through latent variables). The practical difference is small with clean data and vanishes entirely with rotation, but purists will note it.

For naming dimensions and moving on, this approach is better than raw PCA because the output is already in interpretable units. You don't need a second step to convert eigenvectors into something readable. The loadings are the interpretation.

Everything shown here uses the same SVD from the package pca-js, just applied differently. The SVD is the Golub-Reinsch implementation — same decomposition, different post-processing. The difference between dimensionality reduction and factor analysis isn't the math. It's what you do after the decomposition lands.

I will also be writing a chonky implementation that does this quite easily, here is my github in case I actually manage to release. If not, life is hard and then we ☠️

The end, and mostly just a donut

With this we come to the end of this 4 part series. I hope you liked it. If not, that's ok too. Bye! and have a great rest of your day!

DEV Community