DEV Community

Cover image for Deepfakes Explained — From Vectors to the Decoder Swap (Interactive)
Mathias Leonhardt
Mathias Leonhardt

Posted on • Originally published at ki-mathias.de

Deepfakes Explained — From Vectors to the Decoder Swap (Interactive)

Most explanations of deepfakes start with "AI swaps faces." That's like explaining a car engine by saying "it goes vroom." I wanted to understand the math — so I built a blog post that walks through it from scratch.

The result: 13 chapters, 6 interactive demos, and a real autoencoder running in your browser — no TensorFlow.js, just 200KB of pure JavaScript matrix math.

The path from vectors to face-swapping

The post follows the arc of a tech talk I gave at my company. Each concept builds on the previous:

  1. Vectors & Matrices — what does it mean to "transform" data?
  2. Orthogonality — when are features truly independent?
  3. Blind Source Separation — separating mixed audio signals with linear algebra
  4. High-dimensional data — a 200-pixel image is a 200-dimensional vector
  5. PCA — finding the axes that explain the most variance
  6. Linear vs. nonlinear interpolation — why blending faces in pixel space gives you ghosts, not faces
  7. The kernel trick — lifting data into higher dimensions where nonlinear becomes linear
  8. Neural networks — the machine that does dimensionality reduction AND increase simultaneously
  9. Autoencoders — compress → latent space → decompress
  10. Latent space arithmetic — smiling woman − neutral woman + neutral man = smiling man
  11. Deepfakes — swap the decoder. That's it. That's the trick.

The interactive parts

Every key concept has a visualization you can play with:

PCA Playground

Draw 2D points → watch the principal component axes update live. Eigenvalues, explained variance, the whole thing.

Matrix Transform

Drag sliders for rotation and scale → see how a matrix transforms a vector in real-time.

Kernel Trick

Toggle between 2D (not linearly separable) and 3D (linearly separable!) with a button. The separating plane appears when you lift the data.

Rotating Tesseract

A 4D hypercube projected to 2D, with three animated rotation planes. Includes a 3D↔4D slider that collapses the tesseract into a regular cube — showing what the 4th dimension "does."

MNIST Autoencoder

Draw a digit → the autoencoder encodes and reconstructs it live. The model is a tiny PyTorch-trained network (784→64→16→2→16→64→784) exported as float16 binary weights. Inference runs in pure JavaScript — no TensorFlow.js, no WebAssembly, just matrix multiplications and ReLU.

Total model size: 200KB.

Latent Space Explorer

Move your mouse across a 2D latent space → the decoder generates digits in real-time. Each color cluster = one digit class. You can see how the autoencoder organized the digits spatially.

The autoencoder — no TensorFlow.js needed

This was the most satisfying part to build. Instead of shipping a 3MB TensorFlow.js dependency, I:

  1. Trained a tiny autoencoder in PyTorch (12 epochs on MNIST, ~30 seconds)
  2. Exported weights as float16 binary (not JSON — 200KB instead of 2.2MB)
  3. Wrote inference in ~60 lines of vanilla JavaScript:
function linear(x, W, b) {
  var out = new Float32Array(W.shape[0]);
  for (var i = 0; i < W.shape[0]; i++) {
    var sum = b.data[i];
    for (var j = 0; j < W.shape[1]; j++)
      sum += W.data[i * W.shape[1] + j] * x[j];
    out[i] = sum;
  }
  return out;
}

function encode(x, model) {
  var h1 = relu(linear(x, model.enc_0_weight, model.enc_0_bias));
  var h2 = relu(linear(h1, model.enc_2_weight, model.enc_2_bias));
  return linear(h2, model.enc_4_weight, model.enc_4_bias);
}
Enter fullscreen mode Exit fullscreen mode

That's it. No framework, no build step, no WASM. Load the binary weights, do matrix math, draw pixels on a canvas.

The deepfake mechanism in one paragraph

Train two autoencoders with a shared encoder but different decoders. The encoder learns a universal face representation (latent space). Decoder A reconstructs face A, Decoder B reconstructs face B. Now swap: feed person A's image through the shared encoder, then through decoder B. Result: A's expression and pose, B's appearance. A deepfake.

Try it

The entire blog is built with Claude Code as an AI pair programmer. The autoencoder training, Manim animations, ElevenLabs voiceover, ffmpeg video assembly, and YouTube upload were all orchestrated from the terminal.

Questions welcome — especially if you spot a mistake in the math.

Top comments (0)