Pooling in CNNs: Shrink the Map, Keep What Matters

#ai #machinelearning #deeplearning #beginners

After a few conv layers you're drowning in feature maps — too big and slow. Pooling shrinks them while keeping the signal, and it has zero parameters. Four numbers in, one out.

🪟 Max vs average pooling: https://dev48v.infy.uk/dl/day9-pooling.html

The operation

for (let y = 0; y < h; y += 2)        // stride 2: hop, don't slide
  for (let x = 0; x < w; x += 2) {
    const win = [img[y][x], img[y][x+1], img[y+1][x], img[y+1][x+1]];
    out[y/2][x/2] = reduce(win);       // 2×2 → 1 value, halves both dims
  }

Max pool — keep the strongest

const reduce = (w) => Math.max(...w);

A feature detector asks "is this feature here?" — max pooling answers "yes, somewhere in this region it fired", keeping the feature's presence while discarding its exact location.

Average pool — smooth it

const reduce = (w) => w.reduce((a,b)=>a+b) / w.length;

Common at the end of a network (global average pooling) to summarise each map before the classifier.

The real point: translation tolerance

Because pooling reports "feature present in region", shifting the input a pixel barely changes the output. That invariance is why a CNN recognises a cat whether it's top-left or centre.

And it's free — no learnable weights. (Modern nets sometimes use strided convolutions instead, but the intent is identical.)