After a few conv layers you're drowning in feature maps — too big and slow. Pooling shrinks them while keeping the signal, and it has zero parameters. Four numbers in, one out.
🪟 Max vs average pooling: https://dev48v.infy.uk/dl/day9-pooling.html
The operation
for (let y = 0; y < h; y += 2) // stride 2: hop, don't slide
for (let x = 0; x < w; x += 2) {
const win = [img[y][x], img[y][x+1], img[y+1][x], img[y+1][x+1]];
out[y/2][x/2] = reduce(win); // 2×2 → 1 value, halves both dims
}
Max pool — keep the strongest
const reduce = (w) => Math.max(...w);
A feature detector asks "is this feature here?" — max pooling answers "yes, somewhere in this region it fired", keeping the feature's presence while discarding its exact location.
Average pool — smooth it
const reduce = (w) => w.reduce((a,b)=>a+b) / w.length;
Common at the end of a network (global average pooling) to summarise each map before the classifier.
The real point: translation tolerance
Because pooling reports "feature present in region", shifting the input a pixel barely changes the output. That invariance is why a CNN recognises a cat whether it's top-left or centre.
And it's free — no learnable weights. (Modern nets sometimes use strided convolutions instead, but the intent is identical.)
The takeaway
Hop in 2×2s, keep the max → smaller maps, position-tolerant, zero params. Pool a grid.
Top comments (0)