A 3×3 grid of numbers, slid across an image, becomes an edge detector. That's a convolution — the operation at the heart of every CNN. Pick a kernel and watch it extract features in real time.
🖼️ Slide kernels over an image: https://dev48v.infy.uk/dl/day8-cnn.html
Why images need a different layer
A dense layer needs a weight per pixel-to-neuron pair (millions for a tiny image) and has to re-learn a cat in every position. Images have structure — nearby pixels relate, and a feature looks the same wherever it appears. CNNs exploit that.
The convolution
for (let y=1; y<h-1; y++)
for (let x=1; x<w-1; x++) {
let sum = 0;
for (let ky=-1; ky<=1; ky++)
for (let kx=-1; kx<=1; kx++)
sum += img[y+ky][x+kx] * kernel[ky+1][kx+1];
out[y][x] = sum;
}
Slide a small kernel over every patch, multiply-and-sum → a "feature map" showing WHERE that feature occurs. An edge kernel lights up edges; blur averages neighbours.
Weight sharing — the killer idea
The SAME 9 weights are reused at every position. So instead of millions of params, a 3×3 kernel has 9 — and it detects its feature ANYWHERE in the image (translation invariance).
Stack + LEARN
Early layers detect edges; deeper layers combine them into shapes, then objects. And the network LEARNS every kernel's weights by gradient descent — you don't hand-pick "edge detector". That's a CNN.
Top comments (0)