DEV Community

Devanshu Biswas
Devanshu Biswas

Posted on

CNNs From Scratch: the Convolution That Sees Images

A 3×3 grid of numbers, slid across an image, becomes an edge detector. That's a convolution — the operation at the heart of every CNN. Pick a kernel and watch it extract features in real time.

🖼️ Slide kernels over an image: https://dev48v.infy.uk/dl/day8-cnn.html

Why images need a different layer

A dense layer needs a weight per pixel-to-neuron pair (millions for a tiny image) and has to re-learn a cat in every position. Images have structure — nearby pixels relate, and a feature looks the same wherever it appears. CNNs exploit that.

The convolution

for (let y=1; y<h-1; y++)
  for (let x=1; x<w-1; x++) {
    let sum = 0;
    for (let ky=-1; ky<=1; ky++)
      for (let kx=-1; kx<=1; kx++)
        sum += img[y+ky][x+kx] * kernel[ky+1][kx+1];
    out[y][x] = sum;
  }
Enter fullscreen mode Exit fullscreen mode

Slide a small kernel over every patch, multiply-and-sum → a "feature map" showing WHERE that feature occurs. An edge kernel lights up edges; blur averages neighbours.

Weight sharing — the killer idea

The SAME 9 weights are reused at every position. So instead of millions of params, a 3×3 kernel has 9 — and it detects its feature ANYWHERE in the image (translation invariance).

Stack + LEARN

Early layers detect edges; deeper layers combine them into shapes, then objects. And the network LEARNS every kernel's weights by gradient descent — you don't hand-pick "edge detector". That's a CNN.

Try the kernels.

Top comments (0)