DEV Community

Cover image for PyTorch from Scratch — Part 1: Tensors, Gradients & Activations
Meclin A Francis
Meclin A Francis

Posted on

PyTorch from Scratch — Part 1: Tensors, Gradients & Activations

Most people use PyTorch without really knowing what's happening underneath. This series breaks the foundations down into the simplest possible explanations — one concept at a time, with code you can run and exactly what goes in and comes out.

This is Part 1 of 5. By the end you'll understand the five building blocks every neural network is made of: creating tensors, doing math on them, reshaping them, computing gradients, and bending them with activation functions.

No assumed knowledge. Let's go.


1. What a tensor actually is

Everything in deep learning is built from one object: the tensor. Don't let the name scare you — a tensor is just a box of numbers.

  • 1 number → a scalar
  • a row of numbers → a vector
  • a grid → a matrix
  • stacked grids → a tensor

An image is literally a 3D tensor: height × width × colour.

The first skill is creating them — filled with zeros, ones, or any value you want. Then .tolist() reads the tensor back as a plain Python list.

import torch

def create_tensor(method, shape, value=0.0):
    if method == "zeros":
        t = torch.zeros(shape)
    elif method == "ones":
        t = torch.ones(shape)
    else:                       # "full"
        t = torch.full(shape, value)
    return t.tolist()
Enter fullscreen mode Exit fullscreen mode

What goes in and what comes out:

create_tensor("zeros", [2, 3])        -> [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0]]
create_tensor("full", [2, 2], 7.0)    -> [[7.0, 7.0], [7.0, 7.0]]
Enter fullscreen mode Exit fullscreen mode

💡 Gotcha: a function only hands back a value if you write return. Forget it and your function silently returns None — one of the most common beginner bugs.


2. Doing math on tensors

There are two kinds of math you'll use constantly, and mixing them up is the #1 beginner mistake.

  1. Element-wise — same position meets same position. [1, 2, 3] + [4, 5, 6] = [5, 7, 9].
  2. Matrix multiplication (@) — rows × columns. This one mixes values together, and it's the single most-used operation in all of deep learning. Every layer of every model is a matmul.
import torch

def tensor_op(x, y, op):
    a = torch.tensor(x, dtype=torch.float32)
    b = torch.tensor(y, dtype=torch.float32)
    if op == "add":
        result = a + b
    elif op == "multiply":
        result = a * b
    elif op == "matmul":
        result = a @ b
    elif op == "power":
        result = a ** b
    else:                       # "max"
        result = torch.maximum(a, b)
    return result.tolist()
Enter fullscreen mode Exit fullscreen mode

Input → output:

tensor_op([1,2,3], [4,5,6], "add")              -> [5.0, 7.0, 9.0]
tensor_op([[1,2],[3,4]], [[5,6],[7,8]], "matmul") -> [[19.0, 22.0], [43.0, 50.0]]
Enter fullscreen mode Exit fullscreen mode

💡 Two traps: * is element-wise multiply, @ is matrix multiply — completely different operations. And for the element-wise maximum of two tensors, use torch.maximum(a, b), not Python's built-in max() (that one can't compare tensors position-by-position).


3. Reshaping tensors

Reshaping means: same numbers, new shape. The data never changes — only how it's arranged. This matters because data arrives in one shape and the next layer expects another. Reshaping is the quiet glue holding a network together.

  • flatten → squash a grid into a single line
  • squeeze → drop useless size-1 dimensions
  • transpose (.T) → flip rows and columns
import torch

def reshape_tensor(x, op):
    t = torch.tensor(x, dtype=torch.float32)
    if op == "flatten":
        result = torch.flatten(t)
    elif op == "squeeze":
        result = torch.squeeze(t)
    else:                       # "transpose"
        result = t.T
    return result.tolist()
Enter fullscreen mode Exit fullscreen mode

Input → output:

reshape_tensor([[1,2,3],[4,5,6]], "flatten")   -> [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]
reshape_tensor([[1,2],[3,4]], "transpose")     -> [[1.0, 3.0], [2.0, 4.0]]
Enter fullscreen mode Exit fullscreen mode

💡 Gotcha: else is a catch-all — it never takes a condition. Writing else op == "transpose": is a syntax error. Just else:.


4. Autograd — the engine that trains everything

This is the most important idea in deep learning, and PyTorch does the hard part for you.

A gradient is just a slope. It answers one question: "if I nudge this input a little, does the output go up or down, and how steeply?" That slope is what tells a network which direction to adjust its weights to reduce its error.

The trick: mark a tensor with requires_grad=True, do your math, then call .backward(). PyTorch quietly records every operation and computes all the slopes automatically — no calculus by hand.

import torch

def compute_gradient(values):
    x = torch.tensor(values, dtype=torch.float32, requires_grad=True)
    y = (x**3 + 2*x).sum()      # collapse to a single number
    y.backward()                # walk backward, fill in the slopes
    return x.grad.tolist()
Enter fullscreen mode Exit fullscreen mode

Input → output:

compute_gradient([1, 2, 3])    -> [5.0, 14.0, 29.0]
Enter fullscreen mode Exit fullscreen mode

Why those numbers? The slope of x³ + 2x is 3x² + 2. At x = 1 that's 3(1) + 2 = 5. At x = 2 it's 3(4) + 2 = 14. PyTorch produced the exact analytical answer — automatically. That's the whole point: it works even for equations far too big to differentiate by hand.

💡 Two gotchas: requires_grad needs floats (integers can't track gradients), and .backward() must start from a single number — that's why we call .sum() first.


5. Activation functions — adding the bend

Here's a surprising fact: if you stack layers that each compute weight × input + bias, the whole stack collapses into a single straight line — even if it's a hundred layers deep. A straight line can't model faces, language, or anything interesting.

Activation functions fix this by adding a bend (a "nonlinearity") after each layer. That bend is what lets a network learn curves and complex patterns.

The common ones:

  • ReLU → cut off negatives: max(0, x)
  • Sigmoid → squash any number into 0…1
  • Tanh → squash into −1…1
  • LeakyReLU → like ReLU, but lets a tiny bit of negatives through so neurons don't "die"
import torch

def activation(x, method):
    t = torch.tensor(x, dtype=torch.float32)
    if method == "relu":
        result = torch.clamp(t, min=0)
    elif method == "sigmoid":
        result = 1 / (1 + torch.exp(-t))
    elif method == "tanh":
        result = torch.tanh(t)
    else:                       # "leaky_relu"
        result = torch.where(t > 0, t, 0.01 * t)
    return result.tolist()
Enter fullscreen mode Exit fullscreen mode

Input → output:

activation([-2,-1,0,1,2], "relu")     -> [0.0, 0.0, 0.0, 1.0, 2.0]
activation([-1,0,1], "sigmoid")       -> [0.269, 0.5, 0.731]
Enter fullscreen mode Exit fullscreen mode

💡 Gotcha: in 1 / (1 + torch.exp(-t)), the parentheses matter. Without them, Python computes (1/1) + exp(-t) because division runs before addition. When a whole expression is the denominator, wrap it in brackets.


The big picture

Put it together and you have the entire core loop of deep learning:

numbers → make a guess → measure the error → compute the slopes → adjust → repeat

That's it. Tensors hold the numbers. Matrix multiply and activations make the guess. Autograd computes the slopes. Everything else — CNNs, transformers, LLMs — is a remix of these same five ideas.

Coming in Part 2: we take these pieces and assemble them into a real, working neural network from scratch.

If this was useful, follow along — I'm building the whole thing in public, one part at a time.

🔗 I post the short version of each part on X: @Meclin_A_Francis

Top comments (0)