DEV Community: Viswa M

Understanding a Tiny Two‑Layer Neural Network that Learns XOR

Viswa M — Fri, 27 Mar 2026 18:48:40 +0000

Tiny Two‑Layer Neural Network that Learns XOR

Meta description: Learn how a simple two‑layer NumPy neural network solves the XOR problem with back‑propagation, step‑by‑step code and explanations.

Tags: xor, neuralnetwork, twolayer, numpy, backpropagation, machinelearning, deeplearning, python, gradientdescent, sigmoid

Introduction

The exclusive‑or (XOR) problem is a classic benchmark for neural networks. It is easy to describe, but a single linear neuron cannot solve it. In this post we walk through a compact NumPy implementation of a two‑layer (one hidden layer) network that learns the XOR truth table from scratch. You will see how the data are prepared, how the parameters are initialized, how the forward and backward passes are performed, and why the hidden layer is essential.

What the Program Does – In a Nutshell

Builds a tiny feed‑forward network with one hidden layer of four sigmoid units.
Trains it on the four possible binary inputs of XOR using gradient descent.
After 10 000 epochs the network’s predictions are close to the target values (≈ 0 for False, ≈ 1 for True).

The printed output after training looks like:

[[0.02]
 [0.97]
 [0.96]
 [0.03]]

These correspond to the XOR results for the inputs (0,0), (0,1), (1,0) and (1,1).

Data Preparation – The XOR Truth Table

x1  x2  XOR
0   0   0
0   1   1
1   0   1
1   1   0

In the code the inputs are stored in a 4 × 2 NumPy array X and the targets in a 4 × 1 array y.

Parameter Initialization – Weights and Biases

W1 = np.random.randn(2, 4)   # input → hidden (2 inputs, 4 hidden units)
B1 = np.zeros((1, 4))        # bias for each hidden unit

W2 = np.random.randn(4, 1)   # hidden → output (4 hidden, 1 output)
B2 = np.zeros((1, 1))        # bias for the output unit

Random weights break symmetry; zero biases are a simple, common choice.

Hyper‑parameters – Epochs and Learning Rate

epochs = 10000   # number of full passes over the training set
lr      = 0.1    # step size for gradient descent

More epochs give the network time to converge; the learning rate controls how large each update is.

Sigmoid Activation and Its Derivative

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(a):
    # a = sigmoid(z); derivative = a * (1 - a)
    return a * (1 - a)

The sigmoid maps any real number to the interval . Its derivative can be expressed directly in terms of the activation, which keeps the back‑propagation code concise.

Training Loop – Forward Pass, Loss, Back‑Propagation, and Gradient Descent

def train_network(X, y, W1, W2, B1, B2, epochs, lr):
    n = X.shape[0]                     # number of training examples (4)

    for _ in range(epochs):
        # ---- forward pass -------------------------------------------------
        Z1 = np.dot(X, W1) + B1
        A1 = sigmoid(Z1)

        Z2 = np.dot(A1, W2) + B2
        A2 = sigmoid(Z2)               # predictions

        # ---- error at output (MSE loss) -----------------------------------
        DZ2 = A2 - y                    # ∂L/∂Z2

        # ---- gradients for output layer ------------------------------------
        DW2 = (1 / n) * np.dot(A1.T, DZ2)
        DB2 = (1 / n) * np.sum(DZ2, axis=0, keepdims=True)

        # ---- back‑propagation to hidden layer -------------------------------
        DA1 = np.dot(DZ2, W2.T)
        DZ1 = DA1 * sigmoid_derivative(A1)

        DW1 = (1 / n) * np.dot(X.T, DZ1)
        DB1 = (1 / n) * np.sum(DZ1, axis=0, keepdims=True)

        # ---- gradient‑descent update ----------------------------------------
        W2 -= lr * DW2
        B2 -= lr * DB2
        W1 -= lr * DW1
        B1 -= lr * DB1

    return W1, B1, W2, B2

What the Loop Does

Forward pass – computes hidden activations A1 and final output A2.
Loss – mean‑squared‑error; its gradient w.r.t. Z2 is simply A2‑y.
Back‑propagation – uses the chain rule to obtain gradients for every parameter.
Gradient descent – moves each weight and bias opposite to its gradient, scaled by lr.

All operations are vectorised, so the training runs without explicit Python loops over the four examples.

Mathematical Derivation of the Gradients

For a single example the network equations are

The loss (mean‑squared‑error) is

Derivative w.r.t. the output activation:

Because ,

In the code the factor is omitted from DZ2 and later absorbed into the hidden‑layer error term; this simplification does not affect the final update.

Back‑propagating to the hidden layer:

Averaged over the batch, the gradients for the weight matrices and biases are

These formulas correspond exactly to the NumPy statements in the training loop.

Training the Network

W1, B1, W2, B2 = train_network(X, y, W1, W2, B1, B2, epochs, lr)

After 10 000 passes the parameters have been adjusted so that the network produces a high confidence (~1) for the true XOR cases and a low confidence (~0) for the false cases.

Final Forward Pass – Inspecting the Predictions

Z1 = np.dot(X, W1) + B1
A1 = sigmoid(Z1)

Z2 = np.dot(A1, W2) + B2
A2 = sigmoid(Z2)

print(A2)

Typical output:

[[0.02]
 [0.97]
 [0.96]
 [0.03]]

Interpretation

Input	Network output	XOR truth
(0,0)	0.02 → False	0
(0,1)	0.97 → True	1
(1,0)	0.96 → True	1
(1,1)	0.03 → False	0

The network has successfully learned the XOR mapping.

Core Machine‑Learning Concepts Explained

Neuron (linear part) – computes a weighted sum of inputs plus a bias.
Activation function – adds non‑linearity; sigmoid maps to .
Loss function – measures prediction error; we use mean‑squared‑error.
Gradient – direction of steepest increase of the loss; we move opposite to it.
Back‑propagation – systematic use of the chain rule to compute all gradients efficiently.
Gradient descent – updates parameters by a small step proportional to the negative gradient.
Epoch – one full sweep over the training set; multiple epochs let the model converge.

Why a Hidden Layer Is Necessary for XOR

A single linear neuron computes

Its decision boundary is a straight line in the plane. XOR’s positive points and are diagonally opposite; no straight line can separate them from the negative points and .

Adding a hidden layer with sigmoid units creates intermediate features such as “”. After training, some hidden neurons fire only for the mixed inputs, enabling the final linear combination to separate the two classes. Thus a single hidden layer gives the network a non‑linear decision surface that can represent XOR, demonstrating the power of depth.

Complete Code Implementation


python
import numpy as np

# ---- data -------------------------------------------------
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])                     # shape (4, 2)

y = np.array([[0], [1], [1], [0]])   # shape (4, 1)

# ---- parameter initialization -----------------------------
W1 = np.random.randn(2, 4)
B1 = np.zeros((1, 4))

W2 = np.random.randn(4, 1)
B2 = np.zeros((1, 1))

# ---- hyper‑parameters --------------------------------------
epochs = 10000
lr = 0.1

# ---- helper functions --------------------------------------
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(a):
    return a * (1 - a)

# ---- training function -------------------------------------
def train_network(X, y, W1, W2, B1, B2, epochs, lr):
    n = X.shape[0]

    for _ in range(epochs):
        # forward
        Z1 = np.dot(X, W1) + B1
        A1 = sigmoid(Z1)

        Z2 = np.dot(A1, W2) + B2

🚀 A Gentle Walk‑Through of Logistic Regression in Python

Viswa M — Fri, 27 Mar 2026 18:43:20 +0000

🚀 A Gentle Walk‑Through of Logistic Regression in Python

Meta description

Learn logistic regression in Python from scratch using NumPy. Step‑by‑step guide to build, train, and predict without heavy libraries.

Tags

logisticregression, python, numpy, machinelearning, dataanalysis, classification, gradientdescent, crossentropy, sigmoid, tutorial

Introduction

When you think of classification, imagine questions like “Is this email spam?” or “Will this customer churn?” The answer is a binary label (). Logistic regression turns a linear model into a probability estimate, allowing us to quantify confidence in the decision. Because it relies on a simple sigmoid function, we can write the whole algorithm in a few lines while preserving intuition.

Overview

Data: features , binary labels
Parameters: a scalar weight and bias for one feature; a vector and bias for many
Training: 1 000 epochs of gradient descent
Prediction: sigmoid applied to the linear combination of inputs

The same equations work whether we have a single feature or several; the only difference is that the weight becomes a vector.

Imports and Data

import numpy as np
from tqdm import tqdm  # progress bar

# One‑dimensional toy data
X  = np.array([1, 2, 3, 4, 5, 6])
y  = np.array([0, 0, 0, 1, 1, 1])

# Two‑dimensional toy data
X2 = np.array([[25, 30],
               [35, 60],
               [45, 80]])
y2 = np.array([0, 1, 1])

These tiny arrays let us step through the whole learning process without any external data files.

Initialisation

# 1‑D parameters
m, b = 0.0, 0.0

# 2‑D parameters
W, bias = np.zeros(X2.shape[1]), 0.0

# Common hyper‑parameters
lr     = 0.01   # learning rate
epochs = 1000   # full passes over the dataset

Learning rate () controls the step size in gradient descent. Too high, and we overshoot; too low, and training stalls.
Epochs is the number of times we loop over the entire dataset.

Sigmoid (Logistic) Function

The sigmoid squashes any real number into the interval :

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

When is very negative, the output is close to 0; when is very positive, it approaches 1.

1‑D Logistic Regression

def logisticRegression(X, y, m, b, lr, epochs):
    n = len(X)  # number of samples

    for _ in tqdm(range(epochs), leave=False):
        # Forward pass
        z     = m * X + b
        y_hat = sigmoid(z)

        # Gradients of cross‑entropy loss
        dm = (1 / n) * np.sum((y_hat - y) * X)
        db = (1 / n) * np.sum(y_hat - y)

        # Gradient descent updates
        m -= lr * dm
        b -= lr * db

    return m, b

What the loop does

Compute the linear score .
Convert into a probability with the sigmoid.
Calculate how much each parameter should change (, ).
Move the parameters a little toward the minimum.

After the loop, and hold the trained model.

Prediction (1‑D)

m, b = logisticRegression(X, y, m, b, lr, epochs)

new_x = 9
prob  = sigmoid(m * new_x + b)
print("Probability that x = 9 is class 1:", prob)

The output is a confidence score between 0 and 1, indicating how likely the point belongs to the positive class.

Multi‑Feature Logistic Regression

The only change is that we replace the scalar weight with a vector and use matrix operations:

def logisticRegressionMultipleFeatures(X, y, W, b, lr, epochs):
    n = len(X)

    for _ in tqdm(range(epochs), leave=False):
        # Forward pass
        z     = np.dot(X, W) + b
        y_hat = sigmoid(z)

        # Gradients
        dw = (1 / n) * np.dot(X.T, (y_hat - y))
        db = (1 / n) * np.sum(y_hat - y)

        # Updates
        W -= lr * dw
        b -= lr * db

    return W, b

The gradients and are derived exactly as in the 1‑D case, just expressed in vector form.

Prediction (Multi‑D)

W, bias = logisticRegressionMultipleFeatures(X2, y2, W, bias, lr, epochs)

sample = np.array([40, 70])
prob   = sigmoid(np.dot(sample, W) + bias)
print("Probability that sample [40, 70] is class 1:", prob)

Again, the result is a probability that can be thresholded (e.g., 0.5) to obtain a hard class label.

Key Concepts & Math

Linear model:
Sigmoid:
Cross‑entropy loss: $$ L = -\frac{1}{n} \sum_i \Big[ y_i \log \sigma(z_i) + (1 - y

Building a Linear Regression Model from Scratch with Gradient Descent in Python

Viswa M — Fri, 27 Mar 2026 18:35:36 +0000

Overview

Title

Gradient Descent Linear Regression in Python

Meta Description

Learn how to build a linear regression model from scratch using gradient descent in Python. Step‑by‑step code, math, and practical tips.

1. Introduction

Linear regression is usually the first model you build when learning machine learning. It introduces the essential concepts of parameters, loss, gradients, and optimisation in the simplest setting: a straight‑line fit.

In this post we’ll walk through a compact Python script that learns a line from five data points using gradient descent. We’ll explain the maths, step through the code, and predict a new value. By the end you’ll understand why the parameters change and how to tweak the algorithm for your own data.

2. What the program does in a nutshell

The script trains a linear model

to minimise mean‑squared error between predictions and the true outputs .

Starting from , , it repeatedly updates these two numbers until the loss stops improving, then prints the learned slope, intercept, and a prediction for a new input.

3. Code Implementation

import numpy as np
from tqdm import tqdm

# Data
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])

# Initial parameters
m = 0          # slope
b = 0          # intercept

# Hyper‑parameters
lr = 0.01      # learning rate
epochs = 1000  # number of iterations

n = len(X)     # number of training samples

# Gradient descent loop
for _ in tqdm(range(epochs)):
    y_hat = m * X + b

    dm = (-2 / n) * np.sum(X * (y - y_hat))   # gradient wrt m
    db = (-2 / n) * np.sum(y - y_hat)         # gradient wrt b

    m -= lr * dm
    b -= lr * db

print("Slope:", m)
print("Intercept:", b)

# Prediction for a new input
input_value = 6
pred = m * input_value + b
print("Predicted y for x=6:", pred)

4. Step‑by‑step walk‑through

Imports – numpy for vector maths; tqdm for a progress bar.
Data – X (inputs) and y (targets).
Parameters – start with m = 0, b = 0.
Hyper‑parameters – learning rate (lr) controls the step size; epochs limits how many updates we perform.
Training loop
1. Compute predictions: y_hat = m * X + b.
2. Compute gradients:
  - dm = (-2 / n) * np.sum(X * (y - y_hat)) (slope).
  - db = (-2 / n) * np.sum(y - y_hat) (intercept).
3. Update parameters: move a fraction (lr) of the negative gradient.
After training – print the final m and b, then predict for .

5. Key concepts (with maths)

5.1 Linear regression model

– slope (rate of change).
– intercept (value at ).

5.2 Loss function (mean‑squared error)

5.3 Gradient of the MSE

These match the dm and db formulas in the code.

5.4 Gradient descent update

With learning rate :

Repeating moves the parameters toward the minimum of the loss surface.

6. Example section

Here’s what the first few iterations look like (values rounded for clarity):

Step			MSE
0	0.000	0.000	12.80
1	0.040	0.240	8.23
2	0.070	0.360	6.46
…	…	…	…

After 1 000 epochs the algorithm converges to approximately

Slope: 0.60
Intercept: 1.20

Predicting for gives

so the program outputs “Predicted y for x=6: 4.8”.

7. Quick sanity checks

Setting	Observation
Small learning rate (`lr = 0.0001`)	Converges slowly; more epochs needed.
Large learning rate (`lr = 1`)	Updates overshoot the optimum; loss may diverge.

Choosing a suitable lr and epochs is a standard practice for any optimisation problem.

8. Take‑away

This code implements ordinary least‑squares regression with gradient descent.
Gradient descent is a generic optimisation routine that underpins logistic regression, neural networks, and more.
Understanding the update equations clarifies why the parameters evolve and how to troubleshoot training.

Feel free to experiment:

Swap in a new dataset.
Try different learning rates or epoch counts.
Normalise your inputs or add a bias term.

Happy coding and keep building!

Slug: linear-regression-gradient-descent-python

SEO Title: Gradient Descent Linear Regression in Python

Meta Description: Learn how to build a linear regression model from scratch using gradient descent in Python. Step‑by‑step code, math, and practical tips.

Keywords: linear regression, gradient descent, python, machine learning, data science, supervised learning, mean squared error, model training, algorithm explanation

Tags: linearregression, gradientdescent, python, machinelearning, dataanalysis, codingtutorial, algorithm, mse, supervisedlearning

Building a Simple Logistic Regression from Scratch (Python Edition)

Viswa M — Fri, 27 Mar 2026 18:33:10 +0000

Building a Simple Logistic Regression from Scratch (Python Edition)

Meta description: Learn to build a simple logistic regression model in pure python with gradient descent, no libraries needed. Step‑by‑step guide, code snippets, predictions.

Tags: logisticregression, python, gradientdescent, machinelearning, purepython, classification, tutorial, datamanipulation

Slug: build-logistic-regression-from-scratch-in-python

Overview

In this post we’ll hand‑craft a logistic‑regression classifier in vanilla NumPy, without any machine‑learning framework.

We’ll:

Train a one‑feature model.
Scale the same idea to two features.
See how gradient descent iteratively lowers the cross‑entropy loss.
Finally, predict the probability that a new sample belongs to the positive class.

Everything is fully transparent, so you can trace every math step and every line of code.

1. What the Code Does – Overview

Create toy data for a binary classification problem.
Define a one‑feature logistic‑regression function that trains by gradient descent.
Predict the probability for a new single‑feature sample.
Define a multi‑feature version of the same algorithm.
Predict the probability for a new two‑feature sample.

All of this is implemented in plain NumPy, so you can see exactly what happens during training.

2. Step‑by‑Step Walk‑Through

2.1 Imports & Data Setup

import numpy as np
from tqdm import tqdm

# 1‑D toy data
X = np.array([1, 2, 3, 4, 5, 6])          # feature values
y = np.array([0, 0, 0, 1, 1, 1])          # binary labels

numpy handles vectorised math.
tqdm shows a progress bar during the training loop.

2.2 Hyperparameters & Initial Parameters

m = 0          # weight (slope)
b = 0          # bias (intercept)
lr = 0.01      # learning rate
epochs = 1000  # number of gradient steps

Parameters start at zero.
The learning rate determines the step size.
More epochs mean more passes over the data.

2.3 One‑Feature Logistic Regression – Core Function

def logisticRegression(X, y, m, b, lr, epochs):
    n = len(X)

    for _ in tqdm(range(epochs)):
        # Linear part
        z = m * X + b

        # Sigmoid activation
        y_hat = 1 / (1 + np.exp(-z))

        # Gradients
        dm = (1 / n) * np.sum((y_hat - y) * X)
        db = (1 / n) * np.sum(y_hat - y)

        # Gradient descent update
        m -= lr * dm
        b -= lr * db

    return m, b

Step	Operation	Purpose
1	`z = m * X + b`	Linear combination of feature and bias.
2	`σ(z) = 1/(1+e^{-z})`	Squashes any real number into the interval .
3	`dm` & `db`	Partial derivatives of cross‑entropy loss w.r.t. `m` and `b`.
4	Update rules	Move parameters toward the minimum of the loss.

2.4 Training & Prediction for 1‑D Data

m, b = logisticRegression(X, y, m, b, lr, epochs)

# Predict probability for a new input
inp = 9
z = m * inp + b
prob = 1 / (1 + np.exp(-z))
print("Probability:", prob)

After training on the six points, the model estimates how likely belongs to the positive class.

2.5 Multi‑Feature Logistic Regression – Scaling Up

# 2‑D toy data
X2 = np.array([
    [25, 30],
    [35, 60],
    [45, 80]
])
y2 = np.array([0, 1, 1])

weights = np.zeros(X2.shape[1])  # one weight per feature
bias = 0

2.6 Core Function for Multiple Features


python
def logisticRegressionMultipleFeatures(X, y, W, b, lr, epochs):
    n = len(X)

    for _ in tqdm(range(epochs)):
        # Linear part
        z = np.dot(X, W) + b

        # Sigmoid
        y_hat = 1 / (1 + np.exp(-z))

        # Gradients
        dw = (1 / n