Akash

Posted on Apr 9 • Edited on Apr 11

From Perceptrons to Predicting the Next Word

#nlp #neuralnetworks #machinelearning #deeplearning

Neural Networks as Language Models

By the end of this post, you'll understand how feedforward neural networks work at the unit level (inputs, weights, activation functions), why a perceptron fails on XOR and what that failure teaches about the need for hidden layers, and how a two-layer feedforward network can implement a language model that beats n-grams on both sparsity and storage. You'll also see two wrong ways and one right way to feed text into a neural net, understand how backpropagation trains the whole system, including the embedding layer, and know conceptually what separates a language model from a large language model.

One idea ties everything together: every architecture in this course, from n-grams to feedforward nets to the transformers we'll eventually reach, is applied to the same task. Predict the next word. This post covers the first neural version of that task.

Part 1: The Components

Before building a neural language model, we need to understand the parts: a single unit, what happens when a unit hits its limits, and what you get when you stack units into layers.

A Single Neural Unit

A neural unit takes inputs, multiplies each by a weight, sums, adds a bias, and applies a non-linear function:

z = \vec{w} \cdot \vec{x} + b

y = f(z)

where $f$ is a non-linear activation function. The non-linearity matters. Without it, stacking layers just produces another linear transformation, and depth buys you nothing.

Three Activation Functions

Sigmoid: $\sigma(x) = \frac{1}{1 + e^{-x}}$ . Squashes output to [0, 1]. Differentiable, has a probabilistic interpretation. The problem: gradients saturate to near-zero for large or small inputs, which is called the vanishing gradient problem. It makes deep networks hard to train with a sigmoid.

ReLU: $f(x) = \max(0, x)$ . Zero for negative inputs, identity for positive. Simple, fast, no vanishing gradient for positive values. The default for hidden layers.

Tanh: $\tanh(x)$ . Like sigmoid but centered at zero, outputting [-1, 1]. Zero-centered outputs often help with training.

Both ReLU and tanh outperform sigmoid in practice. When designing your own network, you pick the activation function per layer; different problems work better with different choices.

The XOR Problem: Why Hidden Layers Exist

A perceptron is the simplest neural unit: binary inputs, step function, binary output. It can compute AND and OR:

AND: $w_1 = 1, w_2 = 1, b = -1$ . Only fires when both inputs are 1.
OR: $w_1 = 1, w_2 = 1, b = 0$ . Fires when either input is 1.

Both are linearly separable; you can draw a straight line that separates the outputs.

XOR (output 1 when exactly one input is 1) is not. Plot the four input combinations on a 2D grid: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0. No single line separates the 1s from the 0s.

The fix: add a hidden layer with two ReLU units. The hidden layer transforms the inputs into a new space where XOR becomes linearly separable. The input points (0,1) and (1,0) get mapped to the same hidden representation, making it easy for the output unit to separate them.

This is why neural networks have hidden layers. The hidden layer creates a representation in which the problem is solvable. The same principle applies to every subsequent architecture. Feedforward nets, RNNs, and transformers — they all learn intermediate representations that make the downstream task easier to solve.

Verify the XOR solution yourself

The network has inputs x₁, x₂, hidden units h₁, h₂ (ReLU), and output y₁ (ReLU). Weights: both inputs connect to both hidden units with weight 1. h₁ has bias 0, h₂ has bias −1. Hidden-to-output weights: h₁ connects with weight 1, h₂ connects with weight −2. Output bias is 0.

Plug in all four input combinations: [0,0], [0,1], [1,0], [1,1]. Confirm that the output is 0, 1, 1, 0 respectively. The ReLU clamps negatives to zero, which is what makes the hidden representation work.

Feedforward Networks: The Full Stack

Stack units into layers. Information flows in one direction: input → hidden → output. No cycles. This is a feedforward network.

For a two-layer network:

h = \sigma(Wx + b)

z = Uh

y = \text{softmax}(z)

The shapes to keep track of:

Variable	Shape	What it is
$x$	$n_0 \times 1$	Input vector
$W$	$n_1 \times n_0$	Hidden layer weights
$h$	$n_1 \times 1$	Hidden layer output
$U$	$n_2 \times n_1$	Output layer weights
$y$	$n_2 \times 1$	Output probabilities

Softmax normalizes the output into a probability distribution, values between 0 and 1, summing to 1:

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{d} e^{z_j}}

A useful way to think about it: a neural network is logistic regression with two upgrades. First, multiple layers instead of one. Second, instead of hand-crafted features, the hidden layers learn their own representations. Even transformers usually end with a feedforward layer plus softmax.

Part 2: Neural Nets Meet NLP

Now we apply these components to language. The progression matters: two approaches that almost work, then the one that does.

The Near Miss: Feature-Based Sentiment (ML 1.0)

Input: "dessert was great." Extract features by hand: word count = 3, positive lexicon words = 1 ("great"), negation count = 0. Feed these three numbers into a feedforward net. Output: P(positive), P(negative), P(neutral).

This technically works. But you've done the hard part yourself (deciding which features matter) and handed the neural net a trivial job. We call this "ML 1.0." You're paying for a neural network but using it as glorified logistic regression.

Better: Pooled Embeddings (ML 2.0)

Look up embeddings for "dessert," "was," "great." Average them into a single vector. Feed that into the network.

Better. You're at least using learned representations. But averaging embeddings throws away word order. "Not great" and "great" produce nearly identical pooled vectors, even though they mean opposite things.

Both of these are stepping stones. The real application of neural nets in NLP is language modeling.

The Fixed-Window Neural Language Model

Same task as always: given some context words, predict the next word. Instead of counting co-occurrences, we use a neural network.

The architecture, step by step:

Take a fixed window of context words: "the students opened their"
Represent each as a one-hot vector (length $|V|$ , all zeros except a single 1)
Multiply each one-hot vector by the embedding matrix $E$ ( $d \times |V|$ ) — this retrieves the word's embedding
Concatenate all context embeddings into one long vector
Feed through the hidden layer: $h = \sigma(We + b)$
Output layer: $y = \text{softmax}(Uh)$ — probability distribution over the entire vocabulary

The predicted word is the one with the highest probability. Slide the window forward, repeat.

The equations:

e = (Ex_1, Ex_2, \dots, Ex_c)

h = \sigma(We + b)

z = Uh

y = \text{softmax}(z)

Neural LM vs. N-gram LM

Two concrete improvements:

Sparsity is gone. N-gram models require the exact word sequence to appear in training data. "Students opened their" never appeared? Probability is zero. The neural LM uses embeddings instead. "Students" and "pupils" have similar embeddings, so the model generalizes to unseen combinations.

Storage is linear, not exponential. N-gram model size is $O(\exp(n))$ — storing counts for all possible n-grams. Neural LM parameters are the weight matrices, which scale as $O(n)$ with window size.

N-gram models were the standard for speech recognition and OCR for decades. Neural LMs replaced them on the strength of these two improvements.

What's Still Missing

The window is fixed. Four words. Maybe you can stretch it to eight. But you can't capture "The computer which I had just put into the machine room on the fifth floor crashed" with a window of any practical size. LLMs condition on entire pages.

Getting there requires architectures that process variable-length sequences. That's RNNs (next post) and eventually transformers.

Part 3: Training the Whole System

Learning Embeddings from Scratch

You don't always plug in pre-trained Word2Vec embeddings. Sometimes the task itself should shape the embeddings.

When training a neural language model, backpropagation updates all parameters: the embedding matrix $E$ , hidden weights $W$ , output weights $U$ , and biases $b$ . The embeddings evolve to serve the task. A language model produces embeddings tuned for next-word prediction. A sentiment system produces embeddings tuned for sentiment. A translation system produces embeddings tuned for translation.

The cost: more computation. You're backpropagating through every layer, including the embedding lookup. For simple tasks like sentiment on small datasets, pre-trained embeddings might be enough. For language modeling on large corpora, learning task-specific embeddings is usually worth the cost.

The Training Loop

Conceptually simple, computationally expensive:

1. Forward pass. Feed context words through the network. Get a predicted probability distribution over the vocabulary.

2. Compute loss. Compare the prediction to the actual next word (which you know — it's the corpus). Cross-entropy loss:

L_{CE} = -\log P(w_t \mid w_{t-1}, \dots, w_{t-n+1})

The loss is just the negative log of the probability the model assigned to the correct word. High confidence in the right answer = low loss.

3. Backward pass. Compute derivatives of the loss with respect to every parameter using the chain rule. Update each parameter:

\theta^{s+1} = \theta^s - \eta \frac{\partial L}{\partial \theta}

$\eta$ is the learning rate. Repeat for billions of words.

The training signal is self-supervised: the next word in the corpus is always the ground truth. No annotation needed. Same principle as Word2Vec, just applied to a bigger architecture.

Forward and backward passes as a computation graph

The entire computation can be drawn as a directed graph: x → multiply by W → add b → f(·) → multiply by U → softmax → loss. Forward pass: compute left to right. Backward pass: compute derivatives right to left via the chain rule. Each parameter gets a gradient, and SGD updates them all. Frameworks like PyTorch automate the backward pass; you define the forward computation, and it handles differentiation for you.

The LM-to-LLM Boundary

When does a language model become a large language model? There's no sharp line. It's scale on three axes:

Depth: feedforward LMs have 2-3 layers. Transformers have 30-60+.
Data: n-gram LMs train on millions of words. LLMs train on trillions.
Parameters: neural LMs have millions. LLMs have billions to hundreds of billions.

But every model covered so far, from bigrams to this feedforward neural LM to the transformers coming later, is doing the same thing: given context, predict the next word. That thread has been running since the first post here.

What You Now Have

Six things you didn't have before reading this:

Neural units and activation functions. Inputs × weights + bias, passed through a non-linearity. Sigmoid has vanishing gradients. ReLU is the practical default. Without non-linearity, depth is useless; multiple linear layers collapse into one.
The XOR lesson. Perceptrons handle AND and OR but not XOR, because XOR isn't linearly separable. A hidden layer solves it by transforming the input into a new representation where the problem is separable. This is why neural networks have hidden layers.
Feedforward network equations. $h = \sigma(Wx + b)$ , $y = \text{softmax}(Uh)$ . Know the shapes of every matrix. Think of it as logistic regression with learned representations and multiple layers.
The fixed-window neural language model. One-hot → embedding lookup → concatenate → hidden layer → softmax over vocabulary. Solves n-grams' sparsity problem (embeddings generalize) and storage problem (O(n) vs O(exp(n))). Remaining weakness: The window is too small for long-range dependencies.
Task-specific embeddings. Backpropagation can update the embedding matrix alongside network weights, learning representations shaped by the task. More expensive, often more effective than plugging in generic Word2Vec.
The training loop. Forward pass predicts, cross-entropy loss measures the error against the actual next word, and backpropagation updates all parameters. The corpus is its own label. Self-supervised, same as Word2Vec, just on a bigger architecture.

Next post: recurrent neural networks and LSTMs — architectures that process sequences of arbitrary length and finally break free of the fixed window.

DEV Community