Akash

Posted on Apr 29

Recurrent Neural Networks

#ai #nlp #machinelearning #deeplearning

Sequence Processing, POS Tagging, and the Context Problem

By the end of this post, the fixed-window language model will be a closed chapter, and the recurrent neural network will make sense to you structurally: one new weight matrix, one thread of hidden state running through time. You'll see why n-gram and feedforward language models hit a wall on context. You'll meet the three families of sequence problems (labeling, classification, sequence-to-sequence) and see how each one maps to a different RNN setup. You'll understand the two-pass training algorithm (backpropagation through time) and why it starts leaking signal once sequences get long. Then we'll apply all of it: RNN language models, autoregressive generation (what ChatGPT does under the hood), and encoder-decoder for machine translation.

This is the missing link between feedforward LMs and transformers. You already know why n-gram and feedforward LMs are limited — they see only the last n words. RNNs are the first architecture in our course that gets rid of that window. Once you understand them, transformers will feel like a natural next step: same goal (context without a fixed window), different mechanism (attention instead of recurrence).

Language Is Temporal. That Changes Everything.

Every NLP problem we've looked at so far has treated text as either a bag of tokens (TF-IDF, bag of words for classification) or a fixed window (n-grams, feedforward language models). Neither of those captures something important about how language actually works: words unfold over time, and the meaning of what came earlier can be altered by what comes later.

Let's take a look at an example. Compare these two sentences:

I hope there are no more quizzes between now and the end of the semester.

No, I hope there are more quizzes between now and the end of the semester.

The first one is straightforward: please stop with the quizzes. The second one starts with "No," which negates everything that follows. The meaning of the whole sentence isn't clear until you've read the whole sentence. You can't make a decision about what the speaker means based on the first three words; you have to wait for the structure to unfold.

This is the core challenge of sequence processing. To understand a sentence, you need to carry information across the entire sequence. And language has this property everywhere: conversation flow, news feeds, Twitter streams, anything where tokens arrive in a specific order and earlier tokens alter the interpretation of later ones.

Temporal isn't metaphorical here. We say "the flow of conversation" or "the thread of an argument" for a reason. These metaphors describe an actual property of language: sequences of symbols whose meaning depends on position and order.

The Sliding Window Problem

In the last-to-last post, we built a feedforward language model with a sliding window. Take three or four previous words, embed them, feed them through a hidden layer, softmax over the vocabulary, predict the next word. Slide forward one token. Repeat.

This works. But it has two disadvantages that become obvious once you think about them.

1. Context is capped at the window size.

If your window is three words wide, the model literally cannot see anything outside those three words. Want to resolve a pronoun reference to something mentioned ten words ago? Too bad — the model never sees it.

2. Repeated work.

As the window slides, the same subsequences get processed over and over. Looking at "Natalie sat down to write the midterm" three words at a time, you process "down to" when predicting "write", then again when predicting "the", then again when predicting "midterm." Same two-word sequence, different positions, three separate computations. No sharing, no memory.

Can we get rid of the fixed window? That's what recurrent neural networks are for.

Three Flavors of Sequence Problems

Before we get to RNNs, let's catalog what kinds of sequence problems exist. We split them into three families, and the split matters because each one calls for a different architectural setup.

Quick reference: the three sequence families

Sequence labeling → input and output are the same length, one label per token. Example: POS tagging, NER.

Sequence classification → input is a sequence, output is a single class. Example: sentiment analysis, spam detection.

Sequence-to-sequence → input and output are both sequences, possibly of different lengths, with non-aligned tokens. Example: machine translation, summarization, and QA.

Sequence Labeling

Tag every token in the input with a label from a fixed set. Input and output are the same length, aligned one-to-one.

The two canonical examples:

Part-of-speech (POS) tagging: assign a syntactic category (noun, verb, adjective, etc.) to each word. Takes a sentence, returns a sentence with the same number of tokens, each with a tag.

Named entity recognition (NER) — identify spans of text that name specific entities (people, companies, dates, monetary amounts) and classify each span.

Sequence Classification

Assign one label to the entire input sequence. Input is a sequence, output is a single class.

Examples: sentiment analysis (positive/negative for a whole review), spam detection, topic classification, sarcasm detection, and message routing.

Sequence-to-Sequence (Seq2Seq)

Map an input sequence to an output sequence, where the two don't have to be the same length or aligned token-by-token.

Examples: machine translation (English in, Spanish out, possibly with different word count and word order), text summarization (long document in, short summary out), question answering (question in, answer out).

We'll see each of these get solved with RNNs before the post is over. Let's start with sequence labeling.

Part-of-Speech Tagging

Take this sentence:

plays well with others

And tag each word with its syntactic category:

Word	Possible tags	Correct (in context)
plays	NNS (plural noun), VBZ (verb)	VBZ
well	UH (interjection), JJ (adjective), NN (noun), RB (adverb)	RB
with	IN (preposition)	IN
others	NNS (plural noun)	NNS

The tags come from the Penn Treebank, which has about 45-48 of them (36 core tags and the rest of them are punctuation and special symbols). Some words are unambiguous (with is only ever a preposition). Most aren't. Plays can be plural noun (Shakespeare's plays) or verb (she plays cricket). Picking the right one requires looking at the rest of the sentence. That's why POS tagging is a sequence problem, not a per-word lookup.

Why Does POS Tagging Matter?

You'd think "what part of speech is this word?" is a pretty dry academic question. It isn't; POS tags are structural glue for downstream tasks:

Text-to-speech: how do you pronounce lead? It depends on the part of speech. I lead the group vs. There's lead in the water are pronounced differently.
Parsing: once you have POS tags, you can write grammar rules over them. "a noun phrase is an optional determiner, followed by adjectives, followed by nouns."
Fallback features: if you don't know a specific word, knowing its POS tag often gives you enough to keep going on higher-level tasks.

The 90% Baseline Trap

Here's a question that sounds easy but rewards thinking about.

Build the dumbest possible POS tagger in five minutes. What would you do?

Answer: for each word, look up its most frequent tag in a reference dictionary. If plays is a plural noun 67% of the time and a verb 30% of the time, tag it as a plural noun every time. If the word is unknown, tag it as a noun.

This dumb algorithm gets about 90% accuracy on standard POS datasets. Why so high?

Two reasons:

A lot of words are unambiguous. Roughly 40% have only one possible tag. Those are free points for any algorithm.
Frequent words dominate the test set. Words like the, a, of, punctuation — these show up constantly, they're short and unambiguous, and they make up a big fraction of the tokens you're evaluating on. Get those right, and your accuracy goes way up.

So 90% is the floor. It's not that the algorithm is clever; it's that the test distribution favors easy words.

Why Isn't 90% Good Enough?

State-of-the-art POS taggers today achieve above 99% accuracy. The gap between 90% and 99% sounds small until you do the math.

A 10% error rate means about one wrong tag per ten-word sentence. So almost every sentence has a mistake somewhere. And POS tags get fed into downstream tasks: parsers, semantic analyzers, and coreference resolvers. A wrong tag propagates. Garbage in, garbage out.

Somebody once put it this way: "I love the 90% accuracy. I hate the 10% error."

Named Entity Recognition

NER is the other canonical sequence labeling task. Take a news article like this:

Bridgestone Sports Company said Friday it has set up a joint venture in Hong Kong with a local company...

A good NER system tags every token. The interesting ones:

Bridgestone Sports Company → COMPANY
Friday → DATE
Hong Kong → CITY/LOCATION

Everything else (said, it has set up a joint venture, etc.) gets tagged O (for "other" — not an entity).

NER is trickier than POS tagging because the task has two parts wrapped into one:

Classify the entity type (is this a company, a person, a date, a city?).
Get the span right (all of Bridgestone Sports Company, not just Bridgestone).

You get penalized for missing either one. Identifying Bridgestone as a company, but cutting off Sports Company, that's partial credit at best.

Sequence Classification, and the Near Miss

Now contrast those labeling tasks with sarcasm detection. You read a sentence:

Natalie was soooo thrilled that Usman had a famous new poem.

And you label the whole thing as sarcastic or not sarcastic. One label for the whole sequence. That's sequence classification.

Here's the near miss. Sarcasm detection looks like it demands sequence modeling. It's a sentence-level task with linguistic subtlety. You'd reach for an RNN on instinct. But you can often solve it without any temporal information at all. Throw the tokens into a bag, compute TF-IDF features, and run a classifier. You lose the order, but signals like soooo and totally still light up as statistically correlated with sarcasm.

The near miss teaches a rule: sequence classification is not automatically a sequence problem. Some of it is. Some of it isn't. The decision to reach for an RNN should come from the task needing order, not from the input being a sentence.

Temporal information isn't free. An RNN costs you training time and serial computation that a bag-of-words model doesn't. If word order barely matters for the task, TF-IDF features with a linear classifier will often match it. Save the RNN for problems where order is actually load-bearing.

That said, accuracy often climbs when you do use temporal info. "Not bad" and "bad not" mean different things. A bag of words can't tell them apart. An RNN can.

So the question to ask on every sequence classification task is the same: how much does order matter? For sentiment polarity, some. For sarcasm, more. For machine translation, order is half the problem.

Simple Recurrent Neural Networks

Now the main event. Recurrent neural networks are the first architecture we've seen that removes the fixed-window limit on context.

Before we go further, let's set the boundary around what we're about to see.

What RNNs are NOT. They are not a free way to parallelize computation. Every timestep depends on the previous one, so forward inference is inherently serial. They are also not a cure-all for long-range context: in practice, the vanishing gradient problem means signal from the distant past often can't reach the present, even though the architecture says it should. And they are not the final word on sequence modeling, transformers have since outperformed them on nearly every task. What RNNs are is the first architecture that carries a thread of memory across arbitrary sequence length, and the bridge from fixed-window models to everything that came after.

The Core Idea

In a feedforward net, the hidden layer at time t depends only on the input at time t. You see $x_t$ , compute $h_t$ , produce $y_t$ . Done.

In a simple RNN (also called an Elman network), the hidden layer at time t depends on both the input at time t and the hidden layer from the previous timestep. The previous hidden state threads forward into the current computation, and then into the next, and into the next after that, a single ribbon of memory running the length of the sequence.

The recurrence adds one new weight matrix, U, that connects the previous hidden layer to the current one:

h_t = g(U h_{t-1} + W x_t)

y_t = \text{softmax}(V h_t)

Where:

$W$ is the input-to-hidden weight matrix (same as in a feedforward net)
$U$ is the new hidden-to-hidden weight matrix (this is what makes it recurrent)
$V$ is the hidden-to-output weight matrix
$g$ is a nonlinearity like tanh
$h_{t-1}$ is the hidden layer from the previous timestep

That's the whole architectural change. Everything else is the same as a feedforward net.

Weights Are Shared Across Time

One thing worth stopping on: the matrices W, U, and V are the same at every timestep. Whether you're processing word 1 or word 20, you use the same weights. The model size doesn't grow with the length of the sequence.

This is what makes training feasible. You don't learn a separate weight matrix per position. You learn one set of weights that gets applied at every step.

Forward Inference: One Word at a Time

Unrolling an RNN in time looks like this: you process $x_1$ , then $x_2$ , then $x_3$ , each step using the hidden state from the step before.

At step 1, you need some initial hidden state $h_0$ (usually zeros). You compute $h_1 = g(U h_0 + W x_1)$ .

At step 2, compute $h_2 = g(U h_1 + W x_2)$ . Notice that $h_2$ depends on $h_1$ , which depended on $x_1$ . So in principle, $h_2$ carries information about both words.

At step 3, $h_3 = g(U h_2 + W x_3)$ . Now $h_3$ has information from $x_1, x_2, x_3$ .

Keep going. By the time you get to step 10, $h_{10}$ has, in theory, absorbed information from every input it's seen, all the way back to $x_1$ .

This is the whole reason RNNs exist. The hidden state is a running memory of everything seen so far. No fixed-size window. Context extends as far back as the sequence itself goes.

Iterative, Not Parallel

The downside of this design is that forward inference is sequential. You cannot compute $h_3$ until you've computed $h_2$ , and you can't compute $h_2$ until you've done $h_1$ . Every step depends on the previous one.

This is slower than a feedforward net (which can batch everything in parallel) and is one of the main motivations for transformers, which get rid of the sequential dependency entirely. But we'll get to that later.

Training RNNs: Backpropagation Through Time

RNNs are trained with the same core machinery as feedforward nets: a loss function, gradient descent, and backpropagation. The wrinkle is that gradients now have to flow backward, not just through layers, but through time.

Two Considerations That Make Training Different

In a feedforward net, the loss at position t depends only on inputs at position t. Gradients flow backward through a single stack of layers.

In an RNN:

Computing the loss at time t requires the hidden layer from time t−1, which required the hidden layer from t−2, and so on back to the beginning.
The hidden layer at time t affects both the output at time t and the hidden layer at time t+1 (and thus the output at t+1, and so on).

So when you compute how much a weight should change based on the error at time t, you have to trace backward through time, accumulating contributions from every subsequent timestep.

The Two-Pass Algorithm

Training an RNN is a two-pass procedure:

[Pass 1] Forward inference. Process the input sequence left to right, computing and saving $h_t$ and $y_t$ at each step, accumulating the loss along the way.

[Pass 2] Backward pass. Process the sequence right to left, computing error gradients as you go. Accumulate gradient contributions for the weight matrices W, U, and V across all timesteps, then apply the updates.

This is called backpropagation through time (BPTT). It's conceptually the same as regular backprop, just extended over the temporal dimension of the unrolled network.

The Vanishing Gradient Problem

There's a subtle issue with BPTT that bites RNNs hard on long sequences. The thread of memory frays under training.

When you backpropagate through t timesteps, you're multiplying derivatives together t times. If those derivatives are small (less than 1 in magnitude), the product shrinks exponentially. By the time the error signal has traveled back 15 or 20 timesteps, the gradient is effectively zero.

The model stops learning from long-range dependencies. In theory, $h_{20}$ still carries information from $x_1$ . In practice, the gradient-based training signal can't propagate far enough to teach the model how to use it. The architecture promises arbitrary context length. The training procedure can't deliver on the promise past a certain distance.

This is why "RNNs and LSTMs" are usually taught together, not just "RNNs" alone. Long Short-Term Memory networks (and Gated Recurrent Units) are engineered to fix this specific problem — they add gating mechanisms that let the model explicitly decide what to keep and what to forget across long spans. We'll get to them in the next post.

RNN Language Models

With forward inference and training understood, let's apply this to the task that kicked off the whole series: language modeling.

The setup is almost identical to a feedforward language model. At each timestep, the model takes a word, produces a probability distribution over the vocabulary, predicting the next word.

Input sequence: $x_1, x_2, \ldots, x_n$ . At each position, look up the word embedding via an embedding matrix, combine with the previous hidden state, and softmax over the vocabulary:

e_t = E x_t

h_t = g(U h_{t-1} + W e_t)

\hat{y}_t = \text{softmax}(V h_t)

The $\hat{y}_t$ vector gives you the probability of each word in the vocabulary being the next word at position t+1.

"The Students Opened Their ___"

Here's the prototypical example. Feed in the, students, opened, their. The model is supposed to predict what comes next.

Common candidates: books, laptops, minds.

A feedforward LM with a three-word window only sees students, opened, their, it's missing the, which is arguably helpful but not critical. More importantly, if the sentence were twenty words long and depended on students mentioned at position 2, a fixed-window LM would have dropped that context long ago.

The RNN language model has the thread of hidden state running from the all the way through to their. When it predicts the next word, it's conditioning on the full left context. Words like books and laptops get high probability because they're statistically associated with students, and the RNN still has that earlier word sitting in its hidden state, available.

Evaluating with Perplexity

Same metric we used for n-gram models: perplexity on a held-out test corpus.

\text{Perplexity}\theta(w{1:n}) = P_\theta(w_{1:n})^{-\frac{1}{n}}

Lower is better. It's the inverse probability of the corpus normalized by length, how surprised the model is by the true next word on average.

Here's a comparison table from a few years ago:

Model	Perplexity
5-gram language model	67.6
2-layer LSTM	39.8

Same test corpus, massive difference. And this is with an LSTM, which is a refined RNN. The improvement comes entirely from being able to condition on long-range context, not from anything fancier. Modern transformer-based LMs drive this number much lower still.

Autoregressive Generation

Once you have a trained language model, you can generate text from it. The procedure is called autoregressive generation, and it's exactly what ChatGPT does under the hood.

Feed in a starting token (a sentence marker, or a prompt).
Sample the next word from the softmax distribution the model produces.
Feed that sampled word back in as the next input.
Repeat until you hit an end-of-sequence token or a length limit.

The word "autoregressive" means the output at time t is used as part of the input at time t+1. The model is regressing on its own previous outputs.

The key to making autoregressive generation actually useful is priming the model with good context. You don't start from scratch with just <s>. You start with a prompt, a system message, an instruction, a question, or the full preceding conversation. ChatGPT isn't doing anything more exotic than this; it's an autoregressive language model with a very well-chosen starting context.

Sequence-to-Sequence Problems

POS tagging is sequence labeling (input and output are the same length). Sentiment analysis is sequence classification (input is a sequence, output is a single class). But a lot of interesting NLP problems don't fit either mold.

Machine translation. Input: an English sentence. Output: a Spanish sentence. The two sentences can have different word counts, different word orders, and the correspondence between them isn't one-to-one.

Text summarization. Input: a long article. Output: a short summary. Again, no alignment, very different lengths.

Question answering. Input: a question. Output: an answer. Arbitrary lengths.

These are sequence-to-sequence problems, and they call for a different architecture.

Encoder-Decoder

The standard setup for seq2seq is the encoder-decoder architecture. Two RNNs working together:

The encoder reads the input sequence and produces a final hidden state that summarizes the whole input (sometimes called the context vector).
The decoder takes that context and autoregressively generates the output sequence, one token at a time.

For machine translation, this lets you train on parallel corpora — pairs of sentences in two languages — and learn to map arbitrary English sentences to Spanish (or any other pair). Same idea for summarization (long-short pairs) and QA (question-answer pairs).

This is the motivation for the encoder-decoder LLM family we saw in the last post. Models like Flan-T5 are specifically designed for seq2seq tasks, and their architecture (encoder-decoder) traces directly back to the RNN encoder-decoder models we're building up here. Transformers replaced the RNN parts, but the overall shape stayed the same.

What You Now Have

Eight things from this lecture:

Language is temporal. The meaning of later tokens depends on earlier ones, and vice versa. Fixed-window models (n-grams, feedforward LMs) can't capture this at all.
Three families of sequence problems. Sequence labeling (POS, NER — one tag per token). Sequence classification (sentiment, spam — one label per whole sequence). Sequence-to-sequence (translation, summarization, QA — arbitrary output sequence).
POS tagging: Penn Treebank tags, 45 categories, one per word. 90% baseline accuracy with "most frequent tag" is high because of unambiguous words and frequent stopwords — but still not good enough for downstream tasks because errors propagate.
NER involves two subtasks. Classifying the entity type (company, date, location) and getting the span right. Miss either one and you lose points.
Simple RNN architecture. A recurrent connection $U$ from the previous hidden layer to the current one, giving $h_t = g(U h_{t-1} + W x_t)$ . Weights W, U, and V are shared across all timesteps. Context is no longer capped at a window size.
Training via backpropagation through time (BPTT). Two-pass: forward inference left-to-right saving hidden states, then backward pass right-to-left accumulating gradients. Plagued by vanishing gradients on long sequences — LSTMs fix this (next lecture).
RNN language models. Perplexity drops substantially vs. n-gram models (67.6 → 39.8 in one benchmark) because long-range context becomes usable. Autoregressive generation = sampling one word at a time, feeding outputs back as inputs — the core mechanism behind modern chat LLMs.
Seq2seq via encoder-decoder. Two RNNs: one summarizes the input, the other generates the output conditioned on that summary. This is the shape of machine translation, summarization, and QA systems, and the direct ancestor of the encoder-decoder transformer models.

Next up: LSTMs and GRUs. Simple RNNs are the right conceptual frame, but the vanishing gradient problem means they can't actually use the full context the architecture promises. Gated networks fix this by explicitly learning what to remember and what to forget. Then we'll meet bidirectional RNNs, and finally, attention — the mechanism the transformer architecture was built around. From here on, most of modern NLP is a response to problems that were first identified in simple RNNs, which means you've now got the foundation that everything after this sits on.

DEV Community