Archit Verma

Posted on Jun 25

The Journey to Transformers: How RNNs, ByteNet, and ConvS2S Shaped Modern AI

#ai #deeplearning #development #llm

Before Transformers Took Over

When people talk about modern AI today, the conversation usually jumps straight to Transformers. GPT, Claude, Gemini, Llama — they all sit on top of that same idea:

let every token look at every other token directly.

But that was not always the obvious path.

Before Transformers became the default choice in 2017, researchers were still wrestling with a very old problem in deep learning:

how do you teach a model to understand sequences without making it painfully slow or forgetful?

For a long time, the answer was Recurrent Neural Networks (RNNs) and later LSTMs. They were elegant in theory, but in practice they had a frustrating weakness: they had to read everything one step at a time.

That meant two things:

Training was slow because the model could not easily work on many tokens at once
Long-distance relationships were hard to preserve because information had to travel through many steps

So researchers started asking a very human question:

What if sequence models did not have to think like a chain?

That question led to a fascinating family of models built with Convolutional Neural Networks (CNNs) instead of recurrence.

Three of the most important ones were:

Extended Neural GPU
ByteNet
ConvS2S (Convolutional Sequence-to-Sequence)

These models did not win the final battle, but they changed the direction of the field. They showed that sequence modeling could be parallel, efficient, and still powerful enough to handle language-like tasks.

The Problem with RNNs

Think about reading this sentence:

"The cat that sat on the mat near the window in the house was sleeping."

As a human, you probably do not consciously track every word in order to understand that "sleeping" refers to "cat". Your brain keeps the important pieces in mind and connects them naturally.
An RNN, however, has to do something much more mechanical.

cat → word → word → word → word → sleeping

The meaning has to pass through every intermediate step.
That creates two big problems:

Slow training — because each step depends on the previous one
Weak long-range memory — because important information can fade as it moves through the sequence

This made researchers wonder whether sequence understanding really needed to be built like a chain at all.

What if the model could look at many positions together instead of one after another?

That idea opened the door to CNN-based sequence models.

1. Extended Neural GPU

The Extended Neural GPU was one of the earliest attempts to move beyond recurrence while still handling sequence-like tasks.
Instead of treating a sentence or number like a line of tokens that must be read step by step, it represents the input more like a grid and repeatedly applies convolution operations over it.

Architecture

Input Sequence
↓
Embedding Grid
↓
Convolution Layer
↓
Convolution Layer
↓
Convolution Layer
↓
Output

Why Was It Created?

The motivation was not just language. Researchers wanted a model that could learn algorithmic behavior — things like:

addition
multiplication
sorting
sequence transformations

These are tasks where the model must learn a procedure, not just memorize patterns.

The Extended Neural GPU was an attempt to say:

Maybe a neural network can learn structured computation without being forced into recurrence.

Advantages

highly parallelizable
efficient on GPUs
capable of learning algorithm-like patterns
avoids the sequential bottleneck of RNNs

Limitation

Its weakness was distance.
If two pieces of information were far apart:

Word A ------------------------- Word B

the model had to move information through several convolution layers before those two positions could interact.

So while it was faster than an RNN in many ways, it still struggled when relationships stretched across long spans.

2. ByteNet

ByteNet, introduced by DeepMind, took a more practical step toward language modeling and machine translation.

Its key idea was simple but powerful: instead of only looking at nearby words, let the model expand its view in a smarter way.
That is where dilated convolutions came in.

Understanding Dilated Convolutions

A normal convolution sees only a small local neighborhood.
A B C D E F G
^^^
A dilated convolution skips positions so the model can see farther without needing many extra layers.
A B C D E F G
^ ^ ^
Examples:
Dilation = 1
1 2 3 4 5 6 7
^^^
Dilation = 2
1 2 3 4 5 6 7
^ ^ ^
Dilation = 4
1 2 3 4 5 6 7
^ ^ ^
As dilation increases, the model’s field of view grows quickly.

Why Is This Useful?

Consider this sentence:
_"The movie that I watched yesterday was amazing."
_
To understand "was amazing", the model may need to connect that phrase back to "movie".
A standard CNN would need many layers to make that connection.

ByteNet made that path shorter by letting information jump across the sequence more efficiently.
In other words, it gave the model a way to see both the local details and the broader context without reading everything in a strictly linear way.

Complexity Advantage

The number of steps needed to connect distant positions grows logarithmically with distance.

Distance = 2 → 1 step
Distance = 4 → 2 steps
Distance = 8 → 3 steps
Distance = 16 → 4 steps

That was a major improvement over ordinary convolutional approaches.

3. ConvS2S (Convolutional Sequence-to-Sequence)

ConvS2S was another important step in this evolution. Its goal was ambitious:

Replace RNN-based encoder-decoder systems with CNNs.

That may sound like a small architectural change, but it was actually a big shift in thinking.

Instead of forcing the model to process a sentence one token at a time, ConvS2S used stacked convolution layers to build context, and attention to help the model focus on the right parts of the input.

Architecture

Input
↓
CNN Encoder
↓
Attention
↓
CNN Decoder
↓
Output

Unlike ByteNet, ConvS2S leaned more explicitly into the encoder-decoder setup that had already become popular in translation systems.

It combined:

deep stacked CNN layers
attention mechanisms
encoder-decoder structure

Example
English:
_I love programming
_French:
_J'aime programmer
_The encoder turns the input into useful contextual features, and the decoder uses those features to generate the translated output.

Benefits

faster training than RNNs
fully parallel computation
better GPU utilization
easier optimization in practice

Limitation

Even with attention, ConvS2S still had a structural weakness: information had to move through layers.

That means the path between distant tokens still grew with sequence length.

Distance = 10 → 10 hops
Distance = 100 → 100 hops
Distance = 1000 → 1000 hops

So although it was much better than a plain RNN in speed, it still did not solve the deeper problem of long-range dependency as elegantly as later models would.

Why Transformers Won

Let’s go back to the earlier sentence:
_The cat that sat on the mat near the window in the house
_...
was sleeping

The word "sleeping" depends on "cat".

That relationship is easy for a human to hold in mind, but for a model it depends on how directly the two words can communicate.

ConvS2S

Information moves layer by layer.
cat → → → → → sleeping
Path Length: Linear

ByteNet

Dilated convolutions reduce the number of steps.
cat → → sleeping
Path Length: Logarithmic

Transformer

Self-attention lets the model connect them directly.
cat ---------------- sleeping
Path Length: Constant
That is the real breakthrough.

A Transformer does not force information to travel through a long chain. Any token can look at any other token immediately. That makes it much better at:

learning long-range dependencies
keeping gradients healthy during training
using parallel hardware efficiently
scaling to larger models and longer contexts

In a sense, Transformers did not just improve sequence modeling — they changed the rules of the game.

Architecture Comparison

Self-Attention

Evolution Timeline

RNN (1980s–1990s)
↓
LSTM (1997)
↓
GRU (2014)
↓
Extended Neural GPU (2016)
↓
ByteNet (2016)
↓
ConvS2S (2017)
↓
Transformer (2017)
↓
BERT
GPT
Llama
Claude
Gemini

Final Thoughts

Extended Neural GPU, ByteNet, and ConvS2S are often treated like footnotes in the history of deep learning, but they deserve more credit than that.

They were part of a very important transition.

At a time when RNNs still dominated sequence modeling, these architectures asked a different question: what if language and sequence understanding could be built in parallel instead of step by step?

That question mattered.

Even though Transformers eventually outperformed them, these CNN-based models helped prove that recurrence was not the only way forward. They explored speed, structure, and context in new ways, and they helped prepare the field for the self-attention revolution.

In that sense, they were not failed experiments.