Before Transformers Took Over
When people talk about modern AI today, the conversation usually jumps straight to Transformers. GPT, Claude, Gemini, Llama — they all sit on top of that same idea:
let every token look at every other token directly.
But that was not always the obvious path.
Before Transformers became the default choice in 2017, researchers were still wrestling with a very old problem in deep learning:
how do you teach a model to understand sequences without making it painfully slow or forgetful?
For a long time, the answer was Recurrent Neural Networks (RNNs) and later LSTMs. They were elegant in theory, but in practice they had a frustrating weakness: they had to read everything one step at a time.
That meant two things:
- Training was slow because the model could not easily work on many tokens at once
- Long-distance relationships were hard to preserve because information had to travel through many steps
So researchers started asking a very human question:
What if sequence models did not have to think like a chain?
That question led to a fascinating family of models built with Convolutional Neural Networks (CNNs) instead of recurrence.
Three of the most important ones were:
- Extended Neural GPU
- ByteNet
- ConvS2S (Convolutional Sequence-to-Sequence)
These models did not win the final battle, but they changed the direction of the field. They showed that sequence modeling could be parallel, efficient, and still powerful enough to handle language-like tasks.
The Problem with RNNs
Think about reading this sentence:
"The cat that sat on the mat near the window in the house was sleeping."
As a human, you probably do not consciously track every word in order to understand that "sleeping" refers to "cat". Your brain keeps the important pieces in mind and connects them naturally.
An RNN, however, has to do something much more mechanical.
cat → word → word → word → word → sleeping
The meaning has to pass through every intermediate step.
That creates two big problems:
- Slow training — because each step depends on the previous one
- Weak long-range memory — because important information can fade as it moves through the sequence
This made researchers wonder whether sequence understanding really needed to be built like a chain at all.
What if the model could look at many positions together instead of one after another?
That idea opened the door to CNN-based sequence models.
1. Extended Neural GPU
The Extended Neural GPU was one of the earliest attempts to move beyond recurrence while still handling sequence-like tasks.
Instead of treating a sentence or number like a line of tokens that must be read step by step, it represents the input more like a grid and repeatedly applies convolution operations over it.
Architecture
Input Sequence
↓
Embedding Grid
↓
Convolution Layer
↓
Convolution Layer
↓
Convolution Layer
↓
Output
Why Was It Created?
The motivation was not just language. Researchers wanted a model that could learn algorithmic behavior — things like:
- addition
- multiplication
- sorting
- sequence transformations
These are tasks where the model must learn a procedure, not just memorize patterns.
The Extended Neural GPU was an attempt to say:
Maybe a neural network can learn structured computation without being forced into recurrence.
Advantages
- highly parallelizable
- efficient on GPUs
- capable of learning algorithm-like patterns
- avoids the sequential bottleneck of RNNs
Limitation
- Its weakness was distance.
- If two pieces of information were far apart:
Word A ------------------------- Word B
the model had to move information through several convolution layers before those two positions could interact.
So while it was faster than an RNN in many ways, it still struggled when relationships stretched across long spans.
2. ByteNet
ByteNet, introduced by DeepMind, took a more practical step toward language modeling and machine translation.
Its key idea was simple but powerful: instead of only looking at nearby words, let the model expand its view in a smarter way.
That is where dilated convolutions came in.
Understanding Dilated Convolutions
A normal convolution sees only a small local neighborhood.
A B C D E F G
^^^
A dilated convolution skips positions so the model can see farther without needing many extra layers.
A B C D E F G
^ ^ ^
Examples:
Dilation = 1
1 2 3 4 5 6 7
^^^
Dilation = 2
1 2 3 4 5 6 7
^ ^ ^
Dilation = 4
1 2 3 4 5 6 7
^ ^ ^
As dilation increases, the model’s field of view grows quickly.
Why Is This Useful?
Consider this sentence:
_"The movie that I watched yesterday was amazing."
_
To understand "was amazing", the model may need to connect that phrase back to "movie".
A standard CNN would need many layers to make that connection.
ByteNet made that path shorter by letting information jump across the sequence more efficiently.
In other words, it gave the model a way to see both the local details and the broader context without reading everything in a strictly linear way.
Complexity Advantage
The number of steps needed to connect distant positions grows logarithmically with distance.
Distance = 2 → 1 step
Distance = 4 → 2 steps
Distance = 8 → 3 steps
Distance = 16 → 4 steps
That was a major improvement over ordinary convolutional approaches.
3. ConvS2S (Convolutional Sequence-to-Sequence)
ConvS2S was another important step in this evolution. Its goal was ambitious:
Replace RNN-based encoder-decoder systems with CNNs.
That may sound like a small architectural change, but it was actually a big shift in thinking.
Instead of forcing the model to process a sentence one token at a time, ConvS2S used stacked convolution layers to build context, and attention to help the model focus on the right parts of the input.
Architecture
Input
↓
CNN Encoder
↓
Attention
↓
CNN Decoder
↓
Output
Unlike ByteNet, ConvS2S leaned more explicitly into the encoder-decoder setup that had already become popular in translation systems.
It combined:
- deep stacked CNN layers
- attention mechanisms
- encoder-decoder structure
Example
English:
_I love programming
_French:
_J'aime programmer
_The encoder turns the input into useful contextual features, and the decoder uses those features to generate the translated output.
Benefits
- faster training than RNNs
- fully parallel computation
- better GPU utilization
- easier optimization in practice
Limitation
Even with attention, ConvS2S still had a structural weakness: information had to move through layers.
That means the path between distant tokens still grew with sequence length.
Distance = 10 → 10 hops
Distance = 100 → 100 hops
Distance = 1000 → 1000 hops
So although it was much better than a plain RNN in speed, it still did not solve the deeper problem of long-range dependency as elegantly as later models would.
Why Transformers Won
Let’s go back to the earlier sentence:
_The cat that sat on the mat near the window in the house
_...
was sleeping
The word "sleeping" depends on "cat".
That relationship is easy for a human to hold in mind, but for a model it depends on how directly the two words can communicate.
ConvS2S
Information moves layer by layer.
cat → → → → → sleeping
Path Length: Linear
ByteNet
Dilated convolutions reduce the number of steps.
cat → → sleeping
Path Length: Logarithmic
Transformer
Self-attention lets the model connect them directly.
cat ---------------- sleeping
Path Length: Constant
That is the real breakthrough.
A Transformer does not force information to travel through a long chain. Any token can look at any other token immediately. That makes it much better at:
- learning long-range dependencies
- keeping gradients healthy during training
- using parallel hardware efficiently
- scaling to larger models and longer contexts
In a sense, Transformers did not just improve sequence modeling — they changed the rules of the game.
Architecture Comparison
Self-Attention
Evolution Timeline
RNN (1980s–1990s)
↓
LSTM (1997)
↓
GRU (2014)
↓
Extended Neural GPU (2016)
↓
ByteNet (2016)
↓
ConvS2S (2017)
↓
Transformer (2017)
↓
BERT
GPT
Llama
Claude
Gemini
Final Thoughts
Extended Neural GPU, ByteNet, and ConvS2S are often treated like footnotes in the history of deep learning, but they deserve more credit than that.
They were part of a very important transition.
At a time when RNNs still dominated sequence modeling, these architectures asked a different question: what if language and sequence understanding could be built in parallel instead of step by step?
That question mattered.
Even though Transformers eventually outperformed them, these CNN-based models helped prove that recurrence was not the only way forward. They explored speed, structure, and context in new ways, and they helped prepare the field for the self-attention revolution.
In that sense, they were not failed experiments.
They were the bridge.

Top comments (0)