<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Archit Verma</title>
    <description>The latest articles on DEV Community by Archit Verma (@letarchit).</description>
    <link>https://dev.to/letarchit</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1027756%2F23d4c258-e9c7-449b-9392-7b610170c587.jpg</url>
      <title>DEV Community: Archit Verma</title>
      <link>https://dev.to/letarchit</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/letarchit"/>
    <language>en</language>
    <item>
      <title>The Journey to Transformers: How RNNs, ByteNet, and ConvS2S Shaped Modern AI</title>
      <dc:creator>Archit Verma</dc:creator>
      <pubDate>Thu, 25 Jun 2026 15:37:48 +0000</pubDate>
      <link>https://dev.to/letarchit/the-journey-to-transformers-how-rnns-bytenet-and-convs2s-shaped-modern-ai-155h</link>
      <guid>https://dev.to/letarchit/the-journey-to-transformers-how-rnns-bytenet-and-convs2s-shaped-modern-ai-155h</guid>
      <description>&lt;h2&gt;
  
  
  Before Transformers Took Over
&lt;/h2&gt;

&lt;p&gt;When people talk about modern AI today, the conversation usually jumps straight to Transformers. GPT, Claude, Gemini, Llama — they all sit on top of that same idea: &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;let every token look at every other token directly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But that was not always the obvious path.&lt;/p&gt;

&lt;p&gt;Before Transformers became the default choice in 2017, researchers were still wrestling with a very old problem in deep learning: &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;how do you teach a model to understand sequences without making it painfully slow or forgetful?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For a long time, the answer was Recurrent Neural Networks (RNNs) and later LSTMs. They were elegant in theory, but in practice they had a frustrating weakness: they had to read everything one step at a time.&lt;/p&gt;

&lt;p&gt;That meant two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Training was slow because the model could not easily work on many tokens at once&lt;/li&gt;
&lt;li&gt;Long-distance relationships were hard to preserve because information had to travel through many steps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So researchers started asking a very human question:&lt;/p&gt;

&lt;p&gt;What if sequence models did not have to think like a chain?&lt;/p&gt;

&lt;p&gt;That question led to a fascinating family of models built with Convolutional Neural Networks (CNNs) instead of recurrence.&lt;/p&gt;

&lt;p&gt;Three of the most important ones were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extended Neural GPU&lt;/li&gt;
&lt;li&gt;ByteNet&lt;/li&gt;
&lt;li&gt;ConvS2S (Convolutional Sequence-to-Sequence)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These models did not win the final battle, but they changed the direction of the field. They showed that sequence modeling could be parallel, efficient, and still powerful enough to handle language-like tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Problem with RNNs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think about reading this sentence:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"The cat that sat on the mat near the window in the house was sleeping."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As a human, you probably do not consciously track every word in order to understand that "sleeping" refers to "cat". Your brain keeps the important pieces in mind and connects them naturally.&lt;br&gt;
An RNN, however, has to do something much more mechanical.&lt;/p&gt;

&lt;p&gt;cat → word → word → word → word → sleeping&lt;/p&gt;

&lt;p&gt;The meaning has to pass through every intermediate step.&lt;br&gt;
That creates two big problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Slow training — because each step depends on the previous one&lt;/li&gt;
&lt;li&gt;Weak long-range memory — because important information can fade as it moves through the sequence&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This made researchers wonder whether sequence understanding really needed to be built like a chain at all.&lt;/p&gt;

&lt;p&gt;What if the model could look at many positions together instead of one after another?&lt;/p&gt;

&lt;p&gt;That idea opened the door to CNN-based sequence models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Extended Neural GPU&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Extended Neural GPU was one of the earliest attempts to move beyond recurrence while still handling sequence-like tasks.&lt;br&gt;
Instead of treating a sentence or number like a line of tokens that must be read step by step, it represents the input more like a grid and repeatedly applies convolution operations over it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Input Sequence&lt;br&gt;
      ↓&lt;br&gt;
Embedding Grid&lt;br&gt;
      ↓&lt;br&gt;
Convolution Layer&lt;br&gt;
      ↓&lt;br&gt;
Convolution Layer&lt;br&gt;
      ↓&lt;br&gt;
Convolution Layer&lt;br&gt;
      ↓&lt;br&gt;
Output&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Was It Created?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The motivation was not just language. Researchers wanted a model that could learn algorithmic behavior — things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;addition&lt;/li&gt;
&lt;li&gt;multiplication&lt;/li&gt;
&lt;li&gt;sorting&lt;/li&gt;
&lt;li&gt;sequence transformations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are tasks where the model must learn a procedure, not just memorize patterns.&lt;/p&gt;

&lt;p&gt;The Extended Neural GPU was an attempt to say:&lt;/p&gt;

&lt;p&gt;Maybe a neural network can learn structured computation without being forced into recurrence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advantages&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;highly parallelizable&lt;/li&gt;
&lt;li&gt;efficient on GPUs&lt;/li&gt;
&lt;li&gt;capable of learning algorithm-like patterns&lt;/li&gt;
&lt;li&gt;avoids the sequential bottleneck of RNNs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Its weakness was distance.&lt;/li&gt;
&lt;li&gt;If two pieces of information were far apart:&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Word A ------------------------- Word B&lt;/p&gt;

&lt;p&gt;the model had to move information through several convolution layers before those two positions could interact.&lt;/p&gt;

&lt;p&gt;So while it was faster than an RNN in many ways, it still struggled when relationships stretched across long spans.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. ByteNet&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ByteNet, introduced by DeepMind, took a more practical step toward language modeling and machine translation.&lt;/p&gt;

&lt;p&gt;Its key idea was simple but powerful: instead of only looking at nearby words, let the model expand its view in a smarter way.&lt;br&gt;
That is where dilated convolutions came in.&lt;/p&gt;

&lt;p&gt;Understanding Dilated Convolutions&lt;/p&gt;

&lt;p&gt;A normal convolution sees only a small local neighborhood.&lt;br&gt;
A B C D E F G&lt;br&gt;
  ^^^&lt;br&gt;
A dilated convolution skips positions so the model can see farther without needing many extra layers.&lt;br&gt;
A B C D E F G&lt;br&gt;
^   ^   ^&lt;br&gt;
Examples:&lt;br&gt;
Dilation = 1&lt;br&gt;
1 2 3 4 5 6 7&lt;br&gt;
  ^^^&lt;br&gt;
Dilation = 2&lt;br&gt;
1 2 3 4 5 6 7&lt;br&gt;
^   ^   ^&lt;br&gt;
Dilation = 4&lt;br&gt;
1 2 3 4 5 6 7&lt;br&gt;
^       ^       ^&lt;br&gt;
As dilation increases, the model’s field of view grows quickly.&lt;/p&gt;

&lt;p&gt;Why Is This Useful?&lt;/p&gt;

&lt;p&gt;Consider this sentence:&lt;br&gt;
_"The movie that I watched yesterday was amazing."&lt;br&gt;
_&lt;br&gt;
To understand "was amazing", the model may need to connect that phrase back to "movie".&lt;br&gt;
A standard CNN would need many layers to make that connection. &lt;/p&gt;

&lt;p&gt;ByteNet made that path shorter by letting information jump across the sequence more efficiently.&lt;br&gt;
In other words, it gave the model a way to see both the local details and the broader context without reading everything in a strictly linear way.&lt;/p&gt;

&lt;p&gt;Complexity Advantage&lt;/p&gt;

&lt;p&gt;The number of steps needed to connect distant positions grows logarithmically with distance.&lt;/p&gt;

&lt;p&gt;Distance = 2   → 1 step&lt;br&gt;
Distance = 4   → 2 steps&lt;br&gt;
Distance = 8   → 3 steps&lt;br&gt;
Distance = 16  → 4 steps&lt;/p&gt;

&lt;p&gt;That was a major improvement over ordinary convolutional approaches.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. ConvS2S (Convolutional Sequence-to-Sequence)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ConvS2S was another important step in this evolution. Its goal was ambitious:&lt;/p&gt;

&lt;p&gt;Replace RNN-based encoder-decoder systems with CNNs.&lt;/p&gt;

&lt;p&gt;That may sound like a small architectural change, but it was actually a big shift in thinking.&lt;/p&gt;

&lt;p&gt;Instead of forcing the model to process a sentence one token at a time, ConvS2S used stacked convolution layers to build context, and attention to help the model focus on the right parts of the input.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Input&lt;br&gt;
  ↓&lt;br&gt;
CNN Encoder&lt;br&gt;
  ↓&lt;br&gt;
Attention&lt;br&gt;
  ↓&lt;br&gt;
CNN Decoder&lt;br&gt;
  ↓&lt;br&gt;
Output&lt;/p&gt;

&lt;p&gt;Unlike ByteNet, ConvS2S leaned more explicitly into the encoder-decoder setup that had already become popular in translation systems.&lt;/p&gt;

&lt;p&gt;It combined:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deep stacked CNN layers&lt;/li&gt;
&lt;li&gt;attention mechanisms&lt;/li&gt;
&lt;li&gt;encoder-decoder structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example&lt;br&gt;
English:&lt;br&gt;
_I love programming&lt;br&gt;
_French:&lt;br&gt;
_J'aime programmer&lt;br&gt;
_The encoder turns the input into useful contextual features, and the decoder uses those features to generate the translated output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Benefits&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;faster training than RNNs&lt;/li&gt;
&lt;li&gt;fully parallel computation&lt;/li&gt;
&lt;li&gt;better GPU utilization&lt;/li&gt;
&lt;li&gt;easier optimization in practice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Even with attention, ConvS2S still had a structural weakness: information had to move through layers.&lt;/p&gt;

&lt;p&gt;That means the path between distant tokens still grew with sequence length.&lt;/p&gt;

&lt;p&gt;Distance = 10    → 10 hops&lt;br&gt;
Distance = 100   → 100 hops&lt;br&gt;
Distance = 1000  → 1000 hops&lt;/p&gt;

&lt;p&gt;So although it was much better than a plain RNN in speed, it still did not solve the deeper problem of long-range dependency as elegantly as later models would.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Transformers Won&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s go back to the earlier sentence:&lt;br&gt;
_The cat that sat on the mat near the window in the house&lt;br&gt;
_...&lt;br&gt;
was sleeping&lt;/p&gt;

&lt;p&gt;The word "sleeping" depends on "cat".&lt;/p&gt;

&lt;p&gt;That relationship is easy for a human to hold in mind, but for a model it depends on how directly the two words can communicate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ConvS2S&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Information moves layer by layer.&lt;br&gt;
cat → → → → → sleeping&lt;br&gt;
Path Length: Linear&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ByteNet&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dilated convolutions reduce the number of steps.&lt;br&gt;
cat → → sleeping&lt;br&gt;
Path Length: Logarithmic&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transformer&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Self-attention lets the model connect them directly.&lt;br&gt;
cat ---------------- sleeping&lt;br&gt;
Path Length: Constant&lt;br&gt;
That is the real breakthrough.&lt;/p&gt;

&lt;p&gt;A Transformer does not force information to travel through a long chain. Any token can look at any other token immediately. That makes it much better at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;learning long-range dependencies&lt;/li&gt;
&lt;li&gt;keeping gradients healthy during training&lt;/li&gt;
&lt;li&gt;using parallel hardware efficiently&lt;/li&gt;
&lt;li&gt;scaling to larger models and longer contexts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a sense, Transformers did not just improve sequence modeling — they changed the rules of the game.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecture Comparison&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fwe3ecqkm4swt5gcgony3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fwe3ecqkm4swt5gcgony3.png" alt=" " width="800" height="377"&gt;&lt;/a&gt;   &lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Attention
&lt;/h2&gt;

&lt;p&gt;Evolution Timeline&lt;/p&gt;

&lt;p&gt;RNN (1980s–1990s)&lt;br&gt;
        ↓&lt;br&gt;
LSTM (1997)&lt;br&gt;
        ↓&lt;br&gt;
GRU (2014)&lt;br&gt;
        ↓&lt;br&gt;
Extended Neural GPU (2016)&lt;br&gt;
        ↓&lt;br&gt;
ByteNet (2016)&lt;br&gt;
        ↓&lt;br&gt;
ConvS2S (2017)&lt;br&gt;
        ↓&lt;br&gt;
Transformer (2017)&lt;br&gt;
        ↓&lt;br&gt;
BERT&lt;br&gt;
GPT&lt;br&gt;
Llama&lt;br&gt;
Claude&lt;br&gt;
Gemini&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;Extended Neural GPU, ByteNet, and ConvS2S are often treated like footnotes in the history of deep learning, but they deserve more credit than that.&lt;/p&gt;

&lt;p&gt;They were part of a very important transition.&lt;/p&gt;

&lt;p&gt;At a time when RNNs still dominated sequence modeling, these architectures asked a different question: what if language and sequence understanding could be built in parallel instead of step by step?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That question mattered.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Even though Transformers eventually outperformed them, these CNN-based models helped prove that recurrence was not the only way forward. They explored speed, structure, and context in new ways, and they helped prepare the field for the self-attention revolution.&lt;/p&gt;

&lt;p&gt;In that sense, they were not failed experiments.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;They were the bridge.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>development</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
