DEV Community

Cover image for What Existed Before Transformers and Why Responses Are Generated Step by Step
Jack Pritom Soren
Jack Pritom Soren

Posted on

What Existed Before Transformers and Why Responses Are Generated Step by Step

Large Language Models (LLMs) like ChatGPT feel almost magical—but under the hood, they are built on clear mathematical and architectural principles.
In this article, we’ll break down how LLMs actually work, why models like ChatGPT weren’t possible before Transformers, and why responses are generated token by token instead of all at once.

The explanation stays simple—but goes deep.


1️⃣ What Is an LLM?

LLM = Large Language Model

An LLM is not a “thinking human” and does not store knowledge like a database.
At its core, it is a large probability engine.

Its single job is:

Given previous words, predict the most likely next word.

Example

"I am eating rice today ..."
Enter fullscreen mode Exit fullscreen mode

Internally, the model calculates probabilities:

  • eating → 80%
  • going → 10%
  • stopped → 5%
  • sky → 0.0001%

➡️ The highest probability token is “eating”

By repeating this process again and again, the model generates full sentences.


2️⃣ Why Does ChatGPT Respond Gradually Instead of All at Once?

Because an LLM does not know the full answer upfront.

It works token by token.

Example Output

"I can help you"
Enter fullscreen mode Exit fullscreen mode

Internally, generation looks like this:

  1. "I"
  2. "I" → "can"
  3. "I can" → "help"
  4. "I can help" → "you"

Each step uses:

previous tokens + probability calculation → next token

That’s why:

  • You see text appearing gradually
  • But internally, the process happens extremely fast (milliseconds)

3️⃣ What Is a Token? (Very Important)

LLMs do not understand words—they understand tokens.

A token can be:

  • A full word
  • Part of a word
  • Punctuation or symbols

Example

"Bangladesh" → ["Bang", "la", "desh"]
Enter fullscreen mode Exit fullscreen mode

This is why concepts like:

  • Token limit
  • Context window

exist in LLM systems.


4️⃣ Why Didn’t ChatGPT-Like Models Exist Before Transformers?

This is the most critical question.

What existed before Transformers?

  • RNN (Recurrent Neural Networks)
  • LSTM / GRU

Limitations of These Models

❌ Could not remember long contexts
❌ Forgot earlier parts of text
❌ Could not process tokens in parallel (very slow)
❌ Weak understanding of relationships across sentences

Example

I went to school.
I met my friend there.
He told me that ...
Enter fullscreen mode Exit fullscreen mode

Older models often failed to understand who “he” refers to.


5️⃣ What Is a Transformer?

In 2017, Google published a paper titled:

“Attention Is All You Need”

This paper introduced the Transformer architecture, which completely changed NLP and AI.


6️⃣ The Heart of Transformers: Attention 🧠

Attention means:

Determining which words in a sentence are important for understanding another word.

Example

Rahim fixed the server because he understands debugging.
Enter fullscreen mode Exit fullscreen mode

The word “he” attends to “Rahim”.

The model learns this relationship automatically.

Self-Attention

Words inside the same sentence attend to one another to understand context.


7️⃣ How Can Transformers See the Whole Sentence at Once?

Transformers:

  • Take all tokens at the same time
  • Compute attention relationships
  • Use parallel computation (GPU-friendly)

This was impossible with RNN-based architectures.


8️⃣ Transformer Architecture (High-Level)

A Transformer consists of several key components:

🔹 Embedding

Converts tokens into numerical vectors.

🔹 Positional Encoding

Transformers don’t understand word order by default.
Position information is added explicitly.

Without this:

Server is down today
Today the server is down
Enter fullscreen mode Exit fullscreen mode

would look identical to the model.


🔹 Multi-Head Attention

The sentence is analyzed from multiple perspectives:

  • Subject
  • Time
  • Cause–effect
  • Intent

Each “head” focuses on a different relationship.


🔹 Feed Forward Network

Deep neural layers that learn patterns and abstractions.

Modern models may have 12–96+ Transformer layers.


9️⃣ How Does ChatGPT Learn?

🔸 Step 1: Pretraining

The model trains on massive text data:

  • Books
  • Articles
  • Code
  • Discussions

The core task:

Predict the next word in this sequence.
Enter fullscreen mode Exit fullscreen mode

Repeated trillions of times.


🔸 Step 2: Fine-Tuning

Human reviewers rank answers as good or bad.


🔸 Step 3: RLHF (Reinforcement Learning from Human Feedback)

Humans provide feedback such as:

  • Helpful
  • Incorrect
  • Unsafe

The model is adjusted accordingly.


🔟 Does ChatGPT Actually “Understand”?

❌ Not like a human
✅ But it understands patterns and relationships extremely well

That’s why:

  • It can sound confident yet be wrong
  • It can also provide surprisingly deep explanations

1️⃣1️⃣ LLM from a Backend Perspective

A simplified pipeline:

Input Text
   ↓
Tokenizer
   ↓
Embedding
   ↓
Transformer Layers
   ↓
Softmax (Probability)
   ↓
Next Token
   ↓
Repeat
Enter fullscreen mode Exit fullscreen mode

1️⃣2️⃣ Why Are LLMs So Powerful Today?

Because all of these exist together now:

✅ Transformer architecture
✅ Massive datasets
✅ Large-scale GPU / TPU computing
✅ Improved training techniques

Previously, this combination was not possible.


1️⃣3️⃣ One Line to Remember

ChatGPT is not a knowledge database.
It is a highly advanced next-token prediction engine
powered by Transformers and attention-based context understanding.


Follow me on : Github Linkedin Threads Youtube Channel

Top comments (0)