Zoricic

Posted on Mar 31

Three Things Had to Align: The Real Story Behind the LLM Revolution

#ai #llm #machinelearning #programming

ChatGPT didn't come out of nowhere. It's the result of 60 years of dead ends, one accidental breakthrough, and three completely separate technologies all maturing at exactly the same moment.

Before the Revolution: Chatbots That Didn't Learn

AI is older than most people realize. But early AI looked nothing like what we have today.

ELIZA (1966) was the first chatbot. It simulated conversation using pattern matching — if you typed "I feel sad," it replied "Why do you feel sad?" It didn't learn anything. It followed hand-written rules. Impressive for 1966. Useless for anything complex.

RNNs emerged in the 1990s as the dominant approach for language processing. LSTMs (Long Short-Term Memory networks), invented by Hochreiter & Schmidhuber in 1997, improved on RNNs specifically to address memory limitations — but both shared the same fundamental design: reading text sequentially, word by word, like looking through a keyhole. By the time either reached the end of a long sentence, the beginning was largely forgotten.

"The dog, which had been chasing the cat down the long street, was tired."
                                                               ↑
                               RNN had forgotten "dog" here ──┘

Meanwhile, Google Was Running on Rules

While researchers struggled with language models, Google was building search using a series of named algorithms — each solving a specific problem:

Algorithm	Year	What It Did
PageRank	1998	Ranked pages by how many other pages linked to them
Panda	2011	Penalized low-quality and duplicate content
Penguin	2012	Penalized sites that bought fake links
Hummingbird	2013	First steps toward understanding concepts, not just keywords
RankBrain	2015	First use of machine learning — handled never-seen-before queries

RankBrain was Google's first real AI in search. It used mathematical vectors to connect unknown words to known ones. But it still couldn't write a single sentence — and it still saw words as isolated units, not as parts of connected thought.

The forgetting problem and the keyword problem were the same problem in different clothes. Both systems failed to capture relationships between words across distance. The fix came from a single paper.

The Breakthrough: "Attention Is All You Need" (2017)

In 2017, a team of Google researchers published a paper with a confident title: Attention Is All You Need. The paper addressed sequence transduction broadly — mapping one sequence to another — and used machine translation as the benchmark task, showing the Transformer outperforming existing approaches on English-to-German and English-to-French. What they invented was the Transformer architecture, and it accidentally became the foundation for every major AI system since. Google Translate's adoption of Transformers was gradual and varied by language pair; the architecture didn't replace the previous approach across the board overnight.

The Core Idea: Self-Attention

Instead of reading text sequentially, the Transformer looks at the entire input at once and assigns a weight — an importance score — to each word relative to every other word.

When processing "tired" in our earlier example:

"The dog, which had been chasing the cat down the long street, was tired."

  Word      Attention Weight
  ───────────────────────────
  dog        HIGH  ← subject; dogs can be tired
  chasing    MED   ← action dog performed
  cat        LOW   ← object of chasing; not tired
  street     LOW   ← location; not tired

The model builds a relationship map of the entire sentence simultaneously. Nothing is read sequentially. Nothing is forgotten mid-sentence.

Why This Required New Hardware

Self-attention is parallelizable — every word's relationship to every other word can be computed at the same time, not one after another.

RNN:         [word1] → [word2] → [word3] → result
             Sequential. Each step waits for the previous.

Transformer: [word1] ↔ [word2] ↔ [word3]
               ↕          ↕         ↕
             All relationships computed simultaneously.

GPUs were designed to render thousands of pixels in parallel for video games. That same parallel architecture maps directly onto self-attention calculations. The match was accidental — and decisive.

Training GPT-3 without modern GPU hardware wouldn't take months. It would take hundreds of years.

Three Things Had to Align

Here's the part most explanations skip: if the Transformer paper had been published in 2005, nothing would have happened.

Modern LLMs exist because three things converged at the same moment:

┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐
│  THE ALGORITHM  │   │    THE DATA     │   │  THE HARDWARE   │
│                 │   │                 │   │                 │
│  Transformer    │ + │  Billions of    │ + │  GPU clusters   │
│  architecture   │   │  internet pages,│   │  powerful       │
│  (the recipe)   │   │  books, Wikipedia│  │  enough to run  │
│                 │   │  (ingredients)  │   │  the math       │
└─────────────────┘   └─────────────────┘   └─────────────────┘
                              ↓
                  All three align ~2017
                              ↓
                  LLM revolution begins

The data requirement is the most underestimated factor. LLMs need billions of sentences to develop statistical understanding of language. The internet, Wikipedia, and digitized books only reached the necessary scale in the early 2010s. The algorithm existing in 2005 would have had nothing meaningful to train on.

Miss any one of the three — and ChatGPT doesn't exist.

Two Directions from One Architecture

Once the Transformer existed, two companies used it to solve two completely different problems.

BERT — Google (2018)

Google needed to understand language, not generate it. Search requires answering: what does this query actually mean?

BERT (Bidirectional Encoder Representations from Transformers) trained using Masked Language Modeling: researchers masked 15% of words in a sentence and trained the model to predict them from surrounding context.

"Paris is the [MASK] of France."  →  BERT learns: "capital"

The key innovation: bidirectionality. Previous models read left-to-right. BERT reads the full sentence in all directions at once.

Real-world impact: Google Search circa 2019. Before BERT, searching "can I pick up a prescription for someone else at the pharmacy" matched keywords. After BERT, Google understood that "for someone else" was the semantically critical phrase.

GPT Series — OpenAI (2018–2020)

OpenAI needed to generate language. The decoder half of the Transformer — predicting the next word given all previous words — became the basis for GPT.

Model	Year	Milestone
GPT-1	2018	Proved large-scale pre-training transfers to other tasks
GPT-2	2019	Outputs convincing enough to trigger public misuse debates
GPT-3	2020	175 billion parameters — first truly massive, general-purpose model

BERT vs. GPT compared:

Aspect	BERT (Google)	GPT (OpenAI)
Reading direction	Bidirectional	Left to right
Primary goal	Understanding	Generation
Training signal	Masked word prediction	Next word prediction
Used for	Search, Q&A, classification	Chatbots, writing, code

Neither is strictly better. They're optimized for different jobs.

The Missing Piece: Why GPT-3 Wasn't ChatGPT

GPT-3 could generate fluent text. But left to its own training objective — predict the next word — it was unreliable as an assistant. Ask it a question and it might answer it, continue the question, or list five more similar questions. It had no concept of being helpful.

The gap between GPT-3 (2020) and ChatGPT (2022) is two techniques:

Instruction tuning fine-tunes a pre-trained model on examples of instructions paired with good responses. Instead of "predict the next token in internet text," the training signal becomes "given this instruction, produce a useful response." The model learns to follow directions.

RLHF (Reinforcement Learning from Human Feedback) goes further. Human raters compare pairs of model outputs and rank which is better. Those rankings train a reward model — a separate model that scores outputs. The language model is then fine-tuned using reinforcement learning to maximize that score.

Pre-trained GPT-3
        │
        ▼
Instruction tuning (supervised, labeled examples)
        │
        ▼
RLHF — human raters rank outputs → reward model → RL fine-tuning
        │
        ▼
InstructGPT / ChatGPT — helpful, harmless, honest

This is why ChatGPT felt qualitatively different from GPT-3 API users' experience. The underlying model was similar in architecture. What changed was the training objective: from "predict internet text" to "be useful to a human."

OpenAI published the InstructGPT paper in early 2022 describing this approach. ChatGPT launched in November 2022 using the same method.

Beyond Text: Multimodal Models

LLMs started as text-only systems. By 2024, the leading models — GPT-4o, Gemini 1.5, Claude 3 — process text, images, audio, and video within the same model.

The core insight: the Transformer's attention mechanism is not specific to text. An image can be divided into patches and treated as a sequence of tokens. Audio can be converted to spectrograms and similarly tokenized. The same self-attention machinery that relates words to each other can relate image patches to words to audio frames.

┌─────────────────────────────────────────────────────┐
│              Multimodal Input                       │
│                                                     │
│  "What is in this image?"  +  [image file]         │
│         ↓                          ↓               │
│   Text tokens               Image patches           │
│         └──────────┬───────────────┘               │
│                    ▼                                │
│           Shared Transformer                        │
│           (attention over all tokens)               │
│                    ↓                                │
│            "A dog chasing a cat"                    │
└─────────────────────────────────────────────────────┘

Practical implications:

A model can answer questions about a screenshot, diagram, or photograph
Audio can be transcribed, translated, and responded to in a single pass
Video frames become a sequence of image tokens the model can reason over

The boundary between "language model" and "AI system" has effectively dissolved. What were separate specialized models (vision model, speech model, text model) are converging into single multimodal architectures.

Hardware: The Unsung Hero

Algorithm and data get most of the coverage. Hardware rarely does — despite being the actual constraint.

Microsoft built a dedicated supercomputer with 10,000 NVIDIA V100s specifically for OpenAI's research. OpenAI never disclosed exact training figures for GPT-3, but estimates put large training runs in the thousands of GPUs running for weeks. Today, systems running Gemini and GPT use clusters of H100 cards at similar scale. This is why NVIDIA became one of the most valuable companies in the world.

Google took a different path, designing custom silicon specifically for Transformer math:

Hardware	Used by	Strength
NVIDIA H100	OpenAI, Anthropic, Meta	General-purpose, widely available
Google TPUs	Google (Gemini)	Purpose-built for Transformer math, more efficient per watt

Gemini has been trained and served across multiple TPU generations (v4, v5e, v5p, and the newer Trillium/v6 for more recent models). TPUs aren't available to external developers for training — they're Google's internal advantage.

Practical Recommendations

If you're building on top of LLMs today, the historical distinctions above translate into concrete choices.

Choosing between understanding and generation models:

Task	Model type	Examples
Classifying text, ranking documents, Q&A over a corpus	Encoder (BERT-style)	BERT, RoBERTa, sentence-transformers
Generating text, summarizing, conversing, coding	Decoder (GPT-style)	GPT-4, Claude, Gemini
Most new product work in 2024+	Multimodal API	GPT-4o, Gemini 1.5 Pro, Claude 3

The Full Timeline

1966        ELIZA — first chatbot (rules only, no learning)
1990s–1997  RNNs (1990s) + LSTMs (1997) — sequential processing, forgetting problem
1998        PageRank — Google's original ranking algorithm
2013        Hummingbird — first concept-based search
2015        RankBrain — first ML in Google Search
2017        "Attention Is All You Need" — Transformer invented
2018        BERT (Google) + GPT-1 (OpenAI) — the split path begins
2019        BERT deployed in Google Search
2020        GPT-3 — 175B parameters, first truly massive LLM
2022        InstructGPT — instruction tuning + RLHF; ChatGPT launches November 2022
2024–2025   Multimodal models — GPT-4o, Gemini 1.5, Claude 3 unify text, image, audio, video

The Bottom Line

The Transformer's attention mechanism solved the forgetting problem that had bottlenecked language AI for decades — by computing relationships across the entire input simultaneously rather than word by word. From that single architectural change, both understanding models (BERT) and generation models (GPT) were built. But the algorithm alone wasn't the revolution. It was the algorithm converging with enough training data and hardware powerful enough to run the math — and none of that aligned until ~2017. Even then, GPT-3 in 2020 wasn't ChatGPT: the final piece was RLHF, which replaced "predict internet text" with "be useful to a human" as the training objective. "AI" and "LLM" are not synonyms. Neither are "pre-trained model" and "assistant." The distinctions matter when evaluating what any given system can actually do.

Resources

Attention Is All You Need — the original 2017 Transformer paper
BERT: Pre-training of Deep Bidirectional Transformers — Google's 2018 paper
Language Models are Few-Shot Learners — the GPT-3 paper
Training language models to follow instructions with human feedback — the InstructGPT/RLHF paper
The Illustrated Transformer — best visual walkthrough of attention mechanisms, no math required

DEV Community