neha

Posted on Oct 17 • Originally published at Medium

Long Long Ago — The History of Generative AI

#genai #llm #ai #machinelearning

We have all seen it and are mesmerised by it. ChatGPT can write new essays, generate images, create stories, create art, write code, be our friend, and give advice as a mentor — It is doing it all. It seems like magic, but this is a result of decades of ideas, technological evolutions and small steps taken behind the scenes — while the Generative AI models emerged as winners on the grand stage of AI.

How did we get here? I will trace through the story of how we got to these powerful LLMs. The journey of how we got here draws a lot of parallels from trying to mimic how the actual human mind works.

Symbolic AI: In the Beginning [1950s - 1980s]
The very first experiment with AI was not smart but they were the front benchers — the very obedient set of students who followed rules exactly as told.

The early version of AI followed statements which seemed more like the If-Else clause — “If X happens, do Y.” This is referred to as — Symbolic AI.

For example —
We give the robot some rules to follow:

If it is yellow and curved → it is a banana.
If it is red and round → it is an apple.
If it is orange and round → it is an orange.

Now,
If the fruit is a yellow and curved → “It’s a banana!”
If the fruit is red and round → “It’s an apple!”

Symbolic AI uses two concepts — Symbolic representation and Rule-based Inference. Some real-world examples of Symbolic AI include —

Expert systems for medical diagnosis
Knowledge graphs and Ontologies
The initial version of Chatbots — ELIZA (1960s) was the first chatbot designed to mimic a therapist.

Machine Learning — Learning Takesover the Rules [1990–2010s]
The next major shift came from the question — “Can machines learn from examples instead of a definitive set of rules? “

This question and experimentation led to Machine learning — Instead of giving rules, data was fed to systems to figure out patterns and learn from them. Emails labelled as Spam or Brain Images labelled as having a tumour. These resulted in task-specific intelligent systems like —

Spam filters
Recommendation engines
Image Recognition
Speech Recognition

Why is this better than before? These systems are more adaptive of real-world scenarios.
But what was missing here? These models were very specific to the task and could navigate only their area of expertise. They do not have any general intelligence.

The Big Shift: Learning From Everything [2017s — 2020]
The next paradigm shift was to learn from everything, but what changed now? Why weren't we training models on everything before?

With time technology evolved there were two aspects to LLMs — The brain and the Muscle.
The muscle — GPUs became faster, and thousands of cores for parallel computation possible.
The brain — In 2017, Google researchers published — “Attention is All You Need”. This paper introduced the concept of transformers.

Enter — Transformers

Before 2017, the models which relied on RNNs (Recurrent Neural Networks) read everything sequentially — word by word . Transformers “transformed” this approach, they could look at sentences.

Let us look at the key concepts which made this possible —

Self-attention
Multi-head Attention
Positional Encoding
Feedforward Layers
Layer Normalization

Let's get into the details of each of these. I will use the below sentence to walk through each of these

“ Jim Halpert loves the mountains but Dwight Schrute loves the beach”

Self-Attention: Understanding the importance
In the context of each word, which word is more important? Self-Attention, allows the model to understand which word is important in a particular word's context. It computes the “attention” score for every other word in a sentence in relation to itself.

Let's see this with our example sentence now. Here, the attention score for “Jim” is high for the word “mountains” and the attention score for “Dwight” is high for the word “beach”. The model gives more weight to the word “mountain” in the context of “Jim” and more weight to the word “beach” in the context of “Dwight”. The model at every word asks the question — “Which other words matter the most to me right now”. With this, the model concludes — “Jim Halpert → mountains” and “Dwight Schrute → beach”

Multi-head attention: Many perspectives one problem

Multi-head attention is self-attention done multiple times with a different aspect as the focus point for each iteration. This brings different patterns into the model. It is similar to different people interpreting the sentence based on their expertise, one could focus on syntax, one could focus on meaning etc. Let's use the same example —

The first head might focus on “Jim Halpert → loves → mountains”
The second head might focus on the symmetry between the two phrases —_ “Jim Halpert → loves → mountains”_ and “Dwight Schrute → loves → beach”
The third head might focus on the “but” clause “but Dwight Schrute loves the beach” clause and the contrast in the sentence is trying to bring it.

This can go on, more the heads more the viewpoints the model has. GPT-3 used around 96 attention heads.

Positional Encoding: Giving a sense of order

The above concepts and models by themselves don't understand the sequence of the words in a sentence. The model initially doesn't understand the difference between — “Jim Halpert loves mountains” and “mountains love Jim Halpert”. The positional encoding makes sure that Jim comes first then the mountains and then Dwight and finally the beach.

Feedforward layers: Generation is here

After attention layers, *Feed“forward” * layer registers the learnings as abstractions which get used for generations, summarization etc. The feedforward layer leaves out the details and captures the essence. In our example something as simple as — people with different tastes. The model arrives at these abstractions after learning from many examples.

Layer normalization

Training with so many layers can be extremely unstable and unpredictable. A specific set of data can add bias into a layer which can percolate and increase exponentially down the newer layers. Layer normalization normalizes the learning for each layer. This layer is applied after attention and feedforward layers.

Here is the final sequence in which LLMs are trained —

*Residual connection adds the output for the step back to the input.

Conclusion: The Journey Has Just Begun
From the obedient rule-followers as the Symbolic AI to machines that can learn from data, and now to models that can Generate and Create — the journey of Generative AI has been a series of small, powerful steps coming together.

Transformers have changed the game. With self-attention, multi-head perspectives, and an understanding of context, these models can now write stories, answer questions, generate art, and mimic creativity.

We are just at the beginning, the systems are evolving fast, there is a lot more coming next Agentic AI, AI and human in the loop, Personalized AI and more.

The story so far was been magical and what awaits next are yet to be imagined.

DEV Community

Long Long Ago — The History of Generative AI

Top comments (0)