DEV Community

Ran Ding
Ran Ding

Posted on • Updated on • Originally published at dingran.me

From Transformers to ChatGPT

You can read the full note here (with better formatting) https://www.dingran.me/from-transformer-to-llm/


Introduction

Large language models such as GPT-3 have shown impressive performance not only in NLP benchmarking tasks but also blew the minds of the public through application interfaces such as ChatGPT.

This note provides a high-level summary of the progress in large language models (LLM) from 2017 (the inception of the Transformer model) to now (the end of 2022), serving as a fast-paced recap for readers to catch up on this field quickly. General familiarity with machine learning/deep learning is assumed.

This note will only cover a small, core set of papers: Transformer, BERT, GPT, GPT-2, GPT-3, and InstructGPT. There are undoubtedly many other notable papers published during the same period of time - I'll leave them to a literature survey/reading list.

This is a somewhat long note - it is broken down into the following sections:

  1. Overview
    1. What is NLP
    2. Past progress (pre-2017)
    3. Recent progress (2017 - 2022)
  2. Model details
    1. Transformer
    2. GPT
    3. BERT
    4. GPT-2, GPT-3
    5. From GPT-3 to ChatGPT
  3. Conclusions & reflections

1. Overview

1.1 What is NLP?

Natural Language Processing (NLP) involves a wide range of tasks that focus on the processing and understanding of human language. Some of the main tasks in NLP include Text Classification, Named Entity Recognition (NER), Sentiment Analysis, Machine Translation, Information Retrieval, Question Answering, and Text Summarization. A more complete list of typical NLP tasks and progress in each is available here.

Here are a few excellent references:

  • Linguistic basics: Emily Bender’s Linguistic Fundamentals for Natural Language Processing: 100 Essentials from Morphology and Syntax (2013)
  • Yoav Goldberg’s book Neural Network Methods for Natural Language Processing (2017)
  • Chris Manning’s lectures (CS224d) on Natural Language Processing with Deep Learning (Youtube)

1.2 Past progress (pre-2017)

Historically Natural Language Processing (NLP) was mostly based on rule-based approaches or statistical models. Deep learning took over NLP in the mid-2010s, with Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) being the de-facto models (there are some minor attempts in other types of models but nothing super successful yet).

The field has made a big jump due to the introduction of these models, but at the same time felt a bit stagnant/limited especially compared to Computer Vision (CV). Let's maybe make a comparison between the status of CV and NLP in the table below.

NLP CV
Model Scaling Poor Good
Dataset Size Small Large
Model Transferability Poor Good

RNN/LSTM are autoregressive models, and because of that, the training is fundamentally sequential and harder to parallelize (e.g., a piece of text reading from left to right). Model scaling in NLP has been quite behind CV - where model architecture such as Convolutional Neural Networks (CNN) offers much easier parallelization.

Another difficulty scaling in NLP is the lack of large, labeled data set. In CV, we have datasets like ImageNet, with ~1M images labeled with 1000 categories. Although in machine translation, we could potentially construct datasets with ~1M pairs of sentences, the information (and model supervision) we could generate from sentence pairs is probably 1 to 2 orders of magnitude lower than an image.

On model transferability, in NLP, we haven't seen the kind of success we saw in CV, where a pre-trained model (on a supervised learning dataset) achieves strong performance on downstream tasks. This is partly due to the diversity of NLP tasks, but equally important is the lack of large labeled datasets to train a good and large enough model that transfers and generalizes well.

One side effect of that we see a big gap in generation capabilities in NLP vs CV. Photorealistic images and video generation (e.g. DeepFake) have been around for several years (based on VAE, GAN and etc), while text generation capability has been extremely primitive- which is also why the recent capability in ChatGPT is seemingly unbelievable.

1.3 Recent progress (2017 - 2022)

The progress from 2017 to 2022 changed all of the above constraints in NLP and made it leapfrog CV. This period of time is indeed the breakthrough period for NLP. We'll talk through the details in the sections below. Here is a preview of the significant changes.

Image description

Transformer

Although initially developed specifically for machine translation, Transfomer quickly became NLP's new standard model architecture, largely replacing RNNs and LSTMs. The model architecture allows modeling sequences without having to be autoregressive. This massively improved our ability to scale up the model. More recently, Transformer has gone beyond NLP and has been used to model images and videos and in multi-modal applications, and it is considered to be one of the most important core model architectures in machine learning generally.

BERT, GPT

Thanks to the foundation laid out by the Transformer model, BERT and the GPT-series models could scale up the model size from 100M parameters (GPT and BERT-base) to 175B parameters (GPT-3) in a short span of 2 years. Researchers also found creative ways to leverage large unlabelled datasets to support model size scaling.

A key innovation from GPT and BERT is the clever structuring of pre-trained language model input/output to allow the pre-trained models to be transferred (fine-tuned) for a wide array of NLP tasks. Thus establishing a familiar pre-training + fine-tuning setup we saw in CV, with compelling model performance and transferability.

Image description

GPT-2, GPT-3

In GPT-2 and GPT-3, the authors introduced the new paradigm of not further adjusting the model (i.e., fine-tuning), but instead of they use natural language "prompts" as a way to tell the model to perform new tasks. The prompts can potentially include some examples (aka, demonstrations). This is called a zero-shot or few-shot setting (depending on how many examples are given to the model).

This allows much-improved transferability since we no longer need to collect a labeled dataset for each specific downstream task to do fine-tuning. This is a significant new step towards making a pre-trained model capable of performing previously unseen tasks completely based on the context/prompt user provides and engaging in open-ended tasks such as doing classifications on an unknown set of labels, engaging in dialogs, writing code, etc. - activity starts to resemble reasoning and intelligence.

From GPT-3 to ChatGPT

Fundamentally, GPT-3 is just a language model. When given a piece of text (i.e., prompt), it generates, somewhat randomly, plausible subsequent text that fits the best. This objective is misaligned with “following the user’s instructions helpfully and safely.” InstructGPT focuses on effectively aligning large language models such as GPT-3 with user intent through fine-tuning with human feedback so that the output can be more helpful, truthful, and harmless.

2. Models

2.1 Transformer

...

Read the full note here https://www.dingran.me/from-transformer-to-llm/

Top comments (0)