Understanding Reinforcement Learning with Human Feedback Part 2: Aligning Pretrained Models

#ai #machinelearning

In the previous article, we explored the concept of pre-training and its limitations without a further step in the training process.

In this article, we will explore how we can align a pretrained model to help overcome these limitations.

The Two Steps of Alignment

Aligning a pretrained model usually involves two stages:

Supervised Fine-Tuning (SFT)
Reinforcement Learning with Human Feedback (RLHF)

Step 1: Supervised Fine-Tuning

Supervised fine-tuning uses a dataset made up of human-written prompts and human-written responses.

For example, someone might create a prompt like:

“Suggest a coding assistant tool”

And then provide a response such as:

“Try out Cursor”

Using many examples like this, we can train the model with standard backpropagation so that it learns to generate helpful responses.

What Supervised Fine-Tuning Achieves

After supervised fine-tuning, the pretrained model becomes more aligned with human communication.

Instead of only predicting the next token like it did during pre-training, the model now starts to generate:

helpful responses
polite responses
responses to natural language prompts

In other words, supervised fine-tuning transforms a pretrained but unaligned model into one that has started learning how to respond like an assistant.

The Limitation of Supervised Fine-Tuning

Since supervised fine-tuning requires human effort and time, the dataset is usually much smaller than the massive dataset used during pre-training.

Because of this, supervised fine-tuning can sometimes cause the model to overfit.

This means the model may respond well to prompts that are similar to examples it was trained on, but struggle with new prompts that were not part of the fine-tuning dataset.

For example, it may respond appropriately to a prompt it has seen during training, but fail to generalize to unfamiliar prompts.

Why RLHF Is Needed

One possible solution would be to create a much larger supervised fine-tuning dataset.

However, collecting and writing a huge dataset by hand would be extremely expensive and time-consuming.

Instead, we can use Reinforcement Learning with Human Feedback (RLHF) to help train the model to generate better responses, even for prompts it was not directly trained on.

We will explore this further in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: