Understanding Reinforcement Learning with Human Feedback Part 3: Collecting Human Preferences

#ai #machinelearning

In the previous article we explored the concept of aligning the pretrained model. Now, we will look at the next component: human preference collection.

The first step in understanding RLHF is to understand that, given a specific prompt, a model can generate different responses.

One way to generate a response is to configure the model to always select the token with the highest output value at every step.

In this case, the model will generate the same response every single time for a given prompt.

Generating Different Responses

Instead of always selecting the highest-value token, we can also use the outputs of the softmax function as probabilities for selecting tokens.

In this approach:

the token with the highest probability is more likely to be selected
but other tokens still have a chance of being selected

As a result, the model can generate different responses for the same prompt.

Collecting Human Preferences

Since a model can generate multiple responses, we can create pairs of responses for the same prompt.

We can then ask people which response they prefer.

For example, given two possible answers, a person can simply choose the better one.

Collecting preferences like this is much faster than asking people to manually write responses for every prompt.

This preference collection process is the “Human Feedback” part of RLHF.

Using Preference Data

Once we collect preference data, we can use it to train the model so that it assigns higher scores to preferred responses and lower scores to less preferred ones.

Over time, this helps the model generate responses that better match human preferences.

In the next article, we will explore how to train the model using this preference data.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: