Understanding Reinforcement Learning with Human Feedback Part 4: Teaching Models Human Preferences

#ai #machinelearning

In the previous article, we explored the part where we collect human preferences. In this article, we will see how to use this data to train the models.

To train a model that gives higher scores to preferred responses, we first make a copy of the model that has already gone through supervised fine-tuning.

Modifying the Model

Next, we modify this copied model.

We remove the unembedding layer, which normally predicts the next token, and replace it with a single output value.

The result is a new model called a reward model.

Instead of generating text, this model learns to assign a reward score to a response.

Training the Reward Model

We can now train this reward model using the human preference data we collected earlier.

For a preferred response, we train the model to produce a higher reward value.

For a less preferred response, we train the model to produce a lower reward value or a negative reward.

For example:

If humans preferred Response A over Response B, the reward model learns to give a higher score to Response A
And a lower score to Response B

Over time, the reward model learns what kinds of responses humans tend to prefer.

We will continue further in the next article

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: