Understanding Reinforcement Learning with Human Feedback Part 6: How the Reward Model Trains the Original Model

#ai #machinelearning

In the previous article, we used loss functions and trained our reward model.

In this article, we will explore how to train the original model using the reward model we just trained.

Using New Prompts

To train the original model with the reward model, we start with new prompts that were not part of the supervised fine-tuning dataset.

We give the model a prompt, and it generates a response.

However, the response may not always be helpful or aligned with what people want.

In this case, the response is not very helpful, so the reward model assigns a negative reward.

Improving the Original Model

We can now use this reward signal to train the original model through reinforcement learning.

Over time, the model learns which kinds of responses receive better rewards.

After reinforcement learning, the same prompt that previously resulted in an unhelpful answer can now produce a more polite and helpful response.

Since the new response is more helpful and aligned with human preferences, it receives a larger positive reward from the reward model.

The Final Result

Once training with RLHF is complete, we end up with a trained and aligned model that better matches how people want to interact with it.

At this stage, the model is much better at generating responses that are useful, polite, and aligned with human expectations.

That wraps up this RLHF series.

In the coming articles, we will explore a new topic.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: