DEV Community

Cover image for Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions
Rijul Rajesh
Rijul Rajesh

Posted on

Understanding Reinforcement Learning with Human Feedback Part 5: Training the Reward Model with Loss Functions

In the previous article, we created a reward model. In this article, we will continue exploring how this model is trained.

One important thing to note is that we do not need to define the ideal reward values in advance.

Instead, the model learns to determine appropriate rewards on its own.

The Loss Function

To train the reward model, OpenAI used the following loss function in their 2022 paper:

This loss function helps the model learn good reward values without us explicitly defining what the rewards should be.

Where:

  • Reward Better corresponds to the reward calculated for the preferred response
  • Reward Worse corresponds to the reward calculated for the less preferred response

Ideally, we want:
Reward_better - Reward_worse
to be a large positive number.

Step 1: The Sigmoid Function

The difference between the rewards is first passed through a sigmoid function.

For any input value, the sigmoid function outputs a value between 0 and 1.

In an ideal case, we want the sigmoid output to be close to 1. This happens when the preferred response receives a much higher reward than the worse response.

Step 2: The Log Function

The output of the sigmoid function is then passed through a log function.

In the ideal case, this produces a relatively high value.

Finally, we multiply the result by -1.

This turns the equation into a loss function that optimization algorithms can minimize during training.

The interesting part is that we never explicitly tell the model:

  • the preferred response must have a positive reward
  • the worse response must have a negative reward

Instead, the loss function naturally guides the model toward assigning rewards in a way that makes preferred responses score higher than worse ones.

What Happens Next?

Once the reward model is fully trained, we can use it to train the original model that only went through supervised fine-tuning.

In the next article, we will explore how the reward model is used to further train the original model.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run:

ipm install repo-name
Enter fullscreen mode Exit fullscreen mode

… and you’re done! πŸš€

Installerpedia Screenshot

πŸ”— Explore Installerpedia here

Top comments (0)