DEV Community

Takara Taniguchi
Takara Taniguchi

Posted on

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailovが第一著者,Stanford

The proposed method improved Proximal Preference Optimization.
Direct preference optimization improved the method of updating policies.

RL fine-tuning is conducted as follows:

Image description\

Using the partition function

Image description

We can delete Z(xx, which is difficult to calculate

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i40dsvgp9jyboh93n2at.png

Then, we do not have to make reward modeling and directly optimize the loss function.

Top comments (0)