Direct Preference Optimization: Your Language Model is Secretly a Reward Model

#machinelearning #ai #nlp #algorithms

Rafael Rafailovが第一著者，Stanford

The proposed method improved Proximal Preference Optimization.
Direct preference optimization improved the method of updating policies.

RL fine-tuning is conducted as follows:

Using the partition function

We can delete Z(xx, which is difficult to calculate

Then, we do not have to make reward modeling and directly optimize the loss function.

DEV Community