Rafael Rafailovが第一著者,Stanford
The proposed method improved Proximal Preference Optimization.
Direct preference optimization improved the method of updating policies.
RL fine-tuning is conducted as follows:
Using the partition function
We can delete Z(xx, which is difficult to calculate
