The $12,000 Surprise
RLHF training for a 7B parameter model ran us $12,400 on AWS for three days of continuous runs. The compute wasn't the issue — it was the waste. Every iteration meant spinning up a critic model, generating completions, calculating rewards, backpropagating through both networks, and repeating. When we migrated the same preference dataset to DPO, the equivalent training run cost $3,950. Same dataset, same base model, 68% cost reduction.
But the migration wasn't a drop-in replacement. DPO doesn't use a reward model at all, which sounds like a simplification until you realize your entire loss function changes shape.
What Actually Changed Under the Hood
RLHF trains two models in tandem. The policy model generates text, the reward model scores it, and policy gradient methods (usually PPO) nudge the policy toward higher rewards. The loss for the policy network involves a rather involved expectation:
Continue reading the full article on TildAlice

Top comments (0)