Scaling laws for reward model overoptimization

-

In reinforcement learning from human feedback, it is not uncommon to optimize against a reward model trained to predict human preferences. Since the reward model is an imperfect proxy, optimizing its value an excessive amount of can hinder ground truth performance, in accordance with Goodhart’s law. This effect has been incessantly observed, but not fastidiously measured on account of the expense of collecting human preference data. On this work, we use an artificial setup wherein a set “gold-standard” reward model plays the role of humans, providing labels used to coach a proxy reward model. We study how the gold reward model rating changes as we optimize against the proxy reward model using either reinforcement learning or best-of-n sampling. We discover that this relationship follows a distinct functional form depending on the strategy of optimization, and that in each cases its coefficients scale easily with the variety of reward model parameters. We also study the effect on this relationship of the dimensions of the reward model dataset, the variety of reward model and policy parameters, and the coefficient of the KL penalty added to the reward within the reinforcement learning setup. We explore the implications of those empirical results for theoretical considerations in AI alignment.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

0 0 votes
Article Rating
guest
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

1
0
Would love your thoughts, please comment.x
()
x