Home Artificial Intelligence Scaling laws for reward model overoptimization

Scaling laws for reward model overoptimization

Scaling laws for reward model overoptimization

In reinforcement learning from human feedback, it is not uncommon to optimize against a reward model trained to predict human preferences. Since the reward model is an imperfect proxy, optimizing its value an excessive amount of can hinder ground truth performance, in accordance with Goodhart’s law. This effect has been incessantly observed, but not fastidiously measured on account of the expense of collecting human preference data. On this work, we use an artificial setup wherein a set “gold-standard” reward model plays the role of humans, providing labels used to coach a proxy reward model. We study how the gold reward model rating changes as we optimize against the proxy reward model using either reinforcement learning or best-of-n sampling. We discover that this relationship follows a distinct functional form depending on the strategy of optimization, and that in each cases its coefficients scale easily with the variety of reward model parameters. We also study the effect on this relationship of the dimensions of the reward model dataset, the variety of reward model and policy parameters, and the coefficient of the KL penalty added to the reward within the reinforcement learning setup. We explore the implications of those empirical results for theoretical considerations in AI alignment.



Please enter your comment!
Please enter your name here