Scaling laws for reward model overoptimization

In reinforcement learning from human feedback, it is not uncommon to optimize against a reward model trained to predict human preferences. Since the reward model is an imperfect proxy, optimizing its value an excessive amount of can hinder ground truth performance, in accordance with Goodhart’s law. This effect has been incessantly observed, but not fastidiously measured on account of the expense of collecting human preference data. On this work, we use an artificial setup wherein a set “gold-standard” reward model plays the role of humans, providing labels used to coach a proxy reward model. We study how the gold reward model rating changes as we optimize against the proxy reward model using either reinforcement learning or best-of-n sampling. We discover that this relationship follows a distinct functional form depending on the strategy of optimization, and that in each cases its coefficients scale easily with the variety of reward model parameters. We also study the effect on this relationship of the dimensions of the reward model dataset, the variety of reward model and policy parameters, and the coefficient of the KL penalty added to the reward within the reinforcement learning setup. We explore the implications of those empirical results for theoretical considerations in AI alignment.

Scaling laws for reward model overoptimization

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

AI’s Growing Power Needs: Tech Industry’s Move Towards Nuclear Power

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

Scaling laws for reward model overoptimization

What are your thoughts on this topic? Let us know in the comments below.

1 COMMENT

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.