There are still necessary disanalogies between our current empirical setup and the last word problem of aligning superhuman models. For instance, it might be easier for future models to mimic weak human errors than for current strong models to mimic current weak model errors, which could make generalization harder in the long run.Â
Nevertheless, we consider our setup captures some key difficulties of aligning future superhuman models, enabling us to start out making empirical progress on this problem today. There are lots of promising directions for future work, including fixing the disanalogies in our setup, developing higher scalable methods, and advancing our scientific understanding of when and the way we should always expect good weak-to-strong generalization.
We consider that is an exciting opportunity for the ML research community to make progress on alignment. To kickstart more research on this area,
- We’re releasing open source code to make it easy to start with weak-to-strong generalization experiments today.
- We’re launching a $10 million grants program for graduate students, academics, and other researchers to work on superhuman AI alignment broadly. We’re especially excited to support research related to weak-to-strong generalization.
Determining the right way to align future superhuman AI systems to be secure has never been more necessary, and it’s now easier than ever to make empirical progress on this problem. We’re excited to see what breakthroughs researchers discover.
