Our approach to alignment research

-

There may be currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to come across various recent alignment problems that we don’t observe yet in current systems. A few of these problems we anticipate now and a few of them shall be entirely recent.

We imagine that finding an indefinitely scalable solution is probably going very difficult. As an alternative, we aim for a more pragmatic approach: constructing and aligning a system that could make faster and higher alignment research progress than humans can.

As we make progress on this, our AI systems can take over an increasing number of of our alignment work and ultimately conceive, implement, study, and develop higher alignment techniques than we have now now. They are going to work along with humans to be certain that their very own successors are more aligned with humans.

We imagine that evaluating alignment research is substantially easier than producing it, especially when supplied with evaluation assistance. Due to this fact human researchers will focus an increasing number of of their effort on reviewing alignment research done by AI systems as a substitute of generating this research by themselves. Our goal is to coach models to be so aligned that we will off-load almost all the cognitive labor required for alignment research.

Importantly, we only need “narrower” AI systems which have human-level capabilities within the relevant domains to do in addition to humans on alignment research. We expect these AI systems are easier to align than general-purpose systems or systems much smarter than humans.

Language models are particularly well-suited for automating alignment research because they arrive “preloaded” with quite a lot of knowledge and data about human values from reading the web. Out of the box, they aren’t independent agents and thus don’t pursue their very own goals on this planet. To do alignment research they don’t need unrestricted access to the web. Yet quite a lot of alignment research tasks could be phrased as natural language or coding tasks.

Future versions of WebGPT, InstructGPT, and Codex can provide a foundation as alignment research assistants, but they aren’t sufficiently capable yet. While we don’t know when our models shall be capable enough to meaningfully contribute to alignment research, we expect it’s necessary to start ahead of time. Once we train a model that may very well be useful, we plan to make it accessible to the external alignment research community.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

1 COMMENT

0 0 votes
Article Rating
guest
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

1
0
Would love your thoughts, please comment.x
()
x