Home Artificial Intelligence Language models can explain neurons in language models

Language models can explain neurons in language models

1
Language models can explain neurons in language models

Although the overwhelming majority of our explanations rating poorly, we imagine we will now use ML techniques to further improve our ability to provide explanations. For instance, we found we were in a position to improve scores by:

  • Iterating on explanations. We are able to increase scores by asking GPT-4 to provide you with possible counterexamples, then revising explanations in light of their activations.
  • Using larger models to provide explanations. The typical rating goes up because the explainer model’s capabilities increase. Nevertheless, even GPT-4 gives worse explanations than humans, suggesting room for improvement.
  • Changing the architecture of the explained model. Training models with different activation functions improved explanation scores.

We’re open-sourcing our datasets and visualization tools for GPT-4-written explanations of all 307,200 neurons in GPT-2, in addition to code for explanation and scoring using publicly available models on the OpenAI API. We hope the research community will develop latest techniques for generating higher-scoring explanations and higher tools for exploring GPT-2 using explanations.

We found over 1,000 neurons with explanations that scored not less than 0.8, meaning that in accordance with GPT-4 they account for a lot of the neuron’s top-activating behavior. Most of those well-explained neurons aren’t very interesting. Nevertheless, we also found many interesting neurons that GPT-4 didn’t understand. We hope as explanations improve we may give you the chance to rapidly uncover interesting qualitative understanding of model computations.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here