Gemma Scope: helping the protection community make clear the inner workings of language models

-


Announcing a comprehensive, open suite of sparse autoencoders for language model interpretability.

To create a synthetic intelligence (AI) language model, researchers construct a system that learns from vast amounts of knowledge without human guidance. In consequence, the inner workings of language models are sometimes a mystery, even to the researchers who train them. Mechanistic interpretability is a research field focused on deciphering these inner workings. Researchers on this field use sparse autoencoders as a type of ‘microscope’ that lets them see inside a language model, and get a greater sense of how it really works.

Today, we’re announcing Gemma Scope, a brand new set of tools to assist researchers understand the inner workings of Gemma 2, our lightweight family of open models. Gemma Scope is a group of a whole lot of freely available, open sparse autoencoders (SAEs) for Gemma 2 9B and Gemma 2 2B. We’re also open sourcing Mishax, a tool we built that enabled much of the interpretability work behind Gemma Scope.

We hope today’s release enables more ambitious interpretability research. Further research has the potential to assist the sector construct more robust systems, develop higher safeguards against model hallucinations, and protect against risks from autonomous AI agents like deception or manipulation.

Try our interactive Gemma Scope demo, courtesy of Neuronpedia.

Interpreting what happens inside a language model

Whenever you ask a language model a matter, it turns your text input right into a series of ‘activations’. These activations map the relationships between the words you’ve entered, helping the model make connections between different words, which it uses to put in writing a solution.

Because the model processes text input, activations at different layers within the model’s neural network represent multiple increasingly advanced concepts, generally known as ‘features’.

For instance, a model’s early layers might learn to recall facts like that Michael Jordan plays basketball, while later layers may recognize more complex concepts like the factuality of the text.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x