Gemma Scope 2: Helping the AI Safety Community Deepen Understanding of Complex Language Model Behavior

-


Announcing a brand new, open suite of tools for language model interpretability

Large Language Models (LLMs) are able to incredible feats of reasoning, yet their internal decision-making processes remain largely opaque. Should a system not behave as expected, a scarcity of visibility into its internal workings could make it difficult to pinpoint the precise reason for its behaviour. Last 12 months, we advanced the science of interpretability with Gemma Scope, a toolkit designed to assist researchers understand the inner workings of Gemma 2, our lightweight collection of open models.

Today, we’re releasing Gemma Scope 2: a comprehensive, open suite of interpretability tools for all Gemma 3 model sizes, from 270M to 27B parameters. These tools can enable us to trace potential risks across your complete “brain” of the model.

To our knowledge, that is the biggest ever open-source release of interpretability tools by an AI lab to this point. Producing Gemma Scope 2 involved storing roughly 110 Petabytes of knowledge, in addition to training over 1 trillion total parameters.

As AI continues to advance, we sit up for the AI research community using Gemma Scope 2 to debug emergent model behaviors, use these tools to raised audit and debug AI agents, and ultimately, speed up the event of practical and robust safety interventions against issues like jailbreaks, hallucinations and sycophancy.

Our interactive Gemma Scope 2 demo is on the market to try, courtesy of Neuronpedia.

What’s latest in Gemma Scope 2

Interpretability research goals to grasp the interior workings and learned algorithms of AI models. As AI becomes increasingly more capable and complicated, interpretability is crucial for constructing AI that’s secure and reliable.

Like its predecessor, Gemma Scope 2 acts as a microscope for the Gemma family of language models. By combining sparse autoencoders (SAEs) and transcoders, it allows researchers to look inside models, see what they’re fascinated by, and the way these thoughts are formed and connect with the model’s behaviour. In turn, this allows the richer study of jailbreaks or other AI behaviours relevant to safety, like discrepancies between a model’s communicated reasoning and its internal state.

While the unique Gemma Scope enabled research in key areas of safety, comparable to model hallucination, identifying secrets known by a model, and training safer models, Gemma Scope 2 supports much more ambitious research through significant upgrades:

  • Full coverage at scale: We offer a full suite of tools for your complete Gemma 3 family (as much as 27B parameters), essential for studying emergent behaviors that only appear at scale, comparable to those previously uncovered by the 27b-size C2S Scale model that helped discover a brand new potential cancer therapy pathway. Although Gemma Scope 2 just isn’t trained on this model, that is an example of the type of emergent behavior that these tools might have the opportunity to grasp.
  • More refined tools to decipher complex internal behaviors: Gemma Scope 2 includes SAEs and transcoders trained on every layer of our Gemma 3 family of models. Skip-transcoders and Cross-layer transcoders make it easier to decipher multi-step computations and algorithms spread throughout the model.
  • Advanced training techniques: We use state-of-the-art techniques, notably the Matryoshka training technique, which helps SAEs detect more useful concepts and resolves certain flaws discovered in Gemma Scope.
  • Chatbot behavior evaluation tools: We also provide interpretability tools targeted on the versions of Gemma 3 tuned for chat use cases. These tools enable evaluation of complex, multi-step behaviors, comparable to jailbreaks, refusal mechanisms, and chain-of-thought faithfulness.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x