Google DeepMind has a brand new strategy to look inside an AI’s “mind”

-

Neuronpedia, a platform for mechanistic interpretability, partnered with DeepMind in July to construct a demo of Gemma Scope that you may mess around with right away. Within the demo, you’ll be able to test out different prompts and see how the model breaks up your prompt and what activations your prompt lights up. You may also fiddle with the model. For instance, should you turn the feature about dogs way up after which ask the model a matter about US presidents, Gemma will find some strategy to weave in random babble about dogs, or the model could start barking at you.

One interesting thing about sparse autoencoders is that they’re unsupervised, meaning they find features on their very own. That results in surprising discoveries about how the models break down human concepts. “My personal favorite feature is the cringe feature,” says Joseph Bloom, science lead at Neuronpedia. “It seems to look in negative criticism of text and flicks. It’s just an incredible example of tracking things which are so human on some level.” 

You’ll be able to search for ideas on Neuronpedia and it should highlight what features are being activated on specific tokens, or words, and the way strongly each is activated. “Should you read the text and also you see what’s highlighted in green, that’s when the model thinks the cringe concept is most relevant. Probably the most energetic example for cringe is any person preaching at another person,” says Bloom.

Some features are proving easier to trace than others. “Probably the most essential features that you just would want to seek out for a model is deception,” says Johnny Lin, founding father of Neuronpedia. “It’s not super easy to seek out: ‘Oh, there’s the feature that fires when it’s lying to us.’ From what I’ve seen, it hasn’t been the case that we will find deception and ban it.”

DeepMind’s research is comparable to what one other AI company, Anthropic, did back in May with Golden Gate Claude. It used sparse autoencoders to seek out the parts of Claude, their model, that lit up when discussing the Golden Gate Bridge in San Francisco. It then amplified the activations related to the bridge to the purpose where Claude literally identified not as Claude, an AI model, but because the physical Golden Gate Bridge and would reply to prompts because the bridge.

Even though it could seem quirky, mechanistic interpretability research may prove incredibly useful. “As a tool for understanding how the model generalizes and what level of abstraction it’s working at, these features are really helpful,” says Batson.

For instance, a team lead by Samuel Marks, now at Anthropic, used sparse autoencoders to seek out features that showed a specific model was associating certain professions with a selected gender. They then turned off these gender features to cut back bias within the model. This experiment was done on a really small model, so it’s unclear if the work will apply to a much larger model.

Mechanistic interpretability research can even give us insights into why AI makes errors. Within the case of the assertion that 9.11 is larger than 9.8, researchers from Transluce saw that the query was triggering the parts of an AI model related to Bible verses and September 11. The researchers concluded the AI could possibly be interpreting the numbers as dates, asserting the later date, 9/11, as greater than 9/8. And in lots of books like religious texts, section 9.11 comes after section 9.8, which could also be why the AI thinks of it as greater. Once they knew why the AI made this error, the researchers tuned down the AI’s activations on Bible verses and September 11, which led to the model giving the proper answer when prompted again on whether 9.11 is larger than 9.8.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x