AI agents help explain other AI systems


Explaining the behavior of trained neural networks stays a compelling puzzle, especially as these models grow in size and class. Like other scientific challenges throughout history, reverse-engineering how artificial intelligence systems work requires a considerable amount of experimentation: making hypotheses, intervening on behavior, and even dissecting large networks to look at individual neurons. To this point, most successful experiments have involved large amounts of human oversight. Explaining every computation inside models the scale of GPT-4 and bigger will almost actually require more automation — maybe even using AI models themselves. 

Facilitating this timely endeavor, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a novel approach that uses AI models to conduct experiments on other systems and explain their behavior. Their method uses agents built from pretrained language models to supply intuitive explanations of computations inside trained networks.

Central to this strategy is the “automated interpretability agent” (AIA), designed to mimic a scientist’s experimental processes. Interpretability agents plan and perform tests on other computational systems, which might range in scale from individual neurons to entire models, so as to produce explanations of those systems in a wide range of forms: language descriptions of what a system does and where it fails, and code that reproduces the system’s behavior. Unlike existing interpretability procedures that passively classify or summarize examples, the AIA actively participates in hypothesis formation, experimental testing, and iterative learning, thereby refining its understanding of other systems in real time. 

Complementing the AIA method is the brand new “function interpretation and outline” (FIND) benchmark, a test bed of functions resembling computations inside trained networks, and accompanying descriptions of their behavior. One key challenge in evaluating the standard of descriptions of real-world network components is that descriptions are only nearly as good as their explanatory power: Researchers don’t have access to ground-truthlabels of units or descriptions of learned computations. FIND addresses this long-standing issue in the sphere by providing a reliable standard for evaluating interpretability procedures: explanations of functions (e.g., produced by an AIA) might be evaluated against function descriptions within the benchmark.  

For instance, FIND incorporates synthetic neurons designed to mimic the behavior of real neurons inside language models, a few of that are selective for individual concepts comparable to “ground transportation.” AIAs are given black-box access to synthetic neurons and design inputs (comparable to “tree,” “happiness,” and “automotive”) to check a neuron’s response. After noticing that an artificial neuron produces higher response values for “automotive” than other inputs, an AIA might design more fine-grained tests to tell apart the neuron’s selectivity for cars from other types of transportation, comparable to planes and boats. When the AIA produces an outline comparable to “this neuron is selective for road transportation, and never air or sea travel,” this description is evaluated against the ground-truth description of the synthetic neuron (“selective for ground transportation”) in FIND. The benchmark can then be used to check the capabilities of AIAs to other methods within the literature. 

Sarah Schwettmann PhD ’21, co-lead writer of a paper on the brand new work and a research scientist at CSAIL, emphasizes the benefits of this approach. “The AIAs’ capability for autonomous hypothesis generation and testing may give you the chance to surface behaviors that might otherwise be difficult for scientists to detect. It’s remarkable that language models, when equipped with tools for probing other systems, are able to any such experimental design,” says Schwettmann. “Clean, easy benchmarks with ground-truth answers have been a serious driver of more general capabilities in language models, and we hope that FIND can play an identical role in interpretability research.”

Automating interpretability 

Large language models are still holding their status because the in-demand celebrities of the tech world. The recent advancements in LLMs have highlighted their ability to perform complex reasoning tasks across diverse domains. The team at CSAIL recognized that given these capabilities, language models may give you the chance to function backbones of generalized agents for automated interpretability. “Interpretability has historically been a really multifaceted field,” says Schwettmann. “There isn’t any one-size-fits-all approach; most procedures are very specific to individual questions we might need a few system, and to individual modalities like vision or language. Existing approaches to labeling individual neurons inside vision models have required training specialized models on human data, where these models perform only this single task. Interpretability agents built from language models could provide a general interface for explaining other systems — synthesizing results across experiments, integrating over different modalities, even discovering latest experimental techniques at a really fundamental level.” 

As we enter a regime where the models doing the explaining are black boxes themselves, external evaluations of interpretability methods have gotten increasingly vital. The team’s latest benchmark addresses this need with a collection of functions with known structure, which are modeled after behaviors observed within the wild. The functions inside FIND span a diversity of domains, from mathematical reasoning to symbolic operations on strings to synthetic neurons built from word-level tasks. The dataset of interactive functions is procedurally constructed; real-world complexity is introduced to easy functions by adding noise, composing functions, and simulating biases. This enables for comparison of interpretability methods in a setting that translates to real-world performance.      

Along with the dataset of functions, the researchers introduced an modern evaluation protocol to evaluate the effectiveness of AIAs and existing automated interpretability methods. This protocol involves two approaches. For tasks that require replicating the function in code, the evaluation directly compares the AI-generated estimations and the unique, ground-truth functions. The evaluation becomes more intricate for tasks involving natural language descriptions of functions. In these cases, accurately gauging the standard of those descriptions requires an automatic understanding of their semantic content. To tackle this challenge, the researchers developed a specialized “third-party” language model. This model is specifically trained to guage the accuracy and coherence of the natural language descriptions provided by the AI systems, and compares it to the ground-truth function behavior. 

FIND enables evaluation revealing that we’re still removed from fully automating interpretability; although AIAs outperform existing interpretability approaches, they still fail to accurately describe almost half of the functions within the benchmark. Tamar Rott Shaham, co-lead writer of the study and a postdoc in CSAIL, notes that “while this generation of AIAs is effective in describing high-level functionality, they still often overlook finer-grained details, particularly in function subdomains with noise or irregular behavior. This likely stems from insufficient sampling in these areas. One issue is that the AIAs’ effectiveness could also be hampered by their initial exploratory data. To counter this, we tried guiding the AIAs’ exploration by initializing their search with specific, relevant inputs, which significantly enhanced interpretation accuracy.” This approach combines latest AIA methods with previous techniques using pre-computed examples for initiating the interpretation process.

The researchers are also developing a toolkit to reinforce the AIAs’ ability to conduct more precise experiments on neural networks, each in black-box and white-box settings. This toolkit goals to equip AIAs with higher tools for choosing inputs and refining hypothesis-testing capabilities for more nuanced and accurate neural network evaluation. The team can also be tackling practical challenges in AI interpretability, specializing in determining the suitable inquiries to ask when analyzing models in real-world scenarios. Their goal is to develop automated interpretability procedures that might eventually help people audit systems — e.g., for autonomous driving or face recognition — to diagnose potential failure modes, hidden biases, or surprising behaviors before deployment. 

Watching the watchers

The team envisions at some point developing nearly autonomous AIAs that may audit other systems, with human scientists providing oversight and guidance. Advanced AIAs could develop latest sorts of experiments and questions, potentially beyond human scientists’ initial considerations. The main focus is on expanding AI interpretability to incorporate more complex behaviors, comparable to entire neural circuits or subnetworks, and predicting inputs that may result in undesired behaviors. This development represents a big step forward in AI research, aiming to make AI systems more comprehensible and reliable.

“A great benchmark is an influence tool for tackling difficult challenges,” says Martin Wattenberg, computer science professor at Harvard University who was not involved within the study. “It’s wonderful to see this sophisticated benchmark for interpretability, one of the crucial vital challenges in machine learning today. I’m particularly impressed with the automated interpretability agent the authors created. It is a form of interpretability jiu-jitsu, turning AI back on itself so as to help human understanding.”

Schwettmann, Rott Shaham, and their colleagues presented their work at NeurIPS 2023 in December.  Additional MIT coauthors, all affiliates of the CSAIL and the Department of Electrical Engineering and Computer Science (EECS), include graduate student Joanna Materzynska, undergraduate student Neil Chowdhury, Shuang Li PhD ’23, Assistant Professor Jacob Andreas, and Professor Antonio Torralba. Northeastern University Assistant Professor David Bau is an extra coauthor.

The work was supported, partially, by the MIT-IBM Watson AI Lab, Open Philanthropy, an Amazon Research Award, Hyundai NGV, the U.S. Army Research Laboratory, the U.S. National Science Foundation, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship.


What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x