Superposition: What Makes it Difficult to Explain Neural Network

-

When there are more features than model dimensions

It could be ideal if the world of neural network represented a one-to-one relationship: each neuron prompts on one and just one feature. In such a world, interpreting the model could be straightforward: this neuron fires for the dog ear feature, and that neuron fires for the wheel of cars. Unfortunately, that will not be the case. In point of fact, a model with dimension d often must represent m features, where d < m. That is once we observe the phenomenon of superposition.

Within the context of machine learning, superposition refers to a selected phenomenon that one neuron in a model represents multiple overlapping features quite than a single, distinct one. For instance, InceptionV1 comprises one neuron that responds to cat faces, fronts of cars, and cat legs [1]. This results in what we will superposition of various features activation in the identical neuron or circuit.

The existence of superposition makes model explainability difficult, especially in deep learning models, where neurons in hidden layers represent complex combos of patterns quite than being related to easy, direct features.

On this blog post, we’ll present an easy toy example of superposition, with detailed implementations by Python on this notebook.

We start this section by discussing the term “feature”.

In tabular data, there may be little ambiguity in defining what a feature is. For instance, when predicting the standard of wine using a tabular dataset, features could be the share of alcohol, the yr of production, etc.

Nevertheless, defining features can grow to be complex when coping with non-tabular data, equivalent to images or textual data. In these cases, there is no such thing as a universally agreed-upon definition of a feature. Broadly, a feature could be considered any property of the input that’s recognizable to most humans. As an illustration, one feature in a big language model (LLM) is likely to be whether a word is in French.

Superposition occurs when the variety of features is greater than the model dimensions. We claim that two essential conditions should be met if superposition would occur:

  1. Non-linearity: Neural networks typically include non-linear activation functions, equivalent to sigmoid or ReLU, at the top of every hidden layer. These activation functions give the network possibilities to map inputs to outputs in a non-linear way, in order that it could actually capture more complex relationships between features. We will imagine that without non-linearity, the model would behave as an easy linear transformation, where features remain linearly separable, with none possibility of compression of dimensions through superposition.
  2. Feature Sparsity: Feature sparsity means the incontrovertible fact that only a small subset of features is non-zero. For instance, in language models, many features should not present at the identical time: e.g. one same word can’t be is_French and is_other_languages. If all features were dense, we will imagine a vital interference resulting from overlapping representations, making it very difficult for the model to decode features.

Synthetic Dataset

Allow us to consider a toy example of 40 features with linearly decreasing feature importance: the primary feature has an importance of 1, the last feature has an importance of 0.1, and the importance of the remaining features is evenly spaced between these two values.

We then generate an artificial dataset with the next code:

def generate_sythentic_dataset(dim_sample, num_sapmple, sparsity): 
"""Generate synthetic dataset in keeping with sparsity"""
dataset=[]
for _ in range(num_sapmple):
x = np.random.uniform(0, 1, n)
mask = np.random.selection([0, 1], size=n, p=[sparsity, 1 - sparsity])
x = x * mask # Apply sparsity
dataset.append(x)
return np.array(dataset)

This function creates an artificial dataset with the given variety of dimensions, which is, 40 in our case. For every dimension, a random value is generated from a uniform distribution in [0, 1]. The sparsity parameter, various between 0 and 1, controls the share of energetic features in each sample. For instance, when the sparsity is 0.8, it the features in each sample has 80% probability to be zero. The function applies a mask matrix to comprehend the sparsity setting.

Linear and Relu Models

We might now prefer to explore how ReLU-based neural models result in superposition formation and the way sparsity values would change their behaviors.

We set our experiment in the next way: we compress the features with 40 dimensions into the 5 dimensional space, then reconstruct the vector by reversing the method. Observing the behavior of those transformations, we expect to see how superposition forms in each case.

To achieve this, we consider two very similar models:

  1. Linear Model: An easy linear model with only 5 coefficients. Recall that we wish to work with 40 features — way over the model’s dimensions.
  2. ReLU Model: A model almost the identical to the linear one, but with an extra ReLU activation function at the top, introducing one level of non-linearity.

Each models are built using PyTorch. For instance, we construct the ReLU model with the next code:

class ReLUModel(nn.Module):
def __init__(self, n, m):
super().__init__()
self.W = nn.Parameter(torch.randn(m, n) * np.sqrt(1 / n))
self.b = nn.Parameter(torch.zeros(n))

def forward(self, x):
h = torch.relu(torch.matmul(x, self.W.T)) # Add ReLU activation: x (batch, n) * W.T (n, m) -> h (batch, m)
x_reconstructed = torch.relu(torch.matmul(h, self.W) + self.b) # Reconstruction with ReLU
return x_reconstructed

Based on the code, the n-dimensional input vector x is projected right into a lower-dimensional space by multiplying it with an m×n weight matrix. We then reconstruct the unique vector by mapping it back to the unique feature space through a ReLU transformation, adjusted by a bias vector. The Linear Model is given by the same structure, with the one difference being that the reconstruction is finished by utilizing only the linear transformation as a substitute of ReLU. We train the model by minimizing the mean squared error between the unique feature samples and the reconstructed ones, weighted one the feature importance.

We trained each models with different sparsity values: 0.1, 0.5, and 0.9, from less sparse to probably the most sparse. We’ve got observed several essential results.

First, regardless of the sparsity level, ReLU models “compress” features a lot better than linear models: While linear models mainly capture features with the best feature importance, ReLU models could concentrate on less essential features by formation of superposition— where a single model dimension represents multiple features. Allow us to have a vision of this phenomenon in the next visualizations: for linear models, the biases are smallest for the highest five features, (in case you don’t remember: the feature importance is defined as a linearly decreasing function based on feature order). In contrast, the biases for the ReLU model don’t show this order and are generally reduced more.

Image by creator: reconstructed bias

One other essential and interesting result’s that: superposition is way more more likely to observe when sparsity level is high within the features. To get an impression of this phenomenon, we will visualize the matrix W^T@W, where W is the m×n weight matrix within the models. One might interpret the matrix W^T@W as a quantity of how the input features are projected onto the lower dimensional space:

Particularly:

  1. The diagonal of W^T@W represents the “self-similarity” of every feature contained in the low dimensional transformed space.
  2. The off-diagonal of the matrix represents how different features correlate to one another.

We now visualize the values of W^T@W below for each the Linear and ReLU models we’ve constructed before with two different sparsity levels : 0.1 and 0.9. You’ll be able to see that when the sparsity value is high as 0.9, the off-diagonal elements grow to be much larger in comparison with the case when sparsity is 0.1 (You truly don’t see much difference between the 2 models output). This statement indicates that correlations between different features are more easily to be learned when sparsity is high.

Image by Creator: matrix for sparsity 0.1
Image by creator: matrix for sparsity 0.9

On this blog post, I made an easy experiment to introduce the formation of superposition in neural networks by comparing Linear and ReLU models with fewer dimensions than features to represent. We observed that the non-linearity introduced by the ReLU activation, combined with a certain level of sparsity, may also help the model form superposition.

In real-world applications, that are way more complex than my navie example, superposition is a vital mechanism for representing complex relationships in neural models, especially in vision models or LLMs.

[1] Zoom In: An Introduction to Circuits. https://distill.pub/2020/circuits/zoom-in/

[2] Toy models with superposition. https://transformer-circuits.pub/2022/toy_model/index.html

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x