How one can Interpret GPT2-Small

Artificial Intelligence

How one can Interpret GPT2-Small

admin

March 22, 2024

Mechanistic Interpretability on prediction of repeated tokens

The event of large-scale language models, especially ChatGPT, has left those that have experimented with it, myself included, astonished by its remarkable linguistic prowess and its ability to perform diverse tasks. Nonetheless, many researchers, including myself, while marveling at its capabilities, also find themselves perplexed. Despite knowing the model’s architecture and the particular values of its weights, we still struggle to grasp why a selected sequence of inputs results in a selected sequence of outputs.

On this blog post, I’ll try to demystify GPT2-small using mechanistic interpretability on an easy case: the prediction of repeated tokens.

Traditional mathematical tools for explaining machine learning models aren’t entirely suitable for language models.

Consider SHAP, a helpful tool for explaining machine learning models. It’s proficient at determining which feature significantly influenced the prediction of quality wine. Nonetheless, it’s essential to keep in mind that language models make predictions on the token level, while SHAP values are mostly computed on the feature level, making them potentially unfit for tokens.

Furthermore, Language Models (LLMs) have quite a few parameters and inputs, making a high-dimensional space. Computing SHAP values is dear even in low-dimensional spaces, and much more so within the high-dimensional space of LLMs.

Despite tolerating the high computational costs, the reasons provided by SHAP could be superficial. For example, knowing that the term “potter” most affected the output prediction resulting from the sooner mention of “Harry” doesn’t provide much insight. It leaves us uncertain in regards to the a part of the model or the particular mechanism liable for such a prediction.

Mechanistic Interpretability offers a unique approach. It doesn’t just discover essential features or inputs for a model’s predictions. As a substitute, it sheds light on the underlying mechanisms or reasoning processes, helping us understand how a model makes its predictions or decisions.

We might be using GPT2-small for a walk in the park: predicting a sequence of repeated tokens. The library we’ll use is TransformerLens, which is designed for mechanistic interpretability of GPT-2 style language models.

gpt2_small: HookedTransformer = HookedTransformer.from_pretrained("gpt2-small")

We use the code above to load the GPT2-Small model and predict tokens on a sequence generated by a selected function. This sequence includes two an identical token sequences, followed by the bos_token. An example could be “ABCDABCD” + bos_token when the seq_len is 3. For clarity, we consult with the sequence from the start to the seq_len as the primary half, and the remaining sequence, excluding the bos_token, because the second half.

def generate_repeated_tokens(
model: HookedTransformer, seq_len: int, batch: int = 1
) -> Int[Tensor, "batch full_seq_len"]:
'''
Generates a sequence of repeated random tokensOutputs are:
rep_tokens: [batch, 1+2*seq_len]
'''
bos_token = (t.ones(batch, 1) * model.tokenizer.bos_token_id).long()  # generate bos token for every batch
rep_tokens_half = t.randint(0, model.cfg.d_vocab, (batch, seq_len), dtype=t.int64)
rep_tokens = t.cat([bos_token,rep_tokens_half,rep_tokens_half], dim=-1).to(device)
return rep_tokens

Once we allow the model to run on the generated token, we discover an interesting statement: the model performs significantly higher on the second half of the sequence than on the primary half. That is measured by the log probabilities on the right tokens. To be precise, the performance on the primary half is -13.898, while the performance on the second half is -0.644.

Image for creator: Log probs on correct tokens

We may also calculate prediction accuracy, defined because the ratio of accurately predicted tokens (those an identical to the generated tokens) to the entire variety of tokens. The accuracy for the primary half sequence is 0.0, which is unsurprising since we’re working with random tokens that lack actual meaning. Meanwhile, the accuracy for the second half is 0.93, significantly outperforming the primary half.

Finding induction head

The statement above could be explained by the existence of an induction circuit. This can be a circuit that scans the sequence for prior instances of the present token, identifies the token that followed it previously, and predicts that the identical sequence will repeat. For example, if it encounters an ‘A’, it scans for the previous ‘A’ or a token very just like ‘A’ within the embedding space, identifies the next token ‘B’, after which predicts the subsequent token after ‘A’ to be ‘B’ or a token very just like ‘B’ within the embedding space.

This prediction process could be broken down into two steps:

Discover the previous same (or similar) token. Every token within the second half of the sequence should “concentrate” to the token ‘seq_len’ places before it. For example, the ‘A’ at position 4 should concentrate to the ‘A’ at position 1 if ‘seq_len’ is 3. We will call the eye head performing this task the “induction head.”
Discover the next token ‘B’. That is the means of copying information from the previous token (e.g., ‘A’) into the subsequent token (e.g., ‘B’). This information might be used to “reproduce” ‘B’ when ‘A’ appears again. We will call the eye head performing this task the “previous token head.”

These two heads constitute an entire induction circuit. Note that sometimes the term “induction head” can also be used to explain the whole “induction circuit.” For more introduction of induction circuit, I highly recommend the article In-context learning and induction head which is a master piece!

Now, let’s discover the eye head and former head in GPT2-small.

The next code is used to seek out the induction head. First, we run the model with 30 batches. Then, we calculate the mean value of the diagonal with an offset of seq_len in the eye pattern matrix. This method lets us measure the degree of attention the present token gives to the one which appears seq_len beforehand.

def induction_score_hook(
pattern: Float[Tensor, "batch head_index dest_pos source_pos"],
hook: HookPoint,
):
'''
Calculates the induction rating, and stores it within the [layer, head] position of the `induction_score_store` tensor.
'''
induction_stripe = pattern.diagonal(dim1=-2, dim2=-1, offset=1-seq_len) # src_pos, des_pos, one position right from seq_len
induction_score = einops.reduce(induction_stripe, "batch head_index position -> head_index", "mean")
induction_score_store[hook.layer(), :] = induction_scoreseq_len = 50
batch = 30
rep_tokens_30 = generate_repeated_tokens(gpt2_small, seq_len, batch)
induction_score_store = t.zeros((gpt2_small.cfg.n_layers, gpt2_small.cfg.n_heads), device=gpt2_small.cfg.device)
rep_tokens_30,
return_type=None, 
pattern_hook_names_filter,
induction_score_hook
)]
)

Now, let’s examine the induction scores. We’ll notice that some heads, equivalent to the one on layer 5 and head 5, have a high induction rating of 0.91.

We may also display the eye pattern of this head. You’ll notice a transparent diagonal line as much as an offset of seq_len.

Image by creator: layer 5, head 5 attention pattern

Similarly, we are able to discover the preceding token head. For example, layer 4 head 11 demonstrates a robust pattern for the previous token.

Image by creator: previous token head scores

How do MLP layers attribute?

Let’s consider this query: do MLP layers count? We all know that GPT2-Small incorporates each attention and MLP layers. To analyze this, I propose using an ablation technique.

Ablation, because the name implies, systematically removes certain model components and observes how performance changes consequently.

We are going to replace the output of the MLP layers within the second half of the sequence with those from the primary half, and observe how this affects the ultimate loss function. We are going to compute the difference between the loss after replacing the MLP layer outputs and the unique lack of the second half sequence using the next code.

def patch_residual_component(
residual_component,
hook,
pos,
cache,
):
residual_component[0,pos, :] = cache[hook.name][pos-seq_len, :]
return residual_componentablation_scores = t.zeros((gpt2_small.cfg.n_layers, seq_len), device=gpt2_small.cfg.device)
gpt2_small.reset_hooks()
logits = gpt2_small(rep_tokens, return_type="logits")
loss_no_ablation = cross_entropy_loss(logits[:, seq_len: max_len],rep_tokens[:, seq_len: max_len])
for layer in tqdm(range(gpt2_small.cfg.n_layers)):
for position in range(seq_len, max_len):
hook_fn = functools.partial(patch_residual_component, pos=position, cache=rep_cache)
ablated_logits = gpt2_small.run_with_hooks(rep_tokens, fwd_hooks=[
(utils.get_act_name("mlp_out", layer), hook_fn)
])
loss = cross_entropy_loss(ablated_logits[:, seq_len: max_len], rep_tokens[:, seq_len: max_len])
ablation_scores[layer, position-seq_len] = loss - loss_no_ablation

We arrive at a surprising result: apart from the primary token, the ablation doesn’t produce a big logit difference. This implies that the MLP layers may not have a big contribution within the case of repeated tokens.

Image by creator: loss different before and after ablation of mlp layers

Provided that the MLP layers don’t significantly contribute to the ultimate prediction, we are able to manually construct an induction circuit using the pinnacle of layer 5, head 5, and the pinnacle of layer 4, head 11. Recall that these are the induction head and the previous token head. We do it by the next code:

def K_comp_full_circuit(
model: HookedTransformer,
prev_token_layer_index: int,
ind_layer_index: int,
prev_token_head_index: int,
ind_head_index: int
) -> FactoredMatrix:
'''
Returns a (vocab, vocab)-size FactoredMatrix,
with the primary dimension being the query side
and the second dimension being the important thing side (going via the previous token head)'''
W_E = gpt2_small.W_E
W_Q = gpt2_small.W_Q[ind_layer_index, ind_head_index]
W_K = model.W_K[ind_layer_index, ind_head_index]
W_O = model.W_O[prev_token_layer_index, prev_token_head_index]
W_V = model.W_V[prev_token_layer_index, prev_token_head_index]
Q = W_E @ W_Q
K = W_E @ W_V @ W_O @ W_K
return FactoredMatrix(Q, K.T)

Computing the highest 1 accuracy of this circuit yields a worth of 0.2283. This is kind of good for a circuit constructed by only two heads!

For detailed implementation, please check my notebook.

Mechanistic Interpretability on prediction of repeated tokens

Finding induction head

How do MLP layers attribute?

LEAVE A REPLY Cancel reply