Mechanistic Interpretability: Peeking Inside an LLM

-

Intro

tips on how to examine and manipulate an LLM’s neural network. That is the subject of mechanistic interpretability research, and it could answer many exciting questions.

Remember: An LLM is a deep artificial neural network, made up of neurons and weights that determine how strongly those neurons are connected. What makes a neural network arrive at its conclusion? How much of the knowledge it processes does it consider and analyze adequately?

These kinds of questions have been investigated in an enormous variety of publications no less than since deep neural networks began showing promise. To be clear, mechanistic interpretability existed before LLMs did, and was already an exciting aspect of Explainable AI research with earlier deep neural networks. As an example, identifying the salient features that trigger a CNN to reach at a given object classification or vehicle steering direction can assist us understand how trustworthy and reliable the network is in safety-critical situations.

But with LLMs, the subject really took off, and have become way more interesting. Are the human-like cognitive abilities of LLMs real or fake? How does information travel through the neural network? Is there hidden knowledge inside an LLM?

On this post, you’ll discover:

  • A refresher on LLM architecture
  • An introduction to interpretability methods
  • Use cases
  • A discussion of past research

In a follow-up article, we’ll take a look at Python code to use a few of these skills, visualize the activations of the neural network and more.

Refresher: The design of an LLM

For the aim of this text, we want a basic understanding of the spots within the neural network where it’s value hooking into, to derive possibly useful information in the method. Subsequently, this section is a fast reminder of the components of an LLM.

LLMs use a sequence of input tokens to predict the subsequent token.

The inner workings of an LLM: Input tokens are embedded right into a combined matrix and transformer blocks enrich this hidden state with additional context. The residual stream can then be unembedded to find out the token predictions. (Image by writer)

Tokenizer: Initially, sentences are segmented into tokens. The goal of the token vocabulary is to show continuously used sub-words into single tokens. Each token has a singular ID.

Nonetheless, tokens will be confusing and messy since they supply an inaccurate representation of many things, including numbers and individual characters. Asking an LLM to calculate or to count letters is a reasonably unfair thing to do. (With specialized embedding schemes, their performance can improve [1].)

Embedding: A glance-up table is used to assign each token ID to an embedding vector of a given dimensionality. The look-up table is learned (i.e., derived throughout the neural network training), and tends to position co-occurring tokens closer together within the embedding space. The dimensionality of the embedding vectors is a very important trade-off between the capabilities of LLMs and computing effort. Because the order of the tokens would otherwise not be apparent in subsequent steps, positional encoding is added to those embeddings. In rotary positional encoding, the cosine of the token position will be used. The embedding vectors of all input tokens provide the matrix that the LLM processes, the initial hidden states. Because the LLM operates with this matrix, which moves through layers because the (also known as the hidden state or representation space), it really works in latent space.

Modalities aside from text: LLMs can work with modalities aside from text. In these cases, the tokenizer and embedding are modified to accommodate different modalities, resembling sound or images.

Transformer blocks: A variety of transformer blocks (dozens) refine the residual stream, adding context and extra meaning. Each transformer layer consists of an attention component [2] and an MLP component. These components are fed the normalized hidden state. The output is then added to the residual stream.

  • Attention: Multiple attention heads (also dozens) add weighted information from source tokens to destination tokens (within the residual stream). Each attention head’s “nature” is parametrized through three learned matrices WQ, WK, WV, which essentially determine what the eye head is specialized on. Queries, keys and values are calculated by multiplying these matrices with the hidden states for all tokens. The eye weight are then computed for every destination token from the softmax of the scaled dot products of the query and the important thing vectors of the source tokens. This attention weight describes the strength of the connection between the source and the destination for a given specialization of the eye head. Finally, the top outputs a weighted sum of the source token’s value vectors, and all the top’s outputs are concatenated and passed through a learned output projection WO.
  • MLP: A totally connected feedforward network. This linear-nonlinear-linear operation is applied independently at each position. MLP networks typically contain a big share of the parameters in an LLM.
    MLP networks store much of the knowledge. Later layers are inclined to contain more semantic and fewer shallow knowledge [3]. That is relevant when deciding where to probe or intervene. (With some effort, these knowledge representations will be modified in a trained LLM through weight modification [4] or residual stream intervention [5].)

Unembedding: The ultimate residual stream values are normalized and linearly mapped back to the vocabulary size to supply the for every input token position. Typically, we only need the prediction for the token following the last input token, so we use that one. The softmax function converts the logits for the ultimate position right into a probability distribution. One option is then chosen from this distribution (e.g., the most certainly or a sampling-based option) as the subsequent predicted token.

Should you want to learn more about how LLMs work and gain additional intuition, Stephen McAleese’s [6] explanation is superb.

Now that we checked out the architecture, the query to ask is: What do the intermittent states of the residual stream mean? How do they relate to the LLM’s output? does this work?

Introduction to interpretability methods

Let’s take a take a look at our toolbox. Which components will help us answer our questions, and which methods can we apply to investigate them? Our options include:

  • Neurons:
    We could observe the activation of individual neurons.
  • Attention:
    We could observe the output of individual attention heads in each layer.
    We could observe the queries, keys, values and a spotlight weights of every attention head for every position and layer.
    We could observe the concatenated outputs of all attention heads in each layer.
  • MLP:
    We could observe the MLP output in each layer.
    We could observe the neural activations within the MLP networks.
    We could observe the LayerNorm mean/variance to trace scale, saturation and outliers.
  • Residual stream:
    We could observe the residual stream at each position, in each layer.
    We could unembed the residual stream in intermediate layers, to watch what would occur if we stopped there — earlier layers often yield more shallow predictions. (This can be a useful diagnostic, but not fully reliable — the unembedding mapping was trained for the ultimate layer.)

We may also derive additional information:

  • Linear probes and classifiers: We are able to construct a system that classifies the recorded residual stream into one group or one other, or measures some feature inside it.
  • Gradient-based attributions: We are able to compute the gradient of a selected output with respect to some or the entire neural values. The gradient magnitude indicates how sensitive the prediction is to changes in those values.

All of this will be done while a given, static LLM runs an inference on a given prompt or while we actively intervene:

  • Comparison of multiple inferences: We are able to switch, train, modify or change the LLM or have it process different prompts, and record the aforementioned information.
  • Ablation: We are able to zero out neurons, heads, MLP blocks or vectors within the residual stream and watch the way it affects behavior. For instance, this permits us to measure the contribution of a head, neuron or pathway to token prediction.
  • Steering: We are able to actively steer the LLM by replacing or otherwise modifying activations within the residual stream.

Use cases

The interpretability methods discussed represent an enormous arsenal that will be applied to many various use cases.

  • Model performance improvement or behavior steering through activation steering: As an example, along with a system prompt, a model’s behavior will be steered towards a certain trait or focus dynamically, without changing the model.
  • Explainability: Methods resembling steering vectors, sparse autoencoders, and circuit tracing will be used to know what the model does and why based on its activations.
  • Safety: Detecting and discouraging undesirable features during training or implementing run-time supervision to interrupt a model that’s deviating. Detect recent or dangerous capabilities.
  • Drift detection: During model development, it is crucial to know when a newly trained model is behaving in a different way and to what extent.
  • Training improvement: Understanding the contribution of elements of the model’s behavior to its overall performance optimizes model development. For instance, unnecessary Chain-of-Thought steps will be discouraged during training, which ends up in smaller, faster, or potentially more powerful models.
  • Scientific and linguistic learnings: Use the models as an object to review to higher understand AI, language acquisition and cognition.

LLM interpretability research

The sector of interpretability has steadily developed over the previous few years, answering exciting questions along the best way. Just three years ago, it was unclear whether or not the learnings outlined below would manifest. This can be a temporary history of key insights:

  • In-context learning and pattern understanding: During LLM training, some attention heads gain the potential to collaborate as pattern identifiers, greatly enhancing an LLM’s in-context learning capabilities [7]. Thus, some elements of LLMs represent algorithms that enable capabilities applicable outside the space of the training data.
  • World understanding: Do LLMs memorize all of their answers, or do they understand the content with a purpose to form an internal mental model before answering? This topic has been heavily debated, and the primary convincing evidence that LLMs create an internal world model was published at the top of 2022. To display this, the researchers recovered the board state of the sport Othello from the residual stream [8, 9]. Many more indications followed swiftly. Space and time neurons were identified [10].
  • Memorization or generalization: Do LLMs simply regurgitate what they’ve seen before, or do they reason for themselves? The evidence here was somewhat unclear [11]. Intuitively, smaller LLMs form smaller world models (i.e., in 2023, the evidence for generalization was less convincing than in 2025). Newer benchmarks [12, 13] aim to limit contamination with material that could be inside a model’s training data, and focus specifically on the generalization capability. Their performance there remains to be substantial.
    LLMs develop deeper generalization abilities for some concepts during their training. To quantify this, indicators from interpretability methods were used [14].
  • Superposition: Properly trained neural networks compress knowledge and algorithms into approximations. Because there are more features than there are dimensions to point them, this ends in so-called superposition, where polysemantic neurons may contribute to multiple features of a model [15]. See Superposition: What Makes it Difficult to Explain Neural Network (Shuyang) for a proof of this phenomenon. Mainly, because neurons act in multiple functions, interpreting their activation will be ambiguous and difficult. This can be a major reason why interpretability research focuses more on the residual stream than on the activation of individual, polysemantic neurons.
  • Representation engineering: Beyond surface facts, resembling board states, space, and time, it is feasible to discover semantically meaningful vector directions throughout the residual stream [16]. Once a direction is identified, it could be examined or modified. This will be used to discover or influence hidden behaviors, amongst other things.
  • Latent knowledge: Do LLMs possess internal knowledge that they keep to themselves? They do, and methods for locating latent knowledge aim to extract it [17, 18]. If a model knows something that isn’t reflected in its prediction output, this is very relevant to explainability and safety. Attempts have been made to audit such hidden objectives, which will be inserted right into a model inadvertently or purposely, for research purposes [19].
  • Steering: The residual stream will be manipulated with such a further to alter the model’s behavior in a targeted way [20]. To find out this steering vector, one can record the residual stream during two consecutive runs (inferences) with opposite prompts and subtract one from the opposite. As an example, this could turn the sort of the generated output from pleased to sad, or from secure to dangerous. The activation vector is normally injected right into a middle layer of the neural network. Similarly, a steering vector will be used to measure how strongly a model responds in a given direction.
    Steering methods were attempted to cut back lies, hallucinations and other undesirable tendencies of LLMs. Nonetheless, it doesn’t at all times work reliably. Efforts have been made to develop measures of how well a model will be guided toward a given concept [21].
  • Chess: The board state of chess games in addition to the language model’s estimation of the opponent’s skill level may also be recovered from the residual stream [22]. Modifying the vector representing the expected skill level was also used to enhance the model’s performance in the sport.
  • Refusals: It was found that refusals may very well be prevented or elicited using steering vectors [23]. This means that some safety behaviors could also be linearly accessible.
  • Emotion: LLMs can derive emotional states from a given input text, which will be measured. The outcomes are consistent and psychologically plausible in light of cognitive appraisal theory [24]. That is interesting since it suggests that LLMs can mirror a lot of our human tendencies of their world models.
  • Features: As mentioned earlier, neurons in an LLM should not very helpful for understanding what is going on internally.
    Initially, OpenAI tried to have GPT-4 guess which features the neurons reply to based on their activation in response to different example texts [25]. In 2023, Anthropic and others joined this major topic and applied auto-encoder neural networks to automate the interpretation of the residual stream [26, 27]. Their work enables the mapping of the residual stream into monosemantic features that describe an interpretable attribute of what is going on. Nonetheless, it was later shown that not all of those features are one-dimensionally linear [28].
    The automation of feature evaluation stays a subject of interest and research, with more work being done on this area [29].
    Currently, Anthropic, Google, and others are actively contributing to Neuronpedia, a mecca for researchers studying interpretability.
  • Hallucinations: LLMs often produce unfaithful statements, or “hallucinate.” Mechanistic interventions have been used to discover the causes of hallucinations and mitigate them [30, 31].
    Features suitable for probing and influencing hallucinations have also been identified [32]. Accordingly, the model has some “self-knowledge” of when it’s producing incorrect statements.
  • Circuit tracing: In LLMs, circuit evaluation, i.e., the evaluation of the interaction of attention heads and MLPs, allows for the particular attribution of behaviors to such circuits [33, 34]. Using this method, researchers can determine not only where information is throughout the residual stream but in addition how the given model computed it. Efforts are ongoing to do that on a bigger scale.
  • Human brain comparisons and insights: Neural activity from humans has been in comparison with activations in OpenAI’s Whisper speech-to-text model [35]. Surprising similarities were found. Nonetheless, this shouldn’t be overinterpreted; it could simply be an indication that LLMs have acquired effective strategies. Interpretability research allows such analyses to be performed in the primary place.
  • Self-referential first-person view and claims of consciousness: Interestingly, suppressing features related to deception led to more claims of consciousness and deeper self-referential statements by LLMs [36]. Again, the outcomes shouldn’t be overinterpreted, but they’re interesting to think about as LLMs turn out to be more capable and challenge us more often.

This review demonstrated the ability of causal interventions on internal activations. Slightly than counting on correlational observations of a black-box system, the system will be dissected and analyzed. 

Conclusion

Interpretability is an exciting research area that gives surprising insights into an LLM’s behavior and capabilities. It may well even reveal interesting parallels to human cognition. Many (mostly narrow) LLM behaviors will be explained for a given model to supply useful insights. Nonetheless, the sheer variety of models and the variety of possible inquiries to ask will likely prevent us from fully deciphering any large model — and even all of them — as the large time investment may simply not yield sufficient profit. That is why shifts to automated evaluation are happening, to use mechanistic insight systematically.

These methods are useful additions to our toolbox in each industry and research, and all users of future AI systems may profit from these incremental insights. They allow improvements in reliability, explainability, and safety.

Contact

This can be a complex and extensive topic, and I’m pleased about pointers, comments and corrections. Be happy to send a message to jvm (at) taggedvision.com

References

  • [1] McLeish, Sean, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian R. Bartoldson, Bhavya Kailkhura, et al. 2024. “Transformers Can Do Arithmetic with the Right Embeddings.” 37: 108012–41. doi:10.52202/079017–3430.
  • [2] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” 2017-Decem(Nips): 5999–6009.
  • [3] Geva, Mor, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. “Transformer Feed-Forward Layers Are Key-Value Memories.” doi:10.48550/arXiv.2012.14913.
  • [4] Meng, Kevin, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2023. “Mass-Editing Memory in a Transformer.” doi:10.48550/arXiv.2210.07229.
  • [5] Hernandez, Evan, Belinda Z Li, and Jacob Andreas. “Inspecting and Editing Knowledge Representations in Language Models.” https://github.com/evandez/REMEDI.
  • [6] Stephen McAleese. 2025. “Understanding LLMs: Insights from Mechanistic Interpretability.” https://www.lesswrong.com/posts/XGHf7EY3CK4KorBpw/understanding-llms-insights-from-mechanistic
  • [7] Olsson, et al., “In-context Learning and Induction Heads”, Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
  • [8] Li, Kenneth, Aspen K. Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. “Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task.” https://arxiv.org/abs/2210.13382v4.
  • [9] Nanda, Neel, Andrew Lee, and Martin Wattenberg. 2023. “Emergent Linear Representations in World Models of Self-Supervised Sequence Models.” https://arxiv.org/abs/2309.00941v2
  • [10] Gurnee, Wes, and Max Tegmark. 2023. “Language Models Represent Space and Time.” https://arxiv.org/abs/2310.02207v1.
  • [11] Wu, Zhaofeng, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, and Yoon Kim. 2023. “Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks.” https://arxiv.org/abs/2307.02477v1.
  • [12] “An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems.” 2025. https://openreview.net/forum?id=Tos7ZSLujg
  • [13] White, Colin, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, et al. 2025. “LiveBench: A Difficult, Contamination-Limited LLM Benchmark.” doi:10.48550/arXiv.2406.19314.
  • [14] Nanda, Neel, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. 2023. “Progress Measures for Grokking via Mechanistic Interpretability.” doi:10.48550/arXiv.2301.05217.
  • [15] Elhage, Nelson, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, et al. 2022. “Toy Models of Superposition.” https://arxiv.org/abs/2209.10652v1 (February 18, 2024).
  • [16] Zou, Andy, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, et al. 2023. “REPRESENTATION ENGINEERING: A TOP-DOWN APPROACH TO AI TRANSPARENCY.”
  • [17] Burns, Collin, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2022. “DISCOVERING LATENT KNOWLEDGE IN LANGUAGE MODELS WITHOUT SUPERVISION.”
  • [18] Cywiński, Bartosz, Emil Ryd, Senthooran Rajamanoharan, and Neel Nanda. 2025. “Towards Eliciting Latent Knowledge from LLMs with Mechanistic Interpretability.” doi:10.48550/arXiv.2505.14352.
  • [19] Marks, Samuel, Johannes Treutlein, Trenton Bricken, Jack Lindsey, Jonathan Marcus, Siddharth Mishra-Sharma, Daniel Ziegler, et al. “AUDITING LANGUAGE MODELS FOR HIDDEN OBJECTIVES.”
  • [20] Turner, Alexander Matt, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. 2023. “Activation Addition: Steering Language Models Without Optimization.” https://arxiv.org/abs/2308.10248v3.
  • [21] Rütte, Dimitri von, Sotiris Anagnostidis, Gregor Bachmann, and Thomas Hofmann. 2024. “A Language Model’s Guide Through Latent Space.” doi:10.48550/arXiv.2402.14433.
  • [22] Karvonen, Adam. “Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models.” https://github.com/adamkarvonen/chess.
  • [23] Arditi, Andy, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. “Refusal in Language Models Is Mediated by a Single Direction.” doi:10.48550/arXiv.2406.11717.
  • [24] Tak, Ala N., Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, and Jonathan Gratch. 2025. “Mechanistic Interpretability of Emotion Inference in Large Language Models.” doi:10.48550/arXiv.2502.05489.
  • [25] Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff, and William Saunders Wu. 2023. “Language Models Can Explain Neurons in Language Models.” https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.
  • [26] “Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.” https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  • [27] Cunningham, Hoagy, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. 2023. “SPARSE AUTOENCODERS FIND HIGHLY INTER-PRETABLE FEATURES IN LANGUAGE MODELS.”
  • [28] Engels, Joshua, Eric J. Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark. 2025. “Not All Language Model Features Are One-Dimensionally Linear.” doi:10.48550/arXiv.2405.14860.
  • [29] Shaham, Tamar Rott, Sarah Schwettmann, Franklin Wang, Achyuta Rajaram, Evan Hernandez, Jacob Andreas, and Antonio Torralba. 2025. “A Multimodal Automated Interpretability Agent.” doi:10.48550/arXiv.2404.14394.
  • [30] Chen, Shiqi, Miao Xiong, Junteng Liu, Zhengxuan Wu, Teng Xiao, Siyang Gao, and Junxian He. 2024. “In-Context Sharpness as Alerts: An Inner Representation Perspective for Hallucination Mitigation.” doi:10.48550/arXiv.2403.01548.
  • [31] Yu, Lei, Meng Cao, Jackie CK Cheung, and Yue Dong. 2024. “Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations.” In , eds. Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen. Miami, Florida, USA: Association for Computational Linguistics, 7943–56. doi:10.18653/v1/2024.findings-emnlp.466.
  • [32] Ferrando, Javier, Oscar Obeso, Senthooran Rajamanoharan, and Neel Nanda. 2025. “DO I KNOW THIS ENTITY? KNOWLEDGE AWARENESS AND HALLUCINATIONS IN LANGUAGE MODELS.”
  • [33] Lindsey, et al., On the Biology of a Large Language Model (2025), Transformer Circuits
  • [34] Wang, Kevin, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2022. “Interpretability within the Wild: A Circuit for Indirect Object Identification in GPT-2 Small.” http://arxiv.org/abs/2211.00593.
  • [35] “Deciphering Language Processing within the Human Brain through LLM Representations.” https://research.google/blog/deciphering-language-processing-in-the-human-brain-through-llm-representations/
  • [36] Berg, Cameron, Diogo de Lucena, and Judd Rosenblatt. 2025. “Large Language Models Report Subjective Experience Under Self-Referential Processing.” doi:10.48550/arXiv.2510.24797.
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x