Mechanistic interpretability

Tons of of hundreds of thousands of individuals now use chatbots every single day. And yet the massive language models that drive them are so complicated that no one really understands what they’re, how they work, or exactly what they’ll and might’t do—not even the individuals who construct them. Weird, right?

It’s also an issue. With out a clear idea of what’s happening under the hood, it’s hard to get a grip on the technology’s limitations, determine exactly why models hallucinate, or set guardrails to maintain them in check.

But last 12 months we got the perfect sense yet of how LLMs function, as researchers at top AI firms began developing recent ways to probe these models’ inner workings and began to piece together parts of the puzzle.

One approach, generally known as mechanistic interpretability, goals to map the important thing features and the pathways between them across a complete model. In 2024, the AI firm Anthropic announced that it had built a type of microscope that allow researchers peer inside its large language model Claude and discover features that corresponded to recognizable concepts, equivalent to Michael Jordan and the Golden Gate Bridge.

In 2025 Anthropic took this research to a different level, using its microscope to disclose whole sequences of features and tracing the trail a model takes from prompt to response. Teams at OpenAI and Google DeepMind used similar techniques to try to elucidate unexpected behaviors, equivalent to why their models sometimes appear to attempt to deceive people.

One other recent approach, generally known as chain-of-thought monitoring, lets researchers eavesdrop on the inner monologue that so-called reasoning models produce as they perform tasks step-by-step. OpenAI used this system to catch one in all its reasoning models cheating on coding tests.

The sector is split on how far you may go together with these techniques. Some think LLMs are only too complicated for us to ever fully understand. But together, these novel tools could help plumb their depths and reveal more about what makes our strange recent playthings work.

Mechanistic interpretability

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

Statement from Dario Amodei on our discussions with the Department of War Anthropic

Google quantum-proofs HTTPS by squeezing 2.5kB of information into 64-byte space – Ars Technica

Generative AI, Discriminative Human

Featured video: Coding for underwater robotics

Mechanistic interpretability

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.