will not be a knowledge quality problem. It will not be a training problem. It will not be an issue you may solve with more RLHF, higher filtering, or a bigger context window. It’s a structural property of what these systems are optimized to do.
I even have held this position for months, and the response is predictable: researchers working on retrieval augmentation, fine-tuning pipelines, and alignment techniques would favor a more optimistic framing. I understand why.
What has been missing from this argument is geometry. Intuition about objectives and architecture is needed but not sufficient. We want to open the model and have a look at what is definitely happening inside when a system produces a confident flawed answer. Not on the logits. Not at the eye patterns. At the inner trajectory of the representation itself, layer by layer, from input to output. That’s what the work I’m presenting here did.
What the Residual Stream Knows Before the Model Lies
The setup may be very easy. We take a factual prompt — the type where a transformer should retrieve a stored association — and we run it in two conditions: one where the model produces the proper answer, one where it produces a confident flawed answer (hallucination). Then, we track the trajectory of the residual stream — the inner representation vector — layer by layer through the network. The query is: do these two trajectories diverge since the model simply lacks the relevant association? Or is something more specific happening?
To grasp what which means, consider the model’s internal state at each layer as a degree in space — a high-dimensional space. Because the model processes a prompt, that time moves. It traces a path. What the experiment measures is whether or not the trail taken during an accurate answer and the trail taken during a hallucination diverge because one path is shorter — the model running out of data — or because they go in numerous directions while covering the identical distance.
The reply is the second. The paths are the identical length. They point to different places. That’s what the Figure 1 shows: two trajectories leaving the identical origin, traveling the identical distance, arriving at different ends of the space. One toward the proper answer. One away from it.
The Commitment Ratio: Where Suppression Becomes Visible
The paper introduces a metric called the commitment ratio κ — essentially, how much of the model’s probability mass is being actively directed toward or away from the proper token at each layer.
In correct processing κ rises monotonically through the network (Figure 2 — red, blue and dark grey curves). The model builds commitment to the suitable answer progressively. That is what you’ll expect from a system retrieving a learned association.
In hallucination, something different happens. κ doesn’t simply stay flat, which might indicate retrieval failure — the absence of the relevant statistical pattern. As a substitute, κ collapses (dashed curves in Figure 2). In all models tested, κ reaches a minimum significantly below its starting value before recovering barely in the ultimate layers. In LLaMA-2 13B and Mistral 7B, it drops to κ_min = 0.08. The p-values are below 10⁻¹⁰⁰. This will not be a “subtle” effect.

What is occurring? The model will not be failing to search out the proper answer. It’s actively moving probability mass away from the proper token at the identical layers where it will be moving probability mass toward it in the proper condition. The failure is largely an override.
The model has encoded the proper answer. That’s what makes the κ collapse significant. If the model simply lacked the relevant association — if “Paris” was never statistically connected to “capital of France” within the weights —we’d see a flat or noisy trajectory. Nothing to suppress. The geometry can be uninformative.
What we see as an alternative is a trajectory that starts in the suitable direction (all curves in Figure 2 starts principally in the identical point) but then turns. The proper token accumulates probability within the early layers, as the proper run does, after which loses it in the center layers, at precisely the depth where it must be rising in the proper condition (red,blue and dark grey curves in Figure 1). Why? The honest answer is that the paper establishes the what with precision and leaves the why open. But probably the most plausible interpretation is competition. These models are usually not retrieving isolated facts. They’re predicting the subsequent token in a context, and context generates its own pressure. A sentence that has been stepping into a specific direction — stylistically, topically, syntactically — creates a powerful prior for the way it should proceed. When the factually correct answer conflicts with that contextual attractor, the model doesn’t flip a coin. The contextual signal, which is dense and continuous across your complete sequence, can outweigh the factual signal, which could also be sparse within the training data.
The training signal never explicitly told the model to prefer coherence over accuracy. It told the model to predict the subsequent token. Coherence and accuracy often align. After they don’t, what we get is the dashed gray line in Figure 2.
is the uncomfortable part.
Three Regimes
One in every of the cleaner empirical findings is that the seven models don’t distribute constantly along any axis of hallucination behavior. They fall into three distinct clusters:
| Models at 1B parameters show attention reallocation starting — some geometric separation — but suppression that’s incomplete. | Models at 1.6B–3B show intermediate suppression. The κ collapse is present but shallower. StableLM-2 1.6B reaches κ_min = 0.32 relatively than 0.08. | Then there may be Gemma 2 2B, which matches the suppression depth of LLaMA-2 13B and Mistral 7B despite having a fraction of their parameters (κ_min = 0.08, p < 10⁻⁹¹). |
Something real is occurring architecturally, not only as a function of scale. Architectural selections — attention mechanisms, normalization, layer design — resolve the ceiling on suppression depth independently of parameter count. This can be a phase structure.
Detecting Hallucinations
We’ve got mapped, with geometric precision, how a particular class of system fails. The causal query — which specific circuits implement the suppression, and why — stays open. That’s the subsequent problem. What the geometry establishes is that the suppression will not be accidental. It will not be a calibration error you may tune away with higher prompting or a distinct learning rate. It’s an emergent property of systems optimized for next-token prediction. Contextual coherence and factual accuracy are different objectives. After they conflict, the training signal doesn’t adjudicate between them. The override is what that conflict looks like from the within.
The sensible implication is direct. You should use this geometric signature to construct hallucination detectors — probes that discover suppression events before they reach the output. They work well. But they’re local. A probe trained on factual retrieval doesn’t transfer cleanly to reasoning tasks or to different knowledge domains. The geometry shifts enough that detection degrades. This will not be a flaw within the approach. It’s information. It tells you that monitoring must be domain-specific, calibrated per deployment context, not installed once and forgotten.
For anyone constructing production systems at scale, that’s the operational conclusion: one monitor per domain, trained on representative data from that domain. The choice — a single universal detector — will not be supported by the evidence.
What the Geometry Cannot Fix
The override mechanism this work documents will not be a “bug waiting to be patched”. It’s a direct consequence of the target function used for training LLMs. Next-token prediction over discrete sequences doesn’t give a model any mechanism to privilege factual accuracy over contextual coherence. The training signal cannot differentiate between them. The model learns to be fluent, which is sort of remarkable. The issue is tha fluency and accuracy often coincide. After they don’t, fluency wins. It’s a conflict-resolution mechanism producing the flawed consequence. The geometry shows you the moment that call happens.
To reply the causal query — which specific circuits implement the suppression, and whether or not they may be modified — we want activation patching at scale, circuit-level evaluation, and ideally causal intervention experiments that transcend the correlational evidence this paper provides. That’s the subsequent step. Several groups are working on it.
Whether the reply to that causal query would allow us to repair hallucination inside the current architectural paradigm is a distinct matter. My view is that it will not — not fundamentally. We are able to suppress the suppression. We are able to add a monitoring layer that catches the κ collapse before it reaches the output. We are able to fine-tune on domains where the conflict is most acute. These are real improvements. However the underlying tension between contextual prediction and factual grounding doesn’t go away until the model has representations of the world that are usually not derived from token co-occurrence. That requires a distinct architecture.
Why This Work Matters Anyway
Infrastructure that accurately characterizes the failure modes of current LLMs is a needed step for the transition to raised ones. We are able to‘t design a successor architecture without understanding, intimately, what the predecessor is definitely doing inside. This work tells us something specific:
- In autoregressive LLMs (transformers architecture), the geometry of correct and incorrect factual processing diverges rotationally, not magnitudinally;
- the divergence is energetic relatively than passive;
- the depth of suppression is architecturally gated, not purely a function of scale;
- the geometric signature transfers across domains with systematic but bounded degradation.
The geometry doesn’t lie. What we decide to do with it’s a distinct query.
Code, data, and related papers will likely be available at cert-framework.com soon.
Really helpful reading
- Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001.
- Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformercircuits.pub/2021/framework/index.html
- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are fewshot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual.
- Bereska, L., & Gavves, E. (2024). Mechanistic interpretability for AI safety — a review. .
- Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. ICLR, 2016.
