is not a data quality problem. It is not a training problem. It is not a problem you can solve with more RLHF, better filtering, or a larger context window. It is a structural property of what these systems are optimized to do.
I have held this position for months, and the reaction is predictable: researchers working on retrieval augmentation, fine-tuning pipelines, and alignment techniques would prefer a more optimistic framing. I understand why.
What has been missing from this argument is geometry. Intuition about objectives and architecture is necessary but not sufficient. We need to open the model and look at what is actually happening inside when a system produces a confident wrong answer. Not at the logits. Not at the attention patterns. At the internal trajectory of the representation itself, layer by layer, from input to output. That is what the work I am presenting here did.
What the Residual Stream Knows Before the Model Lies
The setup is very simple. We take a factual prompt — the kind where a transformer should retrieve a stored association — and we run it in two conditions: one where the model produces the correct answer, one where it produces a confident wrong answer (hallucination). Then, we track the trajectory of the residual stream — the internal representation vector — layer by layer through the network. The question is: do these two trajectories diverge because the model simply lacks the relevant association? Or is something more specific happening?
To understand what that means, think of the model’s internal state at each layer as a point in space — a high-dimensional space. As the model processes a prompt, that point moves. It traces a path. What the experiment measures is whether the path taken during a correct answer and the path taken during a hallucination diverge because one path is shorter — the model running out of information — or because they go in different directions while covering the same distance.
The answer is the second one. The paths are the same length. They point to different places. That is what the Figure 1 shows: two trajectories leaving the same origin, traveling the same distance, arriving at different ends of the space. One toward the correct answer. One away from it.
Figure 1. When a LLM hallucinates, the internal representation does not go blank. It rotates. Both paths — correct and incorrect — travel the same distance through the model’s representation space. What separates them is direction, not magnitude. The geometry is telling you something the output logits cannot: the model knew where the right answer was. It went somewhere else. Image by author
The Commitment Ratio: Where Suppression Becomes Visible
The paper introduces a metric called the commitment ratio κ — essentially, how much of the model’s probability mass is being actively directed toward or away from the correct token at each layer.
In correct processing κ rises monotonically through the network (Figure 2 — red, blue and dark grey curves). The model builds commitment to the right answer progressively. This is what you would expect from a system retrieving a learned association.
In hallucination, something different happens. κ does not simply stay flat, which would indicate retrieval failure — the absence of the relevant statistical pattern. Instead, κ collapses (dashed curves in Figure 2). In all models tested, κ reaches a minimum significantly below its starting value before recovering slightly in the final layers. In LLaMA-2 13B and Mistral 7B, it drops to κ_min = 0.08. The p-values are below 10⁻¹⁰⁰. This is not a “subtle” effect.
Figure 2: Six models with the same pattern. The dashed line in each panel is a hallucination run. Every other curve — correct processing under different prompt conditions — rises through the network. The hallucination curve falls, reaches a floor near zero, then partially recovers at the output layer. In LLaMA-2 13B and Mistral 7B that floor is κ = 0.08. In Gemma 2 2B — a model with a fraction of their parameters — it reaches the same depth. The model is not failing to retrieve the correct answer. It is actively moving probability away from it. That is not a retrieval failure. That is a decision. Image by author.
What is happening? The model is not failing to find the correct answer. It is actively moving probability mass away from the correct token at the same layers where it would be moving probability mass toward it in the correct condition. The failure is basically an override.
The model has encoded the correct answer. That is what makes the κ collapse significant. If the model simply lacked the relevant association — if “Paris” was never statistically connected to “capital of France” in the weights —we would see a flat or noisy trajectory. Nothing to suppress. The geometry would be uninformative.
What we see instead is a trajectory that starts in the right direction (all curves in Figure 2 starts basically in the same point) but then turns. The correct token accumulates probability in the early layers, as the correct run does, and then loses it in the middle layers, at exactly the depth where it should be rising in the correct condition (red,blue and dark grey curves in Figure 1). Why? The honest answer is that the paper establishes the what with precision and leaves the why open. But the most plausible interpretation is competition. These models are not retrieving isolated facts. They are predicting the next token in a context, and context generates its own pressure. A sentence that has been going in a particular direction — stylistically, topically, syntactically — creates a strong prior for how it should continue. When the factually correct answer conflicts with that contextual attractor, the model does not flip a coin. The contextual signal, which is dense and continuous across the entire sequence, can outweigh the factual signal, which may be sparse in the training data.
The training signal never explicitly told the model to prefer coherence over accuracy. It told the model to predict the next token. Coherence and accuracy usually align. When they do not, what we get is the dashed gray line in Figure 2.
The model is not lying. It is doing exactly what it was optimized to do. That is the uncomfortable part.
Three Regimes
One of the cleaner empirical findings is that the seven models do not distribute continuously along any axis of hallucination behavior. They fall into three distinct clusters:
Models at 1B parameters show attention reallocation beginning — some geometric separation — but suppression that is incomplete.
Models at 1.6B–3B show intermediate suppression. The κ collapse is present but shallower. StableLM-2 1.6B reaches κ_min = 0.32 rather than 0.08.Then there is Gemma 2 2B, which matches the suppression depth of LLaMA-2 13B and Mistral 7B despite having a fraction of their parameters (κ_min = 0.08, p < 10⁻⁹¹).
Something real is going on architecturally, not just as a function of scale. Architectural choices — attention mechanisms, normalization, layer design — decide the ceiling on suppression depth independently of parameter count. This is a phase structure.
Detecting Hallucinations
We have mapped, with geometric precision, how a specific class of system fails. The causal question — which specific circuits implement the suppression, and why — remains open. That is the next problem. What the geometry establishes is that the suppression is not accidental. It is not a calibration error you can tune away with better prompting or a different learning rate. It is an emergent property of systems optimized for next-token prediction. Contextual coherence and factual accuracy are different objectives. When they conflict, the training signal does not adjudicate between them. The override is what that conflict looks like from the inside.
The practical implication is direct. You can use this geometric signature to build hallucination detectors — probes that identify suppression events before they reach the output. They work well. But they are local. A probe trained on factual retrieval does not transfer cleanly to reasoning tasks or to different knowledge domains. The geometry shifts enough that detection degrades. This is not a flaw in the approach. It is information. It tells you that monitoring needs to be domain-specific, calibrated per deployment context, not installed once and forgotten.
For anyone building production systems at scale, that is the operational conclusion: one monitor per domain, trained on representative data from that domain. The alternative — a single universal detector — is not supported by the evidence.
What the Geometry Cannot Fix
The override mechanism this work documents is not a “bug waiting to be patched”. It is a direct consequence of the objective function used for training LLMs. Next-token prediction over discrete sequences does not give a model any mechanism to privilege factual accuracy over contextual coherence. The training signal cannot differentiate between them. The model learns to be fluent, which is quite remarkable. The problem is tha fluency and accuracy usually coincide. When they do not, fluency wins. It is a conflict-resolution mechanism producing the wrong outcome. The geometry shows you the moment that decision happens.
To answer the causal question — which specific circuits implement the suppression, and whether they can be modified — we need activation patching at scale, circuit-level analysis, and ideally causal intervention experiments that go beyond the correlational evidence this paper provides. That is the next step. Several groups are working on it.
Whether the answer to that causal question would allow us to fix hallucination within the current architectural paradigm is a different matter. My view is that it would not — not fundamentally. We can suppress the suppression. We can add a monitoring layer that catches the κ collapse before it reaches the output. We can fine-tune on domains where the conflict is most acute. These are real improvements. But the underlying tension between contextual prediction and factual grounding does not go away until the model has representations of the world that are not derived from token co-occurrence. That requires a different architecture.
Why This Work Matters Anyway
Infrastructure that accurately characterizes the failure modes of current LLMs is a necessary step for the transition to better ones. We can‘t design a successor architecture without understanding, in detail, what the predecessor is actually doing inside. This work tells us something specific:
- In autoregressive LLMs (transformers architecture), the geometry of correct and incorrect factual processing diverges rotationally, not magnitudinally;
- the divergence is active rather than passive;
- the depth of suppression is architecturally gated, not purely a function of scale;
- the geometric signature transfers across domains with systematic but bounded degradation.
The geometry does not lie. What we choose to do with it is a different question.
Code, data, and related papers will be available at cert-framework.com soon.
Recommended reading
- Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001.
- Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread. https://transformercircuits.pub/2021/framework/index.html
- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are fewshot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual.
- Bereska, L., & Gavves, E. (2024). Mechanistic interpretability for AI safety — a review. arXiv preprint arXiv:2404.14082.
- Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes. ICLR, 2016.

