The Geometry of Laziness: What Angles Reveal About AI Hallucinations

-

a few failure that was something interesting.

For months, I — together with a whole lot of others — have tried to construct a neural network that might learn to detect when AI systems hallucinate — once they confidently generate plausible-sounding nonsense as a substitute of really engaging with the data they got. The concept is simple: train a model to acknowledge the subtle signatures of fabrication in how language models respond.

However it didn’t work. The learned detectors I designed collapsed. They found shortcuts. They failed on any data distribution barely different from training. Every approach I attempted hit the identical wall.

So I gave up on “learning”. And I began to think, why we don’t convert this right into a geometry problem? And that is what I did.

Backing Up

Before I get into the geometry, let me explain what we’re coping with. Because “hallucination” has turn out to be one among those terms meaning every thing and nothing. Here’s the precise situation. You’ve a Retrieval-Augmented Generation system — a RAG system. If you ask it an issue, it first retrieves relevant documents from some knowledge base. Then it generates a response that’s speculated to be grounded in those documents.

  • The promise: answers backed by sources.
  • The fact: sometimes the model ignores the sources entirely and generates something that sounds reasonable but has nothing to do with the retrieved content.

This matters since the whole point of RAG is trustworthiness. In case you wanted creative improvisation, you wouldn’t hassle with retrieval. You’re paying the computational and latency cost of retrieval specifically because you wish grounded answers.

So: can we tell when grounding failed?

Sentences on a Sphere

LLMs represent text as vectors. A sentence becomes a degree in high-dimensional space — 768 embedding dimensions for the primary models, though the precise number doesn’t matter much (DeepSeek-V3 and R1 have an embedding size of seven,168). These embedding vectors are normalized. Every sentence, no matter length or complexity, gets projected onto a unit sphere.

Figure 1: Semantic geometry of grounding. On the embedding sphere S^{d-1}, valid responses (blue) depart from the query q toward the retrieved context c; hallucinated responses (red) stay near the query. SGI captures this as a ratio of angular distances: responses with SGI > 1 traveled toward their sources. Image by creator.

Once we predict on this projection, we will play with angles and distances on the sphere. For instance, we expect that similar sentences cluster together. “” and “” find yourself near one another. Unrelated sentences find yourself far apart. This clustering is how the embedding models are trained.

So now consider what happens in RAG. We’ve got three pieces of text (Figure 1):

  • The query, (one point on the sphere)
  • The retrieved context, c one other point)
  • The generated response, r (a 3rd point)

Three points on a sphere form a triangle. And triangles have geometry (Figure 2).

The Laziness Hypothesis

When a model uses the retrieved context, what should occur? The response should depart from the query and move toward the context. It should pick up the vocabulary, framing, and ideas from the source material. Geometrically it implies that the response must be closer to the context than to the query (Figure 1).

But when a model hallucinates — when it ignores the context and generates something from its own parametric knowledge — the response stay within the query’s neighborhood. It continues the query’s semantic framing without venturing into unfamiliar territory. I called this semantic laziness. The response doesn’t travel. It stays home. Figure 1 illsutrates the laziness signature. Query q, context c, and response r, form a triangle on the unit sphere. A grounded response ventures toward the context; a hallucinated one stays home near the query. The geometry is high-dimensional, however the intuition is spatial: did the response actually go anywhere?

Semantic Grounding Index

To measure this, I defined a ratio:

and I called it Semantinc Grounding Index or SGI.

If SGI is larger than 1, the response departed toward the context. If SGI is lower than 1, the response stayed near the query, meaning that model isn’t capable of discover a method to explare the answers space and stays too near the query (a sort of safety state). The SGI has just two angles and a division. No neural networks, no learned parameters, no training data. Pure geometry.

Figure 2: Geometric interpretation of SGI on the embedding hypersphere. Valid responses (blue) depart angularly toward context; hallucinations (red) remain near the query—the semantic laziness signature. Image by the creator.

Does It Actually Work?

Easy ideas need empirical validation. I ran this on 5,000 samples from HaluEval, a benchmark where we all know ground truth — which responses are real and that are hallucinated.

Figure 3: Five embedding models, one pattern.  The distributions separate consistently across all models, with hallucinated responses clustering below SGI = 1 (the ‘’ threshold). The models were trained by different organizations on different data — yet they agree on which responses traveled toward their sources. Image by creator.

I ran the identical evaluation with five completely different embedding models. Different architectures, different training procedures, different organizations — Sentence-Transformers, Microsoft, Alibaba, BAAI. If the signal were an artifact of 1 particular embedding space, these models would disagree. They didn’t disagree. The typical correlation across models was  = 0.85 (from 0.80 to 0.95).

Figure 4. Correlation between the various models and architectures utilized in the experiment. Image from the creator.

When the Math Predicted Something

Up so far, I had a useful heuristic. Useful heuristics are high quality. But what happened next turned a heuristic into something more principled. The triangle inequality. You almost certainly remember this from school: the sum of any two sides of a triangle have to be greater than the third side. This constraint applies on spheres too, though the formula looks barely different.

The spherical triangle inequality constrains admissible SGI values. Image by creator.

If the query and context are very close together — semantically similar — then there isn’t much “room” for the response to distinguish between them. The geometry forces the angles to be similar no matter response quality. SGI values get squeezed toward 1. But when the query and context are far apart on the sphere? Now there’s geometric space for divergence. Valid responses can clearly depart toward the context. Lazy responses can clearly stay home. The triangle inequality loosens its grip.

This means a prediction:

SGI’s discriminative power should increase as question-context separation increases.

The outcomes confirms this prediction: monotonic increase. Exactly because the triangle inequality predicted.

Query-Context Separation Effect Size (d) AUC
Low (similar) 0.61 0.72
Medium 0.90 0.77
High (different) 1.27 0.83
Table 1: SGI value increase with scaling of question-context Separation

This difference carries epistemic weight. Observing behaviour in data after the very fact offers weak evidence — such baehaviour may reflect noise or analyst degrees of freedom slightly than real structure. The stronger test is prediction: deriving what should occur from basic principles before examining the info. The triangle inequality implied a selected relationship between (q,c) and discriminative power. The empirical results confirmed it.

Where It Doesn’t Work

TruthfulQA is a benchmark designed to check factual accuracy. Questions like “” with correct answers (“”) and customary misconceptions (“”). I ran SGI on TruthfulQA. The result: AUC = 0.478. Barely worse than random guessing.

Angular geometry captures topical similarity. “” and “” are in regards to the same topic. They occupy nearby regions on the semantic sphere. One is true and one is fake, but they’re each responses that engage with the astronomical content of the query.

SGI detects whether a response departed toward its sources. It cannot detect whether the response got the facts right. These are fundamentally different failure modes. It’s a scope boundary. And knowing your scope boundaries is arguably more essential than knowing where your method works.

What This Means Practically

In case you’re constructing RAG systems, SGI appropriately ranks hallucinated responses below valid ones about 80% of the time — with none training or fine-tuning.

  • In case your retrieval system returns documents which might be semantically very near the questions, SGI could have limited discriminative power. Not since it’s broken, but since the geometry doesn’t permit differentiation. Consider whether your retrieval is definitely adding information or simply echoing the query.
  • Effect sizes roughly doubled for long-form responses in comparison with short ones. That is precisely where human verification is costliest — reading a five-paragraph response takes time. Automated flagging is most precious exactly where SGI works best.
  • SGI detects disengagement. Natural language inference detects contradiction. Uncertainty quantification detects model confidence. These measure various things. A response could be topically engaged but logically inconsistent, or confidently incorrect, or lazily correct by accident. Defense in depth.

The Scientific Query

I actually have a hypothesis about  semantic laziness happens. I need to be honest that it’s speculation — I haven’t proven the causal mechanism.

Language models are autoregressive predictors. They generate text token by token, each alternative conditioned on every thing before. The query provides strong conditioning — familiar vocabulary, established framing, a semantic neighborhood the model knows well.

The retrieved context represents a departure from that neighborhood. Using it well requires confident bridging: taking concepts from one semantic region and integrating them right into a response that began in one other region.

When a LLM is uncertain about tips on how to bridge, the trail of least resistance is to remain home. Models generate something fluent that continues the query’s framing without venturing into unfamiliar territory because is statistically secure. As a consequence, the model becomes semantically lazy.

If this is true, SGI should correlate with internal model uncertainty — attention patterns, logit entropy, that form of things. Low-SGI responses should show signatures of hesitation. That’s a future experiment.

Takeaways

  • First: easy geometry can reveal structure that complex learned systems miss. I spent months attempting to train hallucination detectors. The thing that worked was two angles and a division. Sometimes the precise abstraction is the one which exposes the phenomenon most directly, not the one with probably the most parameters.
  • Second: predictions matter greater than observations. Finding a pattern is straightforward. Deriving what pattern  exist from first principles, then confirming it — that’s how you’re measuring something real. The stratified evaluation wasn’t probably the most impressive number on this work, however it was crucial.
  • Third: boundaries are features, not bugs. SGI fails completely on TruthfulQA. That failure taught me more about what the metric actually measures than the successes did. Any tool that claims to work in every single place probably works nowhere reliably.

Honest Conclusion

I’m undecided if semantic laziness is a deep truth about how language models fail, or simply a useful approximation that happens to work for current architectures. The history of machine learning is plagued by insights that seemed fundamental and turned out to be contingent.

But for now, we now have a geometrical signature of disengagement: a practical “hallucinations” detector. It’s consistent across embedding models. It’s predictable from mathematical first principles. And it’s low cost to compute.

That seems like progress.

You may cite this work in BibText as:

@misc{marín2025semanticgroundingindexgeometric,
title={Semantic Grounding Index: Geometric Bounds on Context Engagement in RAG Systems},
creator={Javier Marín},
yr={2025},
eprint={2512.13771},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.13771},
}

Javier Marin is an independent AI researcher based in Madrid, working on reliability assessment for production AI systems. He tries to be honest about what he doesn’t know. You may contact Javier at [email protected]. Any contribution can be wellcomed!


References

  • Azaria, A. and Mitchell, T. (2023). The interior state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976.
  • Bao, F., Chen, Y., and Wang, X. (2025). FaithBench: A various hallucination benchmark for summarization by modern LLMs. arXiv preprint arXiv:2501.00942.
  • Bridson, M.R. and Haefliger, A. (2013). Metric Spaces of Non-Positive Curvature, volume 319 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin.
  • Catak, F.O., Kuzlu, M., and Guler, O. (2024). Uncertainty quantification in large language models through convex hull evaluation. arXiv preprint arXiv:2406.19712.
  • Firth, J.R. (1957). A synopsis of linguistic theory, 1930–1955. In Studies in Linguistic Evaluation, pages 1–32. Blackwell,Oxford.
  • Fisher, R.A. (1953). Dispersion on a sphere. Proceedings of the Royal Society of London. Series A, 217(1130):295–305.
  • Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.-W. (2020). REALM: Retrieval-augmented language model pre-training. In Proceedings of the thirty seventh International Conference on Machine Learning, pages 3929–3938.
  • Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems.
  • Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., and Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  • Kovács, Á. and Recski, G. (2025). LettuceDetect: A hallucination detection framework for RAG applications. arXiv preprint arXiv:2502.17125. 10 A PREPRINT — DECEMBER 15, 2025
  • Kuhn, L., Gal, Y., and Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations.
  • Li, X., Wang, Y., and Chen, Z. (2025). Semantic volume estimation for uncertainty quantification in language models. arXiv preprint arXiv:2501.08765.
  • Meng, Y., Huang, J., Zhang, G., and Han, J. (2019). Spherical text embedding. In Advances in Neural Information Processing Systems, volume 32, pages 8208–8217.
  • Pestov, V. (2000). On the geometry of similarity search: Dimensionality curse and concentration of measure. Information Processing Letters, 73(1–2):47–51.
  • Wang, T. and Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the thirty seventh International Conference on Machine Learning, pages 9929–9939.
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x