about a failure that turned into something interesting.
For months, I β along with hundreds of others β have tried to build a neural network that could learn to detect when AI systems hallucinate β when they confidently generate plausible-sounding nonsense instead of actually engaging with the information they were given. The idea is straightforward: train a model to recognize the subtle signatures of fabrication in how language models respond.
But it didnβt work. The learned detectors I designed collapsed. They found shortcuts.Β They failed on any data distribution slightly different from training.Β Every approach I tried hit the same wall.
So I gave up on βlearningβ. And I started to think, why we do not convert this into a geometry problem? And this is what I did.
Backing Up
Before I get into the geometry, let me explain what weβre dealing with. Because βhallucinationβ has become one of those terms that means everything and nothing. Hereβs the specific situation. You have a Retrieval-Augmented Generation system β a RAG system. When you ask it a question, it first retrieves relevant documents from some knowledge base. Then it generates a response thatβs supposed to be grounded in those documents.
- The promise: answers backed by sources.
- The reality: sometimes the model ignores the sources entirely and generates something that sounds reasonable but has nothing to do with the retrieved content.
This matters because the whole point of RAG is trustworthiness. If you wanted creative improvisation, you wouldnβt bother with retrieval. Youβre paying the computational and latency cost of retrieval specifically because you want grounded answers.
So: can we tell when grounding failed?
Sentences on a Sphere
LLMs represent text as vectors. A sentence becomes a point in high-dimensional space β 768 embedding dimensions for the first models, though the specific number doesnβt matter much (DeepSeek-V3 and R1 have an embedding size of 7,168). These embedding vectors are normalized. Every sentence, regardless of length or complexity, gets projected onto a unit sphere.
Figure 1:Β Semantic geometry of grounding. On the embedding sphere S^{d-1}, valid responsesΒ rΒ (blue) depart from the questionΒ qΒ toward the retrieved contextΒ c; hallucinated responses (red) stay close to the question. SGI captures this as a ratio of angular distances: responses with SGI > 1 traveled toward their sources. Image by author.
Once we think in this projection, we can play with angles and distances on the sphere. For example, we expect that similar sentences cluster together. βThe cat sat on the matβ and βA feline rested on the rugβ end up near each other. Unrelated sentences end up far apart. This clustering is how the embedding models are trained.
So now consider what happens in RAG. We have three pieces of text (Figure 1):
- TheΒ question, qΒ (one point on the sphere)
- The retrievedΒ context, cΒ another point)
- The generatedΒ response, rΒ (a third point)
Three points on a sphere form a triangle. And triangles have geometry (Figure 2).
The Laziness Hypothesis
When a model uses the retrieved context, what should happen? The response should depart from the question and move toward the context. It should pick up the vocabulary, framing, and concepts from the source material. Geometrically it implies that the response should be closer to the context than to the question (Figure 1).
But when a model hallucinates β when it ignores the context and generates something from its own parametric knowledge β the response stay in the questionβs neighborhood. It continues the questionβs semantic framing without venturing into unfamiliar territory. I called thisΒ semantic laziness. The response doesnβt travel. It stays home. Figure 1 illsutrates the laziness signature. Question q, context c, and response r, form a triangle on the unit sphere. A grounded response ventures toward the context; a hallucinated one stays home near the question. The geometry is high-dimensional, but the intuition is spatial: did the response actually go anywhere?
Semantic Grounding Index
To measure this, I defined a ratio:
and I called it Semantinc Grounding Index or SGI.
If SGI is greater than 1, the response departed toward the context. If SGI is less than 1, the response stayed close to the question, meaning that model isnβt able to find a way to explare the answers space and stays too close to the question (a kind of safety state). The SGI has just two angles and a division. No neural networks, no learned parameters, no training data. Pure geometry.
Figure 2: Geometric interpretation of SGI on the embedding hypersphere. Valid responses (blue) depart angularly toward context; hallucinations (red) remain near the questionβthe semantic laziness signature. Image by the author.
Does It Actually Work?
Simple ideas need empirical validation. I ran this on 5,000 samples from HaluEval, a benchmark where we know ground truth β which responses are genuine and which are hallucinated.
Figure 3:Β Five embedding models, one pattern.Β Solid curves show valid responses;Β dashed curves show hallucinations. The distributions separate consistently across all models, withΒ hallucinatedΒ responses clustering below SGI = 1 (the βstayed homeβ threshold). The models were trained by different organizations on different data β yet they agree on which responses traveled toward their sources. Image by author.
I ran the same analysis with five completely different embedding models. Different architectures, different training procedures, different organizations β Sentence-Transformers, Microsoft, Alibaba, BAAI. If the signal were an artifact of one particular embedding space, these models would disagree. They didnβt disagree. The average correlation across models wasΒ rΒ = 0.85 (from 0.80 to 0.95).
Figure 4. Correlation between the different models and architectures used in the experiment. Image from the author.
When the Math Predicted Something
Up to this point, I had a useful heuristic. Useful heuristics are fine. But what happened next turned a heuristic into something more principled. The triangle inequality. You probably remember this from school: the sum of any two sides of a triangle must be greater than the third side. This constraint applies on spheres too, though the formula looks slightly different.
The spherical triangle inequality constrains admissible SGI values. Image by author.
If the question and context are very close together β semantically similar β then there isnβt much βroomβ for the response to differentiate between them. The geometry forces the angles to be similar regardless of response quality. SGI values get squeezed toward 1. But when the question and context are far apart on the sphere? Now thereβs geometric space for divergence. Valid responses can clearly depart toward the context. Lazy responses can clearly stay home. The triangle inequality loosens its grip.
This implies a prediction:
SGIβs discriminative power should increase as question-context separation increases.
The results confirms this prediction: monotonic increase. Exactly as the triangle inequality predicted.
Question-Context Separation Effect Size (d) AUCLow (similar)0.610.72Medium0.900.77High (different)1.270.83Table 1: SGI value increase with scaling of question-context Separation
This difference carries epistemic weight. Observing behaviour in data after the fact offers weak evidence β such baehaviour may reflect noise or analyst degrees of freedom rather than genuine structure. The stronger test is prediction: deriving what should happen from basic principles before examining the data. The triangle inequality implied a specific relationship betweenΒ ΞΈ(q,c) and discriminative power. The empirical results confirmed it.
Where It Doesnβt Work
TruthfulQA is a benchmark designed to test factual accuracy. Questions like βWhat causes the seasons?β with correct answers (βEarthβs axial tiltβ) and common misconceptions (βDistance from the Sunβ). I ran SGI on TruthfulQA. The result: AUC = 0.478. Slightly worse than random guessing.
Angular geometry capturesΒ topicalΒ similarity. βThe seasons are caused by axial tiltβ and βThe seasons are caused by solar distanceβ are about the same topic. They occupy nearby regions on the semantic sphere. One is true and one is false, but theyβre both responses that engage with the astronomical content of the question.
SGI detects whether a response departed toward its sources. It cannot detect whether the response got the facts right. These are fundamentally different failure modes. Itβs a scope boundary. And knowing your scope boundaries is arguably more important than knowing where your method works.
What This Means Practically
If youβre building RAG systems, SGI correctly ranks hallucinated responses below valid ones about 80% of the time β without any training or fine-tuning.
- If your retrieval system returns documents that are semantically very close to the questions, SGI will have limited discriminative power. Not because itβs broken, but because the geometry doesnβt permit differentiation. Consider whether your retrieval is actually adding information or just echoing the query.
- Effect sizes roughly doubled for long-form responses compared to short ones. This is precisely where human verification is most expensive β reading a five-paragraph response takes time. Automated flagging is most valuable exactly where SGI works best.
- SGI detects disengagement. Natural language inference detects contradiction. Uncertainty quantification detects model confidence. These measure different things. A response can be topically engaged but logically inconsistent, or confidently wrong, or lazily correct by accident. Defense in depth.
The Scientific Question
I have a hypothesis aboutΒ whyΒ semantic laziness happens. I want to be honest that itβs speculation β I havenβt proven the causal mechanism.
Language models are autoregressive predictors. They generate text token by token, each choice conditioned on everything before. The question provides strong conditioning β familiar vocabulary, established framing, a semantic neighborhood the model knows well.
The retrieved context represents a departure from that neighborhood. Using it well requires confident bridging: taking concepts from one semantic region and integrating them into a response that started in another region.
When a LLM is uncertain about how to bridge, the path of least resistance is to stay home. Models generate something fluent that continues the questionβs framing without venturing into unfamiliar territory because is statistically safe. As a consequence, the model becomes semantically lazy.
If this is right, SGI should correlate with internal model uncertainty β attention patterns, logit entropy, that sort of things. Low-SGI responses should show signatures of hesitation. Thatβs a future experiment.
Takeaways
- First: simple geometry can reveal structure that complex learned systems miss. I spent months trying to train hallucination detectors. The thing that worked was two angles and a division. Sometimes the right abstraction is the one that exposes the phenomenon most directly, not the one with the most parameters.
- Second: predictions matter more than observations. Finding a pattern is easy. Deriving what patternΒ shouldΒ exist from first principles, then confirming it β thatβs how you know youβre measuring something real. The stratified analysis wasnβt the most impressive number in this work, but it was the most important.
- Third: boundaries are features, not bugs. SGI fails completely on TruthfulQA. That failure taught me more about what the metric actually measures than the successes did. Any tool that claims to work everywhere probably works nowhere reliably.
Honest Conclusion
Iβm not sure if semantic laziness is a deep truth about how language models fail, or just a useful approximation that happens to work for current architectures. The history of machine learning is littered with insights that seemed fundamental and turned out to be contingent.
But for now, we have a geometric signature of disengagement: a practical βhallucinationsβ detector. Itβs consistent across embedding models. Itβs predictable from mathematical first principles. And itβs cheap to compute.
That feels like progress.
Note: The scientific paper with complete methodology, statistical analyses, and reproducibility details is available at https://arxiv.org/abs/2512.13771.
You can cite this work in BibText as:
@misc{marΓn2025semanticgroundingindexgeometric,
title={Semantic Grounding Index: Geometric Bounds on Context Engagement in RAG Systems},
author={Javier MarΓn},
year={2025},
eprint={2512.13771},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.13771},
}
Javier MarinΒ is an independent AI researcher based in Madrid, working on reliability assessment for production AI systems. He tries to be honest about what he doesnβt know. You can contact Javier at [emailΒ protected]. Any contribution will be wellcomed!
References
- Azaria, A. and Mitchell, T. (2023). The internal state of an LLM knows when itβs lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967β976.
- Bao, F., Chen, Y., and Wang, X. (2025). FaithBench: A diverse hallucination benchmark for summarization by modern LLMs. arXiv preprint arXiv:2501.00942.
- Bridson, M.R. and Haefliger, A. (2013). Metric Spaces of Non-Positive Curvature, volume 319 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin.
- Catak, F.O., Kuzlu, M., and Guler, O. (2024). Uncertainty quantification in large language models through convex hull analysis. arXiv preprint arXiv:2406.19712.
- Firth, J.R. (1957). A synopsis of linguistic theory, 1930β1955. In Studies in Linguistic Analysis, pages 1β32. Blackwell,Oxford.
- Fisher, R.A. (1953). Dispersion on a sphere. Proceedings of the Royal Society of London. Series A, 217(1130):295β305.
- Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.-W. (2020). REALM: Retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, pages 3929β3938.
- Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. (2023). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems.
- Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., and Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1β38.
- KovΓ‘cs, Γ. and Recski, G. (2025). LettuceDetect: A hallucination detection framework for RAG applications. arXiv preprint arXiv:2502.17125. 10 A PREPRINT β DECEMBER 15, 2025
- Kuhn, L., Gal, Y., and Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In The Eleventh International Conference on Learning Representations.
- Li, X., Wang, Y., and Chen, Z. (2025). Semantic volume estimation for uncertainty quantification in language models. arXiv preprint arXiv:2501.08765.
- Meng, Y., Huang, J., Zhang, G., and Han, J. (2019). Spherical text embedding. In Advances in Neural Information Processing Systems, volume 32, pages 8208β8217.
- Pestov, V. (2000). On the geometry of similarity search: Dimensionality curse and concentration of measure. Information Processing Letters, 73(1β2):47β51.
- Wang, T. and Isola, P. (2020). Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th International Conference on Machine Learning, pages 9929β9939.

