Pedro Fernández de Córdoba, Carlos A. Reyes Pérez and Enrique A. Sánchez Pérez. Mathematical features of semantic projections and word embeddings for automatic linguistic analysis. DOI:Â10.3934/math.2025185
Abstract:
Embeddings in normed spaces are a widely used tool in automatic linguistic analysis, as they help model semantic structures. They map words, phrases, or even entire sentences into vectors within a high-dimensional space, where the geometric proximity of vectors corresponds to the semantic similarity between the corresponding terms. This allows systems to perform various tasks like word analogy, similarity comparison, and clustering. However, the proximity of two points in such embeddings merely reflects metric similarity, which could fail to capture specific features relevant to a particular comparison, such as the price when comparing two cars or the size of di erent dog breeds. These specific features are typically modeled as linear functionals acting on the vectors of the normed space representing the terms, sometimes referred to as semantic projections. These functionals Project the high-dimensional vectors onto lower-dimensional spaces that highlight particular attributes, such as the price, age, or brand. However, this approach may not always be ideal, as the assumption of linearity imposes a significant constraint. Many real-world relationships are nonlinear, and imposing linearity could overlook important non-linear interactions between features. This limitation has motivated research into non-linear embeddings and alternative models that can better capture the complex and multifaceted nature of semantic relationships, o ering a more flexible and accurate representation of meaning in natural language processing.
Aplicación:
Este artÃculo aporta un avance metodológico directo en la base cuantitativa del proyecto: formaliza las proyecciones semánticas sobre embeddings no solo como funcionales lineales, sino como funciones Lipschitz en espacios métricos, lo que permite modelar relaciones semánticas contextuales y no lineales sin perder coherencia con la estructura de distancias del embedding. Además, propone una descomposición mixta que combina un componente lineal (cuando procede) con un componente métrico, y define criterios explÃcitos de control de calidad —constante de Lipschitz, invariancia e Ãndice de adecuación— para auditar, comparar y seleccionar proyecciones suficientemente estables antes de emplearlas como indicadores en el análisis prospectivo de escenarios de la Huerta. En conjunto, el trabajo refuerza la trazabilidad y la robustez de las herramientas de IA/NLP de Prometeo, al proporcionar un marco matemático que guÃa la construcción de indicadores semánticos interpretables y comparables entre fuentes documentales heterogéneas.
We would like to acknowledge funding from the Generalitat Valenciana (Spain) through the PROMETEO 2024 CIPROM/2023/32 grant.
