# Natural Alpha Embeddings

## Abstract

Learning an embedding for a large collection of items is a popular approach to overcome the computational limitations associated to one-hot encodings. The aim of item embedding is to learn a low dimensional space for the representations, able to capture with its geometry relevant features or relationships for the data at hand. This can be achieved for example by exploiting adjacencies among items in large sets of unlabelled data. In this paper we interpret in an Information Geometric framework the item embeddings obtained from conditional models. By exploiting the -geometry of the exponential family, first introduced by Amari, we introduce a family of natural -embeddings represented by vectors in the tangent space of the probability simplex, which includes as a special case standard approaches available in the literature. A typical example is given by word embeddings, commonly used in natural language processing, such as Word2Vec and GloVe. In our analysis, we show how the -deformation parameter can impact on standard evaluation tasks.

∎

## 1 Introduction

Item embedding is a collective name for a set of techniques extracting meaningful representations from a huge amount of unlabelled data, by exploiting the complex network of relationships among the set of items in a dictionary. Studying the geometry of the learned embedding space is fundamental to understand the kind of information which has been extracted and how it has been organized ?; ?; ?. Popular applications of interest range from recommendation systems ?; ?; ?; ? to approximate similarity based search and information retrieval for a wide variety of tasks ?; ?. In this paper we will study item embeddings with a particular focus on natural language processing, which allows us to provide intuitive examples based on common understanding.

Natural Language Processing (NLP) is a branch of machine learning which deals with the design of algorithms to effectively process natural language corpora. The one-hot encoding is a common representation for the words in a dictionary, each word is assigned to a different direction in the space, whose total dimension corresponds to the size of the dictionary , where . Such sparse representation has the characteristic that all words are equidistant between each other and no metric is implicitly defined a priori on the space. However, a practical issue in NLP arises in presence of a very large dictionary , for instance, training a neural network taking in input a sequence of -dimensional vectors, becomes quickly prohibitive even for relatively limited language domains. The aim of item embedding, or word embedding in this specific case, is to project the one-hot encoding vectors onto a lower-dimensional space, mapping to , with . Several probabilistic models can be designed and subsequently trained to learn these compact representations. Notice that the training of the model at this stage is completely unsupervised, guiding the representations to leverage the huge amount of unlabelled text available.

Neural network language models ?; ? introduce the idea of using the internal representation of a neural network to construct the word embedding. In particular Bengio et al. ? were among the first ones to propose the use of a neural network to predict the probability of the next word given the previous ones (-gram model). A matrix is used to project the one-hot encoding of the previous words onto a linear space of dimension . Then a neural network is used to generate the probabilities of the next word, given the concatenation of the projection of the previous ones. The words representations and the weights of the neural network are learned simultaneously, in order to obtain a matrix able to project similar words close to each other. Mikolov et al. ? are the first ones introducing a recurrent language model, using a Recurrent Neural Network (RNN) to model the vector representations in the embedded space . Such approach has the advantage of using less weights compared to an N-gram approach, while still potentially being able to learn quality word embeddings depending on the previous words (given the limitations of learning long-time dependencies with vanilla RNNs, cf. Bengio et al. (1994)).

It is a well known, and at first surprising fact, that syntactic and semantic analogies between words (e.g., ) translate into vectorial relationships between the respective word vectors learned by the embedding, e.g., () ?; ?; Lee (2015); ?; ?; ?. Even relatively simple models have been shown to work well at capturing syntactic and semantic analogies, in particular we can mention the Skip-Gram (SG) ?; ? and the Continuous Bag Of Words model (CBOW) ?. In such models the training task is to predict a word from its context or vice versa. SG tries to predict the words in the context from the central word, while CBOW makes a sum of the representations of all the vectors of the context and tries to predict the central word. A slightly different approach based on global statistics of the corpus is given by GloVe (Global Vectors), introduced by Pennington et al. ?. The novelty of this approach is that it is learning directly from the counts of the co-occurrences, thus it does not need to iterate over the corpus during training as SG. This is particularly advantageous for big corpora in which the dimension of the corpus is much bigger than the dimension of the global matrix of co-occurrences.

There have been numerous works investigating the origin of the linear structure of word analogies in the embedding space. Pennington et al. give a clear intuitive explanation in their paper on GloVe ?. More recently Arora et al. ? formalized this further by introducing a Hidden Markov Model for text generation and assuming that word embedding vectors are isotropically distributed in space after learning. Nevertheless, how well information is actually encoded in the space of word embeddings seems to not be completely understood yet. The embedding of a particular word is based on the co-occurrences within the context, i.e., words with similar context tend to be projected nearby in the embedded space. This makes unclear if opposite words will be near to each other in the embedded space (e.g., hot and cold) and in general how well words will be spaced. An additional motivation for the need to study the word embedding distances is that, in evaluating accuracies on analogies , is a common practice to remove the three query words , , from the set of possible returned results, otherwise the answer is often one of these three query words. A further problem related to the expressivity of the embedding space is word polisemy ?, that is still object of investigations. More recent studies aim to deploy different models, based on transformers (suitable for capturing long-term dependencies), to solve some of these issues and increase the overall performances of the learned embeddings ?; ?; ?; ?; ?. This opens up to a plethora of language models and their associated pretraining strategies for the respective word embeddings. It is clear that understanding deeply the meaning of distances and directions in the word embedding space in the different models is of key importance for numerous NLP tasks using word embedding as a base block.

We want to stress how all the different strategies enlisted so far have in common the parametrization of one or more discrete probability distributions over the dictionary, i.e., the identification of a point in a probability Simplex. Our aim is to build a bridge between the geometrical view of Information Geometry and the probabilistic models applied in the literature. In this study, we will focus our analysis on standard word embeddings models based on skip-gram conditional probability model. Using notions of Information Geometry we will interpret the embedding as a vector in the tangent space of the manifold. Next, by exploiting the notion of -connection for dually flat statistical models, first introduced by Amari Amari (1980, 1982a, 1982b); Nagaoka and Amari (1982); Amari (1985), see also Lauritzen (1987), we define a family of -embedding, which depends on the choice of the connection.

The use of Riemannian methods have been explored previously in the literature of NLP. Lebanon Lebanon (2006) has been one of the first authors to propose to learn a distance metric over the input space using a framework based on Information Geometry, with applications to text document classification. More recent applications of Riemannian optimization algorithms can be found in the work of Fonarev et al. Fonarev et al. (2017), who proposed the use of Riemannian methods to optimize the Skip-Gram Negative Sampling objective function over the manifold of required low-rank matrices, and Nickel and Kiela Nickel and Kiela (2017) who introduced an approach to learn hierarchical representations of symbolic data by embedding them into hyperbolic space, with applications to word embedding. More recently Jawanpuria et al. Jawanpuria et al. (2019) proposed a geometric framework to learn bilingual mappings given monolingual embeddings and a bilingual dictionary, where the mapping problem is framed as a classification problem on smooth Riemannian manifolds. The notions of -divergence Amari (1985, 1987); Amari and Nagaoka (2000); Amari (2016), has also been already exploited in several applications, for example also in the social siences Ichimori (2011); Wada (2012).

## 2 Conditional Models and the Embeddings Structure

Let be a dictionary of cardinality , presented by a one-hot encoding, i.e., . Such encoding does not impose any structure on the space of the representations, since all words are equidistant, but also it is not a practical representation for large dictionaries, and dimensionality reduction techniques are usually required. A word embedding is a mapping from to a lower-dimensional vector space , with . One of the simplest, but still effective, models in the literature is the Skip-Gram conditional model ?; ?, modelling the conditional probability distribution of the words in the context given the central word. Incidentally, the idea behind this model can be also found in a famous quote by Firth, “you shall know a word by the company it keeps” ?. For each word , let us fix a window around , and let the context of be the set of words . The skip-gram model ?; ? associates to each word a conditional probability distribution with , expressed by

(1) |

Notice that such model assigns to each word two vectors , which are used in case the word in question is the central word or a word of the context, respectively. The vectors for form two projection matrices , with rows given by and . The matrices and can be learned from the data by maximum likelihood estimation, with different objective losses: the likelihood of the word couples observed in the corpus (word2vec ?; ?), or using a stochastic matrix factorization approach, like in GloVe ?.

It has been demonstrated in the literature ? that word2vec Skip-Gram with negative sampling ? is indeed equivalent to a matrix factorization à la GloVe. Then without loss of generality we will mainly focus on GloVe as a training methodology, since it is more computationally convenient for large corpora. The matrices and can be learned from the data by optimizing

where is the matrix counting all the co-occurrences of couples of words in a corpus . The function weights the error in the matrix factorization, depending on the frequency of the couple of words in question. A typical choice is

(2) |

where is a cutoff, usually fixed to 100, cf. ?. The conditional model in Eq. (1) corresponds to the exponential family

where are the parameters of the model. The family is not written in canonical form, since the vector of natural parameters corresponds to , with sufficient statistics , , see for instance Section 3.4 from George (George Casella) Casella (2001). In the following we adopt a different perspective. We consider the matrix fixed after the inference process, so that the exponential family can be written in canonical form with natural parameters and sufficient statistics , for , cf. Rudolph et al. (2016). The vector defines a family of conditional probability distributions and each , for corresponds to a different distribution with parameters in the exponential family identified by a fixed . By conditioning over different , we obtain different probability distributions which all belong to the same exponential family. Let us stop for a moment and define a bit of notation at this point. We will refer to the rows of the matrix as or , and to its columns as . In this way the following expressions define the same element . We will use one or the other notation thorough the paper, according to convenience.

Once the embeddings and have been learned, a typical task of interest consists in evaluating similarities between words. We refer to the proper literature for the different measures proposed ?; ?; ?; ?. Another task of interest is the evaluation of analogies. Starting from an analogy of the form , Mikolov et al. ? showed how it can be efficiently solved for one of his arguments, for instance , by

(3) |

There have been several attempts to interpret such linear behavior, see for example ?; ? and ?. In the following we provide an intuitive explanation starting from the argument of Pennington et al. ?, according to which, for the words satisfying an analogy, the relationship between the contexts of the word and the word is the same as the relationship which intercurs between the contexts of the words and . Solving an analogy then corresponds to finding such that

(4) |

i.e., by minimizing the average over all possible words of the context of difference between the ratios of probabilities. We observe that under two hypothesis, namely the isotropy of the covariance matrix associated to the row vectors of , and the “stability” of in Eq. (1) with respect to (i.e., for any ), then Eqs. (4) and (3) are equivalent. Indeed using the isotropy of the we can write

(5) |

which, by using the stability of the s reduces to Eq. (4).

The hypothesis of isotropy and stability of the have been discussed in ? and in particular the stability of has been experimentally verified for 4 different word embedding objectives, namely Squared Norm, GloVe, CBOW and SG, see also ?. Eq. (5) is of particular interest for this paper, since this formula will be generalized in Section 5. For the moment let us just notice that, by considering the columns of as centered sufficient statistics (i.e., in case the rows of have zero mean), is proportional to the Fisher information matrix in the tangent space of the uniform distribution. This statement will be made more precise in the next sections (Sec. 5).

## 3 Over-parametrization of the Simplex and Mapping to the Sphere

The exponential family of Eq. (1) represents in reality a submodel of the simplex. The full dimensional simplex, embedded in , is the set . As usually happens in the machine learning community, an over-parametrization of the simplex is commonly used, through a function called softmax. This consist in taking and mapping this to the point such that

(6) |

The simplex can also be mapped to the sphere with the canonical identification (see for example Guy (2015)).

In the case of interest for the present paper (1) we are considering a submodel of the space , given by the Span of the columns of

(7) |

The submanifold of the simplex identified by such model is and the corresponding submanifold of the sphere is . As can be easily verified the points of this set are also points of and thus . The tangent space of the submanifold of the sphere is then calculated by means of the pushforward of the composite mapping ,

(8) |

given the obvious identification . A tangent vector on the subsphere can thus be written as

(9) |

where is the vector with all ones in and the of a vector defines a diagonal matrix whose diagonal corresponds to the vector itself. Notice that the mapping in Eq. 9 is not full rank. In particular the pushforward has rank (null eigenvalue in the direction of ), corresponding to the fact that each increase of in the direction () does not affect the resulting probability distribution.

## 4 -representation

We have seen in Section 3 how the submodel of the simplex, defined by Eq. (1), can be mapped to the sphere. In this section we will briefly recap how this reasoning can be extended to a family of diffeomorphisms parametrized by a parameter alpha Amari (1985); Amari and Nagaoka (2000); Amari and Cichocki (2010).

Given a finite sample space of cardinality , the dimensional simplex embedded in is the set of probability distributions over . We denote with its interior, i.e. . In Information Geometry Amari (1985); Amari and Nagaoka (2000); Amari (2016), the interior of the simplex is commonly represented as a statistical manifold endowed with the Fisher-Rao metric. A tangent vector is defined such that , for some , which gives the common characterization for tangent vectors of the simplex. The Riemannian metric of is

(10) |

Through the metric, it is possible to compute the normal vector in to with respect to , which is given by itself, i.e. for all . Let us denote as the ambient space of the simplex with the metric . Let us now consider the family of mappings called -representation of , given by Amari and Nagaoka (2000); Amari and Cichocki (2010)

(11) |

with derivative , and inverse

(12) |

For , the simplex is mapped on the sphere with the canonical identification , while for the mapping becomes the identity.

Let us call . Let be a tangent vector represented in the basis of the ambient space , the pushforward is a linear operator defined by

(13) |

In the following we will express all tangent vectors of in the basis of the ambient space . Let be two tangent vectors in , for the isometry condition we have

(14) |

The -representation defines a smooth isometry between and which is the ambient space with the metric induced by the transformation , i.e. from Eq. (14) it follows

(15) |

In other words, the of , the ambient space of , is defined in such a way that is an isometry. In the following, to favor a lighter notation, we will use to replace . This mapping also induces an isometry for the -family of Riemannian manifolds , which implies that geodesics can be mapped between manifolds through and its inverse. This has a direct computational implication, indeed for , the image of is the sphere endowed with the ambient metric of , for which metric geodesics corresponds to arcs of great circles.

Following a standard construction in Information Geometry due to Amari Amari (1987), we introduce the -family of connections which are flat in the -representation, i.e., the Christoffel symbols in the coordinates vanish. This family of connections allow the definition of -geodesics between two distributions and , by

(16) |

Notice that, unless , the curve , and a normalization is required, with Amari (2016). Since the -geodetic is not metric, does not represent the shortest path between and , instead it corresponds to the curve along which vectors remain parallel with respect to the -connection. The -connection allows also to define a -logarithmic map as the inverse of the -exponential map, which in reads

(17) |

Using the typical notation for word embeddings of Section 2, with respect to the exponential family in Eq. (1), we have that the sample space coincide with the dictionary , are the natural parameters, and are the sufficient statistics for a given . In a slightly different notation, we can rewrite the -dimensional exponential family of Eq.(1) as

(18) |

where is the vector of natural parameters, is the vector of sufficient statistics, and

(19) |

is the normalizing constant. The tangent space equals , where is a column of the matrix . The Fisher matrix reads

(20) |

The mapping has a Jacobian given by

(21) |

where is the matrix of the centered sufficient statistics. By applying the mapping to we can map the exponential family in Eq. (18) to a submanifold . By combining the Jacobian of with the pushforward in Eq. (13), we obtain a characterization of the tangent space of as a linear subspace , by

(22) |

where is a tangent vector in the parameter space . See Fig. 1 for a graphical representation.

Through the characterization of the basis, we can calculate the projection onto the tangent space of a vector . Let us define the matrix whose columns correspond to the basis vectors of expressed through the basis of the ambient space . The coordinates of the projection of a vector on the basis of are

(23) |

Let us now take two vectors in the ambient tangent space , the inner product in of their projections reads

(24) |

where we use the Fisher metric in the inner product, since is projecting on the basis of the sufficient statistics . We prove the following theorem. {thm} Let , then Eq. (24) reduces to .

Let us consider , given the distribution of Eq. (18) the Log map for a word becomes

(25) |

where is the vector of all ones. The vector of ones is orthogonal to in the simplex, that is, it is proportional to the pushforward of the normal vector to in for . The projection of such vector on the tangent space project away the ‘one’ component

(26) |

where are indices and we used Einstein summation convention whenever possible (paired indices). We just proved that for , the projection (23) reduces to . It follows that, Eq. (24) reduces to . {cor} Let , in the uniform distribution , Eq. (24) reduces to .

## 5 A Geometric Framework for Word Embedding

In this section we apply the geometric framework defined so far with the purpose of defining a family of -measures for word similarities and word analogies based on Information Geometry.

Given , and a reference measure , we introduce a family of geometric measures of -similarity for two words , by generalizing to the Riemannian case the computation of the cosine product between two tangent vectors. The intuition behind this definition is to provide a similarity measure based on the cosine product between two directions in the tangent space of pointing towards and , respectively. The inner products are computed with respect to the Fisher metric and the logarithmic maps with respect to the -connection. The computation of this quantity can be done by first obtaining the -representations of and . The second step consists in computing the -logarithm map centered in of these two points using Eq. (17), followed by a projection on the tangent space

(27) |

see also Eq. (23). Finally, the last step consists in the evaluation of the Riemannian cosine product with respect to the Fisher matrix.

###### Definition 1

The -cosine similarity between two words and with respect to a reference distribution reads

(28) |

###### Proposition 1

The -cosine similarity between two words and with respect to a reference distribution for simplifies as

(29) |

It is a common approach in the literature of word embeddings to measure the similarity between two words using the cosine product between the embedding vectors and ,

(30) |

see for example ?; ?; ?. In the light of the previous proposition, the cosine product between the vectors (30) is a special case of Eq. (28) for and when is isotropic. The following proposition provides a sufficient condition which guarantees the isotropy of the Fisher information matrix. {prop} Let be the uniform distribution, if the sufficient statistics are centered, i.e., , and the matrix is isotropic, then Fisher information matrix is isotropic too. Notice that equivalently the standard cosine product in Eq. (30) corresponds to the case when the computations are done in the natural parameters of the exponential family and the Fisher-Rao metric is replaced by the standard Euclidean metric.

Also the resolution of analogies can be generalized in a geometric way. Given an analogy of the form , we compute the -logarithmic maps and and -parallely transport them in the same reference point with the -connection. Since the connection is flat in , this simply corresponds to a translation of the vectors. Once in the vectors can be projected onto , and the norm of the difference can be computed with respect to the metric. This gives a novel measure of word analogy which depends on and can be written as

(31) |

Given a reference measure and a value of , this quantity can be used to solve an analogy, for instance by minimizing it over , i.e.,

(32) |

Let us notice that for , and for the exponential family of Eq. (18), Eq. (31) reduces to the norm of a vector in the tangent space of

(33) |

Furthermore, under the conditions in Proposition 5 which guarantees the isotropy of the Fisher information matrix, we recover the standard formulation of word analogy in Eq. (3).

## 6 Alpha Embeddings

As an alternative view, we propose to use as embeddings the coordinates of the projected Logarithmic map

(34) |

with , and where the coefficients

(35) |

are changing with alpha and represents the weights of a linear combination of the rows of the matrix of the centered sufficient statistics . While the alpha embeddings of Eq. (34) can be computed in any point and for any value of , they conveniently reduce to in the uniform distribution and for , due to Theorem 4.

Fixed and a reference point we can thus rewrite similarity and analogy measure in terms of the alpha embeddings as

(36) |

and

(37) |

## 7 Limit Embeddings

Let us notice that: when , the weight factor goes to zero for all words such that and grows on the others. Vice versa for , the weight factor grows for all words such that and goes to zero for the others. The limit of for a word then becomes interesting, since the factors will tend to a delta on the word . To achieve a numerically stable version of these embeddings we propose the following formula

(38) |

we will call this the limit embedding (LE), which is depending only to the row of the matrix of sufficient statistics . This leads to very simple geometrical evaluation tasks of similarities and analogies in the limit.

## 8 Computational Stability

For negative alphas, the formula (34) gets numerically unstable pretty quickly. Since the interest is usually on the directions (cosine product on similarities and vectors are often normalized before evaluating analogies) we propose to compute a numerically stable version of as

(39) |

which is the rescaled version of (34), up to a normalization factor independent on . We will use this normalization trick to obtain numerically stable alpha embedding vectors in the rest of the paper.

## 9 Change of reference measure

In the literature of word embedding it is common practice to consider as embedding vectors for calculating similarities and analogies ?; ?; ?; ?; ?. The embedding vectors given by the sum have been experimentally shown to provide better results ? compared to using simply . With regard to Equation (1), summing and vectors, corresponds to a shifting of the natural parameters of the exponential family. For each word this shift is different and it can be interpreted as a change of reference measure for the conditional distribution of that particular word. The reweighted probabilities are

(40) |

where is the partition function for the vectors, and is the reference measure used for the word . The reference measure is based on the scalar product between the outer vectors (which are interpreted as the sufficient statistics, see Sec. 2 and 4). is higher for those words which behaves more similarly to the word itself when in the context (similar direction for the outer vectors). Using Equation (40), in place of Eq. (1) as starting point to calculate the alpha embeddings, we obtain the U+V alpha embeddings.

## 10 Model Training

We have performed experiments using the English Wikipedia dump from October 2017 (enwiki). We used the wikiextractor python script 24 to parse the Wikipedia dump xml file. We decide to use a simple preprocessing to have a standard baseline: we lower case all the letters, we remove stop-words and we remove punctuation. To obtain the dictionary of words for the enwiki corpus we use a cut-off minimum frequency (m0) of 1000. The words occurring less than m0 times in the corpus are agglomerated in a single unknown token (). In this way we obtained a dictionary of 67,336 words. In accordance with ? we choose a window size of 10 around each word (10 words preceding and 10 following) with decaying weighting rate from the center of for cooccurrences calculation. We trained our models with Glove ? with vector sizes of 100, 200 and 300, for a maximum of 1000 epochs (each epoch means iterating over all the entries of the cooccurrence matrix). To make sure that the models trained are effectively comparable with the models in the literature we evaluate accuracies on the word analogy tasks of ?; ?; ?. We decide to keep the vectors obtained after 1000 epochs since the training has converged. To verify the correct convergence of the training we tested on the analogy tasks of ?; ?; ? using the code available from the GloVe paper ?. We also analyzed the performances of the model in similarity during training as we will see more in details in the Results section.

corpus | vec size | iter | Sem. | Syn. | Tot. |

enwiki 1.48B | 100 | 200 | 67.40 | 55.11 | 60.39 |

400 | 69.13 | 55.40 | 61.30 | ||

600 | 69.38 | 55.51 | 61.47 | ||

800 | 69.72 | 55.51 | 61.62 | ||

1000 | 69.85 | 55.47 | 61.65 | ||

200 | 200 | 77.38 | 62.14 | 68.69 | |

400 | 78.22 | 62.65 | 69.35 | ||

600 | 78.56 | 62.52 | 69.42 | ||

800 | 78.83 | 62.62 | 69.59 | ||

1000 | 78.99 | 62.74 | 69.72 | ||

300 | 200 | 80.79 | 63.83 | 71.12 | |

400 | 82.21 | 64.32 | 72.01 | ||

600 | 82.46 | 64.53 | 72.24 | ||

800 | 82.54 | 64.60 | 72.31 | ||

1000 | 82.54 | 64.66 | 72.34 | ||

enwiki 1.6B GloVe paper? | 100 | 50 | 67.5 | 54.3 | 60.3 |

300 | 100 | 80.8 | 61.5 | 70.3 |

## 11 Results

As discussed in Section 6 the alpha embeddings (denoted by E in tables and figures) can be calculated in any point on the manifold. In this section we consider three points of interest: the uniform distribution (0), the unigram distribution from the model (u) obtained calculating the marginals of the joint model deriving from Eq. 1 for fixed and , and the unigram distribution as obtained from the cooccurrences in the data (ud). The vectors are calculated for U or for U+V, with the change of reference measure explained in Section 9.

vecsize |
method | wordsim353 | mc | rg | scws | rw |
---|---|---|---|---|---|---|

100 | E-0-NI-PI | 66.53 (-5.0) | 71.86 (-2.8) | 72.98 (-2.2) | 60.07 (-4.6) | 44.58 (-4.4) |

E-0-NF-PI | 67.45 (-5.0) | 71.57 (-2.6) | 72.38 (-1.6) | 59.87 (0.0) | 46.08 (1.8) | |

E-0-NF-PF | 63.09 (-5.0) | 68.94 (-3.2) | 69.41 (-2.8) | 59.28 (-5.0) | 41.76 (-5.0) | |

E-u-NI-PI | 67.10 (-5.0) | 72.50 (-2.0) | 74.60 (-4.4) | 60.72 (-4.4) | 47.72 (-5.0) | |

E-u-NF-PI | 68.18 (-4.6) | 72.08 (-5.0) | 74.12 (-3.4) | 60.19 (-5.0) | 47.88 (-5.0) | |

E-u-NF-PF | 64.00 (-5.0) | 74.82 (-2.0) | 73.08 (-4.0) | 60.18 (-5.0) | 44.91 (-5.0) | |

U | 60.64 | 64.36 | 67.68 | 57.05 | 43.23 | |

U+V-n | 62.50 | 68.88 | 70.65 | 57.68 | 42.80 | |

200 | E-0-NI-PI | 68.25 (-5.0) | 79.09 (-3.6) | 77.67 (-2.6) | 62.22 (-3.2) | 52.21 (-4.8) |

E-0-NF-PI | 68.58 (-5.0) | 79.78 (-5.0) | 77.94 (-4.8) | 61.76 (-5.0) | 52.54 (-5.0) | |

E-0-NF-PF | 66.23 (-5.0) | 75.58 (-5.0) | 74.80 (-5.0) | 61.47 (-4.4) | 50.73 (-5.0) | |

E-u-NI-PI | 69.12 (-5.0) | 80.23 (-5.0) | 79.89 (-1.8) | 62.75 (-2.0) | 53.86 (-4.8) | |

E-u-NF-PI | 69.38 (-2.6) | 82.21 (-3.6) | 79.63 (-3.6) | 62.47 (-5.0) | 54.19 (-5.0) | |

E-u- NF-PF | 67.36 (-5.0) | 78.00 (-3.2) | 76.82 (-4.8) | 62.56 (-4.2) | 52.59 (-5.0) | |

U | 60.24 | 68.86 | 67.07 | 57.72 | 45.37 | |

U+V-n | 63.81 | 73.93 | 72.64 | 58.41 | 44.78 | |

300 | E-0-NI-PI | 71.01 (-5.0) | 81.50 (-5.0) | 81.32 (-1.6) | 63.55 (-4.4) | 53.81 (-5.0) |

E-0-NF-PI | 71.12 (-5.0) | 82.90 (-1.8) | 82.56 (-1.8) | 63.27 (-5.0) | 53.97 (-5.0) | |

E-0-NF-PF | 68.88 (-5.0) | 77.36 (-3.4) | 77.10 (-3.6) | 63.00 (-5.0) | 52.95 (-5.0) | |

E-u-NI-PI | 71.96 (-4.8) | 84.66 (-1.6) | 83.95 (-1.4) | 63.72 (-4.8) | 55.67 (-4.8) | |

E-u-NF-PI | 72.18 (-2.2) | 82.90 (-2.4) | 83.43 (-5.0) | 63.40 (-5.0) | 55.84 (-4.8) | |

E-u-NF-PF | 70.50 (-5.0) | 80.96 (-2.0) | 80.68 (-4.8) | 63.74 (-4.6) | 55.11 (-5.0) | |

U | 60.33 | 69.28 | 69.78 | 58.32 | 47.33 | |

U+V-n | 64.42 | 74.49 | 75.28 | 58.98 | 46.04 | |

300 |
WG5-U+V | 65.08 | 73.82 | 77.85 | 62.18 | 51.54 |

- | p_data-cn | 57.83 | 70.50 | 78.30 | 62.73 | 44.94 |

In Figure 2 we show how the space of word embeddings is deformed with alpha. See also videos in Supporting Information. We will show how this deformation can impact on the evaluation of standard tasks, like for example computing similarity evaluation for varying alpha. Similarity is evaluated by means of the Spearman correlation between the human scores of the dataset and the similarity scores of the method ?; ?. We compare our selected similarity measures with methods from the literature using cosine product on the vectors (denoted as U) and also cosine product on the vectors as in ? (denoted as U+V). The authors of ? report normalizing the columns of the two matrices before the similarity evaluation, and indeed we notice that this process tends to increase the correlations on their methods. This normalization reminds of a Caron factor of 0, cf. the analysis of Bullinaria et al. ?, and it is thus linked to the weighting of PCA components in the tangent space, which has been explicitly explored by several authors. ?; ?; ?; ? Among the base methods to compare with, we selected a simple method which is a variant of what reported also in ?, the cosine product of the rows of (a row is for fixed ), after centering and normalizing its columns (p_data-cn). Notice that this method does not require training, since the matrix can be simply estimated from the cooccurrences ?.

We test different cosine products in the tangent space, in which vectors are normalized (N) either with the Fisher matrix (NF) or with the identity (NI), and subsequently the scalar product (P) is performed either with the Fisher matrix (PF) or with the identity (PI). We have shown in Section 5 how, in , in the point 0 and in case the Fisher is isotropic, Eq. (28) reduces to Eq. (30) commonly used in the literature, see Propositions 1 and 5. This corresponds to the method E-0-NI-PI which for reduces to the standard scalar product in the Euclidean space U (as can also be observed in the Figures reporting similarities for varying alphas). Analogously, this holds also for the U+V alpha methods, whose dependence on alpha is akin to the U alpha methods. U+V alpha methods are not reported in the plots to not overcrowd them, but they will be further discussed later in the present section. In Figures 3 and 4 we show how the similarities of NI-PI and of NF-PF varies during training, for different vector sizes . Notice in the figures how the curves get progressively more flat during training, and simultaneously the similarities tend to improve for very negative alphas. The impact of vector size on the similarity correlations is further detailed in Table 2. In this table we reported for comparison also WG5-U+V similarities in which we took the online available word embeddings trained on WikiGiga5 corpus ?. Let us notice that WG5 vectors are trained on a much bigger corpus, of about 6B tokens. For the sake of a fair comparison, the correlations of the WG5 are computed on the similarities between words belonging to the smaller enwiki dictionary. This in theory constitutes a direct advantage for WG5 since we are restricting to a smaller dictionary made of more frequent (thus supposedly easier) words.

In Figure 5 we plot the alpha methods against the baseline methods from the literature. We can notice how the alpha methods reported in this Figure perform better than the baselines for negative alphas. Methods in ud (not plotted) are found to reach analogous performances of their counterparts in 0 and u, but they decay abruptly for very negative alphas. Remarkably all methods using the limit embeddings of Eq. (38) have good performances, for all points considered: 0, u and ud. LE methods are not reported with horizontal lines in Fig. 5 to not impair the readability of the figure, but their values can be found in Table 3, and they will be analyzed when comparing with the literature.

The question arises on the origin of the good performances of limit embeddings, or more in general of alpha embeddings for large negative alphas. Let us notice (Eq. (38)) that limit embeddings are performing a clustering in space, in which the same limit embedding vector can be associated to one or more words. We hypothesize that this clustering learned by the system during training (corresponding to the learned sufficient statistics ) is good at extracting the relevant information for similarities. For demonstrative purpose we trained an extra GloVe model with only 2 components. Since the components are only 2 they can be directly plotted and we can see how, in the limit case described in Section 6, the limit embeddings LE are indeed corresponding to a form of clustering in space (Figure 6). The embeddings have not been normalized in this figure, the rows of the matrix resulted having similar norms after the training, in this simple case.

method | WSsim | WSrel | MEN | MTurk | RW | SimLex |
---|---|---|---|---|---|---|

LE-U-0-F | 75.92 | 67.49 | 74.56 | 68.49 | 51.17 | 35.86 |

LE-U-0-I | 76.60 | 67.71 | 74.50 | 66.00 | 51.04 | 37.56 |

LE-U-u-F | 70.04 | 59.89 | 71.11 | 67.98 | 48.01 | 32.00 |

LE-U-u-I | 72.35 | 62.74 | 72.61 | 68.66 | 49.50 | 32.95 |

LE-U-ud-F | 77.27 | 69.30 | 75.21 | 60.40 | 52.26 | 37.94 |

LE-U-ud-I | 72.81 | 56.79 | 70.93 | 50.73 | 50.89 | 37.12 |

LE-U+V-0-F | 75.72 | 66.59 | 74.73 | 68.71 | 54.16 | 37.73 |

LE-U+V-0-I | 76.46 | 67.12 | 74.76 | 65.94 | 54.82 | 40.05 |

LE-U+V-u-F | 69.62 | 58.20 | 70.95 | 68.31 | 49.55 | 32.87 |

LE-U+V-u-I | 71.96 | 61.27 | 72.59 | 68.98 | 51.37 | 34.00 |

LE-U+V-ud-F | 77.78 | 69.21 | 75.57 | 60.13 | 55.56 | 41.57 |

LE-U+V-ud-I | 73.62 | 57.21 | 71.28 | 50.31 | 53.83 | 41.57 |

Levy et al. 2015 | 74.6 | 64.3 | 75.4 | 61.6 | 26.6 | 37.5 |

Even though tuning alpha to get the best possible performing geometry on each task is beyond the scope of the current paper, we would still like to be able to compare some of our results with the literature. Let us consider the limit embeddings (LE). This is a simple method which does not require us to perform any cross-validation to tune alpha. We take as a reference comparison a paper by Levy, Goldberg, and Dagan 2015 Levy et al. (2015) in which they report the best method in cross-validation with a fixed window size of 10 (varying other hyperparameters). In Table 3 we report comparisons on different similarity datasets. We notice how the limit embeddings obtain better or comparable performances on all tasks. The limit methods with the Fisher metric in ud seems to perform better over all, even though it seems to fall short on the MTurk dataset. The methods using , the change of reference measure described in Section 9, seem to provide an improvement expecially on the rare words and simlex datasets.

## 12 Conclusions

We defined an Information Geometric framework for word embeddings. We introduced a novel family of measures for word similarities and analogies, depending on a deformation parameter , extending common approaches in the literature to their Riemannian counterparts. We evaluated our proposed measures on standard word similarity tasks and showed how our method can outperform previous approaches for a range of values of , recovering existing approaches for . For the enwiki corpus, we obtained a large improvement compared with baselines. The analysis done so far is orthogonal with respect to the training, it would be of great interest to develop different methodologies to take advantage of the -representation during learning and possibly learn different geometries during learning. The limit embeddings seems to provide a very simple and effective method, without the need to tune the alpha parameter. The experimental evaluation of word analogies and of the performances on different downstream tasks will be object of future studies.

###### Acknowledgements.

The authors are supported by the DeepRiemann project, co-funded by the European Regional Development Fund and the Romanian Government through the Competitiveness Operational Programme 2014-2020, Action 1.1.4, project ID P_37_714, contract no. 136/27.09.2016.### Footnotes

- email: {volpi,malago}@rist.ro
- email: {volpi,malago}@rist.ro

### References

- Information geometry of divergence functions. Bulletin of the Polish Academy of Sciences: Technical Sciences 58 (1), pp. 183–195 (en). External Links: ISSN 0239-7528, Link, Document Cited by: §4, §4.
- Methods of information geometry. American Mathematical Society, Providence, RI. Note: Translated from the 1993 Japanese original by Daishi Harada External Links: MathReview Entry Cited by: §1, §4, §4.
- Theory of information spaces: a differential geometrical foundation of statistics. Post RAAG Reports. Cited by: §1.
- Differential geometry of curved exponential families-curvatures and information loss. The Annals of Statistics, pp. 357–385. Cited by: §1.
- Geometrical theory of asymptotic ancillarity and conditional inference. Biometrika 69 (1), pp. 1–17. Cited by: §1.
- Differential-geometrical methods in statistics. Lecture Notes in Statistics, Vol. 28, Springer-Verlag, New York. External Links: ISBN 3-540-96056-2, MathReview (C. R. Rao) Cited by: §1, §1, §4, §4.
- Dual connections on the Hilbert bundles of statistical models. In Geometrization of statistical theory (Lancaster, 1987), Lancaster, pp. 123–151. Cited by: §1, §4.
- Information Geometry and Its Applications. Applied Mathematical Sciences, Vol. 194, Springer Japan, Tokyo (en). External Links: ISBN 978-4-431-55977-1 978-4-431-55978-8, Link, Document Cited by: §1, §4, §4.
- How we blessed distributional semantic evaluation. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, pp. 1–10. Cited by: Figure 2.
- Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §1.
- Riemannian optimization for skip-gram negative sampling. arXiv preprint arXiv:1704.08059. Cited by: §1.
- Statistical inference. 2 edition, , Vol. , Duxbury Press. External Links: ISBN 0534243126,9780534243128, Link Cited by: §2.
- Riemannian geometry and statistical machine learning. LAP LAMBERT Academic Publishing. Cited by: §3.
- On rounding off quotas to the nearest integers in the problem of apportionment. JSIAM Letters 3, pp. 21–24. Cited by: §1.
- Learning multilingual word embeddings in latent metric space: a geometric approach. Transactions of the Association for Computational Linguistics 7, pp. 107–120. Cited by: §1.
- Statistical manifolds. Differential geometry in statistical inference 10, pp. 163–216. Cited by: §1.
- Metric learning for text documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (4), pp. 497–508. Cited by: §1.
- On the linear algebraic structure of distributed word representations. arXiv preprint arXiv:1511.06961. Cited by: §1.
- Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3, pp. 211–225. Cited by: Table 3, §11.
- Differential geometry of smooth families of probability distributions. Technical report Technical Report METR 82-7, Univ. Tokyo. Cited by: §1.
- Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, pp. 6338–6347. Cited by: §1.
- Exponential family embeddings. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon and R. Garnett (Eds.), pp. 478–486. External Links: Link Cited by: §2.
- A divisor apportionment method based on the kolm–atkinson social welfare function and generalized entropy. Mathematical Social Sciences 63 (3), pp. 243–247. Cited by: §1.
- WikiExtractor. Note: https://github.com/attardi/wikiextractorAccessed: 2017-10 Cited by: §10.