Improving Optimization in Models With Continuous Symmetry Breaking
Abstract
Many loss functions in representation learning are invariant under a continuous symmetry transformation. As an example, consider word embeddings (Mikolov et al., 2013b), where the loss remains unchanged if we simultaneously rotate all word and context embedding vectors. We show that representation learning models with a continuous symmetry and a quadratic Markovian time series prior possess socalled Goldstone modes. These are low cost deviations from the optimum which slow down convergence of gradient descent. We use tools from gauge theory in physics to design an optimization algorithm that solves the slow convergence problem. Our algorithm leads to a fast decay of Goldstone modes, to orders of magnitude faster convergence, and to more interpretable representations, as we show for dynamic extensions of matrix factorization and word embedding models. We present an example application, translating modern words into historic language using a shared representation space.
1 Introduction
Symmetries are coordinate transformations that leave a certain quantity invariant, such as the loss function of a machine learning model. Continuous symmetries, as opposed to discrete symmetries like mirror symmetries, are parameterized by realvalued numbers, e.g., rotation angles. In theoretical physics, continuous symmetries are often the starting point to formulate effective theories (Arnol’d, 2013). In particular, gauge theories describe the propagation of continuous symmetries across space, and explain many fundamental forces in nature (Peskin, 1995). This paper uses methods from gauge theory in the context of machine learning.
Symmetries may be spontaneously broken by interactions, i.e., weak couplings of system parameters. Many phases of matter, e.g., solids, magnets, or superfluids, emerge when such symmetry breaking occurs. The Goldstone theorem then guarantees the existence of low energy excitations, called Goldstone modes (Altland & Simons, 2010), which signalize shallow directions of the energy landscape.
In this paper, we show that Goldstone modes also appear in representation learning. Their low excitation energy translates to a small contribution of Goldstone modes to the loss function. This leads to an illconditioned optimization problem and to slow convergence of gradient descent. We present an algorithm that solves the problem of slow convergence by separating the small symmetry breaking contributions to the loss function from the symmetry obeying terms. The algorithm uses artificial gauge fields as a concise parameterization of the symmetry breaking terms, and minimizes over them efficiently using natural gradients.
The particular model class we consider are dynamic matrix factorizations and dynamic embedding models (Lu et al., 2009; Koren, 2010; Charlin et al., 2015; Bamler & Mandt, 2017; Rudolph & Blei, 2017). These are time series models that exhibit multiple copies of a representation learning problem coupled by a quadratic regularizer. The coupling penalizes sudden changes of model parameters along the time dimension, thus allowing the model to share statistical strength across time steps. Specifically, the coupling is a sum over squared differences between the model parameters of adjacent time steps. In a Bayesian setup, such a coupling arises from a Gaussian Markovian time series prior.
Since these models typically do not assign any predefined meanings to specific directions in the representation space, the loss function is often invariant under a collective rotation of the embedding vectors. For example, a simultaneous rotation of all word and context embedding vectors in word2vec (Mikolov et al., 2013b) does not change the loss. A similar rotational symmetry exists in matrix factorization models. The quadratic coupling between adjacent time steps breaks the symmetry. We show that, even if the coupling is strong, the symmetry breaking contributions to the loss function can be small, which leads to a small contribution to the gradient, and to slow convergence of gradient descent.
Our contributions are as follows:

We identify a broad class of models which suffer from slow convergence due to Goldstone modes. We explain the effect of Goldstone modes on the speed of convergence both mathematically and pictorially.

Using ideas from gauge theories, we propose Goldstone Gradient Descent (GoldstoneGD), an algorithm that speeds up convergence by separating the optimization in the subspace of symmetry transformations from the remaining coordinate directions.

We evaluate the GoldstoneGD algorithm experimentally with dynamic matrix factorizations and Dynamic Word Embeddings. We find that GoldstoneGD converges orders of magnitude faster and finds more interpretable embedding vectors than standard gradient descent (GD) or GD with diagonal preconditioning.

For Dynamic Word Embeddings (Bamler & Mandt, 2017), GoldstoneGD allows us to find historical synonyms of modern English words, such as ”wagon” for ”car”. Without our advanced optimization algorithm, we could not obtain this result.
Our paper is structured as follows. In Section 2, we specify the model class under consideration, provide concrete example models, and introduce the slow convergence problem. Section 3 describes related work. In Section 4, we propose the GoldstoneGD algorithm that solves the slow convergence problem. We report experimental results in Section 5 and provide concluding remarks in Section 6.
2 Problem Setting
In this section, we discuss the problem of slow convergence in representation learning with a continuous symmetry and a time series prior. We first provide a geometric visualization of Goldstone modes (Section 2.1), describe the relevant the class of models (Section 2.2), and list concrete examples (Section 2.3). We finally show that Goldstone modes lead to slow convergence of gradient descent (GD) (Section 2.4).
2.1 Geometric Picture of Goldstone Modes
We give an intuitive picture of Goldstone modes in representation learning, deferring a more formal one to Section 2.4.
We consider a representation learning model whose loss is invariant under rotations of all embedding vectors in the representation space. For example, consider factorizing a large matrix into the product of two smaller matrices and by minimizing the loss . We refer to the columns of and as embedding vectors. Rotating all embedding vectors by the same orthogonal matrix such that and does not change the loss since .
For the sake of illustration, we consider a toy model with a twodimensional representation space and a single embedding vector. The red surface in Figure 1a) shows the loss function . The loss is rotationally symmetric, which leads to a degenerate minimum. The purple sphere depicts the single embedding vector and sits at one exemplary minimum. We can force the model to prefer one minimum over all others by adding a small term to the loss that does not obey the rotational symmetry. Even if this symmetry breaking term is tiny it can change the position of the selected minimum by a large distance in the representation space.
Such a small symmetry breaking term arises, e.g., in time series models like dynamic matrix factorizations (Lu et al., 2009; Koren, 2010; Sun et al., 2012; Charlin et al., 2015). These models combine several instances of a rotationally symmetric model, coupling them with a quadratic regularizer that penalizes differences between model parameters of adjacent instances. Figure 1b) illustrates a time series model with time steps. Each purple sphere depicts the embedding vector for one time step. The embedding vectors are connected by a quadratic coupling, which we can think of as springs between neighboring spheres (not drawn). The fact that the chain is not yet contracted to a single point reflects a small deviation from the minimum of the total loss, called a Goldstone mode.
Goldstone modes appear in practice: the plots in Figure 2 show snapshots of the embedding space in a small scale Gaussian dynamic matrix factorization with and a representation space (details in Section 5.1). Points with the same color show the evolution of a given embedding vector along the time dimension of the model. In this toy example, the local loss is identical for each time step. Thus, in the optimum, the chains of points should, again, contract to a single point. The upper row of plots shows that Goldstone modes decay only slowly under gradient descent (the contraction of the chains happens slowly). In contrast, our proposed GoldstoneGD algorithm eliminates Goldstone modes quickly (bottom row). Goldstone modes contribute only little to the loss, as can be seen in the upper panel in Figure 2 after step . However, the small difference in the loss can manifest itself in a large difference of the fitted embedding vectors, as we discuss in Section 2.4.
2.2 Model Class Specification
More formally, the slow convergence problem arises in the following class of models. We consider data that are associated with additional metadata , such as a time stamp. For each , the task is to learn a low dimensional representation by minimizing a local loss function . We add a quadratic regularizer that couples the representations along the dimension. We refer to as the prior, adopting language of a Bayesian setup. Thus, the overall loss function is
(1) 
For each task , the representation is a matrix whose columns are low dimensional embedding vectors. We assume that is invariant under a collective rotation of : let be an arbitrary orthogonal rotation matrix of the same dimension as the embedding dimension, then
(2) 
Finally, we consider a special form of prior which is quadratic in , and which is defined in terms of a sparse symmetric coupling matrix :
(3) 
Here, the matrixvector multiplications are carried out in space, and the trace runs over the remaining dimensions. As we show in Section 2.3, Gaussian Markovian time series priors fall into this class, where is a tridiagonal matrix. More generally, is the Laplacian matrix of a sparse weighted graph (Poignard et al., 2018). Here, is the adjacency matrix, whose entries are the coupling strengths between tasks, and the degree matrix is diagonal and defined so that the entries of each row of sum up to zero.
2.3 Exemplary Models
In this section, we introduce three particular instances of the model class presented in Section 2.2. These are also the models that we investigate in our experiments in Section 5.
Dense dynamic matrix factorization.
Consider the task of factorizing a large matrix into a product of two smaller matrices. The latent representation is the concatenation of the two embedding matrices,
(4) 
In a Gaussian matrix factorization, the local loss function is
(5) 
In dynamic matrix factorization models, the data are observed sequentially at discrete time steps , and the representations capture the temporal evolution of latent embedding vectors. We use a Markovian Gaussian time series prior with a coupling strength ,
(6) 
Here, the vector is the ^{th} column in the matrix , i.e., the ^{th} embedding vector, and is the number of columns. The prior allows the model to share statistical strength across time. By multiplying out the square, we find that has the form of Eq. 3, and its Laplacian matrix is tridiagonal.
Sparse dynamic matrix factorization.
In a sparse matrix factorization, the local loss involves only few components of the matrix , where the latent representation is again . We consider a model for movie ratings where each user rates only few movies. When user rates movie in time step , we model the likelihood to obtain the binary rating with a logistic regression,
(7) 
with the sigmoid function . The full likelihood for time step is the product of the likelihoods of all observed ratings at time step . The local loss function is
(8) 
Here, is the Frobenius norm, and we add a quadratic regularizer with strength since data for some users or movies may be scarce. We distinguish this local regularizer from the time series prior , given again in Eq. 6, as the local regularizer does not break the rotational symmetry.
Dynamic Word Embeddings.
Word embeddings map words from a large vocabulary to a low dimensional representation space such that neighboring words are semantically similar, and differences between word embedding vectors capture syntactic and semantic relations. We consider the Dynamic Word Embeddings model (Bamler & Mandt, 2017), which uses a probabilistic interpretation of the SkipGram model with negative sampling, also known as word2vec (Mikolov et al., 2013b; Barkan, 2017), and combines it with a time series prior. The model is trained on text sources with time stamps , and it assigns two time dependent embedding vectors and to each word from a fixed vocabulary. The embedding vectors are obtained by simultaneously factorizing two matrices, which contain socalled positive and negative counts of wordcontext pairs. Therefore, the representation for each time step is invariant under orthogonal transformations. The prior is a discretized OrnsteinUhlenbeck (OU) process, i.e., it combines a random diffusion process with a quadratic regularizer. Analogously to the movie recommendations model, we absorb the quadratic regularizer in the per task loss .
2.4 The Slow Convergence Problem
In this section, we identify the slow convergence problem in time series models with continuous symmetries as an illconditioning of the Hessian of the loss at its minimum.
We consider a model of the form of Eqs. 13. Due to the continuous rotational symmetry, each local loss function has a manifold of degenerate minima, and the Hessian of vanishes within this manifold. Consider, e.g., the twodimensional rotationally symetric loss in Figure 1a). It is given by where . Setting to zero, we find that the manifold of degenerate minima is the circle with radius around the origin. The Hessian at any point on this circle is . It has a zero eigenvalue for the direction perpendicular to , i.e., the direction of a small rotation (blue arrows in Figure 1a)).
Thus, within the subspace of small symmetry transformations, only the Hessian of the prior remains, which is
(9) 
Here, denotes the Kronecker delta, and run over the rows and over the columns of the matrices and . The Hessian of therefore has the same eigenvalues as the Laplacian matrix of the coupling graph, each with a multiplicity of .
Every Laplacian matrix has a zero eigenvalue because its rows add up to zero. The corresponding eigenvector describes a global rotation of the representations for all tasks by the same amount, which does not concern us here. A global rotation leaves the total loss invariant, implying that the minimum of is degenerate. Convergence within the valley of degenerate minima is not required since any minimum is a valid solution.
The second smallest eigenvalue of a Laplacian matrix is called algebraic connectivity (de Abreu, 2007), and it is small in sparse graphs. In the Markovian time series prior in Eq. 6, the coupling graph is a onedimensional chain, with algebraic connectivity (de Abreu, 2007), which vanishes as for large . Thus, even if the coupling strength is strong, the lowest nonzero eigenvalue of can be small for large . In our experiments in Section 5, is , , and , respectively. The small eigenvalue of the Hessian leads to an illconditioned optimization problem with slow convergence. We present our proposed solution to speed up convergence in Section 4.
3 Related Work
Symmetries in the input space of machine learning models have been exploited to reduce the number of independent model parameters. Convolutional neural networks (CNNs) (LeCun et al., 1998) use discrete translational symmetry to tie weights in a neural network layer. This idea was generalized to arbitrary discrete symmetries (Gens & Domingos, 2014) and to the (discrete) permutation symmetry of sets (Zaheer et al., 2017). Discrete symmetries are also exploited in inference with graphical models to reduce the size of the effective latent space (Bui et al., 2012; Noessner et al., 2013). Discrete symmetries do not lead to a small gradient in gradient descent because they do not give rise to configurations with arbitrarily small deviations from the optimum.
In this work, we consider models with continuous symmetries. These arise in the latent representation space, e.g., in deep neural networks (Badrinarayanan et al., 2015), matrix factorization (Mnih & Salakhutdinov, 2008; Gopalan et al., 2015), linear factor models (Murphy, 2012), and word embeddings (Mikolov et al., 2013a, b; Pennington et al., 2014; Barkan, 2017). Dynamic matrix factorizations (Lu et al., 2009; Koren, 2010; Sun et al., 2012; Charlin et al., 2015) and dynamic word embeddings (Bamler & Mandt, 2017; Rudolph & Blei, 2017) combine such models with a time series prior, which does not obey the symmetry. These are the models whose optimization we address in this paper.
The slow convergence in these models is caused by shallow directions of the loss function. Popular methods to escape a shallow valley of the loss in deep learning models (Duchi et al., 2011; Zeiler, 2012; Kingma & Ba, 2014) rely on diagonal preconditioning. As confirmed by our experiments, diagonal preconditioning does not speed up convergence in the models addressed in this paper since the shallow directions correspond to collective rotations of many model parameters, which are not aligned with the coordinate axes.
Natural gradients (Amari, 1998; Martens, 2014) are a more general form of preconditioning that has been applied to deep learning (Pascanu & Bengio, 2013) and to variational inference (Hoffman et al., 2013). Natural gradients take the information geometry of the parameter space into account. They use a Riemannian metric to map the gradient, which lives in the tangent space, to an update step in the cotangent space (Ollivier, 2015a, b). In general, natural gradients are expensive to obtain. We show that using an appropriate parameterization of the symmetry transformations, natural gradients in the symmetry subspace are cheap.
4 Goldstone Gradient Descent
In this section, we present our solution to the slow convergence problem that we identified in Section 2.4. Algorithm LABEL:alg:gaugegd summarizes the proposed Goldstone Gradient Descent (GoldstoneGD) algorithm. We lay out details in Section 4.1, and discuss hyperparameters in Section 4.2.
The algorithm minimizes a loss function of the form of Eqs. 13. It alternates between standard gradient steps in the full parameter space (lines LABEL:ln:beginfullspaceLABEL:ln:standardgradstep), and natural gradient steps in the subspace of small symmetry transformations (‘symmetry subspace’ for short; lines LABEL:ln:beginsymspaceLABEL:ln:updategamma). Switching between the two spaces involves an overhead due to coordinate transformations (lines LABEL:ln:preparegauge and LABEL:ln:applygauge). We therefore always execute several consecutive gradient steps before switching between spaces (hyperparameters and ). For simplicity, the update step in line LABEL:ln:standardgradstep is formulated here with a single constant learning rate. In our experiments, we also use adaptive learning rates and minibatch sampling, see Section 5.
algocf[t] \end@float
4.1 Optimization in the Symmetry Subspace
We now describe the optimization in the symmetry subspace (lines LABEL:ln:beginsymspaceLABEL:ln:updategamma in Algorithm LABEL:alg:gaugegd), and the coordinate transformations to and from this subspace (lines LABEL:ln:preparegauge and LABEL:ln:applygauge, respectively).
For given initial model parameters , we minimize the loss by only applying symmetry transformations. Let denote orthogonal matrices. The task is to minimize the following auxiliary loss function over ,
(10) 
with the nonlinear constraint . If minimizes , then replacing decreases the loss by eliminating all Goldstone modes. The second term on the righthand side of Eq. 10 does not influence the minimization as it is independent of . Subtracting this term makes independent of the local loss functions : by using Eqs. 12, we can express in terms of only the prior ,
(11) 
Artificial gauge fields.
We turn the constrained minimization into an unconstrained one using a result from the theory of Lie groups (Hall, 2015). First, note that only special orthogonal transformations with contribute to the slow convergence problem. Mirror transformations with do not lead to Goldstone modes. Every special orthogonal matrix is the matrix exponential of a skew symmetric matrix . Here, skew symmetry means that , and the matrix exponential function is defined by its series expansion,
(12) 
which is not to be confused with the componentwise exponential of , (the term in Eq. 12 is the matrix product of with itself, not the componentwise square). Eq. 12 follows from the Lie group–Lie algebra correspondence for the Lie group (Hall, 2015). Note that setting all components of to zero yields the identity . For small , is a small rotation close to the identity. We enforce skew symmetry of by parameterizing it via the skew symmetric part of an unconstrained matrix , i.e.,
(13) 
We call the components of the gauge fields, invoking an analogy to gauge theory in physics.
Taylor expansion in the gauge fields.
Eqs. 1213 turn the constrained minimization of over into an unconstrained minimization over . However, the matrixexponential function in Eq. 12 is numerically expensive, and its derivative is complicated because the group is nonabelian. We simplify the problem by introducing an approximateion. As the model parameters approach the minimum of , the optimal rotations that minimize converge to the identity, and thus the gauge fields converge to zero. In this limit, the approximation becomes exact.
We approximate the auxiliary loss function by a second order Taylor expansion . In detail, we truncate Eq. 12 after the term quadratic in and insert the truncated series into Eq. 11. We multiply out the quadratic form in the prior , Eq. 3, and neglect again all terms of higher than quadratic order in . Using the skew symmetry of and the symmetry of the Laplacian matrix , we find
(14) 
where the trace runs over the embedding space, and for each , we define the matrix ,
(15) 
We evaluate the matrices when we switch from the optimization in the full parameter space to the optimization in the symmetry subspace, see line LABEL:ln:preparegauge in Algorithm LABEL:alg:gaugegd. Note that the adjacency matrix is sparse, and that we only need to obtain those matrices for which is nonzero.
Once we obtain gauge fields that minimize , the optimal update step for the model parameters would be . We avoid again a full evaluation of the expensive matrix exponential function and approximate it for small gauge fields by truncating after the linear term, resulting in the update step in line LABEL:ln:applygauge of Algorithm LABEL:alg:gaugegd.
Natural gradients.
Lines LABEL:ln:beginsymspaceLABEL:ln:updategamma in Algorithm LABEL:alg:gaugegd minimize over the gauge fields using GD. We speed up convergence using the fact that depends only on the prior and not on . Since we know the Hessian of , we can use natural gradients (Amari, 1998), resulting in the update step
(16) 
with a constant learning rate , discussed below. Here, we precondition with the psuedoinverse of the Laplacian matrix. We obtain by taking the eigendecomposition of and inverting the eigenvalues, except for a single zero eigenvalue, which we leave at zero. This has to be done only once before entering the training loop. The zero eigenvalue corresponds to a global rotation of all by the same orthogonal matrix, which does not reduce the loss.
We do not reset the gauge fields to zero when switching back to the full parameter space. Thus, when we return to the minimization of after interleaving standard gradient steps, we preinitialize with the result from the previous minimization. This turned out to speed up convergence in our experiments. We explain the speedup with the observation that, by remembering the previous update step, the gauge fields act like a momentum in the symmetry subspace.
Learning rate.
We find that we can automatically set in Eq. 16 to a value that leads to fast convergence,
(17) 
We arrive at this choice of learning rate due to the following considerations. First, consider the easier task of minimizing over the full parameter space. Here, the same preconditioning as in Eq. 16 with leads to the update
(18) 
where is a projection that only removes the (irrelevant) nullspace of . The minimization would thus find the exact minimum of with a single update step with .
Of course, the objective for the minimization in the full parameter space is not but the total loss . In the symmetry subspace, however, the objective is indeed (a reparameterization of) , see Eq. 11. The reparameterization in terms of leads to the matrices in Eq. 14, which are quadratic in the components of and linear in . This suggests a learning rate . We find empirically for large models that the dependency of leads to a small mismatch between and the Hessian of . The more conservative choice of learning rate in Eq. 17 leads to fast convergence of the gauge fields in all our experiments.
4.2 Hyperparameters
L  Operation  Complexity  # 

LABEL:ln:standardgradstep  gradient step in full param. space  model dependent  
LABEL:ln:preparegauge  transformation to symmetry space  
LABEL:ln:updategamma  nat. grad. step in symmetry space  
LABEL:ln:applygauge  transformation to full param. space 
GoldstoneGD has two integer hyperparameters, and , which control the frequency of execution of each operation. Table 1 lists the computational complexity of each operation, assuming that the adjacency matrix has nonzero entries. Note that representation learning usually involves a dimensionality reduction, i.e., is often orders of magnitude smaller than . Therefore, update steps in the symmetry subspace are cheap. In our experiments, we always set and such that the runtime increases by less than compared to standard GD with the same number of update steps in the full parameter space.
5 Experiments
We evaluate the proposed GoldstoneGD optimization algorithm on the three example models introduced in Section 2.3. We compare GoldstoneGD to standard GD, to AdaGrad (Duchi et al., 2011), and to Adam (Kingma & Ba, 2014). GoldstoneGD converges orders of magnitude faster and fits more interpretable word embeddings.
5.1 Visualizing Goldstone Modes With Artificial Data
Model and data preparation.
We fit the dynamic Gaussian matrix factorization model defined in Eqs. 46 in Section 2.3 to small scale artificial data. In order to visualize Goldstone modes in the embedding space we choose an embedding dimension of and, for this experiment only, we fit the model to time independentdata. This allows us to monitor convergence since we know that the matrices and that minimize the loss are also timeindependent. We generate artificial data for the matrix by drawing the components of two matrices from a standard normal distribution, forming , and adding uncorrelated Gaussian noise with variance . We use time steps and a coupling strength of .
Hyperparameters.
We train the model with standard GD (baseline) and with GoldstoneGD with and . We find fastest convergence for the baseline method if we clip the gradients to an interval and use a decreasing learning rate despite the noisefree gradient. Here, is the training step. We optimize the hyperparameters for fastest convergence in the baseline and find , , and .
Results.
Figure 2 compares convergence in the two algorithms. We discussed the figure at the end of Section 2.1. In summary, GoldstoneGD converges an order of magnitude faster even in this small scale setup in which the skew symmetric gauge fields have only independent parameters, i.e., there are only three types of Goldstone modes. Once the minimization finds minima of the local losses , differences in the total loss between the two algorithms are small since Goldstone modes contribute only little to (this is why they decay slowly in GD).
5.2 MovieLens Recommendations
Model and data set.
We fit the sparse dynamic Bernoulli factorization model defined in Eqs. 68 in Section 2.3 to the Movielens 20M data set^{1}^{1}1https://grouplens.org/datasets/movielens/20m/ (Harper & Konstan, 2016). We use embedding dimension , coupling strength , and regularizer . The data set consists of million reviews of movies by users with time stamps from 1995 to 2015. We binarize the ratings by splitting at the median, discarding ratings at the median, and we slice the remaining million data points into time bins of equal duration. We split randomly across all bins into training, validation, and test set.
Baseline and hyperparameters.
We compare the proposed GoldstoneGD algorithm to GD with AdaGrad (Duchi et al., 2011) with a learning rate prefactor of obtained from crossvalidation. Similar to GoldstoneGD, AdaGrad is designed to escape shallow valleys of the loss, but it uses only diagonal preconditioning. We compare to GoldstoneGD with and , using the same AdaGrad optimizer for update steps in the full parameter space.
Query  GoldstoneGD  Baseline 

car  boat, saddle, canoe, wagon, box  shell, roof, ceiling, choir, central 
computer  perspective, telescope, needle, mathematical, camera  organism, disturbing, sexual, rendering, bad 
electricity  vapor, virus, friction, fluid, molecular  exercising, inherent, seeks, takes, protect 
DNA  potassium, chemical, sodium, molecules, displacement  operates, differs, sharing, takes, keeps 
tuberculosis  chronic, paralysis, irritation, disease, vomiting  trained, uniformly, extinguished, emerged, widely 
Results.
The additional operations in GoldstoneGD lead to a increase in runtime per full update step. The upper panel in Figure 3 shows training curves for the loss using the baseline (purple) and GoldstoneGD (green). The loss drops faster in GoldstoneGD, but differences in terms of the full loss are small because the local loss functions are much larger than the prior in this experiment. We show only the prior in the lower panel of Figure 3. Both algorithms converge to the same value of , but GoldstoneGD converges at least an order of magnitude faster. The difference in absolute values is small because Goldstone modes contribute little to . They can, however, have a large influence on the parameter values, as we show next in experiments with Dynamic Word Embeddings.
5.3 Dynamic Word Embeddings
Model and data set.
We perform variational inference (Ranganath et al., 2014) in Dynamic Word Embeddings (DWE), see Section 2.3. We fit the model to digitized books from the years to in the Google Books corpus^{2}^{2}2http://storage.googleapis.com/books/ngrams/books/datasetsv2.html (Michel et al., 2011) (approximately words). We follow (Bamler & Mandt, 2017) for data preparation, resulting in a vocabulary size of , a training set of time step, and a test set of time steps. The paper proposes two inference algorithms: filtering and smoothing. We use the smoothing algorithm, which has better predictive performance than filtering but suffers from Goldstone modes. We set the embedding dimension to due to hardware constraints and train for steps using an Adam optimizer (Kingma & Ba, 2014) with a decaying prefactor of the adaptive learning rate, , where is the training step, , and . We find that this leads to better convergence than a constant prefactor. All other hyperparameters are the same as in (Bamler & Mandt, 2017). We compare the baseline to GoldstoneGD using the same learning rate schedule and hyperparameters , which leads to an increase in runtime.
Results.
By eliminating Goldstone modes, GoldstoneGD makes word embeddings comparable across the time dimension of the model. We demonstrate this in Table 2, which shows the result of ‘aging’ modern words, i.e., translating them from modern English to the English of . For each query word , we report the five words whose embedding vectors at the first time step (year ) have largest overlap with the embedding vector of the query word at the last time step (year ). Overlap is measured in cosine distance (normalized scalar product) and we use the means of and under the variational distribution.
GoldstoneGD finds words that are plausible for the year while still being related to the query (e.g., means of transportation in a query for ‘car’). By contrast, the baseline method fails to find plausible results. Figure 4 provides more insight into the failure of the baseline method. It shows histograms of the cosine distance between word embeddings and for the same word from the first to the last time step. In GoldstoneGD (green), most embeddings have a large overlap, reflecting that the usage of most words does not change drastically over time. In the baseline (purple), no embeddings overlap by more then between and , and some embeddings even change their orientation (negative overlap). We explain this counterintuitive observation with the presence of Goldstone modes, i.e., the entire embedding spaces are rotated against each other.
For a quantitative comparison, we evaluate the predictive loglikelihood of the test set under the posterior mean, and find slightly better predictive performance with GoldstoneGD ( vs. per test point). The improvement is small because the training set is so large that the influence of the prior in all but the symmetry directions is dwarfed by the likelihood. The main advantage of GoldstoneGD are the more interpretable embeddings, as demonstrated in Table 2.
6 Conclusions
We identified a slow convergence problem in representation learning models with a continuous symmetry and a Markovian time series prior, and we solved the problem with a new optimization algorithm, GoldstoneGD. The algorithm separates the minimization in the symmetry subspace from the remaining coordinate directions. Our experiments showed that GoldstoneGD converges orders of magnitude faster and fits more interpretable embedding vectors, which can be compared across the time dimension of a model. We believe that continuous symmetries are common in representation learning and can guide model and algorithm design.
Acknowledgements
We thank Ari Pakman for valuable and detailed feedback that greatly improved the manuscript.
References
 Altland & Simons (2010) Altland, Alexander and Simons, Ben D. Condensed matter field theory. Cambridge University Press, 2010.
 Amari (1998) Amari, ShunIchi. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.
 Arnol’d (2013) Arnol’d, Vladimir Igorevich. Mathematical methods of classical mechanics, volume 60. Springer Science & Business Media, 2013.
 Badrinarayanan et al. (2015) Badrinarayanan, Vijay, Mishra, Bamdev, and Cipolla, Roberto. Understanding symmetries in deep networks. arXiv preprint arXiv:1511.01029, 2015.
 Bamler & Mandt (2017) Bamler, Robert and Mandt, Stephan. Dynamic word embeddings. In Proceedings of the 34th International Conference on Machine Learning (ICML), pp. 380–389, 2017.
 Barkan (2017) Barkan, Oren. Bayesian Neural Word Embedding. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 Bui et al. (2012) Bui, Hung Hai, Huynh, Tuyen N, and Riedel, Sebastian. Automorphism groups of graphical models and lifted variational inference. arXiv preprint arXiv:1207.4814, 2012.
 Charlin et al. (2015) Charlin, Laurent, Ranganath, Rajesh, McInerney, James, and Blei, David M. Dynamic Poisson factorization. In Proceedings of the 9th ACM Conference on Recommender Systems, pp. 155–162, 2015.
 de Abreu (2007) de Abreu, Nair Maria Maia. Old and new results on algebraic connectivity of graphs. Linear Algebra and its Applications, 423(1):53–73, 2007.
 Duchi et al. (2011) Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
 Gens & Domingos (2014) Gens, Robert and Domingos, Pedro M. Deep symmetry networks. In Advances in Neural Information Processing Systems 27, pp. 2537–2545. 2014.
 Gopalan et al. (2015) Gopalan, Prem, Hofman, Jake M, and Blei, David M. Scalable recommendation with hierarchical Poisson factorization. In UAI, pp. 326–335, 2015.
 Hall (2015) Hall, Brian. Lie groups, Lie algebras, and representations: an elementary introduction, volume 222. Springer, 2015.
 Harper & Konstan (2016) Harper, F Maxwell and Konstan, Joseph A. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19, 2016.
 Hoffman et al. (2013) Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John William. Stochastic Variational Inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference for Learning Representations (ICLR), 2014.
 Koren (2010) Koren, Yehuda. Collaborative filtering with temporal dynamics. Communications of the ACM, 53(4):89–97, 2010.
 LeCun et al. (1998) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lu et al. (2009) Lu, Zhengdong, Agarwal, Deepak, and Dhillon, Inderjit S. A spatio–temporal approach to collaborative filtering. In ACM Conference on Recommender Systems (RecSys), 2009.
 Martens (2014) Martens, James. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
 Michel et al. (2011) Michel, JeanBaptiste, Shen, Yuan Kui, Aiden, Aviva Presser, Veres, Adrian, Gray, Matthew K, Pickett, Joseph P, Hoiberg, Dale, Clancy, Dan, Norvig, Peter, Orwant, Jon, et al. Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331(6014):176–182, 2011.
 Mikolov et al. (2013a) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013a.
 Mikolov et al. (2013b) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, pp. 3111–3119. 2013b.
 Mnih & Salakhutdinov (2008) Mnih, Andriy and Salakhutdinov, Ruslan R. Probabilistic matrix factorization. In Advances in neural information processing systems, pp. 1257–1264, 2008.
 Murphy (2012) Murphy, Kevin P. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
 Noessner et al. (2013) Noessner, Jan, Niepert, Mathias, and Stuckenschmidt, Heiner. Rockit: Exploiting parallelism and symmetry for map inference in statistical relational models. In AAAI Workshop: Statistical Relational Artificial Intelligence, 2013.
 Ollivier (2015a) Ollivier, Yann. Riemannian metrics for neural networks I: feedforward networks. Information and Inference: A Journal of the IMA, 4(2):108–153, 2015a.
 Ollivier (2015b) Ollivier, Yann. Riemannian metrics for neural networks II: recurrent networks and learning symbolic data sequences. Information and Inference: A Journal of the IMA, 4(2):154–193, 2015b.
 Pascanu & Bengio (2013) Pascanu, Razvan and Bengio, Yoshua. Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584, 2013.
 Pennington et al. (2014) Pennington, Jeffrey, Socher, Richard, and Manning, Christopher. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
 Peskin (1995) Peskin, Michael Edward. An introduction to quantum field theory. Westview press, 1995.
 Poignard et al. (2018) Poignard, Camille, Pereira, Tiago, and Pade, Jan Philipp. Spectra of laplacian matrices of weighted graphs: structural genericity properties. SIAM Journal on Applied Mathematics, 78(1):372–394, 2018.
 Ranganath et al. (2014) Ranganath, Rajesh, Gerrish, Sean, and Blei, David. Black box variational inference. In Artificial Intelligence and Statistics, pp. 814–822, 2014.
 Rudolph & Blei (2017) Rudolph, Maja and Blei, David. Dynamic bernoulli embeddings for language evolution. arXiv preprint arXiv:1703.08052, 2017.
 Sun et al. (2012) Sun, John Z, Varshney, Kush R, and Subbian, Karthik. Dynamic matrix factorization: A state space approach. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1897–1900, 2012.
 Zaheer et al. (2017) Zaheer, Manzil, Kottur, Satwik, Ravanbakhsh, Siamak, Poczos, Barnabas, Salakhutdinov, Ruslan R, and Smola, Alexander J. Deep sets. In Advances in Neural Information Processing Systems, pp. 3394–3404, 2017.
 Zeiler (2012) Zeiler, Matthew D. ADADELTA: an Adaptive Learning Rate Method. arXiv preprint arXiv:1212.5701, 2012.