Fast stochastic optimization on Riemannian manifolds
Abstract
We study optimization of finite sums of geodesically smooth functions on Riemannian manifolds. Although variance reduction techniques for optimizing finitesum problems have witnessed a huge surge of interest in recent years, all existing work is limited to vector space problems. We introduce Riemannian SVRG, a new variance reduced Riemannian optimization method. We analyze this method for both geodesically smooth convex and nonconvex functions. Our analysis reveals that Riemannian SVRG comes with advantages of the usual SVRG method, but with factors depending on manifold curvature that influence its convergence. To the best of our knowledge, ours is the first fast stochastic Riemannian method. Moreover, our work offers the first nonasymptotic complexity analysis for nonconvex Riemannian optimization (even for the batch setting). Our results have several implications; for instance, they offer a Riemannian perspective on variance reduced PCA, which promises a short, transparent convergence analysis.
Fast stochastic optimization on Riemannian manifolds
Hongyi Zhang hongyiz@mit.edu Massachusetts Institute of Technology Sashank J. Reddi sjakkamr@cs.cmu.edu Carnegie Mellon University Suvrit Sra suvrit@mit.edu Massachusetts Institute of Technology
noticebox[b]29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\end@float
1 Introduction
We study the following rich class of (possibly nonconvex) finitesum optimization problems:
(1) 
where is a Riemannian manifold with the Riemannian metric , and is a geodesically convex set. We further assume that each is geodesically smooth (see §2). Problem (1) is fundamental to machine learning, where it typically arises in the context of empirical risk minimization, albeit usually in its vector space incarnation. It also captures numerous widely used problems such as principal component analysis (PCA), independent component analysis (ICA), dictionary learning, mixture modeling, among others (please see the related work section).
The linear space version of (1) where and is the Euclidean norm has been the subject of intense algorithmic development in machine learning and optimization, starting with the classical work of Robbins and Monro [26] to the recent spate of work on variance reduction methods [10, 28, 18, 20, 25]. However, when is a nonlinear Riemannian manifold, much less attention has been paid [7, 38].
When solving problems with manifold constraints, one common approach is to alternate between optimizing in the ambient Euclidean space and “projecting” onto the manifold. For example, two wellknown methods to compute the leading eigenvector of symmetric matrices, power iteration and Oja’s algorithm [23], are in essence projected gradient and projected stochastic gradient algorithms. For certain manifolds (e.g., positive definite matrices), projections can be too expensive to compute.
An effective alternative is to use Riemannian optimization^{1}^{1}1Riemannian optimization is optimization on a known manifold structure. Note the distinction from manifold learning, which attempts to learn a manifold structure from data. We briefly review some Riemannian optimization applications in the related work., which directly operates on the manifold in question. This allows Riemannian optimization to turn the constrained optimization problem (1) into an unconstrained one defined on the manifold, and thus, to be “projectionfree.” More importantly is its conceptual value: viewing a problem through the Riemannian lens, one can discover new insights into the geometry of a problem, which can even lead to better optimization algorithms.
Although the Riemannian approach is very appealing, our knowledge of it is fairly limited. In particular, there is little analysis about its global complexity (a.k.a. nonasymptotic convergence rate), in part due to the difficulty posed by the nonlinear metric. It is only recently that Zhang and Sra [38] developed the first global complexity analysis of full and stochastic gradient methods for geodesically convex functions. However, the batch and stochastic gradient methods in [38] suffer from problems similar to their vector space counterparts. For solving finite sum problems with components, the fullgradient method requires derivatives at each step; the stochastic method requires only one derivative but at the expense of vastly slower convergence to an accurate solution.
These issues have driven much of the recent progress on faster stochastic optimization in vector spaces by using variance reduction [28, 18, 10]. However, all of these works critically rely on properties of vector spaces; thus, using them in the context of Riemannian manifolds poses major challenges. Given the potentially vast scope of Riemannian optimization and its growing number of applications, developing fast stochastic optimization methods for it is very important: it will help us apply Riemannian optimization to largescale problems, while offering a new set of algorithmic tools for the practitioner’s repertoire.
Contributions.
In light of the above motivation, let us summarize our key contributions below.

We introduce Riemannian SVRG (Rsvrg), a variance reduced Riemannian stochastic gradient method based on SVRG [18]. We analyze Rsvrg for geodesically strongly convex functions through a novel theoretical analysis that accounts for the nonlinear (curved) geometry of the manifold to yield linear convergence rates.

Inspired by the exciting advances in variance reduction for nonconvex optimization [25, 3], we generalize the convergence analysis of Rsvrg to (geodesically) nonconvex functions and also to gradient dominated functions (see §2 for the definition). Our analysis provides the first stochastic Riemannian method that is provably superior to both batch and stochastic (Riemannian) gradient methods for nonconvex finitesum problems.

Using a Riemannian formulation and applying our result for (geodesically) gradientdominated functions, we provide new insights, and a short transparent analysis explaining fast convergence of variance reduced PCA for computing the leading eigenvector of a symmetric matrix.
To our knowledge, this paper provides the first stochastic gradient method with global linear convergence rates for geodesically strongly convex functions, as well as first nonasymptotic convergence rates for geodesically nonconvex optimization (even in the batch case). Our analysis reveals how manifold geometry, in particular its curvature impacts convergence rates. We illustrate the benefits of Rsvrg by showing an application to computing leading eigenvectors of a symmetric matrix, as well as for accelerating the computation of the Riemannian centroid of covariance matrices, a problem that has received great attention in the literature [5, 16, 38].
Related Work.
Variance reduction techniques, such as control variates, are widely used in Monte Carlo simulations [27]. In linear spaces, variance reduced methods for solving finitesum problems have recently witnessed a huge surge of interest [e.g. 28, 18, 10, 4, 20, 36, 14]. They have been shown to accelerate stochastic optimization for strongly convex objectives, convex objectives, nonconvex (), and even when both and () are nonconvex [25, 3]. Reddi et al. [25] further proved global linear convergence for gradient dominated nonconvex problems. Our analysis is inspired by [18, 25], but applies to the substantially more general Riemannian optimization setting.
References of Riemannian optimization can be found in [33, 1], where analysis is limited to asymptotic convergence (except [33, Theorem 4.2] which proved linear rate convergence for firstorder line search method with bounded and positive definite hessian). Stochastic Riemannian optimization has been previously considered in [7, 21], though with only asymptotic convergence analysis, and without any rates. Many applications of Riemannian optimization are known, including matrix factorization on fixedrank manifold [34, 32], dictionary learning [8, 31], optimization under orthogonality constraints [11, 22], covariance estimation [35], learning elliptical distributions [39, 30], and Gaussian mixture models [15]. Notably, some nonconvex Euclidean problems are geodesically convex, for which Riemannian optimization can provide similar guarantees to convex optimization. Zhang and Sra [38] provide the first global complexity analysis for firstorder Riemannian algorithms, but their analysis is restricted to geodesically convex problems with full or stochastic gradients. In contrast, we propose Rsvrg, a variance reduced Riemannian stochastic gradient algorithm, and analyze its global complexity for both geodesically convex and nonconvex problems.
In parallel with our work, [19] also proposed and analyzed Rsvrg specifically for the Grassmann manifold, where the complexity analysis is restricted to local convergence to strict local minimums, which essentially corresponds to our analysis of (locally) geodesically strongly convex functions.
2 Preliminaries
Before formally discussing Riemannian optimization, let us recall some foundational concepts of Riemannian geometry. For a thorough review one can refer to any classic text, e.g.,[24].
A Riemannian manifold is a real smooth manifold equipped with a Riemannain metric . The metric induces an inner product structure in each tangent space associated with every . We denote the inner product of as ; and the norm of is defined as . The angle between is defined as . A geodesic is a constant speed curve that is locally distance minimizing. An exponential map maps in to on , such that there is a geodesic with and . If between any two points in there is a unique geodesic, the exponential map has an inverse and the geodesic is the unique shortest path with the geodesic distance between .
Parallel transport maps a vector to , while preserving norm, and roughly speaking, “direction,” analogous to translation in . A tangent vector of a geodesic remains tangent if parallel transported along . Parallel transport preserves inner products.
The geometry of a Riemannian manifold is determined by its Riemannian metric tensor through various characterization of curvatures. Let be linearly independent, so that they span a two dimensional subspace of . Under the exponential map, this subspace is mapped to a two dimensional submanifold of . The sectional curvature is defined as the Gauss curvature of at . As we will mainly analyze manifold trigonometry, for worstcase analysis, it is sufficient to consider sectional curvature.
Function Classes.
We now define some key terms. A set is called geodesically convex if for any , there is a geodesic with and for . Throughout the paper, we assume that the function in (1) is defined on a geodesically convex set on a Riemannian manifold .
We call a function geodesically convex (gconvex) if for any and any geodesic such that , and for , it holds that
It can be shown that if the inverse exponential map is welldefined, an equivalent definition is that for any , , where is a subgradient of at (or the gradient if is differentiable). A function is called geodesically strongly convex (strongly gconvex) if for any and subgradient , it holds that
We call a vector field geodesically Lipschitz (gLipschitz) if for any ,
where is the parallel transport from to . We call a differentiable function geodesically smooth (gsmooth) if its gradient is gLipschitz, in which case we have
We say is gradient dominated if is a global minimizer of and for every
(2) 
We recall the following trigonometric distance bound that is essential for our analysis:
3 Riemannian SVRG
In this section we introduce Rsvrg formally. We make the following standing assumptions: (a) attains its optimum at ; (b) is compact, and the diameter of is bounded by , that is, ; (c) the sectional curvature in is upper bounded by , and within the exponential map is invertible; and (d) the sectional curvature in is lower bounded by . We define the following key geometric constant that capture the impact of manifold curvature:
(4) 
We note that most (if not all) practical manifold optimization problems can satisfy these assumptions.
Our proposed Rsvrg algorithm is shown in Algorithm 1. Compared with the Euclidean SVRG, it differs in two key aspects: the variance reduction step uses parallel transport to combine gradients from different tangent spaces; and the exponential map is used (instead of the update ).
3.1 Convergence analysis for strongly gconvex functions
In this section, we analyze global complexity of Rsvrg for solving (1), where each () is gsmooth and is strongly gconvex. In this case, we show that Rsvrg has linear convergence rate. This is in contrast with the rate of Riemannian stochastic gradient algorithm for strongly gconvex functions [38].
Theorem 1.
The proof of Theorem 1 is in the appendix, and takes a different route compared with the original SVRG proof [18]. Specifically, due to the nonlinear Riemannian metric, we are not able to bound the squared norm of the variance reduced gradient by . Instead, we bound this quantity by the squared distances to the minimizer, and show linear convergence of the iterates. A bound on is then implied by gsmoothness, albeit with a stronger dependency on the condition number. Theorem 1 leads to the following more digestible corollary on the global complexity of the algorithm:
Corollary 1.
With assumptions as in Theorem 1 and properly chosen parameters, after IFO calls, the output satisfies
We give a proof with specific parameter choices in the appendix. Observe the dependence on in our result: for , we have , which implies that negative space curvature adversarially affects convergence rate; while for , we have , which implies that for nonnegatively curved manifolds, the impact of curvature is not explicit. In the rest of our analysis we will see a similar effect of sectional curvature; this phenomenon seems innate to manifold optimization (also see [38]).
In the analysis we do not assume each to be gconvex, which resulted in a worse dependence on the condition number. We note that a similar result was obtained in linear space [12]. However, we will see in the next section that by generalizing the analysis for gradient dominated functions in [25], we are able to greatly improve this dependence.
3.2 Convergence analysis for geodesically nonconvex functions
In this section, we analyze global complexity of Rsvrg for solving (1), where each is only required to be gsmooth, and neither nor need be gconvex. We measure convergence to a stationary point using following [13]. Note, however, that here and is defined via the inner product in . We first note that RiemannianSGD on nonconvex gsmooth problems attains convergence as SGD [13] holds; we relegate the details to the appendix.
Recently, two groups independently proved that variance reduction also benefits stochastic gradient methods for nonconvex smooth finitesum optimization problems, with different analysis [3, 25]. Our analysis for nonconvex Rsvrg is inspired by [25]. Our main result for this section is Theorem 2.
Theorem 2.
The key challenge in proving Theorem 2 in the Riemannian setting is to incorporate the impact of using a nonlinear metric. Similar to the gconvex case, the nonlienar metric impacts the convergence, notably through the constant that depends on a lowerbound on sectional curvature.
Reddi et al. [25] suggested setting , in which case we obtain the following corollary.
Corollary 2.
With assumptions and parameters in Theorem 2, choosing , the IFO complexity for achieving an accurate solution is:
Corollary 3.
With assumptions in Theorem 2 and , the IFO complexity for achieving an accurate solution is .
The same reasoning allows us to also capture the class of gradient dominated functions (2), for which Reddi et al. [25] proved that SVRG converges linearly to a global optimum. We have the following corresponding theorem for Rsvrg:
Theorem 3.
We summarize the implication of Theorem 3 as follows (note the dependency on curvature):
Corollary 4.
A typical example of gradient dominated function is a strongly gconvex function (see appendix). Specifically, we have the following corollary, which prove linear convergence rate of Rsvrg with the same assumptions as in Theorem 1, improving the dependence on the condition number.
4 Applications
4.1 Computing the leading eigenvector
In this section, we apply our analysis of Rsvrg for gradient dominated functions (Theorem 3) to fast eigenvector computation, a fundamental problem that is still being actively researched in the bigdata setting [29, 12, 17]. For the problem of computing the leading eigenvector, i.e.,
(5) 
existing analyses for stateoftheart algorithms typically result in dependency on the eigengap of , as opposed to the conjectured dependency [29], as well as the dependency of power iteration. Here we give new support for the conjecture. Note that Problem (5) seen as one in is nonconvex, with negative semidefinite Hessian everywhere, and has nonlinear constraints. However, we show that on the hypersphere Problem (5) is unconstrained, and has gradient dominated objective. In particular we have the following result:
Theorem 4.
Suppose has eigenvalues and . With probability , the random initialization falls in a Riemannian ball of a global optimum of the objective function, within which the objective function is gradient dominated.
We provide the proof of Theorem 4 in appendix. Theorem 4 gives new insights for why the conjecture might be true – once it is shown that with a constant stepsize and with high probability (both independent of ) the iterates remain in such a Riemannian ball, applying Corollary 4 one can immediately prove the dependency conjecture. We leave this analysis as future work.
Next we show that variance reduced PCA (VRPCA) [29] is closely related to Rsvrg. We implement Riemannian SVRG for PCA, and use the code for VRPCA in [29]. Analytic forms for exponential map and parallel transport on hypersphere can be found in [1, Example 5.4.1; Example 8.1.1]. We conduct wellcontrolled experiments comparing the performance of two algorithms. Specifically, to investigate the dependency of convergence on , for each where , we generate a matrix where using the method where are orthonormal matrices and is a diagonal matrix, as described in [29]. Note that has the same eigenvalues as . All the data matrices share the same and only differ in (thus also in ). We also fix the same random initialization and random seed. We run both algorithms on each matrix for epochs. For every five epochs, we estimate the number of epochs required to double its accuracy ^{2}^{2}2Accuracy is measured by , i.e. the relative error between the objective value and the optimum. We measure how much the error has been reduced after each five epochs, which is a multiplicative factor on the error at the start of each five epochs. Then we use as the estimate, assuming stays constant.. This number can serve as an indicator of the global complexity of the algorithm. We plot this number for different epochs against , shown in Figure 2. Note that the performance of RSVRG and VRPCA with the same stepsize is very similar, which implies a close connection of the two. Indeed, the update used in [29] and others is a wellknown approximation to the exponential map with small stepsize (a.k.a. retraction). Also note the complexity of both algorithms seems to have an asymptotically linear dependency on .
4.2 Computing the Riemannian centroid
In this subsection we validate that Rsvrg converges linearly for averaging PSD matrices under the Riemannian metric. The problem for finding the Riemannian centroid of a set of PSD matrices is where is also a PSD matrix. This is a geodesically strongly convex problem, yet nonconvex in Euclidean space. It has been studied both in matrix computation and in various applications [5, 16]. We use the same experiment setting as described in [38] ^{3}^{3}3We generate random PSD matrices using the Matrix Mean Toolbox [6] with normalization so that the norm of each matrix equals ., and compare Rsvrg against Riemannian full gradient (RGD) and stochastic gradient (RSGD) algorithms (Figure 3). Other methods for this problem include the relaxed Richardson iteration algorithm [6], the approximated joint diagonalization algorithm [9], and Riemannian Newton and quasiNewton type methods, notably the limitedmemory Riemannian BFGS [37]. However, none of these methods were shown to greatly outperform RGD, especially in data science applications where is large and extremely small optimization error is not required.
Note that the objective is sum of squared Riemannian distances in a nonpositively curved space, thus is strongly gconvex and gsmooth. According to Theorem 1 the optimal stepsize for Rsvrg is . For all the experiments, we initialize all the algorithms using the arithmetic mean of the matrices. We set , and choose in Algorithm 1 for Rsvrg, and use suggested parameters from [38] for other algorithms. The results suggest Rsvrg has clear advantage in the large scale setting.
5 Discussion
We introduce Riemannian SVRG, the first variance reduced stochastic gradient algorithm for Riemannian optimization. In addition, we analyze its global complexity for optimizing geodesically strongly convex, convex, and nonconvex functions, explicitly showing their dependence on sectional curvature. Our experiments validate our analysis that Riemannian SVRG is much faster than full gradient and stochastic gradient methods for solving finitesum optimization problems on Riemannian manifolds.
Our analysis of computing the leading eigenvector as a Riemannian optimization problem is also worth noting: a nonconvex problem with nonpositive Hessian and nonlinear constraints in the ambient space turns out to be gradient dominated on the manifold. We believe this shows the promise of theoretical study of Riemannian optimization, and geometric optimization in general, and we hope it encourages other researchers in the community to join this endeavor.
Our work also has limitations – most practical Riemannian optimization algorithms use retraction and vector transport to efficiently approximate the exponential map and parallel transport, which we do not analyze in this work. A systematic study of retraction and vector transport is an important topic for future research. For other applications of Riemannian optimization such as lowrank matrix completion [34], covariance matrix estimation [35] and subspace tracking [11], we believe it would also be promising to apply fast incremental gradient algorithms in the large scale setting.
References
 Absil et al. [2009] P.A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
 Agarwal and Bottou [2015] A. Agarwal and L. Bottou. A lower bound for the optimization of finite sums. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 78–86, 2015.
 AllenZhu and Hazan [2016] Z. AllenZhu and E. Hazan. Variance reduction for faster nonconvex optimization. arXiv:1603.05643, 2016.
 Bach and Moulines [2013] F. Bach and E. Moulines. Nonstronglyconvex smooth stochastic approximation with convergence rate o (1/n). In Advances in Neural Information Processing Systems, pages 773–781, 2013.
 Bhatia [2007] R. Bhatia. Positive Definite Matrices. Princeton University Press, 2007.
 Bini and Iannazzo [2013] D. A. Bini and B. Iannazzo. Computing the karcher mean of symmetric positive definite matrices. Linear Algebra and its Applications, 438(4):1700–1710, 2013.
 Bonnabel [2013] S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. Automatic Control, IEEE Transactions on, 58(9):2217–2229, 2013.
 Cherian and Sra [2015] A. Cherian and S. Sra. Riemannian dictionary learning and sparse coding for positive definite matrices. arXiv:1507.02772, 2015.
 Congedo et al. [2015] M. Congedo, B. Afsari, A. Barachant, and M. Moakher. Approximate joint diagonalization and geometric mean of symmetric positive definite matrices. PloS one, 10(4):e0121423, 2015.
 Defazio et al. [2014] A. Defazio, F. Bach, and S. LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In NIPS, pages 1646–1654, 2014.
 Edelman et al. [1998] A. Edelman, T. A. Arias, and S. T. Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
 Garber and Hazan [2015] D. Garber and E. Hazan. Fast and simple pca via convex optimization. arXiv preprint arXiv:1509.05647, 2015.
 Ghadimi and Lan [2013] S. Ghadimi and G. Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 Gong and Ye [2014] P. Gong and J. Ye. Linear convergence of variancereduced stochastic gradient without strong convexity. arXiv preprint arXiv:1406.1102, 2014.
 Hosseini and Sra [2015] R. Hosseini and S. Sra. Matrix manifold optimization for Gaussian mixtures. In NIPS, 2015.
 Jeuris et al. [2012] B. Jeuris, R. Vandebril, and B. Vandereycken. A survey and comparison of contemporary algorithms for computing the matrix geometric mean. Electronic Transactions on Numerical Analysis, 39:379–402, 2012.
 Jin et al. [2015] C. Jin, S. M. Kakade, C. Musco, P. Netrapalli, and A. Sidford. Robust shiftandinvert preconditioning: Faster and more sample efficient algorithms for eigenvector computation. arXiv:1510.08896, 2015.
 Johnson and Zhang [2013] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pages 315–323, 2013.
 Kasai et al. [2016] H. Kasai, H. Sato, and B. Mishra. Riemannian stochastic variance reduced gradient on grassmann manifold. arXiv preprint arXiv:1605.07367, 2016.
 Konečnỳ and Richtárik [2013] J. Konečnỳ and P. Richtárik. Semistochastic gradient descent methods. arXiv:1312.1666, 2013.
 Liu et al. [2004] X. Liu, A. Srivastava, and K. Gallivan. Optimal linear representations of images for object recognition. IEEE TPAMI, 26(5):662–666, 2004.
 Moakher [2002] M. Moakher. Means and averaging in the group of rotations. SIAM journal on matrix analysis and applications, 24(1):1–16, 2002.
 Oja [1992] E. Oja. Principal components, minor components, and linear neural networks. Neural Networks, 5(6):927–935, 1992.
 Petersen [2006] P. Petersen. Riemannian geometry, volume 171. Springer Science & Business Media, 2006.
 Reddi et al. [2016] S. J. Reddi, A. Hefny, S. Sra, B. Póczós, and A. Smola. Stochastic variance reduction for nonconvex optimization. arXiv:1603.06160, 2016.
 Robbins and Monro [1951] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
 Rubinstein and Kroese [2011] R. Y. Rubinstein and D. P. Kroese. Simulation and the Monte Carlo method, volume 707. John Wiley & Sons, 2011.
 Schmidt et al. [2013] M. Schmidt, N. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arXiv:1309.2388, 2013.
 Shamir [2015] O. Shamir. A Stochastic PCA and SVD Algorithm with an Exponential Convergence Rate. In International Conference on Machine Learning (ICML15), pages 144–152, 2015.
 Sra and Hosseini [2013] S. Sra and R. Hosseini. Geometric optimisation on positive definite matrices for elliptically contoured distributions. In Advances in Neural Information Processing Systems, pages 2562–2570, 2013.
 Sun et al. [2015] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere ii: Recovery by riemannian trustregion method. arXiv:1511.04777, 2015.
 Tan et al. [2014] M. Tan, I. W. Tsang, L. Wang, B. Vandereycken, and S. J. Pan. Riemannian pursuit for big matrix recovery. In International Conference on Machine Learning (ICML14), pages 1539–1547, 2014.
 Udriste [1994] C. Udriste. Convex functions and optimization methods on Riemannian manifolds, volume 297. Springer Science & Business Media, 1994.
 Vandereycken [2013] B. Vandereycken. Lowrank matrix completion by Riemannian optimization. SIAM Journal on Optimization, 23(2):1214–1236, 2013.
 Wiesel [2012] A. Wiesel. Geodesic convexity and covariance estimation. IEEE Transactions on Signal Processing, 60(12):6182–6189, 2012.
 Xiao and Zhang [2014] L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
 Yuan et al. [2016] X. Yuan, W. Huang, P.A. Absil, and K. Gallivan. A riemannian limitedmemory bfgs algorithm for computing the matrix geometric mean. Procedia Computer Science, 80:2147–2157, 2016.
 Zhang and Sra [2016] H. Zhang and S. Sra. Firstorder methods for geodesically convex optimization. arXiv:1602.06053, 2016.
 Zhang et al. [2013] T. Zhang, A. Wiesel, and M. S. Greco. Multivariate generalized Gaussian distribution: Convexity and graphical models. Signal Processing, IEEE Transactions on, 61(16):4141–4148, 2013.
Appendix: Fast Stochastic Optimization on Riemannian Manifolds
Appendix A Proofs for Section 3.1
Theorem 1.
Proof.
We start by bounding the squared norm of the variance reduced gradient. Since , conditioned on and taking expectation with respect to , we obtain:
We use twice, in the first and fourth inequalities. The second equality is due to . The second inequality is due to the gsmoothness assumption. The third inequality is due to triangle inequality.
Notice that and , we thus have
The first inequality uses the trigonometric distance lemma, the second one uses previously obtained bound for , the third and fourth use the strong gconvexity of .
We now denote . Hence by taking expectation with all the history, and noting , we have , i.e. . Therefore, , hence we get
where . It follows directly from the algorithm that after outer loops, . ∎
Corollary 1.
With assumptions as in Theorem 1 and properly chosen parameters, after IFO calls, the output satisfies
Proof.
Assume we choose and , it follows that and therefore
where the second inequality is due to for . Applying Theorem 1 with , we have . Note that by using the gsmooth assumption, we also get . It thus suffices to run outer loops to guarantee .
For the th outer loop, we need IFO calls to evaluate the full gradient at , and IFO calls when calculating each variance reduced gradient. Hence the total number of IFO calls to reach accuracy is . ∎
Appendix B Proofs for Section 3.2
Theorem 5.
Assuming the inverse exponential map is welldefined on , is a geodesically smooth function, stochastic firstorder oracle satisfies , then the SGD algorithm with satisfies
Proof.
After rearrangement, we obtain
Summing up the above equation from to and using where
we obtain
∎
Lemma 2.
Proof.
Since is smooth we have
(6) 
Consider now the Lyapunov function