# Riemannian stochastic variance reduced gradient

Hiroyuki Sato Department of Information and Computer Technology, Tokyo University of Science, Tokyo, Japan (hsato@rs.tus.ac.jp).    Hiroyuki Kasai Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, Japan (kasai@is.uec.ac.jp).    Bamdev Mishra Core Machine Learning Team, Amazon.com, Bangalore, India (bamdevm@amazon.com.)
July 5, 2019
###### Abstract

Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large but finite number of loss functions. In this paper, we propose a novel Riemannian extension of the Euclidean stochastic variance reduced gradient algorithm (R-SVRG) to a manifold search space. The key challenges of averaging, adding, and subtracting multiple gradients are addressed with retraction and vector transport. We present a global convergence analysis of the proposed algorithm with a decay step size and a local convergence rate analysis under a fixed step size under some natural assumptions. The proposed algorithm is applied to problems on the Grassmann manifold, such as principal component analysis, low-rank matrix completion, and computation of the Karcher mean of subspaces, and outperforms the standard Riemannian stochastic gradient descent algorithm in each case111This paper extends the earlier work  to include more general results..

Keywords: Riemannian optimization, stochastic variance reduced gradient, retraction, vector transport, matrix completion

## 1 Introduction

A general loss minimization problem is defined as , where , is the model variable, is the number of samples, and is the loss incurred on the -th sample. The full gradient descent (GD) algorithm requires evaluating derivatives, i.e., , per iteration, which is computationally expensive when is very large. A popular alternative uses only one derivative per iteration for the -th sample, and forms the basis of the stochastic gradient descent (SGD) algorithm. When a relatively large step size is used in SGD, train loss first decreases rapidly, but results in large fluctuations around the solution. Conversely, when a small step size is used, a large number of iterations are required for SGD to converge. To circumvent this issue, SGD starts with a relatively large step size and gradually decreases it.

Recently, variance reduction techniques have been proposed to accelerate SGD convergence [14, 19, 23, 25, 26, 7, 30]. The stochastic variance reduced gradient (SVRG) is a popular technique with superior convergence properties . For smooth and strongly convex functions, SVRG has convergence rates similar to those of stochastic dual coordinate ascent  and stochastic average gradient (SAG) algorithms . Garber and Hazan  analyzed the convergence rate for SVRG when is a convex function that is the sum of nonconvex (but smooth) terms and applied this result to principal component analysis (PCA) problem. Shalev-Shwartz  also found similar results. Allen-Zhu and Yuan  further studied the same case with better convergence rates. Shamir  specifically studied the convergence properties of the variance reduction PCA algorithm. Very recently, Allen-Zhu and Hazan  proposed a variance reduction method for faster nonconvex optimization. However, it should be noted that all these cases assume a Euclidean search space.

In this paper, we handle problems in which the variables have a manifold structure:

 minw∈Mf(w):=1NN∑n=1fn(w), (1)

where is a Riemannian manifold and are real-valued functions on . These problems include, for example, low-rank matrix completion problem , Karcher mean computation problem, and PCA problem. In all these problems, optimization on Riemannian manifolds has shown state-of-the-art performance. The Riemannian framework exploits the geometry of the search space which are characterized by constraints of the optimization problem. A lot of efficient optimization algorithms have been developed . Specifically, the problem , where is a Riemannian manifold, is solved as an unconstrained optimization problem defined over the Riemannian manifold search space. Bonnabel  proposed a Riemannian stochastic gradient descent algorithm (R-SGD) that extends SGD from Euclidean space to Riemannian manifolds.

It should be mentioned that recent work , which appeared simultaneously with our technical report , also proposes R-SVRG on manifolds. The main difference between our work and  is that we provide convergence analyses for the algorithm with retraction and vector transport, whereas  deals with a special case in which exponential mapping and parallel translation are used as retraction and vector transport, respectively. There are further differences; our convergence analysis handles global and local convergence analyses separately, as in typical analyses of batch algorithms on Riemannian manifolds . Another difference is that our assumptions for the local rate of convergence analysis are imposed only in a local neighborhood around a minimum, which is milder and more natural than the assumptions in , which assumes Lipschitz smoothness in the entire space. Consequently, our analysis should be applicable to a wider variety of manifolds than that of .

Building upon the work of Bonnabel  and , we propose an extension of the stochastic variance reduction gradient algorithm to Riemannian manifold search space (R-SVRG) and novel analyses. This extension is nontrivial and requires particular consideration in handling the averaging, addition, and subtraction of multiple gradients at different points on the manifold . To this end, this paper specifically leverages the notions of retraction and vector transport. The algorithm and convergence analysis in this paper are challengingly generalized in the retraction and vector transport case, as well as in the exponential mapping and parallel translation case, allowing extremely efficient implementation and making distinct contributions compared with existing works  that rely only on the exponential mapping and parallel translation case.

The paper is organized as follows. Section 2 discusses Riemannian optimization theory, including background on Riemannian geometry and some geometric tools used in optimization on Riemannian manifolds. The detailed description of R-SVRG is given in Section 3. Sections 4 and 5 present the global convergence analysis and the local convergence rate analysis of the proposed R-SVRG, respectively. In Section 6, numerical comparisons with R-SGD on three problems suggest the superior performance of R-SVRG.

Our proposed R-SVRG is implemented in the Matlab toolbox Manopt . The Matlab codes for the proposed algorithms are available at https://bamdevmishra.com/codes/RSVRG/.

## 2 Riemannian optimization

Optimization on manifolds, or Riemannian optimization, seeks a global or local optimum of a given real-valued function, called objective or cost function, defined on a smooth Riemannian manifold . One of the advantages of using Riemannian geometry tools is that the intrinsic properties of the manifold allow constrained optimization problems to be handled as unconstrained optimization problems. This section introduces optimization on manifolds by summarizing . We refer to the many references therein, and in [22, 20], for further detail.

Let be a smooth real-valued function on manifold . In optimization, we compute a minimum of ; usual methods for solving this minimization problem are iterative algorithms on manifold . In an iterative algorithm, with a given starting point , we generate a sequence on that converges to whenever is in a neighborhood of . In an iterative optimization algorithm, we compute a search direction and then move in the search direction. More concretely, an iteration on manifold is performed by following geodesics emanating from and tangent to at . Notion of Geodesics on Riemannian manifolds is a generalized concept of straight lines in the Euclidean space. For any tangent vector in at , there exists an interval about and a unique geodesic , such that and . The exponential mapping at is defined by geodesics emanating from as for . If is a complete manifold, exponential mapping is defined for all vectors . We can thus obtain an update formula using the exponential mapping:

 wt+1 = Expwt(stξwt),

where the search direction is in the tangent space of at , the scalar is the step size, and is the exponential mapping [1, Section 5.4] that induces a line-search algorithm along geodesics. Additionally, given two points and on , the logarithm mapping, or simply log mapping, which is the inverse of the exponential mapping, maps to a vector on the tangent space at . The log mapping satisfies , where is the shortest distance between two points on .

The steepest descent or full gradient descent method for minimizing on is an iterative algorithm obtained when is used as the search direction . is the Riemannian gradient of at , which is computed according to the chosen metric at . Collecting each metric over , the family is called a Riemannian metric on . is an inner product elements and of the tangent space at . Herein, we use the notation instead of for simplicity. The gradient is defined as the unique element of that satisfies

where is the Fréchet derivative of in the direction .

In searching a next point along a geodesic, we need to compute tangent vectors obtained by the exponential mapping, which are expensive to compute in general. There are even some Riemannian manifolds for which a closed form for exponential mapping is not available. Alternatively, we can use curves other than geodesics as long as a starting point and its tangent vector at the initial time is the same as those of geodesics. A more general update formula is then written as

 wt+1 = Rwt(stξwt),

where is a retraction, which is any map that locally approximates the exponential mapping, up to first-order, on the manifold [1, Definition 4.1.1]. Retractions include the exponential mapping as a special case. An advantage of using retractions is that the computational cost can be reduced compared with using the exponential mapping. It is worth noting that convergence properties for the exponential mapping usually holds for retractions as well.

In the R-SVRG proposed in Section 3, we need to add tangent vectors that are in different tangent spaces, say and on . A mathematically natural way to do so is to use the parallel translation operator. Parallel translation transports a vector field on the geodesic curve that satisfies and [1, Section 5.4], where is the parallel translation operator sending to . However, parallel translation is sometimes computationally expensive and no explicit formula is available for some manifolds, such as the Stiefel manifold. A vector transport on , which is a map [1, Definition 8.1.1], is used as an alternative. A vector transport has an associated retraction . For and , is a tangent vector at , which can be regarded as a transported vector along . Parallel translation is an example of vector transport. In the following, we use the notations , , and interchangeably, where is a curve connecting and on defined by retraction as with .

## 3 Riemannian stochastic variance reduced gradient

After a brief explanation of the variance reduced gradient variants in Euclidean space, we describe the proposed Riemannian stochastic variance reduced gradient on Riemannian manifolds.

### 3.1 Variance reduced gradient variants in Euclidean space

The SGD update in Euclidean space is , where is a randomly selected vector called the stochastic gradient and is the step size. SGD assumes an unbiased estimator of the full gradient, as . Many recent variants of the variance reduced gradient of SGD attempt to reduce its variance as increases to achieve better convergence [14, 19, 23, 25, 26, 7, 30]. SVRG, proposed in , introduces an explicit variance reduction strategy with double loops, where the -th outer loop, called the -th epoch, has inner iterations. SVRG first keeps or for randomly chosen at the end of the -th epoch, and also sets the initial value of the -th epoch to . It then computes a full gradient . Subsequently, denoting the selected random index by , SVRG randomly picks the -th sample for each at and computes the modified stochastic gradient as

 vst = ∇fist(wst−1)−∇fist(~ws−1)+∇f(~ws−1). (2)

It should be noted that SVRG can be regarded as one special case of S2GD (semi-stochastic gradient descent), which differs in the number of inner loop iterations chosen .

### 3.2 Proposed Riemannian extension of SVRG (R-SVRG)

We propose a Riemannian extension of SVRG on a Riemannian manifold , called R-SVRG. Here, we denote the Riemannian stochastic gradient for the -th sample as and the modified Riemannian stochastic gradient as instead of , to show the differences from the Euclidean case.

R-SVRG reduces variance analogously to the SVRG algorithm in the Euclidean case. More specifically, R-SVRG keeps after stochastic update steps of the -th epoch, and computes the full Riemannian gradient only for this stored . The algorithm also computes the Riemannian stochastic gradient that corresponds to this -th sample. Then, picking the -th sample for each -th inner iteration of the -th epoch at , we calculate in the same way as in (2), i.e., by modifying the stochastic gradient using both and . Translating the right-hand side of (2) to manifold involves the sum of , , and , which belong to two separate tangent spaces and . This operation requires particular attention on a manifold, and a vector transport enables us to handle multiple elements on two separated tangent spaces flexibly. More concretely, and are first transported to at the current point, ; then, they can be added to on . Consequently, the modified Riemannian stochastic gradient at the -th inner iteration of the -th epoch is set to

where represents a vector transport operator from to on . Specifically, we need to calculate the tangent vector from to , which is given by the inverse of the retraction, i.e., . Consequently, the final update rule of R-SVRG is defined as , where is a step size at . As shown in our local convergence analysis (Section 5.2), can be fixed once the iterate approaches sufficiently close to a solution. It should be noted that the modified direction is also a Riemannian stochastic gradient of at .

Let be the expectation with respect to the random choice of , conditioned on all randomness introduced up to the -th iteration of the inner loop of the -th epoch. Conditioned on , we take the expectation with respect to and obtain

The theoretical convergence analysis of the Euclidean SVRG algorithm assumes that the initial vector of the -th epoch is set to the average (or a random) vector of the -th epoch [14, Figure 1]. However, the set of the last vectors in the -th epoch, i.e., shows superior performance in the Euclidean SVRG algorithm. Therefore, for our local convergence rate analysis in Theorem 5.1, this study also uses, as option I, the mean value of as , where is the Karcher mean on the manifold. Alternatively, we can also simply choose for at random. In addition, as option II, we can use the last vector in the -th epoch, i.e., . In the global convergence analysis in Section 4, we use option II. The overall algorithm, with a fixed step size, is summarized in Algorithm 1.

Additionally, variants of the variance reduced SGD initially require a full gradient calculation at every epoch. This initially results in more overhead than the ordinal SGD algorithm and eventually causes a cold-start property in these variants. To avoid this, , in Euclidean space, proposes using the standard SGD update only for the first epoch. We also adopt this simple modification of R-SVRG, denoted as R-SVRG+; we do not analyze this extension, but leave it as an open problem.

As mentioned earlier, each iteration of R-SVRG has double loops, to reduce the variance of the modified stochastic gradient . The -th epoch, i.e., the outer loop, requires gradient evaluations, where is for the full gradient at the beginning of each -th epoch and is for the inner iterations, since each inner step needs two gradient evaluations, i.e., and . However, if for each sample is stored at the beginning of the -th epoch, as in SAG, the evaluations over all inner loops result in . Finally, the -th epoch requires evaluations. It is natural to choose an of the same order as but slightly larger (e.g., for nonconvex problems is suggested in ).

## 4 Global convergence analysis

In this section, we provide a global convergence analysis of Algorithm 1 for Problem (1) after introducing some assumptions. Throughout this section, we let and denote a retraction and vector transport used in Algorithm 1, respectively, and suppose the following assumptions.

###### Assumption 4.1.

The vector transport is isometric on , i.e., for any and , .

Note that we can construct an isometric vector transport as in  so that Assumption 4.1 holds.

###### Assumption 4.2.

The objective function and its components are twice continuously differentiable.

As discussed in , for a positive constant , a -totally retractive neighborhood of is a neighborhood such that for all , , and is a diffeomorphism on , which is the ball in with center and radius , where is the zero vector in . The concept of a totally retractive neighborhood is analogous to that of a totally normal neighborhood for exponential mapping.

###### Assumption 4.3.

For a sequence generated by Algorithm 1, there exists a compact and connected set such that for all . Additionally, for each , there exists a totally retractive neighborhood of such that stays in for any . Furthermore, suppose that there exists such that .

In the global convergence analysis, we also assume that the sequence of step sizes satisfies the usual condition in stochastic approximation as follows:

###### Assumption 4.4.

The sequence of step sizes satisfies

 ∑(αst)2 < ∞   and   ∑αst = ∞. (4)

This condition is satisfied, for example, if with positive constants and , where is the total iteration number depending on and . We also note the following proposition introduced in :

###### Proposition 4.1 ().

Let be a nonnegative stochastic process with bounded positive variations, i.e., such that , where denotes the quantity for a random variable and is the increasing sequence of -algebras generated by the variables just before time . Then, the process is a quasi-martingale, i.e.,

 ∞∑n=0|E[Xn+1−Xn|Fn]|<∞   a.s., and Xn converges a.s.

Now, we give the a.s. convergence of the proposed algorithm under the assumption that the generated sequenc is in a compact set.

###### Theorem 4.1.

Suppose Assumptions 4.14.3 and consider Algorithm 1 with option II and step sizes satisfying Assumption 4.4 on a Riemannian manifold . If , then converges a.s. and a.s.

###### Proof.

The claim is proved similarly to the proof of the standard Riemannian SGD (see ). Since is compact, all continuous functions on can be bounded. Therefore, there exists such that for all and , we have and . We use in the following. Moreover, as , there exists such that for we have . Suppose now that . Let satisfy , i.e., . It follows from the triangle inequality that

where we have used the assumption that is an isometry. Therefore, we have

 ∥R−1wst(wst+1)∥wst=∥−αstξst+1∥wst≤αstC′

Hence, it follows from the assumption that there exists a curve linking and . Thus, the Taylor formula implies that

where is an upper bound of the largest eigenvalues of the Euclidean Hessian of in the compact ball . Let be the increasing sequence of -algebras defined by

 Fst={i11,…,i1m1,…,is−11,…,is−1ms−1,is1,…,ist−1}.

Since is computed from , it is measurable in . As is independent of , we have

which yields that

as .

We reindex the sequence as
. We also similarly reindex . As , (5) proves that is a nonnegative supermartingale. Hence, it converges a.s. Returning to the original indexing, this implies that converges a.s. Moreover, summing the inequalities, we have

Here, we prove that the right term is bounded. By doing so, we can conclude the convergence of the left term.

Summing (5) over and , it is clear that satisfies the assumption of Proposition 4.1; thus, is a quasi-martingale, implying that converges a.s. because of inequality (6), where , by Proposition 4.1. If is proved to converge a.s., it can only converge to 0 a.s. because of condition (4).

Now consider the nonnegative process . Bounding the second derivative of by , along the curve defined by the retraction linking and , a Taylor expansion yields

Let be a constant such that is a lower bound of the Hessian of on the compact set. Then, we have

We have proved the fact that the sum of the right term is finite, implying that is a quasi-martingale, thus further implying the a.s. convergence of to , as claimed. ∎

Theorem 4.1 contains a global convergence of the algorithm with exponential mapping and parallel translation as a special case. In this case, a sufficient condition for the assumptions can be easily written by notions of injectivity radius.

###### Corollary 4.1.

Suppose Assumption 4.1 and consider Algorithm 1 with option II and step sizes satisfying Assumption 4.4 on a Riemannian manifold , where the exponential mapping and parallel translation are used as retraction and vector transport, respectively. Assume that is connected and has injectivity radius uniformly bounded from below by . Assume also that there exists a compact set such that for all . If , then converges a.s. and a.s.

## 5 Local convergence rate analysis

In this section, we show the local convergence rate analysis of the R-SVRG algorithm; we analyze the convergence of any sequences generated by Algorithm 1 that are contained in a sufficiently small neighborhood of a local minimum point of the objective function. Hence, we can assume that the objective function is convex in such a neighborhood. We first give formal expressions of these assumptions and then analyze the local convergence rate of the algorithm.

### 5.1 Assumptions and existing lemmas

We again suppose Assumptions 4.1 and 4.2. That is, the objective function is and the vector transport is isometric.

Let be a critical point. As discussed in , we note that there exists a positive constant and -totally retractive neighborhood of . In the local convergence analysis, we assume the following.

###### Assumption 5.1.

The sequence generated by Algorithm 1 continuously remains in totally retractive neighborhood of critical point , i.e., for all and for all . In addition, the radius of is sufficiently small.

###### Assumption 5.2.

There exists a constant such that vector transport satisfies the following conditions:

 ∥Tη−TRη∥≤c0∥η∥,∥T−1η−T−1Rη∥≤c0∥η∥,

where denotes the differentiated retraction, i.e.,

 TRηw(ξw)=DR(ηw)[ξw],ηw,ξw∈TwM,w∈M.

Note that it follows from Taylor’s theorem and the fact that that Assumption 5.2 holds if and are .

Also, we assume the following.

###### Assumption 5.3.

is strongly retraction-convex with respect to in ; i.e., there exist two constants such that for all , with , and for all satisfying for all . Furthermore, is strongly convex in , i.e., the above claim is true even if is replaced with the exponential mapping .

Letting be smaller if necessary, we can guarantee the last assumption from other assumptions using Lemma 3.1 in .

In the rest of this section, we introduce some existing lemmas to evaluate the differences of using retraction and vector transport instead of exponential mapping and parallel translation, and the effects of the curvature of the manifold in question.

###### Lemma 5.1 (In the proof of Lemma 3.9 in ).

Under Assumptions 4.2 and 5.1, there exists a constant such that

where and are in and is a curve for an arbitrary defined by retraction on . is a parallel translation operator along curve from to .

Note that curve in this lemma is not necessarily the geodesic on . Relation (7) is a generalization of the Lipschitz continuity condition.

###### Lemma 5.2 (Lemma 3.5 in ).

Let be a vector transport associated with the same retraction as that of the parallel transport . Under Assumption 5.2, for any , there exists a constant and neighborhood of such that for all ,

 ∥Tη(ξ)−Pη(ξ)∥z≤θ∥ξ∥w∥η∥w,

where and .

We can derive the following lemma from the Taylor expansion as in the proof of Lemma 3.2 in .

###### Lemma 5.3 (In the proof of Lemma 3.2 in ).

Under Assumptions 5.15.3, there exists a positive real number such that