Riemannian stochastic variance reduced gradient
Stochastic variance reduction algorithms have recently become popular for minimizing the average of a large but finite number of loss functions. In this paper, we propose a novel Riemannian extension of the Euclidean stochastic variance reduced gradient algorithm (R-SVRG) to a manifold search space. The key challenges of averaging, adding, and subtracting multiple gradients are addressed with retraction and vector transport. We present a global convergence analysis of the proposed algorithm with a decay step size and a local convergence rate analysis under a fixed step size under some natural assumptions. The proposed algorithm is applied to problems on the Grassmann manifold, such as principal component analysis, low-rank matrix completion, and computation of the Karcher mean of subspaces, and outperforms the standard Riemannian stochastic gradient descent algorithm in each case
Riemannian optimization, stochastic variance reduced gradient, retraction, vector transport, matrix completion
A general loss minimization problem is defined as , where , is the model variable, is the number of samples, and is the loss incurred on the -th sample. The full gradient descent (GD) algorithm requires evaluating derivatives, i.e., , per iteration, which is computationally expensive when is very large. A popular alternative uses only one derivative per iteration for the -th sample, and forms the basis of the stochastic gradient descent (SGD) algorithm. When a relatively large step size is used in SGD, train loss first decreases rapidly, but results in large fluctuations around the solution. Conversely, when a small step size is used, a large number of iterations are required for SGD to converge. To circumvent this issue, SGD starts with a relatively large step size and gradually decreases it.
Recently,variance reduction techniques have been proposed to accelerate SGD convergence . The stochastic variance reduced gradient (SVRG) is a popular technique with superior convergence properties . For smooth and strongly convex functions, SVRG has convergence rates similar to those of stochastic dual coordinate ascent  and stochastic average gradient (SAG) algorithms . Garber and Hazan  analyzed the convergence rate for SVRG when is a convex function that is the sum of nonconvex (but smooth) terms and applied this result to principal component analysis (PCA) problem. Shalev-Shwartz  also found similar results. Allen-Zhu and Yuan  further studied the same case with better convergence rates. Shamir  specifically studied the convergence properties of the variance reduction PCA algorithm. Very recently, Allen-Zhu and Hazan  proposed a variance reduction method for faster nonconvex optimization. However, it should be noted that all these cases assume a Euclidean search space.
In this paper, we handle problems in which the variables have a manifold structure:
where is a Riemannian manifold and are real-valued functions on . These problems include, for example, low-rank matrix completion problem , Karcher mean computation problem, and PCA problem. In all these problems, optimization on Riemannian manifolds has shown state-of-the-art performance. The Riemannian framework exploits the geometry of the search space which are characterized by constraints of the optimization problem. A lot of efficient optimization algorithms have been developed . Specifically, the problem , where is a Riemannian manifold, is solved as an unconstrained optimization problem defined over the Riemannian manifold search space. Bonnabel  proposed a Riemannian stochastic gradient descent algorithm (R-SGD) that extends SGD from Euclidean space to Riemannian manifolds.
It should be mentioned that recent work , which appeared simultaneously with our technical report , also proposes R-SVRG on manifolds. The main difference between our work and  is that we provide convergence analyses for the algorithm with retraction and vector transport, whereas  deals with a special case in which exponential mapping and parallel translation are used as retraction and vector transport, respectively. There are further differences; our convergence analysis handles global and local convergence analyses separately, as in typical analyses of batch algorithms on Riemannian manifolds . Another difference is that our assumptions for the local rate of convergence analysis are imposed only in a local neighborhood around a minimum, which is milder and more natural than the assumptions in , which assumes Lipschitz smoothness in the entire space. Consequently, our analysis should be applicable to a wider variety of manifolds than that of .
Building upon the work of Bonnabel  and , we propose an extension of the stochastic variance reduction gradient algorithm to Riemannian manifold search space (R-SVRG) and novel analyses. This extension is nontrivial and requires particular consideration in handling the averaging, addition, and subtraction of multiple gradients at different points on the manifold . To this end, this paper specifically leverages the notions of retraction and vector transport. The algorithm and convergence analysis in this paper are challengingly generalized in the retraction and vector transport case, as well as in the exponential mapping and parallel translation case, allowing extremely efficient implementation and making distinct contributions compared with existing works  that rely only on the exponential mapping and parallel translation case.
The paper is organized as follows. Section 2 discusses Riemannian optimization theory, including background on Riemannian geometry and some geometric tools used in optimization on Riemannian manifolds. The detailed description of R-SVRG is given in Section 3. Sections Section 4 and Section 5 present the global convergence analysis and the local convergence rate analysis of the proposed R-SVRG, respectively. In Section 6, numerical comparisons with R-SGD on three problems suggest the superior performance of R-SVRG.
Optimization on manifolds, or Riemannian optimization, seeks a global or local optimum of a given real-valued function, called objective or cost function, defined on a smooth Riemannian manifold . One of the advantages of using Riemannian geometry tools is that the intrinsic properties of the manifold allow constrained optimization problems to be handled as unconstrained optimization problems. This section introduces optimization on manifolds by summarizing . We refer to the many references therein, and in , for further detail.
Let be a smooth real-valued function on manifold . In optimization, we compute a minimum of ; usual methods for solving this minimization problem areiterative algorithms on manifold . In an iterative algorithm, with a given starting point , we generate a sequence on that converges to whenever is in a neighborhood of . In an iterative optimization algorithm, we compute a search direction and then move in the search direction. More concretely, an iteration on manifold is performed by following geodesics emanating from and tangent to at . Notion of Geodesics on Riemannian manifolds is a generalized concept of straight lines in the Euclidean space. For any tangent vector in at , there exists an interval about and a unique geodesic , such that and . Theexponential mapping at is defined by geodesics emanating from as for . If is a complete manifold, exponential mapping is defined for all vectors . We can thus obtain an update formula using the exponential mapping:
where the search direction is in the tangent space of at , the scalar is the step size, and is the exponential mapping  that induces a line-search algorithm along geodesics. Additionally, given two points and on , the logarithm mapping, or simply log mapping, which is the inverse of the exponential mapping, maps to a vector on the tangent space at . The log mapping satisfies , where is the shortest distance between two points on .
Thesteepest descent orfull gradient descent method for minimizing on is an iterative algorithm obtained when is used as the search direction . is the Riemannian gradient of at , which is computed according to the chosen metric at . Collecting each metric over , the family is called a Riemannian metric on . is an inner product elements and of the tangent space at . Herein, we use the notation instead of for simplicity. The gradient is defined as the unique element of that satisfies
where is the Fréchet derivative of in the direction .
In searching a next point along a geodesic, we need to compute tangent vectors obtained by the exponential mapping, which are expensive to compute in general. There are even some Riemannian manifolds for which a closed form for exponential mapping is not available. Alternatively, we can use curves other than geodesics as long as a starting point and its tangent vector at the initial time is the same as those of geodesics. A more general update formula is then written as
where is a retraction, which is any map that locally approximates the exponential mapping, up to first-order, on the manifold . Retractions include the exponential mapping as a special case. An advantage of using retractions is that the computational cost can be reduced compared with using the exponential mapping. It is worth noting that convergence properties for the exponential mapping usually holds for retractions as well.
In the R-SVRG proposed in Section 3, we need to add tangent vectors that are in different tangent spaces, say and on . A mathematically natural way to do so is to use the parallel translation operator. Parallel translation transports a vector field on the geodesic curve that satisfies and , where is the parallel translation operator sending to . However, parallel translation is sometimes computationally expensive and no explicit formula is available for some manifolds, such as the Stiefel manifold. A vector transport on , which is a map , is used as an alternative. A vector transport has an associated retraction . For and , is a tangent vector at , which can be regarded as a transported vector along . Parallel translation is an example of vector transport.
In the following, we use the notations , , and interchangeably, where is a curve connecting and on defined by retraction as with .
3Riemannian stochastic variance reduced gradient
After a brief explanation of the variance reduced gradient variants in Euclidean space, we describe the proposed Riemannian stochastic variance reduced gradient on Riemannian manifolds.
3.1Variance reduced gradient variants in Euclidean space
The SGD update in Euclidean space is , where is a randomly selected vector called the stochastic gradient and is the step size. SGD assumes an unbiased estimator of the full gradient, as . Many recent variants of the variance reduced gradient of SGD attempt to reduce its variance as increases to achieve better convergence . SVRG, proposed in , introduces an explicit variance reduction strategy with double loops, where the -th outer loop, called the -th epoch, has inner iterations. SVRG first keeps or for randomly chosen at the end of the -th epoch, and also sets the initial value of the -th epoch to . It then computes a full gradient . Subsequently, denoting the selected random index by , SVRG randomly picks the -th sample for each at and computes the modified stochastic gradient as
It should be noted that SVRG can be regarded as one special case of S2GD (semi-stochastic gradient descent), which differs in the number of inner loop iterations chosen .
3.2Proposed Riemannian extension of SVRG (R-SVRG)
We propose a Riemannian extension of SVRG on a Riemannian manifold , called R-SVRG. Here, we denote the Riemannian stochastic gradient for the -th sample as and the modified Riemannian stochastic gradient as instead of , to show the differences from the Euclidean case.
R-SVRG reduces variance analogously to the SVRG algorithm in the Euclidean case. More specifically, R-SVRG keeps after stochastic update steps of the -th epoch, and computes the full Riemannian gradient only for this stored . The algorithm also computes the Riemannian stochastic gradient that corresponds to this -th sample. Then, picking the -th sample for each -th inner iteration of the -th epoch at , we calculate in the same way as in (Equation 2), i.e., by modifying the stochastic gradient using both and . Translating the right-hand side of (Equation 2) to manifold involves the sum of , , and , which belong to two separate tangent spaces and . This operation requires particular attention on a manifold, and a vector transport enables us to handle multiple elements on two separated tangent spaces flexibly. More concretely, and are first transported to at the current point, ; then, they can be added to on . Consequently, the modified Riemannian stochastic gradient at the -th inner iteration of the -th epoch is set to
where represents a vector transport operator from to on . Specifically, we need to calculate the tangent vector from to , which is given by the inverse of the retraction, i.e., . Consequently, the final update rule of R-SVRG is defined as , where is a step size at . As shown in our local convergence analysis (Section 5.2), can be fixed once the iterate approaches sufficiently close to a solution. It should be noted that the modified direction is also a Riemannian stochastic gradient of at .
Let be the expectation with respect to the random choice of , conditioned on all randomness introduced up to the -th iteration of the inner loop of the -th epoch.
Conditioned on , we take the expectation with respect to and obtain
The theoretical convergence analysis of the Euclidean SVRG algorithm assumes that the initial vector of the -th epoch is set to the average (or a random) vector of the -th epoch . However, the set of the last vectors in the -th epoch, i.e., shows superior performance in the Euclidean SVRG algorithm. Therefore, for our local convergence rate analysis in Theorem ?, this study also uses, as option I, the mean value of as , where is the Karcher mean on the manifold. Alternatively, we can also simply choose for at random. In addition, as option II, we can use the last vector in the -th epoch, i.e., . In the global convergence analysis in Section 4, we useoption II. The overall algorithm, with a fixed step size, is summarized in Algorithm ?.
Additionally, variants of the variance reduced SGD initially require a full gradient calculation at every epoch. This initially results in more overhead than the ordinal SGD algorithm and eventually causes a cold-start property in these variants. To avoid this, , in Euclidean space, proposes using the standard SGD update only for the first epoch. We also adopt this simple modification of R-SVRG, denoted as R-SVRG+; we do not analyze this extension, but leave it as an open problem.
As mentioned earlier, each iteration of R-SVRG has double loops, to reduce the variance of the modified stochastic gradient . The -th epoch, i.e., the outer loop, requires gradient evaluations, where is for the full gradient at the beginning of each -th epoch and is for the inner iterations, since each inner step needs two gradient evaluations, i.e., and . However, if for each sample is stored at the beginning of the -th epoch, as in SAG, the evaluations over all inner loops result in . Finally, the -th epoch requires evaluations. It is natural to choose an of the same order as but slightly larger (e.g., for nonconvex problems is suggested in ).
4Global convergence analysis
In this section, we provide a global convergence analysis of Algorithm ? for Problem after introducing some assumptions. Throughout this section, we let and denote a retraction and vector transport used in Algorithm ?, respectively, and suppose the following assumptions.
Note that we can construct an isometric vector transport as in  so that Assumption ? holds.
As discussed in , for a positive constant , a -totally retractive neighborhood of is a neighborhood such that for all , , and is a diffeomorphism on ,which is the ball in with center and radius , where is the zero vector in . The concept of a totally retractive neighborhood is analogous to that of a totally normal neighborhood for exponential mapping.
In the global convergence analysis, we also assume that the sequence of step sizes satisfies the usual condition in stochastic approximation as follows:
This condition is satisfied, for example, if with positive constants and , where is the total iteration number depending on and . We also note the following proposition introduced in :
Now, we give the a.s. convergence of the proposed algorithm under the assumption that the generated sequenc is in a compact set.
The claim is proved similarly to the proof of the standard Riemannian SGD (see ). Since is compact, all continuous functions on can be bounded. Therefore, there exists such that for all and , we have and . We use in the following. Moreover, as , there exists such that for we have . Suppose now that .
Let satisfy , i.e., .
It follows from the triangle inequality that
where we have used the assumption that is an isometry. Therefore, we have
Hence, it follows from the assumption that
there exists a curve linking and . Thus, the Taylor formula implies that where is an upper bound of the largest eigenvalues of the Euclidean Hessian of in the compact ball . Let be the increasing sequence of -algebras defined by
Since is computed from , it is measurable in . As is independent of , we have
which yields that
We reindex the sequence as
. We also similarly reindex . As , proves that is a nonnegative supermartingale. Hence, it converges a.s. Returning to the original indexing, this implies that converges a.s. Moreover, summing the inequalities, we have
Here, we prove that the right term is bounded. By doing so, we can conclude the convergence of the left term.
Summing (Equation 4) over and , it is clear that satisfies the assumption of Proposition ?; thus, is a quasi-martingale, implying that converges a.s. because of inequality (Equation 5), where , by Proposition ?. If is proved to converge a.s., it can only converge to 0 a.s. because of condition ( ?).
Now consider the nonnegative process . Bounding the second derivative of by , along the curve defined by the retraction linking and , a Taylor expansion yields
Let be a constant such that is a lower bound of the Hessian of on the compact set. Then, we have
We have proved the fact that the sum of the right term is finite, implying that is a quasi-martingale, thus further implying the a.s. convergence of to , as claimed.
Theorem ? contains a global convergence of the algorithm with exponential mapping and parallel translation as a special case. In this case, a sufficient condition for the assumptions can be easily written by notions of injectivity radius.
5Local convergence rate analysis
In this section, we show the local convergence rate analysis of the R-SVRG algorithm; we analyze the convergence of any sequences generated by Algorithm ? that are contained in a sufficiently small neighborhood of a local minimum point of the objective function. Hence, we can assume that the objective function is convex in such a neighborhood. We first give formal expressions of these assumptions and then analyze the local convergence rate of the algorithm.
5.1Assumptions and existing lemmas
We again suppose Assumptions ? and ?. That is, the objective function is and the vector transport is isometric.
Let be a critical point. As discussed in , we note that there exists a positive constant and -totally retractive neighborhood of . In the local convergence analysis, we assume the following.
Note that it follows from Taylor’s theorem and the fact that that Assumption ? holds if and are .
Also, we assume the following.
Letting be smaller if necessary, we can guarantee the last assumption from other assumptions using Lemma 3.1 in .
In the rest of this section, we introduce some existing lemmas to evaluate the differences of using retraction and vector transport instead of exponential mapping and parallel translation, and the effects of the curvature of the manifold in question.
Note that curve in this lemma is not necessarily the geodesic on . Relation is a generalization of the Lipschitz continuity condition.
We can derive the following lemma from the Taylor expansion as in the proof of Lemma 3.2 in .
From the assumptions, there exists such that for all , with , and for all such that for all . From Taylor’s theorem, we can conclude that this satisfies the claim.
If we replace in Lemma ? with retraction on , we can obtain a similar result for , which is shown in Lemma 3.2 in  where the constant corresponds to in our Lemma ?. However, Lemma ? for is sufficient in the following discussion.
From Lemma 3 in , we have the following lemma:
Since we have , Eq. is equivalent to , which gives a relation between the exponential mapping and a general retraction.
Now, we introduce Lemma 6 in  to evaluate the distance between and using the smoothness of our objective function. In the following, an Alexandrov space is defined as a length space whose curvature is bounded.
5.2Local convergence rate analysis with retraction and vector transport
We now demonstrate the local convergence properties of the R-SVRG algorithm (i.e., local convergence to local minimizers) and its convergence rate.
We first describe an assumption that generated points are sufficiently close to a critical point.
We show a property of the Karcher mean on a general Riemannian manifold.
From the triangle inequality and for real numbers and , we have for
Since is the Karcher mean of , it holds that
It then follows that
This completes the proof.
Then, we prove that the norms of modified stochastic gradients in R-SVRG are sufficiently small under Assumption ?.
We first show that for any is upper-bounded in if is sufficiently small. Since is sufficiently small, is contained in a set diffeomorphic to an open set by a chart . Consider a sufficiently small closed ball in centered at such that . Then, is in . Note that is compact and is a diffeomorphism. Hence, is also compact. Replacing with this if necessary, we can assume that is compact. Therefore, is upper bounded in . In this proof, let denote an upper bound, i.e., for all and .
Let . The definition of , Eq. , and the triangle inequality yield
where and is the parallel translation operator along curve . The three terms on the right-hand side above are bounded as
using Lemmas ?, ?, and ?, respectively, where and are in the lemmas and is a curve defined by . Therefore, we have
if we choose a sufficiently small such that .
We now derive the upper bound of the variance of as follows.
Let and satisfy and , respectively. Let be a parallel translation along the curve , where .
By Lemma ? and ?,
the upper bound of the variance of