Stochastic Subspace Descent††thanks: Submitted to the editors DATE.
Funding: AD acknowledges funding by the US Department of Energyâs Office of Science Advanced Scientific Computing Research, Award DE-SC0006402 and National Science Foundation Grant CMMI-145460.
We present two stochastic descent algorithms that apply to unconstrained optimization and are particularly efficient when the objective function is slow to evaluate and gradients are not easily obtained, as in some PDE-constrained optimization and machine learning problems. The basic algorithm projects the gradient onto a random subspace at each iteration, similar to coordinate descent but without restricting directional derivatives to be along the axes. This algorithm is previously known but we provide new analysis. We also extend the popular SVRG method to this framework but without requiring that the objective function be written as a finite sum. We provide proofs of convergence for our methods under various convexity assumptions and show favorable results when compared to gradient descent and BFGS on non-convex problems from the machine learning and shape optimization literature. We also note that our analysis gives a proof that the iterates of SVRG and several other popular first-order stochastic methods, in their original formulation, converge almost surely to the optimum; to our knowledge, prior to this work the iterates of SVRG had only been known to converge in expectation.
Key words. Randomized methods, optimization, derivative-free, Gaussian processes, shape optimization, stochastic gradients
AMS subject classifications. 90C06, 93B40, 65K10
We consider optimization problems of the form
where has -Lipschitz derivative. Throughout the document we will consider additional restrictions such as convexity or -strong convexity of , restrictions that will be made clear as they are required. In the most basic form discussed in Section LABEL:sect:_SSD, we project the gradient onto a random -dimensional subspace and descend along that subspace as in gradient descent. This is accomplished using a random matrix , for some , with the properties
in the following iteration scheme
This algorithm has convergence properties consistent with gradient descent (see Theorem LABEL:thm:convergence) and allows us to leverage techniques from the substantial literature surrounding stochastic gradient descent (see Theorem LABEL:thm:VarianceReducedRandomGradient). Note that (LABEL:eq:_iterations) with a diagonal reduces to randomized block-coordinate descent .
A particular case of (LABEL:eq:_minimize_f) is Empirical Risk Minimization (ERM) commonly used in machine learning
where is typically very large, and hence the problem is amenable to iterative stochastic methods that approximate using randomly sampled observations at each iteration with where . While the methods discussed in this paper do not require this type of finite-sum structure, they can be employed to tackle such problems. In fact, the canonical parameter estimation problem in machine learning is one application we envision for our work.
There are several important classes of functions that do not fit into the ERM framework and therefore do not benefit from the recent stochastic algorithms tailored to ERM. Partial Differential Equation (PDE) constrained optimization is one such example, and except in special circumstances (such as ), a stochastic approach leveraging the ERM structure (such as stochastic gradient descent and its variants) does not provide any benefits. This is due to the fact that in PDE-constrained optimization the cost of evaluating each is often identical to the cost of evaluating . A primary goal of this paper is to use variations on the prototypical iterative scheme (LABEL:eq:_iterations) to adapt recent developments from the machine learning optimization literature so that stochastic methods can be used effectively in the PDE-constrained optimization setting. Along the way we present results that we think will be useful to the machine learning community. An example that comes directly from the machine learning literature is Gaussian processes, specifically the sparse Gaussian process framework of [85, 92]. The placement of so-called inducing points is rolled into the continuous optimization setting causing the dimension of the problem to increase linearly with the desired number of points. This leads to high-dimensional problems for which traditional stochastic optimization algorithms provide no benefit.
1.1 PDE-constrained optimization
Partial differential equations are well-studied tools in applied mathematics, ubiquitous for their use in modeling physical phenomena. Successful application of PDEs to modeling is contingent upon appropriate discretization and parameter estimation. Parameter estimation in this setting is referred to as PDE-constrained optimization, and arises in optimal control, or whenever the parameters of the PDE are unknown, as in inverse problems. Algorithmic and hardware advances for PDE-constrained optimization have allowed for previously impossible modeling capabilities. Examples include fluid dynamics models with millions of parameters for tracking atmospheric contaminants , modeling the flow of the Antarctic ice sheet [48, 74], parameter estimation in seismic inversion [1, 14], groundwater hydrology , experimental design [47, 40], and atmospheric remote sensing .
In Section LABEL:sect:_EmpiricalResults we take a discretize-then-optimize approach and assume that the infinite-dimensional PDE has already been discretized by some appropriate mechanism such as finite differences or finite elements. With this in mind, the PDE-constrained optimization problem can be written as follows
That is, the goal is to minimize a tradeoff between (a) the loss which depends on the solution to the PDE and the observed data , and (b) a regularization functional that serves to keep the problem well-posed. The PDE is parameterized by , the dimension of which may be subject to the choice of discretization, and may be in the millions. The regularization parameter controls the balance between the data misfit and the regularity of the solution. Typically is unknown but there are a number of techniques for estimating it, see [91, §3.3], [64, §5.4], [44, 84]; we do not treat this issue here.
1.2 Gaussian processes
Gaussian processes are a very important class of stochastic processes used for modeling processes that are correlated in input space. Here we use them to model an unknown function in the context of regression. Elegant results from functional analysis, developed over the past 100 years, allow for the flexibility of working in function space using only machinery from linear algebra but the applications of Gaussian processes are somewhat hamstrung in many modern settings because their time complexity scales as and their storage as . One recourse is to approximate the Gaussian process, allowing time complexity to be reduced to with storage requirements of where is the number of points used in lieu of the full data set. Methods have been developed [85, 92] which place these inducing points, also called landmark points, along the domain at points different from the original inputs; optimal placement of these points is a continuous optimization problem with dimension at least the number of inducing points to be placed. Such a framework places heavy burden on the optimization procedure as improperly placed points may result in a poor approximation.
1.3 A historical perspective of iterative optimization
Despite being among the easiest to understand and oldest variants of gradient descent, subspace methods (by far the most common of which is coordinate descent) have been conspicuously absent from the optimization literature. The reasons for this vary but such under-representation may be partially attributed to the fact that until recently the performance of these types of algorithms did not justify the difficult theoretical work.
Coordinate descent schemes
The simplest variant of subspace descent is a deterministic method that cycles over the coordinates. Convergence results for coordinate descent requires challenging analysis and the class of functions for which it will converge is restricted; indeed, [98, 77] provide simple examples for which the method fails to converge while simpler-to-analyze methods such as gradient descent will succeed. It was shown in  that relaxing the cyclic nature of the updates so that the schedule for updating along each coordinate is “almost cyclic” (that is, the function is updated along each coordinate at least once every iterations for some ) allows for convergence guarantees under conditions only moderately more restrictive than those of gradient descent. Also, it was shown in  that using the deterministic Gauss-Southwell rule for selecting the descent direction results in convergence at a linear rate, a result corroborated by theoretical and experimental evidence in , and38 expanded upon in the paper  which shows that the rate of convergence is competitive with the rates of the randomized method provided by .
The analysis in  for the case where the coordinates are chosen randomly provides convergence results on par with gradient descent without additional restrictions on the class of functions that can be optimized. This was generalized in  with slightly tighter analysis and an extension to composite functions. An overview of the recent progress until 2015, along with simplifications of the proofs for several results can be found in . The algorithm described in Section LABEL:sect:_SSD is a generalization of results that can be found in [101, 67].
Since 2015, much emphasis has been placed on accelerating coordinate descent methods [2, 46] but those improvements require knowledge of the Lipschitz constants of the partial derivatives of the functions and/or special structure in the function to make updates inexpensive; in Section LABEL:sect:_Conclusions we discuss how recent analysis in  allows for acceleration of our methods without such an unrealistic assumption.
Optimization without derivatives
Our methods use directions where P is with , which is computationlly identical to taking directional derivatives of , and can therefore be estimated by forward- or centered-differences in or function evaluations, respectively. We assume the error due to the finite difference approximation is negligible, which is realistic for low-precision optimization. Because our methods only evaluate and not directly, they fall into the class of derivative-free optimization (DFO) and “zeroth-order” optimization. Recent analysis has shown that many first-order methods (e.g., gradient descent, SVRG) maintain asymptotic convergence properties  when the gradients are estimated via finite differences. Additionally, non-asymptotic bounds on the rate of convergence for zeroth-order methods have been shown to be similar, albeit slightly worse, to their first-order counterparts .
To be clear, when is readily available, zeroth-order optimization methods are not competitive with first-order methods and second-order methods (e.g., Newton’s method). For example, if and is a matrix small enough to store in the memory of a computer, then evaluating and evaluating have nearly the same computational cost, namely . In fact, such a statement is always true regardless of the structure of : by using reverse-mode automatic differentiation (AD), one can theoretically evaluate in about four-times the cost of evaluating , regardless of the dimension . In the context of PDE-constrained optimization, the popular adjoint-state method, which is a form of AD applied to either the continuous or discretized PDE, also evaluates in time independent of the dimension.
However, there are many situations when AD and the adjoint-state method are inefficient or not applicable. Finding the adjoint equation requires a careful derivation (which depends on the PDE as well as on initial and boundary conditions), and then a numerical method must be implemented to solve it, which takes considerable development time. For this reason, complicated codes that are often updated with new features, such as weather models, rarely have the capability to compute a full gradient. There are software packages that solve for the adjoint automatically, or run AD, but these require a programming environment that restricts the user, and may not be efficient in parallel high-performance computing environments. Even when the gradient is calculated automatically via software, the downside of these reverse-mode methods is a potential explosion of memory when creating temporary intermediate variables. For example, in unsteady fluid flow, the naive adjoint method requires storing the entire time-dependent PDE solution  and hybrid check-pointing schemes  are the subject of active research. Furthermore, in black-box simulations which only return function evaluations given inputs (see [20, §1.2] for examples) the derivative may not be available by any means.
Many DFO algorithms have been developed for the cases that do not admit obtaining a full gradient. Classic methods include grid search, Nelder-Mead, (quasi-) Monte-Carlo sampling, simulated annealing and MCMC methods . More powerful modern algorithms include randomized methods, Evolution Strategies (ES) such as CMA-ES , Hit-and-Run  and random cutting planes , as well as textbook modern DFO methods ([20, Algo. 10.3], [71, Algo. 9.1]) based on interpolation and trust-regions. A limitation of all these methods is that they do not scale well to high-dimensions (typical dimensionality is , and extreme cases would be dimension ), thus limiting their applicability. For this reason, and motivated by the success of stochastic gradient methods in machine learning, we turn to cheaper methods that scale well to high dimensions.
Our basic stochastic subspace descent (SSD) method (LABEL:eq:_iterations) has been previously explored under the name “random gradient,” “random pursuit,” “directional search,” and “random search” (not to be confused with the hit-and-run style “random search” algorithm ). The algorithm dates back to the 1970s, with some analysis (cf. [29, Ch. 6] and [33, 86]), but it never achieved prominence since zeroth-order methods are not competitive with first-order methods when the gradient is available. Most analysis has focused on the specific case [66, 69, 89, 52].
Recently, the random gradient method has seen renewed interest.  analyzes the case when is quadratic, and  provides an analysis (assuming a line search oracle). A variant of the method, using and P comprised of IID Gaussian entries with mean zero and variance such that , was popularized by Nesterov [66, 69], and various proximal, acceleration and noise-tolerant extensions and analysis appeared in [28, 27, 35, 17]. The novelty of our work, and distinguishing feature compared to prior work, is that the methods in the literature have not focused on high-dimensional projections or practical demonstrations. Another variant of random gradient has recently been proposed in the reinforcement learning community. The Google Brain Robotics team used orthogonal sampling to train artificial intelligence systems , but treated it as a heuristic to approximate Gaussian smoothing. Previous use of orthogonal sampling by the OpenAI team  used variance reduction, but only in the sense of antithetic pairs.
We note that the methods we propose may appear similar to stochastic gradient descent (SGD) methods popular in machine learning, but the analysis is completely different. Indeed, even under strong convexity and with a finite sum structure, SGD methods require diminishing stepsizes and hence converge slowly or else converge only to a radius of the solution .
The most relevant recent work to the SSD scheme is the recent SEGA algorithm  which takes similar random measurements but then uses them in a completely different way, following their “sketch-and-project” methodology. Given a sketch and an estimate of the gradient at the previous step , the simplest version of SEGA finds a new estimate
followed by a debiasing step.
As a baseline, one could use function evaluations to estimate using finite differences, which is clearly costly when is large and evaluating is expensive. Once is computed, one can run gradient descent, accelerated variants , non-linear conjugate gradient methods , or quasi-Newton methods like BFGS and its limited-memory variant . We will compare with these baselines in the numerical results section, particularly with plain gradient descent since it is less sensitive to step-size and errors compared to accelerated variants, and to BFGS since it is widely used. We do not consider Newton, nor inexact Newton (Newton-Krylov) solvers such as [8, 9, 10], because these require estimating the Hessian as well. Some authors have recently investigated estimating the Hessian stochastically (or adapting BFGS to work naturally in the stochastic sampling case), such as [15, 61, 82, 63, 4, 5, 36, 97], but none apply directly to our setup as they require an ERM structure.
1.4 Structure of this document
The remainder of the paper is organized as follows: In Section LABEL:sect:_SSD we investigate the convergence behavior of the stochastic subspace descent method under various convexity assumptions, and provide rates of convergence. In Section LABEL:sect:_SVRG we recall the Stochastic Variance Reduced Gradient (SVRG) method of , and show how control variates can be used to reduce the variance in stochastic subspace descent to improve the rate of convergence. Along the way, we provide a simple lemma that relates convergence in expectation at a linear rate with almost sure convergence of our method. It can also be used to show that the convergence of SVRG and other popular first order methods can be trivially extended to almost sure convergence. In Section LABEL:subsect:synthetic-data we provide empirical results on a simulated function that Nesterov has dubbed “the worst function in the world” . In Section LABEL:subsect:_experiments-GP we find the optimal placement of inducing points for sparse Gaussian processes in the framework of . As a final empirical demonstration, in Section LABEL:subsect:_experiments-plate we test our algorithms in the PDE-constrained optimization setting on a shape optimization problem where the goal is to identify the shape of a hole of fixed area which minimizes the maximum stress subject to boundary forces. We make concluding remarks and discuss future extensions in Section LABEL:sect:_Conclusions. For the sake of readability, proofs are relegated to the Appendix.
In this document, uppercase-boldfaced letters represent matrices, lowercase-boldfaced letters are vectors. The vector norm is assumed to be the 2-norm, and the matrix norm is the operator norm.
2 Main results
This section contains our main theoretical results. We present two algorithms in the SSD framework and provide convergence results under various assumptions. We additionally provide proof that the popular SVRG algorithm , along with several other commonly used stochastic descent methods, provides almost sure convergence of the iterates to the optimal solution.
In the following theorem and corollary we provide conditions under which function evaluations of stochastic subspace descent converge to a function evaluation at the optimum . In the case of a unique optimum we also provide conditions for the iterates to converge to the optimum . Stochastic subspace descent, so-called because at each iteration the method descends in a random low-dimensional subspace of a vector space containing the gradient, is a gradient-free method as it only requires computation of directional derivatives at each iteration without requiring direct access to the gradient. Scaled Haar-distributed random matrices define randomized directions (in the subspace) along which to descend at each iteration. Neither Theorem LABEL:thm:convergence, nor the subsequent theorems require these Haar-distributed matrices in particular but whatever matrices are used must satisfy Assumption (A0) below. See Appendix LABEL:App:_Misc for an algorithm for generating draws from a Haar-distributed matrix, and a brief discussion of the advantages of using Haar over random coordinate descent type schemes.
For the remainder of this section we make use of the following assumptions on the matrices and the function to be optimized.
, , are independent random matrices such that and with .
is continuously-differentiable with a -Lipschitz first derivative.
The function attains its minimum .
For some and all , the function satisfies the Polyak-Lojasiewicz (PL) inequality of Lemma LABEL:PLLemma:
Also, where is the Lipschitz constant in (A1).
is -strongly-convex for some and all . Also, where is the Lipschitz constant in (A1).
is convex and attains its minimum on a set , and there is an such that for the parameter initialization ,
Coercivity of implies the existence of the constant in (A3”), hence Assumption (A3”) is mathematically equivalent to coercivity of . For the results below, particularly the rate, in Theorem LABEL:thm:_convergence-convex, we require knowledge of the value of . Also note that (A3’) implies (A3) (Lemma LABEL:PLLemma).
Algorithm LABEL:Alg:_SSD provides basic pseudocode to implement stochastic subspace descent using the recursion (LABEL:eq:_iterations).
Theorem 2.2 (Convergence of SSD).
Assume (A0), (A1), (A2), (A3) and let be an arbitrary initialization. Then recursion (LABEL:eq:_iterations) with results in and .
Formal definitions of the different modes of convergence are provided in the appendix. Since non-convex functions may satisfy the PL-inequality (LABEL:eq:PL), Theorem LABEL:thm:convergence provides a convergence result for well-behaved non-convex functions. More practically, in Corollary LABEL:corr:strong-convexity we show that Theorem LABEL:thm:convergence allows linear convergence for relevant classes of convex functions such as least squares, even when the functions are not strongly convex [42, §2.3].
Corollary 2.3 (Convergence under strong-convexity and rate of convergence).
Assume (A0), (A1), (A2), (A3’) and let be an arbitrary initialization. Then recursion (LABEL:eq:_iterations) with results in where is the unique minimizer of .
Assume (A0), (A1), (A2), and (A3). Then with the step-size , the recursion defined in (LABEL:eq:_iterations) attains the following expected rate of convergence
We note that if we recover a textbook rate of convergence for gradient descent [13, §9.3]. Under the more restrictive assumption of strong-convexity the result of Corollary LABEL:corr:strong-convexity is much stronger than Theorem LABEL:thm:convergence; we get almost sure convergence of the function evaluations and of the iterates to the optimal solution at a linear rate. In inverse problems it is typically the convergence of not of that is sought. Furthermore, if either assumption (A3) or (A3’) is satisfied, SSD has a linear rate of convergence. This result may be somewhat counter-intuitive for those more familiar with stochastic gradient descent, which is well known to provide a sub-linear rate of convergence. By applying random sampling to dimensions in input space rather than to observations the step-size need not decay, allowing for a linear rate. In addition to the linear rate, we provide guidance for selecting an optimal step-size, which in theory alleviates the need for the user to tune any hyperparameters. The optimal step-size is easily interpreted as the quotient of the sample rate and the global Lipschitz constant. This is consistent with our intuition: more directional derivatives make for a better approximation of the gradient. In practice the Lipschitz and strong-convexity constants are rarely known and a line search can be used.
Thus far, we have provided results for cases when the function satisfies the PL-inequality or is strongly-convex. The proof in the convex case is different, but substantively similar to a proof of coordinate descent on convex functions found in .
Theorem 2.4 (Convergence under convexity).
Assume (A0), (A1), (A2), (A3”). Then recursion (LABEL:eq:_iterations) with gives
2.2 Variance Reduced Stochastic Subspace Descent (VRSSD)
In its most basic formulation SGD suffers from a sub-linear rate of convergence. This is due to a requirement for the step-size to gradually decay to assure convergence of the iterates. Several methods have been designed to address this short-coming, one of which is SVRG  which controls the variance of the gradient estimate using a control variate, and admits convergence with a constant step-size, resulting in a linear rate of convergence. In this section we recall some results regarding SVRG from  before proving almost sure convergence of the iterates and function evaluations, and ultimately adapting it to the framework of SSD. SVRG minimizes the functional via the iterative scheme
where is chosen uniformly at random, is a past iterate updated every iterations, and the full gradient is also calculated every iterations:
In  a proof is presented for the case where is updated after iterations by setting , where is chosen uniformly at random from . The authors call this updating scheme “Option 2” and recommend an alternative “Option 1”, without proof, which involves setting instead. A proof of convergence for Option 1 is provided in . The SVRG method is presented in Algorithm LABEL:alg:SVRG, and the convergence result of , is presented in Theorem LABEL:SVRG.
Theorem 2.5 (Convergence of SVRG ).
Assume that each is convex with -Lipschitz derivative and that is -strongly-convex. Assume that the memory parameter is sufficiently large so that
Then, for the sequences (LABEL:eq:_SVRG1) and (LABEL:eq:_SVRG2), using Option 2 to update , we have the following expected rate of convergence,
The proof, omitted here, can be found in . Since is a constant and , Theorem LABEL:SVRG implies that . We further show that the convergence is also in other modes. This is a particular result of the following simple lemma:
Lemma 2.6 (a.s. convergence from a linear rate of convergence).
Consider the general problem of minimizing an objective function and set . If a minimization algorithm converges in expectation at a linear rate; that is, if there exists and a constant such that
then . If in addition, is strongly-convex, then .
The distinction between modes of convergence is not merely academic, see for instance [31, p.4] for a simple example of a sequence that converges in for any but does not converge almost surely. As discussed in , the class of algorithms that satisfy the conditions required to invoke Lemma LABEL:thm:SVRG-AS is substantial. This class includes many of the most popular gradient-based optimization methods such as block-coordinate descent, SAG , SAGA , SDCA , SVRG , Finito/MISO [24, 59], and their proximal variants. Though the conditions of Lemma LABEL:thm:SVRG-AS are sufficient for almost sure convergence, they are not necessary: there are optimization algorithms, such as stochastic gradient descent, that converge almost surely for which the rate of convergence is sub-linear. Of the aforementioned algorithms only SVRG is relevant to our present discussion:
Remark 1 (a.s. convergence of SVRG).
For the SVRG algorithm discussed in Theorem LABEL:SVRG, and .
The problems for which the theory of SVRG can be applied are strongly-convex. Also, by construction , and . Thus the result follows from Lemma LABEL:thm:SVRG-AS.
As stated, SVRG is stochastic in that it randomly samples observations from the dataset. Our goal is different in that we are interested in random sampling along the dimensions of input space, motivating a new algorithm we call Variance Reduced Stochastic Subspace Descent (VRSSD). For subspace selection we generate a random matrix using Algorithm LABEL:Haar, though again any matrices satisfying assumption (A0) suffice. The main algorithm is described in Algorithm LABEL:alg:varprod. Notice that the similarity of SVRG and VRSSD is in their use of a control variate; in VRSSD we additionally incorporate a parameter which allows to weight the influence of the control variate.
Theorem LABEL:thm:VarianceReducedRandomGradient allows for, but (unlike SVRG) does not require, to have a finite-sum representation and as a consequence it does not require Lipschitz-continuity of the individual samples, , like Theorem LABEL:SVRG does.
Theorem 2.7 (a.s. convergence of VRSSD).
Assume that is a -strongly convex continuously-differentiable function with -Lipschitz gradient for some . Define . Let be the minimizer of .
Fix a memory parameter , a sampling rate , and a step-size such that
Then, Algorithm LABEL:alg:varprod with option 2 and has the following expected rate of convergence:
Fix , , and such that
Then, with the control variate parameter:
Algorithm LABEL:alg:varprod using Option 2 has the following expected rate of convergence:
For appropriately chosen parameters and , the variance reduction introduced in Theorem LABEL:thm:VarianceReducedRandomGradient leads to an improved convergence rate compared to the simpler method presented in Theorem LABEL:thm:convergence. The benefits can be seen in practice as we show in Section LABEL:subsect:_experiments-plate. In part () of Theorem LABEL:thm:VarianceReducedRandomGradient we show that it is possible to further improve the rate by selecting a control variate parameter to optimally weight the impact of the full gradient that is held in memory. In theory we see the following effects on the rate for VRSSD with the optimal control variate, all of which are consistent with our intuition:
A large sampling rate improves per-iteration convergence.
A large improves per-iteration convergence, but each iteration is only counted after inner loops have been completed, so the effect of cancels (in theory).
Poor conditioning () has a deleterious effect on the convergence rate commensurate with its effect on the convergence rate of gradient descent.
A large Lipschitz constant must be offset by a small step-size.
It is possible to find the optimal step for a known local Lipschitz constant and strong-convexity constant, and a fixed .
In many of the problems for which SSD is useful it is undesirable to compute the term to find the optimal control variate as this would amount to additional function evaluations per iteration. For this reason, in practice, we recommend approximating it with the current estimate of the gradient instead,
Though this approximation is not supported by the theory, empirical performance in Section LABEL:sect:_EmpiricalResults suggests the validity of such an approximation. We chose the approximation in (LABEL:eqn:_VRSSD-approximation) because it is an unbiased estimate of the optimal control variate parameter at each iteration and has the advantage that no additional function evaluations are required for each iteration. It also has the effect that some of the intuitions outlined above may not hold; this in itself is not necessarily a problem as the Lipschitz and strong convexity constants are rarely known and a line search can be used to determine the step-size; in our experiments we use an Armijo backtracking line search as discussed in .
In order to understand the role of more clearly, we consider three possible scenarios: corresponds to standard SSD as outlined in Theorem LABEL:thm:convergence, corresponds to part () of Theorem LABEL:thm:VarianceReducedRandomGradient, and finally the optimal is given by (LABEL:eqn:_optimal-eta) and used in part () of Theorem LABEL:thm:VarianceReducedRandomGradient. With the understanding that the true gradient is the optimal search direction with respect to one-step improvement, we would like our search direction to vary around it as little as possible. To this end, we ascertain the conditional mean-square error defined as the expectation of
given information at the current iteration (i.e., conditional on algebras defined in the appendix). We denote this conditional error as CMSE.
Case (Theorem LABEL:thm:convergence)
Case (Theorem LABEL:thm:VarianceReducedRandomGradient part (i))
Case (Theorem LABEL:thm:VarianceReducedRandomGradient part (ii))
These identities follow from the proofs in the appendix. Equation (LABEL:eqn:_eta-analysis) shows that the optimal never increases the variance compared to SSD and can in fact drop it to zero, though an increase in the variance is possible if (e.g., in the event is negative, which may occur if is too large and the iterates traverse substantial portions of parameter space between full gradient calculations). These results are in agreement with Theorem LABEL:thm:VarianceReducedRandomGradient which suggests that the per-iteration rate of convergence using an optimal is strictly less than the rate when .
3 Experimental results
In this section we provide results for a synthetic problem (Section LABEL:subsect:synthetic-data), a problem from the machine learning literature (Section LABEL:subsect:_experiments-GP), and a shape-optimization problem (Section LABEL:subsect:_experiments-plate). For the sake of fair comparisons, we compare to zeroth-order variants of gradient descent and BFGS where the gradient is calculated via forward-differences. In the synthetic problem we also compare to a randomized block-coordinate descent. For VRSSD we use the approximation (LABEL:eqn:_VRSSD-approximation) of the optimal , and we use Option 1 for updating .
3.1 Synthetic data
We explore the performance of the algorithms on Nesterov’s “worst function in the world” . Fix a Lipschitz constant and let
where represents the coordinate of and is a constant integer that defines the intrinsic dimension of the problem. This function is convex with global minimum
Nesterov has shown that a broad class of iterative first-order solvers perform poorly when minimizing with . We compare our results to randomized block-coordinate descent and gradient descent for two illustrative examples from the family of functions defined by (LABEL:eq:_NesterovWorst). We do not compare to BFGS as we do not run enough iterations for it to differ substantially from gradient descent. For the randomized methods we use ; in Section LABEL:subsect:_experiments-GP we explore the impact that different choices of have on convergence. We expect coordinate descent to perform poorly as there is only a single coordinate of this function that provides a descent direction from the initialization; for high-dimensional problems it becomes exceedingly unlikely that the “correct” coordinate is chosen. Also, since it is a block coordinate descent with , the impact of the other two coordinates chosen outweighs the benefit, and the step-size gets pushed to zero; hence, in both examples shown in Figure LABEL:fig:_NesterovExample we see no improvement from coordinate descent. Had we used instead for coordinate descent we could only expect improvement to be made one in every iterations, rather than at every iteration as in the other methods.
If the function is intrinsically low-dimensional but it is embedded into a high-dimensional ambient space, then SSD and VRSSD perform exceptionally well (see left plot of Figure LABEL:fig:_NesterovExample). This holds true for the Gaussian process example of Section LABEL:subsect:_experiments-GP as well. On the other hand, if the space is truly high-dimensional then the stochastic methods do not appear to provide any benefit.
There are hyperparameters to set in our methods. We reserve discussion of for Section LABEL:subsect:_experiments-GP. In Figure LABEL:fig:_VRSSD-performanceprofile we use performance profiles  to examine the impact of varying in VRSSD. A performance profile is conducted by running each parameterization with 300 randomized restarts with termination only once some pre-specified tolerance for accuracy has been reached. We count the proportion of realizations from each parameterization that achieves the specified tolerance within function evaluations where is the fewest function evaluations required in any of the trials, is twice as many function evaluations, etc. We use a function from the family defined in (LABEL:eq:_NesterovWorst) and fix . The objective value threshold we set is within of the optimal objective.
Figure LABEL:fig:_VRSSD-performanceprofile shows that the Lipschitz constant has little bearing on the relative performance of the different parameterizations, but the intrinsic dimension has a large role to play. This is so significant that the performance profiles are inverted from a high- to low- intrinsic dimension case. The top plots in Figure LABEL:fig:_VRSSD-performanceprofile paint a bleak picture for these stochastic methods: the more similar to gradient descent () the better their performance. Conversely, the bottom plots show that randomized methods provide a distinct advantage when the data is intrinsically low-dimensional. Recent work  takes a rigorous look at the manifold hypothesis which suggests that most real, high-dimensional data tend to lie near a low-dimensional manifold. This is corroborated in  where the authors provide theoretical rationale for the observed low-rank structure of large matrices. These two theoretical papers, along with ample empirical evidence [95, 3, 53, 93], suggest that the lower plots in Figure LABEL:fig:_VRSSD-performanceprofile, and the left plot in Figure LABEL:fig:_NesterovExample are more representative of problems that arise in practice. We have presented two extremes in an effort to provide a heuristic for choosing if the intrinsic dimension can be approximated. The paper  provides a method for estimating the intrinsic dimension via maximum likelihood. More recently,  provides a method for estimating the intrinsic dimension of an arbitrary data set by exploiting concentration inequalities of norms and angles for random samples in high dimensional space. In  the authors also conduct a thorough review of the literature and compare to several other methods for estimating intrinsic dimension.
3.2 Sparse Gaussian process parameter estimation
We apply the methods developed in Section LABEL:sect:_MainResults to hyperparameter estimation for Gaussian processes used in regression, where the goal is inference on a function based on noisy observations at points . The observations are modeled as
where the function is modeled as a zero-mean Gaussian process with covariance function
and is a symmetric positive-definite kernel that depends on a vector of parameters . The parameters can be used to control various characteristics of the process, among them the variance (kernel amplitude) and the scale of the correlation between inputs (length-scale). The process is assumed to be independent of the noise vector with unknown variance . The covariance matrix of the vector will be written as
Maximum likelihood estimates of the parameters are obtained by maximizing the log-marginal likelihood of of observations with density :
Difficulties arise in finding the optimal parameterization of Gaussian processes when the number of observations is large: due to the inversion and determinant calculation in equation (LABEL:eqn:_MLE), the cost of this maximization is . Various approximations exist, perhaps most popular among them in the machine learning community is that of . The basic idea is as follows: Let , , be points in different from the original (these are called “inducing points”), and let . Let be a multivariate density function on , and define the density on , where is the conditional density of f given . The Gaussian density that minimizes the Kullback-Leibler divergence between the density and the posterior density of given leads to the following lower bound for the loglikelihood :
Here is the loglikelihood of the multivariate Gaussian , where
is the the Nyström approximation of introduced in . Note that
That is, the Nyström approximation is controlled by the mean-squared error of the prediction of f based on . The trace term in (LABEL:eq:llbound) controls the size of this mean-square error. We note that the Nyström approximation is a low-rank non-negative update of and that optimal approximations of this form were studied in [88, 87] using Rao’s geodesic distance between SPD matrices as a measure of optimality. We do not pursue this approach here but recognize it as a potential alternative to the method of  that we use. Derivations of the conditional means and variances of a multivariate Gaussian can be found in, for example, .
Gradient-based methods are used to find an optimal placement of the inducing points by maximizing the lower bound in (LABEL:eq:llbound), which we re-state as a function of to be consistent with notation in previous sections:
Practically speaking, the optimization problem is -dimensional: represents the location of inducing points in dimensions, is the number of hyperparameters of the kernel, and we must estimate the noise variance. The benefit of moving to this high-dimensional optimization problem is that the time complexity is reduced to and the storage costs to . Consider that this objective function fits directly into the framework described by equation (LABEL:eq:_PDE): If we multiply (LABEL:eqn:_VariationalUpperBound) by -1 so that it becomes a minimization problem then the first term on the right-hand-side can be regarded as the loss functional, and the second term can be viewed as a regularization functional penalizing a large mean-square error. Figure LABEL:fig:_GPExample is a test problem using the same data as was used in  with fifteen inducing points. No recommendation is made in  regarding which optimization routine to use. In the left pane of Figure LABEL:fig:_GPExample we show the optimal result for the placement of the inducing points and the setting of the hyperparameters where the optimum is found using a BFGS solver on equation (LABEL:eqn:_VariationalUpperBound). The Gaussian process that arises from the initial settings is shown in the right pane of Figure LABEL:fig:_GPExample.
We use the variations of Algorithm LABEL:alg:varprod introduced in this paper to determine the placement of the inducing points and the settings of the hyperparameters for the same data set used in the original paper on variational learning of inducing points  and by its intellectual predecessor . The results are compared with gradient descent and BFGS solvers. The problem is non-convex and does not satisfy the PL-inequality so the theory introduced in this paper does not provide insight into what to expect. As is common with non-convex optimization problems we perform multiple random restarts to improve the chance of converging to the global minimum.
The data set contains 200 training points. We use a squared-exponential kernel defined by . The parameters to be learned are the amplitude , the lengthscale ,the noise variance , and the locations of inducing points in so that our optimization problem is -dimensional. For testing purposes we can control the dimension of the optimization problem by increasing the number of inducing points. We acknowledge that for such a simple problem the use of a zeroth order method is not required as tools such as autograd  for Python are able to find the gradient of (LABEL:eqn:_MLE) with respect to each of the parameters by automatic differentiation; in general though, the gradient may not be available, even through automatic differentiation packages as not all functions have been incorporated into such packages. It is also possible to take individual derivatives by hand, but this is not feasible in more complex problems in high-dimensions. Once again, we use performance profiles  to determine the effect of varying the parameters for different problem sizes. We run each parameterization 300 times. Results for SSD for are shown in Figure LABEL:fig:_SSD-PerformanceProfile for 30 and 60 inducing points.
The cut-off threshold is of the distance between the objective function at the parameter initialization and at the optima, as found by BFGS. It is apparent that for this problem, is not a good option. Similarly, can be ruled out due to the fact that it underperforms and approximately 90% (resp. 99%) of the time in the 30- (resp. 60-) dimensional problem. The case claims the best single performance: in the fastest trial it is roughly 100 (resp. 800) times faster than BFGS for the 30- (resp. 60-) dimensional problem, but the variance of the performance for is high, and about 1% of the time it performs at least 10 times slower than BFGS (not pictured). On the other hand, beats BFGS by a similar factor and seems to be insulated from the high variance that was observed for . A few trials of and found their way to a local minima, resulting in the methods not achieving the target threshold.
Figure LABEL:fig:_GP-performance shows the performance for single trials on several problems of different dimension. For high-dimensional problems the performance of BFGS degrades due to the fact that at each iteration the Hessian update is only rank two. This acts doubly against BFGS because high-dimensional problems have more expensive function evaluations, and more iterations are required before the Hessian is reasonably approximated. In contrast, the performance of SSD does not appear to degrade noticeably as the dimension increases. The performance of VRSSD is impacted by the occasional expensive full gradient evaluations but not to the same extent as BFGS. The value for the memory parameter in VRSSD is labeled to signify that 100 iterations of SSD are taken before any variance reduction is used, after which the memory parameter becomes 10. This is in recognition of the fact that SSD descends very quickly until it is near the solution, at which point it becomes beneficial to make use of full gradient information.
3.3 Shape optimization
As our third numerical example, we consider a shape optimization problem involving a linear, elastic structure. Consider a square plate of size with a hole subject to uniform boundary traction =1, as illustrated in Fig. LABEL:fig:_HoleGeometry.
We adopt a discretize-then-optimize approach to solving the PDE-constrained optimization problem. The discretization and optimization steps do not generally commute and an optimize-then-discretize approach may be preferable for some types of problems [38, §2.9], but we do not pursue this question here.
Our goal is to identify a shape of the hole that minimizes the maximum stress along the direction over a quarter of the plate corresponding to . To this end, we parameterize the radius of the hole for a given (see Figure LABEL:fig:_HoleGeometry) via
where is a user-defined parameter controlling the potential deviation from an n-gon of radius 1. For large values of it is possible to have a negative radius. The parameters that dictate the shape are and . Subscripts indicate the index of the vector. For our experiments we set so that the minimum possible radius of any particular control point is 0.2 at the initialization. We initialize the entries of and uniformly at random between 0 and 1. For each instance of and – equivalently – we generate a conforming triangular finite element mesh of the plate which we subsequently use within the FEniCS package  to solve for the maximum stress . A mesh refinement study is performed to ensure the spatial discretizatin errors are negligible. As we only model a quarter of the plate, we apply symmetry boundary conditions so that and displacements along and are zero. The Young’s modulus and Poisson’s ratio of the plate material are set to and , respectively.
The parametric radius defined by (LABEL:eqn:_parametric-radius) allows to scale the complexity of the problem arbitrarily by increasing the dimension . In effect, if is large then the problem becomes ill-conditioned since and each make at most additive contribution to the radius. Such ill-posedness suggests that gradient descent ought to perform poorly as it does not account for the curvature of the objective function. We expect BFGS to outperform gradient descent once it has iterated sufficiently to provide a good approximation of the Hessian. Based upon the intrinsic dimensionality discussion presented in Section LABEL:subsect:synthetic-data we anticipate SSD and VRSSD to also outperform gradient descent despite the fact that they also do not explicitly account for the curvature. Note that each function evaluation requires a PDE solve meaning that gradient descent and BFGS require PDE solves per iteration. Though a conforming finite element mesh is used to reduce the computational burden, the cost of so many PDE solves makes this problem intractable in high-dimensions unless the resolution of the mesh is very low. On the other hand, SSD and VRSSD require far fewer PDE solves per iteration provided we choose (Figure LABEL:fig:_SSD-PerformanceProfile suggests that may be reasonable). As mentioned above, the objective value is the maximum stress in the y-direction, , over the plate. In Figure LABEL:fig:_SSD-Plate we minimize the objective for a hole with shape governed by (LABEL:eqn:_parametric-radius) for problems with and using gradient descent and BFGS, as well as SSD with . For VRSSD we use the same hybrid method as described in Section LABEL:subsect:_GP, with identical rationale.
As expected, BFGS outperforms gradient descent, but it is slow to provide a reasonable estimate of the Hessian and hence, relatively slow to converge. On low- or medium-dimensional problems this is not an issue, but since BFGS makes a rank-2 update to the Hessian at every iteration, and -dimensional problems of the form explained in this section with radius specified by (LABEL:eqn:_parametric-radius) have a rank- Hessian. Thus, high-dimensional problems may require many iterations before the Hessian is well-approximated.
In the 100-dimensional problem in the left plot of Figure LABEL:fig:_SSD-Plate we notice convergence of SSD and VRSSD after few function evaluations; so few, in fact, that the difference in performance between SSD and VRSSD likely has more to do with random chance than variance reduction. In multiple restarts (not shown), similar performance was observed wherein the methods had converged before any variance reduction had occurred. In all cases, convergence occurs by the time gradient descent of BFGS are able to take more than a couple of steps.
In the 200-dimensional problem in the right plot of Figure LABEL:fig:_SSD-Plate the progress of gradient descent, SSD, and BFGS stall prematurely, apparently settling into local optima at different locations despite being initialized identically. The figure is somewhat deceptive as each method is indeed still descending along shallow valleys of the objective function as shown in the inset. VRSSD does not appear to be as susceptible to getting caught in such valleys for reasons we do not yet fully understand. We hypothesize that there are two phenomena at play. First, as the -dimensional subspace along which SSD and VRSSD descend changes with each iteration, they explore parameter space more erratically than deterministic methods, making them less likely to get funnelled into long, shallow basins; this is intuitively similar to the recent line of research which suggests that noisy perturbation of iterative algorithms helps avoid saddle points . Second, VRSSD uses past information from a full-gradient evaluation that push it in directions that may not be optimal from a single-step perspective but that may help it traverse these valleys in a way that is heuristically similar to the heavy ball method of Polyak ; this phenomena has recently been explored in the context of non-convex minimization, e.g., . Since SSD still appears to get caught in these low-gradient valleys, it is apparently the combination of these two reasons, the second of them, or some as-yet-undescribed property of VRSSD that allows it to avoid them. It could also be the case that this particular problem is amenable to VRSSD; however, further investigation on different problems is required to verify the exact reasons that VRSSD appears to perform so well.
We have presented a generalization of coordinate descent and gradient descent algorithms and have shown asymptotic convergence at a linear rate under various convexity assumptions. We have demonstrated empirical improvements compared to the status quo for several practical problems that do not necessarily adhere to the assumptions required by the theory. We have provided analysis that puts the ideas of the SVRG algorithm  in a stochastic subspace framework, making it a feasible alternative to deterministic gradient-based methods for problems that cannot be represented with a finite-sum structure. This analysis opens the doors to adapting other stochastic iterative methods designed for an ERM framework into a more general form. We also showed that any stochastic algorithm that converges in expectation at a linear rate also converges almost surely.
In the future we plan to adapt other first-order stochastic optimization techniques such as stochastic quasi-Newton  to the stochastic subspace descent framework. The Haar-distributed random matrices may not be the optimal choice for subspace projections, and some clever adaptive scheme could be beneficial for determining the decent directions; these ideas have been discussed at length in the coordinate descent literature [78, 67], and similar techniques may be available within the SSD framework. We would like to explore the feasibility of implementing our methods in parallel as has been discussed in the coordinate descent case in . A computationally straightforward extension may allow sketching methods (see e.g. ) to improve our results with minimal programmatic overhead, but analysis must be conducted to confirm the theoretical properties of such modifications. The analysis we have performed thus far does not explain some phenomena we have observed in our experiments; in particular, it appears as though the intrinsic dimension rather than the ambient dimension dictates the convergence rate of our methods, and that VRSSD is resistant to getting trapped in the shallow basins common to ill-conditioned non-convex problems.
Finally, recent work on a universal “catalyst” scheme  also applies to our method, allowing for Nesterov-style acceleration without requiring additional knowledge of the Lipschitz constants along any particular direction.
A Preliminary results
We summarize results from optimization and probability used in the proofs of convergence.
Lemma A.1 (A property of Lipschitz-continuous first derivatives).
Let be continuously-differentiable with -Lipschitz first derivative. Then for any ,
The proof can be found in [68, pg. 25-26].
Lemma A.2 (Strong-convexity implies PL-inequality).
Let be a -strongly convex, continuously-differentiable function. Denote the unique minimizer of by . Then:
Inequality (LABEL:PLineq) is the PL-inequality. The proof is simple and can be found in the Appendix of .
Lemma A.3 (A characterization of convex functions).
A function is convex with -Lipschitz derivative if and only if for all
Theorem A.4 (Strongly-convex functions).
If a function is -strongly convex with -Lipschitz derivative then for all
We now recall some basic definitions and results regarding convergence of random variables. A sequence of random vectors is said to converge a.s. to a random vector if and only if . It is said to converge in probability if and only if for all , as . It converges in for some integer if and only if . We will write , , and , respectively.
It is readily apparent that if and only if for all
as . It follows from this characterization that if . By subadditivity,
Finally, by Markov’s inequality, if , then .
B Proofs of main results
Proof of Theorem LABEL:thm:convergence
Define the filtration , and . Because is continuously-differentiable with a -Lipschitz derivative, Lemma LABEL:Lipschitz yields
Let be the error for a particular . Then, equations (LABEL:eq:_iterations) and (LABEL:Taylor) yield:
where we have used the fact that . We can choose such that ; that is, . With this choice the right-hand side is non-positive and the errors are non-increasing. Since the error is bounded below by zero the sequence converges almost surely but the actual limit remains to be determined. Furthermore, since the sequence is bounded above by , Lebesgue’s dominated convergence implies convergence of the sequence in . To find the limit we take conditional expectations of both sides to get
Since is -measurable, the gradients can be pulled out of the conditional expectation, the expectation can be passed through due to linearity. Noting that , this leaves us with
The first term on the right-hand side in equation (LABEL:intermediate) is negative so the PL-inequality leads to:
Recursive application yields
Taking an expectation of both sides yields
Any step-size forces the right-hand side to converge to zero. Thus, since for some and , we have both and .
Proof of Corollary LABEL:corr:strong-convexity part (i)
By Lemma LABEL:PLLemma, the strong-convexity assumed by this corollary implies the assumptions of Theorem LABEL:thm:convergence, so that . Now by strong-convexity,
Since the left-hand side converges almost surely to zero and , we have .
Proof of Corollary LABEL:corr:strong-convexity part (ii)
We begin by providing an upper bound for . Rearranging the terms in equation (LABEL:intermediate) we have
Combing (LABEL:UpperBound) with (LABEL:PLineq) we get
Repeated recursion then yields