Stochastic Subspace Descent^{†}^{†}thanks: Submitted to the editors DATE.
Funding: AD acknowledges funding by the US Department of Energyâs Office of Science Advanced Scientific Computing Research, Award DESC0006402 and National Science Foundation Grant CMMI145460.
Abstract
We present two stochastic descent algorithms that apply to unconstrained optimization and are particularly efficient when the objective function is slow to evaluate and gradients are not easily obtained, as in some PDEconstrained optimization and machine learning problems. The basic algorithm projects the gradient onto a random subspace at each iteration, similar to coordinate descent but without restricting directional derivatives to be along the axes. This algorithm is previously known but we provide new analysis. We also extend the popular SVRG method to this framework but without requiring that the objective function be written as a finite sum. We provide proofs of convergence for our methods under various convexity assumptions and show favorable results when compared to gradient descent and BFGS on nonconvex problems from the machine learning and shape optimization literature. We also note that our analysis gives a proof that the iterates of SVRG and several other popular firstorder stochastic methods, in their original formulation, converge almost surely to the optimum; to our knowledge, prior to this work the iterates of SVRG had only been known to converge in expectation.
Key words. Randomized methods, optimization, derivativefree, Gaussian processes, shape optimization, stochastic gradients
AMS subject classifications. 90C06, 93B40, 65K10
1 Introduction
We consider optimization problems of the form
(1.1) 
or sometimes
(1.2) 
where has Lipschitz derivative. Throughout the document we will consider additional restrictions such as convexity or strong convexity of , restrictions that will be made clear as they are required. In the most basic form discussed in Section LABEL:sect:_SSD, we project the gradient onto a random dimensional subspace and descend along that subspace as in gradient descent. This is accomplished using a random matrix , for some , with the properties
in the following iteration scheme
(1.3) 
This algorithm has convergence properties consistent with gradient descent (see Theorem LABEL:thm:convergence) and allows us to leverage techniques from the substantial literature surrounding stochastic gradient descent (see Theorem LABEL:thm:VarianceReducedRandomGradient). Note that (LABEL:eq:_iterations) with a diagonal reduces to randomized blockcoordinate descent [6].
A particular case of (LABEL:eq:_minimize_f) is Empirical Risk Minimization (ERM) commonly used in machine learning
where is typically very large, and hence the problem is amenable to iterative stochastic methods that approximate using randomly sampled observations at each iteration with where . While the methods discussed in this paper do not require this type of finitesum structure, they can be employed to tackle such problems. In fact, the canonical parameter estimation problem in machine learning is one application we envision for our work.
There are several important classes of functions that do not fit into the ERM framework and therefore do not benefit from the recent stochastic algorithms tailored to ERM. Partial Differential Equation (PDE) constrained optimization is one such example, and except in special circumstances (such as [39]), a stochastic approach leveraging the ERM structure (such as stochastic gradient descent and its variants) does not provide any benefits. This is due to the fact that in PDEconstrained optimization the cost of evaluating each is often identical to the cost of evaluating . A primary goal of this paper is to use variations on the prototypical iterative scheme (LABEL:eq:_iterations) to adapt recent developments from the machine learning optimization literature so that stochastic methods can be used effectively in the PDEconstrained optimization setting. Along the way we present results that we think will be useful to the machine learning community. An example that comes directly from the machine learning literature is Gaussian processes, specifically the sparse Gaussian process framework of [85, 92]. The placement of socalled inducing points is rolled into the continuous optimization setting causing the dimension of the problem to increase linearly with the desired number of points. This leads to highdimensional problems for which traditional stochastic optimization algorithms provide no benefit.
1.1 PDEconstrained optimization
Partial differential equations are wellstudied tools in applied mathematics, ubiquitous for their use in modeling physical phenomena. Successful application of PDEs to modeling is contingent upon appropriate discretization and parameter estimation. Parameter estimation in this setting is referred to as PDEconstrained optimization, and arises in optimal control, or whenever the parameters of the PDE are unknown, as in inverse problems. Algorithmic and hardware advances for PDEconstrained optimization have allowed for previously impossible modeling capabilities. Examples include fluid dynamics models with millions of parameters for tracking atmospheric contaminants [32], modeling the flow of the Antarctic ice sheet [48, 74], parameter estimation in seismic inversion [1, 14], groundwater hydrology [11], experimental design [47, 40], and atmospheric remote sensing [21].
In Section LABEL:sect:_EmpiricalResults we take a discretizethenoptimize approach and assume that the infinitedimensional PDE has already been discretized by some appropriate mechanism such as finite differences or finite elements. With this in mind, the PDEconstrained optimization problem can be written as follows
(1.4) 
That is, the goal is to minimize a tradeoff between (a) the loss which depends on the solution to the PDE and the observed data , and (b) a regularization functional that serves to keep the problem wellposed. The PDE is parameterized by , the dimension of which may be subject to the choice of discretization, and may be in the millions. The regularization parameter controls the balance between the data misfit and the regularity of the solution. Typically is unknown but there are a number of techniques for estimating it, see [91, §3.3], [64, §5.4], [44, 84]; we do not treat this issue here.
1.2 Gaussian processes
Gaussian processes are a very important class of stochastic processes used for modeling processes that are correlated in input space. Here we use them to model an unknown function in the context of regression. Elegant results from functional analysis, developed over the past 100 years, allow for the flexibility of working in function space using only machinery from linear algebra but the applications of Gaussian processes are somewhat hamstrung in many modern settings because their time complexity scales as and their storage as . One recourse is to approximate the Gaussian process, allowing time complexity to be reduced to with storage requirements of where is the number of points used in lieu of the full data set. Methods have been developed [85, 92] which place these inducing points, also called landmark points, along the domain at points different from the original inputs; optimal placement of these points is a continuous optimization problem with dimension at least the number of inducing points to be placed. Such a framework places heavy burden on the optimization procedure as improperly placed points may result in a poor approximation.
1.3 A historical perspective of iterative optimization
Despite being among the easiest to understand and oldest variants of gradient descent, subspace methods (by far the most common of which is coordinate descent) have been conspicuously absent from the optimization literature. The reasons for this vary but such underrepresentation may be partially attributed to the fact that until recently the performance of these types of algorithms did not justify the difficult theoretical work.
Coordinate descent schemes
The simplest variant of subspace descent is a deterministic method that cycles over the coordinates. Convergence results for coordinate descent requires challenging analysis and the class of functions for which it will converge is restricted; indeed, [98, 77] provide simple examples for which the method fails to converge while simplertoanalyze methods such as gradient descent will succeed. It was shown in [57] that relaxing the cyclic nature of the updates so that the schedule for updating along each coordinate is “almost cyclic” (that is, the function is updated along each coordinate at least once every iterations for some ) allows for convergence guarantees under conditions only moderately more restrictive than those of gradient descent. Also, it was shown in [57] that using the deterministic GaussSouthwell rule for selecting the descent direction results in convergence at a linear rate, a result corroborated by theoretical and experimental evidence in [19], and38 expanded upon in the paper [72] which shows that the rate of convergence is competitive with the rates of the randomized method provided by [67].
The analysis in [67] for the case where the coordinates are chosen randomly provides convergence results on par with gradient descent without additional restrictions on the class of functions that can be optimized. This was generalized in [78] with slightly tighter analysis and an extension to composite functions. An overview of the recent progress until 2015, along with simplifications of the proofs for several results can be found in [101]. The algorithm described in Section LABEL:sect:_SSD is a generalization of results that can be found in [101, 67].
Since 2015, much emphasis has been placed on accelerating coordinate descent methods [2, 46] but those improvements require knowledge of the Lipschitz constants of the partial derivatives of the functions and/or special structure in the function to make updates inexpensive; in Section LABEL:sect:_Conclusions we discuss how recent analysis in [54] allows for acceleration of our methods without such an unrealistic assumption.
Optimization without derivatives
Our methods use directions where P is with , which is computationlly identical to taking directional derivatives of , and can therefore be estimated by forward or centereddifferences in or function evaluations, respectively. We assume the error due to the finite difference approximation is negligible, which is realistic for lowprecision optimization. Because our methods only evaluate and not directly, they fall into the class of derivativefree optimization (DFO) and “zerothorder” optimization. Recent analysis has shown that many firstorder methods (e.g., gradient descent, SVRG) maintain asymptotic convergence properties [55] when the gradients are estimated via finite differences. Additionally, nonasymptotic bounds on the rate of convergence for zerothorder methods have been shown to be similar, albeit slightly worse, to their firstorder counterparts [26].
To be clear, when is readily available, zerothorder optimization methods are not competitive with firstorder methods and secondorder methods (e.g., Newton’s method). For example, if and is a matrix small enough to store in the memory of a computer, then evaluating and evaluating have nearly the same computational cost, namely . In fact, such a statement is always true regardless of the structure of : by using reversemode automatic differentiation (AD), one can theoretically evaluate in about fourtimes the cost of evaluating , regardless of the dimension [37]. In the context of PDEconstrained optimization, the popular adjointstate method, which is a form of AD applied to either the continuous or discretized PDE, also evaluates in time independent of the dimension.
However, there are many situations when AD and the adjointstate method are inefficient or not applicable. Finding the adjoint equation requires a careful derivation (which depends on the PDE as well as on initial and boundary conditions), and then a numerical method must be implemented to solve it, which takes considerable development time. For this reason, complicated codes that are often updated with new features, such as weather models, rarely have the capability to compute a full gradient. There are software packages that solve for the adjoint automatically, or run AD, but these require a programming environment that restricts the user, and may not be efficient in parallel highperformance computing environments. Even when the gradient is calculated automatically via software, the downside of these reversemode methods is a potential explosion of memory when creating temporary intermediate variables. For example, in unsteady fluid flow, the naive adjoint method requires storing the entire timedependent PDE solution [70] and hybrid checkpointing schemes [96] are the subject of active research. Furthermore, in blackbox simulations which only return function evaluations given inputs (see [20, §1.2] for examples) the derivative may not be available by any means.
Many DFO algorithms have been developed for the cases that do not admit obtaining a full gradient. Classic methods include grid search, NelderMead, (quasi) MonteCarlo sampling, simulated annealing and MCMC methods [51]. More powerful modern algorithms include randomized methods, Evolution Strategies (ES) such as CMAES [43], HitandRun [7] and random cutting planes [22], as well as textbook modern DFO methods ([20, Algo. 10.3], [71, Algo. 9.1]) based on interpolation and trustregions. A limitation of all these methods is that they do not scale well to highdimensions (typical dimensionality is , and extreme cases would be dimension ), thus limiting their applicability. For this reason, and motivated by the success of stochastic gradient methods in machine learning, we turn to cheaper methods that scale well to high dimensions.
Stochastic DFO
Our basic stochastic subspace descent (SSD) method (LABEL:eq:_iterations) has been previously explored under the name “random gradient,” “random pursuit,” “directional search,” and “random search” (not to be confused with the hitandrun style “random search” algorithm [79]). The algorithm dates back to the 1970s, with some analysis (cf. [29, Ch. 6] and [33, 86]), but it never achieved prominence since zerothorder methods are not competitive with firstorder methods when the gradient is available. Most analysis has focused on the specific case [66, 69, 89, 52].
Recently, the random gradient method has seen renewed interest. [52] analyzes the case when is quadratic, and [89] provides an analysis (assuming a line search oracle). A variant of the method, using and P comprised of IID Gaussian entries with mean zero and variance such that , was popularized by Nesterov [66, 69], and various proximal, acceleration and noisetolerant extensions and analysis appeared in [28, 27, 35, 17]. The novelty of our work, and distinguishing feature compared to prior work, is that the methods in the literature have not focused on highdimensional projections or practical demonstrations. Another variant of random gradient has recently been proposed in the reinforcement learning community. The Google Brain Robotics team used orthogonal sampling to train artificial intelligence systems [18], but treated it as a heuristic to approximate Gaussian smoothing. Previous use of orthogonal sampling by the OpenAI team [81] used variance reduction, but only in the sense of antithetic pairs.
We note that the methods we propose may appear similar to stochastic gradient descent (SGD) methods popular in machine learning, but the analysis is completely different. Indeed, even under strong convexity and with a finite sum structure, SGD methods require diminishing stepsizes and hence converge slowly or else converge only to a radius of the solution [12].
The most relevant recent work to the SSD scheme is the recent SEGA algorithm [45] which takes similar random measurements but then uses them in a completely different way, following their “sketchandproject” methodology. Given a sketch and an estimate of the gradient at the previous step , the simplest version of SEGA finds a new estimate
followed by a debiasing step.
Alternatives
As a baseline, one could use function evaluations to estimate using finite differences, which is clearly costly when is large and evaluating is expensive. Once is computed, one can run gradient descent, accelerated variants [65], nonlinear conjugate gradient methods [41], or quasiNewton methods like BFGS and its limitedmemory variant [71]. We will compare with these baselines in the numerical results section, particularly with plain gradient descent since it is less sensitive to stepsize and errors compared to accelerated variants, and to BFGS since it is widely used. We do not consider Newton, nor inexact Newton (NewtonKrylov) solvers such as [8, 9, 10], because these require estimating the Hessian as well. Some authors have recently investigated estimating the Hessian stochastically (or adapting BFGS to work naturally in the stochastic sampling case), such as [15, 61, 82, 63, 4, 5, 36, 97], but none apply directly to our setup as they require an ERM structure.
1.4 Structure of this document
The remainder of the paper is organized as follows: In Section LABEL:sect:_SSD we investigate the convergence behavior of the stochastic subspace descent method under various convexity assumptions, and provide rates of convergence. In Section LABEL:sect:_SVRG we recall the Stochastic Variance Reduced Gradient (SVRG) method of [50], and show how control variates can be used to reduce the variance in stochastic subspace descent to improve the rate of convergence. Along the way, we provide a simple lemma that relates convergence in expectation at a linear rate with almost sure convergence of our method. It can also be used to show that the convergence of SVRG and other popular first order methods can be trivially extended to almost sure convergence. In Section LABEL:subsect:syntheticdata we provide empirical results on a simulated function that Nesterov has dubbed “the worst function in the world” [68]. In Section LABEL:subsect:_experimentsGP we find the optimal placement of inducing points for sparse Gaussian processes in the framework of [92]. As a final empirical demonstration, in Section LABEL:subsect:_experimentsplate we test our algorithms in the PDEconstrained optimization setting on a shape optimization problem where the goal is to identify the shape of a hole of fixed area which minimizes the maximum stress subject to boundary forces. We make concluding remarks and discuss future extensions in Section LABEL:sect:_Conclusions. For the sake of readability, proofs are relegated to the Appendix.
In this document, uppercaseboldfaced letters represent matrices, lowercaseboldfaced letters are vectors. The vector norm is assumed to be the 2norm, and the matrix norm is the operator norm.
2 Main results
This section contains our main theoretical results. We present two algorithms in the SSD framework and provide convergence results under various assumptions. We additionally provide proof that the popular SVRG algorithm [50], along with several other commonly used stochastic descent methods, provides almost sure convergence of the iterates to the optimal solution.
2.1 Ssd
In the following theorem and corollary we provide conditions under which function evaluations of stochastic subspace descent converge to a function evaluation at the optimum . In the case of a unique optimum we also provide conditions for the iterates to converge to the optimum . Stochastic subspace descent, socalled because at each iteration the method descends in a random lowdimensional subspace of a vector space containing the gradient, is a gradientfree method as it only requires computation of directional derivatives at each iteration without requiring direct access to the gradient. Scaled Haardistributed random matrices define randomized directions (in the subspace) along which to descend at each iteration. Neither Theorem LABEL:thm:convergence, nor the subsequent theorems require these Haardistributed matrices in particular but whatever matrices are used must satisfy Assumption (A0) below. See Appendix LABEL:App:_Misc for an algorithm for generating draws from a Haardistributed matrix, and a brief discussion of the advantages of using Haar over random coordinate descent type schemes.
Assumptions 2.1.
For the remainder of this section we make use of the following assumptions on the matrices and the function to be optimized.

, , are independent random matrices such that and with .

is continuouslydifferentiable with a Lipschitz first derivative.

The function attains its minimum .

For some and all , the function satisfies the PolyakLojasiewicz (PL) inequality of Lemma LABEL:PLLemma:
(2.1) Also, where is the Lipschitz constant in (A1).

is stronglyconvex for some and all . Also, where is the Lipschitz constant in (A1).

is convex and attains its minimum on a set , and there is an such that for the parameter initialization ,
Coercivity of implies the existence of the constant in (A3”), hence Assumption (A3”) is mathematically equivalent to coercivity of . For the results below, particularly the rate, in Theorem LABEL:thm:_convergenceconvex, we require knowledge of the value of . Also note that (A3’) implies (A3) (Lemma LABEL:PLLemma).
Algorithm LABEL:Alg:_SSD provides basic pseudocode to implement stochastic subspace descent using the recursion (LABEL:eq:_iterations).
Theorem 2.2 (Convergence of SSD).
Assume (A0), (A1), (A2), (A3) and let be an arbitrary initialization. Then recursion (LABEL:eq:_iterations) with results in and .
Formal definitions of the different modes of convergence are provided in the appendix. Since nonconvex functions may satisfy the PLinequality (LABEL:eq:PL), Theorem LABEL:thm:convergence provides a convergence result for wellbehaved nonconvex functions. More practically, in Corollary LABEL:corr:strongconvexity we show that Theorem LABEL:thm:convergence allows linear convergence for relevant classes of convex functions such as least squares, even when the functions are not strongly convex [42, §2.3].
Corollary 2.3 (Convergence under strongconvexity and rate of convergence).

Assume (A0), (A1), (A2), (A3’) and let be an arbitrary initialization. Then recursion (LABEL:eq:_iterations) with results in where is the unique minimizer of .

Assume (A0), (A1), (A2), and (A3). Then with the stepsize , the recursion defined in (LABEL:eq:_iterations) attains the following expected rate of convergence
(2.2) where .
We note that if we recover a textbook rate of convergence for gradient descent [13, §9.3]. Under the more restrictive assumption of strongconvexity the result of Corollary LABEL:corr:strongconvexity is much stronger than Theorem LABEL:thm:convergence; we get almost sure convergence of the function evaluations and of the iterates to the optimal solution at a linear rate. In inverse problems it is typically the convergence of not of that is sought. Furthermore, if either assumption (A3) or (A3’) is satisfied, SSD has a linear rate of convergence. This result may be somewhat counterintuitive for those more familiar with stochastic gradient descent, which is well known to provide a sublinear rate of convergence. By applying random sampling to dimensions in input space rather than to observations the stepsize need not decay, allowing for a linear rate. In addition to the linear rate, we provide guidance for selecting an optimal stepsize, which in theory alleviates the need for the user to tune any hyperparameters. The optimal stepsize is easily interpreted as the quotient of the sample rate and the global Lipschitz constant. This is consistent with our intuition: more directional derivatives make for a better approximation of the gradient. In practice the Lipschitz and strongconvexity constants are rarely known and a line search can be used.
Thus far, we have provided results for cases when the function satisfies the PLinequality or is stronglyconvex. The proof in the convex case is different, but substantively similar to a proof of coordinate descent on convex functions found in [101].
Theorem 2.4 (Convergence under convexity).
Assume (A0), (A1), (A2), (A3”). Then recursion (LABEL:eq:_iterations) with gives
2.2 Variance Reduced Stochastic Subspace Descent (VRSSD)
In its most basic formulation SGD suffers from a sublinear rate of convergence. This is due to a requirement for the stepsize to gradually decay to assure convergence of the iterates. Several methods have been designed to address this shortcoming, one of which is SVRG [50] which controls the variance of the gradient estimate using a control variate, and admits convergence with a constant stepsize, resulting in a linear rate of convergence. In this section we recall some results regarding SVRG from [50] before proving almost sure convergence of the iterates and function evaluations, and ultimately adapting it to the framework of SSD. SVRG minimizes the functional via the iterative scheme
(2.3) 
where is chosen uniformly at random, is a past iterate updated every iterations, and the full gradient is also calculated every iterations:
(2.4) 
In [50] a proof is presented for the case where is updated after iterations by setting , where is chosen uniformly at random from . The authors call this updating scheme “Option 2” and recommend an alternative “Option 1”, without proof, which involves setting instead. A proof of convergence for Option 1 is provided in [90]. The SVRG method is presented in Algorithm LABEL:alg:SVRG, and the convergence result of [50], is presented in Theorem LABEL:SVRG.
Theorem 2.5 (Convergence of SVRG [50]).
Assume that each is convex with Lipschitz derivative and that is stronglyconvex. Assume that the memory parameter is sufficiently large so that
Then, for the sequences (LABEL:eq:_SVRG1) and (LABEL:eq:_SVRG2), using Option 2 to update , we have the following expected rate of convergence,
(2.5) 
The proof, omitted here, can be found in [50]. Since is a constant and , Theorem LABEL:SVRG implies that . We further show that the convergence is also in other modes. This is a particular result of the following simple lemma:
Lemma 2.6 (a.s. convergence from a linear rate of convergence).
Consider the general problem of minimizing an objective function and set . If a minimization algorithm converges in expectation at a linear rate; that is, if there exists and a constant such that
then . If in addition, is stronglyconvex, then .
The distinction between modes of convergence is not merely academic, see for instance [31, p.4] for a simple example of a sequence that converges in for any but does not converge almost surely. As discussed in [54], the class of algorithms that satisfy the conditions required to invoke Lemma LABEL:thm:SVRGAS is substantial. This class includes many of the most popular gradientbased optimization methods such as blockcoordinate descent, SAG [80], SAGA [23], SDCA [83], SVRG [50], Finito/MISO [24, 59], and their proximal variants. Though the conditions of Lemma LABEL:thm:SVRGAS are sufficient for almost sure convergence, they are not necessary: there are optimization algorithms, such as stochastic gradient descent, that converge almost surely for which the rate of convergence is sublinear. Of the aforementioned algorithms only SVRG is relevant to our present discussion:
Remark 1 (a.s. convergence of SVRG).
For the SVRG algorithm discussed in Theorem LABEL:SVRG, and .
Proof.
The problems for which the theory of SVRG can be applied are stronglyconvex. Also, by construction , and . Thus the result follows from Lemma LABEL:thm:SVRGAS.
As stated, SVRG is stochastic in that it randomly samples observations from the dataset. Our goal is different in that we are interested in random sampling along the dimensions of input space, motivating a new algorithm we call Variance Reduced Stochastic Subspace Descent (VRSSD). For subspace selection we generate a random matrix using Algorithm LABEL:Haar, though again any matrices satisfying assumption (A0) suffice. The main algorithm is described in Algorithm LABEL:alg:varprod. Notice that the similarity of SVRG and VRSSD is in their use of a control variate; in VRSSD we additionally incorporate a parameter which allows to weight the influence of the control variate.
Theorem LABEL:thm:VarianceReducedRandomGradient allows for, but (unlike SVRG) does not require, to have a finitesum representation and as a consequence it does not require Lipschitzcontinuity of the individual samples, , like Theorem LABEL:SVRG does.
Theorem 2.7 (a.s. convergence of VRSSD).
Assume that is a strongly convex continuouslydifferentiable function with Lipschitz gradient for some . Define . Let be the minimizer of .

Fix a memory parameter , a sampling rate , and a stepsize such that
Then, Algorithm LABEL:alg:varprod with option 2 and has the following expected rate of convergence:
(2.6) and .

Fix , , and such that
(2.7) Then, with the control variate parameter:
(2.8) Algorithm LABEL:alg:varprod using Option 2 has the following expected rate of convergence:
(2.9) and .
For appropriately chosen parameters and , the variance reduction introduced in Theorem LABEL:thm:VarianceReducedRandomGradient leads to an improved convergence rate compared to the simpler method presented in Theorem LABEL:thm:convergence. The benefits can be seen in practice as we show in Section LABEL:subsect:_experimentsplate. In part () of Theorem LABEL:thm:VarianceReducedRandomGradient we show that it is possible to further improve the rate by selecting a control variate parameter to optimally weight the impact of the full gradient that is held in memory. In theory we see the following effects on the rate for VRSSD with the optimal control variate, all of which are consistent with our intuition:

A large sampling rate improves periteration convergence.

A large improves periteration convergence, but each iteration is only counted after inner loops have been completed, so the effect of cancels (in theory).

Poor conditioning () has a deleterious effect on the convergence rate commensurate with its effect on the convergence rate of gradient descent.

A large Lipschitz constant must be offset by a small stepsize.

It is possible to find the optimal step for a known local Lipschitz constant and strongconvexity constant, and a fixed .
In many of the problems for which SSD is useful it is undesirable to compute the term to find the optimal control variate as this would amount to additional function evaluations per iteration. For this reason, in practice, we recommend approximating it with the current estimate of the gradient instead,
(2.10) 
Though this approximation is not supported by the theory, empirical performance in Section LABEL:sect:_EmpiricalResults suggests the validity of such an approximation. We chose the approximation in (LABEL:eqn:_VRSSDapproximation) because it is an unbiased estimate of the optimal control variate parameter at each iteration and has the advantage that no additional function evaluations are required for each iteration. It also has the effect that some of the intuitions outlined above may not hold; this in itself is not necessarily a problem as the Lipschitz and strong convexity constants are rarely known and a line search can be used to determine the stepsize; in our experiments we use an Armijo backtracking line search as discussed in [71].
In order to understand the role of more clearly, we consider three possible scenarios: corresponds to standard SSD as outlined in Theorem LABEL:thm:convergence, corresponds to part () of Theorem LABEL:thm:VarianceReducedRandomGradient, and finally the optimal is given by (LABEL:eqn:_optimaleta) and used in part () of Theorem LABEL:thm:VarianceReducedRandomGradient. With the understanding that the true gradient is the optimal search direction with respect to onestep improvement, we would like our search direction to vary around it as little as possible. To this end, we ascertain the conditional meansquare error defined as the expectation of
given information at the current iteration (i.e., conditional on algebras defined in the appendix). We denote this conditional error as CMSE.

Case (Theorem LABEL:thm:convergence)

Case (Theorem LABEL:thm:VarianceReducedRandomGradient part (i))

Case (Theorem LABEL:thm:VarianceReducedRandomGradient part (ii))
CMSE (2.11)
These identities follow from the proofs in the appendix. Equation (LABEL:eqn:_etaanalysis) shows that the optimal never increases the variance compared to SSD and can in fact drop it to zero, though an increase in the variance is possible if (e.g., in the event is negative, which may occur if is too large and the iterates traverse substantial portions of parameter space between full gradient calculations). These results are in agreement with Theorem LABEL:thm:VarianceReducedRandomGradient which suggests that the periteration rate of convergence using an optimal is strictly less than the rate when .
3 Experimental results
In this section we provide results for a synthetic problem (Section LABEL:subsect:syntheticdata), a problem from the machine learning literature (Section LABEL:subsect:_experimentsGP), and a shapeoptimization problem (Section LABEL:subsect:_experimentsplate). For the sake of fair comparisons, we compare to zerothorder variants of gradient descent and BFGS where the gradient is calculated via forwarddifferences. In the synthetic problem we also compare to a randomized blockcoordinate descent. For VRSSD we use the approximation (LABEL:eqn:_VRSSDapproximation) of the optimal , and we use Option 1 for updating .
3.1 Synthetic data
We explore the performance of the algorithms on Nesterov’s “worst function in the world” [68]. Fix a Lipschitz constant and let
(3.1) 
where represents the coordinate of and is a constant integer that defines the intrinsic dimension of the problem. This function is convex with global minimum
Nesterov has shown that a broad class of iterative firstorder solvers perform poorly when minimizing with . We compare our results to randomized blockcoordinate descent and gradient descent for two illustrative examples from the family of functions defined by (LABEL:eq:_NesterovWorst). We do not compare to BFGS as we do not run enough iterations for it to differ substantially from gradient descent. For the randomized methods we use ; in Section LABEL:subsect:_experimentsGP we explore the impact that different choices of have on convergence. We expect coordinate descent to perform poorly as there is only a single coordinate of this function that provides a descent direction from the initialization; for highdimensional problems it becomes exceedingly unlikely that the “correct” coordinate is chosen. Also, since it is a block coordinate descent with , the impact of the other two coordinates chosen outweighs the benefit, and the stepsize gets pushed to zero; hence, in both examples shown in Figure LABEL:fig:_NesterovExample we see no improvement from coordinate descent. Had we used instead for coordinate descent we could only expect improvement to be made one in every iterations, rather than at every iteration as in the other methods.
If the function is intrinsically lowdimensional but it is embedded into a highdimensional ambient space, then SSD and VRSSD perform exceptionally well (see left plot of Figure LABEL:fig:_NesterovExample). This holds true for the Gaussian process example of Section LABEL:subsect:_experimentsGP as well. On the other hand, if the space is truly highdimensional then the stochastic methods do not appear to provide any benefit.
There are hyperparameters to set in our methods. We reserve discussion of for Section LABEL:subsect:_experimentsGP. In Figure LABEL:fig:_VRSSDperformanceprofile we use performance profiles [25] to examine the impact of varying in VRSSD. A performance profile is conducted by running each parameterization with 300 randomized restarts with termination only once some prespecified tolerance for accuracy has been reached. We count the proportion of realizations from each parameterization that achieves the specified tolerance within function evaluations where is the fewest function evaluations required in any of the trials, is twice as many function evaluations, etc. We use a function from the family defined in (LABEL:eq:_NesterovWorst) and fix . The objective value threshold we set is within of the optimal objective.
Figure LABEL:fig:_VRSSDperformanceprofile shows that the Lipschitz constant has little bearing on the relative performance of the different parameterizations, but the intrinsic dimension has a large role to play. This is so significant that the performance profiles are inverted from a high to low intrinsic dimension case. The top plots in Figure LABEL:fig:_VRSSDperformanceprofile paint a bleak picture for these stochastic methods: the more similar to gradient descent () the better their performance. Conversely, the bottom plots show that randomized methods provide a distinct advantage when the data is intrinsically lowdimensional. Recent work [30] takes a rigorous look at the manifold hypothesis which suggests that most real, highdimensional data tend to lie near a lowdimensional manifold. This is corroborated in [94] where the authors provide theoretical rationale for the observed lowrank structure of large matrices. These two theoretical papers, along with ample empirical evidence [95, 3, 53, 93], suggest that the lower plots in Figure LABEL:fig:_VRSSDperformanceprofile, and the left plot in Figure LABEL:fig:_NesterovExample are more representative of problems that arise in practice. We have presented two extremes in an effort to provide a heuristic for choosing if the intrinsic dimension can be approximated. The paper [53] provides a method for estimating the intrinsic dimension via maximum likelihood. More recently, [16] provides a method for estimating the intrinsic dimension of an arbitrary data set by exploiting concentration inequalities of norms and angles for random samples in high dimensional space. In [16] the authors also conduct a thorough review of the literature and compare to several other methods for estimating intrinsic dimension.
3.2 Sparse Gaussian process parameter estimation
We apply the methods developed in Section LABEL:sect:_MainResults to hyperparameter estimation for Gaussian processes used in regression, where the goal is inference on a function based on noisy observations at points . The observations are modeled as
where the function is modeled as a zeromean Gaussian process with covariance function
and is a symmetric positivedefinite kernel that depends on a vector of parameters . The parameters can be used to control various characteristics of the process, among them the variance (kernel amplitude) and the scale of the correlation between inputs (lengthscale). The process is assumed to be independent of the noise vector with unknown variance . The covariance matrix of the vector will be written as
(3.2) 
Maximum likelihood estimates of the parameters are obtained by maximizing the logmarginal likelihood of of observations with density [99]:
(3.3) 
Difficulties arise in finding the optimal parameterization of Gaussian processes when the number of observations is large: due to the inversion and determinant calculation in equation (LABEL:eqn:_MLE), the cost of this maximization is . Various approximations exist, perhaps most popular among them in the machine learning community is that of [92]. The basic idea is as follows: Let , , be points in different from the original (these are called “inducing points”), and let . Let be a multivariate density function on , and define the density on , where is the conditional density of f given . The Gaussian density that minimizes the KullbackLeibler divergence between the density and the posterior density of given leads to the following lower bound for the loglikelihood [92]:
(3.4) 
Here is the loglikelihood of the multivariate Gaussian , where
is the the Nyström approximation of introduced in [100]. Note that
That is, the Nyström approximation is controlled by the meansquared error of the prediction of f based on . The trace term in (LABEL:eq:llbound) controls the size of this meansquare error. We note that the Nyström approximation is a lowrank nonnegative update of and that optimal approximations of this form were studied in [88, 87] using Rao’s geodesic distance between SPD matrices as a measure of optimality. We do not pursue this approach here but recognize it as a potential alternative to the method of [92] that we use. Derivations of the conditional means and variances of a multivariate Gaussian can be found in, for example, [91].
Gradientbased methods are used to find an optimal placement of the inducing points by maximizing the lower bound in (LABEL:eq:llbound), which we restate as a function of to be consistent with notation in previous sections:
(3.5) 
Practically speaking, the optimization problem is dimensional: represents the location of inducing points in dimensions, is the number of hyperparameters of the kernel, and we must estimate the noise variance. The benefit of moving to this highdimensional optimization problem is that the time complexity is reduced to and the storage costs to . Consider that this objective function fits directly into the framework described by equation (LABEL:eq:_PDE): If we multiply (LABEL:eqn:_VariationalUpperBound) by 1 so that it becomes a minimization problem then the first term on the righthandside can be regarded as the loss functional, and the second term can be viewed as a regularization functional penalizing a large meansquare error. Figure LABEL:fig:_GPExample is a test problem using the same data as was used in [92] with fifteen inducing points. No recommendation is made in [92] regarding which optimization routine to use. In the left pane of Figure LABEL:fig:_GPExample we show the optimal result for the placement of the inducing points and the setting of the hyperparameters where the optimum is found using a BFGS solver on equation (LABEL:eqn:_VariationalUpperBound). The Gaussian process that arises from the initial settings is shown in the right pane of Figure LABEL:fig:_GPExample.
We use the variations of Algorithm LABEL:alg:varprod introduced in this paper to determine the placement of the inducing points and the settings of the hyperparameters for the same data set used in the original paper on variational learning of inducing points [92] and by its intellectual predecessor [85]. The results are compared with gradient descent and BFGS solvers. The problem is nonconvex and does not satisfy the PLinequality so the theory introduced in this paper does not provide insight into what to expect. As is common with nonconvex optimization problems we perform multiple random restarts to improve the chance of converging to the global minimum.
The data set contains 200 training points. We use a squaredexponential kernel defined by . The parameters to be learned are the amplitude , the lengthscale ,the noise variance , and the locations of inducing points in so that our optimization problem is dimensional. For testing purposes we can control the dimension of the optimization problem by increasing the number of inducing points. We acknowledge that for such a simple problem the use of a zeroth order method is not required as tools such as autograd [58] for Python are able to find the gradient of (LABEL:eqn:_MLE) with respect to each of the parameters by automatic differentiation; in general though, the gradient may not be available, even through automatic differentiation packages as not all functions have been incorporated into such packages. It is also possible to take individual derivatives by hand, but this is not feasible in more complex problems in highdimensions. Once again, we use performance profiles [25] to determine the effect of varying the parameters for different problem sizes. We run each parameterization 300 times. Results for SSD for are shown in Figure LABEL:fig:_SSDPerformanceProfile for 30 and 60 inducing points.
The cutoff threshold is of the distance between the objective function at the parameter initialization and at the optima, as found by BFGS. It is apparent that for this problem, is not a good option. Similarly, can be ruled out due to the fact that it underperforms and approximately 90% (resp. 99%) of the time in the 30 (resp. 60) dimensional problem. The case claims the best single performance: in the fastest trial it is roughly 100 (resp. 800) times faster than BFGS for the 30 (resp. 60) dimensional problem, but the variance of the performance for is high, and about 1% of the time it performs at least 10 times slower than BFGS (not pictured). On the other hand, beats BFGS by a similar factor and seems to be insulated from the high variance that was observed for . A few trials of and found their way to a local minima, resulting in the methods not achieving the target threshold.
Figure LABEL:fig:_GPperformance shows the performance for single trials on several problems of different dimension. For highdimensional problems the performance of BFGS degrades due to the fact that at each iteration the Hessian update is only rank two. This acts doubly against BFGS because highdimensional problems have more expensive function evaluations, and more iterations are required before the Hessian is reasonably approximated. In contrast, the performance of SSD does not appear to degrade noticeably as the dimension increases. The performance of VRSSD is impacted by the occasional expensive full gradient evaluations but not to the same extent as BFGS. The value for the memory parameter in VRSSD is labeled to signify that 100 iterations of SSD are taken before any variance reduction is used, after which the memory parameter becomes 10. This is in recognition of the fact that SSD descends very quickly until it is near the solution, at which point it becomes beneficial to make use of full gradient information.
3.3 Shape optimization
As our third numerical example, we consider a shape optimization problem involving a linear, elastic structure. Consider a square plate of size with a hole subject to uniform boundary traction =1, as illustrated in Fig. LABEL:fig:_HoleGeometry.
We adopt a discretizethenoptimize approach to solving the PDEconstrained optimization problem. The discretization and optimization steps do not generally commute and an optimizethendiscretize approach may be preferable for some types of problems [38, §2.9], but we do not pursue this question here.
Our goal is to identify a shape of the hole that minimizes the maximum stress along the direction over a quarter of the plate corresponding to . To this end, we parameterize the radius of the hole for a given (see Figure LABEL:fig:_HoleGeometry) via
(3.6) 
where is a userdefined parameter controlling the potential deviation from an ngon of radius 1. For large values of it is possible to have a negative radius. The parameters that dictate the shape are and . Subscripts indicate the index of the vector. For our experiments we set so that the minimum possible radius of any particular control point is 0.2 at the initialization. We initialize the entries of and uniformly at random between 0 and 1. For each instance of and – equivalently – we generate a conforming triangular finite element mesh of the plate which we subsequently use within the FEniCS package [56] to solve for the maximum stress . A mesh refinement study is performed to ensure the spatial discretizatin errors are negligible. As we only model a quarter of the plate, we apply symmetry boundary conditions so that and displacements along and are zero. The Young’s modulus and Poisson’s ratio of the plate material are set to and , respectively.
The parametric radius defined by (LABEL:eqn:_parametricradius) allows to scale the complexity of the problem arbitrarily by increasing the dimension . In effect, if is large then the problem becomes illconditioned since and each make at most additive contribution to the radius. Such illposedness suggests that gradient descent ought to perform poorly as it does not account for the curvature of the objective function. We expect BFGS to outperform gradient descent once it has iterated sufficiently to provide a good approximation of the Hessian. Based upon the intrinsic dimensionality discussion presented in Section LABEL:subsect:syntheticdata we anticipate SSD and VRSSD to also outperform gradient descent despite the fact that they also do not explicitly account for the curvature. Note that each function evaluation requires a PDE solve meaning that gradient descent and BFGS require PDE solves per iteration. Though a conforming finite element mesh is used to reduce the computational burden, the cost of so many PDE solves makes this problem intractable in highdimensions unless the resolution of the mesh is very low. On the other hand, SSD and VRSSD require far fewer PDE solves per iteration provided we choose (Figure LABEL:fig:_SSDPerformanceProfile suggests that may be reasonable). As mentioned above, the objective value is the maximum stress in the ydirection, , over the plate. In Figure LABEL:fig:_SSDPlate we minimize the objective for a hole with shape governed by (LABEL:eqn:_parametricradius) for problems with and using gradient descent and BFGS, as well as SSD with . For VRSSD we use the same hybrid method as described in Section LABEL:subsect:_GP, with identical rationale.
As expected, BFGS outperforms gradient descent, but it is slow to provide a reasonable estimate of the Hessian and hence, relatively slow to converge. On low or mediumdimensional problems this is not an issue, but since BFGS makes a rank2 update to the Hessian at every iteration, and dimensional problems of the form explained in this section with radius specified by (LABEL:eqn:_parametricradius) have a rank Hessian. Thus, highdimensional problems may require many iterations before the Hessian is wellapproximated.
In the 100dimensional problem in the left plot of Figure LABEL:fig:_SSDPlate we notice convergence of SSD and VRSSD after few function evaluations; so few, in fact, that the difference in performance between SSD and VRSSD likely has more to do with random chance than variance reduction. In multiple restarts (not shown), similar performance was observed wherein the methods had converged before any variance reduction had occurred. In all cases, convergence occurs by the time gradient descent of BFGS are able to take more than a couple of steps.
In the 200dimensional problem in the right plot of Figure LABEL:fig:_SSDPlate the progress of gradient descent, SSD, and BFGS stall prematurely, apparently settling into local optima at different locations despite being initialized identically. The figure is somewhat deceptive as each method is indeed still descending along shallow valleys of the objective function as shown in the inset. VRSSD does not appear to be as susceptible to getting caught in such valleys for reasons we do not yet fully understand. We hypothesize that there are two phenomena at play. First, as the dimensional subspace along which SSD and VRSSD descend changes with each iteration, they explore parameter space more erratically than deterministic methods, making them less likely to get funnelled into long, shallow basins; this is intuitively similar to the recent line of research which suggests that noisy perturbation of iterative algorithms helps avoid saddle points [34]. Second, VRSSD uses past information from a fullgradient evaluation that push it in directions that may not be optimal from a singlestep perspective but that may help it traverse these valleys in a way that is heuristically similar to the heavy ball method of Polyak [76]; this phenomena has recently been explored in the context of nonconvex minimization, e.g., [49]. Since SSD still appears to get caught in these lowgradient valleys, it is apparently the combination of these two reasons, the second of them, or some asyetundescribed property of VRSSD that allows it to avoid them. It could also be the case that this particular problem is amenable to VRSSD; however, further investigation on different problems is required to verify the exact reasons that VRSSD appears to perform so well.
4 Conclusions
We have presented a generalization of coordinate descent and gradient descent algorithms and have shown asymptotic convergence at a linear rate under various convexity assumptions. We have demonstrated empirical improvements compared to the status quo for several practical problems that do not necessarily adhere to the assumptions required by the theory. We have provided analysis that puts the ideas of the SVRG algorithm [50] in a stochastic subspace framework, making it a feasible alternative to deterministic gradientbased methods for problems that cannot be represented with a finitesum structure. This analysis opens the doors to adapting other stochastic iterative methods designed for an ERM framework into a more general form. We also showed that any stochastic algorithm that converges in expectation at a linear rate also converges almost surely.
In the future we plan to adapt other firstorder stochastic optimization techniques such as stochastic quasiNewton [62] to the stochastic subspace descent framework. The Haardistributed random matrices may not be the optimal choice for subspace projections, and some clever adaptive scheme could be beneficial for determining the decent directions; these ideas have been discussed at length in the coordinate descent literature [78, 67], and similar techniques may be available within the SSD framework. We would like to explore the feasibility of implementing our methods in parallel as has been discussed in the coordinate descent case in [73]. A computationally straightforward extension may allow sketching methods (see e.g. [75]) to improve our results with minimal programmatic overhead, but analysis must be conducted to confirm the theoretical properties of such modifications. The analysis we have performed thus far does not explain some phenomena we have observed in our experiments; in particular, it appears as though the intrinsic dimension rather than the ambient dimension dictates the convergence rate of our methods, and that VRSSD is resistant to getting trapped in the shallow basins common to illconditioned nonconvex problems.
Finally, recent work on a universal “catalyst” scheme [54] also applies to our method, allowing for Nesterovstyle acceleration without requiring additional knowledge of the Lipschitz constants along any particular direction.
A Preliminary results
We summarize results from optimization and probability used in the proofs of convergence.
Lemma A.1 (A property of Lipschitzcontinuous first derivatives).
Let be continuouslydifferentiable with Lipschitz first derivative. Then for any ,
The proof can be found in [68, pg. 2526].
Lemma A.2 (Strongconvexity implies PLinequality).
Let be a strongly convex, continuouslydifferentiable function. Denote the unique minimizer of by . Then:
(A.1) 
Inequality (LABEL:PLineq) is the PLinequality. The proof is simple and can be found in the Appendix of [42].
Lemma A.3 (A characterization of convex functions).
A function is convex with Lipschitz derivative if and only if for all
The proof can be found in [68, pg. 56]. If the function is strongly convex then the bound is sharpened according to the following theorem, the proof of which can be found in [68, pg. 62].
Theorem A.4 (Stronglyconvex functions).
If a function is strongly convex with Lipschitz derivative then for all
We now recall some basic definitions and results regarding convergence of random variables. A sequence of random vectors is said to converge a.s. to a random vector if and only if . It is said to converge in probability if and only if for all , as . It converges in for some integer if and only if . We will write , , and , respectively.
It is readily apparent that if and only if for all
as [31]. It follows from this characterization that if . By subadditivity,
(A.2) 
Finally, by Markov’s inequality, if , then .
B Proofs of main results
Proof of Theorem LABEL:thm:convergence
Define the filtration , and . Because is continuouslydifferentiable with a Lipschitz derivative, Lemma LABEL:Lipschitz yields
(B.1) 
Let be the error for a particular . Then, equations (LABEL:eq:_iterations) and (LABEL:Taylor) yield:
where we have used the fact that . We can choose such that ; that is, . With this choice the righthand side is nonpositive and the errors are nonincreasing. Since the error is bounded below by zero the sequence converges almost surely but the actual limit remains to be determined. Furthermore, since the sequence is bounded above by , Lebesgue’s dominated convergence implies convergence of the sequence in . To find the limit we take conditional expectations of both sides to get
Since is measurable, the gradients can be pulled out of the conditional expectation, the expectation can be passed through due to linearity. Noting that , this leaves us with
(B.2) 
The first term on the righthand side in equation (LABEL:intermediate) is negative so the PLinequality leads to:
Recursive application yields
Taking an expectation of both sides yields
Any stepsize forces the righthand side to converge to zero. Thus, since for some and , we have both and .
Proof of Corollary LABEL:corr:strongconvexity part (i)
By Lemma LABEL:PLLemma, the strongconvexity assumed by this corollary implies the assumptions of Theorem LABEL:thm:convergence, so that . Now by strongconvexity,
Since the lefthand side converges almost surely to zero and , we have .
Proof of Corollary LABEL:corr:strongconvexity part (ii)
We begin by providing an upper bound for . Rearranging the terms in equation (LABEL:intermediate) we have
(B.3) 
Combing (LABEL:UpperBound) with (LABEL:PLineq) we get
That is,
(B.4) 
Repeated recursion then yields