Gradientless Descent: HighDimensional ZerothOrder Optimization
Abstract
Zerothorder optimization is the process of minimizing an objective , given oracle access to evaluations at adaptively chosen inputs . In this paper, we present two simple yet powerful GradientLess Descent (GLD) algorithms that do not rely on an underlying gradient estimate and are numerically stable. We analyze our algorithm from a novel geometric perspective and present a novel analysis that shows convergence within an ball of the optimum in evaluations, for any monotone transform of a smooth and strongly convex objective with latent dimension , where the input dimension is , is the diameter of the input space and is the condition number. Our rates are the first of its kind to be both 1) polylogarithmically dependent on dimensionality and 2) invariant under monotone transformations. We further leverage our geometric perspective to show that our analysis is optimal. Both monotone invariance and its ability to utilize a low latent dimensionality are key to the empirical success of our algorithms, as demonstrated on BBOB and MuJoCo benchmarks.
1 Introduction
We consider the problem of zerothorder optimization (also known as gradientfree optimization, or bandit optimization), where our goal is to minimize an objective function with as few evaluations of as possible. For many practical and interesting objective functions, gradients are difficult to compute and there is still a need for zerothorder optimization in applications such as reinforcement learning [mania2018simple, salimans2017evolution, choromanski2018structured], attacking neural networks [chen2017zoo, papernot2017practical], hyperparameter tuning of deep networks [snoek2012practical], and network control [liu2017zeroth].
The standard approach to zerothorder optimization is, ironically, to estimate the gradients from function values and apply a firstorder optimization algorithm [flaxman2005online]. [nesterov2011random] analyze this class of algorithms as gradient descent on a Gaussian smoothing of the objective and gives an accelerated iteration complexity for an Lipschitz convex function with condition number and and . They propose a twopoint evaluation scheme that constructs gradient estimates from the difference between function values at two points that are close to each other. This scheme was extended by [duchi2015optimal] for stochastic settings, by [ghadimi2013stochastic] for nonconvex settings, and by [shamir2017optimal] for nonsmooth and nonEuclidean norm settings. Since then, firstorder techniques such as variance reduction [liu2018zeroth], conditional gradients [balasubramanian2018zeroth], and diagonal preconditioning [mania2018simple] have been successfully adopted in this setting. This class of algorithms are also known as stochastic search, random search, or (natural) evolutionary strategies and have been augmented with a variety of heuristics, such as the popular CMAES [auger2005restart].
These algorithms, however, suffer from high variance due to nonrobust local minima or highly nonsmooth objectives, which are common in the fields of deep learning and reinforcement learning. [mania2018simple] notes that gradient variance increases as training progresses due to higher variance in the objective functions, since often parameters must be tuned precisely to achieve reasonable models. Therefore, some attention has shifted into direct search algorithms that usually finds a descent direction and moves to , where the step size is not scaled by the function difference.
The first approaches for direct search were based on deterministic approaches with a positive spanning set and date back to the 1950s [brooks1958discussion]. Only recently have theoretical bounds surfaced, with [gratton2015direct] giving an iteration complexity that is a large polynomial of and [dodangeh2016worst] giving an improved . Stochastic approaches tend to have better complexities: [stich2013optimization] uses line search to give a iteration complexity for convex functions with condition number and most recently, [gorbunov2019stochastic] uses importance sampling to give a complexity for convex functions with average condition number , assuming access to sampling probabilities. [stich2013optimization] notes that direct search algorithms are invariant under monotone transforms of the objective, a property that might explain their robustness in highvariance settings.
In general, zeroth order optimization suffers an at least linear dependence on input dimension and recent works have tried to address this limitation when is large but admits a lowdimensional structure. Some papers assume that depends only on coordinates and [wang2017stochastic] applies Lasso to find the important set of coordinates, whereas [balasubramanian2018zeroth] simply change the step size to achieve an iteration complexity. Other papers assume more generally that only depends on a dimensional subspace given by the range of and [djolonga2013high] apply lowrank approximation to find the lowdimensional subspace while [wang2013bayesian] use random embeddings. [hazan2017hyperparameter] assume that is a sparse collection of degree monomials on the Boolean hypercube and apply sparse recovery to achieve a runtime bound. We will show that under the case that , our algorithm will inherently pick up any lowdimensional structure in and achieve a convergence rate that depends on . This initial convergence rate survives, even if we perturb , so long as is sufficiently small.
We will not cover the whole variety of blackbox optimization methods, such as Bayesian optimization or genetic algorithms. In general, these methods attempt to solve a broader problem (e.g. multiple optima), have weaker theoretical guarantees and may require substantial computation at each step: e.g. Bayesian optimization generally has theoretical iteration complexities that grow exponentially in dimension,
and CMAES lacks provable complexity bounds beyond convex quadratic functions. In addition to the slow runtime and weaker guarantees, Bayesian optimization assumes the success of an inner optimization loop of the acquisition function. This inner optimization is often implemented with many iterations of a simpler zerothorder methods, justifying the need to understand gradientless descent algorithms within its own context.
1.1 Our contributions
In this paper, we present GradientLess Descent (GLD), a class of truly gradientfree algorithms (also known as direct search algorithms) that are parameter free and provably fast. Our algorithms are based on a simple intuition: for wellconditioned functions, if we start from a point and take a small step in a randomly chosen direction, there is a significant probability that we will reduce the objective function value. We present a novel analysis that relies on facts in high dimensional geometry and can thus be viewed as a geometric analysis of gradientfree algorithms, recovering the standard convergence rates and step sizes. Specifically, we show that if the step size is on the order of , we can guarantee an expected decrease of in the optimality gap, based on geometric properties of the sublevel sets of a smooth and strongly convex function.
Our results are invariant under monotone transformations of the objective function, thus our convergence results also hold for a large class of nonconvex functions that are a subclass of quasiconvex functions.
Specifically, note that monotone transformations of convex functions are not necessarily
convex. However, a monotone transformation of a convex function is always quasiconvex. The maximization of quasiconcave utility functions, which is equivalent to the minimization of quasiconvex functions, is an important topic of study in economics (e.g. [arrow1961quasi]).
Intuition suggests that the stepsize dependence on dimensionality can be improved when admits a lowdimensional structure. With a careful choice of sampling distribution we can show that if , where is a rank matrix, then our step size can be on the order of as our optimization behavior is preserved under projections. We call this property affineinvariance and show that the number of function evaluations needed for convergence depends logarithmically on . Unlike most previous algorithms in the highdimensional setting, no expensive sparse recovery or subspace finding methods are needed. Furthermore, by novel perturbation arguments, we show that our fast convergence rates are robust and holds even under the more realistic assumption when with being sufficiently small.
Theorem 1 (Convergence of GLD: Informal Restatement of Theorem 7 and Theorem 14).
Let be any monotone transform of a convex function with condition number and . Let be a sample from an appropriate distribution centered at . Then, with constant probability,
Therefore, we can find such that after function evaluations. Furthermore, for functions with rank matrix and sufficiently small , we only require evaluations.
Another advantage of our nonstandard geometric analysis is that it allows us to deduce that our rates are optimal with a matching lower bound (up to logarithmic factors), presenting theoretical evidence that gradientfree inherently requires function evaluations to converge. While gradientestimation algorithms can achieve a better theoretical iteration complexity of , they lack the monotone and affine invariance properties. Empirically, we see that invariance properties are important to successful optimization, as validated by experiments on synthetic BBOB and MuJoCo benchmarks that show the competitiveness of GLD against standard optimization procedures.
Algorithm  Iteration Complexity  Monotone  kSparse  kAffine 

[nesterov2011random]  No  No  No  
[balasubramanian2018zeroth]  No  Yes  No  
[stich2013optimization]  Yes  No  No  
[gorbunov2019stochastic]  Yes  No  No  
This paper (GLD)  Yes  Yes  Yes 
2 Preliminaries
We first define a few notations for the rest of the paper. Let be a compact subset of and let denote the Euclidean norm. The diameter of , denoted , is the maximum distance between elements in . Let be a realvalued function which attains its minimum at . We use to denote the image of on a subset of , and to denote the ball of radius centered at .
Definition 2.
The level set of at point is . The sublevel set of at point is . When the function is clear from the context, we omit it.
Definition 3.
We say that is strongly convex for if for all and smooth for if for all .
Definition 4.
We say that is a monotone transformation of if is a monotonically (and strictly) increasing function.
Monotone transformations preserve the level sets of a function in the sense that . Because our algorithms depend only on the level set properties, our results generalize to any monotone transformation of a strongly convex and strongly smooth function. This leads to our extended notion of condition number.
Definition 5.
A function has condition number if it is the minimum ratio over all functions such that is a monotone transformation of and is strongly convex and smooth.
When we work with low rank extensions of , we only care about the condition number of within a rank subspace. Indeed, if only varies along a rank subspace, then it has a strong convexity value of , making its condition number undefined. If is strongly convex and smooth, then its Hessian matrix always has eigenvalues bounded between and . Therefore, we need a notion of a projected condition number. Let be some orthonormal matrix and let be the projection matrix onto the column space of .
Definition 6.
For some orthonormal with , a function has condition number restricted to , , if it is the minimum ratio over all functions such that is a monotone transformation of and is strongly convex and smooth.
3 Analysis of Descent Steps
The GLD template can be summarized as follows: given a sampling distribution , we start at and in iteration , we choose a scalar radii and we sample from a distribution centered around , where provides the scaling of . Then, if , we update ; otherwise, we set . The analysis of GLD follows from the main observation that the sublevel set of a monotone transformation of a strongly convex and strongly smooth function contains a ball of sufficiently large radius tangent to the level set (Lemma 15). In this section, we show that this property, combined with facts of highdimensional geometry, implies that moving in a random direction from any point has a good chance of significantly improving the objective.
As we mentioned before, the key to fast convergence is the careful choice of step sizes, which we describe in Theorem 7. The intuition here is that we would like to take as large steps as possible while keeping the probability
of improving the objective function reasonably high, so by insights in highdimensional geometry, we choose a step size of . Also, we show that if admits a latent rank structure, then this step size can be increased to and is therefore only dependent on the latent dimensionality of , allowing for fast highdimensional optimization. Lastly, our geometric understanding allows us to show that our convergence rates are optimal with a matching lower bound. Without loss of generality, this section assumes that is strongly convex and smooth with condition number .
3.1 Step Size
Theorem 7.
For any such that , we can find integers such that if or , then a random sample from uniform distribution over satisfies
with probability at least .
Proving the above theorem requires the following lemma about the intersection of balls in high dimensions and it is proved in the appendix.
Lemma 8.
Let and be two balls in of radii and respectively. Let be the distance between the centers. If and , then
where is a dimensiondependent constant that is lower bounded by at .
3.2 Gaussian Sampling and Low Rank Structure
A direct application of Lemma 8 seems to imply that uniform sampling of a highdimensional ball is necessary. Upon further inspection, this can be easily replaced with a much simpler Gaussian sampling procedure that concentrates the mass close to the surface to the ball. This procedure lends itself to better analysis when admits a latent lowdimensional structure since any affine projection of a Gaussian is still Gaussian.
Lemma 9.
Let and be two balls in of radii and respectively. Let be the distance between the centers. If and and are independent Gaussians with mean centered at the center of and variance , then
where is a dimensionindependent constant.
Assume that there exists some rank projection matrix such that , where is much smaller than . Because Gaussians projected on a dimensional subspace are still Gaussians, we show that our algorithm has a dimension dependence on . We let be the condition number of restricted to the subspace that drives the dominant changes in .
Theorem 10.
Let for some unknown rank matrix with and suppose for some numbers . Then, there exist integers such that if or , then a random sample from a Gaussian distribution satisfies
with constant probability.
Note that the speedup in progress is due to the fact that we can now tolerate the larger sampling radius of , while maintaining a high probability of making progress. If is unknown, we can simply use binary search to find the correct radius with an extra factor of in our runtime.
The lowrank assumption is too restrictive to be realistic; however, our fast rates still hold, at least for the early stages of the optimization, even if we assume that and is a fullrank function that is bounded by . In this setting, we can show that convergence remains fast, at least until the optimality gap approaches .
Theorem 11.
Let for some unknown rank matrix with where are convex and . Suppose for some numbers where minimizes . Then, there exist integers such that if or , then a random sample from a Gaussian distribution satisfies
with constant probability whenever .
3.3 Lower Bounds
We show that our upper bounds given in the previous section are tight up to logarithmic factors for any symmetric sampling distribution . These lower bounds are easily derived from our geometric perspective as we show that a sampling distribution with a large radius gives an extremely low probability of intersection with the desired sublevel set. Therefore, while gradientapproximation algorithms can be accelerated to achieve a runtime that depends on the squareroot of the condition number , gradientless methods that rely on random sampling are likely unable to be accelerated according to our lower bound. However, we emphasize that monotone invariance allows these results to apply to a broader class of objective functions, beyond smooth and convex, so the results can be useful in practice despite the seemingly worse theoretical bounds.
Theorem 12.
Let , where is a random sample from for some radius and is standard Gaussian or any rotationally symmetric distribution. Then, there exist a region with positive measure such that for any ,
with probability at least .
4 Gradientless Algorithms
In this section, we present two algorithms that follow the same Gradientless Descent (GLD) template: GLDSearch and GLDFast, with the latter being an optimized version of the former when an upper bound on the condition number of a function is known. For both algorithms, since they are monotoneinvariant, we appeal to the previous section to derive fast convergence rates for any monotone transform of convex with good condition number. We show the efficacy of both algorithms experimentally in the Experiments section.
4.1 Gradientless Descent with Binary Search
Although the sampling distribution is fixed, we have a choice of radii for each iteration of the algorithm. We can apply a binary search procedure to ensure progress. The most straightforward version of our algorithm is thus with a naive binary sweep across an interval in that is unchanged throughout the algorithm. This allows us to give convergence guarantees without previous knowledge of the condition number at a cost of an extra factor of .
Theorem 13.
Let be any starting point and a blackbox function with condition number . Running Algorithm 1 with and as a standard Gaussian returns a point such that after function evaluations with high probability.
Furthermore, if admits a lowrank structure with a rank matrix, then we only require function evaluations to guarantee . This holds analogously even if is almost lowrank where and .
4.2 Gradientless Descent with Fast Binary Search
GLDSearch (Algorithm 1) uses a naive lower and upper bound for the search radius , which incurs an extra factor of in the runtime bound. In GLDFast, we remove this extra factor dependence on by drastically reducing the range of the binary search. This is done by exploiting the assumption that has a good condition number upper bound and by slowly halfing the diameter of the search space every few iterations since we expect as .
Theorem 14.
Let be any starting point and a blackbox function with condition number upper bounded by . Running Algorithm 2 with suitable parameters returns a point such that after function evaluations with high probability.
Furthermore, if admits a lowrank structure with a rank matrix, then we only require function evaluations to guarantee . This holds analogously even if is almost lowrank where and .
5 Experiments
We tested GLD algorithms on a simple class of objective functions and compare it to Accelerated Random Search (ARS) by [nesterov2011random], which has linear convergence guarantees on strongly convex and strongly smooth functions. To our knowledge, ARS makes the weakest assumption among the zerothorder algorithms that have linear convergence guarantees and perform only a constant order of operations per iteration. Our main conclusion is that GLDFast is comparable to ARS and tends to achieve a reasonably low error much faster than ARS in high dimensions (). In low dimensions, GLDSearch is competitive with GLDFast and ARS though it requires no information about the function.
We let be a diagonal matrix with its th diagonal equal to . In simple words, its diagonal elements form an evenly space sequence of numbers from to . Our objective function is then as , which is strongly convex and strongly smooth. We always use the same starting point , which requires for our algorithms. We plot the optimality gap against the number of function evaluations, where is the best point observed so far after evaluations. Although all tested algorithms are stochastic, they have a low variance on the objective functions that we use; hence we average the results over 10 runs and omit the error bars in the plots.
We ran experiments on with imperfect curvature information and (see Figure 3 in appendix). GLDSearch is independent of the condition number. GLDFast takes only one parameter, which is the upper bound on the condition number; if approximation factor is , then we pass as the upper bound.
ARS requires both strong convexity and smoothness parameters. We test three different distributions of the approximation error; when the approximation factor is , then ARSalpha gets , ARSbeta gets (), and ARSeven gets as input. GLDFast is more robust and faster than ARS when the condition number is overapproximated. When the condition number is underestimated, GLDFast still steadily converges.
5.1 Monotone Transformations
In Figure 1, we ran experiments on for different settings of dimensionality , and its monotone transformation with . For this experiment, we assume a perfect oracle for the strong convexity and smoothness parameters of . The convergence of GLD is totally unaffected by the monotone transformation. For the lowdimension cases of a transformed function (bottom half of the figure), we note that there are inflection points in the convergence curve of ARS. This means that ARS initially struggles to gain momentum and then struggles to stop the momentum when it gets close to the optimum. Another observation is that unlike ARS that needs to build up momentum, GLDFast starts from a large radius and therefore achieves a reasonably low error much faster than ARS, especially in higher dimensions.
5.2 BBOB Benchmarks
To show that practicality of GLD on practical and nonconvex settings, we also test GLD algorithms on a variety of BlackBox Optimization Benchmarking (BBOB) functions [hansen2009bbob]. For each function, the optima is known and we use the log optimality gap as a measure of competance. Because each function can exhibit varying forms of nonsmoothness and convexity, all algorithms are ran with a smoothness constant of 10 and a strong convexity constant of 0.1. All other setup details are same as before, such as using a fixed starting point.
The plots, given in Appendix C, underscore the superior performance of GLD algorithms on various BBOB functions, demonstrating that GLD can successfully optimize a diverse set of functions even without explicit knowledge of condition number. We note that BBOB functions are far from convex and smooth, many exhibiting high conditioning, multimodal valleys, and weak global structure. Due to our radius search produce, our algorithm appears more robust to nonideal settings with nonconvexity and ill conditioning. As expected, we note that GLDFast tend to outperform GLDSearch, especially as the dimension increases, matching our theoretical understanding of GLD.
5.3 Mujoco Control Benchmarks and Affine Transformations
We also ran experiments on the Mujoco benchmarks with varying architectures, both linear and nonlinear. This demonstrates the viability of our approach even in the nonconvex, high dimensional setting. We note that however, unlike e.g. ES which uses all queries to form a gradient direction, our algorithm removes queries which produce less reward than using the current argmax, which can be an information handicap. Nevertheless, we see that our algorithm still achieves competitive performance on the maximum reward. We used a horizon of for all experiments.
Env.  Arch  Rew. at (, , Max) Queries  Rew. Thresh. 
HalfCheetahv1  L  3799, 3903, 4064  3430 
HalfCheetahv1, Proj200  L  1594 , 3509 , 4342  3430 
HalfCheetahv1  H41  2741, 3074, 3392  3430 
Hopperv1  L  1017, 3359, 3375  3120 
Hopperv1  H41  2708, 3370, 3566  3120 
Reacherv1  L  70, 5, 4  10 
Reacherv1  H41  231, 17, 15  10 
Swimmerv1  L  365, 369 , 369  325 
Swimmerv1  H41  353, 369, 369  325 
Walker2dv1  L  1027 , 2201, 2201  4390 
Walker2dv1  H41  1630, 1963, 2146  4390 

We further tested the affine invariance of GLD on the policy parameters from using Gaussian ball sampling, under the HalfCheetah benchmark by projecting the state of the MDP with linear policy to a higher dimensional state , using a matrix multiplication with an orthonormal . Specifically, in this setting, for a linear policy parametrized by matrix , the objective function is thus where . Note that when projecting into a high dimension, there is a slowdown factor of where are the new high dimension and previous base dimension, respectively, due to the binary search in our algorithm on a higher dimensional space. For our HalfCheetah case, we projected the 17 base dimension to a 200length dimension, which suggests that the slowdown factor is a factor . This can be shown in our plots in the appendix (Figure 15).
6 Conclusion
We introduced GLD, a robust zerothorder optimization algorithm that is simple, efficient, and we show strong theoretical convergence bounds via our novel geometric analysis. As demonstrated by our experiments on BBOB and MuJoCo benchmarks, GLD performs very robustly even in the nonconvex setting and its monotone and affine invariance properties give theoretical insight on its practical efficiency.
GLD is very flexible and allows easy modifications. For example, it could use momentum terms to keep moving in the same direction that improved the objective, or sample from adaptively chosen ellipsoids similarly to adaptive gradient methods. [duchi2011adaptive, mcmahanS10]. Just as one may decay or adaptively vary learning rates for gradient descent, one might use a similar change the distribution from which the ballsampling radii are chosen, perhaps shrinking the minimum radius as the algorithm progresses, or concentrating more probability mass on smaller radii.
Likewise, GLD could be combined with random restarts or other restart
policies developed for gradient descent. Analogously to adaptive
per–coordinate learning rates [duchi2011adaptive, mcmahanS10], one could
adaptively change the shape of the balls being sampled into ellipsoids with
various lengthscale factors. Arbitrary combinations of the above variants are also possible.
References
Appendix A Proofs of Section 3
Lemma 15.
If has condition number , then for all , there is a ball of radius that is tangent at and inside the sublevel set .
Proof.
Write such that is strongly convex and smooth for some and is monotonically increasing. From the smoothness assumption, we have for any ,
Consider the ball . For any , the above inequality implies . Hence, when we apply on both sides, we still have for all . Therefore, .
By strong convexity, . It follows that the radius of is at least . ∎
Proof of Lemma 8.
Without loss of generality, consider the unit distance case where . Furthermore, it suffices to prove for the smallest possible radius .
Since , the intersection is composed of two hyperspherical caps glued end to end. We lower bound by the volume of the cap of that is contained in the intersection. Consider the triangle with sides and . From classic geometry, the height of is
(1) 
The volume of a spherical cap is [li2011concise],
where is the regularized incomplete beta function defined as
where and . Note that for any fixed and , is increasing in . Hence, in order to obtain a lower bound on , we want to lower bound or equivalently, upper bound
Write for some . From Eq. (1),
Hence,
Since is convex in , It follows that So, To complete the proof, note that is increasing in , and . As goes to infinity, this value converges to as . ∎
Proof of Lemma 7.
Let . Let . Let be a ball that has on its surface, lies inside , and has radius . Lemma 15 guarantees its existence.
Suppose that
(2) 
and that a random sample from belongs to , which happens with probability at least . Then, our guarantee follows by
where the first line follows from Lemma 15 and second line from convexity of .
Therefore, it now suffices to prove Eq. 2. To do so, we will apply Lemma 8 after showing that the radius of and are in the proper ranges. Let and note that
(3)  
(4)  
Since is outside of , we also have
(5) 
It follows that
In the space, our choice of is equivalent to starting from and sweeping through the range at the interval of size 1. This is guaranteed to find a point between and , which is also an interval of size 1. Therefore, there exists a satisfying the theorem statement, and similarly, we can prove the existence of .
Finally, it remains to show that
From Eq. (3), it suffices to show that or equivalently . From Eq. (4),
For any , . So,
(6) 
and the proof is complete. ∎
Proof of Lemma 9.
Without loss of generality, let and is centered at the origin with radius and is centered at . Then, we simply want to show that
By Markov’s inequality, we see that with probability at most . And since is independent and , it suffices to show that
Since has standard deviation at least , we see that the probability of deviating at least a few standard deviation below is at least a constant. ∎
Proof of Theorem 10.
Proof of Theorem 11.
By the boundedness of , since , we see that . By Lemma 9, we see that if we sample from a Gaussian distribution , then if is the minimum of restricted to the column space of , then
with constant probability. By boundedness on , we know that . Furthermore, this also implies that . Therefore, we know that the decrease is at least
Since , we conclude that
and our proof is complete. ∎
Proof of Theorem 12.
Our main proof strategy is to show that progress can only be made with a radius size of ; larger radii cannot find descent directions with high probability. Consider a simple ellipsoid function , where is a diagonal matrix and , where WLOG we let and for . The optima is with .
Consider the region . Then, if we let be a standard Gaussian vector, then for some radius , we see that the probability of finding a descent direction is:
By standard concentration bounds for subexponential variables, we have
Therefore, with exponentially high probability, . Also, since , Chernoff bounds give:
Therefore, with probability at least , .
If , then we have
We conclude that the probability of descent is upper bounded by . This probability is exactly , where is the cumulative density of a standard normal and . By a naive upper bound, we see that
Since , we conclude that Therefore, we conclude that with probability at least , we have .
Otherwise, we are in the case that . Arguing simiarly as before, with high probability, our objective function and each coordinate can change by at most .
Next, we extend our proof to any symmetric distribution . Since is rotationally symmetric, if we parametrize is polarcoordinates, then the p.d.f. of any scaling of must take the form , where induces the uniform distribution over the unit sphere. Therefore, if is a random variable that follows , then we may write , where is a random scalar with p.d.f and is a standard Gaussian vector and are independent.
As previously argued, with exponentially high probability. Therefore, if