ZerothOrder Stochastic Variance Reduction for Nonconvex Optimization
Abstract
As application demands for zerothorder (gradientfree) optimization accelerate, the need for variance reduced and faster converging approaches is also intensifying. This paper addresses these challenges by presenting: a) a comprehensive theoretical analysis of variance reduced zerothorder (ZO) optimization, b) a novel variance reduced ZO algorithm, called ZOSVRG, and c) an experimental evaluation of our approach in the context of two compelling applications, blackbox chemical material classification and generation of adversarial examples from blackbox deep neural network models. Our theoretical analysis uncovers an essential difficulty in the analysis of ZOSVRG: the unbiased assumption on gradient estimates no longer holds. We prove that compared to its firstorder counterpart, ZOSVRG with a twopoint random gradient estimator suffers an additional error of order , where is the minibatch size. To mitigate this error, we propose two accelerated versions of ZOSVRG utilizing variance reduced gradient estimators, which achieve the best rate known for ZO stochastic optimization (in terms of iterations). Our extensive experimental results show that our approaches outperform other stateoftheart ZO algorithms, and strike a balance between the convergence rate and the function query complexity.
ZerothOrder Stochastic Variance Reduction for Nonconvex Optimization
Sijia Liu IBM Research, AI sijia.liu@ibm.com Bhavya Kailkhura Lawrence Livermore National Laboratory kailkhura1@llnl.gov PinYu Chen IBM Research, AI pinyu.chen@ibm.com Paishun Ting University of Michigan paishun@umich.edu Shiyu Chang IBM Research, AI shiyu.chang@ibm.com Lisa Amini IBM Research, AI lisa.amini@us.ibm.com
noticebox[b]Preprint. \end@float
1 Introduction
Zerothorder (gradientfree) optimization is increasingly embraced for solving machine learning problems where explicit expressions of the gradients are difficult or infeasible to obtain. Recent examples have shown zerothorder (ZO) based generation of predictionevasive, blackbox adversarial attacks on deep neural networks (DNNs) as effective as stateoftheart whitebox attacks, despite leveraging only the inputs and outputs of the targeted DNN (papernot2016practical, ; madry17, ; chen2017zoo, ). Additional classes of applications include network control and management with timevarying constraints and limited computation capacity (chen2017bandit, ; liu2017zeroth, ), and parameter inference of blackbox systems (fu2002optimization, ; lian2016comprehensive, ). ZO algorithms achieve gradientfree optimization by approximating the full gradient via gradient estimators based on only the function values (brent2013algorithms, ; spall2005introduction, ).
Although many ZO algorithms have recently been developed and analyzed (liu2017zeroth, ; flaxman2005online, ; shamir2013complexity, ; agarwal2010optimal, ; nesterov2015random, ; duchi2015optimal, ; shamir2017optimal, ; gao2014information, ; dvurechensky2018accelerated, ; wangdu18, ), they often suffer from the high variances of ZO gradient estimates, and in turn, hampered convergence rates. In addition, these algorithms are mainly designed for convex settings, which limits their applicability in a wide range of (nonconvex) machine learning problems.
In this paper, we study the problem of design and analysis of variance reduced and faster converging nonconvex ZO optimization methods. To reduce the variance of ZO gradient estimates, one can draw motivations from similar ideas in the firstorder regime. The stochastic variance reduced gradient (SVRG) is a commonlyused, effective firstorder approach to reduce the variance (johnson2013accelerating, ; reddi2016stochastic, ; nitanda2016accelerated, ; allen2016improved, ; lei2017non, ). Due to the variance reduction, it improves the convergence rate of stochastic gradient descent (SGD) from ^{1}^{1}1In the big notation, the constant numbers are ignored, and the dominant factors are kept. to , where is the total number of iterations.
Although SVRG has shown a great promise, applying similar ideas to ZO optimization is not a trivial task. The main challenge arises due to the fact that SVRG relies upon the assumption that a stochastic gradient is an unbiased estimate of the true batch/full gradient, which unfortunately does not hold in the ZO case. Therefore, it is an open question whether the ZO stochastic variance reduced gradient could enable faster convergence of ZO algorithms. In this paper, we attempt to fill the gap between ZO optimization and SVRG.
Contributions We propose and evaluate a novel ZO algorithm for nonconvex stochastic optimization, ZOSVRG, which integrates SVRG with ZO gradient estimators. We show that compared to SVRG, ZOSVRG achieves a similar convergence rate that decays linearly with but up to an additional error correction term of order , where is the minibatch size. Without a careful treatment, this correction term (e.g., when is small) could be a critical factor affecting the optimization performance. To mitigate this error term, we propose two accelerated ZOSVRG variants, utilizing reduced variance gradient estimators. These yield a faster convergence rate towards , the best known iteration complexity bound for ZO stochastic optimization (e.g., the ZO gradient descent method).
Our work offers a comprehensive study on how ZO gradient estimators affect SVRG on both iteration complexity (i.e., convergence rate) and function query complexity. Compared to the existing ZO algorithms, our methods can strike a balance between iteration complexity and function query complexity. To demonstrate the flexibility of our approach in managing this tradeoff, we conduct an empirical evaluation of our proposed algorithms and other stateoftheart algorithms on two diverse applications: blackbox chemical material classification and generation of universal adversarial perturbations from blackbox deep neural network models. Extensive experimental results and theoretical analysis validate the effectiveness of our approaches.
2 Related work
In ZO algorithms, a full gradient is typically approximated using either a onepoint or a twopoint gradient estimator, where the former acquires a gradient estimate by querying at a single random location close to (flaxman2005online, ; shamir2013complexity, ), and the latter computes a finite difference using two random function queries (agarwal2010optimal, ; nesterov2015random, ). In this paper, we focus on the twopoint gradient estimator since it has a lower variance and thus improves the complexity bounds of ZO algorithms.
Despite the meteoric rise of twopoint based ZO algorithms, most of the work is restricted to convex problems (liu2017zeroth, ; duchi2015optimal, ; shamir2017optimal, ; gao2014information, ; dvurechensky2018accelerated, ; wangdu18, ). For example, a ZO mirror descent algorithm proposed by (duchi2015optimal, ) has an exact rate , where is the number of optimization variables. The same rate is obtained by bandit convex optimization (shamir2017optimal, ) and ZO online alternating direction method of multipliers (liu2017zeroth, ). Current studies suggested that ZO algorithms typically agree with the iteration complexity of firstorder algorithms up to a smalldegree polynomial of the problem size .
In contrast to the convex setting, nonconvex ZO algorithms are comparatively understudied except a few recent attempts (lian2016comprehensive, ; nesterov2015random, ; ghadimi2013stochastic, ; hajinezhad2017zeroth, ; gu2016zeroth, ). Different from convex optimization, the stationary condition is used to measure the convergence of nonconvex methods. In (nesterov2015random, ), the ZO gradient descent (ZOGD) algorithm was proposed for deterministic nonconvex programming, which yields convergence rate. A stochastic version of ZOGD (namely, ZOSGD) studied in (ghadimi2013stochastic, ) achieves the rate of . In (hajinezhad2017zeroth, ), a ZO distributed algorithm was developed for multiagent optimization, leading to convergence rate. Here is the number of random directions used to construct a gradient estimate. In (lian2016comprehensive, ), an asynchronous ZO stochastic coordinate descent (ZOSCD) was derived for parallel optimization and achieved the rate of . In (gu2016zeroth, ), a variant of ZOSCD, known as ZO stochastic variance reduced coordinate (ZOSVRC) descent, improved the convergence rate from to under the same parameter setting for the gradient estimation. Although the authors in (gu2016zeroth, ) considered the stochastic variance reduced technique, only a coordinate descent algorithm using a coordinatewise (deterministic) gradient estimator was studied. This motivates our study on a more general framework ZOSVRG under different gradient estimators.
3 Preliminaries
Consider a nonconvex finitesum problem of the form
(1) 
where are individual nonconvex cost functions. The generic form (1) encompasses many machine learning problems, ranging from generalized linear models to neural networks. We next elaborate on assumptions of problem (1), and provide a background on ZO gradient estimators.
3.1 Assumptions
A1: Functions have Lipschitz continuous gradients (smooth), i.e., for any and , , and some . Here denotes the Euclidean norm, and for ease of notation represents the integer set .
A2: The variance of stochastic gradients is bounded as . Here can be viewed as a stochastic gradient of by randomly picking an index .
Both A1 and A2 are the standard assumptions used in nonconvex optimization literature (lian2016comprehensive, ; nesterov2015random, ; lei2017non, ; ghadimi2013stochastic, ; hajinezhad2017zeroth, ; gu2016zeroth, ). Note that A2 is milder than the assumption of bounded gradients (liu2017zeroth, ; hajinezhad2017zeroth, ). For example, if , then A2 is satisfied with .
3.2 ZO gradient estimation
Given an individual cost function (or an arbitrary function under A1 and A2), a twopoint random gradient estimator is defined by (nesterov2015random, ; gao2014information, )
(RandGradEst) 
where recall that is the number of optimization variables, is a smoothing parameter^{2}^{2}2The parameter can be generalized to for . Here we assume for ease of representation., and are i.i.d. random directions drawn from a uniform distribution over a unit sphere (flaxman2005online, ; shamir2017optimal, ; gao2014information, ). In general, RandGradEst is a biased approximation to the true gradient , and its bias reduces as approaches zero. However, in a practical system, if is too small, then the function difference could be dominated by the system noise and fails to represent the function differential (lian2016comprehensive, ).
Remark 1
Instead of using a single sample in RandGradEst, the average of i.i.d. samples can also be used for gradient estimation (liu2017zeroth, ; duchi2015optimal, ; hajinezhad2017zeroth, ),
(AvgRandGradEst) 
which we call an average random gradient estimator.
In addition to RandGradEst and AvgRandGradEst, the work lian2016comprehensive (); gu2016zeroth (); choromanski2017blackbox () considered a coordinatewise gradient estimator. Here every partial derivative is estimated via the twopoint querying scheme under fixed direction vectors,
(CoordGradEst) 
where is a coordinatewise smoothing parameter, and is a standard basis vector with at its th coordinate, and s elsewhere. Compared to RandGradEst, CoordGradEst is deterministic and requires times more function queries. However, as will be evident later, it yields an improved iteration complexity (i.e., convergence rate). More details on ZO gradient estimation can be found in Appendix A.1.
4 ZO stochastic variance reduced gradient (ZOSVRG)
4.1 SVRG: from firstorder to zerothorder
It has been shown in (johnson2013accelerating, ; reddi2016stochastic, ) that the firstorder SVRG achieves the convergence rate , yielding less iterations than the ordinary SGD for solving finite sum problems. The key step of SVRG^{3}^{3}3Different from the standard SVRG (johnson2013accelerating, ), we consider its minibatch variant in (reddi2016stochastic, ). (Algorithm 1) is to generate an auxiliary sequence at which the full gradient is used as a reference in building a modified stochastic gradient estimate
(2) 
where denotes the gradient estimate at , is a minibatch of size (chosen uniformly randomly with replacement), and . The gradient blending (2) is also motivated by a variance reduced technique known as control variate (tucker2017rebar, ; grathwohl2017backpropagation, ; chatterji2018theory, ). The link between SVRG and control variate is discussed in Appendix A.2.
In the ZO setting, the gradient blending (2) is approximated using only function values,
(3) 
where , and is a ZO gradient estimate specified by RandGradEst, AvgRandGradEst or CoordGradEst. Replacing (2) with (3) in SVRG (Algorithm 1) leads to a new ZO algorithm, which we call ZOSVRG (Algorithm 2). We highlight that although ZOSVRG is similar to SVRG except the use of ZO gradient estimators to estimate batch, minibatch, as well as blended gradients, this seemingly minor difference yields an essential difficulty in the analysis of ZOSVRG. That is, the unbiased assumption on gradient estimates used in SVRG no longer holds. Thus, a careful analysis of ZOSVRG is much needed.
4.2 ZOSVRG and convergence analysis
In what follows, we focus on the analysis of ZOSVRG using RandGradEst. Later, we will study ZOSVRG with AvgRandGradEst and CoordGradEst. We start by investigating the secondorder moment of the blended ZO gradient estimate in the form of (3); see Proposition 1.
1:Input: In addition to parameters in SVRG, set smoothing parameter .
2:for do
3: compute ZO estimate ,
4: set ,
5: for do
7: update ,
8: end for
9: set ,
10:end for
11:return
chosen uniformly random from .

Proposition 1
Suppose A2 holds and RandGradEst is used in Algorithm 2. The blended ZO gradient estimate in Step 7 of Algorithm 2 satisfies
(4) 
Proof: See Appendix A.3.
Compared to SVRG and its variants (reddi2016stochastic, ; lei2017non, ), the error bound (4) involves a new error term , which is induced by the secondorder moment of RandGradEst (Appendix A.1). With the aid of Proposition 1, Theorem 1 provides the convergence rate of ZOSVRG in terms of an upper bound on at the solution .
Theorem 1
Suppose A1 and A2 hold, and the random gradient estimator (RandGradEst) is used. The output of Algorithm 2 satisfies
(5) 
where , , , , and
(6)  
(7) 
In (6)(7), is a positive parameter ensuring , and the coefficients are given by
(8) 
Proof: See Appendix A.4.
Compared to the convergence rate of SVRG as given in (reddi2016stochastic, , Theorem 2), Theorem 1 exhibits two additional errors and due to the use of ZO gradient estimates. Roughly speaking, if we choose the smoothing parameter reasonably small, then the error would reduce, leading to nondominant effect on the convergence rate of ZOSVRG. For the term , the quantity is more involved, relying on the epoch length , the step size , the smoothing parameter , the minibatch size , and the number of optimization variables . In order to acquire explicit dependence on these parameters and to explore deeper insights of convergence, we simplify (5) for a specific parameter setting, as formalized below.
Corollary 1
Suppose we set
(9) 
, and , where is a universal constant that is independent of , , , and . Then Theorem 1 implies , , and , which yields
(10) 
Proof: See Appendix A.5.
It is worth mentioning that the condition on the value of smoothing parameter in Corollary 1 is less restrictive than several ZO algorithms^{4}^{4}4One exception is ZOSCD (lian2016comprehensive, ) (and its variant ZOSVRC (gu2016zeroth, )), where .. For example, ZOSGD in (ghadimi2013stochastic, ) required , and ZOADMM (liu2017zeroth, ) and ZOmirror descent (duchi2015optimal, ) considered . Moreover similar to liu2017zeroth (), we set the step size linearly scaled with . Compared to the aforementioned ZO algorithms (liu2017zeroth, ; duchi2015optimal, ; ghadimi2013stochastic, ), the convergence performance of ZOSVRG in (10) has an improved (linear rather than sublinear) dependence on . However, it suffers an additional error of order inherited from in (5), which is also a consequence of the last error term in (4). A recent work (hajinezhad2017zeroth, , Theorem 1) also identified this side effect of RandGradEst in the context of ZO nonconvex multiagent optimization using a method of multipliers. Therefore, a naive combination of RandGradEst and SVRG could make the algorithm converging to a neighborhood of a stationary point, where the size of neighborhood is controlled by the minibatch size . Our work and reference (liu2017zeroth, ) show that a large minibatch indeed reduces the variance of RandGradEst and improves the convergence of ZO optimization methods. Although the tightness of the error bound (10) is not proven, we conjecture that the dependence on and could be optimal, since the form is consistent with SVRG, and the latter does not rely on the selected parameters in (9).
5 Acceleration of ZOSVRG
In this section, we improve the iteration complexity of ZOSVRG (Algorithm 2) by using AvgRandGradEst and CoordGradEst, respectively. We start by comparing the squared errors of different gradient estimates to the true gradient , as formalized in Proposition 2.
Proposition 2
Consider a gradient estimator , then the squared error
(11) 
Proof: See Appendix A.6.
Proposition 2 shows that compared to CoordGradEst, both RandGradEst and AvgRandGradEst involve an additional error within a factor and of , respectively. Such an error is introduced by the secondorder moment of gradient estimators using random direction samples (nesterov2015random, ; duchi2015optimal, ), and it decreases as the number of direction samples increases. On the other hand, all gradient estimators have a common error bounded by , where let for in CoordGradEst. If is specified as in (9), then we obtain the error term , consistent with the convergence rate of ZOSVRG in Corollary 1.
In Theorem 2, we show the effect of AvgRandGradEst on the convergence rate of ZOSVRG.
Theorem 2
Suppose A1 and A2 hold, and AvgRandGradEst is used in Algorithm 2. Then is bounded same as given in (5), where the parameters , and for are modified by , , with . Given the setting in Corollary 1 and , the convergence rate simplifies to
(12) 
Proof: See Appendix A.7
By contrast with Corollary 1, it can be seen from (12) that the use of AvgRandGradEst reduces the error in (10) through multiple () direction samples. And the convergence rate ceases to be significantly improved as . Our empirical results show that a moderate choice of can significantly speed up the convergence of ZOSVRG.
We next study the effect of the coordinatewise gradient estimator (CoordGradEst) on the convergence rate of ZOSVRG, as formalized in Theorem 3.
Theorem 3
Suppose A1 and A2 hold, and CoordGradEst is used in Algorithm 2. Then
(13) 
where , , and have been defined in (5), the parameters , and for are given by , , with , and is a positive parameter ensuring . Given the specific setting in Corollary 1 and , the convergence rate simplifies to
(14) 
Proof: See Appendix A.8.
Theorem 3 shows that the use of CoordGradEst improves the iteration complexity, where the error of order in Corollary 1 or in Theorem 2 has been eliminated in (14). This improvement is benefited from the low variance of CoordGradEst shown by Proposition 2. We can also see this benefit by comparing in Theorem 3 with (7): the former avoids the term . The disadvantage of CoordGradEst is the need of times more function queries than RandGradEst in gradient estimation.
Recall that RandGradEst, AvgRandGradEst and CoordGradEst require , and function queries, respectively. In ZOSVRG (Algorithm 2), the total number of gradient evaluations is given by , where . Therefore, by fixing the number of iterations , the function query complexity of ZOSVRG using the studied estimators is then given by , and , respectively. In Table 1, we summarize the convergence rates and the function query complexities of ZOSVRG and its two variants, which we call ZOSVRGAve and ZOSVRGCoord, respectively. For comparison, we also present the results of ZOSGD (ghadimi2013stochastic, ) and ZOSVRC (gu2016zeroth, ), where the later updates coordinates per iteration within an epoch. Table 1 shows that ZOSGD has the lowest query complexity but has the worst convergence rate. ZOSVRGcoord yields the best convergence rate in the cost of high query complexity. By contrast, ZOSVRG (with an appropriate minibatch size) and ZOSVRGAve could achieve better tradeoffs between the convergence rate and the query complexity.
Method Grad. estimator Stepsize Convergence rate Query complexity ZOSVRG (RandGradEst) ZOSVRGAve (AvgRandGradEst) ZOSVRGCoord (CoordGradEst) ZOSGD (ghadimi2013stochastic, ) (RandGradEst) ZOSVRC (gu2016zeroth, ) (CoordGradEst) ,
6 Applications and experiments
We evaluate the performance of our proposed algorithms on two applications: blackbox classification and generating adversarial examples from blackbox DNNs. The first application is motivated by a realworld material science problem, where a material is classified to either be a conductor or an insulator from a density function theory (DFT) based blackbox simulator (dft, ). The second application arises in testing the robustness of a deployed DNN via iterative model queries (papernot2016practical, ; chen2017zoo, ).
Blackbox binary classification
We consider a nonlinear least square problem (xu2017second, , Sec. 3.2), i.e., problem (1) with for . Here is the th data sample containing feature vector and label , and is a blackbox function that only returns the function value given an input. The used dataset consists of crystalline materials/compounds extracted from Open Quantum Materials Database (oqmd, ). Each compound has chemical features, and its label ( is conductor and is insulator) is determined by a DFT simulator (vasp, ). Due to the blackbox nature of DFT, the true is unknown^{5}^{5}5 One can mimic DFT simulator using a logistic function once the parameter is learned from ZO algorithms.. We split the dataset into two equal parts, leading to training samples and testing samples. We refer readers to Appendix A.10 for more details on our dataset and the setting of experiments.
(a) Training loss versus iterations (b) Training loss versus function queries
Method ZOSGD (ghadimi2013stochastic, ) ZOSVRC (gu2016zeroth, ) ZOSVRG ZOSVRGCoord ZOSVRGAve # of epochs Error ()
In Fig. 1, we present the training loss against the number of epochs (i.e., iterations divided by the epoch length ) and function queries. We compare our proposed algorithms ZOSVRG, ZOSVRGCoord and ZOSVRGAve with ZOSGD (ghadimi2013stochastic, ) and ZOSVRC (gu2016zeroth, ). Fig. 1(a) presents the convergence trajectories of ZO algorithms as functions of the number of epochs, where ZOSVRG is evaluated under different minibatch sizes . We observe that the convergence error of ZOSVRG decreases as increases, and for a small minibatch size , ZOSVRG likely converges to a neighborhood of a critical point as shown by Corollary 1. We also note that our proposed algorithms ZOSVRG (), ZOSVRGCoord and ZOSVRGAve have faster convergence speeds (i.e., less iteration complexity) than the existing algorithms ZOSGD and ZOSVRC. Particularly, the use of multiple random direction samples in AvgRandGradEst significantly accelerates ZOSVRG since the error of order is reduced to (see Table 1), leading to a nondominant factor versus in the convergence rate of ZOSVRGAve. Fig. 1(b) presents the training loss against the number of function queries. For the same experiment, Table 2 shows the number of iterations and the testing error of algorithms studied in Fig. 1(b) using function queries. We observe that the performance of CoordGradEst based algorithms (i.e., ZOSVRC and ZOSVRGCoord) degrade due to the need of large number of function queries to construct coordinatewise gradient estimates. By contrast, algorithms based on random gradient estimators (i.e., ZOSGD, ZOSVRG and ZOSVRGAve) yield better both training and testing results, while ZOSGD consumes an extremely large number of iterations ( epochs). As a result, ZOSVRG () and ZOSVRGAve achieve better tradeoffs between the iteration and the function query complexity.
Generation of adversarial examples from blackbox DNNs
In image classification, adversarial examples refer to carefully crafted perturbations such that, when added to the natural images, are visually imperceptible but will lead the target model to misclassify. In the setting of ‘zeroth order’ attacks (madry17, ; chen2017zoo, ; carlini2017towards, ), the model parameters are hidden and acquiring its gradient is inadmissible. Only the model evaluations are accessible. We can then regard the task of generating a universal adversarial perturbation (to natural images) as an ZO optimization problem of the form (1). We elaborate on the problem formulation for generating adversarial examples in Appendix A.11.
We use a welltrained DNN^{6}^{6}6https://github.com/carlini/nn_robust_attacks on the MNIST handwritten digit classification task as the target blackbox model, which achieves 99.4% test accuracy on natural examples. Two ZO optimization methods, ZOSGD and ZOSVRGAve, are performed in our experiment. Note that ZOSVRGAve reduces to ZOSVRG when . We choose images from the same class, and set the same parameters and constant step size for both ZO methods, where is the image dimension. For ZOSVRGAve, we set and vary the number of random direction samples . In Fig. 2, we show the blackbox attack loss (against the number of epochs) as well as the least distortion of the successful (universal) adversarial perturbations. To reach the same attack loss (e.g., in our example), ZOSVRGAve requires roughly (), () and () more function evaluations than ZOSGD. The sharp drop of attack loss in each method could be caused by the hingelike loss as part of the total loss function, which turns to only if the attack becomes successful. Compared to ZOSGD, ZOSVRGAve offers a faster convergence to a more accurate solution, and its convergence trajectory is more stable as becomes larger (due to the reduced variance of AvgRandGradEst). In addition, ZOSVRGAve improves the distortion of adversarial examples compared to ZOSGD (e.g., improvement when ). We present the corresponding adversarial examples in Appendix A.11.
7 Conclusion
In this paper, we studied ZOSVRG, a new ZO nonconvex optimization method. We presented new convergence results beyond the existing work on ZO nonconvex optimization. We show that ZOSVRG improves the convergence rate of ZOSGD from to but suffers a new correction term of order . The is the side effect of combining a twopoint random gradient estimators with SVRG. We then propose two accelerated variants of ZOSVRG based on improved gradient estimators of reduced variances. We show an illuminating tradeoff between the iteration and the function query complexity. Experimental results and theoretical analysis validate the effectiveness of our approaches compared to other stateoftheart algorithms.
References
 (1) N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical blackbox attacks against machine learning,” in Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, 2017, pp. 506–519.
 (2) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
 (3) P.Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.J. Hsieh, “Zoo: Zeroth order optimization based blackbox attacks to deep neural networks without training substitute models,” in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. ACM, 2017, pp. 15–26.
 (4) T. Chen and G. B. Giannakis, “Bandit convex optimization for scalable and dynamic iot management,” arXiv preprint arXiv:1707.09060, 2017.
 (5) S. Liu, J. Chen, P.Y. Chen, and A. O. Hero, “Zerothorder online admm: Convergence analysis and applications,” in Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics, April 2018, vol. 84, pp. 288–297.
 (6) M. C. Fu, “Optimization for simulation: Theory vs. practice,” INFORMS Journal on Computing, vol. 14, no. 3, pp. 192–215, 2002.
 (7) X. Lian, H. Zhang, C.J. Hsieh, Y. Huang, and J. Liu, “A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zerothorder to firstorder,” in Advances in Neural Information Processing Systems, 2016, pp. 3054–3062.
 (8) R. P. Brent, Algorithms for minimization without derivatives, Courier Corporation, 2013.
 (9) J. C. Spall, Introduction to stochastic search and optimization: estimation, simulation, and control, vol. 65, John Wiley & Sons, 2005.
 (10) A. D. Flaxman, A. T. Kalai, and H. B. McMahan, “Online convex optimization in the bandit setting: gradient descent without a gradient,” in Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms, 2005, pp. 385–394.
 (11) O. Shamir, “On the complexity of bandit and derivativefree stochastic convex optimization,” in Conference on Learning Theory, 2013, pp. 3–24.
 (12) A. Agarwal, O. Dekel, and L. Xiao, “Optimal algorithms for online convex optimization with multipoint bandit feedback,” in COLT, 2010, pp. 28–40.
 (13) Y. Nesterov and V. Spokoiny, “Random gradientfree minimization of convex functions,” Foundations of Computational Mathematics, vol. 2, no. 17, pp. 527–566, 2015.
 (14) J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, “Optimal rates for zeroorder convex optimization: The power of two function evaluations,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2788–2806, 2015.
 (15) O. Shamir, “An optimal algorithm for bandit and zeroorder convex optimization with twopoint feedback,” Journal of Machine Learning Research, vol. 18, no. 52, pp. 1–11, 2017.
 (16) X. Gao, B. Jiang, and S. Zhang, “On the informationadaptive variants of the admm: an iteration complexity perspective,” Optimization Online, vol. 12, 2014.
 (17) P. Dvurechensky, A. Gasnikov, and E. Gorbunov, “An accelerated method for derivativefree smooth stochastic convex optimization,” arXiv preprint arXiv:1802.09022, 2018.
 (18) Y. Wang, S. Du, S. Balakrishnan, and A. Singh, “Stochastic zerothorder optimization in high dimensions,” in Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics. April 2018, vol. 84, pp. 1356–1365, PMLR.
 (19) R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in neural information processing systems, 2013, pp. 315–323.
 (20) S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochastic variance reduction for nonconvex optimization,” in International conference on machine learning, 2016, pp. 314–323.
 (21) A. Nitanda, “Accelerated stochastic gradient descent for minimizing finite sums,” in Artificial Intelligence and Statistics, 2016, pp. 195–203.
 (22) Z. AllenZhu and Y. Yuan, “Improved svrg for nonstronglyconvex or sumofnonconvex objectives,” in International conference on machine learning, 2016, pp. 1080–1089.
 (23) L. Lei, C. Ju, J. Chen, and M. I. Jordan, “Nonconvex finitesum optimization via scsg methods,” in Advances in Neural Information Processing Systems, 2017, pp. 2345–2355.
 (24) S. Ghadimi and G. Lan, “Stochastic firstand zerothorder methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013.
 (25) D. Hajinezhad, M. Hong, and A. Garcia, “Zeroth order nonconvex multiagent optimization over networks,” arXiv preprint arXiv:1710.09997, 2017.
 (26) B. Gu, Z. Huo, and H. Huang, “Zerothorder asynchronous doubly stochastic algorithm with variance reduction,” arXiv preprint arXiv:1612.01425, 2016.
 (27) K. M. Choromanski and V. Sindhwani, “On blackbox backpropagation and jacobian sensing,” in Advances in Neural Information Processing Systems, 2017, pp. 6524–6532.
 (28) G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. SohlDickstein, “Rebar: Lowvariance, unbiased gradient estimates for discrete latent variable models,” in Advances in Neural Information Processing Systems, 2017, pp. 2624–2633.
 (29) W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud, “Backpropagation through the void: Optimizing control variates for blackbox gradient estimation,” arXiv preprint arXiv:1711.00123, 2017.
 (30) N. S. Chatterji, N. Flammarion, Y.A. Ma, P. L. Bartlett, and M. I. Jordan, “On the theory of variance reduction for stochastic gradient monte carlo,” arXiv preprint arXiv:1802.05431, 2018.
 (31) W. Yang and P. W. Ayers, “Densityfunctional theory,” in Computational Medicinal Chemistry for Drug Discovery, pp. 103–132. CRC Press, 2003.
 (32) P. Xu, F. RoostaKhorasan, and M. W. Mahoney, “Secondorder optimization for nonconvex machine learning: An empirical study,” arXiv preprint arXiv:1708.07827, 2017.
 (33) S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. W. Doak, M. Aykol, S. Rühl, and C. Wolverton, “The open quantum materials database (oqmd): assessing the accuracy of dft formation energies,” npj Computational Materials, vol. 1, pp. 15010, 2015.
 (34) G. Kresse and J. Furthmüller, “Efficiency of abinitio total energy calculations for metals and semiconductors using a planewave basis set,” Computational materials science, vol. 6, no. 1, pp. 15–50, 1996.
 (35) N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in IEEE Symposium on Security and Privacy, 2017, pp. 39–57.
 (36) S. ShalevShwartz, “Online learning and online convex optimization,” Foundations and Trends® in Machine Learning, vol. 4, no. 2, pp. 107–194, 2012.
 (37) L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton, “A generalpurpose machine learning framework for predicting properties of inorganic materials,” npj Computational Materials, vol. 2, pp. 16028, 2016.
Appendix A Supplementary material
a.1 Zerothorder (ZO) gradient estimators
With an abuse of notation, in this section let be an arbitrary function under assumptions A1 and A2. Lemma 1 shows the secondorder statistics of RandGradEst.
Lemma 1
Suppose that Assumption A1 holds, and define , where is a uniform distribution over the unit Euclidean ball. Then RandGradEst yields:
1) is smooth, and
(15) 
where is drawn from the uniform distribution over the unit Euclidean sphere, and is given by RandGradEst.
2) For any ,
(16)  
(17)  
(18) 
3) For any ,
(19) 
Proof: First, by using (gao2014information, , Lemma 4.1.a) (also see (shalev2012online, ) and (nesterov2015random, )), we immediately obtain that is smooth with , and
(20) 
Since , we obtain (15).
Next, we obtain (16)(18) based on (gao2014information, , Lemma 4.1.b). Moreover, we have
(21) 
where the first inequality holds due to Lemma 5.
The first inequality of (19) holds due to (15) and for a random variable . And the second inequality of (19) holds due to (gao2014information, , Lemma 4.1.b). The proof is now complete.
In Lemma 2, we show the properties of AvgRandGradEst.
Proof: Since are i.i.d. random vectors, we have
(24) 
where .
In (23), the first inequality holds due to (22) and for a random variable . Next, we bound the second moment of
(25) 
where the expectation is taken with respect to i.i.d. random vectors , and we have used the fact that for any . Substituting (18) and (19) into (25), we obtain (23).
In Lemma 3, we demonstrate the properties of CoordGradEst.
Lemma 3
Let Assumption A1 hold and define , where denotes the uniform distribution at the interval . We then have:
1) is Lsmooth, and
(26) 
where is defined by CoordGradEst, and denotes the partial derivative with respect to the th coordinate.
2) For ,
(27)  
(28) 
3) For ,
(29) 
Proof: For the th coordinate, it is known from (lian2016comprehensive, , Lemma 6) that is smooth and
(30) 
Based on (30) and the definition of CoordGradEst, we then obtain (26).
The inequalities (27) and (28) have been proved by (lian2016comprehensive, , Lemma 6).
a.2 Control variates
The gradient blending in Step 6 of SVRG (Algorithm 1) can be interpreted using control variate (tucker2017rebar, ; grathwohl2017backpropagation, ; chatterji2018theory, ). If we view as the raw gradient estimate at , and as a control variate satisfying , then the gradient blending (2) becomes a gradient estimate modified by a control variate, . Here has the same expectation as , i.e., , however, has a lower variance when is positively correlated with (see a detailed analysis as below).
Consider the following gradient estimator,
(31) 
where is a given (raw) gradient estimate, is an unknown coefficient, and is a control variate. It is clear that has the same expectation as . We then study the effect of on the variance of ,