Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization

Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization

Sijia Liu
IBM Research, AI
sijia.liu@ibm.com &Bhavya Kailkhura
Lawrence Livermore National Laboratory
kailkhura1@llnl.gov \ANDPin-Yu Chen
IBM Research, AI
pin-yu.chen@ibm.com &Paishun Ting
University of Michigan
paishun@umich.edu &Shiyu Chang
IBM Research, AI
shiyu.chang@ibm.com &Lisa Amini
IBM Research, AI
lisa.amini@us.ibm.com
Abstract

As application demands for zeroth-order (gradient-free) optimization accelerate, the need for variance reduced and faster converging approaches is also intensifying. This paper addresses these challenges by presenting: a) a comprehensive theoretical analysis of variance reduced zeroth-order (ZO) optimization, b) a novel variance reduced ZO algorithm, called ZO-SVRG, and c) an experimental evaluation of our approach in the context of two compelling applications, black-box chemical material classification and generation of adversarial examples from black-box deep neural network models. Our theoretical analysis uncovers an essential difficulty in the analysis of ZO-SVRG: the unbiased assumption on gradient estimates no longer holds. We prove that compared to its first-order counterpart, ZO-SVRG with a two-point random gradient estimator suffers an additional error of order , where is the mini-batch size. To mitigate this error, we propose two accelerated versions of ZO-SVRG utilizing variance reduced gradient estimators, which achieve the best rate known for ZO stochastic optimization (in terms of iterations). Our extensive experimental results show that our approaches outperform other state-of-the-art ZO algorithms, and strike a balance between the convergence rate and the function query complexity.

 

Zeroth-Order Stochastic Variance Reduction for Nonconvex Optimization


  Sijia Liu IBM Research, AI sijia.liu@ibm.com Bhavya Kailkhura Lawrence Livermore National Laboratory kailkhura1@llnl.gov Pin-Yu Chen IBM Research, AI pin-yu.chen@ibm.com Paishun Ting University of Michigan paishun@umich.edu Shiyu Chang IBM Research, AI shiyu.chang@ibm.com Lisa Amini IBM Research, AI lisa.amini@us.ibm.com

\@float

noticebox[b]Preprint. \end@float

1 Introduction

Zeroth-order (gradient-free) optimization is increasingly embraced for solving machine learning problems where explicit expressions of the gradients are difficult or infeasible to obtain. Recent examples have shown zeroth-order (ZO) based generation of prediction-evasive, black-box adversarial attacks on deep neural networks (DNNs) as effective as state-of-the-art white-box attacks, despite leveraging only the inputs and outputs of the targeted DNN (papernot2016practical, ; madry17, ; chen2017zoo, ). Additional classes of applications include network control and management with time-varying constraints and limited computation capacity (chen2017bandit, ; liu2017zeroth, ), and parameter inference of black-box systems (fu2002optimization, ; lian2016comprehensive, ). ZO algorithms achieve gradient-free optimization by approximating the full gradient via gradient estimators based on only the function values (brent2013algorithms, ; spall2005introduction, ).

Although many ZO algorithms have recently been developed and analyzed (liu2017zeroth, ; flaxman2005online, ; shamir2013complexity, ; agarwal2010optimal, ; nesterov2015random, ; duchi2015optimal, ; shamir2017optimal, ; gao2014information, ; dvurechensky2018accelerated, ; wangdu18, ), they often suffer from the high variances of ZO gradient estimates, and in turn, hampered convergence rates. In addition, these algorithms are mainly designed for convex settings, which limits their applicability in a wide range of (non-convex) machine learning problems.

In this paper, we study the problem of design and analysis of variance reduced and faster converging nonconvex ZO optimization methods. To reduce the variance of ZO gradient estimates, one can draw motivations from similar ideas in the first-order regime. The stochastic variance reduced gradient (SVRG) is a commonly-used, effective first-order approach to reduce the variance (johnson2013accelerating, ; reddi2016stochastic, ; nitanda2016accelerated, ; allen2016improved, ; lei2017non, ). Due to the variance reduction, it improves the convergence rate of stochastic gradient descent (SGD) from 111In the big notation, the constant numbers are ignored, and the dominant factors are kept. to , where is the total number of iterations.

Although SVRG has shown a great promise, applying similar ideas to ZO optimization is not a trivial task. The main challenge arises due to the fact that SVRG relies upon the assumption that a stochastic gradient is an unbiased estimate of the true batch/full gradient, which unfortunately does not hold in the ZO case. Therefore, it is an open question whether the ZO stochastic variance reduced gradient could enable faster convergence of ZO algorithms. In this paper, we attempt to fill the gap between ZO optimization and SVRG.

Contributions We propose and evaluate a novel ZO algorithm for nonconvex stochastic optimization, ZO-SVRG, which integrates SVRG with ZO gradient estimators. We show that compared to SVRG, ZO-SVRG achieves a similar convergence rate that decays linearly with but up to an additional error correction term of order , where is the mini-batch size. Without a careful treatment, this correction term (e.g., when is small) could be a critical factor affecting the optimization performance. To mitigate this error term, we propose two accelerated ZO-SVRG variants, utilizing reduced variance gradient estimators. These yield a faster convergence rate towards , the best known iteration complexity bound for ZO stochastic optimization (e.g., the ZO gradient descent method).

Our work offers a comprehensive study on how ZO gradient estimators affect SVRG on both iteration complexity (i.e., convergence rate) and function query complexity. Compared to the existing ZO algorithms, our methods can strike a balance between iteration complexity and function query complexity. To demonstrate the flexibility of our approach in managing this trade-off, we conduct an empirical evaluation of our proposed algorithms and other state-of-the-art algorithms on two diverse applications: black-box chemical material classification and generation of universal adversarial perturbations from black-box deep neural network models. Extensive experimental results and theoretical analysis validate the effectiveness of our approaches.

2 Related work

In ZO algorithms, a full gradient is typically approximated using either a one-point or a two-point gradient estimator, where the former acquires a gradient estimate by querying at a single random location close to (flaxman2005online, ; shamir2013complexity, ), and the latter computes a finite difference using two random function queries (agarwal2010optimal, ; nesterov2015random, ). In this paper, we focus on the two-point gradient estimator since it has a lower variance and thus improves the complexity bounds of ZO algorithms.

Despite the meteoric rise of two-point based ZO algorithms, most of the work is restricted to convex problems (liu2017zeroth, ; duchi2015optimal, ; shamir2017optimal, ; gao2014information, ; dvurechensky2018accelerated, ; wangdu18, ). For example, a ZO mirror descent algorithm proposed by (duchi2015optimal, ) has an exact rate , where is the number of optimization variables. The same rate is obtained by bandit convex optimization (shamir2017optimal, ) and ZO online alternating direction method of multipliers (liu2017zeroth, ). Current studies suggested that ZO algorithms typically agree with the iteration complexity of first-order algorithms up to a small-degree polynomial of the problem size .

In contrast to the convex setting, non-convex ZO algorithms are comparatively under-studied except a few recent attempts (lian2016comprehensive, ; nesterov2015random, ; ghadimi2013stochastic, ; hajinezhad2017zeroth, ; gu2016zeroth, ). Different from convex optimization, the stationary condition is used to measure the convergence of nonconvex methods. In (nesterov2015random, ), the ZO gradient descent (ZO-GD) algorithm was proposed for deterministic nonconvex programming, which yields convergence rate. A stochastic version of ZO-GD (namely, ZO-SGD) studied in (ghadimi2013stochastic, ) achieves the rate of . In (hajinezhad2017zeroth, ), a ZO distributed algorithm was developed for multi-agent optimization, leading to convergence rate. Here is the number of random directions used to construct a gradient estimate. In (lian2016comprehensive, ), an asynchronous ZO stochastic coordinate descent (ZO-SCD) was derived for parallel optimization and achieved the rate of . In (gu2016zeroth, ), a variant of ZO-SCD, known as ZO stochastic variance reduced coordinate (ZO-SVRC) descent, improved the convergence rate from to under the same parameter setting for the gradient estimation. Although the authors in (gu2016zeroth, ) considered the stochastic variance reduced technique, only a coordinate descent algorithm using a coordinate-wise (deterministic) gradient estimator was studied. This motivates our study on a more general framework ZO-SVRG under different gradient estimators.

3 Preliminaries

Consider a nonconvex finite-sum problem of the form

(1)

where are individual nonconvex cost functions. The generic form (1) encompasses many machine learning problems, ranging from generalized linear models to neural networks. We next elaborate on assumptions of problem (1), and provide a background on ZO gradient estimators.

3.1 Assumptions

A1: Functions have -Lipschitz continuous gradients (-smooth), i.e., for any and , , and some . Here denotes the Euclidean norm, and for ease of notation represents the integer set .

A2: The variance of stochastic gradients is bounded as . Here can be viewed as a stochastic gradient of by randomly picking an index .

Both A1 and A2 are the standard assumptions used in nonconvex optimization literature (lian2016comprehensive, ; nesterov2015random, ; lei2017non, ; ghadimi2013stochastic, ; hajinezhad2017zeroth, ; gu2016zeroth, ). Note that A2 is milder than the assumption of bounded gradients (liu2017zeroth, ; hajinezhad2017zeroth, ). For example, if , then A2 is satisfied with .

3.2 ZO gradient estimation

Given an individual cost function (or an arbitrary function under A1 and A2), a two-point random gradient estimator is defined by (nesterov2015random, ; gao2014information, )

(RandGradEst)

where recall that is the number of optimization variables, is a smoothing parameter222The parameter can be generalized to for . Here we assume for ease of representation., and are i.i.d. random directions drawn from a uniform distribution over a unit sphere (flaxman2005online, ; shamir2017optimal, ; gao2014information, ). In general, RandGradEst is a biased approximation to the true gradient , and its bias reduces as approaches zero. However, in a practical system, if is too small, then the function difference could be dominated by the system noise and fails to represent the function differential (lian2016comprehensive, ).

Remark 1

Instead of using a single sample in RandGradEst, the average of i.i.d. samples can also be used for gradient estimation (liu2017zeroth, ; duchi2015optimal, ; hajinezhad2017zeroth, ),

(Avg-RandGradEst)

which we call an average random gradient estimator.

In addition to RandGradEst and Avg-RandGradEst, the work lian2016comprehensive (); gu2016zeroth (); choromanski2017blackbox () considered a coordinate-wise gradient estimator. Here every partial derivative is estimated via the two-point querying scheme under fixed direction vectors,

(CoordGradEst)

where is a coordinate-wise smoothing parameter, and is a standard basis vector with at its th coordinate, and s elsewhere. Compared to RandGradEst, CoordGradEst is deterministic and requires times more function queries. However, as will be evident later, it yields an improved iteration complexity (i.e., convergence rate). More details on ZO gradient estimation can be found in Appendix A.1.

4 ZO stochastic variance reduced gradient (ZO-SVRG)

4.1 SVRG: from first-order to zeroth-order

It has been shown in (johnson2013accelerating, ; reddi2016stochastic, ) that the first-order SVRG achieves the convergence rate , yielding less iterations than the ordinary SGD for solving finite sum problems. The key step of SVRG333Different from the standard SVRG (johnson2013accelerating, ), we consider its mini-batch variant in (reddi2016stochastic, ). (Algorithm 1) is to generate an auxiliary sequence at which the full gradient is used as a reference in building a modified stochastic gradient estimate

(2)

where denotes the gradient estimate at , is a mini-batch of size (chosen uniformly randomly with replacement), and . The gradient blending (2) is also motivated by a variance reduced technique known as control variate (tucker2017rebar, ; grathwohl2017backpropagation, ; chatterji2018theory, ). The link between SVRG and control variate is discussed in Appendix A.2.

In the ZO setting, the gradient blending (2) is approximated using only function values,

(3)

where , and is a ZO gradient estimate specified by RandGradEst, Avg-RandGradEst or CoordGradEst. Replacing (2) with (3) in SVRG (Algorithm 1) leads to a new ZO algorithm, which we call ZO-SVRG (Algorithm 2). We highlight that although ZO-SVRG is similar to SVRG except the use of ZO gradient estimators to estimate batch, mini-batch, as well as blended gradients, this seemingly minor difference yields an essential difficulty in the analysis of ZO-SVRG. That is, the unbiased assumption on gradient estimates used in SVRG no longer holds. Thus, a careful analysis of ZO-SVRG is much needed.

4.2 ZO-SVRG and convergence analysis

In what follows, we focus on the analysis of ZO-SVRG using RandGradEst. Later, we will study ZO-SVRG with Avg-RandGradEst and CoordGradEst. We start by investigating the second-order moment of the blended ZO gradient estimate in the form of (3); see Proposition 1.

1:Input: total number of iterations , epoch length , number of epochs , step sizes , mini-batch , and .
2:for  do
3:     set , ,
4:     for  do
5:         compute gradient blending via (2):           ,
6:         update ,
7:     end for
8:     set ,
9:end for
10:return chosen uniformly random from .
Algorithm 1:
1:Input: In addition to parameters in SVRG, set smoothing parameter .
2:for  do
3:     compute ZO estimate ,
4:     set ,
5:     for  do
6:         compute ZO gradient blending (3):           ,
7:         update ,
8:     end for
9:     set ,
10:end for
11:return chosen uniformly random from .
Algorithm 2:
Proposition 1

Suppose A2 holds and RandGradEst is used in Algorithm 2. The blended ZO gradient estimate in Step 7 of Algorithm 2 satisfies

(4)

Proof: See Appendix A.3.

Compared to SVRG and its variants (reddi2016stochastic, ; lei2017non, ), the error bound (4) involves a new error term , which is induced by the second-order moment of RandGradEst (Appendix A.1). With the aid of Proposition 1, Theorem 1 provides the convergence rate of ZO-SVRG in terms of an upper bound on at the solution .

Theorem 1

Suppose A1 and A2 hold, and the random gradient estimator (RandGradEst) is used. The output of Algorithm 2 satisfies

(5)

where , , , , and

(6)
(7)

In (6)-(7), is a positive parameter ensuring , and the coefficients are given by

(8)

Proof: See Appendix A.4.

Compared to the convergence rate of SVRG as given in (reddi2016stochastic, , Theorem 2), Theorem 1 exhibits two additional errors and due to the use of ZO gradient estimates. Roughly speaking, if we choose the smoothing parameter reasonably small, then the error would reduce, leading to non-dominant effect on the convergence rate of ZO-SVRG. For the term , the quantity is more involved, relying on the epoch length , the step size , the smoothing parameter , the mini-batch size , and the number of optimization variables . In order to acquire explicit dependence on these parameters and to explore deeper insights of convergence, we simplify (5) for a specific parameter setting, as formalized below.

Corollary 1

Suppose we set

(9)

, and , where is a universal constant that is independent of , , , and . Then Theorem 1 implies , , and , which yields

(10)

Proof: See Appendix A.5.

It is worth mentioning that the condition on the value of smoothing parameter in Corollary 1 is less restrictive than several ZO algorithms444One exception is ZO-SCD (lian2016comprehensive, ) (and its variant ZO-SVRC (gu2016zeroth, )), where .. For example, ZO-SGD in (ghadimi2013stochastic, ) required , and ZO-ADMM (liu2017zeroth, ) and ZO-mirror descent (duchi2015optimal, ) considered . Moreover similar to liu2017zeroth (), we set the step size linearly scaled with . Compared to the aforementioned ZO algorithms (liu2017zeroth, ; duchi2015optimal, ; ghadimi2013stochastic, ), the convergence performance of ZO-SVRG in (10) has an improved (linear rather than sub-linear) dependence on . However, it suffers an additional error of order inherited from in (5), which is also a consequence of the last error term in (4). A recent work (hajinezhad2017zeroth, , Theorem 1) also identified this side effect of RandGradEst in the context of ZO nonconvex multi-agent optimization using a method of multipliers. Therefore, a naive combination of RandGradEst and SVRG could make the algorithm converging to a neighborhood of a stationary point, where the size of neighborhood is controlled by the mini-batch size . Our work and reference (liu2017zeroth, ) show that a large mini-batch indeed reduces the variance of RandGradEst and improves the convergence of ZO optimization methods. Although the tightness of the error bound (10) is not proven, we conjecture that the dependence on and could be optimal, since the form is consistent with SVRG, and the latter does not rely on the selected parameters in (9).

5 Acceleration of ZO-SVRG

In this section, we improve the iteration complexity of ZO-SVRG (Algorithm 2) by using Avg-RandGradEst and CoordGradEst, respectively. We start by comparing the squared errors of different gradient estimates to the true gradient , as formalized in Proposition 2.

Proposition 2

Consider a gradient estimator , then the squared error

(11)

Proof: See Appendix A.6.

Proposition 2 shows that compared to CoordGradEst, both RandGradEst and Avg-RandGradEst involve an additional error within a factor and of , respectively. Such an error is introduced by the second-order moment of gradient estimators using random direction samples (nesterov2015random, ; duchi2015optimal, ), and it decreases as the number of direction samples increases. On the other hand, all gradient estimators have a common error bounded by , where let for in CoordGradEst. If is specified as in (9), then we obtain the error term , consistent with the convergence rate of ZO-SVRG in Corollary 1.

In Theorem 2, we show the effect of Avg-RandGradEst on the convergence rate of ZO-SVRG.

Theorem 2

Suppose A1 and A2 hold, and Avg-RandGradEst is used in Algorithm 2. Then is bounded same as given in (5), where the parameters , and for are modified by , , with . Given the setting in Corollary 1 and , the convergence rate simplifies to

(12)

Proof: See Appendix A.7

By contrast with Corollary 1, it can be seen from (12) that the use of Avg-RandGradEst reduces the error in (10) through multiple () direction samples. And the convergence rate ceases to be significantly improved as . Our empirical results show that a moderate choice of can significantly speed up the convergence of ZO-SVRG.

We next study the effect of the coordinate-wise gradient estimator (CoordGradEst) on the convergence rate of ZO-SVRG, as formalized in Theorem 3.

Theorem 3

Suppose A1 and A2 hold, and CoordGradEst is used in Algorithm 2. Then

(13)

where , , and have been defined in (5), the parameters , and for are given by , , with , and is a positive parameter ensuring . Given the specific setting in Corollary 1 and , the convergence rate simplifies to

(14)

Proof: See Appendix A.8.

Theorem 3 shows that the use of CoordGradEst improves the iteration complexity, where the error of order in Corollary 1 or in Theorem 2 has been eliminated in (14). This improvement is benefited from the low variance of CoordGradEst shown by Proposition 2. We can also see this benefit by comparing in Theorem 3 with (7): the former avoids the term . The disadvantage of CoordGradEst is the need of times more function queries than RandGradEst in gradient estimation.

Recall that RandGradEst, Avg-RandGradEst and CoordGradEst require , and function queries, respectively. In ZO-SVRG (Algorithm 2), the total number of gradient evaluations is given by , where . Therefore, by fixing the number of iterations , the function query complexity of ZO-SVRG using the studied estimators is then given by , and , respectively. In Table 1, we summarize the convergence rates and the function query complexities of ZO-SVRG and its two variants, which we call ZO-SVRG-Ave and ZO-SVRG-Coord, respectively. For comparison, we also present the results of ZO-SGD (ghadimi2013stochastic, ) and ZO-SVRC (gu2016zeroth, ), where the later updates coordinates per iteration within an epoch. Table 1 shows that ZO-SGD has the lowest query complexity but has the worst convergence rate. ZO-SVRG-coord yields the best convergence rate in the cost of high query complexity. By contrast, ZO-SVRG (with an appropriate mini-batch size) and ZO-SVRG-Ave could achieve better trade-offs between the convergence rate and the query complexity.

Method Grad. estimator Stepsize Convergence rate Query complexity ZO-SVRG (RandGradEst) ZO-SVRG-Ave (Avg-RandGradEst) ZO-SVRG-Coord (CoordGradEst) ZO-SGD (ghadimi2013stochastic, ) (RandGradEst) ZO-SVRC (gu2016zeroth, ) (CoordGradEst) ,

Table 1: Summary of convergence rate and function query complexity of our proposals given iterations.

6 Applications and experiments

We evaluate the performance of our proposed algorithms on two applications: black-box classification and generating adversarial examples from black-box DNNs. The first application is motivated by a real-world material science problem, where a material is classified to either be a conductor or an insulator from a density function theory (DFT) based black-box simulator (dft, ). The second application arises in testing the robustness of a deployed DNN via iterative model queries (papernot2016practical, ; chen2017zoo, ).

Black-box binary classification

We consider a non-linear least square problem (xu2017second, , Sec. 3.2), i.e., problem (1) with for . Here is the th data sample containing feature vector and label , and is a black-box function that only returns the function value given an input. The used dataset consists of crystalline materials/compounds extracted from Open Quantum Materials Database (oqmd, ). Each compound has chemical features, and its label ( is conductor and is insulator) is determined by a DFT simulator (vasp, ). Due to the black-box nature of DFT, the true is unknown555 One can mimic DFT simulator using a logistic function once the parameter is learned from ZO algorithms.. We split the dataset into two equal parts, leading to training samples and testing samples. We refer readers to Appendix A.10 for more details on our dataset and the setting of experiments.

(a) Training loss versus iterations (b) Training loss versus function queries

Figure 1: Comparison of different ZO algorithms for the task of chemical material classification.

Method ZO-SGD (ghadimi2013stochastic, ) ZO-SVRC (gu2016zeroth, ) ZO-SVRG ZO-SVRG-Coord ZO-SVRG-Ave # of epochs Error ()

Table 2: Testing error for chemical material classification using function queries.

In Fig. 1, we present the training loss against the number of epochs (i.e., iterations divided by the epoch length ) and function queries. We compare our proposed algorithms ZO-SVRG, ZO-SVRG-Coord and ZO-SVRG-Ave with ZO-SGD (ghadimi2013stochastic, ) and ZO-SVRC (gu2016zeroth, ). Fig. 1-(a) presents the convergence trajectories of ZO algorithms as functions of the number of epochs, where ZO-SVRG is evaluated under different mini-batch sizes . We observe that the convergence error of ZO-SVRG decreases as increases, and for a small mini-batch size , ZO-SVRG likely converges to a neighborhood of a critical point as shown by Corollary 1. We also note that our proposed algorithms ZO-SVRG (), ZO-SVRG-Coord and ZO-SVRG-Ave have faster convergence speeds (i.e., less iteration complexity) than the existing algorithms ZO-SGD and ZO-SVRC. Particularly, the use of multiple random direction samples in Avg-RandGradEst significantly accelerates ZO-SVRG since the error of order is reduced to (see Table 1), leading to a non-dominant factor versus in the convergence rate of ZO-SVRG-Ave. Fig. 1-(b) presents the training loss against the number of function queries. For the same experiment, Table 2 shows the number of iterations and the testing error of algorithms studied in Fig. 1-(b) using function queries. We observe that the performance of CoordGradEst based algorithms (i.e., ZO-SVRC and ZO-SVRG-Coord) degrade due to the need of large number of function queries to construct coordinate-wise gradient estimates. By contrast, algorithms based on random gradient estimators (i.e., ZO-SGD, ZO-SVRG and ZO-SVRG-Ave) yield better both training and testing results, while ZO-SGD consumes an extremely large number of iterations ( epochs). As a result, ZO-SVRG () and ZO-SVRG-Ave achieve better tradeoffs between the iteration and the function query complexity.

Generation of adversarial examples from black-box DNNs

In image classification, adversarial examples refer to carefully crafted perturbations such that, when added to the natural images, are visually imperceptible but will lead the target model to misclassify. In the setting of ‘zeroth order’ attacks (madry17, ; chen2017zoo, ; carlini2017towards, ), the model parameters are hidden and acquiring its gradient is inadmissible. Only the model evaluations are accessible. We can then regard the task of generating a universal adversarial perturbation (to natural images) as an ZO optimization problem of the form (1). We elaborate on the problem formulation for generating adversarial examples in Appendix A.11.

We use a well-trained DNN666https://github.com/carlini/nn_robust_attacks on the MNIST handwritten digit classification task as the target black-box model, which achieves 99.4% test accuracy on natural examples. Two ZO optimization methods, ZO-SGD and ZO-SVRG-Ave, are performed in our experiment. Note that ZO-SVRG-Ave reduces to ZO-SVRG when . We choose images from the same class, and set the same parameters and constant step size for both ZO methods, where is the image dimension. For ZO-SVRG-Ave, we set and vary the number of random direction samples . In Fig. 2, we show the black-box attack loss (against the number of epochs) as well as the least distortion of the successful (universal) adversarial perturbations. To reach the same attack loss (e.g., in our example), ZO-SVRG-Ave requires roughly (), () and () more function evaluations than ZO-SGD. The sharp drop of attack loss in each method could be caused by the hinge-like loss as part of the total loss function, which turns to only if the attack becomes successful. Compared to ZO-SGD, ZO-SVRG-Ave offers a faster convergence to a more accurate solution, and its convergence trajectory is more stable as becomes larger (due to the reduced variance of Avg-RandGradEst). In addition, ZO-SVRG-Ave improves the distortion of adversarial examples compared to ZO-SGD (e.g., improvement when ). We present the corresponding adversarial examples in Appendix A.11.

Method distortion ZO-SGD ZO-SVRG-Ave () () ZO-SVRG-Ave () () ZO-SVRG-Ave () ()
Figure 2: Comparison of ZO-SGD and ZO-SVRG-Ave for generation of universal adversarial perturbations from a black-box DNN. Left: Attack loss versus iterations. Right: distortion and improvement with respect to ZO-SGD.

7 Conclusion

In this paper, we studied ZO-SVRG, a new ZO nonconvex optimization method. We presented new convergence results beyond the existing work on ZO nonconvex optimization. We show that ZO-SVRG improves the convergence rate of ZO-SGD from to but suffers a new correction term of order . The is the side effect of combining a two-point random gradient estimators with SVRG. We then propose two accelerated variants of ZO-SVRG based on improved gradient estimators of reduced variances. We show an illuminating trade-off between the iteration and the function query complexity. Experimental results and theoretical analysis validate the effectiveness of our approaches compared to other state-of-the-art algorithms.

References

  • (1) N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against machine learning,” in Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, 2017, pp. 506–519.
  • (2) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
  • (3) P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. ACM, 2017, pp. 15–26.
  • (4) T. Chen and G. B. Giannakis, “Bandit convex optimization for scalable and dynamic iot management,” arXiv preprint arXiv:1707.09060, 2017.
  • (5) S. Liu, J. Chen, P.-Y. Chen, and A. O. Hero, “Zeroth-order online admm: Convergence analysis and applications,” in Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, April 2018, vol. 84, pp. 288–297.
  • (6) M. C. Fu, “Optimization for simulation: Theory vs. practice,” INFORMS Journal on Computing, vol. 14, no. 3, pp. 192–215, 2002.
  • (7) X. Lian, H. Zhang, C.-J. Hsieh, Y. Huang, and J. Liu, “A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order,” in Advances in Neural Information Processing Systems, 2016, pp. 3054–3062.
  • (8) R. P. Brent, Algorithms for minimization without derivatives, Courier Corporation, 2013.
  • (9) J. C. Spall, Introduction to stochastic search and optimization: estimation, simulation, and control, vol. 65, John Wiley & Sons, 2005.
  • (10) A. D. Flaxman, A. T. Kalai, and H. B. McMahan, “Online convex optimization in the bandit setting: gradient descent without a gradient,” in Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, 2005, pp. 385–394.
  • (11) O. Shamir, “On the complexity of bandit and derivative-free stochastic convex optimization,” in Conference on Learning Theory, 2013, pp. 3–24.
  • (12) A. Agarwal, O. Dekel, and L. Xiao, “Optimal algorithms for online convex optimization with multi-point bandit feedback,” in COLT, 2010, pp. 28–40.
  • (13) Y. Nesterov and V. Spokoiny, “Random gradient-free minimization of convex functions,” Foundations of Computational Mathematics, vol. 2, no. 17, pp. 527–566, 2015.
  • (14) J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, “Optimal rates for zero-order convex optimization: The power of two function evaluations,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2788–2806, 2015.
  • (15) O. Shamir, “An optimal algorithm for bandit and zero-order convex optimization with two-point feedback,” Journal of Machine Learning Research, vol. 18, no. 52, pp. 1–11, 2017.
  • (16) X. Gao, B. Jiang, and S. Zhang, “On the information-adaptive variants of the admm: an iteration complexity perspective,” Optimization Online, vol. 12, 2014.
  • (17) P. Dvurechensky, A. Gasnikov, and E. Gorbunov, “An accelerated method for derivative-free smooth stochastic convex optimization,” arXiv preprint arXiv:1802.09022, 2018.
  • (18) Y. Wang, S. Du, S. Balakrishnan, and A. Singh, “Stochastic zeroth-order optimization in high dimensions,” in Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics. April 2018, vol. 84, pp. 1356–1365, PMLR.
  • (19) R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in Advances in neural information processing systems, 2013, pp. 315–323.
  • (20) S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola, “Stochastic variance reduction for nonconvex optimization,” in International conference on machine learning, 2016, pp. 314–323.
  • (21) A. Nitanda, “Accelerated stochastic gradient descent for minimizing finite sums,” in Artificial Intelligence and Statistics, 2016, pp. 195–203.
  • (22) Z. Allen-Zhu and Y. Yuan, “Improved svrg for non-strongly-convex or sum-of-non-convex objectives,” in International conference on machine learning, 2016, pp. 1080–1089.
  • (23) L. Lei, C. Ju, J. Chen, and M. I. Jordan, “Non-convex finite-sum optimization via scsg methods,” in Advances in Neural Information Processing Systems, 2017, pp. 2345–2355.
  • (24) S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013.
  • (25) D. Hajinezhad, M. Hong, and A. Garcia, “Zeroth order nonconvex multi-agent optimization over networks,” arXiv preprint arXiv:1710.09997, 2017.
  • (26) B. Gu, Z. Huo, and H. Huang, “Zeroth-order asynchronous doubly stochastic algorithm with variance reduction,” arXiv preprint arXiv:1612.01425, 2016.
  • (27) K. M. Choromanski and V. Sindhwani, “On blackbox backpropagation and jacobian sensing,” in Advances in Neural Information Processing Systems, 2017, pp. 6524–6532.
  • (28) G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein, “Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models,” in Advances in Neural Information Processing Systems, 2017, pp. 2624–2633.
  • (29) W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud, “Backpropagation through the void: Optimizing control variates for black-box gradient estimation,” arXiv preprint arXiv:1711.00123, 2017.
  • (30) N. S. Chatterji, N. Flammarion, Y.-A. Ma, P. L. Bartlett, and M. I. Jordan, “On the theory of variance reduction for stochastic gradient monte carlo,” arXiv preprint arXiv:1802.05431, 2018.
  • (31) W. Yang and P. W. Ayers, “Density-functional theory,” in Computational Medicinal Chemistry for Drug Discovery, pp. 103–132. CRC Press, 2003.
  • (32) P. Xu, F. Roosta-Khorasan, and M. W. Mahoney, “Second-order optimization for non-convex machine learning: An empirical study,” arXiv preprint arXiv:1708.07827, 2017.
  • (33) S. Kirklin, J. E. Saal, B. Meredig, A. Thompson, J. W. Doak, M. Aykol, S. Rühl, and C. Wolverton, “The open quantum materials database (oqmd): assessing the accuracy of dft formation energies,” npj Computational Materials, vol. 1, pp. 15010, 2015.
  • (34) G. Kresse and J. Furthmüller, “Efficiency of ab-initio total energy calculations for metals and semiconductors using a plane-wave basis set,” Computational materials science, vol. 6, no. 1, pp. 15–50, 1996.
  • (35) N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in IEEE Symposium on Security and Privacy, 2017, pp. 39–57.
  • (36) S. Shalev-Shwartz, “Online learning and online convex optimization,” Foundations and Trends® in Machine Learning, vol. 4, no. 2, pp. 107–194, 2012.
  • (37) L. Ward, A. Agrawal, A. Choudhary, and C. Wolverton, “A general-purpose machine learning framework for predicting properties of inorganic materials,” npj Computational Materials, vol. 2, pp. 16028, 2016.

Appendix A Supplementary material

a.1 Zeroth-order (ZO) gradient estimators

With an abuse of notation, in this section let be an arbitrary function under assumptions A1 and A2. Lemma 1 shows the second-order statistics of RandGradEst.

Lemma 1

Suppose that Assumption A1 holds, and define , where is a uniform distribution over the unit Euclidean ball. Then RandGradEst yields:

1) is -smooth, and

(15)

where is drawn from the uniform distribution over the unit Euclidean sphere, and is given by RandGradEst.

2) For any ,

(16)
(17)
(18)

3) For any ,

(19)

Proof: First, by using (gao2014information, , Lemma 4.1.a) (also see (shalev2012online, ) and (nesterov2015random, )), we immediately obtain that is smooth with , and

(20)

Since , we obtain (15).

Next, we obtain (16)-(18) based on (gao2014information, , Lemma 4.1.b). Moreover, we have

(21)

where the first inequality holds due to Lemma 5.

The first inequality of (19) holds due to (15) and for a random variable . And the second inequality of (19) holds due to (gao2014information, , Lemma 4.1.b). The proof is now complete.

In Lemma 2, we show the properties of Avg-RandGradEst.

Lemma 2

Following the conditions of Lemma 1, then Avg-RandGradEst yields:

1) For any

(22)

where is given by Avg-RandGradEst.

2) For any

(23)

Proof: Since are i.i.d. random vectors, we have

(24)

where .

In (23), the first inequality holds due to (22) and for a random variable . Next, we bound the second moment of

(25)

where the expectation is taken with respect to i.i.d. random vectors , and we have used the fact that for any . Substituting (18) and (19) into (25), we obtain (23).

In Lemma 3, we demonstrate the properties of CoordGradEst.

Lemma 3

Let Assumption A1 hold and define , where denotes the uniform distribution at the interval . We then have:

1) is L-smooth, and

(26)

where is defined by CoordGradEst, and denotes the partial derivative with respect to the th coordinate.

2) For ,

(27)
(28)

3) For ,

(29)

Proof: For the th coordinate, it is known from (lian2016comprehensive, , Lemma 6) that is -smooth and

(30)

Based on (30) and the definition of CoordGradEst, we then obtain (26).

The inequalities (27) and (28) have been proved by (lian2016comprehensive, , Lemma 6).

Based on (26) and (28), we have

where the first inequality holds due to Lemma 5 in Sec. A.9. The proof is now complete.

a.2 Control variates

The gradient blending in Step 6 of SVRG (Algorithm 1) can be interpreted using control variate (tucker2017rebar, ; grathwohl2017backpropagation, ; chatterji2018theory, ). If we view as the raw gradient estimate at , and as a control variate satisfying , then the gradient blending (2) becomes a gradient estimate modified by a control variate, . Here has the same expectation as , i.e., , however, has a lower variance when is positively correlated with (see a detailed analysis as below).

Consider the following gradient estimator,

(31)

where is a given (raw) gradient estimate, is an unknown coefficient, and is a control variate. It is clear that has the same expectation as . We then study the effect of on the variance of ,