Stochastic Zerothorder Optimization via Variance Reduction method
Abstract
Derivativefree optimization has become an important technique used in machine learning for optimizing blackbox models. To conduct updates without explicitly computing gradient, most current approaches iteratively sample a random search direction from Gaussian distribution and compute the estimated gradient along that direction. However, due to the variance in the search direction, the convergence rates and query complexities of existing methods suffer from a factor of , where is the problem dimension. In this paper, we introduce a novel Stochastic Zerothorder method with Variance Reduction under Gaussian smoothing (SZVRG) and establish the complexity for optimizing nonconvex problems. With variance reduction on both sample space and search space, the complexity of our algorithm is sublinear to and is strictly better than current approaches, in both smooth and nonsmooth cases. Moreover, we extend the proposed method to the minibatch version. Our experimental results demonstrate the superior performance of the proposed method over existing derivativefree optimization techniques. Furthermore, we successfully apply our method to conduct a universal blackbox attack to deep neural networks and present some interesting results.
1 Introduction
Derivativefree optimization methods have a long history in optimization [1]. They use only function value information rather than explicit gradient calculation to optimize a function, as in the case of blackbox setting or when computing the partial derivative is too expensive. Recently, derivativefree methods received substantial attention in machine learning and deep learning [2], such as online problem in bandit setting [3, 4, 5], certain graphical model and structureprediction problems [6], and blackbox attack to deep neural networks (DNNs) [7, 8, 9]. However, the convergence rate of current approaches encounters a factor of , where is problem dimension. This prevents the application of derivativefree optimization in highdimensional problems.
This paper focuses on the theoretical development of derivativefree (zerothorder) method for nonconvex optimization. More specifically, we consider the following optimization problem:
(1.1) 
where and are differentiable, nonconvex functions, and , is a random variable. In particular, when =1, the objective function is = with a fixed , which becomes the problem solved in [10]. To solve (1.1), most approaches [10] consider the use of stochastic zerothorder oracle (). At each iteration, for a given and , outputs a stochastic gradient defined by
(1.2) 
which approximates the derivative along the direction of . Each only requires function value evaluations (or if has already being queried). It is thus natural to analyze the convergence rate of an algorithm in terms of number of required to achieve with a small .
A recent important work by Nesterov and Spokoiny [10] proposed the random gradientfree method (RGF) and proved some tight bounds for approximation the gradient through function value information with Gaussian smoothing techniques. He established an complexity for nonconvex smooth function in the case of =1 in problem (1.1). Subsequently, Ghadimi and Lan [11] introduced a randomized stochastic gradient (RSG) method for solving the stochastic programming problem (1.2) and proved the complexity of . However, when the dimension is large, especially in deep learning, these derivativefree methods will suffer slow convergence.
The dependency in is mainly due to the variance in sampling query direction . Recently, a family of variance reduction methods have been proposed for firstorder optimization, including SVRG [12], SCSG [13] and Natasha [14]. They developed ways to reduce variance of stochastic samples (). It is thus natural to ask the following question: can the variance reduction technique also be used in derivativefree optimization to reduce the complexity caused by problem dimension? And how to choose the best size of Gaussian random vector set for each epoch to estimate the gradient in zerothorder optimization?
In this paper, we develop a novel stochastic zerothorder method with variance reduction under Gaussian smoothing (SZVRG). The main contributions are summarized below.

We proposed a novel algorithm based on variance reduction. Different from RSG and RGF that generate a Gaussian random vector for each iteration, we independently generate Gaussian vector set (in practice, we preserve the corresponding seeds) to compute the average of direction derivatives at the beginning of each epoch as defined in (3.1). In the inner iteration of epoch, we randomly select one or block of seeds that preserved in the outer epoch to compute the corresponding gradient as defined in (3.2).

We give the theoretical proof for the proposed algorithm and show that our results are better than that of RGF and RSG in both smooth and nonsmooth functions, and in the case of both and of problem (1.1). Furthermore, we also explicitly present parameter settings and the corresponding derivation process, which is better for understanding the convergence analysis.

We extend the stochastic zerothorder optimization to the minibatch setting. Although the complexity will increase, we show that the increasing rate is sublinear to batch size. In comparison, previous algorithms including RGF and RSG have complexity growing linearly with batch size. Furthermore, the total number of iterations in our algorithm will decrease when using larger minibatch, which implicitly implies better parallelizability.

We show that our algorithm is more efficient than both RGF and RSG in canonical logistic regression problem. Furthermore, we successfully apply our algorithm to a real blackbox adversarial attack problem that involves highdimensional zeroth order optimization.
1.1 Our results
Method  complexity  MiniBatch  Nonsmooth 

RGF [10]  
RSG [11]    
SZVRG 
Our proposed algorithm can achieve the following complexity:
where . We identify an interesting dichotomy with respect to . In particular, if , becomes , otherwise becomes . Different complexities of methods are presented in Table 1.
Comparing our method with RGF [10] in the case of =1 (that is ), we can see that our result is better than that of RGF with a factor of improvement. For , the complexity of our method is also better than that of RSG [11] as clearly shown in Figure 1.
MiniBatch Our result generalizes to the minibatch stochastic setting, where in the inner iteration of each epoch, the estimated gradient defined in (3.2) is computed with minibatch of times. The complexity will become . The comparison of minibatch complexity is also shown in Table 1.
Nonsmooth We also give the convergence analysis for nonsmooth case and present the complexity, which is better than that of RGF [10].
1.2 Other Related work
Derivativefree optimization can be dated back to the early days of the development of the optimization theory [1]. The advantage of using derivativefree method is manifested in the case when computation of function value is much simpler than gradient, or in the blackbox setting when optimizer does not have full information about the function.
The most common method for derivativefree optimization is the random optimization approach [1], which samples a random vector uniformly distributed over the unit sphere, computes the directional derivative of the function, and then moves the next point if the update leads to the decrease of function value. However, no particular convergence rate was established. Nesterov and Spokoiny [9] presented several random derivativefree methods, and provide the corresponding complexity bound for both convex and nonconvex problems. What’s more, an important kind of smoothness, Gaussian smoothing and its properties were established. Ghadimi and Lan [11] incorporated the Gaussian smoothing technique to randomized stochastic gradient (RSG). John et al. [5, 15] analyzed the finitesample convergence rate of zerothorder optimization for convex problem. Wang et al. [16] considered the zerothorder optimization in highdimension, but also in convex function. For the coordinate smoothness (the sampled direction is along natural basis), Lian et al. [17] presented zerothorder under asynchronous stochastic parallel optimization for nonconvex problem. Subsequently, Gu et al. [18] apply variance reduction of zerothorder to asynchronous doubly stochastic algorithm, however, without the specific analysis of the complexity related to dimension . Furthermore, it is not practical to perform full gradient computation in the parallel setting for largescale data.
Stochastic firstorder methods including SGD [19] and SVRG [20] have been studied extensively. However, these two algorithms suffer from either hight iteration complexity or the complexity that depend on the number of samples. Lei et al. [13] recently proposed the stochastically controlled stochastic gradient (SCSG) method to obtain the complexity that is based on , which is derived from [21] and [22] for the convex case.
The rest of the paper is organized as following. We first introduce some notations, definitions and assumptions in Section 2. In Section 3, we provide our algorithm via variance reduction technology, and analyze the complexity for both smooth and nonsmooth function, and their corresponding minibatch version. Experiment results are shown in 4. Section 5 concludes our paper.
2 Preliminary
Throughout this paper, we use Euclidean norm denoted by . We use to denote that is generated from . We denote by and the set, and and the cardinality of the sets. We use and to denote the variable set, where belong to , , and belong to , . We use to denote the indicator function of a probabilistic event. Here are some definitions on the smoothness of a function, direction derivative and smooth approximation function and its property.
Definition 2.1.
For a function : , ,

, then .

, then and .
Note that if , then due to the fact that
Definition 2.2.
The smooth approximation of is defined as
(2.1) 
Its corresponding gradient is and , where defined in (1.2). The details of gradient derivation process can be referred to [10].
Assumption 2.1.
We assume that is the upper bound on the variance of function , that is
3 Stochastic Zerothorder via Variance reduction with Gaussian smooth
We introduce our SZVRG method in Algorithm 1. At each outer iteration, we have two kinds of sampling: the first one is to form with the size of , which are randomly selected from ; the second one is to independently generate a Gaussian vector set with times. Furthermore, we store the corresponding seeds of Gaussian vectors, which will be used for the inner iterations. The main difference between set and is the property of independence, which will be the key element in analyzing the size of their sets. Based on these two sets, we compute the random gradient at a snapshot point , which is maintained for each epoch,
(3.1) 
where the definition of is in (1.2).
At each inner iteration, we select and from and randomly, and compute the estimated random gradient,
(3.2) 
where and are the Gaussian vector set and sample set. Taking expectation of with respect to , and , we have
(3.3) 
where and are defined in Definition 2.2.
3.1 Convergence analysis
We present the convergence and complexity results for our algorithm. Theorem 3.1 is based on the variance reduction technique for the nonconvex problem. The detailed proof can be found in Appendix B. In order to ensure the convergence, we present the parameter settings, such as , , , and in Remark B.2 and B.1.
Theorem 3.1.
The complexity is presented in Theorem 3.2, which is based on the best choice of step size . For the different sizes of and , we give different results, which is an interesting phenomenon caused by two types of samples.
Theorem 3.2.
In Algorithm 1, for , let the size of sample set , =, the step , , and the number of inner iteration , Gaussian vectors set . In order to obtain
the total number of is at most , with the number of total iterations .
3.1.1 Variance reduction for Gaussian random direction
If we only consider the directions of Gaussian random vector, that is , Algorithm 1 is similar to SVRG but the variance reduction will be on random directions instead of random samples. In outer iteration, we independently produce Gaussian random vectors and compute the smoothed gradient estimator in (3.1) (Here, we use to indicate the only sample). Then in inner iteration, we randomly select a Gaussian vector, and compute the estimated gradient as in (3.2). Since this is the same problem solved in Nesterov and Spokoiny [10], we compare the complexity between our method and theirs based on different stepsize choices:

For , we set , the complexity of our proposed method is = which is better than that of RGF [10], . The corresponding Gaussian random vector set , . This is due to the fact that is not finitesum structure and the term , which is bounded by . More details can be referred to Lemma B.1. This is the key difference with SVRG method [20]. Based on the lower bound, we can derive the corresponding best complexity and best step as shown in Theorem 3.2.

For , the complexity will be larger than . This can be directly seen from the total number of . In this case becomes 1, and the proposed algorithm will become the original RGF [10] method, where the step is . This can explain that why the variance reduction method is better than that of RGF, that is our proposed method can apply the large step to obtain the better complexity.
3.1.2 Variance reduction for finitesum function
For the finitesum function as in (1.1), In Algorithm 1, we also provide the variancereduction technique at the same time for both Gaussian vector and random variable . Our algorithm has two kinds of random procedure. That is, in outer iteration, we compute the gradient include both B samples and D Gaussian random vectors. In inner iteration, we randomly select a sample and a Gaussian random vector to estimate the gradient. Here, we compare our result with RSG [11], which also use both random sample and Gaussian random vector. Based on the result in Theorem 3.2, we discuss the complexity under different ,

For , the complexity of our proposed method is =. This result is similar to SCSG [13] if the dimension d is not large enough. Furthermore, in our algorithm, we set B as the fix value rather than a value that is produced by the probability. If B=d, the complexity result looks the same as RSG [11]. But the difference lies on that the B is no more than such that our result is better than RSG [11]. Figure 1 clearly shows the difference.

For , the complexity becomes . The complexity is also better than that of RSG.
3.2 Minibatch SZVRG
We extend the SZVRG to the minibatch version in Algorithm 2, which is similar to Algorithm 1. The difference is that we estimate the gradient in inner epoch with times computation, then average them. Theorem 3.3 gives the corresponding complexity and the corresponding step size.
Theorem 3.3.
From the above Theorem, we can see that the complexity is increased by a factor or , which is smaller than the size of the minibatch. However, the corresponding complexity of RGF and RSG will be increased by multiplying a factor of (see Table 1), so our algorithm has a better dependency to the batch size. Furthermore, our total number of iterations will decrease by a factor or .
3.3 SZVRG for nonsmooth function
For nonsmooth function, we also provide the theory analysis and give the corresponding complexity. Similar to Theorem 3.2, we analyze the convergence based on the norm of the gradient. But the difference lies in that the convergence of gradient norm is rather than . As stated in [10], allowing and , the convergence of ensures the convergence to a stationary point of the initial function.
Theorem 3.4.
In Algorithm 1, for , the step , , and the number of inner iteration , Gaussian vectors set . In order to obtain
the total number of is , number of inner iterations .
4 Experimental results
4.1 Logistic regression with stochastic zerothorder method
In order to verify our theory, we apply our algorithm to logistic regression. Given training examples , where and , are the feature vector and the label of th example. The objective function is
where . We use MNIST [23] dataset to make two kinds of experiments in order to verify that our variance reduction technology is better than current approach. The dimension of is , where the size of the image is , and the number of the class is 10. We choose the parameters according to setting in Theorem 3.2 to give the best performance. First, to verify that our variance reduction technique for Gaussian random directions are useful, we compare our algorithm with RGF [10] for solving a deterministic function , which is the logistic regression with MNIST samples. Row 1 in Figure 2 shows the results that our method SZVRG is better than RGF [10] both on the objective function value and the norm of the gradient. This verified that even for solving a deterministic function, our algorithm outperforms RGF in both theory and practice, due to the variance reduction for Gaussian search directions.
In the second experiment we compare with RSG [11] on stochastic optimization, that consider two kinds of stochastic process: randomly select one or block example and Gaussian vector to estimate the gradient. We use the fix dataset with randomly selected examples. Figure 2. row 2 shows that our method is better than RSG since we conduct variance reduction on both examples and Gaussian vectors.
4.2 Universal adversarial examples with blackbox setting
In the second set of experiments, we apply zeroth order optimization methods to solve a real problem in adversarial blackbox attack to machine learning models. It has been observed recently that convolutional neural networks are vulnerable to adversarial example [24, 25]. [8] apply zeroth order optimization techniques in the blackbox setting, where one can only acquire inputoutput correspondences of targeted model. Also, [26] finds there exists universal perturbations that could fool the classifier on almost all datapoints sampled. Therefore, we decide to apply our SZVRG algorithm to nonsmooth function that find universal adversarial perturbations in the blackbox setting to show our efficiency in an interesting application. For classification models in neural networks, given the classification model , it is usually assumed that , where is the final layer output, and is the prediction score for the th class. Formally, we want to find a universal perturbation that could fool all N images in samples set , that is,
where is a constant to balance the distortion and attack success rate and is a confidence parameter that guarantees a constant gap between and . In this experiments, we use two standard datasets: MNIST [23], CIFAR10 [27]. We construct two convolution neural networks following [28]. In detail, both MNIST and CIFAR use the same network structure with four convolution layers, two maxpooling layers and two fullyconnected layers. Using the parameters provided by [28], we could achieve 99.5% accuracy on MNIST and 82.5% accuracy on CIFAR10. All models are trained using Pytorch^{1}^{1}1https://github.com/pytorch/pytorch. The dimension of is for MNIST and for CIFAR10. We tune the best parameters to give the best performance. Figure 3 show the performance with difference methods. We can see that our algorithm SZVRG is better than RGF and RSG both on objective value and the norm of the gradient.
5 Conclusion
In this paper, we present stochastic zerothorder optimization via variance reduction for both smooth and nonsmooth nonconvex problem. The stochastic process include two kinds of aspects: randomly select the sample and derivative of direction, respectively. We give the theoretical analysis of complexity, which is better than that of RGF and RSG. Furthermore, we also extend our algorithm to minibatch, in which the complexity is multiplying a smaller size of the minibatch. Our experimental result also confirm our theory.
References
 [1] J Matyas. Random optimization. Automation and Remote control, 26(2):246–253, 1965.
 [2] Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivativefree optimization, volume 8. Siam, 2009.
 [3] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms, pages 385–394. Society for Industrial and Applied Mathematics, 2005.
 [4] Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multipoint bandit feedback. In COLT, pages 28–40. Citeseer, 2010.
 [5] Andre Wibisono, Martin J Wainwright, Michael I Jordan, and John C Duchi. Finite sample convergence rates of zeroorder stochastic optimization methods. In Advances in Neural Information Processing Systems, pages 1439–1447, 2012.
 [6] James C Spall. Introduction to stochastic search and optimization: estimation, simulation, and control, volume 65. John Wiley & Sons, 2005.
 [7] Arjun Nitin Bhagoji, Warren He, Bo Li, and Dawn Song. Exploring the space of blackbox attacks on deep neural networks. arXiv preprint arXiv:1712.09491, 2017.
 [8] PinYu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and ChoJui Hsieh. Zoo: Zeroth order optimization based blackbox attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017.
 [9] Nina Narodytska and Shiva Kasiviswanathan. Simple blackbox adversarial attacks on deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1310–1318. IEEE, 2017.
 [10] Yurii Nesterov and Vladimir Spokoiny. Random gradientfree minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
 [11] Saeed Ghadimi and Guanghui Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 [12] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
 [13] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex finitesum optimization via scsg methods. In Advances in Neural Information Processing Systems, pages 2345–2355, 2017.
 [14] Zeyuan AllenZhu. Natasha 2: Faster nonconvex optimization than sgd. arXiv preprint arXiv:1708.08694, 2017.
 [15] John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zeroorder convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
 [16] Yining Wang, Simon Du, Sivaraman Balakrishnan, and Aarti Singh. Stochastic zerothorder optimization in high dimensions. arXiv preprint arXiv:1710.10551, 2017.
 [17] Xiangru Lian, Huan Zhang, ChoJui Hsieh, Yijun Huang, and Ji Liu. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zerothorder to firstorder. In Advances in Neural Information Processing Systems, pages 3054–3062, 2016.
 [18] Bin Gu, Zhouyuan Huo, and Heng Huang. Zerothorder asynchronous doubly stochastic algorithm with variance reduction. arXiv preprint arXiv:1612.01425, 2016.
 [19] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
 [20] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323, 2016.
 [21] Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konečnỳ, and Scott Sallinen. Stopwasting my gradients: Practical svrg. In Advances in Neural Information Processing Systems, pages 2251–2259, 2015.
 [22] Lihua Lei and Michael Jordan. Less than a single pass: Stochastically controlled stochastic gradient. In Artificial Intelligence and Statistics, pages 148–156, 2017.
 [23] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [24] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 [25] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
 [26] SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations.
 [27] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 [28] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.
Appendix A Technical Lemma
Lemma A.1.
For the sequences that satisfy , where , , and , we can get the geometric progression
then can be represented as decrease sequences,
Furthermore, if , we have
Lemma A.2.
[10] For and , , where .
Lemma A.3.
Lemma A.4.
If satisfy , and is a nonempty, uniform random subset of , then
Furthermore, if the elements in are independent, then
Proof.
Based on the , and permutation and combination, we have

For the case that is a nonempty, uniform random subset of , we have
(A.2) (A.3) 
For the case that the elements in are independent, we have
(A.4)
∎
Lemma A.5.
Consider that is a nonempty, uniform random subset of [n] with , and the set with , if is a nonempty set, in which each element in is independent, and , then
(A.5) 
Proof.
is a nonempty set, in which each element in is independent. Consider the as an element, and based on the result in Lemma A.4, we have
Take the expectation with respect to and for the last two terms, we have
Thus, we have the expectation with respect to and ,
∎
a.1 The model of Convergence analysis
Before give the official proof, we give a simple model of convergence sequence, which is easily comprehensive. First, given two sequences,
Define , we can see that
if parameters satisfy, ,

and is a decrease sequence;
