Stochastic Zeroth-order Optimization via Variance Reduction method
Derivative-free optimization has become an important technique used in machine learning for optimizing black-box models. To conduct updates without explicitly computing gradient, most current approaches iteratively sample a random search direction from Gaussian distribution and compute the estimated gradient along that direction. However, due to the variance in the search direction, the convergence rates and query complexities of existing methods suffer from a factor of , where is the problem dimension. In this paper, we introduce a novel Stochastic Zeroth-order method with Variance Reduction under Gaussian smoothing (SZVR-G) and establish the complexity for optimizing non-convex problems. With variance reduction on both sample space and search space, the complexity of our algorithm is sublinear to and is strictly better than current approaches, in both smooth and non-smooth cases. Moreover, we extend the proposed method to the mini-batch version. Our experimental results demonstrate the superior performance of the proposed method over existing derivative-free optimization techniques. Furthermore, we successfully apply our method to conduct a universal black-box attack to deep neural networks and present some interesting results.
Derivative-free optimization methods have a long history in optimization . They use only function value information rather than explicit gradient calculation to optimize a function, as in the case of black-box setting or when computing the partial derivative is too expensive. Recently, derivative-free methods received substantial attention in machine learning and deep learning , such as online problem in bandit setting [3, 4, 5], certain graphical model and structure-prediction problems , and black-box attack to deep neural networks (DNNs) [7, 8, 9]. However, the convergence rate of current approaches encounters a factor of , where is problem dimension. This prevents the application of derivative-free optimization in high-dimensional problems.
This paper focuses on the theoretical development of derivative-free (zeroth-order) method for non-convex optimization. More specifically, we consider the following optimization problem:
where and are differentiable, non-convex functions, and , is a random variable. In particular, when =1, the objective function is = with a fixed , which becomes the problem solved in . To solve (1.1), most approaches  consider the use of stochastic zeroth-order oracle (). At each iteration, for a given and , outputs a stochastic gradient defined by
which approximates the derivative along the direction of . Each only requires function value evaluations (or if has already being queried). It is thus natural to analyze the convergence rate of an algorithm in terms of number of required to achieve with a small .
A recent important work by Nesterov and Spokoiny  proposed the random gradient-free method (RGF) and proved some tight bounds for approximation the gradient through function value information with Gaussian smoothing techniques. He established an complexity for non-convex smooth function in the case of =1 in problem (1.1). Subsequently, Ghadimi and Lan  introduced a randomized stochastic gradient (RSG) method for solving the stochastic programming problem (1.2) and proved the complexity of . However, when the dimension is large, especially in deep learning, these derivative-free methods will suffer slow convergence.
The dependency in is mainly due to the variance in sampling query direction . Recently, a family of variance reduction methods have been proposed for first-order optimization, including SVRG , SCSG  and Natasha . They developed ways to reduce variance of stochastic samples (). It is thus natural to ask the following question: can the variance reduction technique also be used in derivative-free optimization to reduce the complexity caused by problem dimension? And how to choose the best size of Gaussian random vector set for each epoch to estimate the gradient in zeroth-order optimization?
In this paper, we develop a novel stochastic zeroth-order method with variance reduction under Gaussian smoothing (SZVR-G). The main contributions are summarized below.
We proposed a novel algorithm based on variance reduction. Different from RSG and RGF that generate a Gaussian random vector for each iteration, we independently generate Gaussian vector set (in practice, we preserve the corresponding seeds) to compute the average of direction derivatives at the beginning of each epoch as defined in (3.1). In the inner iteration of epoch, we randomly select one or block of seeds that preserved in the outer epoch to compute the corresponding gradient as defined in (3.2).
We give the theoretical proof for the proposed algorithm and show that our results are better than that of RGF and RSG in both smooth and non-smooth functions, and in the case of both and of problem (1.1). Furthermore, we also explicitly present parameter settings and the corresponding derivation process, which is better for understanding the convergence analysis.
We extend the stochastic zeroth-order optimization to the mini-batch setting. Although the complexity will increase, we show that the increasing rate is sublinear to batch size. In comparison, previous algorithms including RGF and RSG have complexity growing linearly with batch size. Furthermore, the total number of iterations in our algorithm will decrease when using larger mini-batch, which implicitly implies better parallelizability.
We show that our algorithm is more efficient than both RGF and RSG in canonical logistic regression problem. Furthermore, we successfully apply our algorithm to a real black-box adversarial attack problem that involves high-dimensional zeroth order optimization.
1.1 Our results
Our proposed algorithm can achieve the following complexity:
where . We identify an interesting dichotomy with respect to . In particular, if , becomes , otherwise becomes . Different complexities of methods are presented in Table 1.
Comparing our method with RGF  in the case of =1 (that is ), we can see that our result is better than that of RGF with a factor of improvement. For , the complexity of our method is also better than that of RSG  as clearly shown in Figure 1.
Mini-Batch Our result generalizes to the mini-batch stochastic setting, where in the inner iteration of each epoch, the estimated gradient defined in (3.2) is computed with mini-batch of times. The complexity will become . The comparison of mini-batch complexity is also shown in Table 1.
Non-smooth We also give the convergence analysis for non-smooth case and present the complexity, which is better than that of RGF .
1.2 Other Related work
Derivative-free optimization can be dated back to the early days of the development of the optimization theory . The advantage of using derivative-free method is manifested in the case when computation of function value is much simpler than gradient, or in the black-box setting when optimizer does not have full information about the function.
The most common method for derivative-free optimization is the random optimization approach , which samples a random vector uniformly distributed over the unit sphere, computes the directional derivative of the function, and then moves the next point if the update leads to the decrease of function value. However, no particular convergence rate was established. Nesterov and Spokoiny  presented several random derivative-free methods, and provide the corresponding complexity bound for both convex and non-convex problems. What’s more, an important kind of smoothness, Gaussian smoothing and its properties were established. Ghadimi and Lan  incorporated the Gaussian smoothing technique to randomized stochastic gradient (RSG). John et al. [5, 15] analyzed the finite-sample convergence rate of zeroth-order optimization for convex problem. Wang et al.  considered the zeroth-order optimization in high-dimension, but also in convex function. For the coordinate smoothness (the sampled direction is along natural basis), Lian et al.  presented zeroth-order under asynchronous stochastic parallel optimization for non-convex problem. Subsequently, Gu et al.  apply variance reduction of zeroth-order to asynchronous doubly stochastic algorithm, however, without the specific analysis of the complexity related to dimension . Furthermore, it is not practical to perform full gradient computation in the parallel setting for large-scale data.
Stochastic first-order methods including SGD  and SVRG  have been studied extensively. However, these two algorithms suffer from either hight iteration complexity or the complexity that depend on the number of samples. Lei et al.  recently proposed the stochastically controlled stochastic gradient (SCSG) method to obtain the complexity that is based on , which is derived from  and  for the convex case.
The rest of the paper is organized as following. We first introduce some notations, definitions and assumptions in Section 2. In Section 3, we provide our algorithm via variance reduction technology, and analyze the complexity for both smooth and non-smooth function, and their corresponding mini-batch version. Experiment results are shown in 4. Section 5 concludes our paper.
Throughout this paper, we use Euclidean norm denoted by . We use to denote that is generated from . We denote by and the set, and and the cardinality of the sets. We use and to denote the variable set, where belong to , , and belong to , . We use to denote the indicator function of a probabilistic event. Here are some definitions on the smoothness of a function, direction derivative and smooth approximation function and its property.
For a function : , ,
, then .
, then and .
Note that if , then due to the fact that
The smooth approximation of is defined as
We assume that is the upper bound on the variance of function , that is
3 Stochastic Zeroth-order via Variance reduction with Gaussian smooth
We introduce our SZVR-G method in Algorithm 1. At each outer iteration, we have two kinds of sampling: the first one is to form with the size of , which are randomly selected from ; the second one is to independently generate a Gaussian vector set with times. Furthermore, we store the corresponding seeds of Gaussian vectors, which will be used for the inner iterations. The main difference between set and is the property of independence, which will be the key element in analyzing the size of their sets. Based on these two sets, we compute the random gradient at a snapshot point , which is maintained for each epoch,
where the definition of is in (1.2).
At each inner iteration, we select and from and randomly, and compute the estimated random gradient,
where and are the Gaussian vector set and sample set. Taking expectation of with respect to , and , we have
where and are defined in Definition 2.2.
3.1 Convergence analysis
We present the convergence and complexity results for our algorithm. Theorem 3.1 is based on the variance reduction technique for the non-convex problem. The detailed proof can be found in Appendix B. In order to ensure the convergence, we present the parameter settings, such as , , , and in Remark B.2 and B.1.
The complexity is presented in Theorem 3.2, which is based on the best choice of step size . For the different sizes of and , we give different results, which is an interesting phenomenon caused by two types of samples.
In Algorithm 1, for , let the size of sample set , =, the step , , and the number of inner iteration , Gaussian vectors set . In order to obtain
the total number of is at most , with the number of total iterations .
3.1.1 Variance reduction for Gaussian random direction
If we only consider the directions of Gaussian random vector, that is , Algorithm 1 is similar to SVRG but the variance reduction will be on random directions instead of random samples. In outer iteration, we independently produce Gaussian random vectors and compute the smoothed gradient estimator in (3.1) (Here, we use to indicate the only sample). Then in inner iteration, we randomly select a Gaussian vector, and compute the estimated gradient as in (3.2). Since this is the same problem solved in Nesterov and Spokoiny , we compare the complexity between our method and theirs based on different step-size choices:
For , we set , the complexity of our proposed method is = which is better than that of RGF , . The corresponding Gaussian random vector set , . This is due to the fact that is not finite-sum structure and the term , which is bounded by . More details can be referred to Lemma B.1. This is the key difference with SVRG method . Based on the lower bound, we can derive the corresponding best complexity and best step as shown in Theorem 3.2.
For , the complexity will be larger than . This can be directly seen from the total number of . In this case becomes 1, and the proposed algorithm will become the original RGF  method, where the step is . This can explain that why the variance reduction method is better than that of RGF, that is our proposed method can apply the large step to obtain the better complexity.
3.1.2 Variance reduction for finite-sum function
For the finite-sum function as in (1.1), In Algorithm 1, we also provide the variance-reduction technique at the same time for both Gaussian vector and random variable . Our algorithm has two kinds of random procedure. That is, in outer iteration, we compute the gradient include both B samples and D Gaussian random vectors. In inner iteration, we randomly select a sample and a Gaussian random vector to estimate the gradient. Here, we compare our result with RSG , which also use both random sample and Gaussian random vector. Based on the result in Theorem 3.2, we discuss the complexity under different ,
For , the complexity of our proposed method is =. This result is similar to SCSG  if the dimension d is not large enough. Furthermore, in our algorithm, we set B as the fix value rather than a value that is produced by the probability. If B=d, the complexity result looks the same as RSG . But the difference lies on that the B is no more than such that our result is better than RSG . Figure 1 clearly shows the difference.
For , the complexity becomes . The complexity is also better than that of RSG.
3.2 Mini-batch SZVR-G
We extend the SZVR-G to the mini-batch version in Algorithm 2, which is similar to Algorithm 1. The difference is that we estimate the gradient in inner epoch with times computation, then average them. Theorem 3.3 gives the corresponding complexity and the corresponding step size.
From the above Theorem, we can see that the complexity is increased by a factor or , which is smaller than the size of the mini-batch. However, the corresponding complexity of RGF and RSG will be increased by multiplying a factor of (see Table 1), so our algorithm has a better dependency to the batch size. Furthermore, our total number of iterations will decrease by a factor or .
3.3 SZVR-G for non-smooth function
For non-smooth function, we also provide the theory analysis and give the corresponding complexity. Similar to Theorem 3.2, we analyze the convergence based on the norm of the gradient. But the difference lies in that the convergence of gradient norm is rather than . As stated in , allowing and , the convergence of ensures the convergence to a stationary point of the initial function.
In Algorithm 1, for , the step , , and the number of inner iteration , Gaussian vectors set . In order to obtain
the total number of is , number of inner iterations .
4 Experimental results
4.1 Logistic regression with stochastic zeroth-order method
In order to verify our theory, we apply our algorithm to logistic regression. Given training examples , where and , are the feature vector and the label of th example. The objective function is
where . We use MNIST  dataset to make two kinds of experiments in order to verify that our variance reduction technology is better than current approach. The dimension of is , where the size of the image is , and the number of the class is 10. We choose the parameters according to setting in Theorem 3.2 to give the best performance. First, to verify that our variance reduction technique for Gaussian random directions are useful, we compare our algorithm with RGF  for solving a deterministic function , which is the logistic regression with MNIST samples. Row 1 in Figure 2 shows the results that our method SZVR-G is better than RGF  both on the objective function value and the norm of the gradient. This verified that even for solving a deterministic function, our algorithm outperforms RGF in both theory and practice, due to the variance reduction for Gaussian search directions.
In the second experiment we compare with RSG  on stochastic optimization, that consider two kinds of stochastic process: randomly select one or block example and Gaussian vector to estimate the gradient. We use the fix dataset with randomly selected examples. Figure 2. row 2 shows that our method is better than RSG since we conduct variance reduction on both examples and Gaussian vectors.
4.2 Universal adversarial examples with black-box setting
In the second set of experiments, we apply zeroth order optimization methods to solve a real problem in adversarial black-box attack to machine learning models. It has been observed recently that convolutional neural networks are vulnerable to adversarial example [24, 25].  apply zeroth order optimization techniques in the black-box setting, where one can only acquire input-output correspondences of targeted model. Also,  finds there exists universal perturbations that could fool the classifier on almost all datapoints sampled. Therefore, we decide to apply our SZVR-G algorithm to non-smooth function that find universal adversarial perturbations in the black-box setting to show our efficiency in an interesting application. For classification models in neural networks, given the classification model , it is usually assumed that , where is the final layer output, and is the prediction score for the -th class. Formally, we want to find a universal perturbation that could fool all N images in samples set , that is,
where is a constant to balance the distortion and attack success rate and is a confidence parameter that guarantees a constant gap between and . In this experiments, we use two standard datasets: MNIST , CIFAR-10 . We construct two convolution neural networks following . In detail, both MNIST and CIFAR use the same network structure with four convolution layers, two max-pooling layers and two fully-connected layers. Using the parameters provided by , we could achieve 99.5% accuracy on MNIST and 82.5% accuracy on CIFAR-10. All models are trained using Pytorch111https://github.com/pytorch/pytorch. The dimension of is for MNIST and for CIFAR-10. We tune the best parameters to give the best performance. Figure 3 show the performance with difference methods. We can see that our algorithm SZVR-G is better than RGF and RSG both on objective value and the norm of the gradient.
In this paper, we present stochastic zeroth-order optimization via variance reduction for both smooth and non-smooth non-convex problem. The stochastic process include two kinds of aspects: randomly select the sample and derivative of direction, respectively. We give the theoretical analysis of complexity, which is better than that of RGF and RSG. Furthermore, we also extend our algorithm to mini-batch, in which the complexity is multiplying a smaller size of the mini-batch. Our experimental result also confirm our theory.
-  J Matyas. Random optimization. Automation and Remote control, 26(2):246–253, 1965.
-  Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to derivative-free optimization, volume 8. Siam, 2009.
-  Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394. Society for Industrial and Applied Mathematics, 2005.
-  Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COLT, pages 28–40. Citeseer, 2010.
-  Andre Wibisono, Martin J Wainwright, Michael I Jordan, and John C Duchi. Finite sample convergence rates of zero-order stochastic optimization methods. In Advances in Neural Information Processing Systems, pages 1439–1447, 2012.
-  James C Spall. Introduction to stochastic search and optimization: estimation, simulation, and control, volume 65. John Wiley & Sons, 2005.
-  Arjun Nitin Bhagoji, Warren He, Bo Li, and Dawn Song. Exploring the space of black-box attacks on deep neural networks. arXiv preprint arXiv:1712.09491, 2017.
-  Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017.
-  Nina Narodytska and Shiva Kasiviswanathan. Simple black-box adversarial attacks on deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1310–1318. IEEE, 2017.
-  Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.
-  Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
-  Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
-  Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems, pages 2345–2355, 2017.
-  Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than sgd. arXiv preprint arXiv:1708.08694, 2017.
-  John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.
-  Yining Wang, Simon Du, Sivaraman Balakrishnan, and Aarti Singh. Stochastic zeroth-order optimization in high dimensions. arXiv preprint arXiv:1710.10551, 2017.
-  Xiangru Lian, Huan Zhang, Cho-Jui Hsieh, Yijun Huang, and Ji Liu. A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order. In Advances in Neural Information Processing Systems, pages 3054–3062, 2016.
-  Bin Gu, Zhouyuan Huo, and Heng Huang. Zeroth-order asynchronous doubly stochastic algorithm with variance reduction. arXiv preprint arXiv:1612.01425, 2016.
-  Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
-  Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323, 2016.
-  Reza Harikandeh, Mohamed Osama Ahmed, Alim Virani, Mark Schmidt, Jakub Konečnỳ, and Scott Sallinen. Stopwasting my gradients: Practical svrg. In Advances in Neural Information Processing Systems, pages 2251–2259, 2015.
-  Lihua Lei and Michael Jordan. Less than a single pass: Stochastically controlled stochastic gradient. In Artificial Intelligence and Statistics, pages 148–156, 2017.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
-  Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations.
-  Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
-  Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pages 39–57. IEEE, 2017.
Appendix A Technical Lemma
For the sequences that satisfy , where , , and , we can get the geometric progression
then can be represented as decrease sequences,
Furthermore, if , we have
 For and , , where .
 If is differentiable at then,
If satisfy , and is a non-empty, uniform random subset of , then
Furthermore, if the elements in are independent, then
Based on the , and permutation and combination, we have
For the case that is a non-empty, uniform random subset of , we have
For the case that the elements in are independent, we have
Consider that is a non-empty, uniform random subset of [n] with , and the set with , if is a non-empty set, in which each element in is independent, and , then
is a non-empty set, in which each element in is independent. Consider the as an element, and based on the result in Lemma A.4, we have
Take the expectation with respect to and for the last two terms, we have
For the first term,
where (A.6) is based on the fact that .
Thus, we have the expectation with respect to and ,
a.1 The model of Convergence analysis
Before give the official proof, we give a simple model of convergence sequence, which is easily comprehensive. First, given two sequences,
Define , we can see that
if parameters satisfy, ,
and is a decrease sequence;