Learning to Learn by ZerothOrder Oracle
Abstract
In the learning to learn (L2L) framework, we cast the design of optimization algorithms as a machine learning problem and use deep neural networks to learn the update rules. In this paper, we extend the L2L framework to zerothorder (ZO) optimization setting, where no explicit gradient information is available. Our learned optimizer, modeled as recurrent neural network (RNN), first approximates gradient by ZO gradient estimator and then produces parameter update utilizing the knowledge of previous iterations. To reduce high variance effect due to ZO gradient estimator, we further introduce another RNN to learn the Gaussian sampling rule and dynamically guide the query direction sampling. Our learned optimizer outperforms handdesigned algorithms in terms of convergence rate and final solution on both synthetic and practical ZO optimization tasks (in particular, the blackbox adversarial attack task, which is one of the most widely used tasks of ZO optimization). We finally conduct extensive analytical experiments to demonstrate the effectiveness of our proposed optimizer.
1 Introduction
Learning to learn (L2L) is a recently proposed metalearning framework where we leverage deep neural networks to learn optimization algorithms automatically. The most common choice for the learned optimizer is recurrent neural network (RNN) since it can capture longterm dependencies and propose parameter updates based on knowledge of previous iterations. By training RNN optimizers on predefined optimization tasks, the optimizers are capable of learning to explore the loss landscape and adaptively choose descent directions and steps (Lv et al., 2017). Recent works (Andrychowicz et al., 2016; Wichrowska et al., 2017; Lv et al., 2017) have shown promising results that these learned optimizers can often outperform widely used handdesigned algorithms such as SGD, RMSProp, ADAM, etc. Despite great prospects in this field, almost all previous learned optimizers are gradientbased, which cannot be applied to solve optimization problems where explicit gradients are difficult or infeasible to obtain.
Such problems mentioned above are called zerothorder (ZO) optimization problems, where the optimizer is only provided with function values (zerothorder information) rather than explicit gradients (firstorder information). They are attracting increasing attention for solving ML problems in the blackbox setting or when computing gradients is too expensive (Liu et al., 2018a). Recently, one of the most important applications of ZO optimization is blackbox adversarial attack to welltrained deep neural networks, since in practice only inputoutput correspondence of targeted models rather than internal model information is accessible (Papernot et al., 2017; Chen et al., 2017a).
Although ZO optimization is popular for solving ML problems, performance of existing algorithms is barely satisfactory. The basic idea behind these algorithms is to approximate gradients via ZO oracle (Nesterov and Spokoiny, 2017; Ghadimi and Lan, 2013). Given the loss function with its parameter to be optimized (called the optimizee), we can obtain its ZO gradient estimator by:
(1) 
where is the smoothing parameter, are random query directions drawn from standard Gaussian distribution (Nesterov and Spokoiny, 2017) and is the number of sampled query directions. However, high variance of ZO gradient estimator which results from both random query directions and random samples (in stochastic optimization setting) hampers convergence rate of current ZO algorithms. Typically, as problem dimension increases, these ZO algorithms suffer from increasing iteration complexity by a small polynomial of to explore the higher dimensional query space.
In this paper, we propose to learn a zerothorder optimizer. Instead of designing variance reduced and faster converging algorithms by hand as in Liu et al. (2018a, b), we replace parameter update rule as well as guided sampling rule for query directions with learned recurrent neural networks (RNNs). The main contributions of this paper are summarized as follows:

We extend the L2L framework to ZO optimization setting and propose to use RNN to learn ZO update rules automatically. Our learned optimizer contributes to faster convergence and lower final loss compared with handdesigned ZO algorithms.

Instead of using standard Gaussian sampling for random query directions as in traditional ZO algorithms, we propose to learn the Gaussian sampling rule and adaptively modify the search distribution. We use another RNN to adapt the variance of random Gaussian sampling. This new technique helps the optimizer to automatically sample on more important search space and thus results in more accurate gradient estimator at each iteration.

Our learned optimizer leads to significant improvement on some ZO optimization tasks (especially the blackbox adversarial attack task). We also conduct extensive experiments to analyze the effectiveness of our learned optimizer.
2 Related Work
Learning to learn (L2L) In the L2L framework, the design of optimization algorithms is cast as a learning problem and deep neural network is used to learn the update rule automatically. In Cotter and Conwell (1990), early attempts were made to model adaptive learning algorithms as recurrent neural network (RNN) and were further developed in Younger et al. (2001) where RNN was trained to optimize simple convex functions. Recently, Andrychowicz et al. (2016) proposed a coordinatewise LSTM optimizer model to learn the parameter update rule tailored to a particular class of optimization problems and showed the learned optimizer could be applied to train deep neural networks. In Wichrowska et al. (2017) and Lv et al. (2017), several elaborate designs were proposed to improve the generalization and scalability of learned optimizers. Li and Malik (2016) and Li and Malik (2017) took a reinforcement learning (RL) perspective and used policy search to learn the optimization algorithms (viewed as RL policies). However, most previous learned optimizers rely on firstorder information and use explicit gradients to produce parameter updates, which is not applicable when explicit gradients are not available.
In this paper, we aim to learn an optimizer for ZO optimization problems. The most relevant work to ours is Chen et al. (2017b). In this work, the authors proposed to learn a global blackbox (zerothorder) optimizer which takes as inputs current query point and function value and outputs the next query point. Although the learned optimizer achieves comparable performance with traditional Bayesian optimization algorithms on some blackbox optimization tasks, it has several crucial drawbacks. As is pointed out in their paper, the learned optimizer scales poorly with long training steps and is specialized to a fixed problem dimension. Furthermore, it is not suitable for solving blackbox optimization problems of high dimensions.
Zerothorder (ZO) optimization The most common method of ZO optimization is to approximate gradient by ZO gradient estimator. Existing ZO optimization algorithms include ZOSGD (Ghadimi and Lan, 2013), ZOSCD (Lian et al., 2016), ZOsignSGD (Liu et al., 2019), ZOADAM (Chen et al., 2017a), etc. These algorithms suffer from high variance of ZO gradient estimator and typically increase the iteration complexity of their firstorder counterparts by a smalldegree polynomial of problem dimension . To tackle this problem, several variance reduced and faster converging algorithms (Liu et al., 2018a, b; Gu et al., 2016) were proposed. But they reduce the variance at the cost of higher query complexity. In this paper, we avoid laborious hand design of these algorithms and aim to learn ZO optimization algorithms automatically.
3 Method
3.1 Model Architecture
Our proposed RNN optimizer consists of three main parts: UpdateRNN, Guided ZO Oracle, and QueryRNN, as shown in Figure 1.
UpdateRNN
The function of the UpdateRNN is to learn the parameter update rule of ZO optimization. Following the idea in Andrychowicz et al. (2016), we use coordinatewise LSTM to model the UpdateRNN. Each coordinate of the optimizee shares the same network but maintains its own separate hidden state, which means that different parameters are optimized using the same update rule based on their own knowledge of previous iterations. Different from previous design in the firstorder setting, UpdateRNN takes as input ZO gradient estimator in equation 1 rather than exact gradient and outputs parameter update for each coordinate. Thus the parameter update rule is:
(2) 
where is the optimizee parameter at iteration . Besides learning to adaptively compute parameter updates by exploring the loss landscape, the UpdateRNN can also reduce negative effects caused by high variance of ZO gradient estimator due to its longterm dependency.
Guided ZO Oracle
In current ZO optimization approaches, ZO gradient estimator is computed by finite difference along the query direction which is randomly sampled from multivariate standard Gaussian distribution. But this estimate suffers from high variance and leads to poor convergence rate when applied to optimize problems of high dimensions (Duchi et al., 2015). To tackle this problem, we propose to use some prior knowledge learned from previous iterates during optimization to guide the random query direction search and adaptively modify the search distribution. Specifically, at iteration , we use to sample query directions ( is produced by the QueryRNN which is introduced later) and then obtain ZO gradient estimator along sampled directions via ZO Oracle (equation 1). The learned adaptive sampling strategy will automatically identify important sampling space which leads to more accurate gradient estimator under fixed query budget, thus further increases the convergence rate in ZO optimization tasks. For example, in the blackbox adversarial attack task, there is usually a clear important subspace for successful attack, and sampling directions from that subspace will lead to much faster convergence. This idea is similar to that of search distribution augmentation techniques for evolutionary strategies (ES) such as CMAES (Hansen, 2016) and Guided ES (Maheswaranathan et al., 2018). However, these methods explicitly define the sampling rule whereas we propose to learn the Gaussian sampling rule (i.e., adaption of covariance matrix ) in an automatic manner.
QueryRNN
We propose to use another LSTM network called QueryRNN to learn the Gaussian sampling rule and dynamically predict the covariance matrix . We assume is diagonal so that it could be predicted in a coordinatewise manner as in the UpdateRNN and the learned QueryRNN is invariant to the dimension of optimizee parameter. It takes as input ZO gradient estimator and parameter update at last iterate (which could be viewed as surrogate gradient information) and outputs the sampling variance coordinatewisely, which can be written as:
(3) 
The intuition is that if estimated gradients or parameter updates of previous iterates are biased toward a certain direction, then we can probably increase the sampling probability toward that direction.
Using predicted covariance to sample query directions increases the bias of estimated gradient and reduces the variance, which leads to a tradeoff between bias and variance. The reduction of variance contributes to faster convergence, but too much bias tends to make the learned optimizer stuck at bad local optima. To balance the bias and variance, at test time we randomly use the covariance of standard Gaussian distribution and the predicted covariance as the input of Guided ZO Oracle: where is a Bernoulli random variable that trades off between bias and variance. Note that the norm of the sampling covariance may not equal to that of standard Gaussian sampling covariance , which makes the expectation of the sampled query direction norm change. To keep the norm of query direction invariant, we then normalize the norm of to the norm of .
3.2 Objective Function
The objective funtion of training our proposed optimizer can be written as follows:
(4) 
where is the parameter of the optimizer, is the horizon of the optimization trajectory and are predefined weights associated with each time step . The objective function consists of two terms. The first one is the weighted sum of the optimizee loss values at each time step. We use linearly increasing weight (i.e., ) to force the learned optimizer to attach greater importance to the final loss rather than focus on the initial optimization stage. The second one is the regularization term of predicted Gaussian sampling covariance with regularization parameter . This term prevents the QueryRNN from predicting too big or too small variance value.
3.3 Training the Learned Optimizer
In experiments, we do not train the UpdateRNN and the QueryRNN jointly for the sake of stability. Instead, we first train the UpdateRNN using standard Gaussian random vectors as query directions. Then we freeze the parameters of the UpdateRNN and train the QueryRNN separately. Both two modules are trained by truncated Backpropagation Through Time (BPTT) and using the same objective function in equation 4.
As the process of random Gaussian sampling is not differentiable, we cannot backpropagate the error signal through this component directly when training the QueryRNN. Therefore, we use “reparameterization trick” (Kingma and Welling, 2013) to generate random query directions. Specifically, to generate query direction , we first sample standard Gaussian vector , and then apply the reparameterization . In this way, we can backpropagate through the random Gaussian sampling module (i.e., ZO Oracle) to train the QueryRNN.
To train the optimizer, we need to take derivatives of the objective function with respect to the optimizer parameter. However, the objective function in equation 4 contains the loss function of the optimizee whose gradient information is not available. In order to obtain the derivatives, we can follow the assumption in Chen et al. (2017b) that we could get the gradient information of the optimizee loss function at training time, and this information is not needed at test time. However, this assumption cannot be applied when the gradient of optimizee loss function is not available neither at training time. In that regard, we propose to approximate the gradient of the optimizee loss function w.r.t its parameter using coordinatewise ZO gradient estimator (Lian et al., 2016; Liu et al., 2018b):
(5) 
where is the dimension of the optimizee loss function, is the smoothing parameter for the coordinate, and is the standard basis vector with its coordinate being 1, and others being 0s. This estimator is deterministic and could achieve an accurate estimate when are sufficiently small. And it is used only to backpropagate the error signal from the optimizee loss function to its parameter to train the optimizer, which is different from the estimator in equation 1 that is used by the optimizer to propose parameter update. Note that this estimator requires function queries scale with , which would slow down the training speed especially when optimizee is of high dimension. However, we can compute the gradient estimator of each coordinate parallelly to reduce the computation overhead. Another advantage is that when training the optimizer, there is no need to backpropagate through the optimizee model, which helps to save a lot of memory cost.
4 Experimental Results
In this section, we empirically demonstrate the superiority of our proposed ZO optimizer on both practical application (blackbox adversarial attack on MINST and CIFAR10 dataset) and synthetic problem (binary classification in stochastic zerothorder setting). We compare our learned optimizer (called ZOLSTM below) with existing handdesigned ZO optimization algorithms, including ZOSGD (Ghadimi and Lan, 2013), ZOsignSGD (Liu et al., 2019), and ZOADAM (Chen et al., 2017a). For each task, we tune the hyperparameters of baseline algorithms to report the best performance. Specifically, we set the learning rate of baseline algorithms to . We first coarsely tune the constant on a logarithmic range and then finetune it on a linear range. For ZOADAM, we tune values over and values over . To ensure fair comparison, all optimizers use the same number of query directions to obtain ZO gradient estimator at each iteration.
In all experiments, we use 1layer LSTM with 10 hidden units for both the UpdateRNN and the QueryRNN. For each RNN, another linear layer is applied to project the hidden state to the output (1dim parameter update for the UpdateRNN and 1dim predicted variance for the QueryRNN). The regularization parameter in the training objective function (equation 4) is set to 0.005. We use ADAM to train our proposed optimizer with truncated BPTT, each optimization is run for 200 steps and unrolled for 20 steps unless specified otherwise. At test time, we set the Bernoulli random variable (see Section 3.1) .
4.1 Adversarial Attack to Blackbox Models
We first consider an important application of our learned ZO optimizer: generating adversarial examples to attack blackbox models. In this problem, given the targeted model and an original example , the goal is to find an adversarial example with small perturbation that minimizes a loss function which reflects attack successfulness. The blackbox attack loss function can be formulated as , where balances the perturbation norm and attack successfulness (Carlini and Wagner, 2017; Tu et al., 2019). Due to the blackbox setting, one can only compute the function value of the above objective, which leads to ZO optimization problems (Chen et al., 2017a). Note that attacking each sample in the dataset corresponds to a particular ZO optimization problem, which motivates us to train a ZO optimizer (or “attacker”) offline with a small subset and apply it to online attack to other samples with faster convergence (which means lower query complexity) and lower final loss (which means less distortion).
Here we experiment with blackbox attack to deep neural network image classifier, see detailed problem formulation in Appendix A.1. We follow the same neural network architectures used in Cheng et al. (2018) for MNIST and CIFAR10 dataset, which achieve 99.2% and 82.7% test accuracy respectively. We randomly select 100 images that are correctly classified by the targeted model in each test set to train the optimizer and select another 100 images to test the learned optimizer. The dimension of the optimizee function is for MNIST and for CIFAR10. The number of sampled query directions is set to for MNIST and for CIFAR10 respectively. All optimizers use the same initial points for finding adversarial examples.
Figure 2 shows blackbox attack loss versus iterations using different optimizers. We plot the loss curves of two selected test images (see Appendix A.3 for more plots on other test images) as well as the average loss curve over all 100 test images for each dataset. It is clear that our learned optimizer (ZOLSTM) leads to much faster convergence and lower final loss than other baseline optimizers both on MNIST and CIFAR10 attack task. The visualization of generated adversarial examples versus iterations can be found in Appendix A.2.
4.2 Stochastic Zerothorder Binary Classification
Next we apply our learned optimizer in the stochastic ZO optimization setting. We consider a synthetic binary classification problem (Liu et al., 2019) with nonconvex least squared loss function: . To generate one dataset for the binary classification task, we first randomly sample a dimensional vector from as the groundtruth. Then we draw samples from and obtain the label if and otherwise. The size of training set is 2000 for each dataset. Note that each dataset corresponds to a different optimizee function in the class of binary classification task. We generate 100 different datasets in total, and use 90 generated datasets (i.e. 90 binary classification objective functions) to train the optimizer and other 10 to test the performance of the learned optimizer. Unless specified otherwise, the problem dimension is ; the batch size and the number of query directions are set to and respectively. At each iteration of training, the optimizer is allowed to run 500 steps and unrolled for 20 steps.
In Figure 2(a), we compare various ZO optimizers and observe that our learned optimizer outperforms all other handdesigned ZO optimization algorithms. Figure 2(b)2(c) compares the performance of ZOSGD and ZOLSTM with different query direction number and batch size . ZOLSTM consistently outperforms ZOSGD in different optimization settings. In Figure 2(d), we generate binary classification problems with different dimension and test the performance of ZOLSTM. Our learned optimizer generalizes well and still achieves better performance than ZOSGD.
4.3 Generalization to Different Tasks
Current L2L framework (firstorder) aims to train an optimizer on a small subset of problems and make the learned optimizer generalize to a wide range of different problems. However, in practice, it is difficult to train a general optimizer that can achieve good performance on problems with different structures and loss landscapes. In experiments, we find that the learned optimizer could not easily generalize to those problems with different relative scales between parameter update and estimated gradient (similar to the definition of learning rate). Therefore, we scale the parameter update produced by the UpdateRNN by a factor when generalizing the learned optimizer to another totally different task and tune this hyperparameter on that task (similar to SGD/Adam).
We first train the optimizer on MNIST attack task and then finetune it on CIFAR10 attack task^{1}^{1}1Although blackbox attack tasks on MNIST and CIFAR10 dataset seem to be similar on intuition, the ZO optimization problems on these two datasets are not such similar. Because targeted models are of very different architectures and image features also vary a lot, the loss landscape and gradient scale are rather different., as shown in Figure 1(d)1(f). We see that the finetuned optimizer (ZOLSTMfinetune) achieves comparable performance with ZOLSTM which is trained from scratch on a CIFAR10 subset. We also generalize the learned optimizer trained on the MNIST attack task to the totally different binary classification task (Figure 2(a)) and surprisingly find that it could achieve almost identical performance with ZOLSTM directly trained on this target task. These results demonstrate that our optimizer has learned a rather general ZO optimization algorithm which can generalize to different tasks well.
4.4 Analysis
In this section, we conduct experiments to analyze the effectiveness of each module and understand the function mechanism of our proposed optimizer (especially the QueryRNN).
Ablation study
To assess the effectiveness of each module, we conduct ablation study on each task, as shown in Figure 3(a)3(c). We compare the performance of ZOSGD, ZOLSTM (our model), ZOLSTMnoquery (our model without the QueryRNN, i.e., use standard Gaussian sampling), ZOLSTMnoupdate (our model without the UpdateRNN, i.e., ZOSGD with the QueryRNN). We observe that both the QueryRNN and the UpdateRNN improves the performance of the learned optimizer in terms of convergence rate or/and final solution. Noticeably, the improvement induced by the QueryRNN is less significant on binary classification task than on blackbox attack task. We conjecture the reason is that the gradient directions are more random in binary classification task so it is much more difficult for the QueryRNN to identify the important sampling space. To further demonstrate the effectiveness of the QueryRNN, we also compare ZOLSTM, ZOLSTMnoquery with ZOLSTMGuidedES, whose parameter update is produced by the UpdateRNN but the covariance matrix of random Gaussian sampling is adapted by guided evolutionary strategy (Guided ES). For fair comparison, we use the ZO gradient estimator and the parameter update at last iterate (the same as the input of our QueryRNN) as surrogate gradients for GuidedES (see Appendix B for details). We find that using GuidedES to guide the query direction search also improves the convergence speed on MNIST attack task, but the improvement is much less than that of the QueryRNN. In addition, GuidedES leads to negligible effects on the other two tasks.
Estimated gradient evaluation
In this experiment, we evaluate the estimated gradient produced by the Guided ZO Oracle with and without the QueryRNN. Specifically, we test our learned optimizer on MNIST attack task and compute the average cosine similarity between the groundtruth gradient and the ZO gradient estimator over optimization steps before convergence. In Figure 3(d), we plot the average cosine similarity of ZOLSTM and ZOLSTMnoquery against different query direction number . We observe that the cosine similarity becomes higher with the QueryRNN, which means that the direction of ZO gradient estimator is closer to that of the groundtruth gradient. And with the query direction number increasing, the improvement of cosine similarity becomes more significant. These results can explain the effectiveness of the QueryRNN in terms of obtaining more accurate ZO gradient estimators. In Appendix C.1, we evaluate the iteration complexity with and without the QueryRNN to further verify its improved convergence rate and scalability with problem dimension.
Optimization trajectory analysis
To obtain a more indepth understanding about what our proposed optimizer learns, we conduct another analysis on the MNIST attack task. We use the learned optimizer (or “attacker”) to attack one test image in the MNIST dataset. Then we select one pixel in the image (corresponds to one coordinate to be optimized), and trace the predicted variance, the gradient estimator and the parameter update of that coordinate at each iteration, as shown in Figure 3(e). We can observe that although the ZO gradient estimator is noisy due to the high variance of random Gaussian sampling, the parameter update produced by the UpdateRNN is less noisy, which makes the optimization process less stochastic. The smoothing effect of the UpdateRNN is similar to that of ZOADAM, but it is learned automatically rather than by hand design. The predicted variance produced by the QueryRNN is even smoother. With larger value of ZO gradient estimator or the parameter update, the QueryRNN produces larger predicted variance to increase the sampling bias toward that coordinate. We observe that the overall trend of the predicted variance is more similar to that of the parameter update, which probably means the parameter update plays a more important role in the prediction of the Gaussian sample variance. Finally, in Appendix C.2, we also visualize the predicted variance by the QueryRNN and compare it with final added perturbation to the image (i.e., final solution of attack task).
5 Conclusion
In this paper, we study the learning to learn framework for zerothorder optimization problems. We propose a novel RNNbased optimizer that learns both the update rule and the Gaussian sampling rule. Our learned optimizer leads to significant improvement in terms of convergence speed and final loss. Experimental results on both synthetic and practical problems validate the superiority of our learned optimizer over other handdesigned algorithms. We also conduct extensive analytical experiments to show the effectiveness of each module and to understand our learned optimizer.
References
 Learning to learn by gradient descent by gradient descent. In Advances in neural information processing systems, pp. 3981–3989. Cited by: §1, §2, §3.1.
 Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §4.1.
 Zoo: zeroth order optimization based blackbox attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15–26. Cited by: §1, §2, §4.1, §4.
 Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 748–756. Cited by: §2, §3.3.
 Queryefficient hardlabel blackbox attack: an optimizationbased approach. arXiv preprint arXiv:1807.04457. Cited by: §4.1.
 Fixedweight networks can learn. In 1990 IJCNN International Joint Conference on Neural Networks, pp. 553–559. Cited by: §2.
 Optimal rates for zeroorder convex optimization: the power of two function evaluations. IEEE Transactions on Information Theory 61 (5), pp. 2788–2806. Cited by: §3.1.
 Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization 23 (4), pp. 2341–2368. Cited by: §1, §2, §4.
 Zerothorder asynchronous doubly stochastic algorithm with variance reduction. arXiv preprint arXiv:1612.01425. Cited by: §2.
 The cma evolution strategy: a tutorial. arXiv preprint arXiv:1604.00772. Cited by: §3.1.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.3.
 Learning to optimize. arXiv preprint arXiv:1606.01885. Cited by: §2.
 Learning to optimize neural nets. arXiv preprint arXiv:1703.00441. Cited by: §2.
 A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zerothorder to firstorder. In Advances in Neural Information Processing Systems, pp. 3054–3062. Cited by: §2, §3.3.
 Stochastic zerothorder optimization via variance reduction method. arXiv preprint arXiv:1805.11811. Cited by: §1, §1, §2.
 SignSGD via zerothorder oracle. In International Conference on Learning Representations, External Links: Link Cited by: §2, §4.2, §4.
 Zerothorder stochastic variance reduction for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 3727–3737. Cited by: §1, §2, §3.3.
 Learning gradient descent: better generalization and longer horizons. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2247–2255. Cited by: §1, §2.
 Guided evolutionary strategies: augmenting random search with surrogate gradients. arXiv preprint arXiv:1806.10230. Cited by: Appendix B, §3.1.
 Random gradientfree minimization of convex functions. Foundations of Computational Mathematics 17 (2), pp. 527–566. Cited by: §1.
 Practical blackbox attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §1.
 Autozoom: autoencoderbased zeroth order optimization method for attacking blackbox neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 742–749. Cited by: §4.1.
 Learned optimizers that scale and generalize. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3751–3760. Cited by: §1, §2.
 Metalearning with backpropagation. In IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No. 01CH37222), Vol. 3. Cited by: §2.
Appendix A Application: Adversarial Attack to Blackbox Models
a.1 Problem formulation for blackbox attack
We consider generating adversarial examples to attack blackbox DNN image classifier and formulate it as a zerothorder optimization problem. The targeted DNN image classifier takes as input an image and outputs the prediction scores (i.e. log probabilities) of classes. Given an image and its corresponding true label , an adversarial example is visually similar to the original image but leads the targerted model to make wrong prediction other than (i.e., untargeted attack). The blackbox attack loss is defined as:
(6) 
The first term is the attack loss which measures how successful the adversarial attack is and penalizes correct prediction by the targeted model. The second term is the distortion loss (norm of added perturbation) which enforces the perturbation added to be small and is the regularization coefficient. In our experiment, we use norm (i.e., ), and set for MNIST attack task and for CIFAR10 attack task. To ensure the perturbed image still lies within the valid image space, we can apply a simple transformation such that . Note that in practice, we can only get access to the inputs and outputs of the targeted model, thus we cannot obtain explicit gradients of above loss function, rendering it a ZO optimization problem.
a.2 Visualization of generated adversarial examples versus iterations
a.3 Additional plots of blackbox attack loss versus iterations
a.4 Additional plots for ablation study
Appendix B Implementation Details for Guided Evolutionary Strategy
Guided evolutionary strategy (GuidedES) in Maheswaranathan et al. (2018) incorporates surrogate gradient information (which is correlated with true gradient) into random search. It keeps track of a low dimensional guided subspace defined by surrogate gradients, which is combined with the full space for query direction sampling. Denote as the orthonormal basis of the guided subspace (i.e., ), GuidedES samples query directions from distribution , where the covariance matrix is modified as:
(7) 
where trades off between the full space and the guided space and we tune the hyperparameter with the best performance in our experiments. Similar to what we have discussed in Section 3.1, we normalize the norm of sampled query direction to keep it invariant. In our experiments, GuidedES uses the ZO gradient estimator and the parameter update at last iterate (the same as the input of our QueryRNN) as input for fair comparison with our proposed QueryRNN.
Appendix C Additional Analytical Study
c.1 Iteration complexity versus problem dimension
In this experiment, we evaluate the iteration complexity with and without the QueryRNN. Specifically, we test the performance of ZOSGD and ZOLSTMnoupdate (i.e., ZOSGD with the QueryRNN) on MNIST attack task and compare the iterations required to achieve initial attack success. In Figure 8, we plot iteration complexity against problem dimension . We generate MNIST attack problems with different dimensions by rescaling the added perturbation using bilinear interpolation method. From Figure 8, we find that with the problem dimension increasing, ZOSGD scales poorly and requires much more iterations (i.e., function queries) to attain initial attack success. With the QueryRNN, ZOLSTMnoupdate consistently requires lower iteration complexity and leads to more significant improvement on problems of higher dimensions. These results show the effectiveness of the QueryRNN in terms of convergence rate and scalability with problem dimensions.
c.2 Visualization of added perturbation and predicted variance
To further verify the effectiveness of the QueryRNN, we select one image from MNIST dataset and visualize final added perturbation to the image (i.e., final solution of MNIST attack task) as well as sampling variance predicted by the QueryRNN, as illustrated in Figure 9. We first compare final perturbation produced by ZOLSTM (Figure 8(b)) and ZOLSTMnoquery (Figure 8(c)). We observe that the perturbation produced by these two optimizers are generally similar, but that produced by ZOLSTM is less distributed due to the sampling bias induced by the QueryRNN. Then we take the predicted variance by the QueryRNN of ZOLSTM (averaged over iterations before convergence) into comparison (Figure 8(d)). We find that there are some similar patterns between average predicted variance by the QueryRNN and final added perturbation generated by ZOLSTM. It is expected since ZOLSTM uses the predicted variance by the QueryRNN to sample query directions, which would thus guide the optimization trajectory and influence the final solution. Surprisingly, we see that the average predicted variance by the QueryRNN of ZOLSTM is also similar to final perturbation produced by ZOLSTMnoquery (which doesn’t utilize the QueryRNN). These results demonstrate that the QueryRNN could recognize useful features quickly in the early optimization stage and produces sampling space toward the final solution.