# AutoZOOM: Autoencoder-based Zeroth Order Optimization Method for Attacking

Black-box Neural Networks

###### Abstract

Recent studies have shown that adversarial examples in state-of-the-art image classifiers trained by deep neural networks (DNN) can be easily generated when the target model is transparent to an attacker, known as the white-box setting. However, when attacking a deployed machine learning service, one can only acquire the input-output correspondences of the target model; this is the so-called black-box attack setting. The major drawback of existing black-box attacks is the need for excessive model queries, which may lead to a false sense of model robustness due to inefficient query designs. To bridge this gap, we propose a generic framework for query-efficient black-box attacks. Our framework, AutoZOOM, which is short for Autoencoder-based Zeroth Order Optimization Method, has two novel building blocks towards efficient black-box attacks: (i) an adaptive random gradient estimation strategy to balance query counts and distortion, and (ii) an autoencoder trained offline with unlabeled data towards attack acceleration. Experimental results suggest that, by applying AutoZOOM to a state-of-the-art black-box attack (ZOO), a significant reduction in model queries can be achieved without sacrificing the attack success rate and the visual quality of the resulting adversarial examples. In particular, when compared to the standard ZOO method, AutoZOOM can consistently reduce the mean query counts in finding successful adversarial examples by at least 93% on MNIST, CIFAR-10 and ImageNet datasets. AutoZOOM’s post-success fine-tuning can further reduce attack distortion.

AutoZOOM: Autoencoder-based Zeroth Order Optimization Method for Attacking

Black-box Neural Networks

Chun-Chen Tu^{†}^{†}thanks: equal contribution , Paishun Ting, Pin-Yu Chen, Sijia Liu,
Huan Zhang, Jinfeng Yi, Cho-Jui Hsieh, Shin-Ming Cheng
University of Michigan, Ann Arbor, USA
MIT-IBM Watson AI Lab, IBM Research
University of California, Davis, USA
JD AI Research, China
National Taiwan University of Science and Technology, Taiwan

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

In recent years, “machine learning as a service” has offered the world effortless access to powerful machine learning tools for off-the-shelf data analysis and applications. For example, commercially available services such as Google Cloud Vision API and Clarifai.com provide well-trained image classifiers to the public. One is able to upload and obtain the class prediction results for images at hand at a low price. However, the existing and emerging machine learning platforms and their low model-access costs raise ever-increasing security concerns, as they also offer an ideal environment for testing malicious attempts. Even worse, the risks can be amplified when these services are used to build derived products such that the inherent security vulnerability could be leveraged by attackers.

In many computer vision tasks, DNN models achieve the state-of-the-art prediction accuracy and hence are widely deployed in modern machine learning services. Nonetheless, recent studies have highlighted DNNs’ vulnerability to adversarial perturbations. In the white-box setting in which the target model is entirely transparent to an attacker, visually imperceptible adversarial images can be easily crafted to fool a target DNN model towards misclassification by leveraging the gradient information [1, 2]. However, in the black-box setting in which the parameters of the deployed model are hidden and one can only observe the input-output correspondences of a queried example, crafting adversarial examples require a gradient-free (zeroth order) optimization approach to gather necessary attack information. Figure 1 displays a prediction-evasive adversarial example crafted via iterative model queries from a black-box DNN trained on ImageNet.

Albeit achieving remarkable attack effectiveness, current black-box attack methods based on gradient estimation, such as [3, 4], are not query-efficient since they exploit coordinate-wise gradient estimation and value update, which inevitably incurs an excessive number of model queries and may lead to a false sense of model robustness due to inefficient query designs. In this paper, we propose to tackle the preceding problem by using AutoZOOM, an Autoencoder-based Zeroth Order Optimization Method. AutoZOOM has two novel building blocks: (i) an adaptive random gradient estimation strategy to balance the query count and distortion when crafting adversarial examples, and (ii) an autoencoder trained offline on other unlabeled data to accelerate black-box attacks. As illustrated in Figure 2, AutoZOOM utilizes the decoder to craft a high-dimensional adversarial perturbation from the learned low-dimensional latent-space representation, and its query efficiency can be well explained by the dimension-dependent convergence rate in gradient-free optimization.

Contributions. We summarize our main contributions as follows:

1. We propose AutoZOOM, a novel query-efficient black-box attack framework for generating adversarial examples. AutoZOOM features an adaptive random gradient estimation strategy and dimension reduction techniques via
an offline trained autoencoder to reduce attack query counts while maintaining attack effectiveness and visual similarity. To the best of our knowledge, AutoZOOM is the first black-box attack using random full gradient estimation and data-driven acceleration.

2. We use the convergence rate of gradient-free optimization methods to motivate the query-efficiency of AutoZOOM and provide an error analysis of the averaged random gradient estimator in AutoZOOM to the true gradient for understanding the trade-offs between estimation error and query counts.

3. When applied to a state-of-the-art black-box attack proposed in [3], AutoZOOM attains a similar attack success rate while achieving a significant reduction (at least 93%) in the mean query counts required to attack the DNN image classifiers for MNIST, CIFAR-10 and ImageNet. It can also fine-tune the distortion in the post-success stage.

## 2 AutoZOOM: Background and Methods

### 2.1 Black-box Attack Formulation and Zeroth Order Optimization

In the black-box attack setting, it suffices to denote the target DNN as a classification function that takes a -dimensional scaled image as its input and yields a vector of prediction scores of all image classes, such as the prediction probabilities for each class. We further consider the case of applying an entry-wise monotonic transformation to the output of for black-box attacks, since monotonic transformation preserves the ranking of the class predictions and can alleviate the problem of large score variation in (e.g., probability to log probability).

Here we formulate black-box targeted attacks. The formulation can be easily adapted to untargeted attacks. Let denote a natural image and its ground-truth class label , and let () denote the adversarial example of and the target attack class label . The problem of finding an adversarial example can be formulated as an optimization problem taking the generic form of

(1) |

where measures the distortion between and , is an attack objective reflecting the likelihood of predicting , is a regularization coefficient, and the constraint confines the adversarial image to the valid image space. The distortion measure is often evaluated by the norm defined as for , where is the adversarial perturbation to . The attack objective can be the training loss of DNNs [2] or some designed loss based on model predictions [5].

In the white-box setting, an adversarial example is generated by using downstream optimizers such as ADAM [6] to solve (1); this requires the gradient of the objective function relative to the input of via back-propagation in DNNs. However, in the black-box setting, acquiring is implausible, and one can only obtain the function evaluation , which renders solving (1) a zeroth order optimization problem. Recently, zeroth order optimization approaches [7, 8, 9] circumvent the preceding challenge by approximating the true gradient via function evaluations. Specifically, in black-box attacks, the gradient estimate is used to replace the gradient computations in downstream optimization solvers for solving (1).

### 2.2 Random Gradient Estimation

As a first attempt to enable gradient-free black-box attacks on DNNs, the authors in [3] use the symmetric difference quotient method [10] to evaluate the gradient of the -th component by

(2) |

using a small . Here denotes the -th elementary basis. Similar technique is also used in the follow-up works in [11, 4]. Albeit contributing to powerful black-box attacks and applicable to large networks like ImageNet, the nature of coordinate-wise gradient estimation step in (2) must incur an enormous amount of model queries and is hence not query-efficient. For example, the ImageNet dataset has dimensions, rendering coordinate-wise zeroth order optimization based on gradient estimation query-inefficient.

To improve query efficiency, we propose a scaled random full gradient estimator defined as

(3) |

where is a smoothing parameter, is a unit-length vector that is uniformly drawn at random from a unit Euclidean sphere, and is a tunable scaling parameter that balances the bias and variance trade-off of the gradient estimation error. Note that with , the gradient estimator in (3) becomes the one used in [12]. With , this estimator becomes the one adopted in [13]. We will provide an optimal value based on our gradient estimation error analysis.

Averaged random gradient estimation. In this paper we consider a more general gradient estimator, in which the gradient estimate is averaged over random directions . That is,

(4) |

where is a gradient estimate defined in (3) with . The use of multiple random directions can reduce the variance of in (4) for convex loss functions [12, 9].

Below we establish an error analysis of the averaged random gradient estimator in (4) for studying the influence of the parameters and on estimation errors and query efficiency.

###### Theorem 1.

Assume is differentiable and its gradient is -Lipschitz^{1}^{1}1A function is -Lipschitz if for any . . Then the mean squared estimation error of in (4) is upper bounded by

(5) |

###### Proof.

The proof is given in the supplementary file. ∎

Here we highlight the important implications based on Theorem 1: (i) The error analysis holds when is non-convex; (ii) In DNNs, the true gradient can be viewed as the numerical gradient obtained via back-propagation; (iii) For any fixed , selecting a small (e.g., we set in AutoZOOM) can effectively reduce the last error term in (5), and we therefore focus on optimizing the first error term; (iv) The first error term in (5) exhibits the influence of and on the estimation error, and is independent of . We further elaborate on (iv) as follows. Fixing and let to be the coefficient of the first error term in (5), then the optimal that minimizes is . For query efficiency, one would like to keep small, which then implies and when the dimension is large. On the other hand, when , and , which yields a smaller error upper bound but is query-inefficient. We also note that by setting , the coefficient and thus is independent of the dimension and the parameter .

Adaptive random gradient estimation. Based on Theorem 1 and the aforementioned error analysis, in AutoZOOM we set in (3) and propose to use an adaptive strategy for selecting . AutoZOOM uses (i.e., the fewest possible model evaluation) to first obtain rough gradient estimates for solving (1) until a successful adversarial image is found. After the initial attack success, it switches to use more accurate gradient estimates with to fine-tune the image quality. The trade-off between (which is proportional to query counts) and distortion reduction will be investigated in Section 3.

### 2.3 Attack Dimension Reduction via Autoencoder

Dimension-dependent convergence rate using gradient estimation. Different from the first order convergence results, the convergence rate of zeroth order gradient descent methods has an additional multiplicative dimension-dependent factor . In the convex loss setting the rate is , where is the number of iterations [8, 9, 13, 14]. The same convergence rate has also been found in the nonconvex setting [7]. The dimension-dependent convergence factor suggests that vanilla black-box attacks using gradient estimations can be query inefficient when the (vectorized) image dimension is large, due to the curse of dimensionality in convergence. This also motivates us to propose using an autoencoder to reduce the attack dimension and improve query efficiency in black-box attacks.

In AutoZOOM, we propose to perform random gradient estimation from a reduced dimension to improve query efficiency. Specifically, as illustrated in Figure 2, we train an autoencoder (AE) using unlabeled data that are different from the training data to learn a reduced-dimension representation for data reconstruction. The encoder in an AE compresses the data to a low-dimensional latent space and the decoder reconstructs an example from its latent representation. The weights of an AE are learned to minimize the average reconstruction error. Note that training such an AE for black-box adversarial attacks is one-time and is entirely offline (i.e., no model queries needed).

Recall that the adversarial perturbation to a natural example is defined as , where is the corresponding adversarial example. AutoZOOM uses the decoder to craft the adversarial perturbation from the latent space such that . In other words, the decoder provides distributional guidance learned from other data when mapping the adversarial perturbation crafted from the latent space to the original space. Furthermore, with the decoder, the random gradient estimator benefits from dimension reduction by optimization over the latent (smaller) space , and hence faster convergence in black-box attacks can be expected. We also note that for any reduced dimension , the setting is optimal in terms of minimizing the corresponding estimation error from Theorem 1, despite the fact that the gradient estimation errors of different reduced dimensions cannot be directly compared. In Section 3 we will report the superior query efficiency in black-box attacks achieved with the use of an autoencoder and discuss the benefit of attack dimension reduction.

### 2.4 AutoZOOM Algorithm

Algorithm 1 summarizes the AutoZOOM framework towards query-efficient black-box attacks on DNNs. We also note that AutoZOOM is a general acceleration tool that is compatible with any black-box adversarial attack obeying the attack formulation in (1). The details on adjusting the regularization coefficient and the query parameter based on run-time model evaluation results will be discussed in Section 3.
Our source code is publicly available.^{2}^{2}2https://github.com/chunchentu/AutoZOOM

## 3 Performance Evaluation

This section presents the experiments for assessing the performance of AutoZOOM in accelerating black-box attacks on DNNs in terms of the number of queries required for an initial attack success and for a specific distortion level.

### 3.1 Distortion Measure and Attack Objective

As described in Section 2, AutoZOOM is a query-efficient gradient-free optimization framework for solving the black-box attack formulation in (1). In the following experiments, we demonstrate the utility of AutoZOOM by using the same attack formulation proposed in ZOO [3], which uses the squared norm as the distortion measure and adopts the attack objective

(6) |

where this hinge function is designed for targeted black-box attacks on the DNN model , and the monotonic transformation is applied to the model output.

### 3.2 Comparative Black-box Attack Methods

We compare AutoZOOM with three different configurations: (i) Standard ZOO implementation^{3}^{3}3https://github.com/huanzhang12/ZOO-Attack with bilinear perturbation scaling for dimension reduction; (ii) ZOO+AE, which is ZOO with our autoencoder; and (iii) ZOO+RV, which is ZOO using
our random full gradient estimator in (3). Note that all attacks indeed generate adversarial perturbations based on the same attack dimension.

### 3.3 Experiment Setup, Evaluation, Datasets and AutoZOOM Implementation

We assess the performance of different attack methods on several representative benchmark datasets, including MNIST [15], CIFAR-10 [16] and ImageNet [17]. For MNIST and CIFAR-10, we use the same DNN image classification models^{4}^{4}4https://github.com/carlini/nn_robust_attacks as in
[5]. For ImageNet, we use the Inception-v3 model [18].
All experiments were conducted using Tensorflow Machine-Learning Library [19] on machines equipped with an Intel Xeon E5-2690v3 CPU and an Nvidia Tesla K80 GPU.

All attacks used ADAM [6] for solving (1) with their estimated gradients and the same initial learning rate . On MNIST and CIFAR-10, all methods adopt 1,000 ADAM iterations. On ImageNet, ZOO and ZOO+AE adopt 20,000 iterations, whereas ZOO+RV and AutoZOOM adopt 100,000 iterations. Note that due to different gradient estimation methods, the query counts (i.e., the number of model evaluations) per iteration of a black-box attack may vary. ZOO and ZOO+AE use the parallel gradient update of (2) with a batch of pixels, leading to 256 query counts per iteration. ZOO+RV uses the random full gradient estimator in (3), yielding query counts per iteration. AutoZOOM uses the the averaged random full gradient estimator in (4), resulting in query counts per iteration. For a fair comparison, the query counts are used for performance assessment.

Query reduction ratio. In what follows, we use the mean query counts of ZOO with the smallest as the baseline for computing the query reduction ratio of other methods and configurations.

TPR and initial success. We report the true positive rate (TPR), which measures the percentage of successful attacks fulfilling a pre-defined constraint on per-pixel distortion, and their query counts of first successes. We also report the per-pixel distortions of initial successes, where an initial success refers to the first query count that finds a successful adversarial example.

Post-success fine-tuning. When implementing AutoZOOM in Algorithm 1, on MNIST and CIFAR-10 we find that AutoZOOM without fine-tuning (i.e., ) already yields similar distortion as ZOO. The latter can be viewed as coordinate-wise fine-tuning and is thus query-inefficient. On ImageNet, we will investigate the effect of post-success fine-tuning on reducing distortion.

Autoencoder. In AutoZOOM, we use convolutional autoencoders for attack dimension reduction, which are trained on unlabeled datasets that are different from the training dataset and the attacked natural examples. The implementation details are given in the supplementary material.

Dynamic Switching on . To adjust the regularization coefficient in (1), in all methods we set its initial value on MNIST and CIFAR-10, and set on ImageNet. Furthermore, for balancing the distortion Dist and the attack objective Loss in (1), we use a dynamic switching strategy to update during the optimization process. Per every iterations, is multiplied by 10 times of the current value if the attack has never been successful. Otherwise, it divides its current value by 2. On MNIST and CIFAR-10, we set . On ImageNet, we set . At the instance of initial success, we also reset and the ADAM parameters to the default values, as doing so can empirically reduce the distortion for all attack methods.

### 3.4 Black-box Attacks on MNIST and CIFAR-10

For both MNIST and CIFAR-10, we randomly select 50 correctly classified images from their test sets, and perform targeted attacks on these images. Since both datasets have 10 classes, each selected image is attacked 9 times, targeting at all but its true class. For all attacks, the ratio of reduced attack-space dimension to the original one (i.e., ) is 25% for MNIST and 6.25% for CIFAR-10.

Table 1 shows the performance evaluation on MNIST with various values of , the initial value of the regularization coefficient in (1). We use the performance of ZOO with as a baseline for comparison. For example, with and , the mean query counts required by AutoZOOM to attain an initial success is reduced by 93.21% and 98.57%, respectively. One can also observe that allowing larger generally leads to fewer mean query counts at the price of slightly increased distortion for the initial attack. The noticeable huge difference in the required attack query counts between ZOO+RV/AutoZOOM and ZOO/ZOO+AE validate the effectiveness of our proposed random full gradient estimator in (3), which dispenses with the coordinate-wise gradient estimation in ZOO but still remains comparable true positive rates, thereby greatly improving query efficiency.

For CIFAR-10, we report similar query efficiency improvements as displayed in Table 2. In particular, comparing the two query-efficient black-box attack methods (ZOO+RV and AutoZOOM), we find that using autoencoder for attack dimension reduction (AutoZOOM) is more query-efficient over bilinear scaling (ZOO+RV). AutoZOOM achieves the highest attack success rates (ASRs) and mean query reduction ratio for different values of . In addition, their true positive rates (TPRs) are similar but AutoZOOM usually takes fewer query counts to reach the same distortion. We note that when , AutoZOOM has a higher TPR but also needs slightly more mean query counts than ZOO+RV to reach the same distortion. This suggests that there are some adversarial examples that are difficult for ZOO+RV to reduce their post-success distortions but can be handled by AutoZOOM.

### 3.5 Black-box Attacks on ImageNet

We selected 50 correctly classified images from the ImageNet test set to perform random targeted attacks and set and the attack dimension reduction ratio to 1.15%. The attack results are summarized in Table 3. Note that comparing to ZOO, AutoZOOM can significantly reduce the query count required to achieve an initial success by 99.3%, which is a remarkable improvement since AutoZOOM reduces more than 2.2 million model queries given the fact that the dimension of ImageNet ( 270K) is much larger than that of MNIST and CIFAR-10.

Post-success distortion refinement. As described in Algorithm 1, adaptive random gradient estimation is integrated in AutoZOOM, offering a quick initial success in attack generation followed by a fine-tuning process to effectively reduce the distortion. This is achieved by adjusting the gradient estimate averaging parameter in (4) in the post-success stage. In general, averaging over more random directions (i.e., setting larger ) tends to better reduce the variance of gradient estimation error, but at the cost of increased model queries. Figure 3 (a) shows the mean distortion against query counts for various choices of in the post-success stage. The results suggest that setting some small but can further decrease the distortion at the converged phase when compared with the case of . Moreover, the refinement effect on distortion empirically saturates at , implying a marginal gain beyond this value. These findings also demonstrate that our proposed AutoZOOM indeed strikes a balance between distortion and query efficiency in black-box attacks.

### 3.6 Remarks on Attack Dimension Reduction and Query Efficiency

In addition to the motivation from the convergence rate in zeroth-order optimization (Sec. 2.3), as a sanity check, we corroborate the benefit of attack dimension reduction to query efficiency in black-box attacks by comparing AutoZOOM with its alternative operated on the original (non-reduced) dimension. In this setting, the alternative (with ) is equivalent to ZOO+RV since the AE is disabled. Tested on all three datasets and aforementioned settings, Figure 3 (b) shows the corresponding mean query count to initial success and the mean query reduction ratio when in all three datasets. When compared to the attack results of the original dimension, attack dimension reduction through AutoZOOM reduces roughly 35-40% query counts on MNIST and CIFAR-10 and at least 95% on ImageNet. This result highlights the importance of dimension reduction towards query-efficient black-box attacks. For example, without dimension reduction, the attack on the original ImageNet dimension cannot even be successful within the query budge ( queries).

## 4 Related Work

Gradient-based adversarial attacks on DNNs fall within the white-box setting, since acquiring the gradient with respect to the input requires knowing the weights of the target DNN. As a first attempt towards black-box attacks, the authors in [20] proposed to train a substitute model using iterative model queries, performing white-box attacks on the substitute model, and implementing transfer attacks to the target model [21, 22]. However, its attack performance can be severely degraded due to poor attack transferability. Although ZOO achieves a similar attack success rate and comparable visual quality as many white-box attack methods [3], its coordinate-wise gradient estimation requires excessive target model evaluations and is hence not query-efficient. The same gradient estimation technique is also used in the follow-up work in [4]. Beyond optimization-based approaches, the authors in [11] proposed to use a natural evolution strategy to enhance query efficiency. The authors in [23] proposed an attack under a restricted setting, where only the decision (top-1 prediction class) is known to an attacker. Such a black-box attack dispenses class prediction scores and hence requires additional model queries. Due to space limitation, we provide more background and a table of comparing existing black-box attacks in the supplementary material.

## 5 Conclusion

AutoZOOM is a generic attack acceleration framework that is compatible with any black-box attack having the general formulation in (1). It adopts an adaptive random full gradient estimation strategy to strike a balance between query counts and estimation errors, and features an autoencoder for attack dimension reduction and algorithmic convergence acceleration. Compared to a state-of-the-art attack (ZOO), AutoZOOM consistently reduces the mean query counts when attacking black-box DNN image classifiers for MNIST, CIFAT-10 and ImageNet, attaining at least query reduction in finding initial successful adversarial examples while maintaining a similar attack success rate. It can also efficiently fine-tune the image distortion to maintain high visual similarity to the original image. Consequently, the query-efficient black-box attacks enabled by AutoZOOM provide novel means for assessing the robustness of deployed machine learning models.

## References

- [1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” ICLR, arXiv preprint arXiv:1312.6199, 2014.
- [2] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” ICLR, arXiv preprint arXiv:1412.6572, 2015.
- [3] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “ZOO: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models,” in ACM Workshop on Artificial Intelligence and Security, 2017, pp. 15–26.
- [4] A. N. Bhagoji, W. He, B. Li, and D. Song, “Exploring the space of black-box attacks on deep neural networks,” arXiv preprint arXiv:1712.09491, 2017.
- [5] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in IEEE Symposium on Security and Privacy (SP), 2017, pp. 39–57.
- [6] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ICLR, arXiv preprint arXiv:1412.6980, 2015.
- [7] S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,” SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013.
- [8] Y. Nesterov and V. Spokoiny, “Random gradient-free minimization of convex functions,” Foundations of Computational Mathematics, vol. 17, no. 2, pp. 527–566, 2017.
- [9] S. Liu, J. Chen, P.-Y. Chen, and A. O. Hero, “Zeroth-order online alternating direction method of multipliers: Convergence analysis and applications,” AISTATS, arXiv preprint arXiv:1710.07804, 2018.
- [10] P. D. Lax and M. S. Terrell, Calculus with applications. Springer, 2014.
- [11] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-box adversarial attacks with limited queries and information,” ICML, arXiv preprint arXiv:1804.08598, 2018.
- [12] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, “Optimal rates for zero-order convex optimization: The power of two function evaluations,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2788–2806, 2015.
- [13] X. Gao, B. Jiang, and S. Zhang, “On the information-adaptive variants of the admm: an iteration complexity perspective,” Optimization Online, vol. 12, 2014.
- [14] Y. Wang, S. Du, S. Balakrishnan, and A. Singh, “Stochastic zeroth-order optimization in high dimensions,” arXiv preprint arXiv:1710.10551, 2017.
- [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
- [16] A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
- [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
- [18] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
- [19] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning.”
- [20] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against machine learning,” in ACM Asia Conference on Computer and Communications Security, 2017, pp. 506–519.
- [21] N. Papernot, P. McDaniel, and I. Goodfellow, “Transferability in machine learning: from phenomena to black-box attacks using adversarial samples,” arXiv preprint arXiv:1605.07277, 2016.
- [22] Y. Liu, X. Chen, C. Liu, and D. Song, “Delving into transferable adversarial examples and black-box attacks,” ICLR, arXiv preprint arXiv:1611.02770, 2017.
- [23] W. Brendel, J. Rauber, and M. Bethge, “Decision-based adversarial attacks: Reliable attacks against black-box machine learning models,” ICLR, arXiv preprint arXiv:1712.04248, 2018.
- [24] D. Lowd and C. Meek, “Adversarial learning,” in ACM SIGKDD international conference on Knowledge discovery in data mining, 2005, pp. 641–647.
- [25] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli, “Evasion attacks against machine learning at test time,” in Joint European conference on machine learning and knowledge discovery in databases, 2013, pp. 387–402.
- [26] B. Biggio and F. Roli, “Wild patterns: Ten years after the rise of adversarial machine learning,” arXiv preprint arXiv:1712.03141, 2017.
- [27] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machine learning at scale,” ICLR, arXiv preprint arXiv:1611.01236, 2017.
- [28] P.-Y. Chen, Y. Sharma, H. Zhang, J. Yi, and C.-J. Hsieh, “EAD: elastic-net attacks to deep neural networks via adversarial examples,” AAAI, arXiv preprint arXiv:1709.04114, 2018.
- [29] S. Baluja and I. Fischer, “Adversarial transformation networks: Learning to generate adversarial examples,” arXiv preprint arXiv:1703.09387, 2017.
- [30] N. Carlini and D. Wagner, “Adversarial examples are not easily detected: Bypassing ten detection methods,” in ACM Workshop on Artificial Intelligence and Security, 2017, pp. 3–14.
- [31] A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” ICML, arXiv preprint arXiv:1802.00420, 2018.
- [32] F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel, “Ensemble adversarial training: Attacks and defenses,” ICLR, arXiv preprint arXiv:1705.07204, 2018.
- [33] A. Ilyas, “Circumventing the ensemble adversarial training defense,” https://github.com/andrewilyas/ens-adv-train-attack, 2018.
- [34] N. Narodytska and S. P. Kasiviswanathan, “Simple black-box adversarial perturbations for deep networks,” arXiv preprint arXiv:1612.06299, 2016.
- [35] F. Suya, Y. Tian, D. Evans, and P. Papotti, “Query-limited black-box attacks to classifiers,” NIPS Workshop, arXiv preprint arXiv:1712.08713, 2017.

## Supplementary Material

## Appendix A More Background on Adversarial Attacks and Defenses

The research in generating adversarial examples to deceive machine-learning models, known as adversarial attacks, tends to evolve with the advance of machine-learning techniques and new publicly available datasets. In [24], the authors studied adversarial attacks to linear classifiers with continuous or Boolean features. In [25], the authors proposed a gradient-based adversarial attack on kernel support vector machines (SVMs). More recently, gradient-based approaches are also used in adversarial attacks on image classifiers trained by DNNs. [1, 2]. Due to space limitation, we focus on related work in adversarial attacks on DNNs. Interested readers may refer to the survey paper [26] for more details.

Gradient-based adversarial attacks on DNNs fall within the white-box setting, since acquiring the gradient with respect to the input requires knowing the weights of the target DNN. In principle, adversarial attacks can be formulated as an optimization problem of minimizing the adversarial perturbation while ensuring attack objectives. In image classification, given a natural image, an untargeted attack aims to find a visually similar adversarial image resulting in a different class prediction, while a targeted attack aims to find an adversarial image leading to a specific class prediction. The visual similarity between a pair of adversarial and natural images is often measured by the norm of their difference, where . Existing powerful white-box adversarial attacks using , or norms include iterative fast gradient sign methods [27], Carlini and Wagner’s (C&W) attack [5], elastic-net attacks to DNNs (EAD) [28], etc.

Black-box adversarial attacks are practical threats to the deployed machine-learning services. Attackers can observe the input-output correspondences of any queried input, but the target model parameters are completely hidden. Therefore, gradient-based adversarial attacks are inapplicable to a black-box setting. As a first attempt, the authors in [20] proposed to train a substitute model using iterative model queries, perform white-box attacks on the substitute model, and leverage the transferability of adversarial examples [21, 22] to attack the target model. However, training a representative surrogate for a DNN is challenging due to the complicated and nonlinear classification rules of DNNs and high dimensionality of the underlying dataset. The performance of black-box attacks can be severely degraded if the adversarial examples for the substitute model transfer poorly to the target model. To bridge this gap, the authors in [3] proposed a black-box attack called ZOO that directly estimates the gradient of the attack objective by iteratively querying the target model. Although ZOO achieves a similar attack success rate and comparable visual quality as many white-box attack methods, it exploits the symmetric difference quotient method [10] for coordinate-wise gradient estimation and value update, which requires excessive target model evaluations and is hence not query-efficient. The same gradient estimation technique is also used in the follow-up work in [4]. Although acceleration techniques such as importance sampling, bilinear scaling and random feature grouping have been used in [3, 4], the coordinate-wise gradient estimation approach still forms a bottleneck for query efficiency.

Beyond optimization-based approaches for black-box attacks, the authors in [11] proposed to use a natural evolution strategy to enhance query efficiency. The authors in [23] proposed an attack under a restricted setting, where only the decision (top-1 prediction class) is known to an attacker. Such a black-box attack lacks class prediction scores and hence requires additional model queries. Here we focus on improving the query efficiency of gradient-estimation-based black-box attacks and consider the case when the class prediction scores are known to an attacker. For reader’s reference, we compare existing black-box attacks on DNNs with AutoZOOM in Table S1. One unique feature of AutoZOOM is an unlabeled data-driven technique (autoencoder) to accelerate black-box attacks. While white-box attacks such as [29] have utilized autoencoders trained on the training data and the transparent logit representations of DNNs, we propose in this work to use autoencoders trained on unlabeled natural data to improve query efficiency for black-box attacks.

There has been many methods proposed for defending adversarial attacks to DNNs. However, new defenses are continuously weakened by follow-up attacks [30, 31]. For instance, model ensembles [32] were shown to be effective against some black-box attacks, while they are recently circumvented by advanced attack techniques [33]. In this paper, we focus on improving query efficiency in attacking black-box undefended DNNs.

## Appendix B Proof of Theorem 1

Recall that the data dimension is and we assume to be differentiable and its gradient to be -Lipschitz. Fixing and consider a smoothed version of :

(S1) |

Based on [13, Lemma 4.1-a], we have the relation

(S2) |

which then yields

(S3) |

where we recall that has been defined in (3). Moreover, based on [13, Lemma 4.1-b], we have

(S4) |

Substituting (S3) into (S4), we obtain

This then implies that

(S5) |

where

Once again, by applying [13, Lemma 4.1-b], we can easily obtain that

(S6) |

Now, let us consider the averaged random gradient estimator in (4),

Due to the properties of i.i.d. samples and (S5), we define

(S7) |

Moreover, we have

(S8) | ||||

(S9) | ||||

(S10) |

where we have used the fact that . The definition of in (S7) yields

(S11) |

From (S6), we also obtain that for any ,

(S12) |

Substituting (S11) and (S12) into (S10), we obtain

(S13) | ||||

(S14) |

Finally, we bound the mean squared estimation error as

(S15) |

which completes the proof.

## Appendix C Architectures of Convolutional Autoencoders in AutoZOOM

On MNIST, the convolutional autoencoder (CAE) is trained on 50,000 randomly selected hand-written digits from the MNIST8M dataset^{5}^{5}5http://leon.bottou.org/projects/infimnist. On CIFAR-10, the CAE is trained on 9,900 images selected from its test dataset. The remaining images are used in black-box attacks. On ImageNet, all the attacked natural images are from 10 randomly selected image labels, and these labels are also used as the candidate attack targets. The CAE is trained on about 9000 images from these classes.

Table S2 shows the architectures for all the autoencoders used in this work. Note that the autoencoders designed for ImageNet uses bilinear scaling to transform data size from to , and also back from to . This is to allow easy processing and handling for the autoencoder’s internal convolutional layers.

## Appendix D More Adversarial Examples of Attacking Inception-v3 in the Black-box Setting

Figure S1 shows other adversarial examples of attacking Inception-v3 in the black-box setting.