Adversarially Robust Generalization Just Requires More Unlabeled Data

# Adversarially Robust Generalization Just Requires More Unlabeled Data

Runtian Zhai , Tianle Cai , Di He
Peking University
{zhairuntian, caitianle1998, di_he}@pku.edu.cn &Chen Dan
Carnegie Mellon University
cdan@cs.cmu.edu &Kun He
Huazhong University of
Science and Technology
brooklet160@hust.edu.cn &John Hopcroft
Cornell University
jeh17@cornell.edu &Liwei Wang
Peking University
wanglw@cis.pku.edu.cn
Equal contribution.
###### Abstract

Neural network robustness has recently been highlighted by the existence of adversarial examples. Many previous works show that the learned networks do not perform well on perturbed test data, and significantly more labeled data is required to achieve adversarially robust generalization. In this paper, we theoretically and empirically show that with just more unlabeled data, we can learn a model with better adversarially robust generalization. The key insight of our results is based on a risk decomposition theorem, in which the expected robust risk is separated into two parts: the stability part which measures the prediction stability in the presence of perturbations, and the accuracy part which evaluates the standard classification accuracy. As the stability part does not depend on any label information, we can optimize this part using unlabeled data. We further prove that for a specific Gaussian mixture problem illustrated by schmidt2018adversarially (), adversarially robust generalization can be almost as easy as the standard generalization in supervised learning if a sufficiently large amount of unlabeled data is provided. Inspired by the theoretical findings, we propose a new algorithm called PASS by leveraging unlabeled data during adversarial training. We show that in the transductive and semi-supervised settings, PASS achieves higher robust accuracy and defense success rate on the Cifar-10 task.

## 1 Introduction

Deep learning lecun2015deep (), especially deep Convolutional Neural Network (CNN) lecun1998gradient (), has led to state-of-the-art results spanning many machine learning fields, such as image classification simonyan2014very (); he2016deep (); huang2017densely (); hu2017squeeze (), object detection ren2015faster (); redmon2016you (); lin2018focal (), semantic segmentation long2015fully (); zhao2017pyramid (); chen2018encoder () and action recognition tran2015learning (); wang2016temporal (); wang2018non ().

Despite the great success in numerous applications, recent studies show that deep CNNs are vulnerable to some well-designed input samples named as Adversarial Examples DBLP:journals/corr/SzegedyZSBEGF13 (); biggio2013evasion (). Take image classification as an example, for almost every commonly used well-performed CNN, attackers are able to construct a small perturbation on an input image. The perturbation is almost imperceptible to humans but can cause a wrong prediction by the model. The problem is serious as some designed adversarial examples can be transferred among different kinds of CNN architectures DBLP:journals/corr/PapernotMG16 (), which makes it possible to perform black-box attack: an attacker has no access to the model parameters or even architecture, but can still easily fool a machine learning system.

There is a rapidly growing body of work on studying how to obtain a robust neural network model. Most of the successful methods are based on adversarial training DBLP:journals/corr/SzegedyZSBEGF13 (); madry2017towards (); goodfellow2014 (); huang2015learning (). The high-level idea of these works is that during training, we predict the strongest perturbation to each sample against the current model and use the perturbed sample together with the correct label for gradient descent optimization. However, the learned model tends to overfit on the training data and fails to keep robust on unseen testing data. For example, using the state-of-the-art adversarial robust training method madry2017towards (), the defense success rate of the learned model on the testing data is below 60% while that on the training data is almost 100%, which indicates that the robustness fails to generalize. Some theoretical results further show that it is challenging to achieve adversarially robust generalization. fawzi2018adversarial () proves that adversarial examples exist for any classifiers and can be transferred across different models, making it impossible to design network architectures free from adversarial attacks. schmidt2018adversarially () shows that adversarially robust generalization requires much more labeled data than standard generalization in certain cases. tsipras2018robustness () presents an inherent trade-off between accuracy and robust accuracy and argues that the phenomenon comes from the fact that robust classifiers learn different features. Therefore it is hard to reach high robustness for standard training methods.

Given the challenge of the task and previous findings, in this paper, we provide several theoretical and empirical results towards better adversarially robust generalization. In particular, we show that we can learn an adversarially robust model which generalizes well if we have plenty of unlabeled data, and the labeled sample complexity for adversarially robust generalization in schmidt2018adversarially () can be largely reduced if unlabeled data is used. Intuitively, imagine we hold a model (i.e. a classifier) and a sample. We want to know whether the model’s prediction is correct and is robust to the sample. Apparently, the correctness of the prediction can be obtained only if we know the ground truth label. However, to evaluate the robustness, we can add perturbations to the sample and check whether the prediction changes. Since such a way of evaluation does not require any label, we can measure and improve the robustness of the model by leveraging unlabeled data.

First, we formalize the intuition above using the language of generalization theory. The core technique is to decompose the upper bound of the expected robust risk into two terms: a stability term which measures whether the model can output consistent predictions under perturbations, and an accuracy term which evaluates whether the model can make correct predictions on natural samples. Given the stability term does not rely on ground truth labels, unlabeled data can be used to minimize this term and thus improve the generalization ability. Second, we prove that for the Gaussian mixture problem defined in  schmidt2018adversarially (), if unlabeled data can be used, adversarially robust generalization will be almost as easy as the standard generalization in supervised learning (i.e. using the same number of labeled samples under similar conditions). Inspired by the theoretical findings, we think using labeled and unlabeled data together during training is a natural way to learn a model for better adversarially robust generalization. To achieve this, we design a PGD-based Adversarial training algorithm in Semi-Supervised setting (PASS for short). On Cifar-10 task, we show that the PASS algorithm can achieve better performance on adversarially robust generalization.

Our contributions are in three folds.

• In Section 3.2.1, we provide a theorem to show that unlabeled data can be naturally used to improve the expected robust risk in general setting and thus leveraging unlabeled data is a way to improve adversarially robust generalization.

• In Section 3.2.2, we discuss a specific Gaussian mixture problem introduced in schmidt2018adversarially (). In schmidt2018adversarially (), the authors proved that in this case, the labeled sample complexity for robust generalization is significantly larger than that for standard generalization. As an extension to this work, we prove that in this case, the labeled sample complexity for robust generalization can be the same as that for standard generalization if we have enough unlabeled data.

• According to our theoretical results, we design an adversarial robust training algorithm using both labeled and unlabeled data. We name our algorithm PASS. Our experimental results show that in the transductive and semi-supervised settings, PASS achieves better performance compared to the baseline algorithms.

## 2 Related Works

### 2.1 Adversarial Attacks and Defense

Most previous works study how to attack a neural network model using small perturbations under certain norm constraints, such as norm or norm. For the constraint, Fast Gradient Sign Method (FGSM) goodfellow2014 () finds a direction to which the perturbation increases the classification loss at an input point to the greatest extent; Projected Gradient Descent (PGD) madry2017towards () extends FGSM by updating the direction of the attack in an iterative manner and clipping the modifications in the norm range after each iteration. For the constraint, DeepFool Moosavi-Dezfooli_2016_CVPR () iteratively computes the minimal norm of an adversarial perturbation by linearizing around the input in each iteration. C&W attack DBLP:journals/corr/CarliniW16a () is a comprehensive approach that works under both norm constraints. In this work, we focus on learning a robust model to defend the white-box attack, i.e. we assume we are in the worst case that the attacker knows the model parameters and thus can use the algorithms above to attack models.

There are a large number of papers about defending against adversarial attacks, but the result is far from satisfactory. Remarkably,  DBLP:journals/corr/abs-1802-00420 () shows most defense methods take advantage of so-called “gradient mask” and provides an attacking method called BPDA to correct the gradients. A recent paper li2019nattack () proposes a powerful black-box attack called NAttack, which fools most previous defenses with a high success rate. So far, adversarial training madry2017towards () has been the most successful white-box defense algorithm. By modeling the learning problem as a mini-max game between the attacker and defender, the robust model can be trained using iterative optimization methods.

### 2.2 Semi-supervised/Transductive Learning

Using unlabeled data to help the learning process has been proved promising in different applications  NIPS2015_5947 (); 10.1093/bioinformatics/btr502 (); Elworthy:1994:BRH:974358.974371 (). Many approaches use regularizers called “soft constraints” to make the model “behave” well on unlabeled data. For example, transductive SVM Joachims:1999:TIT:645528.657646 () uses prediction confidence as a soft constraint, and graph-based SSL belkin2006manifold (); talukdar2009new () requires the model to have similar outputs at endpoints of an edge. The most related work to ours is the consistency-based SSL. It uses consistency as a soft constraint, which encourages the model to make consistent predictions on unlabeled data when a small perturbation is added. The consistency metric can be either computed by the model’s own predictions, such as the model DBLP:journals/corr/SajjadiJT16a (), Temporal Ensembling DBLP:journals/corr/LaineA16 () and Virtual Adversarial Training  2017arXiv170403976M (), or by the predictions of a teacher model, such as the mean teacher model NIPS2017_6719 ().

Although our method looks similar to those algorithms, it has a totally different starting point and focusing on a different problem. The goal of the consistency-based approach is to improve standard generalization by designing auxiliary regularization on unlabeled data. Most of the works do not have any theoretical guarantees. On the contrary, in our work, we show that the unlabeled data can be naturally used to improve generalization for robust machine learning problems from a theoretical perspective. We derive theoretical bounds, show better sample complexity and design practical algorithms to demonstrate its strength.

## 3 Main Results

In this section, we first illustrate the benefits of using unlabeled data for robust generalization from a theoretical perspective. Then we develop a practical algorithm based on the theoretical findings.

### 3.1 Notations and Definitions

We consider a standard classification task with an underlying data distribution over pairs of examples and corresponding labels . Usually is unknown and we can only access to in which is independent and identically drawn from , . For ease of reference, we denote this empirical distribution as (i.e. the uniform distribution over i.i.d. sampled data). We also assume that we are given a suitable loss function , where is parameterized by . The standard loss function is the zero-one loss, i.e. . Due to its discontinuous and non-differentiable nature, surrogate loss functions such as cross-entropy or mean square loss are commonly used during optimization.

Our goal is to find an that minimizes the expected classification risk. Without loss of any generality, our theory is mainly based on the binary classification problem, i.e. . All theorems below can be easily extended to the multi-class classification problem. For a binary classification problem, the expected classification risk is defined as below.

###### Definition 1.

(Expected classification risk). Let be a probability distribution over . The expected classification risk of a classifier under distribution and loss function is defined as .

We use to denote the classification risk under the underlying distribution and use to denote the classification risk under the empirical distribution. We use to denote the risk with the zero-one loss function. The classification risk characterizes whether the model is accurate. However, we also care about whether is robust. For example, when input is an image, we hope a small change (perturbation) to will not change the prediction of . To this end, schmidt2018adversarially () defines expected robust classification risk as below.

###### Definition 2.

(Expected robust classification risk). Let be a probability distribution over and be a perturbation set. Then the -robust classification risk of a classifier under distribution and loss function is defined as .

Again, we use to denote the expected robust classification risk under the underlying distribution and use to denote the expected robust classification risk under the empirical distribution. We use to denote the robust risk with the zero-one loss function. In real practice, the most commonly used setting is the perturbation under -bounded norm constraint . For simplicity, we refer to the robustness defined by this perturbation set as -robustness.

Before presenting our main theoretical results, we briefly introduce the motivation of the work. As we can see from Definition 2, the robust classification risk concerns about whether can correctly predict the label for all around . We notice that testing whether is robustly accurate or not can be achieved by answering two questions separately: Whether provides a correct prediction on , and whether changes its prediction on any around ? It is easy to see that is robustly accurate if and only if the answers to the two questions are Yes. Based on this, we actually decompose the problem into two parts. One part concerns about whether is accurate and the other part concerns about whether is robust. Considering the second one (i.e. whether changes its prediction on any around ) does not require any label information, we think the robustness can be improved with more unlabeled data.

### 3.2 Robust Generalization Analysis

Our first result (Section 3.2.1) shows that unlabeled data can be used to improve adversarially robust generalization in general setting. Our second result (Section 3.2.2) shows that for a specific learning problem defined on Gaussian mixture model, compared to previous work schmidt2018adversarially (), the sample complexity of the robust generalization can be significantly reduced by using unlabeled data. Both results suggest that using unlabeled data is a natural way to improve adversarially robust generalization. Due to space limitation, we put all detailed proofs of the theorems and lemmas into appendix.

#### 3.2.1 General Results

In this subsection, we show that the expected robust classification risk can be bounded by the sum of two terms. The first term only depends on the hypothesis space and the unlabeled data, and the second term is a standard PAC bound.

###### Theorem 1.

Let be the hypothesis space. Let be the set of i.i.d. samples drawn from the underlying distribution . For any function , with probability at least over the random draw of , we have

where (1) is a term that can be optimized with only unlabeled data and (2) is the standard PAC generalization bound. is the marginal distribution for and is the empirical Rademacher complexity of hypothesis space .

From Theorem 1, we can see that the expected robust classification risk is bounded by the sum of two terms: the first term only involves the marginal distribution and the second term is the standard PAC generalization error bound. This shows that the expected robust risk minimization can be achieved by jointly optimizing the two terms simultaneously: we can optimize the first term using unlabeled data sampled from and optimize the second term using labeled data sampled from , which is the same as the standard supervised learning.

While cullina2018pac () suggests that in the standard PAC learning scenario (only labeled data is considered), the generalization gap of robust risk can be sometimes uncontrollable by the capacity of hypothesis space , our results show that we can mitigate this problem by introducing unlabeled data. In fact, our following result shows that with enough unlabeled data, learning a robust model can be almost as easy as learning a standard model.

#### 3.2.2 Learning from Gaussian Mixture Model

The learning problem defined on Gaussian mixture model is illustrated in schmidt2018adversarially () as an example to show adversarially robust generalization needs much more labeled data compared to standard generalization. In this subsection, we show that for this specific problem, just using more unlabeled data is enough to achieve adversarially robust generalization. For completeness, we first list the results in schmidt2018adversarially () and then show our theoretical findings.

###### Definition 3.

(Gaussian mixture model schmidt2018adversarially ()). Let be the per-class mean vector and let be the variance parameter. Then the -Gaussian mixture model is defined by the following distribution over : First, draw a label uniformly at random. Then sample the data point from .

Given the samples from the distribution defined above, the learning problem is to find a linear classifier to predict label from . schmidt2018adversarially () proved the following sample complexity bound for standard generalization.

###### Theorem 2.

(Theorem 4 in schmidt2018adversarially ()). Let be drawn from the -Gaussian mixture model with and where is a universal constant. Let be the vector . Then with high probability, the expected classification risk of the linear classifier using 0-1 loss is at most 1%.

Theorem 2 suggests that we can learn a linear classifier with low classification risk (e.g., 1%) even if there is only one labeled data. However, the following theorem shows that for adversarially robust generalization under perturbation, significantly more labeled data is required.

###### Theorem 3.

(Theorem 6 in schmidt2018adversarially ()). Let be any learning algorithm, i.e. a function from samples to a binary classifier . Moreover, let , let , and let be drawn from . We also draw samples from the -Gaussian mixture model. Then the expected -robust classification risk of using 0-1 loss is at least if the number of labeled data .

As we can see from above theorem, the sample complexity of robust generalization is larger than that of standard generalization by . This shows that for high-dimensional problems, adversarial robustness can provably require a significantly larger number of samples. We provide a new result which shows that the learned model can be robust if there is only one labeled data and sufficiently many unlabeled data. Our theorem is stated as follow:

###### Theorem 4.

Let be a labeled data drawn from -Gaussian mixture model with and . Let be unlabeled data drawn from . Let such that . Let . Then there exists a constant such that for any , with high probability, the expected -robust classification risk of using 0-1 loss is at most when the number of unlabeled data and .

From Theorem 4, we can see that when the number of unlabeled data is significant, we can learn a highly accurate and robust model using only one labeled data. The learning process can be intuitively described as the following three steps: in the first step, we use unlabeled data to estimate the direction of although we do not know the label that (or ) corresponds to. In the second step, we use the given labeled data to determine the “sign” of with high probability. Finally, we give a good estimation of by combining the two step above and learn a robust classifier. The three key lemmas corresponding to the three steps are listed as below ( are constants for ).

###### Lemma 1.

Under the same setting as Theorem 4, suppose that and . Then, with probability at least , there is a unique unit maximal eigenvector of the sample covariance matrix such that .

###### Lemma 2.

Under the same setting as Theorem 4, suppose is a unit vector such that for some constant . Then with probability at least , we have .

###### Lemma 3.

(Lemma 20 in schmidt2018adversarially ()). Under the same setting as Theorem 4, for any and , and for any unit vector such that where is the dual norm of , the linear classifier has -robust classification risk at most .

Our theoretical findings suggest that we can improve the adversarially robust generalization using unlabeled data, and in the next subsection, we will present a practical algorithm for real applications.

### 3.3 Practical Algorithm

Let be a set of labeled data and be a set of unlabeled data. Motivated by the theory above, to achieve better adversarially robust generalization, we can optimize the classifier to be accurate on and robust on . This is also equivalent to learn the classifier to be accurate and robust on and robust on . Therefore, we design two loss terms on and separately.

For the labeled dataset , we use the standard -robust adversarial training objective function, i.e.,

 L1(f,SL)=1nn∑i=1maxx′i∈Bϵ∞(xi)lCE(f(x′i),yi).

Following the most common setting, during training, the classifier outputs a probability distribution over categories and is evaluated by cross-entropy loss defined as , where is the output probability for category .

For unlabeled data , we use an objective function which measures robustness without ground truth

 L2(f,SU)=1mm∑i=1maxx′i∈Bϵ∞(xi)lCE(f(x′i),^yi),where ^yi=argmaxk{fk(xi)}.

Putting the two objective functions together, our training loss is defined as a combination of and as follows:

 LSSL(f,SL,SU)=L1(f,SL)+λL2(f,SU). (2)

Here is a coefficient to trade off the two loss terms. In real practice, we use iterative optimization methods to learn the function . In the inner loop, we fix the model and use Projected Gradient Descent (PGD) algorithm to learn the attack for any . In the outer loop, we use stochastic gradient descent to optimize on the perturbed s. We call our algorithm: PGD-based Adversarial training in Semi-Supervised setting PASS. The general training process is shown in Algorithm 1.

## 4 Experiments

We use the Cifar-10 task to verify our proposed algorithm. In particular, given a set of labeled and unlabeled data, we study two settings: the transductive setting in which the testing data is the given unlabeled data, and the semi-supervised setting in which the testing data is a set of unseen data during training. All codes and models are available at https://github.com/RuntianZ/adversarial-robustness-unlabeled.

### 4.1 Experimental Setting

Following madry2017towards (), we use the Resnet model and modify the network incorporating wider layers by a factor of 10. This results in a network with five residual units with (16, 160, 320, 640) filters each. During training, we apply data augmentation including random crops and flips, as well as per image standardization. The initial learning rate is 0.1, and decay by a factor of 10 twice during training. In the inner loop, we run a 7-step PGD with step size for each mini-batch. The perturbation is constrained to be under norm.

##### Transductive learning setting.

In the transductive setting, the algorithm has access to all labeled training data and all unlabeled test data. In the Cifar-10 task, we use the labeled training images as and the test images as and set . Each mini-batch contains 100 sampled labeled images and 20 sampled unlabeled images. Learning rate is decayed at the and the epoch. We compare our proposed method with several baselines which use labeled training data only, including the original PGD-based adversarial training madry2017towards (), thermometer encoding buckman2018thermometer (), cascade learning na2018cascade () and ADV-BNN liu2018advbnn ().

##### Semi-supervised learning setting.

In the semi-supervised learning setting, the unlabeled data are no longer coming from the test set, hence it is a better way to measure whether more unlabeled data can help adversarially robust generalization. Following many previous works DBLP:journals/corr/LaineA16 (); NIPS2017_6719 (); 2017arXiv170403976M (); athiwaratkun2018there (), we sample / labeled data from the training set and use them as labeled data. We mask out the labels of the remaining images in the training set and use them as unlabeled data. By doing this, we conduct two semi-supervised learning tasks and call them the / experiments. In a mini-batch, we sample 25/50 labeled images and 225/200 unlabeled images for the / experiment respectively. In both experiments, we use several different values of as an ablation study for this hyperparameter by setting , , . Learning rate is decayed at the and the epoch. We use the original PGD-based adversarial training madry2017towards () on the sampled / labeled data as the baseline algorithm for comparison (referred to as PGD-adv).

### 4.2 Experimental Results

##### Transductive learning setting

In Table 1, we report the robust test accuracy of different models using different attack methods in the transductive setting. The attack methods include FGSM goodfellow2014 (), 7-step PGD (referred to as PGD-7), 40-step PGD (referred to as PGD-40), BPDA DBLP:journals/corr/abs-1802-00420 () and NAttack li2019nattack (). All attacks are limited to in terms of norm. We also report the test accuracy on the original test data (referred to as natural accuracy).

From the table, we can clearly see that PASS in transductive setting is significantly better than all other baselines for more than 30 points under different attacks. Furthermore, the defense success rate of PASS (which is computed by ) is more than 99% under PGD-40 attack which is even stronger than the attack (PGD-7) used during training. This indicates that the model learned from PASS is very robust if it produces a correct prediction. Actually, this experimental result is predictable since the algorithm explicitly imposes regularization on the robustness of the test data.

##### Semi-supervised learning setting.

We list all results of the / experiments in Table 2. We use five criteria to evaluate the performance of the model: the natural training/test accuracy, the robust training/test accuracy using PGD-7 attack and the defense success rate.

First, we can see that in both experiments, the robust test accuracy is improved when we use unlabeled data. For example, the robust test accuracy of the models trained by PASS with for the / experiments increase by 3.0/5.0 percents compared to the PGD-adv baselines. We also check the defense success rate which evaluates whether the model is robust given the prediction is correct. As we can see from the last column in Table 2, the defense success rate of models trained using our proposed method is much higher than the baselines. In particular, the defense success rate of the model trained with in the experiment is competitive to the model trained using PGD-adv on the whole dataset. This clearly shows the advantage of our proposed algorithm.

Second, we can also see the influence of the value of . The model trained with a larger has higher robust accuracy. For example, in the experiment, the robust test accuracy of the model trained with is more than better than that with . However, we observe that training will become hard to converge if .

VAT miyato2015distributional () uses an adversarial regularization over the unlabeled data but the goal is to improve natural test accuracy. In our experiment, as can be seen from the table, PASS improves robust test accuracy more than natural test accuracy. One of the differences between PASS and VAT is that VAT uses one-step gradient method to attack during training, which is known to be a very weak attack and cannot lead to robust models madry2017towards () (more experimental results are provided in the appendix). In our work, we focus on improving network robustness and use a much stronger attack (7-step PGD) during training. For unlabeled data when defending against a weak attack during training, the model can not be learned to be robust but might tend to generalize toward “accuracy”; However, for unlabeled data, when defending against a strong attack during training, the model will be learned to minimize the expected robust risk. Therefore, the model tends to generalize toward “robustness”, which is consistent with our theory.

## 5 Conclusion

In this paper, we theoretically and empirically show that with just more unlabeled data, we can learn model with better adversarially robust generalization. We first give an expected robust risk decomposition theorem and then show that for a specific learning problem on Gaussian mixture model, the adversarially robust generalization can be almost as easy as standard generalization. Based on these theoretical results, we propose a new algorithm called PASS which leverages unlabeled data during training and empirically show its advantage. As future work, we will study the sample complexity of unlabeled data for broader function classes and solve more challenging real tasks.

## References

• (1) Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. CoRR, abs/1802.00420, 2018.
• (2) Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. There are many consistent explanations of unlabeled data: Why you should average. In International Conference on Learning Representations, 2019.
• (3) Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
• (4) Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.
• (5) Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In International Conference on Learning Representations, 2018.
• (6) Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. CoRR, abs/1608.04644, 2016.
• (7) Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611, 2018.
• (8) Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. Pac-learning in the presence of evasion adversaries. arXiv preprint arXiv:1806.01471, 2018.
• (9) David Elworthy. Does baum-welch re-estimation help taggers? In Proceedings of the Fourth Conference on Applied Natural Language Processing, ANLC ’94, pages 53–58, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics.
• (10) Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier. arXiv preprint arXiv:1802.08686, 2018.
• (11) Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
• (12) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
• (13) Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 7, 2017.
• (14) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, volume 1(2), page 3, 2017.
• (15) Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesvári. Learning with a strong adversary. arXiv preprint arXiv:1511.03034, 2015.
• (16) Thorsten Joachims. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pages 200–209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
• (17) Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. CoRR, abs/1610.02242, 2016.
• (18) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
• (19) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
• (20) Yandong Li, Lijun Li, Liqiang Wang, Tong Zhang, and Boqing Gong. Nattack: Learning the distributions of adversarial examples for an improved black-box attack on deep neural networks. arXiv preprint arXiv:1905.00441, 2019.
• (21) Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018.
• (22) Xuanqing Liu, Yao Li, Chongruo Wu, and Cho-Jui Hsieh. Adv-BNN: Improved adversarial defense through robust bayesian neural network. In International Conference on Learning Representations, 2019.
• (23) Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
• (24) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
• (25) Takeru Miyato, Shin-ichi Maeda, Shin Ishii, and Masanori Koyama. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.
• (26) Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.
• (27) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
• (28) Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
• (29) Taesik Na, Jong Hwan Ko, and Saibal Mukhopadhyay. Cascade adversarial machine learning regularized with a unified embedding. In International Conference on Learning Representations, 2018.
• (30) Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. CoRR, abs/1605.07277, 2016.
• (31) Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3546–3554. Curran Associates, Inc., 2015.
• (32) Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
• (33) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
• (34) Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. CoRR, abs/1606.04586, 2016.
• (35) Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pages 5019–5031, 2018.
• (36) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
• (37) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.
• (38) Partha Pratim Talukdar and Koby Crammer. New regularized algorithms for transductive learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 442–457. Springer, 2009.
• (39) Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1195–1204. Curran Associates, Inc., 2017.
• (40) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
• (41) Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In International Conference on Learning Representations, 2019.
• (42) Martin J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
• (43) Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36. Springer, 2016.
• (44) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1(3), page 4, 2018.
• (45) Bing Zhang and Mingguang Shi. Semi-supervised learning improves gene expression-based prediction of cancer recurrence. Bioinformatics, 27(21):3017–3023, 09 2011.
• (46) Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017.

## Appendix A Background on Generalization and Rademacher Complexity

The Rademacher complexity is a commonly used capacity measure for a hypothesis space.

###### Definition 4.

Given a set of samples, the empirical Rademacher complexity of function class (mapping from to ) is defined as:

where contains i.i.d. random variables drawn from the Rademacher distribution unif({1, -1}).

By using the Rademacher complexity, we can directly provide an upper bound on the generalization error.

###### Theorem 5.

(Theorem 3.5 in mohri2018foundations ()). Suppose is the loss, let be the set of i.i.d. samples drawn from the underlining distribution . Let be the hypothesis space, then with probability at least over , for any :

## Appendix B Proof of Theorem 1

###### Proof.

For indicator function , we have for any ,

 (3)

According to Definition 2, we have

 RB−robust(f) =E(x,y)∼PXYsupx′∈B(x)l(f(x′),y)=E(x,y)∼PXYsupx′∈B(x)I(f(x′)≠y) ≤E(x,y)∼PXYsupx′∈B(x)(I(f(x)≠y)+I(f(x)≠f(x′))) (4) =Ex∼PXsupx′∈B(x)(I(f(x′)≠f(x))+E(x,y)∼PXYl(f(x),y) =Ex∼PXsupx′∈B(x)(I(f(x′)≠f(x))+R(f),

where (4) is derived from (3). We further use Theorem 5 to bound . It is easy to verify that with probability at least , for any :

which completes the proof. ∎

## Appendix C Proof of Theorem 4

For convenience, in this section, we use or to denote some universal constants, where .

In the proof of Theorem 4, we will use the concentration bound for covariance estimation in wainwright_2019 (). We first introduce the definition of spiked covariance ensemble.

###### Definition 5.

(Spiked covariance ensemble). A sample from the spiked covariance ensemble takes the form

 xi=√νξiθ0+wi,

where is a zero-mean random variable with unit variance, is a fixed scalar, is a fixed vector and is a random vector independent of , with zero mean and covariance matrix .

To see why spiked covariance ensemble model is useful, we note that the Gaussian mixture model is its special case. Specifically, let ’s be the unlabeled data in Theorem 4. Then follows the Gaussian mixture distribution , and is a spiked covariance ensemble with parameter , uniformly distributed on , and .

The following theorem from wainwright_2019 () characterizes the concentration property of spiked covariance ensemble, which we will further use to bound the robust classification error. Intuitively, the theorem says that we can approximately recover in the spiked covariance ensemble model using the top eigenvector of the sample covariance matrix .

###### Theorem 6.

(Concentration of covariance estimation, see Corollary 8.7 in wainwright_2019 ()). Given i.i.d. samples from the spiked covariance ensemble with sub-Gaussian tails (which means both and are sub-Gaussian with parameter at most one), suppose that and . Then, with probability at least , there is a unique maximal eigenvector of the sample covariance matrix such that

 ∥∥^θ−θ0∥∥2≤c0√ν+1ν2√dn+c3.

Using the theorem above, we can show that for the Gaussian mixture model, one of the top unit eigenvector of the sample covariance matrix is approximately . In other words, we can approximately recover the parameter up to a sign difference: the principal component analysis of gives either or , while is close to .

###### Lemma 4.

Under the same setting as Theorem 4, suppose that and . Then, with probability at least , there is a unique maximal eigenvector of the sample covariance matrix with unit norm such that

 ∥∥∥v−θ∗√d∥∥∥2≤τ0=min{c0σ√σ2+1n+c3,√2}
###### Proof.

As discussed above, is a spiked covariance ensemble. By Theorem 6 we have with probability at least , there is a unique maximal eigenvector of the sample covariance matrix such that

 ∥∥∥~v−θ∗√d∥∥∥2≤c′0σ√σ2+1n+c′3.

Let , we have . Below we need to consider two cases, and .

Case 1: . Let , since both and are unit vectors, we have

 (5)

Recall that , which is equivalent to

 τ2 ≥∥~v∥2+∥∥∥θ∗√d∥∥∥2−2⟨~v,θ∗√d⟩ =∥~v∥2+1−2∥~v∥⟨v,θ∗√d⟩

Rearranging the terms and using AM-GM inequality gives

 2⟨v,θ∗√d⟩≥∥v∥+1−τ2∥v∥≥2√1−τ2

Therefore, by (5),

 ∥∥∥v−θ∗√d∥∥∥ =√2−2⟨v,θ∗√d⟩ ≤√2−2√1−τ2 =√2τ21+√1−τ2 ≤√2τ =√2(c′0σ√σ2+1n+c′3)

By substituting , and , we complete the proof.

Case 2: . Let be one of such that the the inner product is nonnegative. Since both and are unit vectors, we have

Therefore, . ∎

Now we have proved that by using the top eigenvector of , we can recover the up to a sign difference. Next, we will show that it is possible to determine the sign using the labeled data.

###### Lemma 5.

Under the same setting as Theorem 4, suppose is a unit vector such that where . Then with probability at least , we have .

###### Proof.

Since , and both and are unit vectors, we have . So the event is equivalent to the event , i.e.

 P[sign(yL⋅v⊤xL)v⊤θ∗≤0]=P[yL⋅v⊤xL≤0] (6)

Recall that is sampled from the Gaussian distribution , where is sampled uniformly at random from , we have follows the Gaussian distribution . Hence,

 P[yL⋅v⊤xL≤0]=P(yLxL)∼N(θ∗,σ2Id)[v⊤(yLxL)≤0]=Pg∼N(0,1)[g≤−θ∗⋅vσ] (7)

Moreover, from we can get

 ⟨θ∗,v⟩≥√d(1−τ202) (8)

So, using the Gaussian tail bound for all , and combining with (6), (7), (8), we have

 P[sign(yL⋅v⊤xL)v⊤θ∗≤0]≤exp⎛⎜ ⎜⎝−d(1−τ202)22σ2⎞⎟ ⎟⎠,

as stated in the lemma. ∎

Armed with Lemma 4 and Lemma 5, we now have a precise estimation of in the Gaussian mixture model. Then, we will show that the high precision of the estimation can be translated to low robust risk. To achieve this, we need a lemma from schmidt2018adversarially (), which upper bounds the robust classification risk of a linear classifier in terms of its inner product with .

###### Lemma 6.

(Lemma 20 in schmidt2018adversarially ()). Under the same setting as in Theorem 4, for any and , and for any unit vector such that where is the dual norm of , the linear classifier has -robust classification risk at most .

Lemma 6 guarantees that if we can estimate precisely, we can achieve small robust classification risk. Combine with Lemma 4 and Lemma 5 which provide such estimation, we are now ready to prove the robust classification risk bound stated in Theorem 4. We can actually prove a slightly more general theorem below with some extra parameters, and obtain Theorem 4 as a corollary.

###### Theorem 7.

Let be a labeled data drawn from -Gaussian mixture model with . Let be unlabeled data drawn from . Let be as stated in Lemma 4, and be the normalized eigenvector (i.e. ) with respect to the maximal eigenvalue of such that with probability at least . Let . Then with probability at least , the linear classifier has -robust classification risk at most when

 ϵ≤1−τ202−σ√2log1β√d. (9)
###### Proof.

By the choice of we have (8) holds, i.e.

 ⟨θ∗,v⟩≥√d(1−τ202), (10)

with probability at least .

Applying Lemma 5 to yields

 sign(yL⋅v⊤xL)v⊤θ∗>0, (11)

with probability at least .

Notice that . So by union bound on events (10) and (11), we have

 ⟨θ∗,^w⟩=sign(yL⋅v⊤xL)⟨θ∗,v⟩≥√d(1−τ202), (12)

with probability at least .

Since , we have

 ∥^w∥∗∞=∥^w∥1≤√d. (13)

By Lemma 6, we have the -robust error is upper bounded by

Combining this with (12), (13) and the assumption (9), we have

 ⟨^w,θ∗⟩−ϵ∥^w∥∗∞≥√d(1−τ202)−√d⎛⎜ ⎜⎝1−τ202−σ√2log1β√d⎞⎟ ⎟⎠=σ√2log1β.

Hence,

 RB−robust(f^w)≤exp⎛⎜ ⎜ ⎜⎝−(σ√2log1β)22σ2⎞⎟ ⎟ ⎟⎠=β,

with probability at least , as stated in the theorem. ∎

Now we are ready to prove Theorem 4.

Proof of Theorem 4: Let be a constant such that for sufficiently large . Notice that the in Theorem 4 is same as the in Theorem 7 since the maximal eigenvector of also maximizes over the unit sphere . Theorem 7 guarantees that with probability at least , -robust classification risk is less then for

 ϵ ≤1−τ202−σ√2log1β√d =1−τ202−c√2log1βd1/4.

Choose to be . Since ,