Adversarially Robust Generalization Just Requires More Unlabeled Data
Abstract
Neural network robustness has recently been highlighted by the existence of adversarial examples. Many previous works show that the learned networks do not perform well on perturbed test data, and significantly more labeled data is required to achieve adversarially robust generalization. In this paper, we theoretically and empirically show that with just more unlabeled data, we can learn a model with better adversarially robust generalization. The key insight of our results is based on a risk decomposition theorem, in which the expected robust risk is separated into two parts: the stability part which measures the prediction stability in the presence of perturbations, and the accuracy part which evaluates the standard classification accuracy. As the stability part does not depend on any label information, we can optimize this part using unlabeled data. We further prove that for a specific Gaussian mixture problem illustrated by schmidt2018adversarially (), adversarially robust generalization can be almost as easy as the standard generalization in supervised learning if a sufficiently large amount of unlabeled data is provided. Inspired by the theoretical findings, we propose a new algorithm called PASS by leveraging unlabeled data during adversarial training. We show that in the transductive and semisupervised settings, PASS achieves higher robust accuracy and defense success rate on the Cifar10 task.
1 Introduction
Deep learning lecun2015deep (), especially deep Convolutional Neural Network (CNN) lecun1998gradient (), has led to stateoftheart results spanning many machine learning fields, such as image classification simonyan2014very (); he2016deep (); huang2017densely (); hu2017squeeze (), object detection ren2015faster (); redmon2016you (); lin2018focal (), semantic segmentation long2015fully (); zhao2017pyramid (); chen2018encoder () and action recognition tran2015learning (); wang2016temporal (); wang2018non ().
Despite the great success in numerous applications, recent studies show that deep CNNs are vulnerable to some welldesigned input samples named as Adversarial Examples DBLP:journals/corr/SzegedyZSBEGF13 (); biggio2013evasion (). Take image classification as an example, for almost every commonly used wellperformed CNN, attackers are able to construct a small perturbation on an input image. The perturbation is almost imperceptible to humans but can cause a wrong prediction by the model. The problem is serious as some designed adversarial examples can be transferred among different kinds of CNN architectures DBLP:journals/corr/PapernotMG16 (), which makes it possible to perform blackbox attack: an attacker has no access to the model parameters or even architecture, but can still easily fool a machine learning system.
There is a rapidly growing body of work on studying how to obtain a robust neural network model. Most of the successful methods are based on adversarial training DBLP:journals/corr/SzegedyZSBEGF13 (); madry2017towards (); goodfellow2014 (); huang2015learning (). The highlevel idea of these works is that during training, we predict the strongest perturbation to each sample against the current model and use the perturbed sample together with the correct label for gradient descent optimization. However, the learned model tends to overfit on the training data and fails to keep robust on unseen testing data. For example, using the stateoftheart adversarial robust training method madry2017towards (), the defense success rate of the learned model on the testing data is below 60% while that on the training data is almost 100%, which indicates that the robustness fails to generalize. Some theoretical results further show that it is challenging to achieve adversarially robust generalization. fawzi2018adversarial () proves that adversarial examples exist for any classifiers and can be transferred across different models, making it impossible to design network architectures free from adversarial attacks. schmidt2018adversarially () shows that adversarially robust generalization requires much more labeled data than standard generalization in certain cases. tsipras2018robustness () presents an inherent tradeoff between accuracy and robust accuracy and argues that the phenomenon comes from the fact that robust classifiers learn different features. Therefore it is hard to reach high robustness for standard training methods.
Given the challenge of the task and previous findings, in this paper, we provide several theoretical and empirical results towards better adversarially robust generalization. In particular, we show that we can learn an adversarially robust model which generalizes well if we have plenty of unlabeled data, and the labeled sample complexity for adversarially robust generalization in schmidt2018adversarially () can be largely reduced if unlabeled data is used. Intuitively, imagine we hold a model (i.e. a classifier) and a sample. We want to know whether the model’s prediction is correct and is robust to the sample. Apparently, the correctness of the prediction can be obtained only if we know the ground truth label. However, to evaluate the robustness, we can add perturbations to the sample and check whether the prediction changes. Since such a way of evaluation does not require any label, we can measure and improve the robustness of the model by leveraging unlabeled data.
First, we formalize the intuition above using the language of generalization theory. The core technique is to decompose the upper bound of the expected robust risk into two terms: a stability term which measures whether the model can output consistent predictions under perturbations, and an accuracy term which evaluates whether the model can make correct predictions on natural samples. Given the stability term does not rely on ground truth labels, unlabeled data can be used to minimize this term and thus improve the generalization ability. Second, we prove that for the Gaussian mixture problem defined in schmidt2018adversarially (), if unlabeled data can be used, adversarially robust generalization will be almost as easy as the standard generalization in supervised learning (i.e. using the same number of labeled samples under similar conditions). Inspired by the theoretical findings, we think using labeled and unlabeled data together during training is a natural way to learn a model for better adversarially robust generalization. To achieve this, we design a PGDbased Adversarial training algorithm in SemiSupervised setting (PASS for short). On Cifar10 task, we show that the PASS algorithm can achieve better performance on adversarially robust generalization.
Our contributions are in three folds.

In Section 3.2.1, we provide a theorem to show that unlabeled data can be naturally used to improve the expected robust risk in general setting and thus leveraging unlabeled data is a way to improve adversarially robust generalization.

In Section 3.2.2, we discuss a specific Gaussian mixture problem introduced in schmidt2018adversarially (). In schmidt2018adversarially (), the authors proved that in this case, the labeled sample complexity for robust generalization is significantly larger than that for standard generalization. As an extension to this work, we prove that in this case, the labeled sample complexity for robust generalization can be the same as that for standard generalization if we have enough unlabeled data.

According to our theoretical results, we design an adversarial robust training algorithm using both labeled and unlabeled data. We name our algorithm PASS. Our experimental results show that in the transductive and semisupervised settings, PASS achieves better performance compared to the baseline algorithms.
2 Related Works
2.1 Adversarial Attacks and Defense
Most previous works study how to attack a neural network model using small perturbations under certain norm constraints, such as norm or norm. For the constraint, Fast Gradient Sign Method (FGSM) goodfellow2014 () finds a direction to which the perturbation increases the classification loss at an input point to the greatest extent; Projected Gradient Descent (PGD) madry2017towards () extends FGSM by updating the direction of the attack in an iterative manner and clipping the modifications in the norm range after each iteration. For the constraint, DeepFool MoosaviDezfooli_2016_CVPR () iteratively computes the minimal norm of an adversarial perturbation by linearizing around the input in each iteration. C&W attack DBLP:journals/corr/CarliniW16a () is a comprehensive approach that works under both norm constraints. In this work, we focus on learning a robust model to defend the whitebox attack, i.e. we assume we are in the worst case that the attacker knows the model parameters and thus can use the algorithms above to attack models.
There are a large number of papers about defending against adversarial attacks, but the result is far from satisfactory. Remarkably, DBLP:journals/corr/abs180200420 () shows most defense methods take advantage of socalled “gradient mask” and provides an attacking method called BPDA to correct the gradients. A recent paper li2019nattack () proposes a powerful blackbox attack called NAttack, which fools most previous defenses with a high success rate. So far, adversarial training madry2017towards () has been the most successful whitebox defense algorithm. By modeling the learning problem as a minimax game between the attacker and defender, the robust model can be trained using iterative optimization methods.
2.2 Semisupervised/Transductive Learning
Using unlabeled data to help the learning process has been proved promising in different applications NIPS2015_5947 (); 10.1093/bioinformatics/btr502 (); Elworthy:1994:BRH:974358.974371 (). Many approaches use regularizers called “soft constraints” to make the model “behave” well on unlabeled data. For example, transductive SVM Joachims:1999:TIT:645528.657646 () uses prediction confidence as a soft constraint, and graphbased SSL belkin2006manifold (); talukdar2009new () requires the model to have similar outputs at endpoints of an edge. The most related work to ours is the consistencybased SSL. It uses consistency as a soft constraint, which encourages the model to make consistent predictions on unlabeled data when a small perturbation is added. The consistency metric can be either computed by the model’s own predictions, such as the model DBLP:journals/corr/SajjadiJT16a (), Temporal Ensembling DBLP:journals/corr/LaineA16 () and Virtual Adversarial Training 2017arXiv170403976M (), or by the predictions of a teacher model, such as the mean teacher model NIPS2017_6719 ().
Although our method looks similar to those algorithms, it has a totally different starting point and focusing on a different problem. The goal of the consistencybased approach is to improve standard generalization by designing auxiliary regularization on unlabeled data. Most of the works do not have any theoretical guarantees. On the contrary, in our work, we show that the unlabeled data can be naturally used to improve generalization for robust machine learning problems from a theoretical perspective. We derive theoretical bounds, show better sample complexity and design practical algorithms to demonstrate its strength.
3 Main Results
In this section, we first illustrate the benefits of using unlabeled data for robust generalization from a theoretical perspective. Then we develop a practical algorithm based on the theoretical findings.
3.1 Notations and Definitions
We consider a standard classification task with an underlying data distribution over pairs of examples and corresponding labels . Usually is unknown and we can only access to in which is independent and identically drawn from , . For ease of reference, we denote this empirical distribution as (i.e. the uniform distribution over i.i.d. sampled data). We also assume that we are given a suitable loss function , where is parameterized by . The standard loss function is the zeroone loss, i.e. . Due to its discontinuous and nondifferentiable nature, surrogate loss functions such as crossentropy or mean square loss are commonly used during optimization.
Our goal is to find an that minimizes the expected classification risk. Without loss of any generality, our theory is mainly based on the binary classification problem, i.e. . All theorems below can be easily extended to the multiclass classification problem. For a binary classification problem, the expected classification risk is defined as below.
Definition 1.
(Expected classification risk). Let be a probability distribution over . The expected classification risk of a classifier under distribution and loss function is defined as .
We use to denote the classification risk under the underlying distribution and use to denote the classification risk under the empirical distribution. We use to denote the risk with the zeroone loss function. The classification risk characterizes whether the model is accurate. However, we also care about whether is robust. For example, when input is an image, we hope a small change (perturbation) to will not change the prediction of . To this end, schmidt2018adversarially () defines expected robust classification risk as below.
Definition 2.
(Expected robust classification risk). Let be a probability distribution over and be a perturbation set. Then the robust classification risk of a classifier under distribution and loss function is defined as .
Again, we use to denote the expected robust classification risk under the underlying distribution and use to denote the expected robust classification risk under the empirical distribution. We use to denote the robust risk with the zeroone loss function. In real practice, the most commonly used setting is the perturbation under bounded norm constraint . For simplicity, we refer to the robustness defined by this perturbation set as robustness.
Before presenting our main theoretical results, we briefly introduce the motivation of the work. As we can see from Definition 2, the robust classification risk concerns about whether can correctly predict the label for all around . We notice that testing whether is robustly accurate or not can be achieved by answering two questions separately: Whether provides a correct prediction on , and whether changes its prediction on any around ? It is easy to see that is robustly accurate if and only if the answers to the two questions are Yes. Based on this, we actually decompose the problem into two parts. One part concerns about whether is accurate and the other part concerns about whether is robust. Considering the second one (i.e. whether changes its prediction on any around ) does not require any label information, we think the robustness can be improved with more unlabeled data.
3.2 Robust Generalization Analysis
Our first result (Section 3.2.1) shows that unlabeled data can be used to improve adversarially robust generalization in general setting. Our second result (Section 3.2.2) shows that for a specific learning problem defined on Gaussian mixture model, compared to previous work schmidt2018adversarially (), the sample complexity of the robust generalization can be significantly reduced by using unlabeled data. Both results suggest that using unlabeled data is a natural way to improve adversarially robust generalization. Due to space limitation, we put all detailed proofs of the theorems and lemmas into appendix.
3.2.1 General Results
In this subsection, we show that the expected robust classification risk can be bounded by the sum of two terms. The first term only depends on the hypothesis space and the unlabeled data, and the second term is a standard PAC bound.
Theorem 1.
Let be the hypothesis space. Let be the set of i.i.d. samples drawn from the underlying distribution . For any function , with probability at least over the random draw of , we have
(1) 
where (1) is a term that can be optimized with only unlabeled data and (2) is the standard PAC generalization bound. is the marginal distribution for and is the empirical Rademacher complexity of hypothesis space .
From Theorem 1, we can see that the expected robust classification risk is bounded by the sum of two terms: the first term only involves the marginal distribution and the second term is the standard PAC generalization error bound. This shows that the expected robust risk minimization can be achieved by jointly optimizing the two terms simultaneously: we can optimize the first term using unlabeled data sampled from and optimize the second term using labeled data sampled from , which is the same as the standard supervised learning.
While cullina2018pac () suggests that in the standard PAC learning scenario (only labeled data is considered), the generalization gap of robust risk can be sometimes uncontrollable by the capacity of hypothesis space , our results show that we can mitigate this problem by introducing unlabeled data. In fact, our following result shows that with enough unlabeled data, learning a robust model can be almost as easy as learning a standard model.
3.2.2 Learning from Gaussian Mixture Model
The learning problem defined on Gaussian mixture model is illustrated in schmidt2018adversarially () as an example to show adversarially robust generalization needs much more labeled data compared to standard generalization. In this subsection, we show that for this specific problem, just using more unlabeled data is enough to achieve adversarially robust generalization. For completeness, we first list the results in schmidt2018adversarially () and then show our theoretical findings.
Definition 3.
(Gaussian mixture model schmidt2018adversarially ()). Let be the perclass mean vector and let be the variance parameter. Then the Gaussian mixture model is defined by the following distribution over : First, draw a label uniformly at random. Then sample the data point from .
Given the samples from the distribution defined above, the learning problem is to find a linear classifier to predict label from . schmidt2018adversarially () proved the following sample complexity bound for standard generalization.
Theorem 2.
(Theorem 4 in schmidt2018adversarially ()). Let be drawn from the Gaussian mixture model with and where is a universal constant. Let be the vector . Then with high probability, the expected classification risk of the linear classifier using 01 loss is at most 1%.
Theorem 2 suggests that we can learn a linear classifier with low classification risk (e.g., 1%) even if there is only one labeled data. However, the following theorem shows that for adversarially robust generalization under perturbation, significantly more labeled data is required.
Theorem 3.
(Theorem 6 in schmidt2018adversarially ()). Let be any learning algorithm, i.e. a function from samples to a binary classifier . Moreover, let , let , and let be drawn from . We also draw samples from the Gaussian mixture model. Then the expected robust classification risk of using 01 loss is at least if the number of labeled data .
As we can see from above theorem, the sample complexity of robust generalization is larger than that of standard generalization by . This shows that for highdimensional problems, adversarial robustness can provably require a significantly larger number of samples. We provide a new result which shows that the learned model can be robust if there is only one labeled data and sufficiently many unlabeled data. Our theorem is stated as follow:
Theorem 4.
Let be a labeled data drawn from Gaussian mixture model with and . Let be unlabeled data drawn from . Let such that . Let . Then there exists a constant such that for any , with high probability, the expected robust classification risk of using 01 loss is at most when the number of unlabeled data and .
From Theorem 4, we can see that when the number of unlabeled data is significant, we can learn a highly accurate and robust model using only one labeled data. The learning process can be intuitively described as the following three steps: in the first step, we use unlabeled data to estimate the direction of although we do not know the label that (or ) corresponds to. In the second step, we use the given labeled data to determine the “sign” of with high probability. Finally, we give a good estimation of by combining the two step above and learn a robust classifier. The three key lemmas corresponding to the three steps are listed as below ( are constants for ).
Lemma 1.
Under the same setting as Theorem 4, suppose that and . Then, with probability at least , there is a unique unit maximal eigenvector of the sample covariance matrix such that .
Lemma 2.
Under the same setting as Theorem 4, suppose is a unit vector such that for some constant . Then with probability at least , we have .
Lemma 3.
(Lemma 20 in schmidt2018adversarially ()). Under the same setting as Theorem 4, for any and , and for any unit vector such that where is the dual norm of , the linear classifier has robust classification risk at most .
Our theoretical findings suggest that we can improve the adversarially robust generalization using unlabeled data, and in the next subsection, we will present a practical algorithm for real applications.
3.3 Practical Algorithm
Let be a set of labeled data and be a set of unlabeled data. Motivated by the theory above, to achieve better adversarially robust generalization, we can optimize the classifier to be accurate on and robust on . This is also equivalent to learn the classifier to be accurate and robust on and robust on . Therefore, we design two loss terms on and separately.
For the labeled dataset , we use the standard robust adversarial training objective function, i.e.,
Following the most common setting, during training, the classifier outputs a probability distribution over categories and is evaluated by crossentropy loss defined as , where is the output probability for category .
For unlabeled data , we use an objective function which measures robustness without ground truth
Putting the two objective functions together, our training loss is defined as a combination of and as follows:
(2) 
Here is a coefficient to trade off the two loss terms. In real practice, we use iterative optimization methods to learn the function . In the inner loop, we fix the model and use Projected Gradient Descent (PGD) algorithm to learn the attack for any . In the outer loop, we use stochastic gradient descent to optimize on the perturbed s. We call our algorithm: PGDbased Adversarial training in SemiSupervised setting PASS. The general training process is shown in Algorithm 1.
4 Experiments
We use the Cifar10 task to verify our proposed algorithm. In particular, given a set of labeled and unlabeled data, we study two settings: the transductive setting in which the testing data is the given unlabeled data, and the semisupervised setting in which the testing data is a set of unseen data during training. All codes and models are available at https://github.com/RuntianZ/adversarialrobustnessunlabeled.
4.1 Experimental Setting
Following madry2017towards (), we use the Resnet model and modify the network incorporating wider layers by a factor of 10. This results in a network with five residual units with (16, 160, 320, 640) filters each. During training, we apply data augmentation including random crops and flips, as well as per image standardization. The initial learning rate is 0.1, and decay by a factor of 10 twice during training. In the inner loop, we run a 7step PGD with step size for each minibatch. The perturbation is constrained to be under norm.
Transductive learning setting.
In the transductive setting, the algorithm has access to all labeled training data and all unlabeled test data. In the Cifar10 task, we use the labeled training images as and the test images as and set . Each minibatch contains 100 sampled labeled images and 20 sampled unlabeled images. Learning rate is decayed at the and the epoch. We compare our proposed method with several baselines which use labeled training data only, including the original PGDbased adversarial training madry2017towards (), thermometer encoding buckman2018thermometer (), cascade learning na2018cascade () and ADVBNN liu2018advbnn ().
Semisupervised learning setting.
In the semisupervised learning setting, the unlabeled data are no longer coming from the test set, hence it is a better way to measure whether more unlabeled data can help adversarially robust generalization. Following many previous works DBLP:journals/corr/LaineA16 (); NIPS2017_6719 (); 2017arXiv170403976M (); athiwaratkun2018there (), we sample / labeled data from the training set and use them as labeled data. We mask out the labels of the remaining images in the training set and use them as unlabeled data. By doing this, we conduct two semisupervised learning tasks and call them the / experiments. In a minibatch, we sample 25/50 labeled images and 225/200 unlabeled images for the / experiment respectively. In both experiments, we use several different values of as an ablation study for this hyperparameter by setting , , . Learning rate is decayed at the and the epoch. We use the original PGDbased adversarial training madry2017towards () on the sampled / labeled data as the baseline algorithm for comparison (referred to as PGDadv).
\diagboxDefensesAttacks  Natural  FGSM  PGD7  PGD40  BPDA DBLP:journals/corr/abs180200420 ()  NAttack li2019nattack () 

PGD7 adv. training  85.40  59.05  49.99  47.54  47  45.48 
Therm. encoding buckman2018thermometer ()  89.88  80.96  79.16    0  7.79 
Cascade learning na2018cascade ()  91.5  69.1  42.5    15  1.74 
ADVBNN liu2018advbnn ()  80.09  64.94  57.59  48.00  41.20  19.69 
PASS (transductive)  86.20  86.19  86.02  85.50  85.50  86.19 
4.2 Experimental Results
Transductive learning setting
In Table 1, we report the robust test accuracy of different models using different attack methods in the transductive setting. The attack methods include FGSM goodfellow2014 (), 7step PGD (referred to as PGD7), 40step PGD (referred to as PGD40), BPDA DBLP:journals/corr/abs180200420 () and NAttack li2019nattack (). All attacks are limited to in terms of norm. We also report the test accuracy on the original test data (referred to as natural accuracy).
From the table, we can clearly see that PASS in transductive setting is significantly better than all other baselines for more than 30 points under different attacks. Furthermore, the defense success rate of PASS (which is computed by ) is more than 99% under PGD40 attack which is even stronger than the attack (PGD7) used during training. This indicates that the model learned from PASS is very robust if it produces a correct prediction. Actually, this experimental result is predictable since the algorithm explicitly imposes regularization on the robustness of the test data.
Semisupervised learning setting.







5  PGDadv on 5  61.18  60.57  32.40  30.54  50.42  
PASS (=0.1)  63.24  60.44  32.97  30.90  51.13  
PASS (=0.2)  61.73  60.71  35.20  32.96  54.29  
PASS (=0.3)  61.88  60.46  35.07  33.54  55.47  
10  PGDadv on 10  78.80  73.79  45.60  37.48  50.79  
PASS (=0.1)  78.24  72.92  47.96  38.86  53.29  
PASS (=0.2)  78.74  73.16  51.20  41.18  56.29  
PASS (=0.3)  78.95  73.35  52.24  42.48  57.91  
PGDadv on 50  99.91  85.40  96.71  49.99  58.54 
We list all results of the / experiments in Table 2. We use five criteria to evaluate the performance of the model: the natural training/test accuracy, the robust training/test accuracy using PGD7 attack and the defense success rate.
First, we can see that in both experiments, the robust test accuracy is improved when we use unlabeled data. For example, the robust test accuracy of the models trained by PASS with for the / experiments increase by 3.0/5.0 percents compared to the PGDadv baselines. We also check the defense success rate which evaluates whether the model is robust given the prediction is correct. As we can see from the last column in Table 2, the defense success rate of models trained using our proposed method is much higher than the baselines. In particular, the defense success rate of the model trained with in the experiment is competitive to the model trained using PGDadv on the whole dataset. This clearly shows the advantage of our proposed algorithm.
Second, we can also see the influence of the value of . The model trained with a larger has higher robust accuracy. For example, in the experiment, the robust test accuracy of the model trained with is more than better than that with . However, we observe that training will become hard to converge if .
VAT miyato2015distributional () uses an adversarial regularization over the unlabeled data but the goal is to improve natural test accuracy. In our experiment, as can be seen from the table, PASS improves robust test accuracy more than natural test accuracy. One of the differences between PASS and VAT is that VAT uses onestep gradient method to attack during training, which is known to be a very weak attack and cannot lead to robust models madry2017towards () (more experimental results are provided in the appendix). In our work, we focus on improving network robustness and use a much stronger attack (7step PGD) during training. For unlabeled data when defending against a weak attack during training, the model can not be learned to be robust but might tend to generalize toward “accuracy”; However, for unlabeled data, when defending against a strong attack during training, the model will be learned to minimize the expected robust risk. Therefore, the model tends to generalize toward “robustness”, which is consistent with our theory.
5 Conclusion
In this paper, we theoretically and empirically show that with just more unlabeled data, we can learn model with better adversarially robust generalization. We first give an expected robust risk decomposition theorem and then show that for a specific learning problem on Gaussian mixture model, the adversarially robust generalization can be almost as easy as standard generalization. Based on these theoretical results, we propose a new algorithm called PASS which leverages unlabeled data during training and empirically show its advantage. As future work, we will study the sample complexity of unlabeled data for broader function classes and solve more challenging real tasks.
References
 (1) Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. CoRR, abs/1802.00420, 2018.
 (2) Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. There are many consistent explanations of unlabeled data: Why you should average. In International Conference on Learning Representations, 2019.
 (3) Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(Nov):2399–2434, 2006.
 (4) Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.
 (5) Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In International Conference on Learning Representations, 2018.
 (6) Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. CoRR, abs/1608.04644, 2016.
 (7) LiangChieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoderdecoder with atrous separable convolution for semantic image segmentation. arXiv preprint arXiv:1802.02611, 2018.
 (8) Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. Paclearning in the presence of evasion adversaries. arXiv preprint arXiv:1806.01471, 2018.
 (9) David Elworthy. Does baumwelch reestimation help taggers? In Proceedings of the Fourth Conference on Applied Natural Language Processing, ANLC ’94, pages 53–58, Stroudsburg, PA, USA, 1994. Association for Computational Linguistics.
 (10) Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier. arXiv preprint arXiv:1802.08686, 2018.
 (11) Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 (12) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 (13) Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. arXiv preprint arXiv:1709.01507, 7, 2017.
 (14) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, volume 1(2), page 3, 2017.
 (15) Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesvári. Learning with a strong adversary. arXiv preprint arXiv:1511.03034, 2015.
 (16) Thorsten Joachims. Transductive inference for text classification using support vector machines. In Proceedings of the Sixteenth International Conference on Machine Learning, ICML ’99, pages 200–209, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
 (17) Samuli Laine and Timo Aila. Temporal ensembling for semisupervised learning. CoRR, abs/1610.02242, 2016.
 (18) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
 (19) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 (20) Yandong Li, Lijun Li, Liqiang Wang, Tong Zhang, and Boqing Gong. Nattack: Learning the distributions of adversarial examples for an improved blackbox attack on deep neural networks. arXiv preprint arXiv:1905.00441, 2019.
 (21) TsungYi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018.
 (22) Xuanqing Liu, Yao Li, Chongruo Wu, and ChoJui Hsieh. AdvBNN: Improved adversarial defense through robust bayesian neural network. In International Conference on Learning Representations, 2019.
 (23) Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
 (24) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
 (25) Takeru Miyato, Shinichi Maeda, Shin Ishii, and Masanori Koyama. Virtual adversarial training: a regularization method for supervised and semisupervised learning. IEEE transactions on pattern analysis and machine intelligence, 2018.
 (26) Takeru Miyato, Shinichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. arXiv preprint arXiv:1507.00677, 2015.
 (27) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
 (28) SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 (29) Taesik Na, Jong Hwan Ko, and Saibal Mukhopadhyay. Cascade adversarial machine learning regularized with a unified embedding. In International Conference on Learning Representations, 2018.
 (30) Nicolas Papernot, Patrick D. McDaniel, and Ian J. Goodfellow. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. CoRR, abs/1605.07277, 2016.
 (31) Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semisupervised learning with ladder networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3546–3554. Curran Associates, Inc., 2015.
 (32) Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, realtime object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
 (33) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 (34) Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semisupervised learning. CoRR, abs/1606.04586, 2016.
 (35) Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pages 5019–5031, 2018.
 (36) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 (37) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.
 (38) Partha Pratim Talukdar and Koby Crammer. New regularized algorithms for transductive learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 442–457. Springer, 2009.
 (39) Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1195–1204. Curran Associates, Inc., 2017.
 (40) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
 (41) Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In International Conference on Learning Representations, 2019.
 (42) Martin J. Wainwright. HighDimensional Statistics: A NonAsymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
 (43) Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36. Springer, 2016.
 (44) Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Nonlocal neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1(3), page 4, 2018.
 (45) Bing Zhang and Mingguang Shi. Semisupervised learning improves gene expressionbased prediction of cancer recurrence. Bioinformatics, 27(21):3017–3023, 09 2011.
 (46) Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017.
Appendix A Background on Generalization and Rademacher Complexity
The Rademacher complexity is a commonly used capacity measure for a hypothesis space.
Definition 4.
Given a set of samples, the empirical Rademacher complexity of function class (mapping from to ) is defined as:
where contains i.i.d. random variables drawn from the Rademacher distribution unif({1, 1}).
By using the Rademacher complexity, we can directly provide an upper bound on the generalization error.
Theorem 5.
(Theorem 3.5 in mohri2018foundations ()). Suppose is the loss, let be the set of i.i.d. samples drawn from the underlining distribution . Let be the hypothesis space, then with probability at least over , for any :
Appendix B Proof of Theorem 1
Appendix C Proof of Theorem 4
For convenience, in this section, we use or to denote some universal constants, where .
In the proof of Theorem 4, we will use the concentration bound for covariance estimation in wainwright_2019 (). We first introduce the definition of spiked covariance ensemble.
Definition 5.
(Spiked covariance ensemble). A sample from the spiked covariance ensemble takes the form
where is a zeromean random variable with unit variance, is a fixed scalar, is a fixed vector and is a random vector independent of , with zero mean and covariance matrix .
To see why spiked covariance ensemble model is useful, we note that the Gaussian mixture model is its special case. Specifically, let ’s be the unlabeled data in Theorem 4. Then follows the Gaussian mixture distribution , and is a spiked covariance ensemble with parameter , uniformly distributed on , and .
The following theorem from wainwright_2019 () characterizes the concentration property of spiked covariance ensemble, which we will further use to bound the robust classification error. Intuitively, the theorem says that we can approximately recover in the spiked covariance ensemble model using the top eigenvector of the sample covariance matrix .
Theorem 6.
(Concentration of covariance estimation, see Corollary 8.7 in wainwright_2019 ()). Given i.i.d. samples from the spiked covariance ensemble with subGaussian tails (which means both and are subGaussian with parameter at most one), suppose that and . Then, with probability at least , there is a unique maximal eigenvector of the sample covariance matrix such that
Using the theorem above, we can show that for the Gaussian mixture model, one of the top unit eigenvector of the sample covariance matrix is approximately . In other words, we can approximately recover the parameter up to a sign difference: the principal component analysis of gives either or , while is close to .
Lemma 4.
Under the same setting as Theorem 4, suppose that and . Then, with probability at least , there is a unique maximal eigenvector of the sample covariance matrix with unit norm such that
Proof.
As discussed above, is a spiked covariance ensemble. By Theorem 6 we have with probability at least , there is a unique maximal eigenvector of the sample covariance matrix such that
Let , we have . Below we need to consider two cases, and .
Case 1: . Let , since both and are unit vectors, we have
(5) 
Recall that , which is equivalent to
Rearranging the terms and using AMGM inequality gives
Therefore, by (5),
By substituting , and , we complete the proof.
Case 2: . Let be one of such that the the inner product is nonnegative. Since both and are unit vectors, we have
Therefore, . ∎
Now we have proved that by using the top eigenvector of , we can recover the up to a sign difference. Next, we will show that it is possible to determine the sign using the labeled data.
Lemma 5.
Under the same setting as Theorem 4, suppose is a unit vector such that where . Then with probability at least , we have .
Proof.
Since , and both and are unit vectors, we have . So the event is equivalent to the event , i.e.
(6) 
Recall that is sampled from the Gaussian distribution , where is sampled uniformly at random from , we have follows the Gaussian distribution . Hence,
(7) 
Moreover, from we can get
(8) 
as stated in the lemma. ∎
Armed with Lemma 4 and Lemma 5, we now have a precise estimation of in the Gaussian mixture model. Then, we will show that the high precision of the estimation can be translated to low robust risk. To achieve this, we need a lemma from schmidt2018adversarially (), which upper bounds the robust classification risk of a linear classifier in terms of its inner product with .
Lemma 6.
(Lemma 20 in schmidt2018adversarially ()). Under the same setting as in Theorem 4, for any and , and for any unit vector such that where is the dual norm of , the linear classifier has robust classification risk at most .
Lemma 6 guarantees that if we can estimate precisely, we can achieve small robust classification risk. Combine with Lemma 4 and Lemma 5 which provide such estimation, we are now ready to prove the robust classification risk bound stated in Theorem 4. We can actually prove a slightly more general theorem below with some extra parameters, and obtain Theorem 4 as a corollary.
Theorem 7.
Let be a labeled data drawn from Gaussian mixture model with . Let be unlabeled data drawn from . Let be as stated in Lemma 4, and be the normalized eigenvector (i.e. ) with respect to the maximal eigenvalue of such that with probability at least . Let . Then with probability at least , the linear classifier has robust classification risk at most when
(9) 
Proof.
Since , we have
(13) 
Now we are ready to prove Theorem 4.
Proof of Theorem 4: Let be a constant such that for sufficiently large . Notice that the in Theorem 4 is same as the in Theorem 7 since the maximal eigenvector of also maximizes over the unit sphere . Theorem 7 guarantees that with probability at least , robust classification risk is less then for
Choose to be . Since ,