Adversarially Robust Generalization Just Requires More Unlabeled Data

# Adversarially Robust Generalization Just Requires More Unlabeled Data

Runtian Zhai, Tianle Cai, Di He, Chen Dan,
Kun He, John E. Hopcroft & Liwei Wang
Peking University   Carnegie Mellon University  Cornell University
Huazhong University of Science and Technology
{zhairuntian,caitianle1998,di_he,wanglw}@pku.edu.cn
cdan@cs.cmu.edu,brooklet60@hust.edu.cn,jeh17@cornell.edu
Equal contribution
###### Abstract

Neural network robustness has recently been highlighted by the existence of adversarial examples. Many previous works show that the learned networks do not perform well on perturbed test data, and significantly more labeled data is required to achieve adversarially robust generalization. In this paper, we theoretically and empirically show that with just more unlabeled data, we can learn a model with better adversarially robust generalization. The key insight of our results is based on a risk decomposition theorem, in which the expected robust risk is separated into two parts: the stability part which measures the prediction stability in the presence of perturbations, and the accuracy part which evaluates the standard classification accuracy. As the stability part does not depend on any label information, we can optimize this part using unlabeled data. We further prove that for a specific Gaussian mixture problem illustrated by Schmidt et al. (2018), adversarially robust generalization can be almost as easy as the standard generalization in supervised learning if a sufficiently large amount of unlabeled data is provided. Inspired by the theoretical findings, we further show that a practical adversarial training algorithm that leverages unlabeled data can improve adversarial robust generalization on MNIST and Cifar-10.

## 1 Introduction

Deep learning (LeCun et al., 2015), especially deep Convolutional Neural Network (CNN) (LeCun et al., 1998), has led to state-of-the-art results spanning many machine learning fields, such as image classification (Simonyan & Zisserman, 2014; He et al., 2016; Huang et al., 2017; Hu et al., 2017), object detection (Ren et al., 2015; Redmon et al., 2016; Lin et al., 2018), semantic segmentation (Long et al., 2015; Zhao et al., 2017; Chen et al., 2018) and action recognition (Tran et al., 2015; Wang et al., 2016, 2018).

Despite the great success in numerous applications, recent studies show that deep CNNs are vulnerable to some well-designed input samples named as Adversarial Examples (Szegedy et al., 2013; Biggio et al., 2013). Take image classification as an example, for almost every commonly used well-performed CNN, attackers are able to construct a small perturbation on an input image. The perturbation is almost imperceptible to humans but can fool the model to make a wrong prediction. The problem is serious as some designed adversarial examples can be transferred among different kinds of CNN architectures (Papernot et al., 2016), which makes it possible to perform black-box attack: an attacker has no access to the model parameters or even architecture, but can still easily fool a machine learning system.

There is a rapidly growing body of work on studying how to obtain a robust neural network model. Most of the successful methods are based on adversarial training (Szegedy et al., 2013; Madry et al., 2017; Goodfellow et al., 2015; Huang et al., 2015). The high-level idea of these works is that during training, we predict the strongest perturbation to each sample against the current model and use the perturbed sample together with the correct label for gradient descent optimization. However, the learned model tends to overfit on the training data and fails to keep robust on unseen testing data. For example, using the state-of-the-art adversarial robust training method (Madry et al., 2017), the defense success rate of the learned model on the testing data is below 60% while that on the training data is almost 100%, which indicates that the robustness fails to generalize. Some theoretical results further show that it is challenging to achieve adversarially robust generalization. Fawzi et al. (2018) proves that adversarial examples exist for any classifiers and can be transferred across different models, making it impossible to design network architectures free from adversarial attacks. Schmidt et al. (2018) shows that adversarially robust generalization requires much more labeled data than standard generalization in certain cases. Tsipras et al. (2019) presents an inherent trade-off between accuracy and robust accuracy and argues that the phenomenon comes from the fact that robust classifiers learn different features. Therefore it is hard to reach high robustness for standard training methods.

Given the challenge of the task and previous findings, in this paper, we provide several theoretical and empirical results towards better adversarially robust generalization. In particular, we show that we can learn an adversarially robust model which generalizes well if we have plenty of unlabeled data, and the labeled sample complexity for adversarially robust generalization in Schmidt et al. (2018) can be largely reduced if unlabeled data is used. First, we show that the expected robust risk can be upper bounded by the sum of two terms: a stability term which measures whether the model can output consistent predictions under perturbations, and an accuracy term which evaluates whether the model can make correct predictions on natural samples. Given the stability term does not rely on ground truth labels, unlabeled data can be used to minimize this term and thus improve the generalization ability. Second, we prove that for the Gaussian mixture problem defined in Schmidt et al. (2018), if unlabeled data can be used, adversarially robust generalization will be almost as easy as the standard generalization in supervised learning (i.e. using the same number of labeled samples under similar conditions). Inspired by the theoretical findings, we provide a practical algorithm that can learn from both labeled and unlabeled data for better adversarially robust generalization. Our experiments on MNIST and Cifar-10 show that the method achieves better performance, which verifies our theoretical findings.

Our contributions are in three folds.

• In Section 3.2.1, we provide a theorem to show that unlabeled data can be naturally used to improve the expected robust risk in general setting and thus leveraging unlabeled data is a way to improve adversarially robust generalization.

• In Section 3.2.2, we discuss a specific Gaussian mixture problem introduced in Schmidt et al. (2018). In Schmidt et al. (2018), the authors proved that in this case, the labeled sample complexity for robust generalization is significantly larger than that for standard generalization. As an extension of this work, we prove that in this case, the labeled sample complexity for robust generalization can be the same as that for standard generalization if we have enough unlabeled data.

• Inspired by our theoretical findings, we provide an adversarial robust training algorithm using both labeled and unlabeled data. Our experimental results show that the algorithm achieves better performance than baseline algorithms on MNIST and Cifar-10, which empirically proves that unlabeled data can help improve adversarially robust generalization.

## 2 Related works

Most previous works study how to attack a neural network model using small perturbations under certain norm constraints, such as norm or norm. For the constraint, Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) finds a direction to which the perturbation increases the classification loss at an input point to the greatest extent; Projected Gradient Descent (PGD) (Madry et al., 2017) extends FGSM by updating the direction of the attack in an iterative manner and clipping the modifications in the norm range after each iteration. For the constraint, DeepFool (Moosavi-Dezfooli et al., 2016) iteratively computes the minimal norm of an adversarial perturbation by linearizing around the input in each iteration. C&W attack (Carlini & Wagner, 2016) is a comprehensive approach that works under both norm constraints. In this work, we focus on learning a robust model to defend the white-box attack, i.e. the attacker knows the model parameters and thus can use the algorithms above to attack the model.

There are a large number of papers about defending against adversarial attacks, but the result is far from satisfactory. Remarkably, Athalye et al. (2018) shows most defense methods take advantage of so-called “gradient mask” and provides an attacking method called BPDA to correct the gradients. So far, adversarial training (Madry et al., 2017) has been the most successful white-box defense algorithm. By modeling the learning problem as a mini-max game between the attacker and defender, the robust model can be trained using iterative optimization methods. Some recent papers (Wang et al., 2019; Gao et al., 2019) theoretically prove the convergence of adversarial training. Moreover, Shafahi et al. (2019); Zhang et al. (2019a) propose ways to accelerate the speed of adversarial training. Adversarial logit pairing (Kannan et al., 2018) and TRADES (Zhang et al., 2019b) further improve adversarial training by decomposing the prediction error as the sum of classification error and boundary error, and Wang et al. (2019) proposes to improve adversarial training by evaluating the quality of adversarial examples using the FOSC metric.

##### Semi-supervised learning

Using unlabeled data to help the learning process has been proved promising in different applications (Rasmus et al., 2015; Zhang & Shi, 2011; Elworthy, 1994). Many approaches use regularizers called “soft constraints” to make the model “behave” well on unlabeled data. For example, transductive SVM (Joachims, 1999) uses prediction confidence as a soft constraint, and graph-based SSL (Belkin et al., 2006; Talukdar & Crammer, 2009) requires the model to have similar outputs at endpoints of an edge. The most related work to ours is the consistency-based SSL. It uses consistency as a soft constraint, which encourages the model to make consistent predictions on unlabeled data when a small perturbation is added. The consistency metric can be either computed by the model’s own predictions, such as the model (Sajjadi et al., 2016), Temporal Ensembling (Laine & Aila, 2016) and Virtual Adversarial Training (Miyato et al., 2018), or by the predictions of a teacher model, such as the mean teacher model (Tarvainen & Valpola, 2017).

##### Semi-supervised learning for adversarially robust generalization

There are three other concurrent and independent works (Carmon et al., 2019; Uesato et al., 2019; Najafi et al., 2019) which also explore how to use unlabeled data to help adversarially robust generalization. We describe the three works below, and compare them with ours. See also Carmon et al. (2019) and Uesato et al. (2019) for the comparison of all the four works from their perspective.

Najafi et al. (2019) investigate the robust semi-supervised learning from the distributionally robust optimization perspective. They assign soft labels to the unlabeled data according to an adversarial loss and train such images together with the labeled ones. Results on a wide range of tasks show that the proposed algorithm improves the adversarially robust generalization. Both Najafi et al. (2019) and we conduct semi-supervised experiments by removing labels from the training data.

Uesato et al. (2019) study the Gaussian mixture model of Schmidt et al. (2018) and theoretically show that a self-training algorithm can successfully leverage unlabeled data to improve adversarial robustness. They extend the self-training algorithm to the real image dataset Cifar10, augment it with unlabeled Tiny Image dataset and improve state-of-the-art adversarial robustness. They show strong improvements in the low labeled data regimes by removing most labels from CIFAR-10 and SVHN. In our work, we also study the Gaussian mixture model and show that a slightly different algorithm can improve adversarially robust generalization as well. We observe similar improvements using our algorithm on Cifar-10 and MNIST.

Carmon et al. (2019) obtain similar theoretical and empirical results as in Uesato et al. (2019), and offer a more comprehensive analysis of other aspects. They show that by using unlabeled data and robust self-training, the learned models can obtain better certified robustness against all possible attacks. Moreover, they study the impact of different training components on the final model performance, such as the size of unlabeled data. We also study the influence of different factors in our experiments and have similar observations.

## 3 Main results

In this section, we illustrate the benefits of using unlabeled data for robust generalization from a theoretical perspective.

### 3.1 Notations and definitions

We consider a standard classification task with an underlying data distribution over pairs of examples and corresponding labels . Usually is unknown and we can only access to in which is independent and identically drawn from , . For ease of reference, we denote this empirical distribution as (i.e. the uniform distribution over i.i.d. sampled data). We also assume that we are given a suitable loss function , where is parameterized by . The standard loss function is the zero-one loss, i.e. . Due to its discontinuous and non-differentiable nature, surrogate loss functions such as cross-entropy or mean square loss are commonly used during optimization.

Our goal is to find an that minimizes the expected classification risk. Without loss of any generality, our theory is mainly based on the binary classification problem, i.e. . All theorems below can be easily extended to the multi-class classification problem. For a binary classification problem, the expected classification risk is defined as below.

###### Definition 1.

(Expected classification risk). Let be a probability distribution over . The expected classification risk of a classifier under distribution and loss function is defined as .

We use to denote the classification risk under the underlying distribution and use to denote the classification risk under the empirical distribution. We use to denote the risk with the zero-one loss function. The classification risk characterizes whether the model is accurate. However, we also care about whether is robust. For example, when input is an image, we hope a small change (perturbation) to will not change the prediction of . To this end, Schmidt et al. (2018) defines expected robust classification risk as follows.

###### Definition 2.

(Expected robust classification risk). Let be a probability distribution over and be a perturbation set. Then the -robust classification risk of a classifier under distribution and loss function is defined as .

Again, we use to denote the expected robust classification risk under the underlying distribution and use to denote the expected robust classification risk under the empirical distribution. We use to denote the robust risk with the zero-one loss function. In real practice, the most commonly used setting is the perturbation under -bounded norm constraint . For simplicity, we refer to the robustness defined by this perturbation set as -robustness.

### 3.2 Robust generalization analysis

Our first result (Section 3.2.1) shows that unlabeled data can be used to improve adversarially robust generalization in general setting. Our second result (Section 3.2.2) shows that for a specific learning problem defined on Gaussian mixture model, compared to previous work (Schmidt et al., 2018), the sample complexity for robust generalization can be significantly reduced by using unlabeled data. Both results suggest that using unlabeled data is a natural way to improve adversarially robust generalization. All detailed proofs of the theorems and lemmas in this section can be found in the appendix.

#### 3.2.1 General results

In this subsection, we show that the expected robust classification risk can be bounded by the sum of two terms. The first term only depends on the hypothesis space and the unlabeled data, and the second term is a standard PAC bound.

###### Theorem 1.

Let be the hypothesis space. Let be the set of i.i.d. samples drawn from the underlying distribution . For any function , with probability at least over the random draw of , we have

where (1) is a term that can be optimized with only unlabeled data and (2) is the standard PAC generalization bound. is the marginal distribution for and is the empirical Rademacher complexity of hypothesis space .

From Theorem 1, we can see that the expected robust classification risk is bounded by the sum of two terms: the first term only involves the marginal distribution and the second term is the standard PAC generalization error bound. This shows that the expected robust risk minimization can be achieved by jointly optimizing the two terms simultaneously: we can optimize the first term using unlabeled data sampled from and optimize the second term using labeled data sampled from , which is the same as the standard supervised learning.

While Cullina et al. (2018) suggests that in the standard PAC learning scenario (only labeled data is considered), the generalization gap of robust risk can be sometimes uncontrollable by the capacity of hypothesis space , our results show that we can mitigate this problem by introducing unlabeled data. In fact, our following result shows that with enough unlabeled data, learning a robust model can be almost as easy as learning a standard model.

#### 3.2.2 Learning from Gaussian mixture model

The learning problem defined on Gaussian mixture model is illustrated in Schmidt et al. (2018) as an example to show adversarially robust generalization needs much more labeled data compared to standard generalization. In this subsection, we show that for this specific problem, just using more unlabeled data is enough to achieve adversarially robust generalization. For completeness, we first list the results in Schmidt et al. (2018) and then show our theoretical findings.

###### Definition 3.

(Gaussian mixture model (Schmidt et al., 2018)). Let be the per-class mean vector and let be the variance parameter. Then the -Gaussian mixture model is defined by the following distribution over : First, draw a label uniformly at random. Then sample the data point from .

Given the samples from the distribution defined above, the learning problem is to find a linear classifier to predict label from . Schmidt et al. (2018) proved the following sample complexity bound for standard generalization.

###### Theorem 2.

(Theorem 4 in Schmidt et al. (2018)). Let be drawn from the -Gaussian mixture model with and where is a universal constant. Let be the vector . Then with high probability, the expected classification risk of the linear classifier using 0-1 loss is at most 1%.

Theorem 2 suggests that we can learn a linear classifier with low classification risk (e.g., 1%) even if there is only one labeled data. However, the following theorem shows that for adversarially robust generalization under perturbation, significantly more labeled data is required.

###### Theorem 3.

(Theorem 6 in Schmidt et al. (2018)). Let be any learning algorithm, i.e. a function from samples to a binary classifier . Moreover, let , let , and let be drawn from . We also draw samples from the -Gaussian mixture model. Then the expected -robust classification risk of using 0-1 loss is at least if the number of labeled data .

As we can see from above theorem, the sample complexity for robust generalization is larger than that of standard generalization by . This shows that for high-dimensional problems, adversarial robustness can provably require a significantly larger number of samples. We provide a new result which shows that the learned model can be robust if there is only one labeled data and sufficiently many unlabeled data. Our theorem is stated as follow:

###### Theorem 4.

Let be a labeled point drawn from -Gaussian mixture model with and . Let be unlabeled points drawn from . Let such that . Let . Then there exists a constant such that for any , with high probability, the expected -robust classification risk of using 0-1 loss is at most when the number of unlabeled points and .

From Theorem 4, we can see that when the number of unlabeled points is significant, we can learn a highly accurate and robust model using only one labeled point.

##### Proof sketch

The learning process can be intuitively described as the following three steps: in the first step, we use unlabeled data to estimate the direction of although we do not know the label that (or ) corresponds to. Specifically, we choose the direction which maximizes the quantity which can be viewed as a measure of the confidence at data points. In the second step, we use the given labeled point to determine the “sign” of with high probability, we note that when the direction is correctly estimated in the first step, then the only one labeled point is sufficient to give the correct sign with high probability. Finally, we give a good estimation of by combining the two steps above and learn a robust classifier. The three key lemmas corresponding to the three steps are listed below ( are constants for ).

###### Lemma 1.

Under the same setting as Theorem 4, suppose that and . Then, with probability at least , there is a unique unit maximal eigenvector of the sample covariance matrix such that .

###### Lemma 2.

Under the same setting as Theorem 4, suppose is a unit vector such that for some constant . Then with probability at least , we have .

###### Lemma 3.

(Lemma 20 in Schmidt et al. (2018)). Under the same setting as Theorem 4, for any and , and for any unit vector such that where is the dual norm of , the linear classifier has -robust classification risk at most .

Our theoretical findings suggest that we can improve the adversarially robust generalization using unlabeled data. In the next section, we will present a practical algorithm for real applications, which further verifies our main results.

## 4 Algorithm and experiments

### 4.1 Practical algorithm

Let be a set of labeled data and be a set of unlabeled data. Motivated by the theory in the previous section, to achieve better adversarially robust generalization, we can optimize the classifier to be accurate on and robust on . This is also equivalent to making the classifier accurate and robust on and robust on . Therefore, we design two loss terms on and separately.

For the labeled dataset , we use the standard -robust adversarial training objective function:

 L1(f,SL)=1nn∑i=1maxx′i∈Bϵ∞(xi)lCE(f(x′i),yi) (2)

Following the most common setting, during training, the classifier outputs a probability distribution over categories and is evaluated by cross-entropy loss defined as , where is the output probability for category .

For unlabeled data , we use an objective function which measures robustness without labels:

 L2(f,SU)=1mm∑i=1maxx′i∈Bϵ∞(xi)lCE(f(x′i),^yi),where ^yi=argmaxk{fk(xi)}. (3)

Putting the two objective functions together, our training loss is defined as a combination of and as follows:

 LSSL(f,SL,SU)=L1(f,SL)+λL2(f,SU). (4)

Here is a coefficient to trade off the two loss terms. In real practice, we use iterative optimization methods to learn the function . In the inner loop, we fix the model and use Projected Gradient Descent (PGD) to learn the attack for any . In the outer loop, we use stochastic gradient descent to optimize on the perturbed s. The general training process is shown in Algorithm 1.

##### Remark

We notice that Algorithm 1 is a generalized version of Virtual Adversarial Training (VAT) (Miyato et al., 2018). When setting the PGD step , the algorithm is almost equivalent to the original VAT algorithm, which is particular useful for improving standard generalization. However, according to our experimental results below, setting does not help improve adversarial robust generalization. The improvement of adversarial robust generalization using unlabeled data exists when setting a relatively larger .

### 4.2 Experimental setting

We verify Algorithm 1 on MNIST and Cifar-10. Following Madry et al. (2017), we use the Resnet model and modify the network incorporating wider layers by a factor of 10. This results in a network with five residual units with (16, 160, 320, 640) filters each. During training, we apply data augmentation including random crops and flips, as well as per image standardization. The initial learning rate is 0.1, and decay by 0.1 twice during training. In the inner loop, we run a 7-step PGD with step size for each mini-batch. The perturbation is constrained to be under norm.

Following many previous works (Laine & Aila, 2016; Tarvainen & Valpola, 2017; Miyato et al., 2018; Athiwaratkun et al., 2019), we sample / labeled data from the training set and use them as labeled data. We mask out the labels of the remaining images in the training set and use them as unlabeled data. By doing this, we conduct two semi-supervised learning tasks and call them the / experiments. In a mini-batch, we sample 25/50 labeled images and 225/200 unlabeled images for the / experiment respectively. In both experiments, we use several different values of as an ablation study for this hyperparameter by setting , , . Learning rate is decayed at the and the epoch. We use the original PGD-based adversarial training (Madry et al., 2017) on the sampled / labeled data as the baseline algorithm for comparison (referred to as PGD-adv). Our algorithm is referred to as Ours.

### 4.3 Experimental results

We list all results of the / experiments in Tables 1 and 2. We use five criteria to evaluate the performance of the model: the natural training/test accuracy (NA and NA), the robust training/test accuracy using PGD-7 attack (RA and RA) and the defense success rate (DSR).

First, we can see that in both experiments, the robust test accuracy is improved when we use unlabeled data. For example, on Cifar-10 the robust test accuracy of the models trained under SSL with for the / experiments increase by 3.0/5.0 percents compared to the PGD-adv baselines. We also check the defense success rate which evaluates whether the model is robust given the prediction is correct. As we can see from the last column in Tables 1 and 2, the defense success rate of models trained using our proposed method is much higher than the baselines. In particular, the defense success rate of the model trained with in the experiment is competitive to the model trained using PGD-adv on the whole dataset. This clearly shows the advantage of our proposed algorithm.

Second, we can also see the influence of the value of . The model trained with a larger has higher robust accuracy. For example, in the experiment, the robust test accuracy of the model trained with is more than better than that with . However, we observe that training will become hard to converge if .

Third, using larger produces more robust models. As we can see from the table, in the experiment, relatively higher natural training/test accuracy can be achieved by setting (vanilla VAT algorithm). However, the robust training/testing accuracy are significantly worse and are near zero. This clearly shows that using a stronger attack on both labeled and unlabeled data leads to better adversarially robust generalization, which is also consistent with our theory.

## 5 Conclusion

In this paper, we theoretically and empirically show that with just more unlabeled data, we can learn models with better adversarially robust generalization. We first give an expected robust risk decomposition theorem and then show that for a specific learning problem on the Gaussian mixture model, the adversarially robust generalization can be almost as easy as standard generalization. Based on these theoretical results, we develop an algorithm which leverages unlabeled data during training and empirically show its advantage. As future work, we will study the sample complexity of unlabeled data for broader function classes and solve more challenging real tasks.

## Appendix A Background on generalization and Rademacher complexity

The Rademacher complexity is a commonly used capacity measure for a hypothesis space.

###### Definition 4.

Given a set of samples, the empirical Rademacher complexity of function class (mapping from to ) is defined as:

where contains i.i.d. random variables drawn from the Rademacher distribution unif({1, -1}).

By using the Rademacher complexity, we can directly provide an upper bound on the generalization error.

###### Theorem 5.

(Theorem 3.5 in Mohri et al. (2012)). Suppose is the loss, let be the set of i.i.d. samples drawn from the underlining distribution . Let be the hypothesis space, then with probability at least over , for any :

## Appendix B Proof of Theorem 1

###### Proof.

For indicator function , we have for any ,

 (7)

According to Definition 2, we have

 RB−robust(f) =E(x,y)∼PXYsupx′∈B(x)l(f(x′),y)=E(x,y)∼PXYsupx′∈B(x)I(f(x′)≠y) ≤E(x,y)∼PXYsupx′∈B(x)(I(f(x)≠y)+I(f(x)≠f(x′))) (8) =Ex∼PXsupx′∈B(x)(I(f(x′)≠f(x))+E(x,y)∼PXYl(f(x),y) =Ex∼PXsupx′∈B(x)(I(f(x′)≠f(x))+R(f),

where (8) is derived from (7). We further use Theorem 5 to bound . It is easy to verify that with probability at least , for any :

which completes the proof. ∎

## Appendix C Proof of Theorem 4

For convenience, in this section, we use or to denote some universal constants, where .

In the proof of Theorem 4, we will use the concentration bound for covariance estimation in Wainwright (2019). We first introduce the definition of spiked covariance ensemble.

###### Definition 5.

(Spiked covariance ensemble). A sample from the spiked covariance ensemble takes the form

 xi=√νξiθ0+wi, (9)

where is a zero-mean random variable with unit variance, is a fixed scalar, is a fixed unit vector and is a random vector independent of , with zero mean and covariance matrix .

To see why spiked covariance ensemble model is useful, we note that the Gaussian mixture model is its special case. Specifically, let ’s be the unlabeled data in Theorem 4. Then follows the Gaussian mixture distribution , and is a spiked covariance ensemble with parameter , uniformly distributed on , and .

The following theorem from Wainwright (2019) characterizes the concentration property of spiked covariance ensemble, which we will further use to bound the robust classification error. Intuitively, the theorem says that we can approximately recover in the spiked covariance ensemble model using the top eigenvector of the sample covariance matrix .

###### Theorem 6.

(Concentration of covariance estimation, see Corollary 8.7 in Wainwright (2019)). Given i.i.d. samples from the spiked covariance ensemble with sub-Gaussian tails (which means both and are sub-Gaussian with parameter at most one), suppose that and . Then, with probability at least , there is a unique maximal eigenvector of the sample covariance matrix such that

 ∥∥^θ−θ0∥∥2≤c0√ν+1ν2√dn+c3. (10)

Using the theorem above, we can show that for the Gaussian mixture model, one of the top unit eigenvector of the sample covariance matrix is approximately . In other words, we can approximately recover the parameter up to a sign difference: the principal component analysis of gives either or , while is close to .

###### Lemma 4.

Under the same setting as Theorem 4, suppose that and . Then, with probability at least , there is a unique maximal eigenvector of the sample covariance matrix with unit norm such that

 ∥∥∥v−θ∗√d∥∥∥2≤τ0=min{c0σ√σ2+dnd+c3,√2} (11)
###### Proof.

As discussed above, is a spiked covariance ensemble. By Theorem 6 we have with probability at least , there is a unique maximal eigenvector of the sample covariance matrix such that

 ∥∥∥~v−θ∗√d∥∥∥2≤c′0σ√σ2+dnd+c′3. (12)

Let , we have . Below we need to consider two cases, and .

Case 1: . Let , since both and are unit vectors, we have

 (13)

Recall that , which is equivalent to

 τ2 ≥∥~v∥2+∥∥∥θ∗√d∥∥∥2−2⟨~v,θ∗√d⟩ =∥~v∥2+1−2∥~v∥⟨v,θ∗√d⟩

Rearranging the terms and using AM-GM inequality gives

 2⟨v,θ∗√d⟩≥∥v∥+1−τ2∥v∥≥2√1−τ2 (14)

Therefore, by equation 13,

 ∥∥∥v−θ∗√d∥∥∥ =√2−2⟨v,θ∗√d⟩ ≤√2−2√1−τ2 =√2τ21+√1−τ2 ≤√2τ =√2(c′0σ√σ2+dnd+c′3).

By substituting , and , we complete the proof.

Case 2: . Let be one of such that the the inner product is nonnegative. Since both and are unit vectors, we have

 (15)

Therefore, . ∎

Now we have proved that by using the top eigenvector of , we can recover the up to a sign difference. Next, we will show that it is possible to determine the sign using the labeled data.

###### Lemma 5.

Under the same setting as Theorem 4, suppose is a unit vector such that where . Then with probability at least , we have .

###### Proof.

Since , and both and are unit vectors, we have . So the event is equivalent to the event , i.e.

 P[sign(yL⋅v⊤xL)v⊤θ∗≤0]=P[yL⋅v⊤xL≤0] (16)

Recall that is sampled from the Gaussian distribution , where is sampled uniformly at random from , we have follows the Gaussian distribution . Hence,

 P[yL⋅v⊤xL≤0]=P(yLxL)∼N(θ∗,σ2Id)[v⊤(yLxL)≤0]=Pg∼N(0,1)[g≤−θ∗⋅vσ] (17)

Moreover, from we can get

 ⟨θ∗,v⟩≥√d(1−τ202) (18)

So, using the Gaussian tail bound for all , and combining with equation 16, equation 17, equation 18, we have

 P[sign(yL⋅v⊤xL)v⊤θ∗≤0]≤exp⎛⎜ ⎜⎝−d(1−τ202)22σ2⎞⎟ ⎟⎠, (19)

as stated in the lemma. ∎

Armed with Lemma 4 and Lemma 5, we now have a precise estimation of in the Gaussian mixture model. Then, we will show that the high precision of the estimation can be translated to low robust risk. To achieve this, we need a lemma from Schmidt et al. (2018), which upper bounds the robust classification risk of a linear classifier in terms of its inner product with .

###### Lemma 6.

(Lemma 20 in Schmidt et al. (2018)). Under the same setting as in Theorem 4, for any and , and for any unit vector such that where is the dual norm of , the linear classifier has -robust classification risk at most .

Lemma 6 guarantees that if we can estimate precisely, we can achieve small robust classification risk. Combine with Lemma 4 and Lemma 5 which provide such estimation, we are now ready to prove the robust classification risk bound stated in Theorem 4. We can actually prove a slightly more general theorem below with some extra parameters, and obtain Theorem 4 as a corollary.

###### Theorem 7.

Let be a labeled data drawn from -Gaussian mixture model with . Let be unlabeled data drawn from . Let be as stated in Lemma 4, and be the normalized eigenvector (i.e. ) with respect to the maximal eigenvalue of such that with probability at least . Let . Then with probability at least , the linear classifier has -robust classification risk at most when

 ϵ≤1−τ202−σ√2log1β√d. (20)
###### Proof.

By the choice of we have equation 18 holds, i.e.

 ⟨θ∗,v⟩≥√d(1−τ202), (21)

with probability at least .

Applying Lemma 5 to yields

 sign(yL⋅v⊤xL)v⊤θ∗>0, (22)

with probability at least .

Notice that . So by union bound on events equation 21 and equation 22, we have

 ⟨θ∗,^w⟩=sign(yL⋅v⊤xL)⟨θ∗,v⟩≥√d(1−τ202), (23)

with probability at least .

Since , we have

 ∥^w∥∗∞=∥^w∥1≤√d. (24)

By Lemma 6, we have the -robust error is upper bounded by

 (25)

Combining this with equation 23, equation 24 and the assumption equation 20, we have

 ⟨^w,θ∗⟩−ϵ∥^w∥∗∞≥√d(1−τ202)−√d⎛⎜ ⎜⎝1−τ202−σ√2log1β√d⎞⎟ ⎟⎠=σ√2log1β. (26)

Hence,

 RB−robust(f^w)≤exp⎛⎜ ⎜ ⎜⎝−(σ√2log1β)22σ2⎞⎟ ⎟ ⎟⎠=β, (27)

with probability at least