Robustness to Adversarial Perturbations
in Learning from Incomplete Data
Abstract
What is the role of unlabeled data in an inference problem, when the presumed underlying distribution is adversarially perturbed? To provide a concrete answer to this question, this paper unifies two major learning frameworks: SemiSupervised Learning (SSL) and Distributionally Robust Learning (DRL). We develop a generalization theory for our framework based on a number of novel complexity measures, such as an adversarial extension of Rademacher complexity and its semisupervised analogue. Moreover, our analysis is able to quantify the role of unlabeled data in the generalization under a more general condition compared to the existing theoretical works in SSL. Based on our framework, we also present a hybrid of DRL and EM algorithms that has a guaranteed convergence rate. When implemented with deep neural networks, our method shows a comparable performance to those of the stateoftheart on a number of realworld benchmark datasets. ^{†}^{†}Emails: najafy@ce.sharif.edu, {ichi,masomatics,miyato}@preferred.jp .
* Computer Engineering Department
Sharif University of Technology, Tehran, Iran
[2mm] Preferred Networks Inc., Tokyo, Japan
1 Introduction
Robustness to adversarial perturbations has become an essential feature in the design of modern classifiers —in particular, of deep neural networks. This phenomenon originates from several empirical observations, such as [1] and [2], which show deep networks are vulnerable to adversarial attacks in the input space. So far, plenty of novel methodologies have been introduced to compensate for this shortcoming. Adversarial Training (AT) [3], Virtual AT [4] or Distillation [5] are just examples of some promising methods in this area. The majority of these approaches seek an effective defense against a pointwise adversary, who shifts input datapoints toward adversarial directions, in a separate manner. However, as shown by [6], a distributional adversary who can shift the data distribution instead of the input datapoints is provably more detrimental to learning. This suggests that one can greatly improve the robustness of a classifier by improving its defense against a distributional adversary rather than a pointwise one. This motivation has led to the development of Distributionally Robust Learning (DRL) [7], which has attracted intensive research interest over the last few years [8, 9, 10, 11].
Despite of all the advancements in supervised or unsupervised DRL, the amount of researches tackling this problem from a semisupervised angle is slim to none [12]. Motivated by this fact, we set out to propose a distributionally robust method that can handle SemiSupervised Learning (SSL) scenarios. Our proposed method is an extension of selflearning [13, 14, 15], and can cope with all existing learning frameworks, such as neural networks. Intuitively, we first try to infer softlabels for the unlabeled data, and then search for suitable classification rules that demonstrate low sensitivity to perturbation around these softlabel distributions.
Parts of this paper can be considered as a semisupervised extension of the general supervised DRL developed in [9]. Computational complexity of our method, for a moderate labelset size, is only slightly above those of its fullysupervised rivals. To optimize our model, we design a Stochastic Gradient Descent (SGD)based algorithm with a theoreticallyguaranteed convergence rate. In order to address the generalization of our framework, we introduce a set of novel complexity measures such as Adversarial Rademacher Complexity and Minimal Supervision Ratio (MSR), each of which are defined w.r.t. the hypothesis set and probability distribution that underlies input datapoints. As long as the ratio of the labeled samples in a dataset (supervision ratio) exceeds MSR, true adversarial risk can be bounded. Also, one can arbitrarily decrease MSR by tuning the model parameters at the cost of increasing the generalization bound; This means our theoretical guarantees hold for all semisupervised scenarios. We summarize the theoretical contribution of our work in Table LABEL:tab:summary.
We have also investigated the applicability of our method, denoted by SSDRL, via extensive computer experiments on datasets such as MNIST [16], SVHN [17], and CIFAR10 [18]. When implemented with deep neural networks, SSDRL outperforms rivals such as PseudoLabeling (PL) [19] and the supervised DRL in [9] (simply denoted as DRL) on all the abovementioned datasets. In addition, SSDRL demonstrates a comparable performance to that of Virtual Adversarial Training (VAT) [4] on MNIST and CIFAR10, while outperforms VAT on SVHN.
The rest of the paper is organized as follows: Section 1.1 specifies the notations, and Section 1.2 reviews the related works. The basic idea behind the proposed method is outlined in Section 2.1, parameter optimization is described in Section 2.2 and generalization is analyzed in Section 2.3. Section 3 is devoted to experimental results. Finally, Section 4 concludes the paper.
table\end@float
1.1 Notations
We extend the notations used in [9]. Assume to be an input space, to be a parameter set, and a corresponding parametric loss function. Observation space can either be the feature space in unsupervised scenarios, or the space of featurelabel pairs, i.e., , where denotes the set of labels. For simplicity, we only consider finite labelsets. By , we mean the set of all probability measures supported on . Assume to be a nonnegative and lower semicontinuous function, where for all . We occasionally refer to as transportation cost. The following definition formulates the Wasserstein distance between two distributions , w.r.t. [8]:
Definition 1 (Wasserstein distance).
The Wasserstein distance between two distributions and in , with respect to cost is defined as:
(1)  
where represents the set of all couplings between any two random variables supported on . Also, and denote the marginals of taken w.r.t. the first and second variables, respectively.
measures the minimal cost of moving to , where the cost of moving one unit of mass from to is given by . Also, for and an arbitrary distribution , we define an ambiguity set (or a Wasserstein ball) as
(2) 
Training dataset is shown by , with samples being drawn i.i.d. from a fixed (and unknown) distribution , where is the dataset size. For a dataset , let be the following empirical measure:
(3) 
where denotes the Dirac delta function at point . Accordingly, and represent the statistical and empirical expectation operators, respectively. For a distribution , denotes the marginal distribution over , and is the conditional distributions over labels given feature vector . For the sake of simplicity in notations, for and a function , the notations and have been used, interchangeably.
1.2 Background and Related Works
DRL attempts to minimize a worstcase risk against an adversary. The adversary has a limited budget to alter the data distribution , in order to inflict the maximum possible damage. Here, can either be the true measure or the empirical one . The mentioned learning scenario can be modeled by a game between a learner and an adversary whose stationary point is the solution of a minimax problem [10]. Mathematically speaking, DRL can be formulated as [8, 11]:
(4) 
Wasserstein metric has been widely used to quantify the strength of adversarial attacks [8, 9, 11, 12], thanks to (i) its fundamental relations to adversarial robustness [20] and (ii) its mathematically wellstudied dualform properties [11]. In [8], authors have reformulated DRL into a convex program for the particular case of logistic regression. Convergence and generalization analysis of DRL have been addressed in [9] in a general context, while the finding of a proper ambiguity set size, i.e. , has been tackled in [21]. An interesting analysis on DRL methods with divergences is given in [10]. Sample complexity of DRL has been reviewed by [22] and [23]. We conjecture that there might be close relations between our complexity analysis in Section 2.3 and some of the results in the latter studies. However, a careful investigation regarding this issue goes beyond the scope of this paper.
On the other hand, recent abundance of unlabeled data has made SSL methods widely popular [4, 24]. See [14] for a comprehensive review on classical SSL approaches. Many robust SSL algorithms have been proposed so far [25, 26], however, their notion of robustness is mostly different from the one considered in this paper. In [27], author has proposed a pessimistic SSL approach which is guaranteed to have a better, or at least equal, performance when it takes unlabeled data into account.We show that a special case of our method reduces to an adversarial extension of [27]. From a theoretical perspective, guarantees on the generalization of SSL can only be provided under certain assumptions on the choice of hypothesis set and the true data distribution [14, 15, 28]. For example, in [15] a compatibility function is introduced to restrict the relation between a model set and an input data distribution. Also, author of [29] has theoretically analyzed SSL under the socalled cluster assumption, in order to establish an improvement guarantee for a situation where unlabeled data had been experimentally shown to be helpful. The fundamental reason behind such assumptions is that lack of any prior knowledge about the informationtheoretic relations between a feature vector and its corresponding label, simply makes unlabeled data to be useless for classification. Not to mention that improper assumptions about the relation of featurelabel pairs, for example by employing unsuitable hypothesis sets, could actually degrade the classification accuracy in semisupervised scenarios. In Section 2.3, we propose a novel compatibility function that works under a general setting and enables us to theoretically establish a generalization bound for our method.
Finally, the only work prior to this paper that also falls in the cross section of DRL and SSL is [12]. However, the method in [12] severely restricts the support of adversariallyaltered distributions, so that the adversary is left to choose from a set of deltaspikes over only labeled and augmented unlabeled samples. Thus, one cannot expect a considerable improvement in the distributional robustness in this case, because it does not let the adversary to freely perturb training datapoints toward arbitrary directions.
2 Proposed Framework
From now on, let us assume . In a semisupervised configuration, dataset consists of two nonoverlapping parts: (labeled) and (unlabeled). It should be noted that learner can only observe partial information about each sample in , namely, its feature vector. Let us denote and as the index sets corresponding to the labeled and unlabeled data points, respectively. Thus, we have , and . The unknown labels of the samples in can be thought as a set of corresponding random variables supported on . DRL in (4) cannot readily extend to this partially labeled setting, since it needs complete access to all the featurelabel pairs in . In order to bypass this barrier, we need to somehow address the additional stochasticity that originates from incorporating unlabeled data in the learning procedure. The following definition can be helpful for this aim:
Definition 2.
The consistent set of probability distributions with respect to a partiallylabeled dataset is defined as
where and are the sizes of and , respectively. Also, denotes the set of all conditional distributions over , given values in .
Distributions in have deltaspikes over supervised samples in . However, for samples in , singularity is only over the feature vectors while conditional distributions over their corresponding labels are free to be anything, i.e., samples can even have softlabels. Note that the empirical measure which corresponds to the true complete dataset is also somewhere inside . Our aim is to choose a suitable measure from , and then use it for (4).
2.1 SelfLearning: Optimism vs. Pessimism
We focus on a wellknown family of SSL approaches, called selflearning [13, 30], and then combine it to the framework of DRL. Methods that are built upon selflearning, such as ExpectationMaximization (EM) algorithm [31], aim to transfer the knowledge from labeled samples to unlabeled ones through what is called pseudolabeling. More precisely, a learner is (repeatedly) trained on the supervised portion of a dataset, and then employs its learned rules to assign pseudolabels to the remaining unlabeled part. This procedure can assign either hard or soft labels to unlabeled features. This way, all these artificiallylabeled unsupervised samples can also join in for the training of the learner in the final stages of learning. However, such methods are prone to overfitting if the information flow from to is not properly controlled. One way to overcome this issue is to use softlabeling, which maintains a minimum level of uncertainty within the unlabeled data points. By combining the above arguments with the core idea of DRL in (4), we propose the following learning scheme:
(5) 
where is a userdefined parameter, is called the supervision ratio, and denotes the Shannon entropy. For now, let us assume .
Minimization over acts as a knowledge transfer module and finds the optimal empirical distribution in for the model selection module, i.e. . Again, note that distributions in differ from each other only in the way they assign labels, or softlabels, to the unlabeled data. According to (5), learner has obviously chosen to be optimistic w.r.t. the hypothesis set and its corresponding loss function . In other words, for any , learner is instructed to pick the labels that are more likely to reduce the loss function for the unlabeled data. This strategy forms the core idea of selflearning. Note that a pessimistic strategy suggests the opposite, i.e. to pick the less likely labels (those with large loss values) and not to trust the loss function. At the end of this section, we explain more about the pessimistic learner.
The negative regularization term prevents hard decisions for labels and promotes softlabeling by bounding the Shannon entropy of labelconditionals from below. A smaller gives softer labels. In the extreme case, choosing ends up in an adversarial version of the selftraining in [14]. It should be noted that (5) considers all the data in for the purpose of distributional robustness. In fact, the adversary has access to all the feature vectors in , which in turn forces the learner to experience adversarial attacks near all these points. This way, learner is instructed to show less sensitivity near all training data, just as one may expect from a semisupervised DRL.
We show that (5) can be efficiently solved given that some smoothness conditions hold for and . Before that, Theorem 1 shows that the optimization corresponding to the knowledge transfer module has an analytic solution, which implies the computational cost of (5) is only slightly higher than those of its fullysupervised counterparts, such as [9].
Definition 3.
For and , softminimum of with respect to is defined as
(6) 
Theorem 1 (LagrangianRelaxation).
Assume a continuous loss and continuous , parameters and , and a partiallylabeled dataset with size . For , let us define the empirical SemiSupervised Adversarial Risk (SSAR), denoted by , as
(7) 
where , called adversarial loss, is defined as
(8) 
Let be a minimizer of (5) for a given set of parameters and . Then, there exists such that is also a minimizer of (7) with the same corresponding parameters and .
Proof of Theorem 1 is given in Appendix C. Note that equals to : (i) operator for , (ii) average for , and (iii) operator for . Also, and are nonnegative dual parameters and fixing either of them uniquely determines the other one. Due to this onetoone relation, one can adjust (for example via crossvalidation), instead of . See [9] for a similar discussion about this issue.
A more subtle look at (7) shows that in the dual context of the proposed optimization problem, one is free to also consider positive values for . Choosing promotes those labels that produce larger adversarial loss values for unlabeled data. In other words, the sign of indicates optimism (), or pessimism () during the label assignment. The choice between optimism vs. pessimism depends on the compatibility of the model set with the true distribution . In Section 2.3, we show that enabling to take values in rather than is crucial for establishing a generalization bound for 7. In other words, for very bad hypothesis sets w.r.t. a particular input distribution, one must choose to be pessimistic to be able to generalize well. To see some situations where pessimism in SemiSupervised Learning can help, reader may refer to [27].
2.2 Numerical Optimization
We propose a numerical optimization scheme for solving (7) or equivalently (5), which has a convergence guarantee. A hurdle in applying SGD to (7) is the fact that the value of is itself the output of a maximization problem. Also, the loss function is not necessarily convex w.r.t. , e.g. neural networks, and hence achieving the global minimum of (7) is not feasible in general. The former problem has already been solved in supervised DRL, as long as we focus on a sufficiently small [9].
Lemma 1.
Consider the setting described in Theorem 1. Assume is differentiable w.r.t. , and is Lipschitz all over , for some . Also, assume transportation cost is strongly convex in its first argument. Then, if , the program
(9) 
becomes strongly concave for all .
The proof of Lemma 1 is based on Taylor’s expansion series. By using a modified version of Danskin’s theorem for minimax problems [32], and followed by additional smoothness conditions on , an efficient computation of the gradient of in (7) w.r.t. is as follows:
Lemma 2.
Proof of Lemma 2 is included in that of Theorem 2 and can be found in Appendix C. Given the formulation in Lemma 2, one can simply apply the minibatch SGD to solve for (5) via Algorithm 1 (the semisupervised extension of [9]). The set of constants such as the maximum iteration number , , and minibatch size are all userdefined. Due to the strong concavity of (10) under the conditions of Lemma 1, can be chosen arbitrarily small. Other parameters such as and should be adjusted via crossvalidation. The computational complexity of Algorithm 1 is no more than times of that of [9], where the latter can only handle supervised data^{1}^{1}1In scenarios where is very large, one can employ heuristic methods to reduce the set of possible labels for an unlabeled data sample and gain more efficiency at the expense of degradation in performance. Note that Algorithm 1 reduces to [9] in fullysupervised scenarios, and coincides with PseudoLabeling and EM algorithm when and , respectively. The following theorem guarantees the convergence of Algorithm 1 to a local minimizer of (7).
Theorem 2.
Assume the loss function , transportation cost , and to satisfy the conditions of Lemma 2. Also, assume is differentiable w.r.t. both parameters and , with Lipschitz gradients. Also, assume , for some all over . Let to be an initial hypothesis, and denote as a local minimizer of (5) or (7). Assume the partiallylabeled dataset to include i.i.d. samples drawn from . Also, let . Then, for the fixed step size
(12) 
the outputs of Algorithm 1 with parameter set , , after iterations, say , satisfy the following inequality:
(13) 
where constants and only depend on and Lipschitz constants of . Also, , and the expectation in (13) is w.r.t. dataset and the randomness of Algorithm 1.
The proof of Theorem 2 with explicit formulations of constants and are given in Appendix C. Theorem 2 guarantees a convergence rate of for Algorithm 1, if one neglects . Note that the presence of is necessary since one cannot find the exact maximizer of (10) in finite steps. However, due to Lemma 1, can be chosen infinitesimally small. According to Theorem 2, choosing a very large reduces the convergence rate, since the derivative of starts to diverge. In fact, the limiting cases of represent two hardlabel strategies with combinatoric structures. Convergence rates for such methods are not wellstudied in the literature [14]. However, convergence guarantees (even without establishing a convergence rate) can still be useful for these limiting cases. Theorem C.1 (Appendix) guarantees the convergence of Algorithm 1 in harddecision regimes, i.e. .
Another interesting question is: can we guarantee the convexity of (5) or (7), given that loss function is twice differentiable and strictly convex w.r.t. ? Although the convexity of does not hold in many cases of interest, e.g. neural networks, a careful convexity analysis of our method is still important. Theorem C.2 (Appendix) provides a sufficient condition for convexity of (7), when is strictly convex. The condition requires with being a negative function. Note that when , the r.h.s. of (7) equals to the minimum of a finite number of convex functions —which is not necessarily convex. On the other hand, a nonnegative value of is always safe for this purpose, because the operator in this case preserves convexity.
2.3 Generalization Guarantees
This section addresses the statistical generalization of our method. More precisely, we intend to bound the true adversarial risk, i.e. , where denotes the optimizer of the empirical risk in (7). To this aim, two major concerns need to be addressed: (i) we are training our model against an adversary, and (ii) our training dataset is only partially labeled. For (i), we introduce a novel complexity measure w.r.t. the hypothesis set and data distribution , which extends the existing generalization analyses into an adversarial setting. For (ii), we establish a novel compatibility condition among , and that deals with the semisupervised aspect of our work.
2.3.1 Adversarial Complexity Measures
Conventional Rademacher complexity, denoted by , is a tool to measure the richness of a function set in classical learning theory [33]. In fact, this measure tells us about how much a function set is able to learn noise, and thus is exposed to overfitting on small datasets. We give a novel adversarial extension for Rademacher complexity which also appears in our generalization bound at the end of this section. Moreover, we show that our complexity measure converges to zero when , for all function sets with a finite VCdimension, regardless of the strength of adversary. Before that, let us define the set of Monge maps as the following function set:
(14) 
Then, the SemiSupervised Monge (SSM) Rademacher complexity can be defined as follows:
Definition 4 (SSM Rademacher Complexity).
For , assume a function set and a distribution . Then, for , a transportation cost and , let us define
where and . represents the set of Monge maps. Also, indicates a vector of independent Rademacher random variables. Then, for a supervision ratio , the SSM Rademacher complexity of is defined as
By setting and , the above definition simply reduces to the classical Rademacher complexity . We define a function set to be learnable, if decreases to zero as one increases . Similarly, a function class is said to be adversarially learnable w.r.t. parameters , if
(15) 
The above definition is necessary when , since learnability of a function class w.r.t. some distribution does not necessarily guarantee its adversarial learnability. In fact, an adversary can shift the data points and forces the learner to experience regions in that cannot be accessed by alone. However, one may be concerned about how to numerically compute this measure in practice? The main difference between and SSM Rademacher complexity is that the latter alters input samples (or distribution) by an adversary. Fortunately, several distributionfree bounds have been established on so far [33], which work for a variety of function classes of practical interest, e.g. classifiers with a bounded VCdimension (including neural networks), polynomial regression tools with a bounded degree, and etc.
We show that in case of having a distributionfree bound on the Rademacher complexity of , the SSM Rademacher complexity can be bounded as well. Mathematically speaking, assuming there exists an asymptotically decreasing upperbound such that . Then for all and the following holds (Lemma D.1):
(16) 
where the r.h.s. of the above equation always converges to zero as . This includes the vast majority of classifier families that are being used in realworld applications, e.g. neural networks, support vector machines, random forests and etc. Just as an example, consider the  loss for a family of classifiers with a VCdimension of . Then, due to Dudley’s entropy bound and Haussler’s upperbound [33], there exists constant such that
(17) 
regardless of or the distribution (again, check Lemma D.1). An interesting implication of this result is that by assuming a bounded VCdimension for a function set, one can guarantee its learnability even in an adversarial setting where an adversary can apply arbitrarily powerful attacks. However, as long as one is interested in a function set whose classical complexity measures cannot be bounded regardless of the data distribution, not much can be said on adversarial learnability without directly computing and in Definition 4.
2.3.2 Minimum Supervision Ratio
As discussed earlier in Section 1.2, generalization guarantees for SSL frameworks generally require a compatibility assumption on the hypothesis set and data distribution . In Appendix B (and in particular, Definition B.4), a new compatibility function, denoted by Minimum Supervision Ratio (MSR), is introduced which has the following functional form:
Intuitively, quantifies the strength of information theoretic relation between the marginal measure and the conditional . It also measures the suitability of function set to learn such relations. As it will be shown in Theorem 3, in order to bound the true risk when unlabeled data are involved, one needs , for some and . First argument, , denotes the pessimism of the learner and the second one, , specifies a safety margin for smallsize datasets. is an increasing function w.r.t. , while it decreases with . In particular, , for all .
For negative values of (optimistic learning), MSR remains small as long as there exists a strong dependency between the distribution of feature vectors and label conditionals . This dependency can be obtained, for example, by the cluster assumption. However, MSR does not require such explicit assumptions and thus is able to impose a compatibility condition on the pair in a more fundamental way compared to existing works in SSL theory.
Additionally, some loss functions in need to be capable of capturing such dependencies, e.g. at least one loss function in should resemble the true negative loglikelihood . Conversely, absence of any dependency between and , or the lack of sufficiently “good” loss functions in increases the MSR toward , which forces the learner to choose a large (in the extreme case ) to be able to use the generalization bound of Theorem 3. Not to mention that a large increases the empirical loss which then loosens the bound. This fact, however, should not be surprising since improper usage of unlabeled data is known to be harmful to the generalization instead of improving it.Based on previous discussions, Theorem 3 gives a generalization bound for our proposed framework in (7):
Theorem 3 (Generalization).
For a featurelabel space and a parameter space , assume the set of continuous functions , with and for some . For , and let
(18) 
and , where is a transportation cost. For a supervision ratio , assume a partially labeled dataset including i.i.d. samples drawn from , where labels can be observed with probability of , independently. For and , assume satisfies the following condition:
(19) 
Then, with probability at least , the following bound holds for all :
(20) 
where is the minimizer of .
Proof of Theorem 3 is given in Appendix C. Condition in (19) can always be satisfied based on Lemma B.2, as long as and are sufficiently large and is adversarially learnable. A stronglycompatible pair of hypothesis set and data distribution should encourage optimism, where learner can choose small (generally negative) values. However, in some situations increasing might be necessary for (19) to hold; In fact, for a weaklycompatible , must be positive or even (the latter always satisfies (19) regardless of or ). Note that choosing a larger increases the empirical risk , which then increases our bound in (20). Interestingly, coincides with the setting of [27], which makes it as a special case of our analysis.
For a fixed , should be tuned to minimize the upperbound for a better generalization. On the other hand, for every there exists where (20) becomes asymptotically tight. More importantly, the limiting cases of Theorem 3, i.e. and , provide us with a new generalization bound for nonrobust SSL, and an alreadyestablished bound for supervised DRL in [9], respectively.
3 Experimental Results
In this section, we demonstrate our experimental results on a number of realworld datasets and also compare our method with some stateoftheart rival methodologies. We have chosen our loss function set as a particular family of Deep Neural Networks (DNN). Architecture and other specifications about our DNNs are explained in details in Appendix A. Throughout this section, our method is denoted by SSDRL and the rival frameworks are Virtual Adversarial Training (VAT) [4], PseudoLabeling (PL) [19], and the fullysupervised DRL of [9], simply denoted as DRL. We have also implemented a fast version of SSDRL, called FSSDRL, where for each unlabeled training sample only a limited number of more favorable labels are considered for Algorithm 1. Here by more favorable labels, we refer to those labels that correspond to smaller nonrobust loss values of . As a result, FSSDRL runs much faster than SSDRL without much degradation in performance. Surprisingly, we found out that FSSDRL often yields better performances in practice compared to SSDRL (also see Appendix A for more details).
Figure 1 shows the misclassification rate vs. on adversarial test examples attained by (9) (same attack strategy as [9]). Recall as the dualcounterpart of the Wasserstein radius in (5). Thus, somehow quantifies the strength of adversarial attacks, as suggested by [9]. Results have been depicted for MNIST, SVHN and CIFAR10 datasets. Figure 2 demonstrates the same procedure for adversarial examples generated by ProjectedGradient Method (PGM) [34]; In this case, the errorrate is depicted vs. PGM’s strength of attack, i.e. . For VAT and SSDRL, curves have been shown for different choices of hyperparameters, i.e., , which correspond to the lowest error rates on: () clean examples, () adversarial examples by [9], and () adversarial examples by PGM, respectively. Values of , and the choices of , transportation cost , and the supervision ratio with more details on the experiments can be found in Appendix A.
According to Figures 1 and 2, the proposed method is always superior to DRL and PL. Also, SSDRL outperforms VAT on SVHN dataset regardless of the attack type, while it has a comparable errorrate on MNIST and CIFAR10 based on Figures 0(a) and 1(c), respectively. The superiority over DRL highlights the fact that exploitation of unlabeled data has improved the performance. However, SSDRL underperforms VAT on MNIST and CIFAR10 datasets if the order of attacks are reversed. According to Figure 1(a), accuracy of PL degrades quite slowly as PGM’s increases, although the loss values increase in Figure 4(a). This phenomenon is due to the fact that the adversarial directions for increasing the loss and errorrate are not correlated in this particular case.
table\end@float
Figure 3 depicts the errorrate corresponding to DRL, SSDRL and FSSDRL as a function of , on adversarial examples in the MNIST dataset which are generated via the maximization problem (as described in [9]). Unlike Figures 1 and 2, we have shown the results for a range of values of and , in order to experimentally measure the sensitivity of our method to these hyperparameters. Also, we have performed the same procedure for DRL for the sake of comparison. In particular, Figure 2(a) shows the comparison between DRL and SSDRL (with set to for SSDRL) and different values of . As it is evident for the majority of cases (), SSDRL performs much better than DRL. This result indicates that employing the unlabeled data samples improves the generalization, which is highly favorable. Figure 2(b) depicts the comparison between FSSDRL and the original SSDRL (again is set to 1 for SSDRL). Figure 2(c) shows the effect of varying (with fixed to ). Surprisingly, the errorrate experiences a drastic jump when one changes the sign of , which indicates a tradeoff between optimism and pessimism. This result might be related to the fact that for the case of MNIST dataset, learned neural networks on the labeled part of the dataset are sufficiently reliable, and thus encourage the user to employ an optimistic approach (i.e., setting a negative ) in order to improve the performance. However, while the sign of is fixed, errorrate does not show that much sensitivity to the magnitude of , which can be noted as a point of strength for SSDRL.
Table LABEL:tab:semisup_wodataaug shows the test errorrates on clean examples for FSSDRL, VAT, PL and DRL on MNIST, SVHN and CIFAR10 datasets. In fact, Table LABEL:tab:semisup_wodataaug characterizes the nonadversarial generalization that can be attained via distributional robustness. Again, FSSDRL outperforms both PL and DRL in almost all experimental settings. It also surpasses VAT on SVHN dataset. FSSDRL underperforms VAT on MNIST and CIFAR10, however, the difference in errorrates remains small and the two methods have close performances.
So far, the performance of SSDRL has been demonstrated w.r.t. its misclassification rate. We have also provided extensive experimental results on the value of adversarial loss , which are crucial for the computation of our generalization bound in Section 2.3. Figure 4 shows the average adversarial loss, i.e. , for different methods and on different datasets. is set to for SSDRL. Again, it should be noted that the adversarial examples used in Figures A.1 and 4 are generated via the procedure described in [9]. Figure 5 is the counterpart of Figure 4, where the attack strategy is replaced with ProjectedGradient Method (PGM). As a result, adversarial loss values have been depicted as a function of PGM’s strength of attack, i.e. . As can be seen, SSDRL (or its fast version FSSDRL) are always among the few methods that generate the smallest adversarial loss values, regardless of the strength of attacks. This means that the proposed method can establish a reliable certificate of robustness for test samples via Theorem 3. Note that VAT, another method that performs well in practice in terms of errorrate, does not have any theoretical guarantees.
4 Conclusions
This paper aims to investigate the applications of distributionally robust learning in partially labeled datasets. The core idea is to focus on a wellknown semisupervised technique, known as selflearning, and make it robust to adversarial attacks. A novel framework, called SSDRL, has been proposed which builds upon an existing general scheme in supervised DRL. SSDRL encompasses many existing methods such as PseudLabeling (PL) and EM algorithm as its special cases. Computational complexity of our method is shown to be only slightly higher than those of its supervised counterparts. We have also derived convergence and generalization guarantees for SSDRL, where for the latter, a number of novel complexity measures have been introduced. We have proposed an adversarial extension of the Rademacher complexity in classical learning theory, and showed that it can be bounded for a broad range of learning frameworks, including neural networks, that have a finite VCdimension. Moreover, our theoretical analysis reveals a more fundamental way to quantify the role unlabeled data in the generalization through a new complexity measure called Minimum Supervision Ratio (MSR). This is in contrast to many existing works that need more restrictive conditions such as cluster assumption to be applicable. Extensive computer simulation on realworld benchmark datasets demonstrate a comparabletosuperior performance for our method compared with those of the stateoftheart. In future, one may attempt to improve the generalization bounds, for example, by finding empirical estimations for MSR function. Fitting a broader range of SSL methods into the core idea of Section 2.1 could be another good research direction.
References
 [1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations, 2014.
 [2] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 427–436.
 [3] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations, 2015.
 [4] T. Miyato, S. Maeda, S. Ishii, and M. Koyama, “Virtual adversarial training: a regularization method for supervised and semisupervised learning,” IEEE transactions on pattern analysis and machine intelligence, 2018.
 [5] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in Security and Privacy (SP), 2016 IEEE Symposium on. IEEE, 2016, pp. 582–597.
 [6] M. Staib and S. Jegelka, “Distributionally robust deep learning as a generalization of adversarial training,” in NIPS workshop on Machine Learning and Computer Security, 2017.
 [7] A. BenTal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen, “Robust solutions of optimization problems affected by uncertain probabilities,” Management Science, vol. 59, no. 2, pp. 341–357, 2013.
 [8] S. ShafieezadehAbadeh, P. M. Esfahani, and D. Kuhn, “Distributionally robust logistic regression,” in Advances in Neural Information Processing Systems, 2015, pp. 1576–1584.
 [9] A. Sinha, H. Namkoong, and J. Duchi, “Certifiable distributional robustness with principled adversarial training,” in International Conference on Learning Representations, 2018.
 [10] W. Hu, G. Niu, I. Sato, and M. Sugiyama, “Does distributionally robust supervised learning give robust classifiers?” in International Conference on Machine Learning, 2018, pp. 2034–2042.
 [11] P. M. Esfahani and D. Kuhn, “Datadriven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations,” Mathematical Programming, pp. 1–52, 2017.
 [12] J. Blanchet and Y. Kang, “Semisupervised learning based on distributionally robust optimization,” arXiv preprint arXiv:1702.08848, 2017.
 [13] Y. Grandvalet and Y. Bengio, “Semisupervised learning by entropy minimization,” in Advances in Neural Information Processing Systems, 2005, pp. 529–536.
 [14] X. Zhu, “Semisupervised learning literature survey,” Computer Science, University of WisconsinMadison, vol. 2, no. 3, p. 4, 2006.
 [15] O. Chapelle, B. Scholkopf, and A. Zien, “Semisupervised learning (chapelle, o. et al., eds.; 2006)[book reviews],” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542–542, 2009.
 [16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [17] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS Workshop on Deep Learning and Unsupervised Feature Learning, vol. 2011, 2011, p. 5.
 [18] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
 [19] D.H. Lee, “Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks,” in ICML Workshop on Challenges in Representation Learning, vol. 2, 2013.
 [20] Z. Cranko, A. K. Menon, R. Nock, C.S. Ong, Z. Shi, and C. Walder, “Monge beats bayes: Hardness results for adversarial training,” arXiv preprint arXiv:1806.02977, 2018.
 [21] J. Duchi, P. Glynn, and H. Namkoong, “Statistics of robust optimization: A generalized empirical likelihood approach,” arXiv preprint arXiv:1610.03425, 2016.
 [22] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry, “Adversarially robust generalization requires more data,” in Advances in Neural Information Processing Systems, 2018, pp. 5014–5026.
 [23] D. Cullina, A. N. Bhagoji, and P. Mittal, “PAClearning in the presence of evasion adversaries,” arXiv preprint arXiv:1806.01471, 2018.
 [24] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov, “Good semisupervised learning that requires a bad gan,” in Advances in Neural Information Processing Systems, 2017, pp. 6510–6520.
 [25] A. Balsubramani and Y. Freund, “Scalable semisupervised aggregation of classifiers,” in Advances in Neural Information Processing Systems, 2015, pp. 1351–1359.
 [26] Y. Yan, Z. Xu, I. W. Tsang, G. Long, and Y. Yang, “Robust semisupervised learning through label aggregation.” in Association for the Advancement of Artificial Intelligence, 2016, pp. 2244–2250.
 [27] M. Loog, “Contrastive pessimistic likelihood estimation for semisupervised classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 3, pp. 462–475, 2016.
 [28] A. Singh, R. Nowak, and X. Zhu, “Unlabeled data: Now it helps, now it doesn’t,” in Advances in Neural Information Processing Systems, 2009, pp. 1513–1520.
 [29] P. Rigollet, “Generalization error bounds in semisupervised classification under the cluster assumption,” Journal of Machine Learning Research, vol. 8, no. Jul, pp. 1369–1392, 2007.
 [30] M.R. Amini and P. Gallinari, “Semisupervised logistic regression,” in European Conference on Artificial Intelligence, 2002, pp. 390–394.
 [31] S. Basu, A. Banerjee, and R. Mooney, “Semisupervised clustering by seeding,” in International Conference on Machine Learning, 2002, pp. 27–34.
 [32] J. F. Bonnans and A. Shapiro, Perturbation analysis of optimization problems. Springer Science & Business Media, 2013.
 [33] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning. MIT press, 2012.
 [34] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations, 2018.
 [35] D.A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” in International Conference on Learning Representations, 2016.
 [36] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
 [37] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015, pp. 448–456.
 [38] J. Blanchet and K. Murthy, “Quantifying distributional model risk via optimal transport,” Mathematics of Operations Research, 2019.
 [39] S. Ghadimi and G. Lan, “Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework,” SIAM Journal on Optimization, vol. 22, no. 4, pp. 1469–1492, 2012.
 [40] M. Dresher, “Games of strategy: theory and applications,” RAND CORP SANTA MONICA CA, Tech. Rep., 1961.
 [41] C. Wu, C. Yang, H. Zhao, and J. Zhu, “On the convergence of the em algorithm: A dataadaptive analysis,” arXiv preprint arXiv:1611.00519, 2016.
 [42] K. S. Miller and B. Ross, An introduction to the fractional calculus and fractional differential equations. WileyInterscience, 1993.
Appendix A Additional Simulations and Experimental Settings
This section presents a number of additional experiments w.r.t. the proposed method and shows more comparison with rival methodologies. We also give an extensive description of the experimental setting that we have used for our computer simulations.
a.1 Additional Simulations
Figure A.1 is a complete version of Figure 1 from Section 3, where the performances of SSDRL, fullysupervised DRL, PL and VAT are extensively investigated on three benchmark datasets, i.e. MNIST, SVHN and CIFAR10. SSDRL and VAT have been tested with a variety of their corresponding hyperparameters and . Figure A.2 is the counterpart of Figure A.1, where the attack strategy is replaced with ProjectedGradient Method (PGM). Again, errorrates have been depicted as a function of PGM’s attack strength, i.e. . Even though more variation in hyperparameters has been considered, we have not observed any significant sensitivity that is caused by a slight change of parameter values. As a result, one can say that DRL, SSDRL and VAT are all stable algorithms w.r.t. to their parameter values, at least up to some certain levels.
Figures A.3 and A.4 represent the performance (again in terms of errorrate) over clean examples from different datasets, and for SSDRL and VAT, respectively. In Figure A.3, different values of have been used for training and the test errorrate is depicted as a function of . Also, is set to for SSDRL. Apparently, SSDRL (or FSSDRL), for a particular range of parameters, overfits during the training stage on MNIST and as a result its performance is degraded when compared to that of DRL. However, SSDRL outperforms DRL (its fullysupervised counterpart) on SVHN and CIFAR10 datasets. Also, SSDRL and VAT have comparable performances on clean examples, specifically on SVHN and CIFAR10 datasets. This observation is in agreement with Table LABEL:tab:semisup_wodataaug.
a.2 Experimental Settings
In this part, we present a detailed description of the experimental settings which have been used for Section 3. It should be noted that the majority of the settings used for SVHN and CIFAR10 datasets follow the same procedure as described in [4].
a.2.1 Realworld Datasets
Three main datasets have been used during the experiments: MNIST, SVHN and CIFAR10.

The MNIST dataset consists of pixel, grayscale images of handwritten digits together with their corresponding labels. Each label is a natural number from to . The number of training examples and test examples in the dataset are and , respectively.

The SVHN dataset consists of pixel RGB images of street view house numbers with their corresponding labels. Again, labels are natural numbers ranging from to . The number of training and test samples in the dataset are and , respectively.

CIFAR10 dataset consists of pixel RGB images of categorized objects, i.e., cars, trucks, planes, animals, and humans. The number of training examples and test examples in the dataset are and , respectively. For CIFAR10 dataset, we conducted Zerophase Component Analysis (ZCA) as a preprocessing stage prior to the experiments.
a.2.2 Supervision Ratio and Training Datapoints
In order to create a dataset (training+testing) for the semisupervised learning task in the paper, we selected a subset of size as the labeled dataset from MNIST and SVHN, while the size goes up to for CIFAR10. The rest of the samples in the training partition are treated as unlabeled data. We repeated the experiment three times with different choices of labeled and unlabeled datapoints on all of the three datasets. For MNIST, a minibatch of size is used for both the labeled and unlabeled term, and for SVHN and CIFAR10, a minibatch of size is used for the calculation of the labeled term, while a minibatch of size is employed for the unlabeled term during the implementation of each method. We trained each model with updates for MNIST and updates for SVHN and CIFAR10. We have used ADAM optimizer in the training stage. In this regard, the initial learning rate of ADAM is set to and then linearly decayed over the last updates for MNIST, and the last updates for SVHN and CIFAR10.
As for the transportation cost function , we follow the work presented in [9] and thus employed the following cost function throughout all our experiments:
(A.1) 
where is an indicator function which returns if its input condition holds and zero, otherwise. It should be noted that this choice is solely for the sake of simplicity, and as described before, every valid lower semicontinuous function is a legitimate choice for .
Also, the pessimism/optimism tradeoff parameter is always set to , except when stated otherwise. This option yields certain degrees of optimism during the learning stage, which is motivated by the fact that Deep Neural Networks (DNN) have already proven to work well on all the abovementioned three datasets. Thus, trusting the learner to assign soft pseudolabels to the unlabeled data is somehow encouraged which in turn indicates a negative value for .
a.2.3 Creating Adversarial Examples
To solve the inner maximization problem in (8) and (10) for each pair of , we simply apply Gradient Ascent with the following update rule:
(A.2) 
where the initial value is set to , and the ascent rate is defined as , where is a hyperparameter. We set to 1.0 for MNIST and CIFAR10, and for SVHN. During the training, we repeat the update in (A.2) times for both the DRL and SSDRL method. However, we repeat it times during the evaluation.
While generating the adversarial examples via the ProjectedGradient Method (PGM), we applied the following update rule which is also used in some previous works in this area [9, 34]:
(A.3) 
where represents the projection operator to an ball (w.r.t. norm) centered on . Also, for an arbitrary vector denotes its normalized version, which is mathematically defined as under the norm constraint. We have defined the length parameter as , where denotes the number of iterations of the update (A.3). Accordingly, we set .
a.2.4 Architecture of Deep Neural Networks
A class of Convolutional Neural Networks (CNN) has been used for the loss function set . Table LABEL:tab:cnn_models shows the CNN models used in our experiments. We use ELU [35] for the activation function in MNIST, and leakyReLU (lReLU) [36] for SVHN and CIFAR10. In the CNNs used for SVHN and CIFAR10, all the convolutional layers as well as the fully connected (or equivalently dense) layers are followed by batch normalization [37], except for the fully connected layer on CIFAR10. The slopes of all lReLU in the network are set to .
table\end@float
Appendix B Minimum Supervision Ratio: Definition and Implications
In this section, we present some complementary discussions with respect to our generalization bound in Section 2.3. In particular, the mathematical definition and intuitive implications behind one of our proposed complexity measures, i.e. the Minimum Supervision Ratio, are explained in details.
In order to better understand the intuition behind the proposed optimization programs in (5) or (7), it is necessary to investigate them under the asymptotic regime of . In this regard, this section provides a rigorous mathematical framework to study the semisupervised learning in general (and its distributionally robust extension in particular), under the specific problem setting of this paper. We then provide conditions on the hypothesis set and datagenerating distribution, under which unlabeled data can help the overall learning procedure. Final bounds on the performance improvement through incorporation of unlabeled samples (which is mostly from the generalization aspect), are given with mathematical details in Theorem 3 and its proof. In order to achieve the abovementioned goal, first let us make the following definition:
Definition B.1.
For a feature space and a finite label set , the conditional composition of a distribution with a conditional distribution through a supervision ratio of , denoted by , is defined as
(B.1) 
It can be easily verified that the following properties hold for the conditional composition distribution of any two corresponding distributions:
(B.2) 
where the first relation means: the marginal of the composition distribution w.r.t. (which is a measure supported on ) is the same as that of , while the second property states that: conditional distribution over (given ) is a weighted mixture of conditional distributions and .
An interesting asymptotic property of a consistent distribution set (see Definition 2) is that, given both fully and partiallyobserved samples in are i.i.d. samples generated from a single arbitrary distribution , the following relation holds almost surely w.r.t. :
(B.3) 
where the asymptotic equality in the above relation corresponds to a memberwise convergence between the two sets. Consequently, rewriting (7) in the asymptotic regime of would give us the following equalities:
(B.4)  
The first term in the r.h.s. of (B.4) is proportional to the true risk which we intend to bound. However, the second term models the asymptotic effect of unlabeled data for a fixed supervision ratio . The main question that we try to answer in this section can be intuitively stated as: under what conditions, the second term becomes approximately proportional to the true risk as well?
Before investigating the above question in more theoretical details, a closer look at the semisupervised adversarial risk reveals that
(B.5) 
This fact implies that by decreasing , one can also decrease (at least in the majority of nontrivial scenarios). This issue has been previously mentioned in Section 2, which indicates that optimism always results in lower empirical risks. But how does this strategy affect the true expected loss, i.e. ? On the other hand, moving toward guarantees that the learner is minimizing a legitimate upperbound of the true risk, i.e. extreme pessimism, however, this also increases the empirical risk. Again, one could ask is it really necessary to be so pessimistic?
In order to answer the above questions, we introduce a new compatibility measure function for a function set and distribution , denoted by minimal supervision ratio or . We then show that as long as a particular inequality holds among parameters such as , and according to , one can guarantee minimizing a valid upperbound for the true risk, while avoiding the extreme pessimism of [27] (less harm to the empirical risk minimization). In order to do so, first let us introduce a number of useful additional tools:
Definition B.2.
Assume function class and distribution for a finite labelset . For the ease of notation, let for . Then, for and is defined as
(B.6) 
As it becomes evident in the proceeding arguments of this section, the introduced functional in Definition B.2, i.e. , plays an important role in determining the relation of expected (or asymptotic) semisupervised risk with the true (supervised) one. Mathematically speaking, enforcing for to remain nonnegative guarantees that for any . This allows us to upperbound the true risk with the value of computed for that particular . Surprisingly, this condition can always be satisfied by choosing (extreme pessimism). This configuration, in the special nonrobust case, coincides with the framework presented in [27].
Lemma B.1.
For any function set and distribution , we have for all .
Proof.
is a distribution over , thus can be considered as a vector in a simplex, i.e. all components are nonnegative and sum up to one. Then, the lemma’s argument can be justified by the fact that
(B.7) 
where denotes the inner product. More precisely, one can write:
The last inequality is a direct result of the fact that inside of the expectation operator is nonnegative. This completes the proof. ∎
However, we are more interested in those cases where can be bounded, or even negative, while is still nonnegative in some regions of . The main problem is that the minimizer of (7) (semisupervised empirical risk) must fall in those regions, as well. Otherwise one cannot upperbound the true risk by minimizing (7). Mathematically speaking, assume as described in (7). Then, we are interested to see if there exists a nonempty subset of , say , such that: