Robustness to Adversarial Perturbationsin Learning from Incomplete Data

Robustness to Adversarial Perturbations in Learning from Incomplete Data

Amir Najafi    Shin-ichi Maeda    Masanori Koyama    Takeru Miyato
Abstract

What is the role of unlabeled data in an inference problem, when the presumed underlying distribution is adversarially perturbed? To provide a concrete answer to this question, this paper unifies two major learning frameworks: Semi-Supervised Learning (SSL) and Distributionally Robust Learning (DRL). We develop a generalization theory for our framework based on a number of novel complexity measures, such as an adversarial extension of Rademacher complexity and its semi-supervised analogue. Moreover, our analysis is able to quantify the role of unlabeled data in the generalization under a more general condition compared to the existing theoretical works in SSL. Based on our framework, we also present a hybrid of DRL and EM algorithms that has a guaranteed convergence rate. When implemented with deep neural networks, our method shows a comparable performance to those of the state-of-the-art on a number of real-world benchmark datasets. E-mails: najafy@ce.sharif.edu, {ichi,masomatics,miyato}@preferred.jp .

* Computer Engineering Department

Sharif University of Technology, Tehran, Iran

[2mm]  Preferred Networks Inc., Tokyo, Japan

1 Introduction

Robustness to adversarial perturbations has become an essential feature in the design of modern classifiers —in particular, of deep neural networks. This phenomenon originates from several empirical observations, such as [1] and [2], which show deep networks are vulnerable to adversarial attacks in the input space. So far, plenty of novel methodologies have been introduced to compensate for this shortcoming. Adversarial Training (AT) [3], Virtual AT [4] or Distillation [5] are just examples of some promising methods in this area. The majority of these approaches seek an effective defense against a point-wise adversary, who shifts input data-points toward adversarial directions, in a separate manner. However, as shown by [6], a distributional adversary who can shift the data distribution instead of the input data-points is provably more detrimental to learning. This suggests that one can greatly improve the robustness of a classifier by improving its defense against a distributional adversary rather than a point-wise one. This motivation has led to the development of Distributionally Robust Learning (DRL) [7], which has attracted intensive research interest over the last few years [8, 9, 10, 11].

Despite of all the advancements in supervised or unsupervised DRL, the amount of researches tackling this problem from a semi-supervised angle is slim to none [12]. Motivated by this fact, we set out to propose a distributionally robust method that can handle Semi-Supervised Learning (SSL) scenarios. Our proposed method is an extension of self-learning [13, 14, 15], and can cope with all existing learning frameworks, such as neural networks. Intuitively, we first try to infer soft-labels for the unlabeled data, and then search for suitable classification rules that demonstrate low sensitivity to perturbation around these soft-label distributions.

Parts of this paper can be considered as a semi-supervised extension of the general supervised DRL developed in [9]. Computational complexity of our method, for a moderate label-set size, is only slightly above those of its fully-supervised rivals. To optimize our model, we design a Stochastic Gradient Descent (SGD)-based algorithm with a theoretically-guaranteed convergence rate. In order to address the generalization of our framework, we introduce a set of novel complexity measures such as Adversarial Rademacher Complexity and Minimal Supervision Ratio (MSR), each of which are defined w.r.t. the hypothesis set and probability distribution that underlies input data-points. As long as the ratio of the labeled samples in a dataset (supervision ratio) exceeds MSR, true adversarial risk can be bounded. Also, one can arbitrarily decrease MSR by tuning the model parameters at the cost of increasing the generalization bound; This means our theoretical guarantees hold for all semi-supervised scenarios. We summarize the theoretical contribution of our work in Table LABEL:tab:summary.

We have also investigated the applicability of our method, denoted by SSDRL, via extensive computer experiments on datasets such as MNIST [16], SVHN [17], and CIFAR-10 [18]. When implemented with deep neural networks, SSDRL outperforms rivals such as Pseudo-Labeling (PL) [19] and the supervised DRL in [9] (simply denoted as DRL) on all the above-mentioned datasets. In addition, SSDRL demonstrates a comparable performance to that of Virtual Adversarial Training (VAT) [4] on MNIST and CIFAR-10, while outperforms VAT on SVHN.

The rest of the paper is organized as follows: Section 1.1 specifies the notations, and Section 1.2 reviews the related works. The basic idea behind the proposed method is outlined in Section 2.1, parameter optimization is described in Section 2.2 and generalization is analyzed in Section 2.3. Section 3 is devoted to experimental results. Finally, Section 4 concludes the paper.

\@float

table\end@float

1.1 Notations

We extend the notations used in [9]. Assume to be an input space, to be a parameter set, and a corresponding parametric loss function. Observation space can either be the feature space in unsupervised scenarios, or the space of feature-label pairs, i.e., , where denotes the set of labels. For simplicity, we only consider finite label-sets. By , we mean the set of all probability measures supported on . Assume to be a non-negative and lower semi-continuous function, where for all . We occasionally refer to as transportation cost. The following definition formulates the Wasserstein distance between two distributions , w.r.t. [8]:

Definition 1 (Wasserstein distance).

The Wasserstein distance between two distributions and in , with respect to cost is defined as:

 Wc(P,Q) ≜infμ∈M(Z2)∫c(z,z′)dμ(z,z′) (1) subject to μ(⋅,Z)=P , μ(Z,⋅)=Q,

where represents the set of all couplings between any two random variables supported on . Also, and denote the marginals of taken w.r.t. the first and second variables, respectively.

measures the minimal cost of moving to , where the cost of moving one unit of mass from to is given by . Also, for and an arbitrary distribution , we define an -ambiguity set (or a Wasserstein -ball) as

 Bϵ(Q)≜{P∈M(Z)| Wc(P,Q)≤ϵ}. (2)

Training dataset is shown by , with samples being drawn i.i.d. from a fixed (and unknown) distribution , where is the dataset size. For a dataset , let be the following empirical measure:

 ^PD≜1nn∑i=1δZi, (3)

where denotes the Dirac delta function at point . Accordingly, and represent the statistical and empirical expectation operators, respectively. For a distribution , denotes the marginal distribution over , and is the conditional distributions over labels given feature vector . For the sake of simplicity in notations, for and a function , the notations and have been used, interchangeably.

1.2 Background and Related Works

DRL attempts to minimize a worst-case risk against an adversary. The adversary has a limited budget to alter the data distribution , in order to inflict the maximum possible damage. Here, can either be the true measure or the empirical one . The mentioned learning scenario can be modeled by a game between a learner and an adversary whose stationary point is the solution of a minimax problem [10]. Mathematically speaking, DRL can be formulated as [8, 11]:

 infθ∈ΘsupP∈Bϵ(Q)EP{ℓ(Z;θ)}. (4)

Wasserstein metric has been widely used to quantify the strength of adversarial attacks [8, 9, 11, 12], thanks to (i) its fundamental relations to adversarial robustness [20] and (ii) its mathematically well-studied dual-form properties [11]. In [8], authors have reformulated DRL into a convex program for the particular case of logistic regression. Convergence and generalization analysis of DRL have been addressed in [9] in a general context, while the finding of a proper ambiguity set size, i.e. , has been tackled in [21]. An interesting analysis on DRL methods with -divergences is given in [10]. Sample complexity of DRL has been reviewed by [22] and [23]. We conjecture that there might be close relations between our complexity analysis in Section 2.3 and some of the results in the latter studies. However, a careful investigation regarding this issue goes beyond the scope of this paper.

On the other hand, recent abundance of unlabeled data has made SSL methods widely popular [4, 24]. See [14] for a comprehensive review on classical SSL approaches. Many robust SSL algorithms have been proposed so far [25, 26], however, their notion of robustness is mostly different from the one considered in this paper. In [27], author has proposed a pessimistic SSL approach which is guaranteed to have a better, or at least equal, performance when it takes unlabeled data into account.We show that a special case of our method reduces to an adversarial extension of [27]. From a theoretical perspective, guarantees on the generalization of SSL can only be provided under certain assumptions on the choice of hypothesis set and the true data distribution [14, 15, 28]. For example, in [15] a compatibility function is introduced to restrict the relation between a model set and an input data distribution. Also, author of [29] has theoretically analyzed SSL under the so-called cluster assumption, in order to establish an improvement guarantee for a situation where unlabeled data had been experimentally shown to be helpful. The fundamental reason behind such assumptions is that lack of any prior knowledge about the information-theoretic relations between a feature vector and its corresponding label, simply makes unlabeled data to be useless for classification. Not to mention that improper assumptions about the relation of feature-label pairs, for example by employing unsuitable hypothesis sets, could actually degrade the classification accuracy in semi-supervised scenarios. In Section 2.3, we propose a novel compatibility function that works under a general setting and enables us to theoretically establish a generalization bound for our method.

Finally, the only work prior to this paper that also falls in the cross section of DRL and SSL is [12]. However, the method in [12] severely restricts the support of adversarially-altered distributions, so that the adversary is left to choose from a set of delta-spikes over only labeled and augmented unlabeled samples. Thus, one cannot expect a considerable improvement in the distributional robustness in this case, because it does not let the adversary to freely perturb training data-points toward arbitrary directions.

2 Proposed Framework

From now on, let us assume . In a semi-supervised configuration, dataset consists of two non-overlapping parts: (labeled) and (unlabeled). It should be noted that learner can only observe partial information about each sample in , namely, its feature vector. Let us denote and as the index sets corresponding to the labeled and unlabeled data points, respectively. Thus, we have , and . The unknown labels of the samples in can be thought as a set of corresponding random variables supported on . DRL in (4) cannot readily extend to this partially labeled setting, since it needs complete access to all the feature-label pairs in . In order to bypass this barrier, we need to somehow address the additional stochasticity that originates from incorporating unlabeled data in the learning procedure. The following definition can be helpful for this aim:

Definition 2.

The consistent set of probability distributions with respect to a partially-labeled dataset is defined as

where and are the sizes of and , respectively. Also, denotes the set of all conditional distributions over , given values in .

Distributions in have delta-spikes over supervised samples in . However, for samples in , singularity is only over the feature vectors while conditional distributions over their corresponding labels are free to be anything, i.e., samples can even have soft-labels. Note that the empirical measure which corresponds to the true complete dataset is also somewhere inside . Our aim is to choose a suitable measure from , and then use it for (4).

2.1 Self-Learning: Optimism vs. Pessimism

We focus on a well-known family of SSL approaches, called self-learning [13, 30], and then combine it to the framework of DRL. Methods that are built upon self-learning, such as Expectation-Maximization (EM) algorithm [31], aim to transfer the knowledge from labeled samples to unlabeled ones through what is called pseudo-labeling. More precisely, a learner is (repeatedly) trained on the supervised portion of a dataset, and then employs its learned rules to assign pseudo-labels to the remaining unlabeled part. This procedure can assign either hard or soft labels to unlabeled features. This way, all these artificially-labeled unsupervised samples can also join in for the training of the learner in the final stages of learning. However, such methods are prone to over-fitting if the information flow from to is not properly controlled. One way to overcome this issue is to use soft-labeling, which maintains a minimum level of uncertainty within the unlabeled data points. By combining the above arguments with the core idea of DRL in (4), we propose the following learning scheme:

 infθ∈Θ infS∈^P(D){supP∈Bϵ(S)EP{ℓ(X,y;θ)}+(1−ηλ)^EDul{H(S|X)}}, (5)

where is a user-defined parameter, is called the supervision ratio, and denotes the Shannon entropy. For now, let us assume .

Minimization over acts as a knowledge transfer module and finds the optimal empirical distribution in for the model selection module, i.e. . Again, note that distributions in differ from each other only in the way they assign labels, or soft-labels, to the unlabeled data. According to (5), learner has obviously chosen to be optimistic w.r.t. the hypothesis set and its corresponding loss function . In other words, for any , learner is instructed to pick the labels that are more likely to reduce the loss function for the unlabeled data. This strategy forms the core idea of self-learning. Note that a pessimistic strategy suggests the opposite, i.e. to pick the less likely labels (those with large loss values) and not to trust the loss function. At the end of this section, we explain more about the pessimistic learner.

The negative regularization term prevents hard decisions for labels and promotes soft-labeling by bounding the Shannon entropy of label-conditionals from below. A smaller gives softer labels. In the extreme case, choosing ends up in an adversarial version of the self-training in [14]. It should be noted that (5) considers all the data in for the purpose of distributional robustness. In fact, the adversary has access to all the feature vectors in , which in turn forces the learner to experience adversarial attacks near all these points. This way, learner is instructed to show less sensitivity near all training data, just as one may expect from a semi-supervised DRL.

We show that (5) can be efficiently solved given that some smoothness conditions hold for and . Before that, Theorem 1 shows that the optimization corresponding to the knowledge transfer module has an analytic solution, which implies the computational cost of (5) is only slightly higher than those of its fully-supervised counterparts, such as [9].

Definition 3.

For and , soft-minimum of with respect to is defined as

 (6)
Theorem 1 (Lagrangian-Relaxation).

Assume a continuous loss and continuous , parameters and , and a partially-labeled dataset with size . For , let us define the empirical Semi-Supervised Adversarial Risk (SSAR), denoted by , as

where , called adversarial loss, is defined as

 ϕγ(X,y;θ)≜supz′∈Zℓ(z′;θ)−γc(z′,(X,y)). (8)

Let be a minimizer of (5) for a given set of parameters and . Then, there exists such that is also a minimizer of (7) with the same corresponding parameters and .

Proof of Theorem 1 is given in Appendix C. Note that equals to : (i) operator for , (ii) average for , and (iii) operator for . Also, and are non-negative dual parameters and fixing either of them uniquely determines the other one. Due to this one-to-one relation, one can adjust (for example via cross-validation), instead of . See [9] for a similar discussion about this issue.

A more subtle look at (7) shows that in the dual context of the proposed optimization problem, one is free to also consider positive values for . Choosing promotes those labels that produce larger adversarial loss values for unlabeled data. In other words, the sign of indicates optimism (), or pessimism () during the label assignment. The choice between optimism vs. pessimism depends on the compatibility of the model set with the true distribution . In Section 2.3, we show that enabling to take values in rather than is crucial for establishing a generalization bound for 7. In other words, for very bad hypothesis sets w.r.t. a particular input distribution, one must choose to be pessimistic to be able to generalize well. To see some situations where pessimism in Semi-Supervised Learning can help, reader may refer to [27].

2.2 Numerical Optimization

We propose a numerical optimization scheme for solving (7) or equivalently (5), which has a convergence guarantee. A hurdle in applying SGD to (7) is the fact that the value of is itself the output of a maximization problem. Also, the loss function is not necessarily convex w.r.t. , e.g. neural networks, and hence achieving the global minimum of (7) is not feasible in general. The former problem has already been solved in supervised DRL, as long as we focus on a sufficiently small [9].

Lemma 1.

Consider the setting described in Theorem 1. Assume is differentiable w.r.t. , and is -Lipschitz all over , for some . Also, assume transportation cost is -strongly convex in its first argument. Then, if , the program

 supz′∈Z ℓ(z′;θ)−γc(z′,(X,y)) (9)

becomes -strongly concave for all .

The proof of Lemma 1 is based on Taylor’s expansion series. By using a modified version of Danskin’s theorem for minimax problems [32], and followed by additional smoothness conditions on , an efficient computation of the gradient of in (7) w.r.t. is as follows:

Lemma 2.

Assume loss function , and , such that conditions in Lemma 1 hold all over . Assume is differentiable w.r.t. , and let . For a fixed and , define as the maximizer of (9) for . Similarly, let to represent the maximizer of

 Ji(y;θ)≜supz′∈Z ℓ(z′;θ)−γc(z′,(Xi,y)),  y∈Y,i∈Iul. (10)

Then, the gradient of (7) w.r.t. can be attained as

where .

Proof of Lemma 2 is included in that of Theorem 2 and can be found in Appendix C. Given the formulation in Lemma 2, one can simply apply the mini-batch SGD to solve for (5) via Algorithm 1 (the semi-supervised extension of [9]). The set of constants such as the maximum iteration number , , and mini-batch size are all user-defined. Due to the strong concavity of (10) under the conditions of Lemma 1, can be chosen arbitrarily small. Other parameters such as and should be adjusted via cross-validation. The computational complexity of Algorithm 1 is no more than times of that of [9], where the latter can only handle supervised data111In scenarios where is very large, one can employ heuristic methods to reduce the set of possible labels for an unlabeled data sample and gain more efficiency at the expense of degradation in performance. Note that Algorithm 1 reduces to [9] in fully-supervised scenarios, and coincides with Pseudo-Labeling and EM algorithm when and , respectively. The following theorem guarantees the convergence of Algorithm 1 to a local minimizer of (7).

Theorem 2.

Assume the loss function , transportation cost , and to satisfy the conditions of Lemma 2. Also, assume is differentiable w.r.t. both parameters and , with Lipschitz gradients. Also, assume , for some all over . Let to be an initial hypothesis, and denote as a local minimizer of (5) or (7). Assume the partially-labeled dataset to include i.i.d. samples drawn from . Also, let . Then, for the fixed step size

 α∗≜1σ2  ⎷Δ^RT(Bσ2+(1−η)|λ||Y|), (12)

the outputs of Algorithm 1 with parameter set , , after iterations, say , satisfy the following inequality:

where constants and only depend on and Lipschitz constants of . Also, , and the expectation in (13) is w.r.t. dataset and the randomness of Algorithm 1.

The proof of Theorem 2 with explicit formulations of constants and are given in Appendix C. Theorem 2 guarantees a convergence rate of for Algorithm 1, if one neglects . Note that the presence of is necessary since one cannot find the exact maximizer of (10) in finite steps. However, due to Lemma 1, can be chosen infinitesimally small. According to Theorem 2, choosing a very large reduces the convergence rate, since the derivative of starts to diverge. In fact, the limiting cases of represent two hard-label strategies with combinatoric structures. Convergence rates for such methods are not well-studied in the literature [14]. However, convergence guarantees (even without establishing a convergence rate) can still be useful for these limiting cases. Theorem C.1 (Appendix) guarantees the convergence of Algorithm 1 in hard-decision regimes, i.e. .

Another interesting question is: can we guarantee the convexity of (5) or (7), given that loss function is twice differentiable and strictly convex w.r.t. ? Although the convexity of does not hold in many cases of interest, e.g. neural networks, a careful convexity analysis of our method is still important. Theorem C.2 (Appendix) provides a sufficient condition for convexity of (7), when is strictly convex. The condition requires with being a negative function. Note that when , the r.h.s. of (7) equals to the minimum of a finite number of convex functions —which is not necessarily convex. On the other hand, a non-negative value of is always safe for this purpose, because the operator in this case preserves convexity.

2.3 Generalization Guarantees

This section addresses the statistical generalization of our method. More precisely, we intend to bound the true adversarial risk, i.e. , where denotes the optimizer of the empirical risk in (7). To this aim, two major concerns need to be addressed: (i) we are training our model against an adversary, and (ii) our training dataset is only partially labeled. For (i), we introduce a novel complexity measure w.r.t. the hypothesis set and data distribution , which extends the existing generalization analyses into an adversarial setting. For (ii), we establish a novel compatibility condition among , and that deals with the semi-supervised aspect of our work.

Conventional Rademacher complexity, denoted by , is a tool to measure the richness of a function set in classical learning theory [33]. In fact, this measure tells us about how much a function set is able to learn noise, and thus is exposed to overfitting on small datasets. We give a novel adversarial extension for Rademacher complexity which also appears in our generalization bound at the end of this section. Moreover, we show that our complexity measure converges to zero when , for all function sets with a finite VC-dimension, regardless of the strength of adversary. Before that, let us define the set of -Monge maps as the following function set:

 Aϵ≜{a:Z→Z| c(z,a(z))≤ϵ, ∀z∈Z}. (14)

Then, the Semi-Supervised Monge (SSM) Rademacher complexity can be defined as follows:

For , assume a function set and a distribution . Then, for , a transportation cost and , let us define

 gl(n) ≜EZ1:n,σ{supf∈F 1nn∑i=1σi[supa∈Aϵ f(a(Zi))]}and gul(n) ≜∑y∈YEX1:n,σ{supf∈F 1nn∑i=1σi[supa∈Aϵ f(a(Xi,y))]},

where and . represents the set of -Monge maps. Also, indicates a vector of independent Rademacher random variables. Then, for a supervision ratio , the SSM Rademacher complexity of is defined as

 R(SSM)n,(ϵ,η)(F)≜ηgl(⌈nη⌉)+(1−η)gul(⌈n(1−η)⌉).

By setting and , the above definition simply reduces to the classical Rademacher complexity . We define a function set to be learnable, if decreases to zero as one increases . Similarly, a function class is said to be adversarially learnable w.r.t. parameters , if

 limn→∞R(SSM)n,(ϵ,η)(F)=0. (15)

The above definition is necessary when , since learnability of a function class w.r.t. some distribution does not necessarily guarantee its adversarial learnability. In fact, an adversary can shift the data points and forces the learner to experience regions in that cannot be accessed by alone. However, one may be concerned about how to numerically compute this measure in practice? The main difference between and SSM Rademacher complexity is that the latter alters input samples (or distribution) by an adversary. Fortunately, several distribution-free bounds have been established on so far [33], which work for a variety of function classes of practical interest, e.g. classifiers with a bounded VC-dimension (including neural networks), polynomial regression tools with a bounded degree, and etc.

We show that in case of having a distribution-free bound on the Rademacher complexity of , the SSM Rademacher complexity can be bounded as well. Mathematically speaking, assuming there exists an asymptotically decreasing upper-bound such that . Then for all and the following holds (Lemma D.1):

 R(SSM)n,(ϵ,η)(F)≤ηΔ(⌈nη⌉)+(1−η)|Y|Δ(⌈n(1−η)⌉), (16)

where the r.h.s. of the above equation always converges to zero as . This includes the vast majority of classifier families that are being used in real-world applications, e.g. neural networks, support vector machines, random forests and etc. Just as an example, consider the - loss for a family of classifiers with a VC-dimension of . Then, due to Dudley’s entropy bound and Haussler’s upper-bound [33], there exists constant such that

 Δ(n)≤C√dim(Θ)n,  and so  R(SSM)n,(ϵ,η)(F)≤C√dim(Θ)n(√η+√1−η|Y|), (17)

regardless of or the distribution (again, check Lemma D.1). An interesting implication of this result is that by assuming a bounded VC-dimension for a function set, one can guarantee its learnability even in an adversarial setting where an adversary can apply arbitrarily powerful attacks. However, as long as one is interested in a function set whose classical complexity measures cannot be bounded regardless of the data distribution, not much can be said on adversarial learnability without directly computing and in Definition 4.

2.3.2 Minimum Supervision Ratio

As discussed earlier in Section 1.2, generalization guarantees for SSL frameworks generally require a compatibility assumption on the hypothesis set and data distribution . In Appendix B (and in particular, Definition B.4), a new compatibility function, denoted by Minimum Supervision Ratio (MSR), is introduced which has the following functional form:

 MSR(F,P0)(λ,margin):R∪{±∞}×R≥0→[0,1].

Intuitively, quantifies the strength of information theoretic relation between the marginal measure and the conditional . It also measures the suitability of function set to learn such relations. As it will be shown in Theorem 3, in order to bound the true risk when unlabeled data are involved, one needs , for some and . First argument, , denotes the pessimism of the learner and the second one, , specifies a safety margin for small-size datasets. is an increasing function w.r.t. , while it decreases with . In particular, , for all .

For negative values of (optimistic learning), MSR remains small as long as there exists a strong dependency between the distribution of feature vectors and label conditionals . This dependency can be obtained, for example, by the cluster assumption. However, MSR does not require such explicit assumptions and thus is able to impose a compatibility condition on the pair in a more fundamental way compared to existing works in SSL theory. Additionally, some loss functions in need to be capable of capturing such dependencies, e.g. at least one loss function in should resemble the true negative log-likelihood . Conversely, absence of any dependency between and , or the lack of sufficiently “good” loss functions in increases the MSR toward , which forces the learner to choose a large (in the extreme case ) to be able to use the generalization bound of Theorem 3. Not to mention that a large increases the empirical loss which then loosens the bound. This fact, however, should not be surprising since improper usage of unlabeled data is known to be harmful to the generalization instead of improving it.Based on previous discussions, Theorem 3 gives a generalization bound for our proposed framework in (7):

Theorem 3 (Generalization).

For a feature-label space and a parameter space , assume the set of continuous functions , with and for some . For , and let

 ϕγ(z;θ)≜supz′∈Z ℓ(z′;θ)−γc(z′,z), (18)

and , where is a transportation cost. For a supervision ratio , assume a partially labeled dataset including i.i.d. samples drawn from , where labels can be observed with probability of , independently. For and , assume satisfies the following condition:

 η≥MSR(Φ,P0)⎛⎝λ,4B√log(1/δ)2n+4R(SSM)n,(ϵ,η)(L)⎞⎠. (19)

Then, with probability at least , the following bound holds for all :

 supP∈Bϵ(P0)EP{ℓ(Z;θ∗)} (20)

where is the minimizer of .

Proof of Theorem 3 is given in Appendix C. Condition in (19) can always be satisfied based on Lemma B.2, as long as and are sufficiently large and is adversarially learnable. A strongly-compatible pair of hypothesis set and data distribution should encourage optimism, where learner can choose small (generally negative) values. However, in some situations increasing might be necessary for (19) to hold; In fact, for a weakly-compatible , must be positive or even (the latter always satisfies (19) regardless of or ). Note that choosing a larger increases the empirical risk , which then increases our bound in (20). Interestingly, coincides with the setting of [27], which makes it as a special case of our analysis.

For a fixed , should be tuned to minimize the upper-bound for a better generalization. On the other hand, for every there exists where (20) becomes asymptotically tight. More importantly, the limiting cases of Theorem 3, i.e. and , provide us with a new generalization bound for non-robust SSL, and an already-established bound for supervised DRL in [9], respectively.

3 Experimental Results

In this section, we demonstrate our experimental results on a number of real-world datasets and also compare our method with some state-of-the-art rival methodologies. We have chosen our loss function set as a particular family of Deep Neural Networks (DNN). Architecture and other specifications about our DNNs are explained in details in Appendix A. Throughout this section, our method is denoted by SSDRL and the rival frameworks are Virtual Adversarial Training (VAT) [4], Pseudo-Labeling (PL) [19], and the fully-supervised DRL of [9], simply denoted as DRL. We have also implemented a fast version of SSDRL, called F-SSDRL, where for each unlabeled training sample only a limited number of more favorable labels are considered for Algorithm 1. Here by more favorable labels, we refer to those labels that correspond to smaller non-robust loss values of . As a result, F-SSDRL runs much faster than SSDRL without much degradation in performance. Surprisingly, we found out that F-SSDRL often yields better performances in practice compared to SSDRL (also see Appendix A for more details).

Figure 1 shows the misclassification rate vs. on adversarial test examples attained by (9) (same attack strategy as [9]). Recall as the dual-counterpart of the Wasserstein radius in (5). Thus, somehow quantifies the strength of adversarial attacks, as suggested by [9]. Results have been depicted for MNIST, SVHN and CIFAR-10 datasets. Figure 2 demonstrates the same procedure for adversarial examples generated by Projected-Gradient Method (PGM) [34]; In this case, the error-rate is depicted vs. PGM’s strength of attack, i.e. . For VAT and SSDRL, curves have been shown for different choices of hyper-parameters, i.e., , which correspond to the lowest error rates on: () clean examples, () adversarial examples by [9], and () adversarial examples by PGM, respectively. Values of , and the choices of , transportation cost , and the supervision ratio with more details on the experiments can be found in Appendix A.

According to Figures 1 and 2, the proposed method is always superior to DRL and PL. Also, SSDRL outperforms VAT on SVHN dataset regardless of the attack type, while it has a comparable error-rate on MNIST and CIFAR-10 based on Figures 0(a) and 1(c), respectively. The superiority over DRL highlights the fact that exploitation of unlabeled data has improved the performance. However, SSDRL under-performs VAT on MNIST and CIFAR-10 datasets if the order of attacks are reversed. According to Figure 1(a), accuracy of PL degrades quite slowly as PGM’s increases, although the loss values increase in Figure 4(a). This phenomenon is due to the fact that the adversarial directions for increasing the loss and error-rate are not correlated in this particular case.

\@float

table\end@float

Figure 3 depicts the error-rate corresponding to DRL, SSDRL and F-SSDRL as a function of , on adversarial examples in the MNIST dataset which are generated via the maximization problem (as described in [9]). Unlike Figures 1 and 2, we have shown the results for a range of values of and , in order to experimentally measure the sensitivity of our method to these hyper-parameters. Also, we have performed the same procedure for DRL for the sake of comparison. In particular, Figure 2(a) shows the comparison between DRL and SSDRL (with set to for SSDRL) and different values of . As it is evident for the majority of cases (), SSDRL performs much better than DRL. This result indicates that employing the unlabeled data samples improves the generalization, which is highly favorable. Figure 2(b) depicts the comparison between F-SSDRL and the original SSDRL (again is set to -1 for SSDRL). Figure 2(c) shows the effect of varying (with fixed to ). Surprisingly, the error-rate experiences a drastic jump when one changes the sign of , which indicates a trade-off between optimism and pessimism. This result might be related to the fact that for the case of MNIST dataset, learned neural networks on the labeled part of the dataset are sufficiently reliable, and thus encourage the user to employ an optimistic approach (i.e., setting a negative ) in order to improve the performance. However, while the sign of is fixed, error-rate does not show that much sensitivity to the magnitude of , which can be noted as a point of strength for SSDRL.

Table LABEL:tab:semisup_wodataaug shows the test error-rates on clean examples for F-SSDRL, VAT, PL and DRL on MNIST, SVHN and CIFAR-10 datasets. In fact, Table LABEL:tab:semisup_wodataaug characterizes the non-adversarial generalization that can be attained via distributional robustness. Again, F-SSDRL outperforms both PL and DRL in almost all experimental settings. It also surpasses VAT on SVHN dataset. F-SSDRL under-performs VAT on MNIST and CIFAR-10, however, the difference in error-rates remains small and the two methods have close performances.

So far, the performance of SSDRL has been demonstrated w.r.t. its misclassification rate. We have also provided extensive experimental results on the value of adversarial loss , which are crucial for the computation of our generalization bound in Section 2.3. Figure 4 shows the average adversarial loss, i.e. , for different methods and on different datasets. is set to for SSDRL. Again, it should be noted that the adversarial examples used in Figures A.1 and 4 are generated via the procedure described in [9]. Figure 5 is the counterpart of Figure 4, where the attack strategy is replaced with Projected-Gradient Method (PGM). As a result, adversarial loss values have been depicted as a function of PGM’s strength of attack, i.e. . As can be seen, SSDRL (or its fast version F-SSDRL) are always among the few methods that generate the smallest adversarial loss values, regardless of the strength of attacks. This means that the proposed method can establish a reliable certificate of robustness for test samples via Theorem 3. Note that VAT, another method that performs well in practice in terms of error-rate, does not have any theoretical guarantees.

4 Conclusions

This paper aims to investigate the applications of distributionally robust learning in partially labeled datasets. The core idea is to focus on a well-known semi-supervised technique, known as self-learning, and make it robust to adversarial attacks. A novel framework, called SSDRL, has been proposed which builds upon an existing general scheme in supervised DRL. SSDRL encompasses many existing methods such as Pseud-Labeling (PL) and EM algorithm as its special cases. Computational complexity of our method is shown to be only slightly higher than those of its supervised counterparts. We have also derived convergence and generalization guarantees for SSDRL, where for the latter, a number of novel complexity measures have been introduced. We have proposed an adversarial extension of the Rademacher complexity in classical learning theory, and showed that it can be bounded for a broad range of learning frameworks, including neural networks, that have a finite VC-dimension. Moreover, our theoretical analysis reveals a more fundamental way to quantify the role unlabeled data in the generalization through a new complexity measure called Minimum Supervision Ratio (MSR). This is in contrast to many existing works that need more restrictive conditions such as cluster assumption to be applicable. Extensive computer simulation on real-world benchmark datasets demonstrate a comparable-to-superior performance for our method compared with those of the state-of-the-art. In future, one may attempt to improve the generalization bounds, for example, by finding empirical estimations for MSR function. Fitting a broader range of SSL methods into the core idea of Section 2.1 could be another good research direction.

References

• [1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations, 2014.
• [2] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 427–436.
• [3] I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations, 2015.
• [4] T. Miyato, S. Maeda, S. Ishii, and M. Koyama, “Virtual adversarial training: a regularization method for supervised and semi-supervised learning,” IEEE transactions on pattern analysis and machine intelligence, 2018.
• [5] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” in Security and Privacy (SP), 2016 IEEE Symposium on.   IEEE, 2016, pp. 582–597.
• [6] M. Staib and S. Jegelka, “Distributionally robust deep learning as a generalization of adversarial training,” in NIPS workshop on Machine Learning and Computer Security, 2017.
• [7] A. Ben-Tal, D. Den Hertog, A. De Waegenaere, B. Melenberg, and G. Rennen, “Robust solutions of optimization problems affected by uncertain probabilities,” Management Science, vol. 59, no. 2, pp. 341–357, 2013.
• [8] S. Shafieezadeh-Abadeh, P. M. Esfahani, and D. Kuhn, “Distributionally robust logistic regression,” in Advances in Neural Information Processing Systems, 2015, pp. 1576–1584.
• [9] A. Sinha, H. Namkoong, and J. Duchi, “Certifiable distributional robustness with principled adversarial training,” in International Conference on Learning Representations, 2018.
• [10] W. Hu, G. Niu, I. Sato, and M. Sugiyama, “Does distributionally robust supervised learning give robust classifiers?” in International Conference on Machine Learning, 2018, pp. 2034–2042.
• [11] P. M. Esfahani and D. Kuhn, “Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations,” Mathematical Programming, pp. 1–52, 2017.
• [12] J. Blanchet and Y. Kang, “Semi-supervised learning based on distributionally robust optimization,” arXiv preprint arXiv:1702.08848, 2017.
• [13] Y. Grandvalet and Y. Bengio, “Semi-supervised learning by entropy minimization,” in Advances in Neural Information Processing Systems, 2005, pp. 529–536.
• [14] X. Zhu, “Semi-supervised learning literature survey,” Computer Science, University of Wisconsin-Madison, vol. 2, no. 3, p. 4, 2006.
• [15] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews],” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542–542, 2009.
• [16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
• [17] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in NIPS Workshop on Deep Learning and Unsupervised Feature Learning, vol. 2011, 2011, p. 5.
• [18] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
• [19] D.-H. Lee, “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in ICML Workshop on Challenges in Representation Learning, vol. 2, 2013.
• [20] Z. Cranko, A. K. Menon, R. Nock, C.-S. Ong, Z. Shi, and C. Walder, “Monge beats bayes: Hardness results for adversarial training,” arXiv preprint arXiv:1806.02977, 2018.
• [21] J. Duchi, P. Glynn, and H. Namkoong, “Statistics of robust optimization: A generalized empirical likelihood approach,” arXiv preprint arXiv:1610.03425, 2016.
• [22] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry, “Adversarially robust generalization requires more data,” in Advances in Neural Information Processing Systems, 2018, pp. 5014–5026.
• [23] D. Cullina, A. N. Bhagoji, and P. Mittal, “PAC-learning in the presence of evasion adversaries,” arXiv preprint arXiv:1806.01471, 2018.
• [24] Z. Dai, Z. Yang, F. Yang, W. W. Cohen, and R. R. Salakhutdinov, “Good semi-supervised learning that requires a bad gan,” in Advances in Neural Information Processing Systems, 2017, pp. 6510–6520.
• [25] A. Balsubramani and Y. Freund, “Scalable semi-supervised aggregation of classifiers,” in Advances in Neural Information Processing Systems, 2015, pp. 1351–1359.
• [26] Y. Yan, Z. Xu, I. W. Tsang, G. Long, and Y. Yang, “Robust semi-supervised learning through label aggregation.” in Association for the Advancement of Artificial Intelligence, 2016, pp. 2244–2250.
• [27] M. Loog, “Contrastive pessimistic likelihood estimation for semi-supervised classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 3, pp. 462–475, 2016.
• [28] A. Singh, R. Nowak, and X. Zhu, “Unlabeled data: Now it helps, now it doesn’t,” in Advances in Neural Information Processing Systems, 2009, pp. 1513–1520.
• [29] P. Rigollet, “Generalization error bounds in semi-supervised classification under the cluster assumption,” Journal of Machine Learning Research, vol. 8, no. Jul, pp. 1369–1392, 2007.
• [30] M.-R. Amini and P. Gallinari, “Semi-supervised logistic regression,” in European Conference on Artificial Intelligence, 2002, pp. 390–394.
• [31] S. Basu, A. Banerjee, and R. Mooney, “Semi-supervised clustering by seeding,” in International Conference on Machine Learning, 2002, pp. 27–34.
• [32] J. F. Bonnans and A. Shapiro, Perturbation analysis of optimization problems.   Springer Science & Business Media, 2013.
• [33] M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of machine learning.   MIT press, 2012.
• [34] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations, 2018.
• [35] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” in International Conference on Learning Representations, 2016.
• [36] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
• [37] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015, pp. 448–456.
• [38] J. Blanchet and K. Murthy, “Quantifying distributional model risk via optimal transport,” Mathematics of Operations Research, 2019.
• [39] S. Ghadimi and G. Lan, “Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework,” SIAM Journal on Optimization, vol. 22, no. 4, pp. 1469–1492, 2012.
• [40] M. Dresher, “Games of strategy: theory and applications,” RAND CORP SANTA MONICA CA, Tech. Rep., 1961.
• [41] C. Wu, C. Yang, H. Zhao, and J. Zhu, “On the convergence of the em algorithm: A data-adaptive analysis,” arXiv preprint arXiv:1611.00519, 2016.
• [42] K. S. Miller and B. Ross, An introduction to the fractional calculus and fractional differential equations.   Wiley-Interscience, 1993.

Appendix A Additional Simulations and Experimental Settings

This section presents a number of additional experiments w.r.t. the proposed method and shows more comparison with rival methodologies. We also give an extensive description of the experimental setting that we have used for our computer simulations.

Figure A.1 is a complete version of Figure 1 from Section 3, where the performances of SSDRL, fully-supervised DRL, PL and VAT are extensively investigated on three benchmark datasets, i.e. MNIST, SVHN and CIFAR-10. SSDRL and VAT have been tested with a variety of their corresponding hyper-parameters and . Figure A.2 is the counterpart of Figure A.1, where the attack strategy is replaced with Projected-Gradient Method (PGM). Again, error-rates have been depicted as a function of PGM’s attack strength, i.e. . Even though more variation in hyper-parameters has been considered, we have not observed any significant sensitivity that is caused by a slight change of parameter values. As a result, one can say that DRL, SSDRL and VAT are all stable algorithms w.r.t. to their parameter values, at least up to some certain levels.

Figures A.3 and A.4 represent the performance (again in terms of error-rate) over clean examples from different datasets, and for SSDRL and VAT, respectively. In Figure A.3, different values of have been used for training and the test error-rate is depicted as a function of . Also, is set to for SSDRL. Apparently, SSDRL (or F-SSDRL), for a particular range of parameters, overfits during the training stage on MNIST and as a result its performance is degraded when compared to that of DRL. However, SSDRL outperforms DRL (its fully-supervised counterpart) on SVHN and CIFAR-10 datasets. Also, SSDRL and VAT have comparable performances on clean examples, specifically on SVHN and CIFAR-10 datasets. This observation is in agreement with Table LABEL:tab:semisup_wodataaug.

a.2 Experimental Settings

In this part, we present a detailed description of the experimental settings which have been used for Section 3. It should be noted that the majority of the settings used for SVHN and CIFAR-10 datasets follow the same procedure as described in [4].

a.2.1 Real-world Datasets

Three main datasets have been used during the experiments: MNIST, SVHN and CIFAR-10.

• The MNIST dataset consists of pixel, gray-scale images of handwritten digits together with their corresponding labels. Each label is a natural number from to . The number of training examples and test examples in the dataset are and , respectively.

• The SVHN dataset consists of pixel RGB images of street view house numbers with their corresponding labels. Again, labels are natural numbers ranging from to . The number of training and test samples in the dataset are and , respectively.

• CIFAR-10 dataset consists of pixel RGB images of categorized objects, i.e., cars, trucks, planes, animals, and humans. The number of training examples and test examples in the dataset are and , respectively. For CIFAR-10 dataset, we conducted Zero-phase Component Analysis (ZCA) as a pre-processing stage prior to the experiments.

a.2.2 Supervision Ratio and Training Data-points

In order to create a dataset (training+testing) for the semi-supervised learning task in the paper, we selected a subset of size as the labeled dataset from MNIST and SVHN, while the size goes up to for CIFAR-10. The rest of the samples in the training partition are treated as unlabeled data. We repeated the experiment three times with different choices of labeled and unlabeled data-points on all of the three datasets. For MNIST, a mini-batch of size is used for both the labeled and unlabeled term, and for SVHN and CIFAR-10, a mini-batch of size is used for the calculation of the labeled term, while a mini-batch of size is employed for the unlabeled term during the implementation of each method. We trained each model with updates for MNIST and updates for SVHN and CIFAR10. We have used ADAM optimizer in the training stage. In this regard, the initial learning rate of ADAM is set to and then linearly decayed over the last updates for MNIST, and the last updates for SVHN and CIFAR-10.

As for the transportation cost function , we follow the work presented in [9] and thus employed the following cost function throughout all our experiments:

 c(z,z′)=∥z−z′∥22+∞⋅1{y≠y′}, (A.1)

where is an indicator function which returns if its input condition holds and zero, otherwise. It should be noted that this choice is solely for the sake of simplicity, and as described before, every valid lower semi-continuous function is a legitimate choice for .

Also, the pessimism/optimism trade-off parameter is always set to , except when stated otherwise. This option yields certain degrees of optimism during the learning stage, which is motivated by the fact that Deep Neural Networks (DNN) have already proven to work well on all the above-mentioned three datasets. Thus, trusting the learner to assign soft pseudo-labels to the unlabeled data is somehow encouraged which in turn indicates a negative value for .

To solve the inner maximization problem in (8) and (10) for each pair of , we simply apply Gradient Ascent with the following update rule:

 Xt+1=Xt+rt∇Xt[ℓ((Xt,y);θ)−γc((Xt,y),(X,y))], (A.2)

where the initial value is set to , and the ascent rate is defined as , where is a hyper-parameter. We set to 1.0 for MNIST and CIFAR-10, and for SVHN. During the training, we repeat the update in (A.2) times for both the DRL and SSDRL method. However, we repeat it times during the evaluation.

While generating the adversarial examples via the Projected-Gradient Method (PGM), we applied the following update rule which is also used in some previous works in this area [9, 34]:

 Xt+1=ProjX,ϵ(Xt+ξ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯∇Xtℓ((Xt,y);θ)), (A.3)

where represents the projection operator to an -ball (w.r.t. norm) centered on . Also, for an arbitrary vector denotes its normalized version, which is mathematically defined as under the -norm constraint. We have defined the length parameter as , where denotes the number of iterations of the update (A.3). Accordingly, we set .

a.2.4 Architecture of Deep Neural Networks

A class of Convolutional Neural Networks (CNN) has been used for the loss function set . Table LABEL:tab:cnn_models shows the CNN models used in our experiments. We use ELU [35] for the activation function in MNIST, and leakyReLU (lReLU) [36] for SVHN and CIFAR-10. In the CNNs used for SVHN and CIFAR-10, all the convolutional layers as well as the fully connected (or equivalently dense) layers are followed by batch normalization [37], except for the fully connected layer on CIFAR-10. The slopes of all lReLU in the network are set to .

\@float

table\end@float

Appendix B Minimum Supervision Ratio: Definition and Implications

In this section, we present some complementary discussions with respect to our generalization bound in Section 2.3. In particular, the mathematical definition and intuitive implications behind one of our proposed complexity measures, i.e. the Minimum Supervision Ratio, are explained in details.

In order to better understand the intuition behind the proposed optimization programs in (5) or (7), it is necessary to investigate them under the asymptotic regime of . In this regard, this section provides a rigorous mathematical framework to study the semi-supervised learning in general (and its distributionally robust extension in particular), under the specific problem setting of this paper. We then provide conditions on the hypothesis set and data-generating distribution, under which unlabeled data can help the overall learning procedure. Final bounds on the performance improvement through incorporation of unlabeled samples (which is mostly from the generalization aspect), are given with mathematical details in Theorem 3 and its proof. In order to achieve the above-mentioned goal, first let us make the following definition:

Definition B.1.

For a feature space and a finite label set , the conditional composition of a distribution with a conditional distribution through a supervision ratio of , denoted by , is defined as

 comp(P,Ω,η)(X,y)≜ηP(X,y)+(1−η)Ω(y|X)⎛⎝∑y′∈YP(X,y′)⎞⎠. (B.1)

It can be easily verified that the following properties hold for the conditional composition distribution of any two corresponding distributions:

 comp(P,Ω,η)X=PX,comp(P,Ω,η)|X=ηP|X+(1−η)Ω|X, (B.2)

where the first relation means: the marginal of the composition distribution w.r.t. (which is a measure supported on ) is the same as that of , while the second property states that: conditional distribution over (given ) is a weighted mixture of conditional distributions and .

An interesting asymptotic property of a consistent distribution set (see Definition 2) is that, given both fully and partially-observed samples in are i.i.d. samples generated from a single arbitrary distribution , the following relation holds almost surely w.r.t. :

 limn→∞^P(D)\lx@stackrela.s.={comp(P0,Ω,η=limn→∞nln)∣∣∣Ω∈MX(Y)}, (B.3)

where the asymptotic equality in the above relation corresponds to a member-wise convergence between the two sets. Consequently, rewriting (7) in the asymptotic regime of would give us the following equalities:

The first term in the r.h.s. of (B.4) is proportional to the true risk which we intend to bound. However, the second term models the asymptotic effect of unlabeled data for a fixed supervision ratio . The main question that we try to answer in this section can be intuitively stated as: under what conditions, the second term becomes approximately proportional to the true risk as well?

Before investigating the above question in more theoretical details, a closer look at the semi-supervised adversarial risk reveals that

This fact implies that by decreasing , one can also decrease (at least in the majority of non-trivial scenarios). This issue has been previously mentioned in Section 2, which indicates that optimism always results in lower empirical risks. But how does this strategy affect the true expected loss, i.e. ? On the other hand, moving toward guarantees that the learner is minimizing a legitimate upper-bound of the true risk, i.e. extreme pessimism, however, this also increases the empirical risk. Again, one could ask is it really necessary to be so pessimistic?

In order to answer the above questions, we introduce a new compatibility measure function for a function set and distribution , denoted by minimal supervision ratio or . We then show that as long as a particular inequality holds among parameters such as , and according to , one can guarantee minimizing a valid upper-bound for the true risk, while avoiding the extreme pessimism of [27] (less harm to the empirical risk minimization). In order to do so, first let us introduce a number of useful additional tools:

Definition B.2.

Assume function class and distribution for a finite label-set . For the ease of notation, let for . Then, for and is defined as

 ρλ(ϕ)≜EP0X{(λ)softminy∈Y{ϕX}}−EP0{ϕ}. (B.6)

As it becomes evident in the proceeding arguments of this section, the introduced functional in Definition B.2, i.e. , plays an important role in determining the relation of expected (or asymptotic) semi-supervised risk with the true (supervised) one. Mathematically speaking, enforcing for to remain non-negative guarantees that for any . This allows us to upper-bound the true risk with the value of computed for that particular . Surprisingly, this condition can always be satisfied by choosing (extreme pessimism). This configuration, in the special non-robust case, coincides with the framework presented in [27].

Lemma B.1.

For any function set and distribution , we have for all .

Proof.

is a distribution over , thus can be considered as a vector in a simplex, i.e. all components are non-negative and sum up to one. Then, the lemma’s argument can be justified by the fact that

 ⟨ϕX∣∣P0|X⟩≤maxy∈Y ϕX,while(∞)softminy∈Y{ϕX}=maxy∈Y ϕX, (B.7)

where denotes the inner product. More precisely, one can write:

 ρ∞(ϕ)

The last inequality is a direct result of the fact that inside of the expectation operator is non-negative. This completes the proof. ∎

However, we are more interested in those cases where can be bounded, or even negative, while is still non-negative in some regions of . The main problem is that the minimizer of (7) (semi-supervised empirical risk) must fall in those regions, as well. Otherwise one cannot upper-bound the true risk by minimizing (7). Mathematically speaking, assume as described in (7). Then, we are interested to see if there exists a non-empty subset of , say , such that:

 ∃ψ⊆Φ∣∣∣ argminϕ∈Φ ^R