Incorporating Unlabeled Data into DistributionallyRobust Learning
Abstract
We study a robust alternative to empirical risk minimization called distributionally robust learning (DRL), in which one learns to perform against an adversary who can choose the data distribution from a specified set of distributions. We illustrate a problem with current DRL formulations, which rely on an overly broad definition of allowed distributions for the adversary, leading to learned classifiers that are unable to predict with any confidence. We propose a solution that incorporates unlabeled data into the DRL problem to further constrain the adversary. We show that this new formulation is tractable for stochastic gradientbased optimization and yields a computable guarantee on the future performance of the learned classifier, analogous to—but tighter than—guarantees from conventional DRL. We examine the performance of this new formulation on real datasets and find that it often yields effective classifiers with nontrivial performance guarantees in situations where conventional DRL produces neither. Inspired by these results, we extend our DRL formulation to active learning with a novel, distributionallyrobust version of the standard modelchange heuristic. Our active learning algorithm often achieves superior learning performance to the original heuristic on real datasets.
Incorporating Unlabeled Data into DistributionallyRobust LearningFrogner, Claici, Chien, and Solomon \firstpageno1
Kevin Murphy and Bernhard Schölkopf
Distributionally robust optimization, Wasserstein distance, optimal transport, supervised learning, active learning
1 Introduction
Human learning is robust in ways that statistical learning struggles to replicate. Small changes to image pixel values and audio waveforms, for example, can dramatically alter the outputs of classifiers trained by conventional empirical risk minimization, while remaining imperceptible to human observers (Szegedy et al., 2013; Carlini and Wagner, 2018). Robustness to artificial and natural variations, however, is critical when learning systems are deployed “in the wild,” such as in selfdriving vehicles (Huval et al., 2015; Bojarski et al., 2016) and speech recognition systems (Junqua and Haton, 2012; Hannun et al., 2014). Hence, the design of robust learning techniques is a key focus of recent machine learning research (Eykholt et al., 2017; Madry et al., 2017; Raghunathan et al., 2018; Singh et al., 2018; Sinha et al., 2018; Cohen et al., 2019; Yuan et al., 2019).
Distributionally robust learning (DRL) (Delage and Ye, 2010; Abadeh et al., 2015; Chen and Paschalidis, 2018) offers an alternative to empirical risk minimization in which one learns to perform against an adversary who chooses the data distribution from a specified set of distributions. This approach offers several benefits, including robust performance with respect to perturbations of the data distribution and computable guarantees on the generalization of the learned model—provided the adversary’s decision set includes the true data distribution.
The robustness guarantees offered by DRL rely on selection of the adversary’s decision set; if the set does not include the true data distribution, the guarantees do not necessarily hold. Most previous work has chosen the decision set to be a norm ball around the empirical distribution of the training data (Abadeh et al., 2015; Chen and Paschalidis, 2018; Esfahani and Kuhn, 2018; Sinha et al., 2018). As we show in Section 5.2, however, in many cases this ball must be extremely large to contain the true data distribution. As a result, the distributionallyrobust learner attempts to be robust to an overly broad set of data distributions, preventing it from making a prediction with any confidence. As a result, it can do no better than assigning equal probability to all of the classes.
In this paper, we address the problem of overwhelminglylarge decision sets by using unlabeled data to further constrain the adversary. In essence, we can remove from the decision set distributions that are “unrealistic” in the sense that their marginals in feature space do not resemble the unlabeled data. With a smaller decision set, the distributionallyrobust learner can provide a tighter bound on the generalization performance, yielding nontrivial predictors with nonvacuous performance guarantees in situations where conventional DRL offers neither.
Our mechanism for optimizing against a adversary constrained by unlabeled data is generalpurpose and applicable beyond supervised learning. We use this same mechanism to formulate a novel distributionallyrobust method for active learning; this method frequently outperforms both uniform random sampling and standard methods for active learning.
2 Background
2.1 Notation
For any Polish space , we use to denote the associated Borel algebra and to denote set of Radon measures on . is the set of nonnegative Radon measures on , and is the set of probability measures: . is the set of continuous, bounded functions from into .
2.2 Statistical learning
Let be an input space and a label space, and let be the true data distribution, a probability measure over . We focus on a classification setting, in which is a finite collection of discrete labels, while can be any compact Polish space. The learning problem chooses a hypothesis , parameterized by , that minimizes the expected risk, , where is a loss function measuring deviation of the prediction from the true label .
We cannot directly evaluate the expected risk, however, since is unknown. We instead have a labeled sample consisting of i.i.d. samples from . If is the empirical distribution of the labeled data, traditional empirical risk minimization substitutes for in the statistical learning problem, solving
(1) 
To reduce variance of this approximation and promote generalization, often a regularization term (e.g., penalizing model complexity) is added to the loss.
2.3 Distributional robustness
Distributionallyrobust learning (DRL) (Delage and Ye, 2010; Abadeh et al., 2015; Chen and Paschalidis, 2018) is an alternative to empirical risk minimization that attempts to learn a predictor with minimal worstcase expected risk, against an adversary who chooses the distribution of the data from a specified decision set :
(2) 
is typically a norm ball centered at the empirical distribution of the labeled data . If is chosen such that it contains the true data distribution , the objective in (2) upperbounds the expected risk of the hypothesis.
In this paper, we focus on Wasserstein distributional robustness (Abadeh et al., 2015; Chen and Paschalidis, 2018), in which the adversary’s decision set is a norm ball with respect to the Wasserstein distance: {definition}[Wasserstein distance] Let be a lowersemicontinuous cost function. For any , the Wasserstein distance between and is
(3) 
with , i.e. the set of all joint distributions on having marginals and . is sometimes also called a “transportation plan” for moving the mass in to match . The Wasserstein distance differs from other common divergences on probability measures, such as the KL divergence, in that it takes into account the geometry of the domain , via the transport cost . For this reason, it can compare measures with disjoint support, for example.
2.4 Related work
Distributionally robust optimization (Calafiore and El Ghaoui, 2006) has been explored extensively beyond the learning setting, for a broad variety of objective functions and decision sets. Often decision sets are defined by moment or support conditions (Delage and Ye, 2010; Goh and Sim, 2010; Wiesemann et al., 2014) or divergences on probability measures such as the Prokhorov metric (Erdoğan and Iyengar, 2006) or divergences (BenTal et al., 2013; Duchi et al., 2016; Namkoong and Duchi, 2016; Bertsimas et al., 2018; Miyato et al., 2015). Directional deviation conditions have also been explored (Chen et al., 2007). Kuhn et al. (2019) gives a recent review of applications in machine learning.
Distributionally robust learning over a Wasserstein ball was proposed for logistic regression (Abadeh et al., 2015), regularized linear regression (Chen and Paschalidis, 2018), and more general losses (Blanchet et al., 2016; Gao and Kleywegt, 2016; Sinha et al., 2018; Dziugaite and Roy, 2017; Esfahani and Kuhn, 2018). An equivalence to regularization, under various assumptions on the loss, was shown by Gao et al. (2017).
In Section 6, we discuss an application of the proposed method to active learning, which is a wellstudied topic that has inspired a wide variety of algorithms (Yang and Loog, 2018). We focus on a class of heuristics that seek to maximize the change in the learned model resulting from obtaining a labeled example (Settles et al., 2008; Freytag et al., 2014; Cai et al., 2017).
3 Distributionallyrobust learning with unlabeled data
3.1 A problem with the existing approach
In the “mediumdata” regime, where the labeled sample may be far from the true data distribution with respect to Wasserstein distance, Wasserstein distributionallyrobust learning suffers from imprecision of the decision set , which is a Wasserstein ball centered at the empirical distribution of the labeled sample. The volume of this ball grows rapidly in its radius, requiring the learner to be robust to an enormous variety of data distributions. This problem manifests as low confidence of the distributionally robust learner, even when the radius is chosen to be much smaller than the true distance to the data distribution—thereby foregoing the performance guarantee implied by (2).
Figure 1 shows an example; additional illustrations are in Section 5.2. We train a Wasserstein distributionally robust logistic regression model using labeled samples from the Wisconsin breast cancer dataset (Dua and Graff, 2019).
We plot both the test set likelihood and the maximum confidence
3.2 Constraining the adversary using unlabeled data
We propose to deal with the overwhelming size of the decision set by constraining it further, pruning unrealistic potential data distributions while still allowing the set to contain the true data distribution. Specifically, we intersect two additional constraints with the Wasserstein ball.
The first constraint uses unlabeled data to constrain the marginal in of the data distribution. As is common in many learning settings, we assume that unlabeled data are acquired much more readily than labeled data, giving the learner access to large set of unlabeled examples. Let be the marginal, defined by for all Borel subsets . Then our unlabeled data is a set drawn i.i.d. from .
The second constraint restricts the marginal of the data distribution, by defining intervals on the individual label probabilities. Let be the marginal, which in a classification setting is discrete for the set of labels and the corresponding label probabilities. The interval for each label is . These interval constraints might come from prior knowledge, another dataset as in the ecological inference setting (King, 2013; Frogner and Poggio, 2019), or directly from the training data.
3.3 Problem formulation and duality
Feature space  

Label space  
Hypothesis function, parameterized by  
Loss function  
True data distribution (over )  
marginal of  
marginal of  
Labeled data distribution (over )  
Wasserstein ball of radius about  
(  Upper (lower) bound on marginal probability of 
Set of probability measures whose first  
marginal is and first marginal satisfies  
, . 
If we restrict the decision set as described in Section 3.2, we need to establish that the distributionally robust learning problem is still tractable, particularly since one of the constraints we have added is infinitedimensional. Recall that is the marginal of the unlabeled data on the feature space, and that and are the lower and upper bounds on the marginal on the label. We can define a set of possible joint distributions on that are consistent with this data,
(4)  
with and .
Suppose in addition we observe labeled data , with , that define the empirical distribution . We define the adversary’s decision set to be the intersection of the set of distributions with a Wasserstein ball of radius around the empirical distribution :
(5) 
where . Thus, our feasible set contains all distributions that have the correct data and label marginals (and are thus in ), but which are also close to the known labeled distribution (and thus contained in ).
The resulting distributionallyrobust problem is defined identically to (2), using this decision set . The inner problem with fixed is that of evaluating a worstcase expected loss,
(6) 
For marginals with infinite support this an infinitedimensional linear program as the marginal on of the solution must contain the support of .
We can rewrite (6) by casting it as an optimal transportation problem over the space between our unknown distribution and the given data distribution such that the transport plan satisfies the marginal constraints on :
(7) 
Here the variable indexes the support of the worstcase measure while indexes the support of . is a transport plan that joins these two measures. Observe that only the constraint on the marginal is infinite dimensional. We will show that this constraint corresponds in the dual problem to an expectation under of a finite dimensional cost.
While the program (6) is infinite dimensional, its dual is a problem in finite dimensions:
(8) 
Here , , and . Each of these dual variables corresponds to a primal constraint: corresponds to the constraint on the transport cost, the constraint that the second marginal be , the lower bound on the worstcase label probabilities, and the upper bound. The infinitedimensional constraint that the first marginal of the primal transport plan have marginal corresponds here to the expectation in the objective. This correspondence is established in much more detail in the proof of Theorem 3.3 (Appendix §A), which shows that the two problems and are in fact equivalent.
We state our main theoretical result, whose proof is deferred to the appendix (§A): {theorem}[Strong duality] Let be a compact Polish space and any finite set. Let be a probability measure over and an empirical probability measure over , and define intervals , . Let the transportation cost be nonnegative and upper semicontinuous with . Assume is upper semicontinuous. Define as in (6) and as in (8). If , then
(9) 
Furthermore, if , then there exists a minimizer attaining the infimum in (8). The takeaway message of Theorem 3.3 is that distributionallyrobust learning under the model proposed here amounts to minimizing with respect to , a finite dimensional problem that can be tackled with stochastic gradient approaches.
4 Algorithm and analysis
4.1 Optimization by SGD
Problem (8) is a convex, finite dimensional optimization problem in that is the sum of a linear term and an expectation under . To apply stochastic gradient descent, we first need to compute derivatives under the variables .
We first compute derivatives of the term under the expectation. Define as the function
The dual objective can be expressed as a function of as
For a given choice of and , there is a set of points where the maximum is achieved at . We can define a collection of subsets of
(10) 
The sets partition , up to boundary points where the sets meet one another. We can decompose the expectation as a finite sum of integrals over domains , i.e.
(11) 
Note that changes depending on the parameters . To evaluate a derivative with respect to one of these parameters, then, we need to differentiate under the integral sign. Applying Reynolds’ Transport Theorem, we obtain that
(12) 
and the same holds for the other parameters .
The exact forms for the derivatives of the dual objective are given in the appendix (§B). To simplify notation further, we define
(13) 
We can optimize for the optimal dual parameters by sampling from and computing gradients of with respect to the dual variables while maintaining constraints. This approach is summarized in Algorithm 1.
4.2 A computable performance guarantee
An attractive feature of traditional Wasserstein DRL is that the optimal value of the objective upperbounds the true expected risk , provided that the adversary’s decision set contains the true data distribution .
The proposed formulation using unlabeled data provides a similar guarantee. Specifically, for all , , we have
(14) 
with high probability, where is an empirical sample on points from the distribution on unlabeled data . In other words, the objective value for the dual problem provides a computable guarantee on the generalization error of the learned predictor .
This result follows from weak duality as the value of (14) upper bounds if is feasible, and an application of the Berry–Esseen theorem (Berry, 1941; Esseen, 1942) implies
with . This result gives a guarantee on the generalization error of , provided that . In §5.2, we compute this bound for a number of datasets and compare it to the equivalent bound from traditional DRL (Figure 5). Since the bound relies on the true distribution being included in the adversary’s decision set, the choice of the radius of the Wasserstein ball around the labeled data distribution is very important. We comment on the impact of in §5.1.
5 Empirical results
In this section, we investigate the empirical performance of our proposed formulation of distributionally robust learning in the particular case of logistic regression. First, we demonstrate an important limitation of the previouslyproposed distributionally robust logistic regression (Abadeh et al., 2015): There is often no choice of the radius of the adversary’s decision set that yields a classifier that is both robust and nontrivial. We then demonstrate that the formulation proposed here, which uses unlabeled data to restrict the adversary, can yield nontrivial classifiers with nonvacuous bounds on the generalization error.
5.1 How important is the choice of ?
In practice, we do not know the radius necessary to include the true data distribution in the Wasserstein ball . Standard practice for DRL is to choose by crossvalidation, attempting to maximize a proxy for the outofsample performance. Implicitly, however, doing so relies on a regularization effect of traditional DRL, documented by Gao et al. (2017), which generates an inverted Ushaped outofsample performance curve with respect to . Maximizing crossvalidation performance does not necessarily yield a robust classifier in the DRL sense: As we demonstrate in Section 5.2, for some datasets there is no choice of that both includes in the adversary’s decision set and yields a nontrivial classifier using traditional DRL. In other words, the that maximizes generalization performance is much smaller than the distance between the labeled data and the true data distribution .
Here, we verify that the choice of matters critically for robustness in the sense of traditional DRL, meaning that a learned classifier that is robust to distributions within an ball is not robust to distributions even slightly outside the ball. In this sense, choosing by crossvalidation in traditional DRL can yield a classifier that is not robust to perturbations on the order of the distance between the labeled data and the true data distribution .
We make two empirical observations:

Wasserstein distributional robustness out to radius does not confer robustness to distributions even slightly outside , at distance , in the sense that there exists a data distribution in the ball that yields poor performance for the traditional Wasserstein DRL predictor trained with radius of robustness . This is shown for several datasets in Figure 3 and further in Appendix D.3.
Choosing in the proposed method
The choice of for the proposed method is constrained by the fact that the feasible set is empty for below a threshold, as there might be no distribution in the ball having the desired marginals and . This situation is easily detected in practice, as the value of the dual becomes unbounded below.
Empirically, with the proposed method, we find no evidence of a biasvariance tradeoff as the radius is varied, unlike traditional Wasserstein DRL. Figure 4 shows outofsample performance as we vary the difference between the radius and the minimal such radius for which the feasible set is nonempty. The performance is flat out to a radius beyond which the confidence of the learner decreases quickly. Appendix D.4 contains further examples.
This last observation suggests suggests a criterion for choosing under the proposed DRL model: One chooses the maximum such that the confidence of the learned classifier is above a threshold. This is the asrobustaspossible selection, as opposed to the maximumcrossvalidationperformance selection often used in traditional DRL. There are multiple possible implementations of this criterion. In our experiments, for example, thresholding the median confidence on the unlabeled set at often suffices to ensure that is large enough to ensure , for reasonable values of (Figure 6).
5.2 Empirical performance of learning with unlabeled data
In this section, we demonstrate the impact of the proposed method for constraining the adversary’s decision set using unlabeled data. We evaluate the performance guarantee offered by the previouslyproposed distributionally robust logistic regression model (Abadeh et al., 2015) on several binary classification datasets,
For each dataset, we sample a small number of labeled examples and compute the radius that is required to include the true (empirical) data distribution in the Wasserstein ball . This is the smallest for which the performance guarantee from DRL holds. We use the labeled examples to compute the distributionally robust logistic regression under the traditional model (Abadeh et al., 2015) and additionally use the set of unlabeled examples to compute the same regression under the proposed model. We compare the performance guarantee (i.e. the dual objective value) computed under each DRL model. Identical values of are used for both methods, but a different value of is computed for each sampled set of labeled examples .
We examine two settings for the proposed method. The first assumes a strong prior that specifies the exact (true) label probabilities, such that . In practice such a strong prior might come from auxiliary data, such as in ecological inference or with domain knowledge. The second setting assumes a weak prior that specifies only confidence intervals for the label probabilities, estimated directly from the from labeled data (Clopper and Pearson, 1934).
We vary the number of labeled examples and examine the computed performance guarantee, shown in Figure 5, as well as the median confidence of the learned predictor, shown in Figure 6. The former is the worstcase guarantee (6) and not the actual generalization performance of the learned classifier. We make three observations:

For all but one of the datasets, the performance bound computed by traditional DRL is vacuous (guaranteeing only likelihood greater than or equal to ), while the learned classifier is trivial (assigning equal probability to both classes), for all tested numbers of labeled examples (maximum ).

For all datasets, the proposed DRL using unlabeled data and either a strong prior and a weak prior on the label probabilities yields a nonvacuous performance bound and a nontrivial classifier, for at which traditional DRL is vacuous.

The strong prior on label probabilities can yield highly nontrivial performance bounds, for smaller numbers of labeled examples than the weak prior.
We have chosen as small as possible while ensuring the computed performance guarantee holds, and the performance bound computed by either algorithm gets monotonically worse as increases.
5.3 Discussion
The overwhelming size of an adversary’s decision set is a weakness of Wasserstein DRL that prevents a reasonable tradeoff between robustness and confidence of the learned predictor. To circumvent this problem, we use unlabeled data to further constrain the decision set. Empirically, the proposed DRL problem produces nontrivial predictors having nonvacuous performance guarantees in cases where traditional Wasserstein DRL fails.
One topic we have not addressed is computational complexity. The proposed DRL is computed via stochastic gradient descent. Each gradient computation scales linearly in the number of labeled examples and this scaling might prohibit application to large labeled datasets. The key bottleneck is computing membership in the sets in (10), which relies on a maximization over labeled examples. This computation might be a fruitful target for performance improvement, possibly via parallelization or by leveraging the fact that the cost function is a power of a metric.
6 Application: Distributionallyrobust active learning
Key to the learning algorithm of Section 4.1 is a mechanism for optimizing an objective over the intersection of a Wasserstein ball with the set of distributions that have prescribed marginals in and . Learning a classifier is just one possible application of this mechanism, however. In this section, we demonstrate another application, to active learning.
6.1 Model change heuristics
Given a set of labeled data and a set of unlabeled data, an active learner attempts to choose the most beneficial example from for which to acquire a label. The goal of the active learner is to reduce the outofsample error of the predictor trained on as rapidly as possible. Many active learning methods assign a score to each unlabeled example, indicating its predicted impact on the learned classifier if we choose to acquire its label. This score might represent various properties, such as model uncertainty, expected error reduction, or expected model change. In the current work we focus on model change criteria (Settles et al., 2008; Freytag et al., 2014; Cai et al., 2017), which are popular and often effective in practice (Yang and Loog, 2018).
In model change criteria, we define an impact function , which is large when acquiring the label for point leads to a large change in the model parameters. Most often this is a norm of the parameter gradient (Yang and Loog, 2018),
for a norm and the hypothesis trained on . The active learning heuristic selects that maximizes an estimate of the anticipated impact across possible labels at point . This might be the conditional expectation according to the model distribution at , the minimum over labels, or the maximum over labels:

Expected impact: , with trained on .

Optimistic: .

Conservative: .
A potential problem with the expected model change criterion, which it shares with a number of other standard heuristics (Yang and Loog, 2018), is that it relies on the current hypothesis when predicting the impact of choosing a new point to label. Specifically, is used in place of the conditional distribution over labels at the point . This is prone to error when the hypothesis is far from the true conditional distribution, incorrectly weighting the impact of obtaining labels at the points where the hypothesis is in error.
The “optimistic” and “conservative” estimates above are simple attempts to eliminate the hypothesis from the estimated impact. Notably, these ignore the labeled data entirely.
6.2 A distributionallyrobust approach
The machinery presented in Section 3 provides an alternative way to eliminate the hypothesis from our estimate of the impact of labeling point . We can formulate a distributionallyrobust estimate of the impact, which computes a lower bound on the expected impact with respect to an entire set of plausible data distributions, rather than just the model distribution. This lower bound can be closer to the true expected impact (under ) than the naïve conservative estimate, as our set of plausible distributions need not include those that are unreasonably far from the training set.
More precisely, we can formulate the problem of choosing the next sample to label as
(15) 
with as in Section 3, the intersection of a Wasserstein ball centered at the labeled data and the set of distributions having the prescribed marginals. Note that for all , meaning that the objective in (15) is in fact linear in . In practice, given the unlabeled data , we can approximate this term by density estimation. We will use the notation for this approximation.
The inner problem in (15) estimates the impact of labeling the point . Just as in the DRL problem formulated in Section 3, this is the optimization of an objective with respect to a probability measure constrained to the feasible set .
(16) 
with
(17) 
and the impact function from Section 6.1.
6.3 Empirical Results
We evaluate active learning performance on the set of binary classification datasets used in Section 3. Given a set of labeled examples, a linear classifier is trained by regularized logistic regression, with the weight on the regularizer fixed a priori. Given this classifier and a set of unlabeled examples, the active learning algorithm selects the next example for which to acquire a label. The process is iterated, beginning with examples chosen uniformly at random (the same initial examples for all active learning methods, but different initial examples between trials), and terminating after labeled examples have been acquired.
After training the classifier at each step, we evaluate the error (the likelihood) on the combination of labeled and unlabeled data, to provide a score that is comparable between steps. This score has previously been proposed as a proxy for outofsample error in both the semisupervised (Grandvalet and Bengio, 2005) and active (Guo and Schuurmans, 2008) learning settings.
We compare the proposed distributionallyrobust active learning method to both random sampling and the existing modelchange heuristics (described in Section 6.1). Specifically, we test five methods:

Random: We choose the next example uniformly at random.

EMC: We choose the example that maximizes the expected norm of the parameter gradient, under the hypothesis distribution (Settles et al., 2008).
