Revisiting Distributionally Robust Supervised Learning in Classification

# Revisiting Distributionally Robust Supervised Learning in Classification

Weihua Hu    Gang Niu    Issei Sato    Masashi Sugiyama
###### Abstract

Distributionally Robust Supervised Learning (DRSL) is necessary for building reliable machine learning systems. When machine learning is deployed in the real world, its performance can be significantly degraded because test data may follow a different distribution from training data. Previous DRSL minimizes the loss for the worst-case test distribution. However, our theoretical analyses show that the previous DRSL essentially reduces to ordinary empirical risk minimization in a classification scenario. This implies that the previous DRSL ends up learning classifiers exactly for the given training data even though it is designed to be robust to distribution shift from the training dataset. In order to learn practically useful robust classifiers, our theoretical analyses motivate us to structurally constrain the distribution shift considered by DRSL. To this end, we propose novel DRSL which can incorporate the structural assumptions on distribution shift and that can learn useful robust decision boundaries based on the assumptions. We derive efficient gradient-based optimization algorithms and establish the convergence rate of the model parameter as well as the order of the estimation error for our DRSL. The effectiveness of our DRSL is demonstrated through experiments.

unsupervised learning, representation learning

## 1 Introduction

Supervised learning has been successful in many application fields. The vast majority of supervised learning research falls into the Empirical Risk Minimization (ERM) framework (Vapnik, 1998) that assumes a test distribution to be the same as a training distribution. However, such an assumption can be easily contradicted in real-world applications due to sample selection bias or non-stationarity of the environment (Quionero-Candela et al., 2009). Once the distribution shift occurs, the performance of the traditional machine learning techniques can be significantly degraded. This makes the traditional techniques unreliable for practitioners to use in the real world.

Distributionally Robust Supervised Learning (DRSL) (Bagnell, 2005; Ben-Tal et al., 2009; Wen et al., 2014; Duchi et al., 2016; Namkoong and Duchi, 2016; Globerson and Roweis, 2006; Liu and Ziebart, 2014; Chen et al., 2016) obtains prediction functions that are explicitly robust to distribution shift. More specifically, DRSL considers a minimax game between a learner and an adversary: the adversary first shifts the test distribution from the training distribution within a specified range so as to maximize the expected loss w.r.t. the test distribution. The learner then minimizes the adversarial expected loss.

DRSL with -divergences (Bagnell, 2005; Ben-Tal et al., 2009; Duchi et al., 2016; Namkoong and Duchi, 2016) is particularly well-studied and takes into account all possible test distributions within a specified range from a training distribution measured by an -divergence. However, the behavior of the previous DRSL is not well-understood in a classification scenario.

The goal of this paper is to provide better theoretical understandings of the previous DRSL (Bagnell, 2005; Ben-Tal et al., 2009; Duchi et al., 2016; Namkoong and Duchi, 2016) in the classification scenario by relating DRSL to ERM. We then propose novel DRSL based on our theoretical insight. Our contributions are four folds.

1. We show a series of theoretical analyses suggesting that the previous DRSL is essentially equivalent to ordinary ERM (Theorems 1, 2 and 3). This implies that the previous DRSL results in learning classification boundaries for the training distribution even though it is designed to be robust to change from the training distribution.

2. To learn practically useful robust classifiers, our theoretical analyses motivate us to structurally constrain the distribution shift considered by DRSL. To this end, we propose novel DRSL which allows its users to easily incorporate their structural assumptions and that can learn robust classifiers based on the assumptions.

4. We establish the convergence rate of the model parameter as well as the order of the estimation error for our DRSL.

### Related Work:

Most existing researches on covariate shift adaptation involve estimating the density ratio by using both training and test data (Sugiyama and Kawanabe, 2012). The estimated density ratio is then used to reweight the training data to provide an unbiased estimate of the risk (Shimodaira, 2000; Quionero-Candela et al., 2009). On the other hand, DRSL assumes that test data are not provided at training time and tries to be robust to potential distribution shift.

Besides the DRSL with -divergences, DRSL by Globerson and Roweis (2006) learns a classifier robust to deletion of a subset of features, while DRSL by Liu and Ziebart (2014) learns a classifier robust to unknown properties of the conditional label distribution, which was then extended by Chen et al. (2016) into a regression setting.

## 2 Review of ERM and DRSL

In this section, we first review the ordinary ERM framework. Then, we explain a general formulation of distributionally robust supervised learning (DRSL) and review the previous DRSL with -divergences (Bagnell, 2005; Ben-Tal et al., 2009; Duchi et al., 2016; Namkoong and Duchi, 2016).

Suppose training samples, , are drawn i.i.d. from an unknown training distribution over with density where and is an output domain. Let be a prediction function with parameter , and let be a loss between and its prediction . Assume is of class w.r.t. .

### Erm:

The objective of the risk minimization (RM) is

 minθEp(x,y)[ℓ(gθ(x),y)]. (1)

In ERM, we approximate Eq. (1) by training data :

 minθ1NN∑i=1ℓ(gθ(xi),yi)+λΩ(θ), (2)

where we add regularization term . is a trade-off hyper-paremeter. In the following, we will denote the risk in Eq. (1) as and the empirical risk in Eq. (2) as .

### General Formulation of DRSL:

In ordinary ERM, the test distribution is assumed to be the same as the training distribution, which does not often hold in practice. DRSL explicitly considers the distribution shift scenario, where test density is different from training density . Let be an uncertainty set for test distributions.

In DRSL, the learning objective is to minimize the risk w.r.t. the most adversarial test distribution in uncertainty set :

 minθsupq∈QpEq(x,y)[ℓ(gθ(x),y)]. (3)

### DRSL with f-divergences:

Bagnell (2005), Ben-Tal et al. (2009), Duchi et al. (2016), and Namkoong and Duchi (2016) consider the uncertainty set as , where is an -divergence defined as and is a convex function with . The scaler, , controls the degree of the distribution shift. Let Then, the objective of DRSL with -divergences can be rewritten as

 minθsupr∈UfEp(x,y)[r(x,y)l(gθ(x),y)], (4) Ep(x,y)[r(x,y)]=1, r(x,y)≥0, ∀(x,y)∈X×Y}. (5)

In Eq. (4), the density ratio, , can be considered as the weight put by the adversary on the loss of labeled data . Then, Eq. (4) can be regarded as a minimax game between the learner and the adversary. In the game, the adversary first reweights the losses so as to maximize the total loss. The learner then minimizes the cost-sensitive risk. In the rest of the paper, we will call the minimization objective of Eq. (4) the adversarial risk and denote it by . Correspondingly, we will call Eq. (4) the adversarial risk minimization (ARM).

For notational convenience, let us denote by . Also, let be a vector of density ratios evaluated at training data points, i.e., for . Bagnell (2005) empirically approximated the minimization objective in Eq. (4) and the uncertainty set in Eq. (5) as

 minθsup\boldmathr∈ˆUf1NN∑i=1riℓi(θ)+λΩ(θ), (6) ˆUf={\boldmathr ∣∣ ∣∣ 1NN∑i=1f(ri)≤δ, 1NN∑i=1ri=1, \boldmathr≥0}, (7)

where the inequality constraint for a vector is applied in an element-wise fashion. Equation (6) can again be regarded as a minimax game between the learner and the adversary. We will call the minimizing objective of Eq. (6) without the regularization term the adversarial empirical risk and denote it by . Correspondingly, we will call Eq. (6) the adversarial empirical risk minimization (AERM).

## 3 Analysis of DRSL with f-divergences in classification

We now show a series of results suggesting the equivalence between ARM (resp. AERM) and ordinary RM (resp. ERM) in a classification scenario.

### Setting:

In the binary classification, the prediction function is . In the -way multi-class classification, the prediction function is . The goal of classification is to learn the prediction function that minimizes the mis-classification rate w.r.t. the test distribution. The mis-classification rate corresponds to the use of the 0-1 loss, i.e., for binary classification, and for multi-class classification, where is the indicator function.

### The 0-1 loss case:

Theorem 1 establishes the non-trivial relationship between the adversarial risk and the ordinary risk when the 0-1 loss is used.

###### Theorem 1.

Let be the 0-1 loss. Then, there is a monotonic relationship between the adversarial risk, , and the ordinary risk, , in the sense that for any pair of parameters and , the followings hold.

If , then

If , then

The same monotonic relationship also holds between their empirical approximations: and

The monotonic relationship is illustrated in Figure 1 (i). The key in proving Theorem 1 is that the adversary of ARM can reweight the losses via in an -wise manner, meaning that the values of are not tied together for different as long as . Consequently, the adversary simply assigns a larger to with larger . This fact combined with the fact that the loss only takes either 0 or 1 leads to Theorem 1. See Appendix A for the detailed proof.

Theorem 1 shows a rather surprising result that when the 0-1 loss is used, the risk and the adversarial risk are essentially equivalent as the minimization objective – minimization of one objective results in the minimization of another objective. In practice, the 0-1 loss is hard to be optimized directly. Nonetheless, we can use the 0-1 loss in the validation stage for choosing hyper-parameters such as . Theorem 1 indicates that if we select the hyper-parameter according to the adversarial with the 0-1 loss, we will end up choosing the hyper-parameter with the minimum mis-classification rate w.r.t. the training distribution.

### The surrogate loss case:

We now turn our focus on the training stage and consider the use of the surrogate loss. In particular, we consider the classification calibrated surrogate loss (Bartlett et al., 2006; Tewari and Bartlett, 2007), which is widely used in practice and includes the logistic loss for binary classification and the cross-entropy loss for multi-class classification. It is well-known that when we use the classification-calibrated surrogate loss, ordinary RM is able to learn the Bayes optimal classifier if the Bayes optimal classifier is contained in the hypothesis class (Bartlett et al., 2006; Tewari and Bartlett, 2007). Theorem 2 shows a surprising fact that the same property also holds for ARM.

###### Theorem 2.

Let be the classification-calibrated loss. Assume that the hypothesis class contains the Bayes optimal classifier for training density . Then, ARM learns the Bayes optimal classifier for .

See Appendix B for the proof. The proof is again built on the fact that the adversary of ARM can reweight the losses in an -wise manner. Theorem 2 indicates that ARM, similarly to RM, ends up learning the optimal decision boundary for the training distribution, if the model is correctly specified. Even though the assumptions made are strong (an infinite number of training data and no model mis-specification), Theorem 2 establishes the non-trivial theoretical connection between the ordinary RM and the ARM, highlighting the asymptotic behavior of the AERM.

Of course, in practice, we only have a finite amount of training data and the model may be mis-specified. Theorem 3 focuses on binary classification and provides an analysis for the practical scenario.

###### Theorem 3.

Consider binary classification and let be convex in . Also, let be linear in and is strongly convex in . Then, regularized AERM w.r.t.  (Eq. (6)) is strongly convex in . Also, there exists a steeper loss, , such that the unique optimum solution of the regularized AERM coincides with that of the regularized ERM using :

 minθ1NN∑i=1ℓDRSL(gθ(xi),yi)+λΩ(θ), (10)

where, is steeper than in the sense that is a non-decreasing function of .

The proof outline is as follows. We first construct a new loss function, , such that holds for , where ’s are the adversarial weights evaluated at the solution of AERM. By using Danskin’s theorem (Danskin, 1966), we can show that the solution of ERM using coincides with the solution of AERM using . Furthermore, as shown in Eq. (6), the adversary of AERM reweights the losses using in a sample-wise manner. As a result, the adversary simply assigns larger weights to data points with larger losses. Because of this, the newly constructed loss function, , is steeper than the original loss function, . See Appendix C for the formal proof.

We note that in the proof is constructed after AERM is exactly solved; hence, the exact form of is in general unknown beforehand. Nonetheless, Theorem 3 highlights the following theoretical properties of AERM. First, Theorem 3 indicates that AERM using a convex surrogate loss reduces to ordinary ERM using a steeper convex surrogate loss. The steeper loss is also classification-calibrated if the original loss is classification-calibrated. Second, Theorem 3 shows that AERM is more sensitive to outliers because it uses the steeper loss function.

### Illustration of Theorems 2 and 3:

The claims of Theorems 2 and 3 are illustrated in Figure 1 (ii-b) using a simple toy dataset. We note that the class-conditionals were Gaussian in the toy dataset; hence, the linear model was correctly-specified. We see from Figure 1 (ii-b) that the previous DRSL learned a similar decision boundary as ordinary ERM did, which verifies the claim of Theorem 2. Comparing the dotted lines in Figures 1 (ii-a) and (ii-b), we also see that the previous DRSL is more sensitive to the inserted outlier, which coincides with the claim of Theorem 3.

In summary, Theorems 1, 2 and 3 together imply that ARM (resp. AERM), similarly to ordinary RM (resp. ERM), still ends up learning decision boundaries for the training distribution even though it is designed to be robust to distribution shift from training distribution.

## 4 DRSL with Latent Prior Probability Change

### Theoretical motivation:

Why do ARM and AERM (the previous DRSL) reduce to ordinary RM and ERM in Theorems 1, 2 and 3? Our proofs of the theorems are crucially built on the fact that the adversary of ARM (resp. AERM) can reweight the losses through in the -wise (resp. sample-wise) manner. Without any structural constraints to tie the value of for different , Theorems 1, 2 and 3 always hold and we cannot learn practically useful robust classifiers. To overcome this problem, our theoretical analyses motivate us to structurally constrain , or equivalently, to impose structural assumptions on the distribution shift. To this end, in this section, we propose novel DRSL that can incorporate the structural assumptions and learn robust classifiers based on the assumptions.

### Practical considerations:

There may be various ways to incorporate structural assumptions on potential distribution shift considered by DRSL. The question is: what is the desirable class of structural assumptions to adopt in DRSL? In practice, a class of structural assumptions with the following properties is preferred.

1. Within the class, users of DRSL can easily and intuitively model their distribution shift assumptions.

2. Efficient learning algorithms can be derived.

To this end, we adopt the latent prior probability change assumption (Storkey and Sugiyama, 2007) that satisfies both of the properties. We introduce a latent variable , which we call a latent category. The latent prior probability change assumes

 p(x,y|z)=q(x,y|z),   q(z)≠p(z), (11)

where and is the training and test distributions, respectively.

The intuition behind the assumption is as follows. We assume a two-level hierarchical data-generation process: we first sample latent category from the prior and then sample actual data from the conditional. We then assume that the distribution changes only at the latent category level, leaving the conditionals intact. Letting , we have the class-prior change assumption (Saerens et al., 2002) as a special case. Similarly, we can also let to be more refined categories such as sub-categories (Ristin et al., 2015). This reduces to the assumption that only the sub-category prior changes but the sub-category conditionals remain the same. Later in this section, we will see in more detail how users of our DRSL may specify the latent categories.

### Objective function of our DRSL:

With the latent prior probability change in Eq. (11), the uncertainty set for a test distribution in our DRSL becomes

 Qp={q(x,y,z) | Df[q(x,y,z)||p(x,y,z)]≤δ, q(x,y|z)=p(x,y|z)}. (12)

Then, corresponding to Eq. (3), the objective of our DRSL can be written as

 (13) Wf≡{w(z) ∣∣∣ ∑z∈Zp(z)f(w(z))≤δ, ∑z∈Zp(z)w(z)=1, w(z)≥0, ∀z∈Z}, (14)

where because of . We will call the minimization objective of Eq. (13) the structural adversarial risk and denote it by . Correspondingly, we will call our DRSL in Eq. (13) the structural adversarial risk minimization (structural ARM) in the following.

### Decomposition of structural adversarial risk:

To better understand the property of the structural adversarial risk, we consider a condition where the Pearson (PE) divergence is used and is not so large. Under this condition, we can decompose the structural adversarial risk as (refer to Appendix E for the derivation)

where is the ordinary risk and ), , is the risk of latent category . The second term of the right-hand side of Eq. (15) is the variance among the risks of the latent categories, which corresponds to the sensitivity of the classifier to the distribution shift that occurs at the latent category level. The smaller the variance is, the less sensitive the learned function is to the distribution shift. Hence, minimizing the structural adversarial risk amounts to simultaneously minimizing the two different objectives: (1) the ordinary risk and (2) the sensitivity to the specified distribution shift. Analogous arguments also hold when other -divergences are used.

### Empirical approximation of the objective:

Suppose we are given training data drawn from . For define , which is a set of data points belonging to latent category . In our DRSL, users are responsible for specifying the groupings of data points, . Intuitively, data points in the same group are assumed to be shifted together in the future distribution shift. Later in this section, we describe how users of our DRSL can specify the groupings in practice.

For notational convenience, let , and define . Using , Eqs. (13) and (14) can be empirically approximated as follows.

 minθsup\boldmathw∈ˆWf1NS∑s=1nsws¯¯¯¯ℓs(θ)≡ˆR(\boldmathw,θ)+λΩ(θ), (16) 1NS∑s=1nsws=1, \boldmathw≥0}, (17)

where is the cardinality of , and is the average loss of all data points in .

We will call the minimization objective of Eq. (16) without the regularization the structural adversarial empirical risk and denote it by . Correspondingly, we will call Eq. (16) the structural adversarial empirical risk minimization (structural AERM).

It is worth noting that in Eq. (16), the same weight is shared within the same group . Hence, the adversary in structural AERM is only allowed to reweight the losses in a group-wise manner rather than in the sample-wise manner as done by AERM in Eq. (6). Because of this constraint, Theorem 3 no longer holds and we can learn more meaningful robust classifiers for the specified distribution shift.

### Incorporating structural assumptions through groupings:

In Eq. (16), we see that data points in the same group share the same weight and hence, are assumed to be shifted together in the future distribution shift. With this intuition, users of our DRSL can make each group corresponded to, for example,

• A class label. This is equivalent to class-prior shift assumption (Saerens et al., 2002).

• A sub-category label (Ristin et al., 2015), which is a refined category of a class label. For example, a ‘flu’ category contains three refined sub-categories: types A, B and C flu.

Users of our DRSL may also utilize meta-information of data to specify the groupings of data. The meta-information includes conditions in which data are collected (e.g., time and places) and the identity of agents that collected data (e.g., robots). The intuition is that data samples collected in similar situations are likely to be shifted together in the future distribution shift; hence, users may put these samples into the same group.

What are the effective groupings of data for structural AERM? Intuitively, the coarser the groupings are, the stronger the assumption on the future distribution shift is and the less powerful the adversary becomes. Consequently, the learner is expected to learn more meaningful solutions than just pessimistic ones. Hence, in structural AERM, it is important for users of our DRSL to specify the coarsest groupings they can make.

### Illustration:

Our DRSL (structural AERM) is illustrated in Figure 1 (ii-c) using the toy dataset. We first see from in Figures 1 (ii-a) and (ii-b) that the classifiers learned by ERM and AERM performed well on the left-hand class (98% accuracy) but performed significantly poorly on the right-hand class (80% accuracy). Consequently, the classification accuracy will significantly deteriorate if more data points come from the right-hand class at the test stage. Our DRSL in (ii-c), on the other hand, prevented such deterioration by shifting the decision boundary to the left. As a result, the learned classifier performed well on both classes (94% accuracy), being robust to distribution shift at the class-level.

By comparing the dotted lines in Figures 1 (ii-b) and (ii-c), we also observe that our DRSL is less sensitive to the outlier than the previous DRSL is. This result can be intuitively understood as follows: the adversary in our DRSL (structural AERM) can reweight the losses only in the group-wise manner and cannot concentrate large weights on a few data points with large losses, i.e., outliers. Consequently, structural AERM is less sensitive to outliers compared to the original AERM.

## 5 Efficient Learning Algorithms

In this section, we derive efficient learning algorithms for structural AERM. We see that its objective, , is convex in if is convex in because we are taking the supremum over a set of convex functions (Boyd and Vandenberghe, 2004). Hence, we can always find the global optimum by a gradient-based optimization algorithm.

Thanks to Danskin’s theorem (Danskin, 1966), we can compute the gradient as

 \boldmathw∗=argsup% \boldmathw∈ˆWfˆR(% \boldmathw,θ). (19)

In the following, we show that Eq. (19) can be solved very efficiently for two well-known instances of -divergences.

### Kullback-Leibler (KL) Divergence:

For the KL divergence, , and the solution of Eq. (19) is obtained as

 w∗s=NZ(γ)⋅exp⎛⎝¯¯¯¯ℓs(θ)γ⎞⎠, 1≤s≤S, (20)

where is a scalar such that the first constraint of holds with equality, and is a normalizing constant in order to satisfy the second constraint of . To compute , we can perform a binary search on to satisfy the first constraint of .

### PE Divergence:

For the PE divergence, . Empirically, we found the inequality constraint of is satisfied for small . Dropping the inequality, the solution of Eq. (19) is simple:

 \boldmathw∗=√Nδ∑Ss=1nsv2s\boldmathv+\boldmath1S, (21)

where is an -dimensional vector with all the elements equal to 1. is an -dimensional vector such that

### Computational Complexity:

The time complexity for solving Eq. (19) is: for the KL divergence and for the PE divergence, where is the number of the binary search iterations to compute in Eq. (20). Solving Eq. (19) therefore adds negligible computational overheads to computing and for , which for example requires -time for a -dimensional linear-in-parameter model. Hence, a batch update for structural AERM in Eq. (18) enjoys nearly the same order of efficiency as that for ordinary ERM.

Wen et al. (2014) also proposed DRSL with structural assumptions on the distribution shift111They assumed smoothness of w.r.t. ., though without theoretical motivation. They also solved the optimization problem by gradient descent. However, at each iteration, the adversary needs to solve a linear program (LP) of size with constraints. The computational complexity for this is (Boyd and Vandenberghe, 2004) which adds a huge computational overhead for large data size .

## 6 Convergence Rate and Estimation Error

We establish the convergence rate of the model parameter and the order of the estimation error for structural AERM in terms of the number of training data points . Due to the limited space, we only present an informal statement here. The formal statement can be found in Appendix H and its proof can be found in Appendix I.

###### Theorem 4 (Convergence rate and estimation error, informal statement).

Let be the solution of structural ARM, and let be the solution of strucrual AERM given training data of size . Assume is linear in . Under mild conditions (which are satisfied by the KL and PE divergences and the softmax cross-entropy loss), as , we have and consequently, .

Notice that the convergence rate of to is not the optimal parametric rate . This is because the maximization of w.r.t.  in Eq. (16) converges in that slows down the entire convergence rate. Theorem 4 applies to any -divergence where is nonlinear in , while knowing which -divergence is used may provide much more information and the convergence rate may be improved to the optimal parametric rate.

To the best of our knowledge, Theorem 4 and its proof technique are novel. For minimax problems, the distribution-free regret is usually considered (Namkoong and Duchi, 2016) instead of the estimation error that heavily depends on . Nevertheless, the optimally worst depends on and is not distribution-free. Hence, structural AERM belongs to statistical learning instead of online learning. Therefore, the estimation error rather than the regret is considered in our theoretical analysis.

## 7 Experiments

In this section, we experimentally analyze our DRSL (structural AERM) by comparing it with ordinary ERM and the previous DRSL with -divergences (AERM). We empirically demonstrate (i) the undesirable property of the previous DRSL in classification and (ii) the robustness of our DRSL against specified distribution shift.

### Datasets:

We obtained fourteen classification datasets from the UCI repository (Blake and Merz, 1998), three of which are for multi-class classification. We also obtained MNIST (LeCun et al., 1998) and 20newsgroups (Lang, 1995). Refer to Appendix J for the details of the datasets.

We note that some of the datasets were collected in the environment where test data are very likely to follow a different distribution from training data. For example, the satimage is a UCI dataset for classifying satellite images into one of six categories, e.g. red soil, cotton crop, grey soil and vegetation. The labeling was performed by one human visiting the actual site; therefore, the gathered training data can be strongly biased. In this case, it is desirable to learn a distributionally robust classifier rather than assuming that test data follow the same distribution as training data.

### Evaluation metrics:

We evaluate the three methods (ordinary ERM, the previous DRSL and our DRSL) with three kinds of metrics: (i) the ordinary classification risk, (ii) the adversarial risk and (iii) the structural adversarial risk, where the 0-1 losses are used for all the metrics. The ordinary risk measures the classification performance when no distribution shift occurs, while the structural adversarial risk is calculated w.r.t. the worst-case distribution shift that occurs at the specified latent category level. Recall from Theorem 1 that the adversarial risk has the monotonic relationship with the ordinary risk. Hence, we do not explicitly report it in the experiments. We used held-out test data for estimating these values.

### Experimental protocols:

For our DRSL, we learn classifiers robust against potential distribution shift at (a) the class level and (b) the sub-category level (Ristin et al., 2015). This corresponds to making each grouping correspond to (a) a class label and (b) a sub-category label, respectively. In the benchmark datasets, the sub-category labels are not available. Hence, we manually create such labels as follows. First, we converted the original multi-class classification problems into classification problems with fewer classes by integrating some classes together. Then, the original class labels are regarded as the subcategories. In this way, we converted the shuttle, satimage, letter222The three datasets are from the UCI repository. and MNIST datasets into binary classification problems, and 20newsgroups into a 7-class classification. Appendix K details how we grouped the class labels.

For all the methods, we used linear models with softmax output for the prediction function . The cross-entropy loss with regularization was adopted. The regularization hyper-parameter was selected from via 5-fold cross validation. For ordinary ERM, the cross validation is straightforward. In Appendix G, we show how to perform the cross validation for DRSL to estimate the structural adversarial risk.

The uncertainty set used in DRSL should be chosen based on users’ prior belief on potential distribution shift (Wen et al., 2014). In the experiments, we used the two -divergences (the KL and PE divergences) and set for our DRSL and the previous DRSL. The same and -divergence were used for estimating the structural adversarial risk. In the end of this section, we discuss how we can choose in practice.

### Results:

In Table 1, we report experimental results when the KL divergence is used. Due to the space constraint, refer to Appendix L for the results when the PE divergence is used. We see from the left half of Table 1 that ordinary ERM achieved the lower estimated risk as was expected. On the other hand, the previous DRSL, which does not incorporate any structural assumptions into distribution shift, performed poorly in terms of all the three evaluation metrics. This may be because the learner became excessively sensitive to outliers as implied by Theorem 3. We see from the right half of Table 1 that our DRSL achieved significantly lower estimated structural adversarial risk. This suggests that our DRSL indeed demonstrated robustness to the specified distribution shift.

### Discussion:

We provide an insight for users to determine in our DRSL333For the choice of -divergences in our DRSL, refer to Appendix F of the supplementary material.. We see from Eq. (15) that the structural adversarial risk can be decomposed into the sum of the ordinary risk and the robustness term, where acts as a trade-off hyper-parameter between the two terms. In practice, users of our DRSL may want to have good balance between the two terms so that the learned classifier shows high accuracy w.r.t. training distribution while being robust to specified distribution shift. Since both terms in Eq. (15) can be estimated by cross validation, the users can adjust of our DRSL (AERM in Eqs. (16) and (17)) to best trade-off the two terms for their purposes.

## 8 Conclusion

In this paper, we theoretically analyzed the previous DRSL in the classification scenario and established its equivalence to ordinary ERM. To overcome this, we presented novel DRSL which learns robust decision boundaries based on structural assumptions on distribution shift. We derived efficient optimization algorithms and established the convergence rate of the model parameter for our DRSL. Its effectiveness was demonstrated through experiments.

## Acknowledgements

GN and MS were supported by JST CREST JPMJCR1403.

## References

• Vapnik (1998) Vladimir Naumovich Vapnik. Statistical Learning Theory. Wiley New York, 1998.
• Quionero-Candela et al. (2009) Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. The MIT Press, 2009.
• Bagnell (2005) J Andrew Bagnell. Robust supervised learning. In Proceedings of Association for the Advancement of Artificial Intelligence, volume 20, pages 714–719, 2005.
• Ben-Tal et al. (2009) Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust Optimization. Princeton University Press, 2009.
• Wen et al. (2014) Junfeng Wen, Chun-Nam Yu, and Russell Greiner. Robust learning under uncertain test distributions: Relating covariate shift to model misspecification. In Proceedings of International Conference on Machine Learning, pages 631–639, 2014.
• Duchi et al. (2016) John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv preprint arXiv:1610.03425, 2016.
• Namkoong and Duchi (2016) Hongseok Namkoong and John C Duchi. Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems, pages 2208–2216, 2016.
• Globerson and Roweis (2006) Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion. In Proceedings of International Conference on Machine learning, pages 353–360. ACM, 2006.
• Liu and Ziebart (2014) Anqi Liu and Brian Ziebart. Robust classification under sample selection bias. In Advances in Neural Information Processing Systems, pages 37–45, 2014.
• Chen et al. (2016) Xiangli Chen, Mathew Monfort, Anqi Liu, and Brian D Ziebart. Robust covariate shift regression. In Proceedings of International Conference on Artificial Intelligence and Statistics, pages 1270–1279, 2016.
• Sugiyama and Kawanabe (2012) Masashi Sugiyama and Motoaki Kawanabe. Machine learning in non-stationary environments: Introduction to covariate shift adaptation. MIT press, 2012.
• Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244, 2000.
• Bartlett et al. (2006) Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
• Tewari and Bartlett (2007) Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(May):1007–1025, 2007.
• Storkey and Sugiyama (2007) Amos J Storkey and Masashi Sugiyama. Mixture regression for covariate shift. In Advances in Neural Information Processing Systems, pages 1337–1344, 2007.
• Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
• Saerens et al. (2002) Marco Saerens, Patrice Latinne, and Christine Decaestecker. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Computation, 14(1):21–41, 2002.
• Ristin et al. (2015) Marko Ristin, Juergen Gall, Matthieu Guillaumin, and Luc Van Gool. From categories to subcategories: large-scale image classification with partial class label refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 231–239, 2015.
• Danskin (1966) John M Danskin. The theory of max-min, with applications. SIAM Journal on Applied Mathematics, 14(4):641–664, 1966.
• Shapiro (1992) Alexander Shapiro. Perturbation analysis of optimization problems in Banach spaces. Numerical Functional Analysis and Optimization, 13:97–116, 1992.
• Bonnans and Cominetti (1996) J. Frederic Bonnans and Roberto Cominetti. Perturbed optimization in Banach spaces I: A general theory based on a weak directional constraint qualification; II: A theory based on a strong directional qualification condition; III: Semiinfinite optimization. SIAM Journal on Control and Optimization, 34:1151–1171, 1172–1189, and 1555–1567, 1996.
• Bonnans and Shapiro (1998) J. Frederic Bonnans and Alexander Shapiro. Optimization problems with perturbations, a guided tour. SIAM Review, 40(2):228–264, 1998.
• Blake and Merz (1998) Catherine Blake and Christopher J. Merz. UCI repository of machine learning databases.
• LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
• Lang (1995) Ken Lang. Newsweeder: Learning to filter netnews. In Proceedings of International Conference on Machine Learning, pages 331–339, 1995.
• Chung (1968) Kai Lai Chung. A Course in Probability Theory. Academic Press, 1968.
• Robinson (1977) Stephen M Robinson. A characterization of stability in linear programming. Operations Research, 25:435–447, 1977.

## Appendix A Proof of Theorem 1

If , the adversarial risk reduces exactly to ordinary risk and Theorem 1 readily follows. In the following, we assume . Recall the objective of ARM:

 minθsupr∈UfEp(x,y)[r(x,y)l(gθ(x),y)], (22) where   Uf≡{r(x,y) | Ep(x,y)[f(r(x,y))]≤δ, Ep(x,y)[r(x,y)]=1, r(x,y)≥0, ∀(x,y)∈X×Y}. (23)

Let . Since is assumed to be the 0-1 loss, we have . Let be the optimal solution of the inner maximization of Eq. (22). It is easy to show that for any , takes the same value within and . i.e.,

 r∗(x,y)=r∗0  for  ∀(x,y)∈Ω(0)θ,r∗(x,y)=r∗1  for  ∀(x,y)∈Ω(1)θ, (24)

where and are some constant values. Let for . We note that Also, we see that is by definition equal to the ordinary risk, i.e., By using Eq. (24), we can further rewrite the inner maximization of Eq. (22) as

 sup(r0,r1)∈U′fpΩ(1)θr1, (25) where   U′f≡{(r0,r1) | pΩ(0)θf(r0)+pΩ(1)θf(r1)≤δ, pΩ(0)θr0+pΩ(1)θr1=1, r0≥0, r1≥0}. (26)

In the following, show that Eq. (25) has monotonic relationship with Setting to some value, we can obtain the optimal by solving Eq. (25). Let be the solution of Eq. (25) when is set to

First, we note that the first inequality constraint in Eq. (26) is a convex set and includes point in its interior. Also, the equality constraint in Eq. (26) includes point Hence, for , there are always exactly two different points that satisfies both and We further see that the optimal solution of is always greater than 1 if because the objective is an increasing function of . Taking these facts into account, we can see that the optimal solution, , satisfies either of the following two cases depending on the active inequality constraints in Eq, (6).

 {\bf Case 1:}p0⋅f(r∗0(p1))+p1⋅f(r∗1(p1))=δ,  p0⋅r∗0(p1)+p1⋅r∗1(p1)=1,  0

where Pick any such that , and let be the solution of Eq. (25) when is set to Regarding the second case in Eq. (28), for , the active equality constraint is always in Eq. (26). Hence, we can show that . Therefore, in this case, the adversarial risk and the ordinary risk both stay 1 for .

Regarding the first case in Eq. (27), we note that both the ordinary risk and the adversarial risk are strictly less than 1. Our goal is to show

 p1⋅r∗1(p1)

We further consider the following two sub-cases of Eq. (27):

 {\bf Case 1-a}:p′1

In Case 1-b, we can show Eq. (29) as follows.

 p1⋅r∗1(p1)≤p′1

where the last inequality follows from

Now, assume Cases 1 and 1-a in Eqs. (27) and (30). Our goal is to show that satisfies . To this end, we show that is contained in the interior of Eq. (26) with set to Then, because our objective in Eq. (25) is linear in , holds in our setting. Then, we arrive at . Formally, our goal is to show

 p′0⋅f(r′0)+p′1⋅f(r′1)<δ, p′0r′0+p′1r′1=1, r′0>0, r′1>0, (33)

where By the second equality of Eq. (33) and , we have

 r′0=1−p′1r′1p′0=1−p1⋅r∗1(p1)p′0=p0p′0⋅r∗0(p1). (34)

The latter two inequalities of Eq. (33), i.e., and follow straightforwardly from the assumptions. Combining the assumption in Eq. (30) and the last inequality in Eq. (27), we have the following inequality.

 0

Thus, we can write (resp. ) as a linear interpolation of and 1 (resp. 1 and ) as follows.

 r′0=α⋅r∗0(p1)+(1−α)⋅1,r′1=β⋅r∗1(p1)+(1−β)⋅1, (36)

where . Substituting and we have

 α =11−r0⋅p′0−p0⋅r∗0(p1)p′0, (37) β =1r1−1⋅p1⋅r∗1(p1)−p′1p′1=1r1−1⋅p′0−p0⋅r∗0(p1)p′1. (38)

Then, we have

 p′0f(r′0)+p′1f(r′1) =p′0f(α⋅r∗0(p1)+(1−α)⋅1)+p′1f(β⋅r∗1(p1)+(1−β)⋅1) (39) ≤p′0⋅{α⋅f(r∗0(p1))+(1−α)⋅f(1)}+p′1⋅{β⋅f(r∗1(p1))+(1−β)⋅f(1)}  (∵ convexity of f(⋅)) (40) =p′0α⋅f(r∗0(p1))+p′1β⋅f(r∗1(p1))  (∵ f(1)=0) (41) =(p′0−p0⋅r∗0(p1))(11−r∗0(p1)⋅f(r∗0(p1))+1r∗1(p1)−1⋅f(r∗1(p1))) (42) =(p1⋅r∗1(p1)−p′1)(11−r∗0(p1)⋅f(r∗0(p1))+1r∗1(p1)−1⋅f(r∗1(p1))) (43) <(p1⋅r∗1(p1)−p1)(11−r∗0(p1)⋅f(r∗0(p1))+1r∗1(p1)−1⋅f(r∗1(p1)))  (∵ p′1>p1.) (44) =p1⋅f(r∗1(p1))+p0⋅f(r∗0(p1)) (45) =δ.  (∵ the first equation of Eq.~{}(???).) (46)

This concludes our proof that Eq. (25) has monotonic relationship with Recall that is by definition equal to the ordinary risk, Therefore, for any pair of parameters and , we have

 R(θ1)