# Target contrastive pessimistic risk for robust domain adaptation

## Abstract

In domain adaptation, classifiers with information from a source domain adapt to generalize to a target domain. However, an adaptive classifier can perform worse than a non-adaptive classifier due to invalid assumptions, increased sensitivity to estimation errors or model misspecification. Our goal is to develop a domain-adaptive classifier that is robust in the sense that it does not rely on restrictive assumptions on how the source and target domains relate to each other and that it does not perform worse than the non-adaptive classifier. We formulate a conservative parameter estimator that only deviates from the source classifier when a lower risk is guaranteed for all possible labellings of the given target samples. We derive the classical least-squares and discriminant analysis cases and show that these perform on par with state-of-the-art domain adaptive classifiers in sample selection bias settings, while outperforming them in more general domain adaptation settings.

## 1Introduction

Generalization in supervised learning relies on the fact that future samples should originate from the same underlying distribution as the ones used for training. However, this is not the case in settings where data is collected from different locations, different measurement instruments are used or there is only access to biased data. In these situations the labeled data does not represent the distribution of interest. This problem setting is referred to as a *domain adaptation* setting, where the distribution of the labeled data is called the *source domain* and the distribution that one is actually interested in is called the *target domain*. Most often, data in the target domain is not labeled and adapting a source domain classifier, i.e. changing its predictions to be more suited to the target domain, is the only means by which one can make predictions for the target domain. Unfortunately, depending on the domain dissimilarity, adaptive classifiers can perform *worse* than non-adaptive ones. In this work, we formulate a conservative adaptive classifier that always performs at least as well as the non-adaptive one.

Biased samplings tend to occur when one samples locally from a much larger population [?]. For instance, in computer-assisted diagnosis, biometrics collected from two different hospitals will be different due to differences between the patient populations: ones diet might not be the same as the others. Nonetheless, both patient populations are subsamples of the human population as a whole. Adaptation in this example corresponds to accounting for the differences between patient populations, training a classifier on the corrected labeled data from one hospital, and applying the adapted classifier to the other hospital. Additionally, different measurement instruments cause different biased samplings: photos of the same object taken with different cameras lead to different distributions over images [?]. Lastly, biases arise when one only has access to particular subsets, such as data from individual humans in a activity recognition task [?].

In the general setting, domains can be arbitrarily different and contain almost no mutual information, which means generalization will be extremely difficult. However, there are cases where the problem setting is more structured: in the *covariate shift* setting, the marginal data distributions differ but the class-posterior distributions are equal [?]. This means that the underlying true classification function is the same in both domains, implying that a correctly specified adaptive classifier converges to the same solution as the target classifier. Adaptation occurs by weighing each source sample by how important it is under the target distribution and training on the importance-weighed labeled source data. A model that relies on equal class-posterior distributions can perform very well when its assumption is true, but it can deviate in detrimental ways when its assumption is false.

Considering their potential, a number of papers have looked at conditions and assumptions that allow for successful adaptation. A particular robust one specifies the existence of a common latent embedding, represented by a set of *transfer components* [?]. After mapping data onto these components, one can train and test standard classsifiers again. Other possible assumptions include low-data-divergence [?], low-error joint prediction [?], the existence of a domain manifold [?], restrictions to subspace transformations [?], conditional independence of class and target given source data [?] and unconfoundedness [?]. The more restrictive an assumption is, the worse the classifier tends to perform when it is invalid. One of the strengths of the estimator that we develop in this paper is that it does not require making any assumptions on the relationship between the domains.

The domain adaptation and covariate shift settings are very similar to the sample selection bias setting in the statistics and econometrics communities [?]. There, the bias is explicitly modeled as a variable that denotes how likely it is for a particular sample to be selected for the training set. One hopes to generalize to an unbiased sample, i.e., the case where each sample is equally likely to be selected. As such, this setting can also be viewed as a case of domain adaptation, with the biased sample set as the source domain and the unbiased sample set as the target domain. In this case, there is even additional information: the support of the source domain will be contained in the support of the target domain. This information can be exploited, as some methods rely on a non-zero target probability for every source sample [?] Lastly, the causal inference community has also considered causes for differing training and testing distributions, including how to estimate and control for these differences [?].

Although not often discussed, a variety of papers have reported adaptive classifiers that, at times, perform worse than the non-adaptive source classifier [?]. On closer inspection, this tends to happen when a classifier with a particular assumption is deployed in a problem setting for which this assumption is not valid. For example, if the assumption of a common latent representation does not hold or when the domains are too dissimilar to recover the transfer components, then mapping both source and target data onto the found transfer components will result in mixing of the class-conditional distributions [?]. Additionally, one of the most popular covariate shift approaches, kernel mean matching (kmm), assumes that the support of the target distribution is contained in the support of the source distribution [?]. When this is not the case, the resulting estimated weights can become very bimodal: a few samples are given very large weights and all other samples are given near-zero weights. This greatly reduces the effective sample size for the subsequent classifier [?].

Since the validity of the aforementioned assumptions are difficult, if not impossible, to check, it is of interest to design an adaptive classifier that is at least guaranteed to perform as well as the non-adaptive one. Such a property is often framed as a minimax optimization problem in statistics, econometrics and game theory [?]. Wen et al. constructed a minimax estimator for the covariate shift setting: Robust Covariate Shift Adjustment (rcsa) [?] accounts for estimation errors in the importance weights by considering their worst-case configuration. However, this can sometimes be too conservative, as the worst-case weights can be very disruptive to the subsequent classifier optimization. Another minimax strategy, dubbed the Robust Bias-Aware (rba) classifier [?], plays a game between a risk minimizing target classifier and a risk maximizing target class-posterior distribution, where the adversary is constrained to pick posteriors that match the moments of the source distribution statistics. This constraint is important, as the adversary would otherwise be able to design posterior probabilities that result in degenerate classifiers (e.g. assign all class-posterior probabilities to for one class and for the other). However, it also means that their approach loses predictive power in areas of feature space where the source distribution has limited support, and thus is not suited very well for problems where the domains are very different.

The main contribution of our paper is that we provide an empirical risk minimization framework to train a classifier that will always perform at least as well as the naive source classifier. Furthermore, we show that a discriminant analysis model derived from our framework will *always be likelier* than the naive source model. To the best of our knowledge, strict improvements have not been shown before.

The paper continues as follows: Section 2 presents the motivation and general formulation of our method, with the specific case of a least-squares classifier in Section 3 and the specific case of a discriminant analysis classifier in Section 4. Sections Section 5.2 and Section 5.3 show experiments on sample selection bias problems and general domain adaptation problems, respectively, and we conclude with discussing some limitations and implications in Section 6.

## 2Target Contrastive Pessimistic Risk

This section starts with the problem definition, followed by our risk formulation.

### 2.1Problem definition

Given a sample space, a *domain* refers to a particular probability measure over this sample space. One has access to labeled data from one domain, denoted the *source* domain, and aims to generalize to another domain, denoted the *target* domain, where no labels are available. Assuming that the labels follow a random variable taking values in the set , let denote the random variable associated with the source domain, with samples drawn from , referred to as , and let denote the random variable associated with the target domain, with samples drawn from , referred to as . Both the source and target domain are measured in a -dimensional vector space, on the same features. The target labels are unknown at training time and the goal is to predict them, using only the given unlabeled target samples and the given labeled source samples .

### 2.2Target Risk

The risk minimization framework formalizes *risk*, or the expected loss incurred by classification function , mapping data to classes , with respect to a particular joint labeled data distribution ; . By minimizing empirical risk, i.e. the approximation of the expectation with the sample average over labeled samples , with respect to classifiers from a space of hypothetical classification functions , one hopes to find the function that generalizes most to novel samples. Additionally, a regularization term that punishes classifier complexity is often incorporated to avoid finding classifiers that are too specific to the given labeled data. For a given data distribution, the choice of loss function, the hypothesis space and amount of regularization largely determine the behavior of the resulting classifier.

The empirical risk in the source domain can be computed as follows:

with the *source classifier* being the classifier that is found by minimizing this risk:

Since the source classifier does not incorporate any target data, it is essentially entirely naive of the target domain. But, if we assume that the domains are related in some way, then it makes sense to apply the source classifier on the target data. To evaluate in the target domain, the empirical *target risk*, i.e. the risk of the classifier with respect to target samples, is measured:

Training on the source domain and testing on the target domain is our baseline, non-adaptive approach. Although the source classifier does not incorporate information from the target domain nor any knowledge on the relation between the domains, it is often *not the worst* classifier. In cases where approaches rely heavily on assumptions, the adaptive classifiers can deviate from the source classifier in ways that lead to even larger target risks.

### 2.3Contrast

We are interested in finding a classifier that is never worse than the source classifier in terms of the empirical target risk. We formalize this desire by subtracting the source classifiers target risk in (Equation 2) from the target risk of a different classifier :

If such a contrast is used as a risk minimization objective, i.e. , then the risk of the resulting classifier is bounded above by the risk of the source classifier: the maximal value of the contrast is , which occurs when the same classifier is found, . Classifiers that lead to larger target risks are not valid solutions to the minimization problem, which implies that certain parts of the hypothesis space will never be reached. As such, the contrast implicitly constrains in a similar way as projection estimators [?].

### 2.4Pessimism

However, (Equation 3) still incorporates the target labels , which are unknown. Taking a conservative approach, we use a worst-case labeling instead, achieved by *maximizing* risk with respect to a hypothetical labeling . For any classifier , the risk with respect to this worst-case labeling will always be larger than the risk with respect to the true target labeling:

Unfortunately, maximizing over a set of discrete labels is a combinatorial problem and is computationally very expensive. To avoid this expense, we represent the hypothetical labeling probabilistically: . Such a representation is sometimes also referred to as a *soft* label [?]. Additionally, it means that is constrained to be an element of a simplex, . For samples, there are simplices: . Note that known labels can also be represented probabilistically, for example , and are sometimes referred to as *crisp* labels.

### 2.5Target Contrastive Pessimistic Risk

Joining the contrastive target risk from (Equation 3) with the pessimistic labeling from (Equation 4) forms the following risk function:

We refer to the risk in Equation 5 as the Target Contrastive Pessimistic risk (tcp). Minimizing it with respect to a classifier and maximizing it with respect to the hypothetical labeling , leads to the new tcp target classifier:

Note that the tcp risk expresses only the performance on the target domain. It is different from the ones used in [?] and [?], because those incorporate the classifiers performance on the source domain as well. Our formulation contains no evaluation on the source domain, and focuses solely on the performance gain we can achieve with respect to the source classifier.

### 2.6Optimization

If the loss function is restricted to be globally convex and the hypothesis space H is a convex set, then the tcp risk with respect to will be globally convex and there will be a unique optimum with respect to . The tcp risk with respect to is bounded linear due to the simplex, which means that it is possible that the optimum is not unique. Nonetheless, the combination is globally convex-linear and the existence of a saddle point, i.e. an optimum with respect to both and , for the minimax objective is guaranteed [?].

Finding the saddle point can be done through first performing a gradient descent step according to the partial derivative with respect to , followed by a gradient ascent step according to the partial derivative with respect to . However, this last step causes the updated to leave the simplex. In order to enact the constraint, it is projected back onto the simplex after performing the gradient step. This projection maps a point outside the simplex to the point on the simplex that is closest in terms of Euclidean distance: [?]. Unfortunately, the projection step complicates the computation of the step size, which we replace by a learning rate , decreasing over iterations . This results in the overall update: . Lastly, a gradient descent - gradient ascent procedure for globally convex-linear objectives is guaranteed to converge to the saddle point (c.f. proposition 4.4 and corollary 4.5 of [?]).

## 3Least-squares

Discriminative classification models make no assumptions on the data distributions and directly optimize predictions. We incorporate a discriminative model through the least-squares classifier, which is defined by a quadratic loss function [?]. For multi-class classification, we employ a one-hot label encoding, also known as a one-vs-all scheme [?].

Furthermore, we chose a linear hypothesis space, , which we will denote as the inner product between the data row vector, implicitly augmented with a constant , and the classifier parameter vector. is an element of a -dimensional parameter space and in the following, we will refer to the classifier optimization step, i.e. minimization over , as a parameter estimation step, i.e. minimization over . In summary, the least-squares loss of a sample is:

Plugging (Equation 7) into (Equation 5), the tcp-ls risk is defined as:

with the resulting estimate:

For fixed , the minimization over has a closed form solution. For each class, the parameter vector is:

Keeping fixed, the gradient with respect to is linear:

Algorithm ? gives pseudo-code for tcp-ls.

## 4Discriminant Analysis

As a generative classification model, we chose the classical discriminant analysis model (da). It fits a Gaussian distribution to each class, , and classifies new samples according to the largest probability over Gaussians; . Again, we will refer to the classifier optimization step as a parameter estimation step. For da models, the parameter space consists of priors, means and covariance matrices for the Gaussian distributions; . The model is incorporated in the empirical risk minimization framework by setting the loss function to the negative log-likelihood, . The probabilistic labeling is incorporated by weighing the likelihood over each class’ Gaussian distribution: .

### 4.1Quadratic Discriminant Analysis

If one fits one Gaussian distribution per class, the resulting classifier is a quadratic function of the difference in means and covariances, and is hence referred to as quadratic discriminant analysis (qda):

where refers to the determinant and refers to the irrational constant.

Plugging the loss from (Equation 9) into (Equation 5), the tcp-qda risk becomes:

where the estimate itself is:

Minimization with respect to also has a closed-form solution for discriminant analysis models. For each class, the parameter estimates are:

One of the properties of a discriminant analysis model is that it requires the estimated covariance matrix to be non-singular. It is possible for the maximizer over in tcp-qda to assign less samples than dimensions to one of the classes, causing the covariance matrix for that class to be singular. To prevent this, we regularize its estimation by first restricting to minimal eigenvalues of and then adding a scalar multiple of the identity matrix . Essentially, the estimated covariance matrix is constrained to a minimum size in each direction.

Keeping fixed, the gradient with respect to is linear:

Algorithm ? lists pseudo-code for tcp-qda.

### 4.2Linear Discriminant Analysis

If one constrains the model to share a single covariance matrix for each class, the resulting classifier is a linear function of the difference in means and is hence termed linear discriminant analysis (lda). This constraint is enforced through the weighted sum over class covariance matrices .

### 4.3Performance Guarantee

The discriminant analysis model has a very surprising property: it obtains a *strictly* smaller risk than the source classifier. To our knowledge, this is the first time that such a performance guarantee can be given in the context of domain adaptation, without using any assumptions on the relation between the two domains.

The reader is referred to for the proof. It follows similar steps as a robust guarantee for discriminant analysis in semi-supervised learning [?]. It should be noted that the risks of tcp-lda and tcp-qda are always strictly smaller with respect to the given target samples, but not necessarily strictly smaller with respect to new target samples. Although, when the given target samples are a good representation of the target distribution, one does expect the adapted model to generalize well to new target samples. Additionally, as long as the same amount of regularization is added to both the source and the tcp classifier , the guarantee also holds for a regularized model.

## 5Experiments

Our experiments compare the risks of the tcp classifiers with that of the source classifier and the corresponding oracle target classifier, as well as their performance with respect to various state-of-the-art domain adaptive classifiers through their areas under the ROC-curve. In all experiments, all target samples are given, unlabeled, to the adaptive classifiers. They make predictions for those given target samples and their performance is evaluated with respect to those target samples’ true labels. Cross-validation for regularization parameters was done by holding out source data, as that is the only data for which labels are available at training time. The range of values we tested was .

### 5.1Compared methods

We implemented transfer component analysis (tca) [?], kernel mean matching (kmm) [?], robust covariate shift adjustment (rcsa) [?] and the robust bias-aware (rba) classifier [?] for the comparison (see cited papers for more information). tca and kmm are chosen because they are popular classifiers with clear assumptions. rcsa and rba are chosen because they also employ minimax formulations but from different perspectives; rcsa as a worst-case and rba as a moment-matching importance weighing. Their implementations details are discussed shortly below.

**Transfer Component Analysis** tca assumes that there exists a common latent representation for both domains and aims to find this representation by means of a cross-domain nonlinear component analysis [?]. In our implementation, we employ a radial basis function kernel with a bandwidth of and set the trade-off parameter to . After mapping the data onto their transfer components, we train a logistic regressor on the mapped source data and apply it to the mapped target data.

**Kernel Mean Matching** kmm assumes that the class-posterior distributions are equal in both domains and that the support of the target distribution is contained within the source distribution [?]. When the first assumption fails, kmm will have deviated from the source classifier in a manner that will not lead to better results on the target domain. When the second assumptions fails, the variance of the importance-weights increases to the point where a few samples receive large weights and all other samples receive very small weights, reducing the effective training sample size and leading to pathological classifiers. We use a radial basis function kernel with a bandwidth of , kernel regularization of to favor estimates with lower variation over weights and upper bound the weights by . After estimating importance weights, we train a weighed least-squares classifier on the source samples.

**Robust Covariate Shift Adjustment** rcsa also assumes equal class-posterior distributions and containment of the support of the target distribution within the source distribution, but additionally incorporates a worst-case labeling [?]. To be precise, it maximizes risk with respect to the importance weights. We used the author’s publicly available code with 5-fold cross-validation for its hyperparameters. Interestingly, the authors also discuss a relation between covariate shift and model misspecification, as described by [?]. They argue for a two-step (estimate weights - train classifier) approach in a game-theoretical form [?], which is done by all importance-weighted classifiers in this paper.

**Robust Bias-Aware** rba assumes that the moments of the feature statistics are approximately equal up to a particular order [?]. In their formulation, the adversary plays a classifier whose class-posterior probabilities are used as a labeling of the target samples, but who is also constrained to match the moments with the source domain’s statistics. The player then proposes an importance-weighted classifier that aims to perform well on both domains. Note that the constraints on the adversary are, among others, necessary to avoid the players switching strategies constantly. We implement rba using first-order feature statistics for the moment-matching constraints, which was also done by the authors in their paper. Furthermore, we use a ratio of normal distributions for the weights and bound them above by .

### 5.2Experiments in a sample selection bias setting

Sample selection bias settings occur when data is collected locally from a larger population. For regression problems, these settings are usually created through a parametric sampling of the feature space [?]. We created something similar but for classification problems: samples are concentrated around a certain subset of the feature space, but with equal class priors as the whole set. For each class:

Find the sample closest to the origin; .

Compute distance to all other samples of the same class.

Draw without replacement samples proportional to .

where denotes the total number of samples to draw and refers to the class-prior distributions of the whole set. Note that drawing samples from each class leads to approximately the same class prior distributions in the source domain as the target domain. We chose the squared Mahalanobis distance: , with the covariance matrix estimated on all data, since that takes scale differences between features into account. Figure 1 presents an example, showing the first two principal components of the pima diabetes dataset. Red/blue squares denote the selected source samples, black circles denote all samples and the green stars denote the seed points ( for each class).

#### Data

We collected the following datasets from the UCI machine learning repository: cylinder bands printing (bands), car evaluation (car), credit approval (credit), ionosphere (iono), mammographic masses (mamm), pima diabetes (pima) and tic-tac-toe endgame (t3). Table 2 lists their characteristics. All missing values have been imputed to . For each dataset, we draw samples as the source domain while treating all samples as the target domain.

bands |

car |

credit |

iono |

mamm |

pima |

t3 |

#Samples | #Features | #Missing | Class (-1+1) |
---|---|---|---|

539 | 39 | 569 | 312 227 |

1728 | 6 | 0 | 1210 518 |

690 | 15 | 67 | 307 383 |

351 | 34 | 0 | 126 225 |

961 | 5 | 162 | 516 445 |

768 | 8 | 0 | 500 268 |

958 | 9 | 0 | 332 626 |

#### Results

The risks (average negative log-likelihoods for the discriminant analysis models and mean squared errors for the least-squares classifiers) in Table 6 belong to the source classifiers, the tcp classifiers and the oracle target classifiers. The oracles represent the best possible result, as they comprise the risk of a classifier trained on all target samples with their true labels. The results show varying degrees of improvement for the tcp classifiers. tcp-lda approaches t-lda more closely than the other two versions, with tcp-ls being the most conservative one. For the ionosphere and tic-tac-toe datasets, the improvement is quite dramatic, indicating that the source classifier is a poor model for the target domain. Note also that some overfitting might be occurring as tcp-qda does not always have a lower risk than tcp-lda, even though t-qda does always have a lower risk than t-lda.

bands |

car |

credit |

iono |

mamm |

pima |

t3 |

s-lda | tcp-lda | t-lda |
---|---|---|

-216.3 | -218.4 | -218.8 |

17.16 | 2.850 | 2.148 |

-80.04 | -83.64 | -83.65 |

199.5 | -8.480 | -8.782 |

8.133 | -10.40 | -11.22 |

-15.92 | -23.44 | -24.15 |

18.77 | 6.136 | 4.734 |

s-qda | tcp-qda | t-qda |
---|---|---|

-215.3 | -217.8 | -219.1 |

57.39 | 18.77 | 2.049 |

-78.99 | -83.73 | -84.61 |

26.30 | -9.325 | -18.78 |

31.66 | -10.08 | -11.68 |

-7.486 | -23.09 | -24.30 |

117.3 | 39.13 | 4.611 |

s-ls | tcp-ls | t-ls |
---|---|---|

1.170 | 1.109 | 0.827 |

1.968 | 1.205 | 0.672 |

2.430 | 0.973 | 0.757 |

17.06 | 0.815 | 0.350 |

0.818 | 0.668 | 0.580 |

1.083 | 1.012 | 0.633 |

1.401 | 1.401 | 0.849 |

Table 8 compares the performances of the adaptive classifiers on all datasets through their area under the ROC-curves (AUC). Although there is quite a variety between datasets, the variation between classifiers within a dataset is relatively small; all approaches perform similarly well. However, with our selection bias procedure, the moments of the target statistics do not match the source statistics (e.g. the target’s variance is by construction always larger) which affect rba’s performance negatively. Interestingly, the tcp discriminant analysis models are quite competitive in cases where their improvement over the source classifier was larger. Unfortunately, like rba, the more conservative tcp-ls never outperforms all other methods simultaneously on any of the datasets. Still, in the average it reaches competitive performance overall. In summary, the tcp classifiers perform on par with the other adaptive classifiers.

bands |

car |

credit |

iono |

mamm |

pima |

t3 |

mean |

tca | kmm | rcsa | rba | tcp-ls | tcp-lda | tcp-qda |
---|---|---|---|---|---|---|

.578 | .620 |
.562 | .504 | .588 | .548 | .589 |

.736 | .776 |
.742 | .684 | .734 | .758 | .699 |

.716 |
.694 | .655 | .702 | .662 | .646 | .663 |

.741 | .817 | .835 | .687 | .731 | .894 |
.826 |

.656 | .804 | .749 | .762 | .836 | .824 | .847 |

.691 | .630 | .760 |
.271 | .692 | .684 | .637 |

.608 |
.532 | .439 | .446 | .520 | .529 | .606 |

.675 | .696 | .677 | .579 | .680 | .698 |
.695 |

### 5.3Experiments in a domain adaptation setting

We performed a set of experiments on a dataset that is naturally split into multiple domains: predicting heart disease in patients from hospitals in 4 different locations. It is a much more realistic setting because problem variables such as prior shift, class imbalance and proportion of imputed features are not controlled. As such, it is a harder problem than the sample selection bias setting. In this setting, the target domains often only have limited overlap with the source domain and can be very dissimilar. As the results will show, many of the assumptions that the state-of-the-art domain adaptive classifiers rely upon, do not hold and their performance degrades drastically.

#### Data

The hospitals are the Hungarian Institute of Cardiology in Budapest (data collected by Andras Janosi), the University Hospital Zurich (collected by William Steinbrunn), the University Hospital Basel (courtesy of Matthias Pfisterer), the Veterans Affairs Medical Center in Long Beach, California, USA, and the Cleveland Clinic Foundation in Cleveland, Ohio, USA (both courtesy of Robert Detrano), which will be referred to as Hungary, Switzerland, California and Ohio hereafter. The data from these hospitals can be considered domains as the patients are all measured on the same biometrics but show different distributions. For example, patients in Hungary are on average younger than patients from Switzerland (48 versus 55 years). Each patient is described by 13 features: age, sex, chest pain type, resting blood pressure, cholesterol level, high fasting blood sugar, resting electrocardiography, maximum heart rate, exercise-induced angina, exercise-induced ST depression, slope of peak exercise ST, number of major vessels in fluoroscopy, and normal/defective/reversible heart rate. Table 10 describes the number of samples (, ), total number of missing measurements that have been imputed (, ) the class balance (, ) and the empirical Maximum Mean Discrepancy for all pairwise combinations of designating one domain as the source and another as the target. First of all, the sample size imbalance is not really a problem, as the largest difference occurs in the Ohio - Switzerland combination with 303 and 123 samples respectively. However, the fact that the classes are severely imbalanced in different proportions, for example going from 54% : 46% to 7% : 93% in Ohio - Switzerland, creates a very difficult setting. A shift in the prior distributions can be disastrous for some classifiers, such as rba which relies on matching the source and target feature statistics. Furthermore, a sudden increase in the amount of missing values (unmeasured patient biometrics), such as in Ohio - California, means that a classifier relying on a certain feature for discrimination degrades when this feature is missing in the target domain. Additionally, the empirical Maximum Mean Discrepancy measures how far apart two sets of samples are: [?]. For its kernel, we used a radial-basis function with a bandwidth of 1. An MMD of means that the two sets are identical, while larger values indicate larger discrepancies between the two sets. The combinations Ohio - Switzerland and Switzerland - Hungary have an MMD that is two orders of magnitude larger than other combinations. Overall, looking at all three sets of descriptive statistics, the combinations Ohio - Switzerland and Switzerland - Hungary should pose the most difficulty for the adaptive classifiers.

Lastly, to further illustrate how the domains differ, we plotted histograms of the age and resting blood pressure of all patients, split by domain (see Figure 2). Not only are they different on average, they tend to differ in variance and skewness as well.

O | H |

O | S |

O | C |

H | S |

H | C |

S | C |

H | O |

S | O |

C | O |

S | H |

C | H |

C | S |

n | m | MMD | ||||
---|---|---|---|---|---|---|

303 | 294 | 6 | 782 | 164:139 | 188:106 | 0.0012 |

303 | 123 | 6 | 273 | 164:139 | 8:115 | 0.1602 |

303 | 200 | 6 | 698 | 164:139 | 51:149 | 0.0227 |

294 | 123 | 782 | 273 | 188:106 | 8:115 | 0.1384 |

294 | 200 | 782 | 698 | 188:106 | 51:149 | 0.0151 |

123 | 200 | 273 | 698 | 8:115 | 51:149 | 0.0804 |

294 | 303 | 782 | 6 | 188:106 | 164:139 | 0.0012 |

123 | 303 | 273 | 6 | 8:115 | 164:139 | 0.1602 |

200 | 303 | 698 | 6 | 51:149 | 164:139 | 0.0227 |

123 | 294 | 273 | 782 | 8:115 | 188:106 | 0.1384 |

200 | 294 | 698 | 782 | 51:149 | 188:106 | 0.0151 |

200 | 123 | 698 | 273 | 51:149 | 8:115 | 0.0804 |

#### Results

Table 14 lists the target risks (average negative log-likelihoods for the discriminant analysis models and mean squared errors for the least-squares classifiers) with the given target samples’ true labels for all source, tcp and oracle target classifiers. Note that the tcp risks range between the source and the oracle target risk. For some combinations tcp is extremely conservative, e.g. Switzerland - Ohio, Switzerland - Hungary for the least-squares case, and for others, it is much more liberal, e.g. Hungary - Switzerland, Hungary - Ohio, Hungary - California for the discriminant analysis models. In general, the discriminative model (tcp-ls) deviates much less and is much more conservative than the generative models (tcp-lda and tcp-qda). Note that the order of magnitude of the improvement with tcp-da in the Hungary - Switzerland, Hungary - Ohio and Hungary - California settings is due to the fact that the two domains lie far apart; the target samples lie very far in the tails of the source models’ Gaussian distribution and evaluate to very small likelihoods, which become very large negative log-likelihoods.

O | H |

O | S |

O | C |

H | S |

H | C |

S | C |

H | O |

S | O |

C | O |

S | H |

C | H |

C | S |

s-lda | tcp-lda | t-lda |
---|---|---|

-53.55 | -57.18 | -57.35 |

-8.293 | -16.76 | -17.54 |

-37.84 | -53.88 | -54.69 |

-12.50 | -16.08 | -17.54 |

-41.70 | -53.91 | -54.69 |

494.9 | -54.49 | -54.69 |

-48.91 | -55.08 | -55.23 |

709.9 | -54.07 | -55.23 |

-49.21 | -55.00 | -55.23 |

649.9 | -56.09 | -57.35 |

-53.05 | -57.19 | -57.35 |

-15.45 | -17.43 | -17.54 |

s-qda | tcp-qda | t-qda |
---|---|---|

-53.55 | -57.20 | -57.62 |

-8.293 | -16.76 | -17.54 |

-37.83 | -53.73 | -54.89 |

-12.80 | -16.44 | -17.54 |

-40.08 | -54.45 | -54.89 |

498.9 | -54.44 | -54.89 |

-49.20 | -54.84 | -55.53 |

709.9 | -54.10 | -55.53 |

-49.17 | -55.05 | -55.53 |

650.3 | -56.19 | -57.62 |

-53.15 | -57.17 | -57.62 |

-15.47 | -17.44 | -17.54‘ |

s-ls | tcp-ls | t-ls |
---|---|---|

0.580 | 0.579 | 0.444 |

1.449 | 1.449 | 0.213 |

1.441 | 1.441 | 0.613 |

1.068 | 1.068 | 0.213 |

1.120 | 1.104 | 0.613 |

0.904 | 0.904 | 0.671 |

0.642 | 0.638 | 0.463 |

1.700 | 1.700 | 0.696 |

1.833 | 1.833 | 0.470 |

2.102 | 2.102 | 0.740 |

0.582 | 0.582 | 0.444 |

0.415 | 0.415 | 0.236 |

Looking at the areas under the ROC-curves in Figure Table 16, one observes a different pattern in the classifier performances. tca, kmm, rcsa and rba perform much worse, often below chance level. It can be seen that, in some cases, the assumption of equal class-posterior distributions still holds approximately, as kmm and rcsa sometimes perform quite well, e.g. in Hungary - Ohio. tca’s performance varies around chance level, indicating that it is difficult to recover a common latent representation in these settings. That makes sense, as the domains lie further apart this time. rba’s performance drops most in cases where the differences in priors and proportions of missing values are largest, e.g. Hungary - California, which also makes sense as it is expecting similar feature statistics in both domains. tcp-ls performs very well in almost all cases; the conservative strategy is paying off. tcp-lda is also performing very well, even outperforming tcp-qda in all cases. The added flexibility of a covariance matrix per class is not beneficial because it is much more difficult to fit correctly. Note that the domain combinations are asymmetrical; for example, rcsa’s performance is quite strong when Switzerland is the source domain and Ohio the target domain, but it’s performance is much weaker when Ohio is the source domain and Switzerland the target domain. In some combinations, assumptions on how two domains are related to each other might be valid that are not valid in their reverse combinations. Overall, in this more general domain adaptation setting, our more conservative approach works best, as shown by the mean performances.

O | H |

O | S |

O | C |

H | S |

H | C |

S | C |

H | O |

S | O |

C | O |

S | H |

C | H |

C | S |

tca | kmm | rcsa | rba | tcp-ls | tcp-lda | tcp-qda |
---|---|---|---|---|---|---|

.699 | .710 | .372 | .481 | .881 | .882 |
.817 |

.590 | .551 | .634 | .670 | .714 |
.671 | .671 |

.496 | .476 | .560 | .450 | .671 |
.668 | .476 |

.455 | .501 | .646 | .602 | .668 |
.665 | .666 |

.528 | .533 | .585 | .434 | .727 |
.709 | .662 |

.475 | .573 | .464 | .603 | .605 |
.546 | .480 |

.616 | .742 | .751 | .510 | .864 | .876 |
.863 |

.582 | .353 | .750 | .449 | .753 |
.589 | .426 |

.484 | .337 | .551 | .557 | .671 | .831 |
.828 |

.407 | .370 | .629 | .484 | .697 | 724 |
.604 |

.472 | .427 | .538 | .616 | .805 | .878 |
.824 |

.511 | .593 | .462 | .474 | .709 |
.503 | .535 |

.526 | .514 | .578 | .528 | .730 |
.712 | .654 |

#### Visualization of the worst-case labeling

The adversary in tcp’s minimax formulation maximizes the objective with respect to the probability that a sample belongs to class . However, note that the worst-case labeling corresponds to the labeling that maximizes the contrast: it looks for the labeling for which the difference between the source parameters and the current parameters is largest. It would be interesting to visualize this labeling at the saddle point. Figure 3 shows the first two principal components of Hungary, with the probabilities of belonging to class , i.e. . The top left figure shows the true labeling, the top right the probabilities for tcp-ls, the bottom left for tcp-lda and the bottom right for tcp-qda. In all three tcp cases the labeling is quite smooth and does not vary too much between neighboring points. One would expect a rough labeling, but note that labellings that are bad for the source classifier will most likely also be bad for the tcp classifier, and the resulting contrast will be small instead of maximal. The probabilities for tcp-ls lie closer to and than for tcp-lda and tcp-qda.

## 6Discussion

Although the tcp classifiers are never worse than the source classifier by construction, they will not automatically lead to improvements in the error rate. This is due to the difference between optimizing a surrogate loss and evaluating the /-loss [?]. There is no one-to-one mapping between the minimizer of the surrogate loss and the minimizer of the /-loss.

One peculiar advantage of our tcp model is that we do not explicitly require source samples at training time. They are not incorporated in the risk formulation, which means that they do not have to be retained in memory. It is sufficient to receive the parameters of a trained classifier that can serve as a baseline. Our approach is therefore more efficient than for example importance-weighing techniques which require source samples for importance-weight estimation and subsequent training. Additionally, it would be interesting to construct a contrast with multiple source domains. The union of source classifiers might serve as a very good starting point for the tcp model.

For each adaptive classifier in this paper, regularization parameters are estimated through cross-validation on held-out source samples. However, this procedure is known to be biased as it does not account for domain dissimilarity [?]. What is optimal with respect to held-out source samples, need not be optimal with respect to target samples. Performance of many adaptive models might be improved with appropriate adaptive validation techniques.

## 7Conclusion

We have designed a risk minimization formulation for a domain-adaptive classifier whose performance, in terms of risk, is always at least as good as that of the non-adaptive source classifier. Furthermore, for the discriminant analysis case, its performance is always strictly better. Our target contrastive pessimistic model performs on par with state-of-the-art domain adaptive classifier on sample selection bias settings and outperforms them on more realistic domain adaptation problem settings.

## Acknowledgments

This work was supported by the Netherlands Organization for Scientific Research (NWO; grant 612.001.301).

## Appendix A

Let and be sample sets of size drawn from continuous distributions and , respectively. Consider a discriminant analysis model parameterized either as for qda or for lda. denotes empirical risk measured with the Gaussian average negative log-likelihood weighed by a set of soft labels : . The sample covariance matrix is required to be non-singular, which is guaranteed when there are more unique samples than features for every class, . In the lda case, unique samples are sufficient. Let be the parameters estimated on labeled source data; .

Firstly, for fixed , the minimized contrast between the target risk of any parameter and the source parameters is non-positive, because both parameters sets are elements of the same parameter space, :

’s that result in a larger target risk than that of are not minimizers of the contrast. The maximum value it can attain is which occurs when exactly the same parameters are found; . Considering that the contast is non-positive for any labeling , it is also non-positive with respect to the worst-case labeling:

Secondly, given that the empirical risk with respect to the true labeling is always less than or equal to the empirical risk with the worst-case labeling, , the target contrastive risk (Equation 3) with the true labeling is always less than or equal to the target contrastive pessimistic risk (Equation 5):

Let ( be the minimaximizer of the target contrastive pessimistic risk on the right-handside of (Equation 12). Plugging these estimates in into (Equation 12) produces:

Combining inequalities Equation 11 and Equation 13 gives:

However, equality of the two risks in Equation 14 occurs with probability , which we will show in the following.

The total mean for the source classifier consists of the weighted combination of the class means, resulting in the overall source sample average:

The total mean for the tcp-da estimator is similarly defined, resulting in the overall target sample average:

Note that since consists of probabilities, the sum over classes in (Equation 15) is , for every sample . Equal risks for these parameter sets, , implies equality of the total means, = . By Equations ? and ?, equal total means imply equal sample averages: . Drawing two sets of samples with *exactly equal* sample averages constitutes the union of two single events:

where the bars over the samples denote the sample averages. By definition, single events under continuous distributions have probability . Therefore, a strictly smaller risk occurs almost surely: