Algorithms and Theory for Multiple-Source Adaptation

Algorithms and Theory
for Multiple-Source Adaptation

Judy Hoffman
CS Department Stanford University
Stanford, CA 94305
jhoffman@cs.stanford.edu
&Mehryar Mohri
Courant Institute and Google
New York, NY 10012
mohri@cims.nyu.edu
\ANDNingshan Zhang
New York University
New York, NY 10012
nzhang@stern.nyu.edu
Abstract

This work includes a number of novel contributions for the multiple-source adaptation problem. We present new normalized solutions with strong theoretical guarantees for the cross-entropy loss and other similar losses. We also provide new guarantees that hold in the case where the conditional probabilities for the source domains are distinct. Moreover, we give new algorithms for determining the distribution-weighted combination solution for the cross-entropy loss and other losses. We report the results of a series of experiments with real-world datasets. We find that our algorithm outperforms competing approaches by producing a single robust model that performs well on any target mixture distribution. Altogether, our theory, algorithms, and empirical results provide a full solution for the multiple-source adaptation problem with very practical benefits.

 

Algorithms and Theory
for Multiple-Source Adaptation


  Judy Hoffman CS Department Stanford University Stanford, CA 94305 jhoffman@cs.stanford.edu Mehryar Mohri Courant Institute and Google New York, NY 10012 mohri@cims.nyu.edu Ningshan Zhang New York University New York, NY 10012 nzhang@stern.nyu.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

In many modern applications, often the learner has access to information about several source domains, including accurate predictors possibly trained and made available by others, but no direct information about a target domain for which one wishes to achieve a good performance. The target domain can typically be viewed as a combination of the source domains, that is a mixture of their joint distributions, or it may be close to such mixtures. In addition, often the learner does not have access to all source data simultaneously, for legitimate reasons such as privacy, storage limitation, etc. Thus the learner cannot simply pool all source data together to learn a predictor.

Such problems arise commonly in speech recognition where different groups of speakers (domains) yield different acoustic models and the problem is to derive an accurate acoustic model for a broader population that may be viewed as a mixture of the source groups (Liao, 2013). In object recognition, multiple image databases exist, each with its own bias and labeled categories (Torralba and Efros, 2011), but the target application may contain images which most closely resemble only a subset of the available training data. Finally, in sentiment analysis, accurate predictors may be available for sub-domains such as TVs, laptops and CD players, each previously trained on labeled data, but no labeled data or predictor may be at the learner’s disposal for the more general category of electronics, which can be modeled as a mixture of the sub-domains (Blitzer et al., 2007; Dredze et al., 2008).

The problem of transfer from a single source to a known target domain, either through unsupervised adaptation techniques (Gong et al., 2012; Long et al., 2015; Ganin and Lempitsky, 2015; Tzeng et al., 2015), or via lightly supervised ones (some amount of labeled data from the target domain) (Saenko et al., 2010; Yang et al., 2007; Hoffman et al., 2013; Girshick et al., 2014), has been extensively investigated in the past. Here, we focus on the problem of multiple-source domain adaptation and ask how the learner can combine relatively accurate predictors available for each source domain to derive an accurate predictor for any new mixture target domain? This is known as the multiple-source adaption (MSA) problem first formalized and analyzed theoretically by Mansour et al. (2008, 2009) and later studied for various applications such as object recognition (Hoffman et al., 2012; Gong et al., 2013a, b). Recently, Zhang et al. (2015) studied a causal formulation of this problem and analyzed the same combination rules of Mansour et al. (2008, 2009) for classification scenario. A closely related problem is also that of domain generalization (Pan and Yang, 2010; Muandet et al., 2013; Xu et al., 2014), where knowledge from an arbitrary number of related domains is combined to perform well on a previously unseen domain.

Mansour et al. (2008, 2009) gave strong theoretical guarantees for a distribution-weighted combination for the MSA problem, but they did not provide any algorithmic solution. Furthermore, the solution they proposed could not be used for loss functions such as cross-entropy, which require a normalized predictor. Their work also assumed a deterministic scenario (non-stochastic) with the same labeling function for all source domains.

This work makes a number of novel contributions to the MSA problem. We give new normalized solutions with strong theoretical guarantees for the cross-entropy loss and other similar losses. Our guarantees hold even when the conditional probabilities for the source domains are distinct. A by-product of our analysis is the extension of the theoretical results of Mansour et al. (2008, 2009) to the stochastic scenario, where there is a joint distribution over the input and output space.

Moreover, we give new algorithms for determining the distribution-weighted combination solution for the cross-entropy loss and other losses. We prove that the problem of determining that solution can be cast as a DC-programming (difference of convex) and prove explicit DC-decompositions for the cross-entropy loss and other losses. We also give a series of experimental results with several datasets demonstrating that our distribution-weighted combination solution is remarkably robust. Our algorithm outperforms competing approaches and performs well on any target mixture distribution.

Altogether, our theory, algorithms, and empirical results provide a full solution for the MSA problem with very practical benefits.

2 Problem setup

Let denote the input space and the output space. We consider a multiple-source domain adaptation (MSA) problem in the general stochastic scenario where there is a distribution over the joint input-output space, . This is a more general setup than the deterministic scenario in (Mansour et al., 2008, 2009), where a target function mapping from to is assumed. This extension is needed for the analysis of the most common and realistic learning setups in practice. We will assume that and are discrete, but the predictors we consider can take real values. Our theory can be straightforwardly extended to the continuous case with summations replaced by integrals in the proofs. We will identify a domain with a distribution over and consider the scenario where the learner has access to a predictor , for each domain , .

We consider two types of predictor functions , and their associated loss functions under the regression model (R) and the probability model (P) respectively,

(R)
(P)

We abuse the notation and write to denote the loss of a predictor at point , that is in the regression model, and in the probability model. We will denote by the expected loss of a predictor with respect to the distribution :

Much of our theory only assumes that is convex and continuous. But, we will be particularly interested in the case where in the regression model, is the squared loss, and where in the probability model, is the cross-entropy loss (-loss).

We will assume that each is a relatively accurate predictor for the distribution : there exists such that for all . We will also assume that the loss of the source hypotheses is bounded, that is for all and all .

In the MSA problem, the learner’s objective is to combine these predictors to design a predictor with small expected loss on a target domain that could be an arbitrary and unknown mixture of the source domains, the case we are particularly interested in, or even some other arbitrary distribution. It is worth emphasizing that the leaner has no knowledge of the target domain.

How do we combine the s? Can we use a convex combination rule, , for some ? In Appendix A (Lemmas 7 and 8) we show that no convex combination rule will perform well even in very simple MSA problems. These results generalize a previous lower bound of Mansour et al. (2008). Next, we show that the distribution-weighted combination rule is the right solution.

Extending the definition given by Mansour et al. (2008), we define the distribution-weighted combination of the functions , as follows. For any , , and ,

(R) (1)
(P) (2)

where we denote by the marginal distribution over : , and the uniform distribution over . This extension may seem technically straightforward in hindsight, but the form of the predictor was not immediately clear in the stochastic case.

3 Theoretical analysis

In this section, we present theoretical analyses of the general multiple-source adaptation setting. We first introduce our main result for the general stochastic scenario. Next, for the probability model with cross-entropy loss, we introduce a normalized distribution weighted combination and prove that it benefits from strong theoretical guarantees.

Our theoretical results rely on the measure of divergence between distributions. The one that naturally comes up in our analysis is the Rényi Divergence (Rényi, 1961). We will denote by the exponential of the -Rényi Divergence of two distributions and . More details of the Rényi Divergence are given in Appendix F.

3.1 Stochastic scenario

Let be an unknown target distribution. We will denote by and the conditional probability distribution on the target and the source domain respectively. Given the same input , are not necessarily the same. This is a novel extension that was not discussed in (Mansour et al., 2009), where in the deterministic scenario, exactly the same labeling function is assumed for all source domains.

For some choice of , define by

When the average divergence is small, can be chosen to be very large and is close to .

Let be a mixture of source distributions, such that in the regression model (R), or in the probability model (P).

Theorem 1.

For any , there exists and such that the following inequalities hold for any :

The proof is given in Appendix B. The learning guarantees for the regression and the probability model are slightly different, since the definitions of the distribution-weighted combinations are different for the two models. Theorem 1 shows the existence of and a mixture weight with a remarkable property: in the regression model (R), for any target distribution whose conditional probability is on average not too far away from for any , and , the loss of on is small. It is even more remarkable that, in the probability model (P), the loss of is at most on any target distribution . Therefore, is a robust hypothesis with favorable property for any such target distribution .

In many learning tasks, it is reasonable to assume that the conditional probability of the output labels is the same for all source domains. For example, a dog picture represents a dog regardless of whether the dog appears in an individual’s personal set of pictures or in a broader database of pictures from multiple individuals. This is a straightforward extension of the assumption adopted by Mansour et al. (2008) in the deterministic scenario, where exactly the same labeling function is assumed for all source domains. Then . By definition, . Let , we recover the main result of Mansour et al. (2008).

Corollary 2.

Assume the conditional probability does not depend on . Let be an arbitrary mixture of source domains, . For any , there exists and , such that .

Corollary 2 shows the existence of a mixture weight and with a remarkable property: for any , regardless of which mixture weight defines the target distribution, the loss of is at most , that is arbitrarily close to . is therefore a robust hypothesis with a favorable property for any mixture target distribution.

To cover the realistic cases in applications, we further extend this result to the case where the distributions are not directly available to the learner, and instead estimates have been derived from data, and further to the case where the target distribution is not a mixture of source distributions (Corollary 11 in Appendix B). We will denote by the distribution-weighted combination rule based on the estimates . Our learning guarantee for depends on the Rényi divergence of and , as well as the Rényi divergence of and the family of source mixtures.

3.2 Probability model with the cross-entropy loss

Next, we discuss the special case where coincide with the cross-entropy loss in the probability model, and present an analysis for a normalized distribution-weighted combination solution. This analysis is a complement to Theorem 1, which only works for the unnormalized hypothesis .

The cross-entropy loss assumes normalized hypotheses. Thus, the source functions are normalized for every : . For any , , we define the normalized weighted combination that is based on in (2):

We will first assume the conditional probability does not depend on .

Theorem 3.

Assume there exists such that for all and . Then, for any , there exists and , such that  for any mixture parameter .

The result of Theorem 3 admits the same favorable property as that of Corollary 2. It can also be extended to the case of arbitrary target distributions and estimated densities. When the conditional probabilities are different across the source domains, we propose a marginal distribution-weighted combination rule, which is already normalized. We can directly apply Theorem 1 to it and achieve favorable guarantees. More details are given in Appendix C.

These results are non-trivial and important, as they provide a guarantee for an accurate and robust predictor for a commonly used loss function, the cross-entropy loss.

4 Algorithms

We have shown that, for both the regression and the probability model, there exists a vector defining a distribution-weighted combination hypothesis that admits very favorable guarantees. But how we find a such ? This is a key question in the MSA problem which was not addressed by Mansour et al. (2008, 2009): no algorithm was previously reported to determine the mixture parameter (even for the deterministic scenario). Here, we give an algorithm for determining that vector .

In this section, we give practical and efficient algorithms for finding the vector in the important cases of the squared loss in the regression model, or the cross-entropy loss in the probability model, by leveraging the differentiability of the loss functions. We first show that is the solution of a general optimization problem. Next, we give a DC-decomposition (difference of convex decomposition) of the objective for both models, thereby proving an explicit DC-programming formulation of the problem. This leads to an efficient DC algorithm that is guaranteed to converge to a stationary point. Additionally, we show that it is straightforward to test if the solution obtained is the global optimum. While we are not proving that the local stationary point found by our algorithm is the global optimum, empirically, we observe that that is indeed the case.

4.1 Optimization problem

Theorem 1 shows that the hypothesis based on the mixture parameter benefits from a strong generalization guarantee. A key step in proving Theorem 1 is to show the following lemma.

Lemma 4.

For any , there exists , with for all , such that the following holds for the distribution-weighted combining rule :

(3)

Lemma 4 indicates that for the solution , has essentially the same loss on all source domains. Thus, our problem consists of finding a parameter verifying this property. This, in turn, can be formulated as a min-max problem: which can be equivalently formulated as the following optimization problem:

(4)

4.2 DC-decomposition

We provide explicit DC decompositions of the objective of Problem (4) for the regression model with the squared loss and the probability model with the cross-entropy loss. The full derivations are given in Appendix D.

We first rewrite as the division of two affine functions for the regression model (R) and the probability model (P): , where

, , (R)
, (P)
Proposition 5 (Regression model, square loss).

Let be the squared loss. Then, for any , , where and are convex functions defined for all by

Proposition 6 (Probability model, cross-entropy loss).

Let be the cross-entropy loss. Then, for , , where and are convex functions:

4.3 DC algorithm

Our DC decompositions prove that the optimization problem (4) can be cast as the following variational form of a DC-programming problem (Tao and An, 1997, 1998; Sriperumbudur and Lanckriet, 2012):

(5)

The DC-programming algorithm works as follows. Let be the sequence defined by repeatedly solving the following convex optimization problem:

(6)
s.t.

where is an arbitrary starting value. Then, is guaranteed to converge to a local minimum of Problem (4) (Yuille and Rangarajan, 2003; Sriperumbudur and Lanckriet, 2012). Note that Problem (6) is a relatively simple optimization problem: is a weighted sum of the negative logarithm of an affine function of , plus a weighted sum of rational functions of (squared loss), and all other terms appearing in the constraints are affine functions of .

Problem (4) seeks a parameter verifying , for all for an arbitrarily small value of . Since is a weighted average of the expected losses , , the solution cannot be negative. Furthermore, by Lemma 4, a parameter verifying that inequality exists for any . Thus, the global solution of Problem (4) must be close to zero. This provides us with a simple criterion for testing the global optimality of the solution we obtain using a DC-programming algorithm with a starting parameter .

5 Experiments

This section reports the results of our experiments with our DC-programming algorithm for finding a robust domain generalization solution when using squared loss and cross-entropy loss. We first evaluate our algorithm using an artificial dataset assuming known densities where we may compare our result to the global solution and found that indeed our global objective approached the known optimum of zero (see Appendix E for more details). Next, we evaluate our DC-programming solution applied to real-world datasets: a sentiment analysis dataset (Blitzer et al., 2007) for squared loss, a visual domain adaptation benchmark dataset Office (Saenko et al., 2010), as well as a generalization of digit recognition task, for cross-entropy loss.

For all real-world datasets, the probability distributions are not readily available to the learner. However, Corollary 11 extends the learning guarantees of our solution to the case where an estimate is used in lieu of the ideal distribution . Thus, we used standard density estimation methods to derive an estimate for each . While density estimation can be a difficult task in general, for our purpose straightforward techniques are sufficient for our predictor to achieve a high performance, since the approximate densities only serve to indicate the relative importance of each source domain. We give full details of our density estimation procedure in Appendix E.

(a)
(b)
Figure 1: MSE sentiment analysis under mixture of two domains: (left) dvd and electronics or (right) kitchen and books.

5.1 Sentiment analysis task for squared loss

We use the sentiment analysis dataset proposed by Blitzer et al. (2007) and used for multiple-source adaptation by Mansour et al. (2008, 2009). This dataset consists of product review text and rating labels taken from four domains: books (B), dvd (D), electronics (E), and kitchen (K), with samples for each domain. We defined a vocabulary of words that occur at least twice at the intersection of the four domains. These words were used to define feature vectors, where every sample is encoded by the number of occurrences of each word. We trained our base hypotheses using support vector regression (SVR) with same hyper-parameters as in (Mansour et al., 2008, 2009).

Test Data
K D B E KD BE DBE KBE KDB KDB KDBE
K 1.460.08 2.200.14 2.290.13 1.690.12 1.830.08 1.990.10 2.060.07 1.810.07 1.780.07 1.980.06 1.910.06
D 2.120.08 1.780.08 2.120.08 2.100.07 1.950.07 2.110.07 2.000.06 2.110.06 2.000.06 2.010.06 2.030.06
B 2.180.11 2.010.09 1.730.12 2.240.07 2.100.09 1.990.08 1.990.05 2.050.06 2.140.06 1.980.06 2.040.05
E 1.690.09 2.310.12 2.400.11 1.500.06 2.000.09 1.950.07 2.070.06 1.860.04 1.840.06 2.140.06 1.980.05
unif 1.620.05 1.840.09 1.860.09 1.620.07 1.730.06 1.740.07 1.770.05 1.700.05 1.690.04 1.770.04 1.740.04
KMM 1.630.15 2.070.12 1.930.17 1.690.12 1.830.07 1.820.07 1.890.07 1.750.07 1.780.06 1.860.09 1.820.06
DW(ours) 1.450.08 1.780.08 1.720.12 1.490.06 1.620.07 1.610.08 1.660.05 1.560.04 1.580.05 1.650.04 1.610.04
Table 1: MSE on the sentiment analysis dataset of source-only baselines for each domain, K,D, B,E, the uniform weighted predictor unif, KMM, and the distribution-weighted method DW based on the learned . DW outperforms all competing baselines.

We compare our method (DW) against each source hypothesis, . We also compute a privileged baseline using the oracle mixing parameter, -comb: . -comb is of course not accessible in practice since the target mixture is not known to the user. We also compare against a previously proposed domain adaptation algorithm (Huang et al., 2006) known as KMM. It is important to note that the KMM model requires access to the unlabeled target data during adaptation and learns a new predictor for every target domain, while DW does not use any target data. Thus KMM operates in a favorable learning setting when compared to our solution.

We first considered the same test scenario as in (Mansour et al., 2008), where the target is a mixture of two source domains. The plots of Figures (a)a and (b)b report the results of our experiments. They show that our distribution-weighted predictor DW outperforms all baseline predictors despite the privileged learning scenarios of -comb and KMM. We didn’t compare to the “weighted” predictor in empirical studies by Mansour et al. (2008) because it is not a real solution, but rather taking the unknown target mixture as to compute .

Next, we compared the performance of DW with accessible baseline predictors on various target mixtures. Since is not accessible in practice, We replace -comb with the uniform combination of all hypotheses (unif), . Table 1 reports the mean and standard deviations of MSE over repetitions. Each column corresponds to a different target test data source. Our distribution-weighted method DW outperforms all baseline predictors across all test domains. Observe that, even when the target is a single source domain, our method successfully outperforms the predictor which is trained and tested on the same domain. Results on more target mixtures are available in Appendix E.

Table 2: Digit dataset statistics.
SVHN MNIST USPS
# train images 73,257 60,000 7,291
# test images 26,032 10,000 2,007
image size 32x32 28x28 16x16
color rgb gray gray
Table 3: Digit dataset accuracy: We report accuracy across six possible test domains.
Test Data
svhn mnist usps mu su sm smu mean
CNN-s 92.3 66.9 65.6 66.7 90.4 85.2 84.2 78.8
CNN-m 15.7 99.2 79.7 96.0 20.3 38.9 41.0 55.8
CNN-u 16.7 62.3 96.6 68.1 22.5 29.4 32.9 46.9
CNN-unif 75.7 91.3 92.2 91.4 76.9 80.0 80.7 84.0
DW (ours) 91.4 98.8 95.6 98.3 91.7 93.5 93.6 94.7
CNN-joint 90.9 99.1 96.0 98.6 91.3 93.2 93.3 94.6
Test Data

amazon webcam dslr aw ad wd awd mean
CNN-a 75.7 0.3 53.8 0.7 53.4 1.3 71.4 0.3 73.5 0.2 53.6 0.8 69.9 0.3 64.5 0.6
CNN-w 45.3 0.5 91.1 0.8 91.7 1.2 54.4 0.5 50.0 0.5 91.3 0.8 57.5 0.4 68.8 0.7
CNN-d 50.4 0.4 89.6 0.9 90.9 0.8 58.3 0.4 54.6 0.4 90.0 0.7 61.0 0.4 70.7 0.6
CNN-unif 69.7 0.3 93.1 0.6 93.2 0.9 74.4 0.4 72.1 0.3 93.1 0.5 75.9 0.3 81.6 0.5
DW (ours) 75.2 0.4 93.7 0.6 94.0 1.0 78.9 0.4 77.2 0.4 93.8 0.6 80.2 0.3 84.7 0.5
CNN-joint 72.1 0.3 93.7 0.5 93.7 0.5 76.4 0.4 76.4 0.4 93.7 0.5 79.3 0.4 83.6 0.4
Table 4: Office dataset accuracy: We report accuracy across six possible test domains. We show performance all baselines: CNN-a,w,d, CNN-unif, DW based on the learned , and the jointly trained model CNN-joint. DW outperforms all competing models.

5.2 Recognition tasks for cross-entropy loss

We consider two real-world domain adaptation tasks: a generalization of digit recognition task, and a standard visual adaptation Office dataset.

For each individual domain, we train a convolutional neural network (CNN) and use the output from the softmax score layer as our base predictors . We compute the uniformly weighted combination of source predictors, . As a privileged baseline, we also train a model on all source data combined, . Note, this approach is often not feasible if independent entities contribute classifiers and densities, but not full training datasets. Thus this approach is not consistent with our scenario, and it operates in a much more favorable learning setting than our solution. Finally, our distribution weighted predictor DW is computed with s, density estimates, and our learned weighting, . Our baselines then consists of the classifiers from , , , and DW.

We begin our study with a generalization of digit recognition task, which consists of three digit recognition datasets: Google Street View House Numbers (SVHN), MNIST, and USPS. Dataset statistics as well as example images can be found in Table 3. We train the ConvNet (or CNN) architecture following  Taigman et al. (2017) as our source models and joint model. We use the second fully-connected layer’s output as our features for density estimation, and the output from the softmax score layer as our predictors. We use the full training sets per domain to learn the source model and densities. Note, these steps are completely isolated from one another and may be performed by unique entities and in parallel. Finally, for our DC-programming algorithm we use small subset of 200 real image-label pairs from each domain to learn the parameter .

Our next experiment uses the standard visual adaptation Office dataset, which has 3 domains: amazon, webcam, and dslr. The dataset contains 31 recognition categories of objects commonly found in an office environment. There are 4110 images total with 2817 from amazon, 795 from webcam, and 498 from dslr. We follow the standard protocol from Saenko et al. (2010), whereby 20 labeled examples are available for training from the amazon domain and 8 labeled examples are available from both the webcam and dslr domains. The remaining examples from each domain are used for testing. We use the AlexNet Krizhevsky et al. (2012) ConvNet (CNN) architecture, and use the output from softmax score layer as our base predictors, pre-trained on ImageNet and use fc7 activations as our features for density estimation Donahue et al. (2014).

We report the performance of our method and that of baselines on the digit recognition dataset in Table 3, and report the performance on the Office dataset in Table 4. On both datasets, we evaluate on various test distributions: each individual domain, the combination of each two domains and the fully combined set. When the test distribution equals one of the source distributions, our distribution-weighted classifier successfully outperforms (webcam,dslr) or maintains performance of the classifier which is trained and tested on the same domain. For the more realistic scenario where the target domain is a mixture of any two or all three source domains, the performance of our method is comparable or marginally superior to that of the jointly trained network, despite the fact that we do not retrain any network parameters in our method and that we only use a small number of per-domain examples to learn the distribution weights – an optimization which may be solved on a single CPU in a matter of seconds for this problem. This again demonstrates the robustness of our distribution-weighted combined classifier to a varying target domain.

6 Conclusion

We presented practically applicable multiple-source domain adaptation algorithms for the cross-entropy loss and other similar losses. These algorithms benefit from very favorable theoretical guarantees that we extended to the stochastic setting. Our empirical results further demonstrate empirically their effectiveness and their importance in adaptation problems.

References

  • Arndt (2004) C. Arndt. Information Measures: Information and its Description in Science and Engineering. Signals and Communication Technology. Springer Verlag, 2004.
  • Blitzer et al. (2007) J. Blitzer, M. Dredze, and F. Pereira. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Association for Computational Linguistics (ACL), 2007.
  • Cover and Thomas (2006) T. M. Cover and J. M. Thomas. Elements of Information Theory. Wiley-Interscience, 2006.
  • Donahue et al. (2014) J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference in Machine Learning (ICML), 2014.
  • Dredze et al. (2008) M. Dredze, K. Crammer, and F. Pereira. Confidence-weighted linear classification. In International Conference on Machine Learning (ICML), 2008.
  • Ganin and Lempitsky (2015) Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference in Machine Learning (ICML), 2015.
  • Girshick et al. (2014) R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In In Proc. CVPR, 2014.
  • Gong et al. (2012) B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Proc. CVPR, 2012.
  • Gong et al. (2013a) B. Gong, K. Grauman, and F. Sha. Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In ICCV, 2013a.
  • Gong et al. (2013b) B. Gong, K. Grauman, and F. Sha. Reshaping visual datasets for domain adaptation. In NIPS, 2013b.
  • Hoffman et al. (2012) J. Hoffman, B. Kulis, T. Darrell, and K. Saenko. Discovering latent domains for multisource domain adaptation. In European Conference on Computer Vision (ECCV), 2012.
  • Hoffman et al. (2013) J. Hoffman, E. Rodner, J. Donahue, K. Saenko, and T. Darrell. Efficient learning of domain-invariant image representations. In International Conference on Learning Representations, 2013.
  • Huang et al. (2006) J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems (NIPS), volume 19, pages 601–608, 2006.
  • Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In Proc. NIPS, 2012.
  • Liao (2013) H. Liao. Speaker adaptation of context dependent deep neural networks. In ICASSP, 2013.
  • Long et al. (2015) M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In International Conference in Machine Learning (ICML), 2015.
  • Mansour et al. (2008) Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation with multiple sources. In NIPS, 2008.
  • Mansour et al. (2009) Y. Mansour, M. Mohri, and A. Rostamizadeh. Multiple source adaptation and the Rényi divergence. In UAI, pages 367–374, 2009.
  • Muandet et al. (2013) K. Muandet, D. Balduzzi, and B. Schölkopf. Domain generalization via invariant feature representation. In Proceedings of ICML, pages 10–18, 2013.
  • Pan and Yang (2010) S. J. Pan and Q. Yang. A survey on transfer learning. In IEEE Transactions on Knowledge and Data Engineering, 2010.
  • Rényi (1961) A. Rényi. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 547–561, 1961.
  • Roark et al. (2012) B. Roark, R. Sproat, C. Allauzen, M. Riley, J. Sorensen, and T. Tai. The opengrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations, pages 61–66. Association for Computational Linguistics, 2012.
  • Saenko et al. (2010) K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting visual category models to new domains. In Proc. ECCV, 2010.
  • Sriperumbudur and Lanckriet (2012) B. K. Sriperumbudur and G. R. G. Lanckriet. A proof of convergence of the concave-convex procedure using Zangwill’s theory. Neural Computation, 24(6):1391–1407, 2012.
  • Taigman et al. (2017) Y. Taigman, A. Polyak, and L. Wolf. Unsupervised cross-domain image generation. In International Conference on Learning Representations (ICLR), 2017.
  • Tao and An (1997) P. D. Tao and L. T. H. An. Convex analysis approach to DC programming: theory, algorithms and applications. Acta Mathematica Vietnamica, 22(1):289–355, 1997.
  • Tao and An (1998) P. D. Tao and L. T. H. An. A DC optimization algorithm for solving the trust-region subproblem. SIAM Journal on Optimization, 8(2):476–505, 1998.
  • Torralba and Efros (2011) A. Torralba and A. Efros. Unbiased look at dataset bias. In CVPR, 2011.
  • Tzeng et al. (2015) E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous deep transfer across domains and tasks. In International Conference in Computer Vision (ICCV), 2015.
  • Xu et al. (2014) Z. Xu, W. Li, L. Niu, and D. Xu. Exploiting low-rank structure from latent domains for domain generalization. In European Conference in Computer Vision (ECCV), 2014.
  • Yang et al. (2007) J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. ACM Multimedia, 2007.
  • Yuille and Rangarajan (2003) A. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15(4):915–936, 2003.
  • Zhang et al. (2015) K. Zhang, M. Gong, and B. Schoelkopf. Multi-source domain adaptation: A causal view. In AAAI Conference on Artificial Intelligence, 2015.

Appendix A Lower bounds for convex combination rules

In this section, we give lower bounds for convex combination rule, for both squared loss and cross-entropy loss. For any , define the convex combination rule for the regression and the probability model as follows:

(R) (7)
(P) (8)
Lemma 7 (Regression model, squared loss).

There is a mixture adaptation problem for which the expected squared loss of is .

Proof.

Let , and . Consider , and ; , and . Consider the target distribution . Then, for any convex combination rule ,

Note that the hypotheses and have zero error on their own domain, i.e. . However, no convex combination rule will perform well on the target distribution .

Lemma 8 (Probability Model, cross-entropy loss).

There is a mixture adaptation problem for which the expected cross-entropy loss of is .

Proof.

Let , and . Consider , and . Consider the largest cross-entropy loss of on any target mixture :

Choosing to minimize that adversarial loss gives

Therefore any convex combination rule incurs at least a loss of . ∎

Again, the base hypotheses s have zero error on their own domain, yet there is no convex combination rule that is robust against any target mixture.

Appendix B Theoretical analysis for the stochastic scenario

In this section, we give a series of theoretical results for the general stochastic scenario with their full proofs. We will separate the proofs for the regression model (Appendix B.1) and the probability model (Appendix B.2), since the definitions of the distribution weighted combination are different in the two models.

b.1 Regression model

The proofs for the regression model (R) are presented in the following order: we first assume the conditional probabilities are the same across source domains, and prove Lemma 4; using that, we prove Corollary 2 and Corollary 11. Finally, we relax the assumption of same conditionals, and prove Theorem 12, which a stronger version of Theorem 1.

Our proofs make use of the following Fixed-Point Theorem of Brouwer.

Theorem 9.

For any compact and convex non-empty set and any continuous function , there is a point such that .

Lemma 4.

For any , there exists , with for all , such that the following holds for the distribution-weighted combining rule :

(9)
Proof.

Consider the mapping defined for all by

is continuous since is a continuous function of and since the denominator is positive (). Thus, by Brouwer’s Fixed Point Theorem, there exists such that . For that , we can write

for all . Since is positive, we must have for all . Dividing both sides by gives , which completes the proof. ∎

Corollary 2.

Assume the conditional probability does not depend on . Let be an arbitrary mixture of source domains, . For any , there exists and , such that .

Proof.

We first upper bound, for an arbitrary , the expected loss of with respect to the mixture distribution defined using the same , that is . By definition of and , we can write

By convexity of , this implies that

Next, observe that for any since by assumption does not depend on . Thus,

Now, choose as in the statement of Lemma 4. Then, the following holds for any mixture distribution :

Setting and concludes the proof. ∎

Next, we introduce a useful Corollary and give its proof.

Corollary 10.

Let be an arbitrary target distribution. For any , there exists and , such that the following inequality holds for any :

Proof.

For any hypothesis and any distribution , by Hölder’s inequality, the following holds:

Thus, by definition of , for any such that for all , we can write

Now, by Corollary 2, there exists and such that for any mixture distribution . Thus, in view of the previous inequality, we can write,for any ,

Taking the infimum of the right-hand side over all completes the proof. ∎

Corollary 11.

Let be an arbitrary target distribution. Then, for any , there exists and , such that the following inequality holds for any :

where , and .

Proof.

By the first part of the proof of Corollary 10, for any and , the following inequality holds: