Ensemble Multisource Domain Adaptation with Pseudolabels
Abstract
Given multiple source datasets with labels, how can we train a target model with no labeled data? Multisource domain adaptation (MSDA) aims to train a model using multiple source datasets different from a target dataset in the absence of target data labels. MSDA is a crucial problem applicable to many practical cases where labels for the target data are unavailable due to privacy issues. Existing MSDA frameworks are limited since they align data without considering conditional distributions of each domain. They also miss a lot of target label information by not considering the target label at all and relying on only one feature extractor. In this paper, we propose Ensemble Multisource Domain Adaptation with Pseudolabels (EnMDAP), a novel method for multisource domain adaptation. EnMDAP exploits labelwise moment matching to align conditional distributions , using pseudolabels for the unavailable target labels, and introduces ensemble learning theme by using multiple feature extractors for accurate domain adaptation. Extensive experiments show that EnMDAP provides the stateoftheart performance for multisource domain adaptation tasks in both of image domains and text domains.
1 Introduction
Given multiple source datasets with labels, how can we train a target model with no labeled data? A large training data are essential for training deep neural networks. Collecting abundant data is unfortunately an obstacle in practice; even if enough data are obtained, manually labeling those data is prohibitively expensive. Using other available or much cheaper datasets would be a solution for these limitations; however, indiscriminate usage of other datasets often brings severe generalization error due to the presence of dataset shifts (TorralbaE11). Unsupervised domain adaptation (UDA) tackles these problems where no labeled data from the target domain are available, but labeled data from other source domains are provided. Finding out domaininvariant features has been the focus of UDA since it allows knowledge transfer from the labeled source dataset to the unlabeled target dataset. There have been many efforts to transfer knowledge from a single source domain to a target one. Most recent frameworks minimize the distance between two domains by deep neural networks and distancebased techniques such as discrepancy regularizers (LongCWJ15; LongZWJ16; LongZWJ17), adversarial networks (GaninUAGLLML16; TzengHSD17), and generative networks (LiuBK17; ZhuPIE17; HoffmanTPZISED18).
While the abovementioned approaches consider one single source, we address multisource domain adaptation (MSDA), which is very crucial and more practical in realworld applications as well as more challenging. MSDA is able to bring significant performance enhancement by virtue of accessibility to multiple datasets as long as multiple domain shift problems are resolved. Previous works have extensively presented both theoretical analysis (BenDavidBCKPV10; MansourMR08; CrammerKW08; HoffmanMZ18; ZhaoZWMCG18; ZellingerMS20) and models (ZhaoZWMCG18; XuCZYL18; PengBXHSW19) for MSDA. MDAN (ZhaoZWMCG18) and DCTN (XuCZYL18) build adversarial networks for each source domain to generate features domaininvariant enough to confound domain classifiers. However, these approaches do not encompass the shifts among source domains, counting only shifts between source and target domain. M^{3}SDA (PengBXHSW19) adopts moment matching strategy but makes the unrealistic assumption that matching the marginal probability would guarantee the alignment of the conditional probability . Most of these methods also do not fully exploit the knowledge of target domain, imputing to the inaccessibility to the labels. Furthermore, all these methods leverage one single feature extractor, which possibly misses important information regarding label classification.
In this paper, we propose EnMDAP, a novel MSDA framework which mitigates the limitations of these methods of not explicitly considering conditional probability , and relying on only one feature extractor. The model architecture is illustrated in Figure 1. EnMDAP aligns the conditional probability by utilizing labelwise moment matching. We employ pseudolabels for the inaccessible target labels to maximize the usage of the target data. Moreover, integrating the features from multiple feature extractors gives abundant information about labels to the extracted features.Extensive experiments show the superiority of our proposed methods.
Our contributions are summarized as follows: {itemize*}
Method. We propose EnMDAP, a novel approach for MSDA that effectively obtains domaininvariant features from multiple domains by matching conditional probability , not marginal one, utilizing pseudolabels for inaccessible target labels to fully deploy target data, and using multiple feature extractors. It allows domaininvariant features to be extracted, capturing the intrinsic differences of different labels.
Analysis. We theoretically prove that minimizing the labelwise moment matching loss is relevant to bounding the target error.
Experiments. We conduct extensive experiments on image and text datasets. We show that 1) EnMDAP provides the stateoftheart accuracy, and 2) each of our main ideas significantly contributes to the superior performance.
2 Related Work
Singlesource Domain Adaptation. Given a labeled source dataset and an unlabeled target dataset, singlesource domain adaptation aims to train a model that performs well on the target domain. The challenge of singlesource domain adaptation is to reduce the discrepancy between the two domains and to obtain appropriate domaininvariant features. Various discrepancy measures such as Maximum Mean Discrepancy (MMD) (TzengHZSD14; LongCWJ15; LongZWJ16; LongZWJ17; GhifaryKZBL16) and KL divergence (ZhuangCLPH15) have been used as regularizers. Inspired from the insight that the domaininvariant features should exclude the clues about its domain, constructing adversarial networks against domain classifiers has shown superior performance. LiuBK17 and HoffmanTPZISED18 deploy GAN to transform data across the source and target domain, while GaninUAGLLML16 and TzengHSD17 leverage the adversarial networks to extract common features of the two domains. Unlike these works, we focus on multiple source domains.
Multisource Domain Adaptation. Singlesource domain adaptation should not be naively employed for multiple source domains due to the shifts between source domains. Many previous works have tackled MSDA problems theoretically. MansourMR08 establish distribution weighted combining rule that the weighted combination of source hypotheses is a good approximation for the target hypothesis. The rule is further extended to a stochastic case with joint distribution over the input and the output space in HoffmanMZ18. CrammerKW08 propose the general theory of how to sift appropriate samples out of multisource data using expected loss. Efforts to find out transferable knowledge from multiple sources from the causal viewpoint are made in ZhangGS15. There have been salient studies on the learning bounds for MSDA. BenDavidBCKPV10 found the generalization bounds based on divergence, which are further tightened by ZhaoZWMCG18. Frameworks for MSDA have been presented as well. ZhaoZWMCG18 propose learning algorithms based on the generalization bounds for MSDA. DCTN (XuCZYL18) resolves domain and category shifts between source and target domains via adversarial networks. M^{3}SDA PengBXHSW19 associates all the domains into a common distribution by aligning the moments of the feature distributions of multiple domains. However, all these methods do not consider multimode structures (PeiCLW18) that differently labeled data follow distinct distributions, even if they are drawn from the same domain. Also, the domaininvariant features in these methods contain the label information for only one label classifier which lead these methods to miss a large amount of label information. Different from these methods, our frameworks fully count the multimodal structures handling the data distributions in a labelwise manner and minimize the label information loss considering multiple label classifiers.
Moment Matching. Domain adaptation has deployed the moment matching strategy to minimize the discrepancy between source and target domains. MMD regularizer (TzengHZSD14; LongCWJ15; LongZWJ16; LongZWJ17; GhifaryKZBL16) can be interpreted as the firstorder moment while SunFS16 address secondorder moments of source and target distributions. ZellingerGLNS17 investigate the effect of higherorder moment matching. M^{3}SDA (PengBXHSW19) demonstrates that moment matching yields remarkable performance also with multiple sources. While previous works have focused on matching the moments of marginal distributions for singlesource adaptation, we handle conditional distributions in multisource scenarios.
3 Proposed Method
In this section, we describe our proposed method, EnMDAP. We first formulate the problem definition in Section 3.1. Then, we describe our main ideas in Section 3.2. Section 3.3 elaborates how to match labelwise moment with pseudolabels and Section 3.4 extends the approach by adding the concept of ensemble learning. Figure 1 shows the overview of EnMDAP.
3.1 Problem Definition
Given a set of labeled datasets from source domains and an unlabeled dataset from a target domain , we aim to construct a model that minimizes test error on . We formulate source domain as a tuple of the data distribution on data space and the labeling function : . Source dataset drawn with the distribution is denoted as . Likewise, the target domain and the target dataset are denoted as and , respectively. We narrow our focus down to homogeneous settings in classification tasks: all domains share the same data space and label set .
3.2 Overview
We propose EnMDAP based on the following observations: 1) existing methods focus on aligning the marginal distributions not the conditional ones , 2) knowledge of the target data is not fully employed as no target label is given, and 3) there exists a large amount of label information loss since domaininvariant features are extracted for only one single label classifier. Thus, we design EnMDAP aiming to solve the limitations. Designing such method entails the following challenges:

Matching conditional distributions. How can we align the conditional distribution, , of multiple domains not the marginal one, ?

Exploitation of the target data. How can we fully exploit the knowledge of the target data despite the absence of the target labels?

Maximally utilizing feature information. How can we maximally utilize the information that the domaininvariant features contain?
We propose the following main ideas to address the challenges:

Labelwise moment matching (Section 3.3). We match the labelwise moments of the domaininvariant features so that the features with the same labels have similar distributions regardless of their original domains.

Pseudolabels (Section 3.3). We use pseudolabels as alternatives to the target labels.

Ensemble of feature representations (Section 3.4). We learn to extract ensemble of features from multiple feature extractors, each of which involves distinct domaininvariant features for its own label classifier.
3.3 Labelwise Moment Matching with pseudolabels
We describe how EnMDAP matches conditional distributions of the features from multiple distinct domains. In EnMDAP, a feature extractor and a label classifier lead the features to be domaininvariant and labelinformative at the same time. The feature extractor extracts features from data, and the label classifier receives the features and predicts the labels for the data. We train the two components, and , according to the losses for labelwise moment matching and label classification, which make the features domaininvariant and labelinformative, respectively.
Labelwise Moment Matching. To achieve the alignment of domaininvariant features, we define a labelwise moment matching loss as follows:
(1) 
where is a hyperparameter indicating the maximum order of moments considered by the loss, and are two distinct domains amongst the domains, and is the number of data labeled as in . We introduce pseudolabels for the target data, which are determined by the outputs of the model currently being trained, to manage the absence of the ground truths for the target data. In other words, we leverage to give the pseudolabel to the target data . Drawing the pseudolabels using the incomplete model, however, brings mislabeling issue which impedes further training. To alleviate this problem, we set a threshold and assign the pseudolabels to the target data only when the prediction confidence is greater than the threshold. The target examples with low confidence are not pseudolabeled and not counted in labelwise moment matching.
By minimizing , the feature extractor aligns data from multiple domains by bringing consistency in distributions of the features with the same labels. The data with distinct labels are aligned independently, taking account of the multimode structures that differently labeled data follow different distributions.
Label Classification. The label classifier gets the features projected by as inputs and makes the label predictions. The label classification loss is defined as follows:
(2) 
where is the softmax crossentropy loss. Minimizing separates the features with different labels so that each of them gets labeldistinguishable.
3.4 Ensemble of Feature Representations
In this section, we introduce ensemble learning for further enhancement. Features extracted with the strategies elaborated in the previous section contain the label information for a single label classifier. However, each label classifier leverages only limited label characteristics, and thus the conventional scheme to adopt only one pair of feature extractor and label classifier captures only a small part of the label information. Our idea is to leverage an ensemble of multiple pairs of feature extractor and label classifier in order to make the features to be more labelinformative.
We train multiple pairs of feature extractor and label classifier in parallel following the labelwise moment matching approach explained in Section 3.3. Let denote the number of the feature extractors in the overall model. We denote the (feature extractor, label classifier) pairs as and the resultant features as where is the output of the feature extractor . After obtaining different feature mapping modules, we concatenate the features into one vector . The final label classifier takes the concatenated feature as input, and predicts the label of the feature.
Naively exploiting multiple feature extractors, however, does not guarantee the diversity of the features since it resorts to the randomness. Thus, we introduce a new model component, extractor classifier, which separates the features from different extractors. The extractor classifier gets the features generated by a feature extractor as inputs and predicts which feature extractor has generated the features. For example, if , the extractor classifier attempts to predict whether the input feature is extracted by the extractor or . By training the extractor classifier and multiple feature extractors at once, we explicitly diversify the features obtained from different extractors. We train the extractor classifier utilizing the feature diversifying loss, :
(3) 
where is the number of feature extractors.
3.5 EnMDAP: Ensemble MultiSource Domain Adaptation with pseudolabels
Our final model EnMDAP consists of pairs of feature extractor and label classifier, , one extractor classifier , and one final label classifier . We first train the entire model except the final label classifier with the loss :
(4) 
where is the label classification loss of the classifier , is the labelwise moment matching loss of the feature extractor , and and are the hyperparameters. Then, the final label classifier is trained with respect to the label classification loss using the concatenated features from multiple feature extractors.
4 Analysis
We present a theoretical insight regarding the validity of the labelwise moment matching loss. For simplicity, we tackle only binary classification tasks. The error rate of a hypothesis on a domain is denoted as where is the labeling function on the domain . We first introduce th order labelwise moment divergence.
Definition 1.
Let and be two domains over an input space where is the dimension of the inputs. Let be the set of the labels, and and be the data distribution given that the label is , i.e. and for the data distribution and on the domains and , respectively. Then, the th order labelwise moment divergence of the two domains and over is defined as
(5) 
where is the set of the tuples of the nonnegative integers, which add up to , and are the probability that arbitrary data from and to be labeled as respectively, and the data is expressed as .∎
The ultimate goal of MSDA is to find a hypothesis with the minimum target error. We nevertheless train the model with respect to the source data since ground truths for the target are unavailable. Let datasets be drawn from labeled source domains respectively. We denote th source dataset as . The empirical error of hypothesis in th source domain estimated with is formulated as . Given a weight vector such that , the weighted empirical source error is formulated as . We extend the theorems in BenDavidBCKPV10; PengBXHSW19 and derive a bound for the target error , for trained with source data, in terms of th order labelwise moment divergence.
Theorem 1.
Let be a hypothesis space of VC dimension , be the number of samples from source domain , be the total number of samples from source domains , and with . Let us define a hypothesis that minimizes the weighted empirical source error, and a hypothesis that minimizes the true target error. Then, for any and , there exist integers and constants such that
(6) 
with probability at least , where and .∎
Proof.
See the appendix. ∎
Speculating that all datasets are balanced against the annotations, i.e., for any , is expressed as the sum of the estimates of with . The theorem provides an insight that labelwise moment matching allows the model trained with source data to have performance comparable to the optimal one on the target domain.
5 Experiments
We conduct experiments to answer the following questions of EnMDAP. {itemize*}
Accuracy (Section 5.2). How well does EnMDAP perform in classification tasks?
Ablation Study (Section 5.3). How much does each component of EnMDAP contribute to performance improvement?
Effects of Degree of Ensemble (Section 5.4). How does the performance change as the number of the pairs of the feature extractor and the label classifier increases?
5.1 Experimental Settings
Datasets.
We use three kinds of datasets, DigitsFive, OfficeCaltech10
Competitors. We use 3 MSDA algorithms, DCTN (XuCZYL18), M^{3}SDA (PengBXHSW19), and M^{3}SDA (PengBXHSW19), with stateoftheart performances as baselines. All the frameworks share the same architecture for the feature extractor, the domain classifier, and the label classifier for consistency. For DigitsFive, we use convolutional neural networks based on LeNet5 (LeCunBBH98). For OfficeCaltech10, ResNet50 (HeZRS16) pretrained on ImageNet is used as the backbone architecture. For Amazon Reviews, the feature extractor is composed of three fullyconnected layers each with 1000, 500, and 100 output units, and a single fullyconnected layer with 100 input units and 2 output units is adopted for both of the extractor and label classifiers. With DigitsFive, LeNet5 (LeCunBBH98) and ResNet14 (HeZRS16) without any adaptation are additionally investigated in two different manners: Source Combined and Single Best. In Source Combined, multiple source datasets are simply combined and fed into a model. In Single Best, we train the model with each source dataset independently, and report the result of the best performing one. Likewise, ResNet50 and MLP consisting of 4 fullyconnected layers with 1000, 500, 100, and 2 units are investigated without adaptation for OfficeCaltech10 and Amazon Reviews respectively.
Training Details. We train our models for DigitsFive with Adam optimizer (KingmaB14) with , , and the learning rate of for 100 epochs. All images are scaled to and the mini batch size is set to . We set the hyperparameters , , and . For the experiments with OfficeCaltech10, all the modules comprising our model are trained following SGD with the learning rate , except that the optimizers for feature extractors have the learning rate . We scale all the images to and set the mini batch size to . All the hyperparameters are kept the same as in the experiments with DigitsFive. For Amazon Reviews, we train the models for epochs using Adam optimizer with , , and the learning rate of . We set , , and the mini batch size to . For every experiment, the confidence threshoold is set to .



5.2 Performance Evaluation
We evaluate the performance of EnMDAP with against the competitors. We repeat experiments for each setting five times and report the mean and the standard deviation. The results are summarized in Tables 1. Note that EnMDAP provides the best accuracy in all the datasets, showing their consistent superiority in both image datasets (DigitsFive, OfficeCaltech10) and text dataset (Amazon Reviews). The enhancement is especially remarkable when MNISTM is the target domain in DigitsFive, improving the accuracy by compared to the stateoftheart methods.
5.3 Ablation Study
We perform an ablation study on DigitsFive to identify what exactly enhances the performance of EnMDAP. We compare EnMDAP with 3 of its variants: MDAPL, MDAP, and EnMDAPR. MDAPL has the same strategies as M^{3}SDA, aligning moments regardless of the labels of the data. MDAP trains the model without ensemble learning theme. EnMDAPR exploits ensemble learning strategy but relies on randomness without extractor classifier and feature diversifying loss.
The results are shown in Table 2. By comparing MDAPL and MDAP, we observe that considering labels in moment matching plays a significant role in extracting domaininvariant features.The remarkable performance gap between MDAP and EnMDAP with verifies the effectiveness of ensemble learning. On the other hand, the performance of EnMDAPR and EnMDAP have little difference. It indicates that two feature extractors trained independently without any diversifying techniques are unlikely to be correlated even though it resorts to randomness.
5.4 Effects of Ensemble
We vary , the number of pairs of feature extractor and label classifier, and repeat the performance evaluation on DigitsFive. The results are summarized in Table 2. While ensemble of two pairs gives much better performance than the model with one single pair, using more than two pairs rarely brings further improvement. This result demonstrates that two pairs of feature extractor and label classifier are able to cover most information without losing important label information in DigitsFive. It is notable that increasing sometimes brings small performance degradation. As more feature extractors are adopted to obtain final features, the complexity of final features increases. It is harder for the final label classifiers to manage the features with high complexity compared to the simple ones. This deteriorates the performance when we exploit more than two feature extractors.
Method  M+S+D+U/T  T+S+D+U/M  T+M+D+U/S  T+M+S+U/D  T+M+S+D/U  Average 

MDAPL  98.750.05  67.770.71  81.750.61  88.510.29  97.170.22  86.790.38 
MDAP  99.140.06  79.320.73  84.770.39  91.910.05  98.490.16  90.730.28 
EnMDAPR (n=2)  99.340.05  83.240.81  86.960.34  92.880.15  98.560.17  92.200.30 
EnMDAP (n=2)  99.310.04  83.950.90  86.930.39  93.150.17  98.490.08  92.370.31 
EnMDAP (n=3)  99.310.05  82.780.67  87.100.29  92.850.24  98.480.09  92.100.27 
EnMDAP (n=4)  99.300.07  82.740.55  86.650.41  92.860.15  98.500.08  92.010.25 
6 Conclusion
We propose EnMDAP, a novel framework for the multisource domain adaptation problem. EnMDAP overcomes the problems in the existing methods of not directly addressing conditional distributions of data , not fully exploiting knowledge of target data, and missing large amount of label information. EnMDAP aligns data from multiple source domains and the target domain considering the data labels, and exploits pseudolabels for unlabeled target data. EnMDAP further enhances the performance by introducing multiple feature extractors. Our framework exhibits superior performance on both image and text classification tasks. Considering labels in moment matching and adding ensemble learning theme is shown to bring remarkable performance enhancement through ablation study. Future works include extending our approach to other tasks such as regression, which may require modification in the pseudolabeling method.
References
Appendix A Appendix
a.1 Proof for Theorem 1
We prove Theorem 1 in the paper by extending the proof in the existing studies (BenDavidBCKPV10; PengBXHSW19). We first define th order labelwise moment divergence , and disagreement ratio of the two hypotheses on the domain .
Definition 1.
Let and be two domains over an input space where is the dimension of the inputs. Let be the set of the labels, and and be the data distributions given that the label is , i.e. and for the data distribution and on the domains and , respectively. Then, the th order labelwise moment divergence of the two domains and over is defined as
(7) 
where is the set of the tuples of the nonnegative integers, which add up to , and are the probability that arbitrary data from and to be labeled as respectively, and the data is expressed as .∎
Definition 2.
Let be a domain over an input space with the data distribution . Then, we define the disagreement ratio of the two hypotheses on the domain as
(8) 
∎
Theorem 2.
(StoneWeierstrass Theorem (stone37)) Let be a compact subset of and be a continuous function. Then, for every , there exists a polynomial, , such that
(9) 
∎
Theorem 2 indicates that continuous functions on a compact subset of are approximated with polynomials. We next formulate the discrepancy of the two domains using the disagreement ratio and bound it with the labelwise moment divergence.
Lemma 1.
Let and be two domains over an input space , where n is the dimension of the inputs. Then, for any hypotheses and any , there exist and a constant such that
(10) 
∎
Proof.
Let the domains and have the data distribution of and , respectively, over an input space , which is a compact subset of , where is the dimension of the inputs. For brevity, we denote as . Then,
(11) 
For any hypotheses , the indicator function is Lebesgue integrable on , i.e. is a function. Since a set of continuous functions is dense in , for every , there exists a continuous function defined on such that
(12) 
for every , and the fixed and that drive Equation 5 to the supremum. Accordingly,
(13) 
By integrating every term in the inequality over , the inequality,
(14) 
follows. Likewise, the same inequality on the domain with instead of holds. By subtracting the two inequalities and reformulating it, the inequality,
(15) 
is induced. By substituting the inequality in Equation 9 to the Equation 5,
(16) 
By the Theorem 2, there exists a polynomial such that
(17) 
and the polynomial is expressed as
(18) 
where is the order of the polynomial, is the set of the tuples of the nonnegative integers, which add up to , is the coefficient of each term of the polynomial, and . By applying Equation 11 to the Equation 10 and substituting the expression in Equation 12,
(19) 
where and are the probability that an arbitrary data is labeled as class in domain and , respectively, and and are the data distribution given that the data is labeled as class on domain and , respectively. For ,
(20) 
for . ∎
Let datasets be drawn from labeled source domains respectively. We denote ith source dataset as . The empirical error of hypothesis in th source domain estimated with is formulated as . Given a positive weight vector such that and , the weighted empirical source error is formulated as .
Lemma 2.
For source domains , let be the number of samples from source domain , be the total number of samples from source domains, and with . Let be the weighted true source error which is the weighted sum of . Then,
(21) 
Proof.
It has been proven in BenDavidBCKPV10. ∎
We now turn our focus back to the Theorem 1 in the paper and complete the proof.
Theorem 1.
Let be a hypothesis space of VC dimension , be the number of samples from source domain , be the total number of samples from source domains , and with . Let us define a hypothesis that minimizes the weighted empirical source error, and a hypothesis that minimizes the true target error. Then, for any and , there exist integers and constants such that
(22) 
with probability at least , where and .∎
Proof.
(23) 
We define for every for the following equations. We also note that the 1triangular inequality (CrammerKW08) holds for binary classification tasks, i.e., for any hypothesis and domain . Then,
(24) 
for the ground truth labeling function on the domain and two hypotheses . Applying the definition and the inequality to Equation 17,
(25) 
By the definition of , for . Additionally, according to Lemma 1, for any , there exists an integer and a constant such that
(26) 
By applying these relations,
(27) 
By Lemma 2 and the standard uniform convergence bound for hypothesis classes of finite VC dimension (BenDavidBCKPV10),