Close Yet Discriminative Domain Adaptation
Abstract
Domain adaptation is transfer learning which aims to generalize a learning model across training and testing data with different distributions. Most previous research tackle this problem in seeking a shared feature representation between source and target domains while reducing the mismatch of their data distributions. In this paper, we propose a close yet discriminative domain adaptation method, namely CDDA, which generates a latent feature representation with two interesting properties. First, the discrepancy between the source and target domain, measured in terms of both marginal and conditional probability distribution via Maximum Mean Discrepancy is minimized so as to attract two domains close to each other. More importantly, we also design a repulsive force term, which maximizes the distances between each label dependent subdomain to all others so as to drag different class dependent subdomains far away from each other and thereby increase the discriminative power of the adapted domain. Moreover, given the fact that the underlying data manifold could have complex geometric structure, we further propose the constraints of label smoothness and geometric structure consistency for label propagation. Extensive experiments are conducted on 36 crossdomain image classification tasks over four public datasets. The Comprehensive results show that the proposed method consistently outperforms the stateoftheart methods with significant margins.
UTF8gkai
1 Introduction
Thanks to deep networks, recent years have witnessed impressive progress in an increasing number of machine learning and computer vision tasks, e.g., image classification[17, 9], object detection [4, 6], semantic segmentation [3, 4, 21]. However, these impressive progress have been made possible only when massive amount of labeled training data are available and such a requirement hampers their adoption to a number of reallife applications where labeled training data don’t exist or not enough in quantity. On the other hand, manual annotation of large training data could be extremely tedious and prohibitive for a given application. An interesting solution to this problem is transfer learning through domain adaptation [16]), which aims to leverage abundant existing labeled data from a different but related domain (source domain) and generalize a predictive model learned from the source domain to unlabeled target data (target domain) despite the discrepancy between the source and target data distributions.
The core idea of most proposed methods for domain adaptation is to reduce the discrepancy between domains and learn a domaininvariant predictive model from data. State of the art has so far featured two mainstream algorithms in reducing data distribution discrepancy: (1) feature representation transfer, which aims to find ”good” feature representations to minimize domain differences and the error of classification or regression models; and (2) instance transfer, which attempts to reweight some ”good” data from source domain, which may be useful for the target domain. It minimizes the distribution differences by reweighting the source domain data and then trains a predictive model on the reweighted source data.
In this paper, we are interested in feature representation transfer which seeks a domain invariant latent space, while preserving at the same time important structure of original data, e.g., data variance or geometry. Early methods, e.g., [1], propose a structural correspondence learning (SCL), which first defines a set of pivot features and then identifies correspondences among features from different domains by modeling their correlations with the pivot features. Later, transfer learning problems are approached via dimensionality reduction. [15] learns a novel feature representation across domains in a Reproducing Kernel Hilbert Space with the Maximum Mean Discrepancy (MMD) measure [2], through the socalled transfer component analysis (TCA). TCA [15] is an extension of [14], with the purpose to reduce computational burden. [12] goes one step further and remarks that both marginal and conditional distribution could be different between the source and target domains. As a result, Joint Distribution Adaptation (JDA) is proposed to jointly minimize the mismatches of marginal and conditional probability distributions. The previous research has thus so far only focused on matching marginal and/or conditional distributions for transfer learning while ignoring the discriminative properties to be reinforced between different classes in the adapted domain.
In this paper, we propose to extract a latent shared feature space underlying the domains where the discrepancy between domains is reduced but more importantly, the original discriminative information between classes is simultaneously reinforced. Specifically, not only we seek to find a shared feature space in minimizing the discrepancy of both marginal and conditional probability distributions as in JDA [12], but also introduce a discriminative model, called subsequently as repulsive force, in light of the Fisherâs linear discriminant analysis (FLDA) [5]. This repulsive force drags the subdomains with different labels far away from each other in maximizing their distances measured in terms of Maximum Mean Discrepancy (MMD), thereby making more discriminative data from different subdomains. This is in clear contrast to the previous approaches as illustrated in Fig.1. Most previous works, e.g.,JDA, only seek to align marginal or conditional distributions between the source and target domain and the resultant latent subspace therefore falls short in terms of discrimination power as illustrated in the lower part of the green ellipse of Fig.1(a), where samples of different labels are all mixed up. In contrast, as can be seen in the lower part of the purple ellipse of Fig.1(b), the proposed method unifies the decrease of data distribution discrepancy and the increase of the discriminative property between classes into a same framework and finds a novel latent subspace where samples with same label are put close to each other while samples with different labels are well separated. Moreover, given the fact that the manifold of both source and target data in the shared latent feature space could have complex geometric structure, we further propose label propagation based on the respect of two constraints, namely label smoothness consistency (LSC) and geometric structure consistency (GSC), for the prediction of target data labels. That is, a good label propagation should well preserve the label information(constraint LSC) and not change too much from the shared data manifold (constraint GSC).
To sum up, the contributions in this paper are threefold:

A novel repulsive force is proposed to increase the discriminative power of the shared latent subspace, aside of decreasing both the marginal and conditional distributions between the source and target domains.

Unlike a number of domain adaptation methods, e.g., JDA [12], which use Nearest Neighbor(NN) with Euclidean distance to predict labels in target domain, the prediction in the proposed model, is deduced via label propagation in respect of the underlying data manifold geometric structure.

Extensive experiments are conducted on comprehensive datasets, and verify the effectiveness of the proposed method which outperforms stateoftheart domain adaptation algorithms with a significant margin.
The rest of the paper is organized as follows. In Section 2, we discuss previous works related to ours and highlight their differences. In Section 3, first we describe the problem and preliminaries of domain adaptation and then we present our proposed method. Experiment results and discussions are presented in Section 4 and finally we draw the conclusion in Section 5.
2 Related Work
In this section, we discuss previous works which are related to our method and analyze their differences.
In machine learning, domain adaptation is transfer learning which aims to learn an effective predictive model for a target domain without labeled data in leveraging abundant existing labeled data of a different but related source domain. Because the collection of large labeled data as needed in traditional machine learning is often prohibitive for many reallife applications, there is an increasing interest on this young yet hot topic [16][19]. According to the taxonomy made in recent surveys [16][19] [12], the proposed method falls down into the feature representation category.
Recent popular methods embrace the dimensionality reduction to seek a latent shared feature space between the source and the target domain. Its core idea is to project the original data into a lowdimensional latent space with preserving important structure of original data. However, [14] points out that direct application of Principal Component Analysis (PCA) can not guarantee the preservation of discriminative data structures. Their proposed remedy is to maximize the variance of the embedded data. Another interesting idea in [14] is the use of a nonparametric criterion, namely Maximum Mean Discrepancy (MMD), based on Reproducing Hilbert Space (RKHS) [2], to estimate the distance between two distributions. Later, [15] further improves [14] in terms of computational efficiency. With JDA, [12] goes one step further and propose not only to minimize the mismatch of the crossdomains marginal probability distributions but also their conditional probability distributions based on the framework of [14, 15]. The proposed framework in this paper can be considered as an extension of JDA with two major differences. First, we seek not only for a latent subspace which minimizes the mismatch of both the marginal and conditional probability distributions across domains, but also reinforces the discriminative structure of subdomains in original data. We achieve this goal in introducing a novel term which acts as repulsive force to drag away different subdomains both in source and target domain, respectively.
Note that we do not discuss the line of work in the literature on transfer learning which is embedded into deep convolutional neural network as the features used in this work are not deep features; Nevertheless we have noticed their impressive performance, thanks to the combination of the latest advances in transfer learning discussed above with the cuttingedge understanding on the transferability [7] of stateoftheart deep neural networks, e.g., Deep Adaptation Network(DAN) [11], etc. Mixing seamlessly our proposed transfer knowledge model with stateoftheart deep networks will be the subject of our upcoming investigation.
3 Close Yet Discriminative Domain Adaptation
In this section, we present in detail the proposed Close yet Discriminative Domain Adaptation (CDDA) method.
3.1 Problem Statement
We begin with the definitions of notations and concepts most of which we borrow directly from [12].
A domain is defined as an mdimensional feature space and a marginal probability distribution , i.e., with .
Given a specific domain , a task is composed of a Ccardinality label set and a classifier , i.e., , where which can be interpreted as the class conditional probability distribution for each input sample .
In unsupervised domain adaptation, we are given a source domain with labeled samples, and a unlabeled target domain with unlabeled samples with the assumption that source domain and target domain are different, i.e., , , , . We also define the notion of subdomain, denoted as , representing the set of samples in with label . Similarly, a subdomain can be defined for the target domain as the set of samples in with label . However, as is the target domain with unlabeled samples, a basic classifier, e.g., NN, is needed to attribute pseudo labels for samples in .
The aim of the Close yet Discriminative Domain Adaptation (CDDA) is to learn a latent feature space with following properties: 1) the distances of both marginal and conditional probability of source and target domains are reduced; 2) The distances between each subdomain to the others, are increased in order to push them far away from each other; 3) The deduction of label prediction is imposed via two constraints, i.e., label consistency and geometric structure of label space.
3.2 Latent Feature Space with Dimensionality Reduction
The finding of a latent feature space with dimensionality reduction has been demonstrated useful in several previous works, e.g., [14, 15, 12], for domain adaptation. One of its important properties is that original data is projected to a lower dimensional space which is considered as principal structure of data. In the proposed method, we also apply the Principal Component Analysis (PCA). Mathematically, given with an input data matrix , , the centering matrix is defined as , where is the matrix of ones. The optimization of PCA is to find a projection space which maximizes the embedded data variance.
(1) 
where denotes the trace of a matrix, is the data covariance matrix, and with the feature dimension and the dimension of the projected subspace. The optimal solution is calculated by solving an eigendecomposition problem: , where are the largest eigenvalues. Finally, the original data is projected into the optimal dimensional subspace using .
3.3 Closer: Marginal and Conditional Distribution Domain Adaptation
However, the feature space calculated via PCA is not sufficiently good enough for our problem of domain adaptation problem, for PCA only seeks to maximize the variance of the projected data from the two domains and does not explicitly reduce their distribution mismatch [12, 11]. Since the distance of data distributions across domain can also be empirically measured , we explicitly leverage the nonparametric distance measurement MMD in RKHS [2] to compute the distance between expectations of source domain and target domain, once the original data projected into a lowdimensional feature space via. Formally, the empirical distance of the two domains is defined as:
(2) 
where represents the marginal distribution between and and its calculation is obtained by:
(3) 
where . The difference between the marginal distributions and is reduced in minimizing .
Similarly, the distance of conditional probability distributions is defined as the sum of the empirical distances over the class labels between the subdomains of a same label in the source and target domain:
(4) 
where is the number of classes, represents the subdomain in the source domain, is the number of samples in the source subdomain. and are defined similarly for the target domain. Finally, represents the conditional distribution between subdomains in and and it is defined as:
(5) 
In minimizing , the mismatch of conditional distributions between and is reduced.
3.4 More discriminative:Repulsive Force Domain Adaptation
The latent feature subspace obtained by the joint marginal and conditional domain adaptation as in JDA, is to reduce the differences between the source and target domain. As such, two spaces of data are attracted to be close to each other. However, their model has ignored an important property for the elaboration of an effective predictor, i.e., the preservation or reinforcement of discriminative information related to subdomains. In this paper, we introduce a novel repulsive force domain adaption, which aims to increase the distances of subdomains with different labels, so as to improve the discriminative power of the latent shared features and thereby making it possible better predictive model for the target domain. To sum up, we aim to generate a latent feature space where the discrepancy between domains is reduced while simultaneously the distances between subdomains of different labels are increased for an reinforced discriminative power of the underlying latent feature space.
Specifically, the repulsive force domain adaptation is defined as: , where and index the distances computed from to and to , respectively. represents the sum of the distances between each source subdomain and all the target subdomains except the one with the label . The sum of these distances is explicitly defined as:
(6) 
where is defined as
(7) 
Symmetrically, represents the sum of the distances from each target subdomain to all the the source subdomains except the source subdomain with the label . Similarly, the sum of these distances is explicitly defined as:
(8) 
where is defined as
(9) 
Finally, we obtain
(10) 
We define as the repulsive force constraint matrix.While the minimization of Eq.(5) and Eq.(4) makes closer both marginal and conditional distributions between source and target, the maximization of Eq.(10) increases the distances between source and target subdomains with different labels, thereby improve the discriminative power of the underlying latent feature space.
3.5 Label Deduction
In a number of domain adaptation methods, e.g.,[14, 15, 12, 18], the simple Nearest Neighbor (NN) classifier is applied for label deduction. In JDA, NNbased label deduction is applied twice at each iteration. NN is first applied to the target domain in order to generate the pseudo labels of the target data and enable the computation of the conditional probability distance as defined in section 3.3. Once the optimized latent subspace NN identified, NN is then applied once again at the end of an iteration for the label prediction of the target domain. However, NN could not be a good classifier, given the fact that it is usually based on a or distance. It could fall short to measure the similarity of source and target domain data which may be embedded into a manifold with complex data structure. Furthermore, the crossdomain discrepancy still exists, even within a reduced latent feature space.
To respect the underlying data manifold structure and better bridge the mismatch between the source and target domain distributions, we further propose in this paper two consistency constraints, namely label smoothness consistency and geometric structure consistency for both the pseudo and final label prediction.
Label Smoothness Consistency (LSC) is defined as:
(11) 
where , is the probability of data belonging to class after iteration. is the initial prediction, and is defined as:
(12) 
Geometric Structure Consistency (GSC) is defined as:
(13) 
3.6 Learning Algorithm
Our proposed domain adaptation integrates the marginal and conditional distribution and repulsive force, as well as the final label prediction using both label smoothness and geometric structure consistencies. Our model is defined as:
(14) 
It can be rewritten mathematically as:
(15) 
Direct solution to this problem is nontrivial. We divide it into two subproblems: (1) , where and (2) . These two subproblems are then iteratively optimized.
The first subproblem, as explained in JDA, amounts to solving the generalized eigendecomposition problem,i.e., . Then, we obtain the adaptation matrix and the underlying embedding space .
The second subproblem is nontrivial. Inspired by the solution proposed in [22] [10] [20], the minimum is approached where the derivative of the function is zero. An approximate solution can be provided by:
(16) 
where is the probability of prediction of the target domain corresponding to different class labels.
The complete learning algorithm is summarized in Algorithm 1.
4 Experiments
In this section, we validate the effectiveness of our proposed domain adaptation model, i.e., CDDA, on several datasets for crossdomain image classification task.
4.1 Benchmarks
In domain adaptation, USPS+MINIST, COIL20, PIE and office+Caltech are standard benchmarks for the purpose of evaluation and comparison with state of the art. In this paper, we follow the data preparation as most previous works. We construct 36 datasets for different image classification tasks. They are: (1) the USPS and MINIST datasets of digits, but with different distribution probabilities. We built the crossdomains as: USPS vs MNIST and MNIST vs USPS; (2) the COIL20 dataset with 20 classes, split into COIL1 vs COIL2 and COIL2 vs COIL1; (3) the PIE face database with different face poses, of which five subsets are selected, denoted as PIE1, PIE2, etc., resulting in domain adaptation tasks, i.e., PIE1 vs PIE 2 PIE5 vs PIE 4; (4) Office and Caltech256. Office contains three realworld datasets: Amazon(images downloaded from online merchants), Webcam(low resolution images) and DSLR( highresolution images by digital web camera). Caltech256 is standard dataset for object recognition, which contains 30,607 images for 31 categories. We denote the dataset Amazon,Webcam,DSLR,and Caltech256 as A,W,D,and C, respectively. domain adaptation tasks can then be constructed, namely A W C D, respectively.
4.2 Baseline Methods
The proposed CDDA method is compared with six methods of the literature, excluding only CNNbased works, given the fact that we are not using deep features. They are: (1)1Nearest Neighbor Classifier(NN); (2) Principal Component Analysis (PCA) +NN; (3) Geodesic Flow Kernel(GFK) [8] + NN; (4) Transfer Component Analysis(TCA) [15] +NN; (5)Transfer Subspace Learning(TSL) [18] +NN; (6) Joint Domain Adaptation (JDA) [12] +NN. Note that TCA and TSL can be viewed as special case of JDA with , and JDA a special case of the proposed CDDA method when the repulsive force domain adaptation is ignored and the label generation is simply based on NN instead of the label propagation with label smoothness and geometric structure consistency constraints.
All the reported performance scores of the six methods of the literature are directly collected from the authors’ publication. They are assumed to be their best performance.
4.3 Experimental Setup
For the problem of domain adaptation, it is not possible to tune a set of optimal hyperparameters, given the fact that the target domain has no labeled data. Following the setting of JDA, we also evaluate the proposed CDDA by empirically searching the parameter space for the optimal settings. Specifically, the proposed CDDA method has three hyperparameters, i.e., the subspace dimension , regularization parameters and . In our experiments, we set and 1) , and for USPS, MNIST and COIL20 , 2) , for PIE, 3) , for Office and Caltech256.
In our experiment, accuracy on the test dataset is the evaluation measurement. It is widely used in literature, e.g.,[14, 12, 11], etc.
(17) 
where is the target domain treated as test data, is the predicted label and is the ground truth label for a test data .
4.4 Experimental Results and Discussion
The classification accuracies of the proposed CDDA method and the six baseline methods are shown in Table.1. and illustrated in Fig.1. for the clarity of comparison.
Datasets  NN  PCA  GFK  TCA  TSL  JDA  CDDA(a)  CDDA(b) 

USPS vs MNIST  44.70  44.95  46.45  51.05  53.75  59.65  62.05  70.75 
MNIST vs USPS  65.94  66.22  67.22  56.28  66.06  67.28  76.22  82.33 
COIL1 vs COIL2  83.61  84.72  72.50  88.47  88.06  89.31  91.53  99.58 
COIL2 vs COIL1  82.78  84.03  74.17  85.83  87.92  88.47  93.89  99.72 
PIE1 vs PIE2  26.09  24.80  26.15  40.76  44.08  58.81  60.22  65.32 
PIE1 vs PIE3  26.59  25.18  27.27  41.79  47.49  54.23  58.70  62.81 
PIE1 vs PIE4  30.67  29.26  31.15  59.63  62.78  84.50  83.48  83.54 
PIE1 vs PIE5  16.67  16.30  17.59  29.35  36.15  49.75  54.17  56.07 
PIE2 vs PIE1  24.49  24.22  25.24  41.81  46.28  57.62  62.33  63.69 
PIE2 vs PIE3  46.63  45.53  47.37  51.47  57.60  62.93  64.64  61.27 
PIE2 vs PIE4  54.07  53.35  54.25  64.73  71.43  75.82  79.90  82.37 
PIE2 vs PIE5  26.53  25.43  27.08  33.70  35.66  39.89  44.00  46.63 
PIE3 vs PIE1  21.37  20.95  21.82  34.69  36.94  50.96  58.46  56.72 
PIE3 vs PIE2  41.01  40.45  43.16  47.70  47.02  57.95  59.73  58.26 
PIE3 vs PIE4  46.53  46.14  46.41  56.23  59.45  68.45  77.20  77.83 
PIE3 vs PIE5  26.23  25.31  26.78  33.15  36.34  39.95  47.24  41.24 
PIE4 vs PIE1  32.95  31.96  34.24  55.64  63.66  80.58  83.10  81.84 
PIE4 vs PIE2  62.68  60.96  62.92  67.83  72.68  82.63  82.26  85.27 
PIE4 vs PIE3  73.22  72.18  73.35  75.86  83.52  87.25  86.64  86.95 
PIE4 vs PIE5  37.19  35.11  37.38  40.26  44.79  54.66  58.33  53.80 
PIE5 vs PIE1  18.49  18.85  20.35  26.98  33.28  46.46  48.02  57.44 
PIE5 vs PIE2  24.19  23.39  24.62  29.90  34.13  42.05  45.61  53.84 
PIE5 vs PIE3  28.31  27.21  28.49  29.9  36.58  53.31  52.02  55.27 
PIE5 vs PIE4  31.24  30.34  31.33  33.64  38.75  57.01  55.99  61.82 
C A  23.70  36.95  41.02  38.20  44.47  44.78  48.33  52.09 
C W  25.76  32.54  40.68  38.64  34.24  41.69  44.75  47.12 
C D  25.48  38.22  38.85  41.40  43.31  45.22  48.41  45.86 
A C  26.00  34.73  40.25  37.76  37.58  39.36  42.12  41.32 
A W  29.83  35.59  38.98  37.63  33.90  37.97  41.69  38.31 
A D  25.48  27.39  36.31  33.12  26.11  39.49  37.58  38.22 
W C  19.86  26.36  30.72  29.30  29.83  31.17  31.97  33.30 
W A  22.96  31.00  29.75  30.06  30.27  32.78  37.27  41.75 
W D  59.24  77.07  80.89  87.26  87.26  89.17  87.90  89.81 
D C  26.27  29.65  30.28  31.70  28.50  31.52  34.64  33.66 
D A  28.50  32.05  32.05  32.15  27.56  33.09  33.51  33.61 
D W  63.39  75.93  75.59  86.10  85.42  89.49  90.51  93.22 
Average (USPS)  55.32  55.59  56.84  53.67  59.90  63.47  69.14  76.54 
Average (COIL)  83.20  84.38  73.34  87.15  87.99  88.89  92.71  99.65 
Average (PIE)  34.76  33.85  35.35  44.75  49.43  60.24  63.10  64.60 
Average (Amazon)  31.37  39.79  42.95  43.61  42.37  46.31  48.22  49.02 
Overall Average  37.46  39.84  41.19  47.22  49.80  57.37  60.12  62.02 
In Table.1, the highest accuracy for each crossdomain adaptation task is highlighted in bold. For a better understanding of the proposed CDDA, we evaluate the proposed CDDA method using two settings: (1) CDDA(a) where simple NN is used as label predictor instead of the proposed label propagation; and (2) CDDA(b) where the proposed label propagation is activated for the prediction of target data labels. As CDDA is reduced to JDA when repulsive force domain adaptation and label propagation are not integrated, the setting CDDA(a) enables to quantify the contribution of adding the repulsive force domain adaptation w.r.t. JDA whereas the setting CDDA(b) makes it possible to evidence the contribution of the proposed label propagation in comparison with CDDA(a) and highlight the overall behavior of the proposed method.
As can be seen in Table.1 , the proposed CDDA depicts an overall average accuracy of and , respectively, with respect to the above two settings. They both outperform the six baseline algorithms with a large margin. With the repulsive force integrated and NN as label predictor, CDDA(a) outperforms JDA on 30 crossdomain tasks out of 36 and improves JDA’s overall average accuracy by roughly 3 points, thereby demonstrating the effectiveness of the proposed repulsive force domain adaptation. Now, in adopting the proposed label propagation under the constraint of both label smoothness and geometric structure consistency, CDDA(b) further improves CDDA(a) by roughly 2 points in terms of overall average accuracy and outperforms JDA by more than 4 points. Compared with the baseline methods, the proposed CDDA method consistently shows its superiority and depicts the best average accuracy over all the four datasets (USPS+MINIST, COIL20, PIE, Amazon). As can be seen in Fig.2, CDDA(b) as represented by the red curve is on the top of the other curves along the axis of 36 crossdomain image classification tasks. It is worth noting that the proposed CDDA depicts accuracy on COIL20; This is rather an unexpected impressive score given the unsupervised nature of the domain adaptation for the target domain.
Using COIL2 vs COIL1, and C W datasets, we also empirically check the convergence and the sensitivity of the proposed CDDA with respect to the hyperparameters. Similar trends can be observed on all the other datasets.
The accuracy w.r.t. iterations is shown in Fig.3 (a). As can be seen there, the performance of the proposed CDDA along with JDA becomes stable after about 10 iterations.
In the experiment, CDDA have two settings: two parameters ( and ) in CDDA(a) and three (, and ) in CDDA(b). The accuracy variation w.r.t regularization parameter is shown in Fig.3 (b), which indicates CDDA(b) achieves the best performance when is close to 0.99 in COIL20 and the performance is more or less stable when is less than 0.99. Given a novel dataset, we tune the parameter in the range [0.001,1]. For instance, in the PIE database, we set the optimal to 0.2. The other parameters, i.e., and , also converge. Their behavior is not shown here due to space limitation.
5 Conclusion and Future Work
In this paper, we have proposed a Close yet Discriminative Domain Adaptation (CDDA) method based on feature representation. Comprehensive experiments on 36 crossdomain datasets highlight the interest of reinforcing the data discriminative properties within the model and label propagation in respect of the geometric structure of the underlying data manifold, and verify the effectiveness of proposed method compared with six baseline methods of the literature.
Our future work will concentrate on embedding the proposed method in deep networks and study other vision tasks, e.g., object detection, within the setting of transfer learning.
References
 [1] Blitzer, J., McDonald, R., and Pereira, F. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing (2006), Association for Computational Linguistics, pp. 120–128.
 [2] Borgwardt, K. M., Gretton, A., Rasch, M. J., Kriegel, H.P., Schölkopf, B., and Smola, A. J. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics 22, 14 (2006), e49–e57.
 [3] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
 [4] Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision 111, 1 (Jan. 2015), 98–136.
 [5] Fisher, R. A. The use of multiple measurements in taxonomic problems. Annals of eugenics 7, 2 (1936), 179–188.
 [6] Girshick, R. Fast rcnn. In International Conference on Computer Vision (ICCV) (2015).
 [7] Glorot, X., Bordes, A., and Bengio, Y. Domain adaptation for largescale sentiment classification: A deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML11) (2011), pp. 513–520.
 [8] Gong, B., Shi, Y., Sha, F., and Grauman, K. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (2012), IEEE, pp. 2066–2073.
 [9] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015).
 [10] Kim, T. H., Lee, K. M., and Lee, S. U. Learning full pairwise affinities for spectral segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 7 (July 2013), 1690–1703.
 [11] Long, M., Cao, Y., Wang, J., and Jordan, M. I. Learning transferable features with deep adaptation networks. In ICML (2015), pp. 97–105.
 [12] Long, M., Wang, J., Ding, G., Sun, J., and Yu, P. S. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE International Conference on Computer Vision (2013), pp. 2200–2207.
 [13] Ng, A. Y., Jordan, M. I., and Weiss, Y. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani, Eds. MIT Press, 2002, pp. 849–856.
 [14] Pan, S. J., Kwok, J. T., and Yang, Q. Transfer learning via dimensionality reduction. In AAAI (2008), vol. 8, pp. 677–682.
 [15] Pan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22, 2 (2011), 199–210.
 [16] Pan, S. J., and Yang, Q. A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22, 10 (2010), 1345–1359.
 [17] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and FeiFei, L. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.
 [18] Si, S., Tao, D., and Geng, B. Bregman divergencebased regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering 22, 7 (July 2010), 929–942.
 [19] Weiss, K., Khoshgoftaar, T. M., and Wang, D. A survey of transfer learning. Journal of Big Data 3, 1 (2016), 1–40.
 [20] Yang, C., Zhang, L., Lu, H., Ruan, X., and Yang, M. H. Saliency detection via graphbased manifold ranking. In 2013 IEEE Conference on Computer Vision and Pattern Recognition (June 2013), pp. 3166–3173.
 [21] Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. Pyramid scene parsing network. CoRR abs/1612.01105 (2016).
 [22] Zhou, D., Bousquet, O., Lal, T. N., Weston, J., and SchÃ¶lkopf, B. Learning with local and global consistency. In Advances in Neural Information Processing Systems 16 (2004), MIT Press, pp. 321–328.