Beyond Sharing Weights for Deep Domain Adaptation
Abstract
The performance of a classifier trained on data coming from a specific domain typically degrades when applied to a related but different one. While annotating many samples from the new domain would address this issue, it is often too expensive or impractical. Domain Adaptation has therefore emerged as a solution to this problem; It leverages annotated data from a source domain, in which it is abundant, to train a classifier to operate in a target domain, in which it is either sparse or even lacking altogether. In this context, the recent trend consists of learning deep architectures whose weights are shared for both domains, which essentially amounts to learning domain invariant features.
Here, we show that it is more effective to explicitly model the shift from one domain to the other. To this end, we introduce a twostream architecture, where one operates in the source domain and the other in the target domain. In contrast to other approaches, the weights in corresponding layers are related but not shared. We demonstrate that this both yields higher accuracy than stateoftheart methods on several object recognition and detection tasks and consistently outperforms networks with shared weights in both supervised and unsupervised settings.
1 Introduction
A classifier trained using samples from a specific domain usually needs to be retrained to perform well in a related but different one. Since this may require much manual annotation to create enough training data, it is often impractical. With the advent of Deep Networks [23, 32], this problem has become particularly acute due to their requirements for massive amounts of training data.
Domain Adaptation [27] and Transfer Learning [39] have long been used to overcome this difficulty by making it possible to exploit what has been learned in one source domain, for which enough training data is available, to effectively train classifiers in a target domain, where only very small amounts of additional annotations, or even none, can be acquired. Recently, Domain Adaptation has been investigated in the context of Deep Learning with promising results [17, 37, 44, 34, 14, 43]. These methods, however, use the same deep architecture with the same weights for both source and target domains. In other words, they attempt to learn features that are invariant to the domain shift.
In this paper, we show that imposing feature invariance is detrimental to discriminative power. To this end, we introduce the twostream architecture depicted by Fig. 1. One stream operates on the source domain and the other on the target one. This makes it possible not to share the weights in some of the layers. Instead, we introduce a loss function that is lowest when they are linear transformations of each other. Furthermore, we introduce a criterion to automatically determine which layers should share their weights and which ones should not. In short, our approach explicitly models the domain shift by learning features adapted to each domain, but not fully independent, to account for the fact that both domains depict the same underlying problem.
We demonstrate that our approach is more effective than stateoftheart weightsharing schemes on standard Domain Adaptation benchmarks for image recognition . We also show that it is well suited to leveraging synthetic data to increase the performance of a classifier on real images. Given that this is one of the easiest ways to provide the large amounts of training data that Deep Networks require, this scenario has become popular. Here, we treat the synthetic images as forming the source domain and the real images the target one. We then make use of our twostream architecture to learn an effective model for the real data even though we have only few annotations for it . We demonstrate the effectiveness of our approach at leveraging synthetic data for both detection of Unmanned Aerial Vehicles (UAVs) and facial pose estimation. The first application involves classification and the second regression, and they both benefit from using synthetic data. We outperform the stateoftheart methods in all these cases, and our experiments support our contention that specializing the network weights outperforms sharing them.
2 Related Work
In many practical applications, classifiers and regressors may have to operate on various kinds of related but visually different image data. The differences are often large enough for an algorithm that has been trained on one kind of images to perform poorly on another. Therefore, new training data has to be acquired and annotated to retrain it. Since this is typically expensive and timeconsuming, there has long been a push to develop Domain Adaptation techniques that allow retraining with minimal amount of new data or even none. Here, we briefly review some recent trends, with a focus on Deep Learning based methods, which are the most related to our work.
A natural approach to Domain Adaptation is to modify a classifier trained on the source data using the available labeled target data. This was done, for example, using SVM [11, 4], Boosted Decision Trees [3] and other classifiers [9]. In the context of Deep Learning, finetuning [17, 37] essentially follows this pattern. In practice, however, when only a small amount of labeled target data is available, this often results in overfitting.
Another approach is to learn a metric between the source and target data, which can also be interpreted as a linear crossdomain transformation [41] or a nonlinear one [30]. Instead of working on the samples directly, several methods involve representing each domain as one separate subspace [20, 19, 12, 6]. A transformation can then be learned to align them [12]. Alternatively, one can interpolate between the source and target subspaces [20, 19, 6]. In [7], this interpolation idea was extended to Deep Learning by training multiple unsupervised networks with increasing amounts of target data. The final representation of a sample was obtained by concatenating all intermediate ones. It is unclear, however, why this concatenation should be meaningful to classify a target sample.
Another way to handle the domain shift is to explicitly try making the source and target data distributions similar. While many metrics have been proposed to quantify the similarity between two distributions, the most widely used in the Domain Adaptation context is the Maximum Mean Discrepancy (MMD) [21]. The MMD has been used to reweight [24, 22] or select [18] source samples such that the resulting distribution becomes as similar as possible to the target one. An alternative is to learn a transformation of the data, typically both source and target, such that the resulting distributions are as similar as possible in MMD terms [38, 36, 1]. In [15], the MMD was used within a shallow neural network architecture. However, this method relied on SURF features [2] as initial image representation and thus only achieved limited accuracy.
Recently, using Deep Networks to learn features has proven effective at increasing the accuracy of Domain Adaptation methods. In [10], it was shown that using DeCAF features instead of handcrafted ones mitigates the domain shift effects even without performing any kind of adaptation. However, performing adaptation within a Deep Learning framework was shown to boost accuracy even further [8, 44, 34, 14, 43, 16, 5]. For example, in [8], a Siamese architecture was introduced to minimize the distance between pairs of source and target samples, which requires training labels available in the target domain thus making the method unsuitable for unsupervised Domain Adaptation. The MMD has also been used to relate the source and target data representations learned by Deep Networks [44, 34] thus making it possible to avoid working on individual samples. [14, 43] introduced a loss term that encodes an additional classifier predicting from which domain each sample comes. This was motivated by the fact that, if the learned features are domaininvariant, such a classifier should exhibit very poor performance.
All these Deep Learning approaches rely on the same architecture with the same weights for both the source and target domains. In essence, they attempt to reduce the impact of the domain shift by learning domaininvariant features. In practice, however, domain invariance might very well be detrimental to discriminative power. As discussed in the introduction, this is the hypothesis we set out to test in this work by introducing an approach that explicitly models the domain shift instead of attempting to enforce invariance to it. We show in the results section that this yields a significant accuracy boost over networks with shared weights.
3 Our Approach
The core idea of our method is that, for a Deep Network to adapt to different domains, the weights should be related, yet different for each of the two domains. As shown empirically, this constitutes a major advantage of our method over the competing ones discussed in Section 2. To implement this idea, we therefore introduce a twostream architecture, such as the one depicted by Fig. 1. The first stream operates on the source data, the second on the target one, and they are trained jointly. While we allow the weights of the corresponding layers to differ between the two streams, we prevent them from being too far from each other. Additionally we use the MMD between the learned source and target representations. This combination lets us encode the fact that, while different, the two domains are related.
More formally, let and be the sets of training images from the source and target domains, respectively, with and being the corresponding labels. To handle unsupervised target data as well, we assume, without loss of generality, that the target samples are ordered, such that only the first ones have valid labels, where in the unsupervised scenario. Furthermore, let and denote the parameters, that is, the weights and biases, of the layer of the source and target streams, respectively. We train the network by minimizing a loss function of the form
(1)  
(2)  
(3)  
(4)  
(5) 
where is a standard classification loss, such as the logistic loss or the hinge loss. and are the weight and unsupervised regularizers discussed below. The first one represents the loss between corresponding layers of the two streams. The second encodes the MMD measure and favors similar distributions of the source and target data representations. These regularizers are weighted by coefficients and , respectively. In practice, we found our approach to be robust to the specific values of these coefficients and we set them to in all our experiments. denotes the set of indices of the layers whose parameters are not shared. This set is problemdependent and, in practice, can be obtained by comparing the MMD values for different configurations, as demonstrated in our experiments.
3.1 Weight Regularizer
While our goal is to go beyond sharing the layer weights, we still believe that corresponding weights in the two streams should be related. This models the fact that the source and target domains are related, and prevents overfitting in the target stream, when only very few labeled samples are available. Our weight regularizer therefore represents the distance between the source and target weights in a particular layer. In principle, we could take it to directly act on the difference of those weights. This, however, would not truly attempt to model the domain shift, for instance to account for different means and ranges of values in the two types of data. To better model the shift and introduce more flexibility in our model, we therefore propose not to penalize linear transformations between the source and target weights. We then write our regularizer either by relying on the norm as
(6) 
or in an exponential form as
(7) 
In both cases, and are scalar parameters that are different for each layer and learned at training time along with all other network parameters. While simple, this parameterization can account, for example, for global illumination changes in the first layer of the network. As shown in the results section, we found empirically that the exponential version gives better results.
3.2 Unsupervised Regularizer
In addition to regularizing the weights of corresponding layers in the two streams, we also aim at learning a final representation, that is, the features before the classifier layer, that is domain invariant. To this end, we introduce a regularizer designed to minimize the distance between the distributions of the source and target representations. Following the popular trend in Domain Adaptation [35, 44], we rely on the Maximum Mean Discrepancy (MMD) [21] to encode this distance.
As the name suggests, given two sets of data, the MMD measures the distance between the mean of the two sets after mapping each sample to a Reproducing Kernel Hilbert Space (RKHS). In our context, let be the feature representation at the last layer of the source stream, and of the target stream. The between the source and target domains can be expressed as
(8) 
where denotes the mapping to RKHS. In practice, this mapping is typically unknown. Expanding Eq. 8 and using the kernel trick to replace inner products by kernel values lets us rewrite the squared MMD, and thus our regularizer as
(9) 
where the dependency on the network parameters comes via the s, and where is a kernel function. In practice, we make use of the standard RBF kernel , with bandwidth . In all our experiments, we found our approach to be insensitive to the choice of and we therefore set it to .
3.3 Training
To learn the model parameters, we first pretrain the source stream using the source data only. We then simultaneously optimize the weights of both streams according to the loss of Eqs. 25 using both source and target data, with the target stream weights initialized from the pretrained source weights. Note that this also requires initializing the linear transformation parameters of each layer, and for all . We initialize these values to and , thus encoding the identity transformation. All parameters are then learned jointly using backpropagation with the AdaDelta algorithm [45]. Note that we rely on minibatches, and thus in practice compute all the terms of our loss over these minibatches rather than over the entire source and target datasets.
Depending on the task, we use different network architectures, to provide a fair comparison with the baselines. For example, for the Office benchmark, we adopt the AlexNet [29] architecture, as was done in [44], and for digit classification we rely on the standard network structure of [31] for each stream.
4 Experimental Results
Synthetic  Real  




Test real data  

In this section, we demonstrate the potential of our approach in both the supervised and unsupervised scenarios using different network architectures. We first thoroughly evaluate our method for the drone detection task. We then demonstrate that it generalizes well to other classification problems by testing it on the Office and MNIST+USPS datasets. Finally, to show that our approach also generalizes to regression problems, we apply it to estimating the position of facial landmarks.
4.1 Leveraging Synthetic Data for Drone Detection
Due to the lack of large publicly available datasets, UAV detection is a perfect example of a problem where training videos are scarce and do not cover a wide enough range of possible shapes, poses, lighting conditions, and backgrounds against which drones can be seen. However, it is relatively easy to generate large amounts of synthetic examples, which can be used to supplement a small number of real images and increase detection accuracy [40]. We show here that our approach allows us to exploit these synthetic images more effectively than other stateoftheart Domain Adaptation techniques.
4.1.1 Dataset and Evaluation Setup
We used the approach of [40] to create a large set of synthetic examples. Fig. 2 depicts sample images from the real and synthetic dataset that we used for training and testing. In our experiments, we treat the synthetic images as source samples and the real images as target ones.
We report results using two versions of this dataset, which we refer to as UAV200 (small) and UAV200 (full). Their sizes are given in Table 1. They only differ in the number of synthetic and negative samples used for training and testing. The ratio of positive to negative samples in the first dataset is better balanced than in the second one. For UAV200 (small), we therefore express our results in terms of accuracy, which is commonly used in Domain Adaptation and can be computed as
(10) 
Using this standard metric facilitates the comparison against the baseline methods whose publicly available implementations only output classification accuracy.
In real detection tasks, however, training datasets are typically quite unbalanced, since one usually encounters many negative windows for each positive example. UAV200 (full) reflects this more realistic scenario, in which the accuracy metric is poorlysuited. For this dataset, we therefore compare various approaches in terms of precisionrecall. Precision corresponds to the number of true positives detected by the algorithm divided by the total number of detections. Recall is the number of true positives divided by the number of test examples labeled as positive. Additionally, we report the Average Precision (AP), which is computed as , where and denote precision and recall, respectively.
For this experiment, we follow the supervised Domain Adaptation scenario. In other words, training data is available with labels for both source and target domains.
Dataset  Training  Testing  

Pos  Neg  Pos  Neg  
(Real)  (Synthetic)  (Real)  (Real)  (Real)  
UAV200 (full)  200  32800  190000  3100  135000 
UAV200 (small)  200  1640  9500  3100  6750 
4.1.2 Network Design
Our network consists of two streams, one for the source data and one for the target data, as illustrated by Fig. 1. Each stream is a CNN that comprises three convolutional and maxpooling layers, followed by two fullyconnected ones. The classification layer encodes a hinge loss, which was shown to outperform the logistic loss in practice for some tasks [28, 26].
As discussed above, some pairs of layers in our twostream architecture may share their weights while others do not, and we must decide upon an optimal arrangement. To this end, we trained one model for every possible combination. In every case, we implemented our regularizer using either the loss of Eq. 6 or the exponential loss of Eq. 7. After training, we then computed the value between the output of both streams for each configuration. We plot the results in Fig. 3 (top), with the and signs indicating whether the weights are streamspecific or shared. Since we use a common classification layer, the value ought to be small when our architecture accounts well for the domain shift [44]. It therefore makes sense to choose the configuration that yields the smallest value. In this case, it happens when using the exponential loss to connect the first three layers and sharing the weights of the others. Our intuition is that, even though the synthetic and real images feature the same objects, they differ in appearance, which is mostly encoded by the first network layers. Thus, allowing the weights to differ in these layers yields good adaptative behavior, as will be demonstrated in Section 4.1.3.
As a sanity check, we used validation data ( positive and negative examples) to confirm that this MMDbased criterion reflects the best architecture choice. In Fig. 3 (bottom), we plot the real detection accuracy as a function of the chosen configuration. The best possible accuracies are 0.916 and 0.757 on the validation and test data, respectively, whereas the ones corresponding to our MMDbased choice are 0.902 and 0.732, which corresponds to the second best architecture. Note that the MMD of the best solution also is very low. Altogether, we believe that this evidences that our MMDbased criterion provides an effective alternative to select the right architecture in the absence of validation data.
4.1.3 Evaluation
We first compare our approach to other Domain Adaptation methods on UAV200 (small). As can be seen in Table 2, it significantly outperforms many stateoftheart baselines in terms of accuracy. In particular, we believe that outperforming DDC [44] goes a long way towards validating our hypothesis that modeling the domain shift is more effective than trying to be invariant to it. This is because, as discussed in Section 2, DDC relies on minimizing the MMD loss between the learned source and target representations much as we do, but uses a single stream for both source and target data. In other words, except for the nonshared weights, it is the method closest to ours. Note, however, that the original DDC paper used a slightly different network architecture than ours. To avoid any bias, we therefore modified this architecture so that it matches ours.
Accuracy  

ITML [41]  0.60 
ARCt assymetric [30]  0.55 
ARCt symmetric [30]  0.60 
HFA [33]  0.75 
DDC [44]  0.89 
Ours  0.92 
We then turn to the complete dataset UAV200 (full). In this case, the baselines whose implementations only output accuracy values become less relevant because it is not a good metric for unbalanced data. We therefore compare our approach against DDC [44], which we found to be our strongest competitor in the previous experiment, and against the Deep Learning approach of [40], which also tackles the drone detection problem. We also turn on and off some of our loss terms to quantify their influence on the final performance. We give the results in Table 3. In short, all loss terms contribute to improving the AP of our approach, which itself outperforms all the baselines by large margins. More specifically, we get a 10% boost over DDC and a 20% boost over using real data only. By contrast, simply using real and synthetic examples together, as was done in [40], does not yield significant improvements. Note that dropping the terms linking the weights in corresponding layers while still minimizing the MMD loss (Loss: ) performs worse than using our full loss function. We attribute this to overfitting of the target stream.
AP  

(Average Precision)  
CNN (trained on Synthetic only (S))  0.314 
CNN (trained on Real only (R))  0.575 
CNN (pretrained on S and finetuned on R):  
Loss:  0.612 
Loss: (with fixed source CNN)  0.655 
CNN (pretrained on S and finetuned on R and S:)  
Loss: [40]  0.569 
DDC [44] (pretrained on S and finetuned on R and S)  0.664 
Our approach (pretrained on S and finetuned on R and S)  
Loss:  0.673 
Loss:  0.711 
Loss:  0.732 
4.1.4 Influence of the Number of Samples
Using synthetic data in the UAV detection scenario is motivated by the fact that it is hard and time consuming to collect large amounts of real data. We therefore evaluate the influence of the ratio of synthetic to real data. To this end, we first fix the number of synthetic samples to 32800, as in UAV200 (full), and vary the amount of real positive samples from 200 to 5000. The results of this experiment are reported in Fig. 4(left), where we again compare our approach to DDC [44] and to the same CNN model trained on the real samples only. Our model always outperforms the one trained on real data only. This suggests that it remains capable of leveraging the synthetic data, even though more real data is available, which is not the case for DDC. More importantly, looking at the leftmost point on our curve shows that, with only 200 real samples, our approach performs similarly to, and even slightly better than, a singlestream model trained using 2500 real samples. In other words, one only needs to collect 510% of labeled training data to obtain good results with our approach, which, we believe, can have a significant impact in practical applications.
Fig. 4(right) depicts the results of an experiment where we fixed the number of real samples to 200 and increased the number of synthetic ones from 0 to 32800. Note that the AP of our approach steadily increases as more synthetic data is used. DDC also improves, but we systematically outperform it except when we use no synthetic samples, in which case both approaches reduce to a singlestream CNN trained on real data only.
4.2 Unsupervised Domain Adaptation on Office
To demonstrate that our approach extends to the unsupervised case, we further evaluate it on the Office dataset, which is a standard domain adaptation benchmark for image classification. Following standard practice, we express our results in terms of accuracy, as defined in Eq. 10.
The Office dataset [41] comprises three different sets of images (Amazon, DSLR, Webcam) featuring 31 classes of objects. Fig. 5 depicts some images from the three different domains. For our experiments, we used the “fullytransductive” evaluation protocol proposed in [41], which means using all the available information on the source domain and having no labels at all for the target domain.
In addition to the results obtained using our MMD regularizer of Eq. 5, and for a fair comparison with [14], which achieves stateoftheart results on this dataset, we also report results obtained by replacing the MMD loss with one based on the domain confusion classifier advocated in [14]. We used the same architecture as in [14] for this classifier.
Fig. 6(a) illustrates the network architecture we used for this experiment. Each stream corresponds to the standard AlexNet CNN [29]. As in [44, 14], we start with the model pretrained on ImageNet and fine tune it. However, instead of forcing the weights of both streams to be shared, we allow them to deviate as discussed in Section 3. To identify which layers should share their weights and which ones should not, we used the MMDbased criterion introduced in Section 4.1.2. In Fig. 6(b), we plot the value as a function of the configuration on the Amazon Webcam scenario, as we did for the drones in Fig. 3. In this case, not sharing the last two fullyconnected layers achieves the lowest value, and this is the configuration we use for our experiments on this dataset.

(a)  (b) 
In Table 4, we compare our approach against other Domain Adaptation techniques on the three commonlyreported source/target pairs. It outperforms them on all the pairs. More importantly, the comparison against GRL [14] confirms that allowing the weights not to be shared increases accuracy.
Accuracy  
A W  D W  W D  Average  
GFK [19]  0.214  0.691  0.650  0.518 
DLID [7]  0.519  0.782  0.899  0.733 
DDC [44]  0.605  0.948  0.985  0.846 
DAN [34]  0.645  0.952  0.986  0.861 
DRCN [16]  0.687  0.964  0.990  0.880 
GRL [14]  0.730  0.964  0.992  0.895 
Ours (+ DDC)  0.630  0.961  0.992  0.861 
Ours (+ GRL)  0.760  0.967  0.996  0.908 
4.3 Domain Adaptation on MnistUsps
The MNIST [31] and USPS [25] datasets for digit classification both feature 10 different classes of images corresponding to the 10 digits. They have recently been employed for the task of Domain Adaptation [13].
For this experiment, we used the evaluation protocol of [13], which involves randomly selecting of images from MNIST and images from USPS and using them interchangeably as source and target domains. As in [13], we work in the unsupervised setting, and thus ignore the target domain labels at training time. Following [35], as the image patches in the USPS dataset are only pixels, we rescaled the images from MNIST to the same size and applied normalization of the pixel intensities. For this experiment, we relied on the standard CNN architecture of [31] and employed our MMDbased criterion to determine which layers should not share their weights. We found that allowing all layers of the network not to share their weights yielded the best performance.
In Table 5, we compare our approach with DDC [44] and with methods that do not rely on deep networks [38, 19, 12, 13]. Our method yields superior performance in all cases, which we believe to be due to its ability to adapt the feature representation to each domain, while still keeping these representations close to each other.
Accuracy  

method  MU  UM  AVG. 
PCA  0.451  0.334  0.392 
SA [12]  0.486  0.222  0.354 
GFK [19]  0.346  0.226  0.286 
TCA [38]  0.408  0.274  0.341 
SSTCA [38]  0.406  0.222  0.314 
TSL [42]  0.435  0.341  0.388 
JCSL [13]  0.467  0.355  0.411 
DDC [44]  0.478  0.631  0.554 
Ours  0.607  0.673  0.640 
4.4 Supervised Facial Pose Estimation
To demonstrate that our method can be used not only for classification or detection tasks but also for regression ones, we further evaluate it for pose estimation purposes. More specifically, the task we address consists of predicting the location of facial landmarks given image patches, such as those of Fig. 7. To this end, we train a regressor to predict a 10D vector with two floating point coordinates for each landmark. As we did for drones, we use synthetic images, such as the ones shown in the top portion of Fig. 7, as our source domain and real ones, such as those shown at the bottom, as our target domain. Both datasets contain annotated images. We use all the synthetic samples but only of the real ones for training, and the remainder for testing. For more detail on these two datasets, we refer the interested reader to the supplementary material where we also describe the architecture of the regressor we use.
Synthetic (source domain)  

Real (target domain)  
In Table 6, we compare our Domain Adaptation results to those of DDC [44] in terms of percentage of correctly estimated landmarks (PCPscore). Each landmark is considered to be correctly estimated if it is found within a pixel radius from the groundtruth. Note that, again, by not sharing the weights, our approach outperforms DDC.
Synthetic  DDC [44]  Ours  

Right eye  64.2  68.0  71.8 
Left eye  39.3  56.2  60.3 
Nose  56.3  64.1  64.5 
Right mouth corner  47.8  57.6  59.8 
Left mouth corner  42.3  55.5  57.7 
Average  50.0  60.3  62.8 
4.5 Discussion
In all the experiments reported above, allowing the weights not to be shared in some fraction of the layers of our twostream architecture boosts performance. This validates our initial hypothesis that explicitly modeling the domain shift is generally beneficial.
However, the optimal choice of which layers should or should not share their weights is application dependent. In the UAV case, allowing the weights in the first two layers to be different yields top performance, which we understand to mean that the domain shift is caused by lowlevel changes that are best handled in the early layers. By contrast, for the Office dataset, it is best to only allow the weights in the last two layers to differ. This network configuration was determined using Amazon and Webcam images, such as those shown in Fig. 5. Close examination of these images reveals that the differences between them are not simply due to lowlevel phenomena, such as illumination changes, but to more complex variations. It therefore seems reasonable that the higher layers of the network, which encode higherlevel information, should be domainspecific.
Fortunately, we have shown that the MMD provides us with an effective criterion to choose the right configuration. This makes our twosteam approach practical, even when no validation data is available.
5 Conclusion
In this paper, we have postulated that Deep Learning approaches to Domain Adaptation should not focus on learning features that are invariant to the domain shift, which makes them less discriminative. Instead, we should explicitly model the domain shift. To prove this, we have introduced a twostream CNN architecture, where the weights of the streams may or may not be shared. To nonetheless encode the fact that both streams should be related, we encourage the nonshared weights to remain close to being linear transformations of each other by introducing an additional loss term.
Our experiments on very diverse datasets have clearly validated our hypothesis. Our approach consistently yields higher accuracy than networks that share all weights for the source and target data, both for classification and regression. In the future, we intend to study if more complex weight transformations could help us further improve our results, with a particular focus on designing effective constraints for the parameters of these transformations.
References
 [1] M. Baktashmotlagh, M. Harandi, B. Lovell, and M. Salzmann. Unsupervised Domain Adaptation by Domain Invariant Projection. In International Conference on Computer Vision, 2013.
 [2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. SURF: Speeded Up Robust Features. Computer Vision and Image Understanding, 10(3):346–359, 2008.
 [3] C. Becker, M. Christoudias, and P. Fua. NonLinear Domain Adaptation with Boosting. In Advances in Neural Information Processing Systems, 2013.
 [4] A. Bergamo and L. Torresani. Exploiting WeaklyLabeled Web Images to Improve Object Classification: A Domain Adaptation Approach. In Advances in Neural Information Processing Systems, 2010.
 [5] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan. Domain Separation Networks. arXiv Preprint, 2016.
 [6] R. Caseiro, J. Henriques, P. Martins, and J. Batista. Beyond the Shortest Path : Unsupervised Domain Adaptation by Sampling Subspaces Along the Spline Flow. In Conference on Computer Vision and Pattern Recognition, 2015.
 [7] S. Chopra, S. Balakrishnan, and R. Gopalan. DLID: Deep Learning for Domain Adaptation by Interpolating Between Domains. In International Conference on Machine Learning, 2013.
 [8] S. Chopra, R. Hadsell, and Y. LeCun. Learning a Similarity Metric Discriminatively, with Application to Face Verification. In Conference on Computer Vision and Pattern Recognition, 2005.
 [9] H. Daumé and D. Marcu. Domain Adaptation for Statistical Classifiers. J. Artif. Int. Res., 26(1):101–126, 2006.
 [10] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In International Conference on Machine Learning, 2014.
 [11] L. Duan, I. Tsang, D.Xu, and S. Maybank. Domain Transfer SVM for Video Concept Detection. In Conference on Computer Vision and Pattern Recognition, pages 1375–1381, 2009.
 [12] B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised Visual Domain Adaptation Using Subspace Alignment. In International Conference on Computer Vision, 2013.
 [13] B. Fernando, T. Tommasi, and T. Tuytelaars. Joint CrossDomain Classification and Subspace Learning for Unsupervised Adaptation. Pattern Recognition Letters, 65:60–66, 2015.
 [14] Y. Ganin and V. Lempitsky. Unsupervised Domain Adaptation by Backpropagation. In International Conference on Machine Learning, 2015.
 [15] M. Ghifary, W. B. Kleijn, and M. Zhang. Domain Adaptive Neural Networks for Object Recognition. arXiv Preprint, 2014.
 [16] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li. Deep ReconstructionClassification Networks for Unsupervised Domain Adaptation. arXiv Preprint, 2016.
 [17] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. arXiv Preprint, 2013.
 [18] B. Gong, K. Grauman, and F. Sha. Connecting the Dots with Landmarks: Discriminatively Learning DomainInvariant Features for Unsupervised Domain Adaptation. In International Conference on Machine Learning, 2013.
 [19] B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic Flow Kernel for Unsupervised Domain Adaptation. In Conference on Computer Vision and Pattern Recognition, 2012.
 [20] R. Gopalan, R. Li, and R. Chellappa. Domain Adaptation for Object Recognition: An Unsupervised Approach. In International Conference on Computer Vision, 2011.
 [21] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. Smola. A Kernel Method for the TwoSample Problem. arXiv Preprint, 2008.
 [22] A. Gretton, A. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf. Covariate Shift by Kernel Mean Matching. Journal of the Royal Statistical Society, 3(4):5–13, 2009.
 [23] G. Hinton, S. Osindero, and Y. Teh. A Fast Learning Algorithm for Deep Belief Nets. Neural Computation, 18:1391–1415, 2006.
 [24] J. Huang., A. Smola, A. Gretton., K. Borgwardt, and B. Scholkopf. Correcting Sample Selection Bias by Unlabeled Data. In Advances in Neural Information Processing Systems, 2006.
 [25] J. Hull. A Database for Handwritten Text Recognition Research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16:550–554, 1994.
 [26] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial Transformer Networks. In Advances in Neural Information Processing Systems, 2015.
 [27] J. Jiang. A Literature Survey on Domain Adaptation of Statistical Classifiers. Technical report, University of Illinois at UrbanaChampaign, 2008.
 [28] J. Jin, K. Fu, and C. Zhang. Traffic Sign Recognition with Hinge Loss Trained Convolutional Neural Networks. IEEE Transactions on Intelligent Transportation Systems, 15:1991–2000, 2014.
 [29] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, 2012.
 [30] B. Kulis, K. Saenko, and T. Darrell. What You Saw is Not What You Get: Domain Adaptation Using Asymmetric Kernel Transforms. In Conference on Computer Vision and Pattern Recognition, 2011.
 [31] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. GradientBased Learning Applied to Document Recognition. IEEE, 1998.
 [32] Y. LeCun, L. Bottou, G. Orr, and K. Müller. Neural Networks: Tricks of the Trade, chapter Efficient Backprop. Springer, 1998.
 [33] W. Li, L. Duan, D. Xu, and I. W. Tsang. Learning with Augmented Features for Supervised and SemiSupervised Heterogeneous Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1134–1148, 2014.
 [34] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning Transferable Features with Deep Adaptation Networks. In International Conference on Machine Learning, 2015.
 [35] M. Long, J. Wang, G. Ding, J. Sun, and P. Yu. Transfer Feature Learning with Joint Distribution Adaptation. In International Conference on Computer Vision, pages 2200–2207, 2013.
 [36] K. Muandet, D. Balduzzi, and B. Schölkopf. Domain Generalization via Invariant Feature Representation. In International Conference on Machine Learning, 2013.
 [37] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and Transferring MidLevel Image Representations Using Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition, 2014.
 [38] S. Pan, I. Tsang, J. Kwok, and Q. Yang. Domain Adaptation via Transfer Component Analysis. In International Joint Conference on Artificial Intelligence, pages 1187–1192, 2009.
 [39] S. Pan and Q. Yang. A Survey on Transfer Learning. IEEE trans. on knowledge and data engineering, 22, 2010.
 [40] A. Rozantsev, V. Lepetit, and P. Fua. On Rendering Synthetic Images for Training an Object Detector. Computer Vision and Image Understanding, 137:24–37, 2015.
 [41] K. Saenko, B. Kulis, M. Fritz, and T. Darrell. Adapting Visual Category Models to New Domains. In European Conference on Computer Vision, pages 213–226, 2010.
 [42] S. Si, D. Tao, and B. Geng. Bregman DivergenceBased Regularization for Transfer Subspace Learning. IEEE Trans. Knowl. Data Eng., 22(7):929–942, 2010.
 [43] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko. Simultaneous Deep Transfer Across Domains and Tasks. In International Conference on Computer Vision, 2015.
 [44] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell. Deep Domain Confusion: Maximizing for Domain Invariance. arXiv Preprint, 2014.
 [45] M. D. Zeiler. ADADELTA: an Adaptive Learning Rate Method. Computing Research Repository, 2012.