DTLET: Deep Transfer Learning by Exploring where to Transfer
Abstract
Previous transfer learning methods based on deep network assume the knowledge should be transferred between the same hidden layers of the source domain and the target domains. This assumption doesn’t always hold true, especially when the data from the two domains are heterogeneous with different resolutions. In such case, the most suitable numbers of layers for the source domain data and the target domain data would differ. As a result, the high level knowledge from the source domain would be transferred to the wrong layer of target domain. Based on this observation, “where to transfer” proposed in this paper should be a novel research frontier. We propose a new mathematic model named DTLET to solve this heterogeneous transfer learning problem. In order to select the best matching of layers to transfer knowledge, we define specific loss function to estimate the corresponding relationship between highlevel features of data in the source domain and the target domain. To verify this proposed crosslayer model, experiments for two crossdomain recognition/classification tasks are conducted, and the achieved superior results demonstrate the necessity of layer correspondence searching.
Introduction
Transfer learning or domain adaption aims at digging potential information in auxiliary source domain to assist the learning task in target domain, where bare labeled data with prior knowledge exist [Pan and Yang2010]. Without the help of related source domain data, the learning tasks like image classification or recognition would fail with insufficient preexisting labeled data. For most big data problems, the labeled data are highly required but always not enough as labeling process would be quite tedious and laborious. Therefore, having a better use of auxiliary source domain data by transfer learning methods has attracted researchers’ attention.
It should be noted direct application of labeled source domain data to a new scene of target domain would result in poor performance due to the semantic gap between the two domains, even they are representing the same objects [Y. et al.2011][Duan, Xu, and Tsang2012]. The semantic gap can be resulted from different acquisition conditions(illumination or view angle) and the use of different cameras or sensors. Transfer learning methods are proposed to overcome this distribution divergence or feature bias [Dai et al.2009][Liu, Yang, and Tao2017][Wang et al.2017]. Traditionally, these transfer learning methods would adopt linear or nonlinear transformation with kernel function to learn a common subspace on which the gap is bridged [Yan et al.2017]. Recent advancement has proven that the features learnt on such common subspace are inefficient. Therefore, deep learning based model has been introduced due to its power on high level feature representation.
Current deep learning based transfer learning topics include two research branches, what knowledge to transfer and how to transfer knowledge [Li et al.2015]. For what knowledge to transfer, researchers mainly concentrate on instancebased transfer learning and parameter transfer approaches. Instancebased transfer learning methods assume that only certain parts of the source data can be reused for learning in the target domain by reweighting [Gong et al.2016]. As for parameter transfer approaches, people mainly try to find the pivot parameters in deep network to transfer to accelerate the transfer process. For how to transfer knowledge, different deep networks are introduced to complete the transfer learning process. However, for both research areas, the right correspondence of layers is ignored.
For what knowledge to transfer problem, the transferred content might even be negative or wrong. A fundamental problem for current transfer learning work should be negative transfer[Tan et al.2017]. If the knowledge from the source domain to target domain is transferred to wrong layers, the transferred knowledge is quite errorprone. With the wrong prior information added, bad effect can be generated on target domain data. For how to transfer knowledge problem, as the two deep networks for the source domain data and the target domain data need to have the same number of layers, the two models could not be optimal at the same time. This situation is especially important for crossresolution heterogeneous transfer. For data with different resolutions, The data with higher resolution might need more maxpooling layers than the data with lower resolution, and more neural network layers are needed. Based on the above observation and assumption, we propose a novel research topic, where to transfer. In this work, the number of layers for two domains does not need to be the same, and optimal matching of layers will be found by the newly proposed objective function. With the best parameters from the source domain data transferred to the right layer of the target domain, the performance of the target domain learning task can be improved.
The proposed work is named Deep Transfer Learning by Exploring where to Transfer(DTLET), which is based on Stacked AutoEncoders[Zhuang et al.2018]. A detailed flowchart is shown in Fig. 1. The main contributions are concluded as follows.

This paper for the first time introduces the where to transfer problem. The deep networks from the source domain and the target domain no longer need to be with the same parameter settings, and the crosslayer transfer learning is proposed in this paper.

We propose a new principle for finding the correspondence between neural networks in the source domain and in the target domain by defining new unified objctive loss function. By optimizing this objective function, the best setting of two deep networks as well as the correspondence relationship can be figured out.
Related Work
Deep learning intends to learn nonlinear representation of raw data to reveal the hidden features [Long et al.2016]. However, a large number of labeled data are required to avoid overfitting during the feature learning process. To achieve this goal, transfer learning has been introduced to augment the data with prior knowledge. By aligning data from different domains to highlevel correlation space, the data information on different domains can be shared. To find this correlation space, many deep transfer learning frameworks have been proposed in recent years. The main motivation is to bridge the semantic gap between the two deep neural networks of the source domain and the target domain. However, due to the complexity of transfer learning, some transfer mechanisms still lack satisfying interpreting. Based on this consideration, quite a few interesting ideas have been generated. To solve how to determine which domain to be source or target problem, Fabio et al. [Carlucci, Porzi, and Caput2017] propose to automatically align domains for the source and target domain. To boost the transfer efficiency and find extra profit during the transfer process, deep mutual learning [Zhang et al.2018] has been proposed to transfer knowledge bidirectionally. The function of each layer in transfer learning is explored in [Collier, DiBiano, and Mukhopadhyay2018]. The transfer learning with unequal classes and data are experimented in [Redko et al.2018] and [Bernico, Li, and Dingchao2018] respectively. However, all the above works still just explain what knowledge to transfer and how to transfer knowledge problems. They still ignore interpreting the matching mechanisms between layers of deep networks of the source domain and the target domain. For this problem, We also name it as DTLET: Deep Transfer Learning by Exploring Where to Transfer. For this work, we adopt stacked denoising autoencoder(SDA) as the baseline deep network for transfer learning.
Glorot et al. for the first time employ stacked denoising to learn homogeneous features based on joint space for sentiment classification [Glorot, Bordes, and Bengio2011]. The computation complexity is further reduced by Chen et al. by the proposing of Marginalized Stacked Denoising Autoencoder(mSDA) [Yu et al.2018]. In this work, some characteristics of word vector are set to zero in the equations of expectation to optimize the representation. Still by matching the marginal as well as conditional distribution, Zhang et al. and Zhuang et al. also develop SDA based homogeneous transfer learning framework [Zhuang et al.2015][Zhang et al.2015]. For heterogeneous case, Zhou et al. [Zhou, Tsang, and Yan2014] propose an extension of mSDA to bridge the semantic gap by finding the crossdomain corresponding instances in advance. Google brain team in recent time introduces generative adversarial network to SDA and propose the Wasserstein AutoEncoders [Tolstikhin et al.2018] to generate samples of better quality on target domain. It can be found SDA is with quite high potential, and our work also chooses SDA as the basic neural network for the where to transfer problem.
Deep Mapping Mechanism
The general framework of such deep mapping mechanism can be summarized as three steps, network setting up, correlation maximization, and layer matching. We would like to first introduce the deep mapping mechanism by defining the variables.
The samples in the source domain are denoted as , in which the labeled data in the source domain is further denoted as , they are used to supervise the classification process. In the target domain, the samples are demoted as . The cooccurrence data [Yang et al.2016](the data in the source domain and the target domain belonging to the same classes but with no prior label information) in the source domain are denoted as , in target domain are denoted as . They are further jointly represented by , which are used to supervise the transfer learning process. The parameters of deep network in the source domain are denoted by , and in the target domain.
The matching of layers is denoted by , in which a represents the total number of layers for the source domain data, and b represents the total layers for the target domain data. is the total number of matching layers. We define here, if (as the first layer is the original layer which will not be used to transfer, m is compared with a1 or b1 instead of a or b), we define the transfer process as full rank transfer learning; else if , we define this case as nonfull rank transfer learning.
The common subspace is represented by and the final classifier is represented by . The labeled data from the source domain are used to predict the label of by applying .
Network setting up
The stacked autoencoder (SAE) is first employed in the source domain and the target domain to get the hidden feature representation and of original data as shown in eq. (1) and eq. (2).
(1)  
(2)  
Here and are parameters from neural network , and are parameters from neural network . and mean the th hidden layers in the source domain and in the target domain respectively. The two neural networks are first initialized by above functions.
Correlation maximization
To set up the initial relationship of the two neural networks, we resort to Canonical Correlation Analysis (CCA) which can maximize the correlation between two domains [Hardoon, Szedmak, and ShaweTaylor2004]. A multilayer correlation model based on the above deep networks is further constructed. Both the and the are projected by CCA to a common subspace on which a uniformed representation is generated. Such projection matrices obtained by CCA are denoted as and . To find optimal neural networks in the source domain and in the target domain, we have two general objectives: to minimize the reconstruction error of neural networks of the source domain and the target domain, and to maximize the correlation between the two neural networks. To achieve the second objective, we need further on one hand find the best layer matching, on the other hand maximize the correlation between corresponding layers. To achieve this goal, we can minimize the final objective function
(3) 
in this function, the objective function is defined as L, and is in corresponding with different matching of . We would like to generate the best matching by finding the minimum . In , and represent the reconstruction errors of data in the source domain and the target domain, which are defined as follows:
(4)  
(5)  
The third term represents the domain divergence after projection by CCA which we want to maximize. The definition for this term is in eq. (6)
(6) 
where , , , By minimizing eq. (3), we can collectively train the two neural networks and .
Layer matching
After constructing the multiple layers of the networks by eq. (3), we need to further find the best matching for layers after construction of neural networks. As different layer matching would generate different function loss value in eq. (3), we further define the objective function for layer matching as
(7) 
As the solution for eq. (3) should be NPhard, we would like to solve the problem exhaustively. It is also found for regular data, stack auto encoder layers deeper than 5 cannot generate better results, we suppose the layer number on both domains to be lower than 5 here.
Model Training
Here we would like to first optimize the eq. 3. As the equation is not joint convex with all the parameters , , , and , and the two parameters and are not related with and , we would like to introduce twostep iteration optimization.
Step.1: Updating with fixed
In Eq. (3), the optimization of is just related to the dominator term. The optimization of each layer (suppose the layer on source domain is in corresponding with layer on target domain) can be formulated as
(8) 
As and [Hardoon, Szedmak, and ShaweTaylor2004], we can rewrite eq. (8) as
(9) 
This is a typical constrained problem which can be formulated as a series of unconstrained minimization problems, and be easily solved by Lagrangian multiplier.
Step.2: Updating with fixed
As and are mutual independent and with the same form, we here just demonstrate the solution of on the source domain (the solution of can be derived similarly). Actually the objective division operation is with the same function with subtraction operation and we reformulate the objective function as
(10) 
Here we apply the gradient descent method to adjust the parameter as
(11)  
(12)  
in which
(13) 
(14) 
(15) 
The operator here stands for the dot product. The same optimization process works for on the target domain.
After these two optimizations for each layer, the two whole networks (the source domain network and the target domain network) are further finetuned by the backpropagation process. The forward and backward propagations will iterate until convergence.
Optimization of
We finally get the minimized by the above procedures. Take the above layer matching as an example, in , layer in source domain is in corresponding with in target domain. As we define the both network for no more than 5 layers(including the original data layers, which will not be used to transfer), the theoretically maximum number of combination should be no more than (as some layers can be vacancy with no matching). However, in our experiments, we heuristically find the number of matching layers should be in direct proportion to the resolution of images. This observation would save a lot of training time.
The training process is finally summarized in Alg. (1), where is generally written as .
Classification on common semantic subspace
The final classification is performed on the common subspace . The target domain data and the labeled are both projected to the common subspace by the correlation coefficients and . The standard SVM algorithm is applied on . The classifier is trained by . This trained classifier is applied to as .
Experiments
We carry out our DTLET framework on two crossdomain recognition tasks, handwritten digit recognition, and texttoimage classification.
Experimental dataset descriptions
Handwritten digit recognition:For this task, we mainly conduct the experiment on Multi Features Dataset collected from UCI machine learning repository. This dataset consists of features of handwritten numerals (09, in total 10 classes) extracted from a collection of Dutch utility maps. 6 features exist for each numeral and we choose the most popular features 216D profile correlations and 240D pixel averages in 2*3 windows to complete the transfer learning based recognition task.
Texttoimage classification:For this task, we make use of NUSWIDE dataset. In our experiment, the images in this dataset are represented with 500D visual features and annotated with 1000D text tags from Flickr. 10 categories of instances are included in this classification task, which are birds, building, cars, cat, dog, fish, flowers, horses, mountain, and plane.
Comparative methods and evaluation
As the proposed ETLET framework mainly have four components, deep learning, CCA, layer matching, and SVM classifier, we first select 3 baseline methods, CCASVM [Hardoon, Szedmak, and ShaweTaylor2004], KernelizedCCASVM(KCCASVM)[Mehrkanoon and Suykens2018], and DeepCCASVM(DCCASVM)[Yu et al.2018] as baseline comparison methods. We also conduct experiment just without layer matching(the number of layers are the same on the source and the target domains) while all the other parameters are the same with the proposed ETLET, and we name this framework NoneDTLET. The final comparison method is the most representative dufttDTNs method[Tang et al.2016], which should be up to now heterogenous transfer learning method with best performance.
For the deep network based method, the DCCASVM, dufttDTNs, NoneDTLET are all with 4 layers for the source domain and the target domain data, as we find more or less layer would generate worse performance.
At last, for the evaluation metric, we select the classification accuracies on the target domain data over the 2 pairs of datasets.
Task 1: Handwritten digit recognition
In the first experiment, we conduct our study for handwritten digit recognition. The source domain data are the 240D pixel averages in 2*3 windows feature, while the target domain data are the 216D profile correlations feature. As there are 10 classes in total, we complete 45 () binary classification tasks, for each category, the accuracy is the average accuracy of 9 binary classification tasks. We use 60% data as cooccurrence data to complete the transfer learning process and find the common subspace, 20% labeled samples on source domain as the training samples, and the rest samples on target domain as the testing samples to complete the classification process. The experiments are repeated for 100 times with 100 sets of randomly chosen training and testing data to avoid data bias [Tommasi et al.2012]. The final accuracy is the average accuracy of the 100 repeated experiments. This data setting applies for all four methods under comparison.
For the deep network, the numbers of neurons of 4 layer networks are 24017010030 for source domain data and 2161549230 for target domain data, this setting works for the all comparison methods. For the proposed DTLET, we find the best two layer matching with lowest loss after 20 iterations are and . The numbers of neurons for are 24017010030 for source domain data and 21612330 for target domain data. The average objective function loss of the all 45 binary classification tasks for these two are 0.856 and 0.832 respectively. The numbers of neurons for are 2401851307530 for source domain data and 2161549230 for target domain data. The oneagainstone SVM classification is applied for final classification. The average classification accuracies of 10 categories are shown in Tab. 1. The matching correlation is detailed in Fig. 2.
numeral 
CCASVM  KCCASVM  DCCASVM  dufttDTNs  NoneDTLET  DTLET()  DTLET() 

0  0.750  0.804  0.961  0.972  0.983  0.989  0.984 
1  0.740  0.767  0.943  0.956  0.964  0.976  0.982 
2  0.780  0.812  0.955  0.972  0.979  0.980  0.989 
3  0.748  0.790  0.945  0.956  0.966  0.976  0.975 
4  0.752  0.799  0.956  0.969  0.980  0.987  0.983 
5  0.728  0.762  0.938  0.949  0.958  0.971  0.977 
6  0.755  0.770  0.958  0.966  0.978  0.988  0.986 
7  0.775  0.797  0.962  0.968  0.978  0.975  0.985 
8  0.764  0.793  0.948  0.954  0.965  0.968  0.975 
9  0.754  0.781  0.944  0.958  0.970  0.976  0.961 

As can be found in Tab. 1, the best performances have been highlighted, which all exist in DTLET framework. However, the best performances for different categories do not exist in the framework with same layer matching. Overall, and should be the best two layer matchings compared with other settings. Based on these results, we heuristically get the conclusion that the best layer matching ratio(5/4, 4/3) is generally in direct proportion to the dimension ratio of original data(240/216). However, more matched layers do not guarantee better performance as the classification results for number “1”, “2”, “5”, “7”, “8” of DTLET () with 2 layer matchings perform better than DTLET() with 3 layer matchings.
Task 2: Texttoimage classification
In the second experiment, we conduct our study for Texttoimage classification. The source domain data are the 1000D text feature, while the target domain data are the 500D image feature. As there are 10 classes in total, we complete 45 () binary classification tasks. We still use 60% data as cooccurrence data[Yang et al.2016], 20% labeled samples on source domain as the training samples, and the rest samples on target domain as the testing samples. The same data setting as Task 1 applies for all four methods under comparison.
For the deep network, the numbers of neurons of 4 layer networks are 1000750500200 for source domain data and 500400300200 for target domain data, this setting works for the all comparison methods. For the proposed DTLET, we find the best two layer matchings with lowest loss after 20 iterations are , and (nonfull rank). The average objective function loss of 45 binary classification tasks for these two layer matchings are 3.231, 3.443 and 3.368. The numbers of neurons for are 1000800600–400200 for source domain data and 500350200 for target domain data. The numbers of neurons for both and are 1000750500200 for source domain data and 500400300200 for target domain data. As matching principle would also influence the performance of transfer learning, we present two with different matching principles as shown in Fig. 3(the average objective function loss for the two different matching principles are 3.231 and 3.455), in which all the detailed layer matching principles are described.
categories  CCASVM  KCCASVM  DCCASVM  dufttDTNs  NoneDTLET  DTLET  
(1)  (2)  
birds  0.690  0.723  0.784  0.770  0.796  0.825  0.830  0.848  0.825 
building  0.706  0.741  0.810  0.783  0.816  0.881  0.838  0.881  0.891 
cars  0.702  0.731  0.803  0.773  0.812  0.832  0.827  0.867  0.853 
cat  0.692  0.731  0.797  0.766  0.806  0.868  0.873  0.859  0.873 
dog  0.687  0.726  0.798  0.765  0.805  0.847  0.847  0.863  0.823 
fish  0.674  0.713  0.773  0.752  0.781  0.848  0.834  0.852  0.839 
flowers  0.698  0.733  0.799  0.783  0.805  0.863  0.844  0.844  0.875 
horses  0.700  0.736  0.802  0.775  0.808  0.841  0.812  0.841  0.831 
mountain  0.717  0.748  0.816  0.786  0.827  0.825  0.813  0.821  0.831 
plane  0.716  0.747  0.824  0.787  0.828  0.810  0.832  0.832  0.825 
average  0.698  0.733  0.801  0.774  0.808  0.844  0.833  0.851  0.847 

For this task, as the overall accuracies are generally lower than task 1, we would like to compare more different settings for this crosslayers matching task. We first verify the effectiveness of DTLET framework. Compared with the comparison methods, the accuracy of DTlET framework is generally with around 85% accuracy while the comparison methods are generally with no more than 80%. This observation generates the conclusion that finding the appropriate layer matching is essential. The second comparison is between the full rank and nonfull rank framework. As can be found in the table, actually is with the highest overall accuracy, although the other nonfull rank DTLETs do not perform quite well. This observation gives us a hint that full rank transfer is not always best as negative transfer would degrade the performance. However, the full rank transfer is generally good, although not optimal. The third comparison is between the same transfers with different matching principles. We present two with different matching principles, and we find the performances vary. The case 1 performs better than case 2. This result tell us continuous transfer might be better than discrete transfer: as for case 1, the transfer is in the last two layers of both domains, and in case 2, the transfer is conducted in layer 3 and layer 5 of the source domain data.
By comparing specific objects, we can find the objects with large semantic difference with other categories are with higher accuracy. For the objects which are hard to classify and with low accuracy, like “birds” and “plane”, the accuracies are always low even the DTLET is introduced. This observation proves the conclusion that DTLET can only be used to improve the transfer process, which helps with the following classification process; while the classification accuracy is still based on the semantic difference of data of different categories.
We also have to point out the relationship between the average objective function loss and the classification accuracy is not strictly positive correlated. Overall is with the highest classification accuracy while its average objective function loss is not lowest. Based on this observation, we have to point out, the lowest average objective function loss can only generate the best transfer leaning result with optimal common subspace. On such common subspace, the data projected from target domain are classified. These classification results are also influenced by the classifier as well as training samples projected randomly from the source domain. Therefore, we conclude as follows. We can just guarantee a good classification performance after getting the optimal transfer learning result, while the classification accuracy is also influenced by the classification settings.
Parameter sensitivity
In this section, we study the effect of different parameters in our networks. We have to point out the even the layer matching is random, the last layer of the two neural networks from the source domain and the target domain must be correlated to construct the common subspace. Actually, the number of neurons at last layer would also affect the final classification result. For the last layer, we take experiments on Multi Feature Dataset as an example. The result is shown in Tab. 3.
From this figure, it can be noted when the number of neuron is 30, the performance is the best. Therefore in our former experiments, 30 neurons are used. The conclusion can also be drawn that more neurons are not always better. Based on this observation, The number of layers in Task 1 is set as 30, and in Task 2 as 200.
layer matching 
10  20  30  40  50 


0.9082  0.9543  0.9786  0.9771  0.9653 
0.8853  0.9677  0.9797  0.9713  0.9522 
Conclusion
In this paper, we propose a novel framework, referred to as Deep Transfer Learning by Exploring where to Transfer(DTLET), for hand writing digit recognition and texttoimage classification. In the proposed model, we find the best matching with lowest loss value. After the transfer, the final correlated common subspace on which classifier is applied. Experimental results support the effectiveness of the proposed framework.
As the current framework is only suitable for binary classification, extending it to multiclass classification is our future work. We would also propose more robust model to solve this “where to transfer” problem in the future.
References
 [Bernico, Li, and Dingchao2018] Bernico, M.; Li, Y.; and Dingchao, Z. 2018. Investigating the impact of data volume and domain similarity on transfer learning applications. In proc. CVPR.
 [Carlucci, Porzi, and Caput2017] Carlucci, F. M.; Porzi, L.; and Caput, B. 2017. Autodial: Automatic domain alignment layers. arXiv:1704.08082.
 [Collier, DiBiano, and Mukhopadhyay2018] Collier, E.; DiBiano, R.; and Mukhopadhyay, S. 2018. Cactusnets: Layer applicability as a metric for transfer learning. arXiv:1711.01558.
 [Dai et al.2009] Dai, W.; Chen, Y.; Xue, G.; Yang, Q.; and Yu, Y. 2009. Translated learning: Transfer learning across different feature spaces. In proc. Advances in neural information processing systems.
 [Duan, Xu, and Tsang2012] Duan, L.; Xu, D.; and Tsang, W. 2012. Learning with augmented features for heterogeneous domain adaptation. In proc. International Conference on Machine Learning.
 [Glorot, Bordes, and Bengio2011] Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Domain adaptation for largescale sentiment classification: A deep learning approach. in Proc. ICML.
 [Gong et al.2016] Gong, M.; Zhang, K.; Liu, T.; Tao, D.; Glymour, C.; and Scholkopf, B. 2016. Domain adaptation with conditional transferable components. In Proc. International Conference on Machine Learning.
 [Hardoon, Szedmak, and ShaweTaylor2004] Hardoon, D. R.; Szedmak, S.; and ShaweTaylor, J. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation.
 [Li et al.2015] Li, J.; Zhang, H.; Huang, Y.; and Zhang, L. 2015. Visual domain adaptation: a survey of recent advances. IEEE Signal Processing Magazine 33(3):53–69.
 [Liu, Yang, and Tao2017] Liu, T.; Yang, Q.; and Tao, D. 2017. Understanding how feature structure transfers in transfer learning. In Proc. International Joint Conference on Artificial Intelligence.
 [Long et al.2016] Long, M.; Wang, J.; Cao, Y.; Sun, J.; and Yu, P. 2016. Deep learning of transferable representation for scalable domain adaptation. IEEE Transactions on Knowledge and Data Engineering 28(8):2027–2040.
 [Mehrkanoon and Suykens2018] Mehrkanoon, S., and Suykens, J. 2018. Regularized semipaired kernel cca for domain adaptation. IEEE Transactions on Neural Networks and Learning Systems 29(7):3199–3213.
 [Pan and Yang2010] Pan, S. J., and Yang, Q. 2010. A survey on transfer learning. IEEE Transaction on Knowledge and Data Engineering 22(10):1345–1359.
 [Redko et al.2018] Redko, I.; Courty, N.; Flamary, R.; and Tuia, D. 2018. Optimal transport for multisource domain adaptation under target shift. arXiv:1803.04899.
 [Tan et al.2017] Tan, B.; Zhang, Y.; Pan, S. J.; and Yang, Q. 2017. Distant domain transfer learning. in Proc. AAAI.
 [Tang et al.2016] Tang, J.; Shu, X.; Li, Z.; Qi, G. J.; and Wang, J. 2016. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Transactions on Multimedia Computing, Communications, and Applications 12(4):68:1–68:22.
 [Tolstikhin et al.2018] Tolstikhin, I.; Bousquet, O.; Gelly, S.; and Schoelkopf, B. 2018. Wasserstein autoencoders. arXiv:1711.01558.
 [Tommasi et al.2012] Tommasi, T.; Quadrianto, N.; Caputo, B.; and Lampert, C. 2012. Beyond dataset bias: Multitask unaligned shared knowledge transfer. In Proc. Asian Conference on Computer Vision.
 [Wang et al.2017] Wang, J.; Chen, Y.; Hao, S.; Feng, W.; and Shen, Z. 2017. Balanced distribution adaptation for transfer learning. In Proc. International Conference on Data Mining.
 [Y. et al.2011] Y., Z.; Chen, Y.; Lu, Z.; Pan, S. J.; Xue, G. R.; Yu, Y.; and Yang, Q. 2011. Heterogeneous transfer learning for image classification. in Proc. AAAI.
 [Yan et al.2017] Yan, Y.; Li, W.; Ng, M.; Tan, M.; Wu, H.; Min, H.; and Wu, Q. 2017. Translated learning: Transfer learning across different feature spaces. In Proc. International Joint Conference on Artificial Intelligence.
 [Yang et al.2016] Yang, L.; Jing, L.; Yu, J.; and Ng, M. K. 2016. Learning transferred weights from cooccurrence data for heterogeneous transfer learning. IEEE Transactions on Neural Networks and Learning Systems 27(11):2187–2200.
 [Yu et al.2018] Yu, Y.; Tang, S.; Aizawa, K.; and Aizawa, A. 2018. Categorybased deep cca for finegrained venue discovery from multimodal data. IEEE Transactions on Neural Networks and Learning Systems 1–9.
 [Zhang et al.2015] Zhang, X.; Yu, F. X.; Chang, S. F.; and Wang, S. 2015. Supervised representation learning: Transfer learning with deep autoencoders. Computer Science.
 [Zhang et al.2018] Zhang, Y.; Xiang, T.; Hospedales, T. M.; and Lu, H. 2018. Deep mutual learning. In proc. CVPR.
 [Zhou, Tsang, and Yan2014] Zhou, J. T.and Pan, S. J.; Tsang, I. W.; and Yan, Y. 2014. Hybrid heterogeneous transfer learning through deep learning. in Proc. AAAI.
 [Zhuang et al.2015] Zhuang, F.; Cheng, X.; Luo, P.; J., P. S.; and Q., H. 2015. Supervised representation learning: Transfer learning with deep autoencoders. in Proc. IJCAI.
 [Zhuang et al.2018] Zhuang, F.; Cheng, X.; Luo, P.; J., P.; and He, Q. 2018. Supervised representation learning with double encodinglayer autoencoder for transfer learning. ACM Transactions on Intelligent Systems and Technology 1–16.