Transferrable Prototypical Networks for Unsupervised Domain Adaptation
Abstract
In this paper, we introduce a new idea for unsupervised domain adaptation via a remold of Prototypical Networks, which learn an embedding space and perform classification via a remold of the distances to the prototype of each class. Specifically, we present Transferrable Prototypical Networks (TPN) for adaptation such that the prototypes for each class in source and target domains are close in the embedding space and the score distributions predicted by prototypes separately on source and target data are similar. Technically, TPN initially matches each target example to the nearest prototype in the source domain and assigns an example a “pseudo” label. The prototype of each class could then be computed on sourceonly, targetonly and sourcetarget data, respectively. The optimization of TPN is endtoend trained by jointly minimizing the distance across the prototypes on three types of data and KLdivergence of score distributions output by each pair of the prototypes. Extensive experiments are conducted on the transfers across MNIST, USPS and SVHN datasets, and superior results are reported when comparing to stateoftheart approaches. More remarkably, we obtain an accuracy of 80.4% of single model on VisDA 2017 dataset.
1 Introduction
The recent advances in deep neural networks have convincingly demonstrated high capability in learning vision models on large datasets. For instance, an ensemble of residual nets [7] achieves 3.57% top5 error on the ImageNet test set, which is even lower than 5.1% of the reported humanlevel performance. The achievements rely heavily on the requirement to have large quantities of annotated data for deep model learning. However, performing intensive manual labeling on a new dataset is expensive and timeconsuming. A valid question is why not recycling offtheshelf learnt knowledge/models in source domain for new domain(s). The difficulty originates from the domain gap [33] that may adversely affect the performance especially when the source and target data distributions are very different. An appealing way to address this challenge would be unsupervised domain adaptation, which aims to utilize labeled examples or learnt models in the source domain and the large number of unlabeled examples in the target domain to generalize a target model.
A common practice in unsupervised domain adaptation is to align data distributions between source and target domains or build invariance across domains by minimizing domain shift through measures such as correlation distances [27, 34] or maximum mean discrepancy [31]. In this paper, we explore generalpurpose and taskspecific domain adaptations under the framework of Prototypical Networks [26]. The design of prototypical networks assumes the existence of an embedding space in which the projections of samples in each class cluster around a single prototype (or centroid). The classification is then performed by computing the distances to prototype representations of each class in the embedding space. In this way, the generalpurpose adaptation is to represent each class distribution by a prototype and match the prototypes of each class in the embedding space learnt on the data from different domains. The inspiration of taskspecific adaptation is from the rationale that the target data should be classified correctly by the taskspecific model when the source and target distributions are well aligned. In the context of prototypical networks, taskspecific adaptation is equivalent to adapting the score distributions produced by prototypes in different domains.
By consolidating the idea of generalpurpose adaptation and taskspecific adaptation into unsupervised domain adaptation, we present a novel Transferrable Prototypical Networks (TPN) architecture. Ideally, TPN is to learn a nonlinear mapping (a neural network) of the input examples into an embedding space, in which the representations are invariant across domains. Specifically, TPN takes a batch of labeled source and unlabeled target examples, compares each target example to each of the prototypes computed on source data, and assigns the label of the nearest prototype as a “pseudo” label to each target example. As such, the generalpurpose adaptation is then formulated to minimize the distances between the prototypes measured on source data, target data with pseudo labels, and source plus target data. That is to alleviate domain discrepancy on class level. In taskspecific adaptation, we utilize a softmax over distances of the embedding of each example to the prototypes as the classifier. The KLdivergence is exploited to model the mismatch of score distribution by classifiers on prototypes computed in each domain or their combination. In this case, domain discrepancy is amended on sample level. The whole TPN is endtoend trained by minimizing the classification loss on labeled source data plus the two adaptation terms, and switching the learning from batch to batch. At inference stage, each prototype is computed as a priori. A test target example is projected into the embedding space to compare to each prototype and the outputs of softmax are taken as predictions.
2 Related Work
Inspired by the recent advances in image representation using deep convolutional neural networks (DCNNs), a few deep architecture based methods have been proposed for unsupervised domain adaptation. In particular, one common deep solution for unsupervised domain adaptation is to guide the feature learning in DCNNs by minimizing the domain discrepancy with Maximum Mean Discrepancy (MMD) [6]. MMD is an effective nonparametric metric for the comparisons between the distributions of source and target domains. [31] is one of early works that incorporates MMD into DCNNs with regular supervised classification loss on source domain to learn both semantically meaningful and domain invariant representation. Later in [15], Long et al. simultaneously exploit transferability of features from multiple layers via the multiple kernel variant of MMD. The work is further extended by adapting classifiers through a residual transfer module in [17]. Most recently, [16] explores domain shift reduction in joint distributions of the network activation of multiple taskspecific layers.
Another branch of unsupervised domain adaptation in DCNNs is to exploit the domain confusion by learning a domain discriminator [4, 14, 29, 30, 35]. Here the domain discriminator is designed to predict the domain (source/target) of each input sample and is trained in an adversarial fashion, similar to GANs [5], for learning domain invariant representation. For example, [29] devises a domain confusion loss measured in domain discriminator for enforcing the learnt representation to be domain invariant. Similar in spirit, Ganin et al. explore such domain confusion problem as a binary classification task and optimize the domain discriminator via a gradient reversal algorithm in [4]. Coupled GANs [13] directly applies GANs into domain adaptation problem to explicitly reduce the domain shifts by learning a joint distribution of multidomain images. Recently, [30] combines adversarial learning with discriminative feature learning for unsupervised domain adaptation. Most recently, [32] extends domain discriminator by learning domaininvariant feature extractor and performing feature augmentation.
In summary, our approach belongs to domain discrepancy based methods. Similar to previous approaches [16, 31], our TPN leverages additional unlabeled target data for learning taskspecific classifiers. The novelty is on the exploitation of multigranular domain discrepancy in Prototypical Networks, at classlevel and samplelevel, that has not been fully explored in the literature. Classlevel domain discrepancy is reduced by learning similar prototypes of each class in different domains, while samplelevel discrepancy is by enforcing similar score distributions across prototypes of different domains.
3 Unsupervised Domain Adaptation
Our Transferrable Prototypical Networks (TPN) is to remould Prototypical Networks towards the scenario of unsupervised domain adaptation by jointly bridging the domain gap via minimizing multigranular domain discrepancies, and constructing classifiers with unlabeled target data and labeled source data. The classifiers in Prototypical Networks are typically achieved by measuring distances between the example and prototype of each class, which can be flexibly adapted across domains by only updating prototypes in a specific domain. To learn transferrable representations in Prototypical Networks, TPN firstly utilizes the classifiers learnt on sourceonly data to directly predict the pseudo labels of unlabeled target data and thus produces another two kinds of prototypebased classifiers constructed in targetonly and sourcetarget data. The training of TPN is then performed simultaneously by classifying each source sample as correct class and reducing multigranular domain discrepancy at class level & sample level. The classlevel domain discrepancy is reduced via matching the prototypes of each class, and the samplelevel domain discrepancy is minimized by enforcing the score distributions over classes of each sample synchronized, across different domains. We alternate the above two steps in each training iteration and optimize the whole TPN in an endtoend fashion.
3.1 Preliminary—Prototypical Networks
Prototypical Networks is preliminarily proposed in [26] to construct an embedding space in which points cluster around a single prototype representation of each class. In particular, given a set with labeled samples belonging to categories, where is the class label of sample . The objective is to learn an embedding function for transforming each input sample into a dimensional embedding space through a deep architecture of Prototypical Networks, where represents the learnable parameters. To convey the highlevel description of the class as metadata, the prototype of each class is defined by taking the average of all embedded samples belonging to that class:
(1) 
where denotes the set of samples from class . Given a query sample , Prototypical Networks directly produce its score distribution over classes via a softmax function on distances to the prototypes, whose th element is the probability of belonging to class :
(2) 
where is the distance function (e.g., Euclidean distance as in [26]) between query sample and the prototype. The training of Prototypical Networks is performed by minimizing the negative loglikelihood probability of assigning correct class label to this sample:
(3) 
3.2 Problem Formulation
In unsupervised domain adaptation, we are given labeled samples in the source domain and unlabeled samples in the target domain. Based on the widely adopted assumption of the existence of a shared feature space for source and target domains in [16, 20, 29], the ultimate goal of this task is to design an embedding function which formally reduces domain shifts in the shared feature space and enables learning of both transferrable representations and classifiers depending on and . Different from the existing transfer techniques [16, 17] which are typically composed of two cascaded networks for learning domaininvariant features and targetdiscriminative classifiers respectively, we consider unsupervised domain adaptation in the framework of Prototypical Networks. Such framework naturally unifies the learning of features and classifiers into one network by constructing classifiers purely on the prototype of each class. This design reflects a very simple inductive bias that is beneficial in domain adaptation regime. Specifically, to make Prototypical Networks transferrable across domains, two adaptation mechanisms are devised to align distributions of source and target domains through reducing multigranular (i.e., classlevel and samplelevel) domain discrepancies. In between, the generalpurpose adaptation matches the prototypes of each class and the taskspecific adaptation enforces similar score distributions over classes of each sample, across different domains, as shown in Figure 1.
3.3 Generalpurpose Domain Adaptation
Most existing works resolve unsupervised domain adaptation by minimizing the domain discrepancy between source and target data distributions with MMD [31], or maximizing the domain confusion across domains via a domain discriminator [29]. Both of the domain discrepancy and domain confusion terms are measured over the entire source and target data, irrespective of the specific class of each sample. Moreover, the domain discrepancy has been seldom exploited across domains for each class, possibly because measuring such classlevel domain discrepancy needs the labels of both source and target samples, while in typical unsupervised domain adaptation settings, no label is provided for target samples.
Inspired from selflabeling [11, 24] for domain adaptation, we directly utilize prototypebased classifier learnt on labeled source data for matching each target sample to the nearest prototype in the source domain, and then assign the target sample a “pseudo” label. As such, all the target samples are with pseudo labels. After obtaining the real/pseudo labels of source/target data, three kinds of classifiers (i.e., prototypes , and ) could be calculated on sourceonly data (), targetonly data () and sourcetarget data (), respectively:
(4) 
where and denote the sets of source/target samples from the same class .
To measure the classlevel domain discrepancy across domains, we take the inspiration from MMDbased transfer techniques [16, 17] and compute pairwise reproducing kernel Hilbert space (RKHS) distance between the prototypes of the same class from different domains. The basic idea is that if the data distributions of source and target domains are identical, the prototypes of the same class achieved on different domains are the same. Formally, we define the following classlevel discrepancy loss as
(5) 
where , and denote the corresponding prototypes in reproducing kernel Hilbert space . By minimizing this term, the prototype of each class computed in each domain will be enforced to be in close proximity in the embedding space, leading to invariant representation distribution across domains in general.
Connections with MMD. MMD [6] is a kernel twosample test which measures the distribution difference between source and target data by mapping them into a reproducing kernel Hilbert space. The empirical estimation of MMD is computed by
(6) 
where is the mapping to RKHS . Taking a close look on the objective of MMD and our classlevel discrepancy loss in Eq.(5), we can observe some interesting connections. Concretely, the means of source and target data (i.e., and ) measured in MMD can be interpreted as the holistic prototype of each domain in RKHS. MMD is then expressed as the RKHS distance between the holistic prototypes across domains. Our classlevel domain discrepancy, different from MMD, is computed as the RKHS distance across the prototypes of each class from different domains. In other words, a finegrained alignment of source and target data distributions is performed on class level, instead of simply minimizing the distance between holistic prototypes across domains.
3.4 Taskspecific Domain Adaptation
The generalpurpose domain adaptation only enforces similarity in feature distributions, while leaving the relations between samples and taskspecific classifiers (i.e., prototypes) unexploited. Furthermore, we devise a new adaptation mechanism, i.e., taskspecific adaptation, to reduce samplelevel domain discrepancy by aligning the score distributions of different classifiers (i.e., prototypes) across domains for each sample. The rationale of samplelevel domain discrepancy is that each source/target sample should be classified correctly by the taskspecific classifiers when source and target distributions are well aligned, leading to consistent decisions of classifiers across domains.
In particular, given each source/target sample , three score distributions (, and ) are obtained via three kinds of classifiers (i.e., prototypes , and ) learnt on sourceonly, targetonly and sourcetarget data, respectively. To measure the samplelevel domain discrepancy, we utilize KLdivergence to evaluate the pairwise distance between the score distributions from different domains. The samplelevel discrepancy loss over the source and target samples are defined as
(7) 
where is the KLdivergence factor and is the symmetric pairwise KLdivergence.
Please note that different from generalpurpose domain adaptation which independently matches the prototypes of each class across different domains, taskspecific adaptation simultaneously adapts the prototypes of all classes, pursuing similar score distributions over classes of each sample.
3.5 Optimization
The overall training objective of our TPN integrates the supervised classification loss in Eq.(3) and multigranular discrepancy losses (i.e., classlevel discrepancy loss in Eq.(5) and samplelevel discrepancy loss in Eq.(7)). Hence we obtain the following optimization problem:
(8) 
where and are tradeoff parameters. With this overall loss objective, the crucial goal of the optimization is to learn the deep embedding function , in which the output representations are invariant across domains.
Training Procedure. To address the optimization problem in Eq.(8), we split the training process into two steps: 1) calculate classifier (i.e., prototypes ) on source domain and perform it to assign pseudo labels to target samples; 2) calculate classifiers (i.e., prototypes and ) on targetonly and sourcetarget data, and update with respect to the gradient descent of overall objective function. We alternate the two steps in each training iteration and stop the procedure until a convergence criterion is met. Note that to remedy the error of selflabeling, we only assign pseudo labels to the target examples whose maximized scores are over 0.6 and resample the target examples for labeling in each training iteration to avoid overfitting of pseudo labels. Furthermore, the training process of our TPN is also resistant to the noise of pseudo labels since we iteratively utilize both labeled source examples and pseudolabeled target examples for learning the embedding function. This procedure not only ensures the accuracy in source domain, but also effectively minimizes classlevel and samplelevel discrepancy. Such cycle will gradually improve the accuracy in target domain.
Inference. After training TPN, we can obtain the deep embedding function . With this, all the three sets of prototypes (, and ) are calculated over the whole training set in advance and stored in memory. Any one of the three prototype sets can be utilized as the final classifier for classifying test target sample at the testing stage. We empirically verified that the performance is not sensitive to the selection of prototypes^{1}^{1}1The accuracy constantly fluctuates within 0.002 when using different set of prototypes for four domain shifts in experiments., which implicitly reveals the domain invariant characteristic of the learnt feature representation. Hence, given a test target sample, we compute its embedding representation via and compare the distances to prototypes of each class to output the final prediction scores.
3.6 Theoretical Analysis
We formalize the error bound of TPN by an extension of the theory in [1]. As TPN performs training on a mixture of labeled source examples and target samples with pseudo labels, the classification error is naturally considered as the linear weighted sum of errors in source and target domain. Denote and as the ground truth labels of source examples and the pseudo labels of target samples, respectively, and as a hypothesis. The error is then formally written as
(9) 
where is the tradeoff parameter. The term and represents the expected error over the sample distribution of target domain and source domain with respect to pseudo labels and ground truth labels, respectively.
Next, a valid question is how close the error is to an oracle error that evaluates the classifier learnt on the ground truth labels of the target examples. The closer the two losses are, the more desirable the domain adaptation performs. The following Lemma proves that the difference between the two losses could be bounded for our TPN.
Lemma 1.
Let be a hypothesis in class . Then
(10) 
where measures the domain discrepancy in the hypothesis space . denotes the ratio of target examples with false pseudo labels. is the combined error in two domains of the joint ideal hypothesis , which is the optimal hypothesis by minimizing the combined error:
(11) 
Lemma 1 decomposes the bound into three terms: domain discrepancy measured by the disagreement of hypothesis in the space , the error of the ideal joint hypothesis and the ratio of the noise in pseudo labels. In TPN, the first term is assessed through quantifying classlevel discrepancy of prototypes and samplelevel discrepancy over score distributions across different domains. As stated in [1], when the combined error of the joint ideal hypothesis is large, there is no classifier that performs well on both domains. Instead, in the most relevant cases for domain adaptation, is usually considered to be negligibly small and thus the second term can be disregarded. Furthermore, in each iteration, TPN searches for the optimal hypothesis and improves the accuracy of pseudolabel prediction on target examples. The increase of correct pseudo labels in turn benefits the reduction of domain discrepancy. We will empirically verify that the third term of the noise in pseudo labels is iteratively decreased in Section 4.3. As such, TPN constantly tightens the bound in Eq.(10).
4 Experiments
We conduct extensive evaluations of TPN for unsupervised domain adaptation from four domain shifts, including three Digits image transfer across three Digits datasets (i.e., MNIST [10], USPS [3] and SVHN [19]) and one synthetictoreal image transfer on VisDA 2017 dataset [21].
4.1 Datasets and Experimental Settings
Datasets. The MNIST (M) and USPS (U) image datasets are both handwritten Digits datasets containing 10 classes of digits. The MNIST dataset consists of 70 images and the USPS dataset includes 9.3 images. Unlike the two, the SVHN (S) dataset is a realworld Digits dataset of house numbers in Google street view images and contains 100 cropped Digits images. The VisDA 2017 dataset is the largest synthetictoreal object classification dataset to date with over 280 images in the training, validation and testing splits (domains). All the three domains share the same 12 object categories. The training domain consists of 152 synthetic images which are generated by rendering 3D models of the same object categories from different angles and under different lighting conditions. The validation domain includes 55 images by cropping object in real images from COCO [12]. The testing domain contains 72 images cropped from video frames in YTBB [22].
Digits Image Transfer. Following [30], we consider three directions: M U, U M and S M, for unsupervised domain adaptation among Digits datasets. For the transfer between MNIST and USPS, we sample 2 images from MNIST and 1.8 images from USPS as in [30]. For S M, the two training sets are fully utilized. In addition, the CNN architecture for the three Digits image transfer tasks is a simple modified version of [10] (2 convlayer LeNet), which is also exploited in [30].
SynthetictoReal Image Transfer. The second experiment was conducted over the most challenging synthetictoreal image transfer task in VisDA 2017. As the annotations of the testing data in VisDA are not publicly available, we take the training data (i.e., synthetic images) as source domain and the validation data (i.e., cropped COCO images) as target domain. Moreover, we adopt 50layer ResNet [7] pretrained on ImageNet [23] as our basic CNN structure.
Implementation Details. The two tradeoff parameters and in Eq.(8) are simply set as 1. A common practice in unsupervised domain adaption is the lack of annotations in target domain, making the parameters unable to be well estimated. As such, we directly fix the tradeoff parameters in all the experiments. We strictly follow [2, 30] and set the embedding size as 10/512 for Digits/synthetictoreal image transfer. We mainly implement TPN based on Caffe [8]. Specifically, the network weights are trained by ADAM [9] with 0.0005 weight decay and 0.9/0.999 momentum for Digits/synthetictoreal image transfer. The learning rate and minibatch size are set as 0.0002/0.00001 and 128/60 for Digits/synthetictoreal image transfer. The maximum training iteration is set as 70 for all the experiments. Moreover, following [30], we pretrain TPN on labeled source data. For Digits image transfer tasks, we adopt the classification accuracy on target domain as evaluation metric. For synthetictoreal image transfer, we measure the percategory classification accuracy on target domain. The final metric is the average of accuracy over all categories.
Compared Methods. To empirically verify the merit of our TPN, we compare the following approaches: (1) Sourceonly directly exploits the classification model trained on source domain to classify target samples. (2) RevGrad [4] treats domain confusion as a binary classification task and trains the domain discriminator via gradient reversal. (3) DC [29] explores a Domain Confusion loss measured in domain discriminator for unsupervised domain adaptation. (4) DAN [15] utilizes multiple kernel variant of MMD to align feature representations from multiple layers. (5) RTN [17] extends DAN by adapting classifiers through a residual transfer module. (6) ADDA [30] designs an unified unsupervised domain adaptation model based on adversarial learning objectives. (7) JAN [16] learns a transfer model by aligning joint distributions of the network activation of multiple layers across domains. (8) MCD [25] aligns distributions of source and target domains by utilizing the taskspecific decision boundaries. (9) SEn [2] explores the mean teacher variant of temporal ensembling [28] for unsupervised domain adaptation. (10) TPN is the proposal in this paper. Moreover, two slightly different settings of TPN are named as TPN and TPN which are trained with only generalpurpose and taskspecific adaptation, respectively. (11) Trainontarget is an oracle run that trains the classifier on all labeled target samples.


4.2 Performance Comparison
Digits Image Transfer. Table 2(a) shows the performance comparisons on three transfer directions among Digits datasets. Overall, the results across three adaptations consistently indicate that our proposed TPN achieves superior performances against other stateoftheart techniques including MMD based models (DAN, RTN, JAN) and domain discriminator based approaches (RevGrad, DC, ADDA, MCD). In particular, the accuracy of TPN can achieve 92.1% and 94.1% on the adaptation of M U and U M, making the absolute improvement over the best competitor ADDA by 2.7% and 4%, respectively, which is generally considered as a significant progress on the adaptation between MNIST and USPS. It is noteworthy that compared to JAN, our TPN also promotes the classification accuracy evidently on the harder transfer S M, where the source and target domains are substantially different. The results in general highlight the key importance of exploring both classlevel and samplelevel domain discrepancy via generalpurpose and taskspecific adaptation in unsupervised domain adaptation, leading to more domaininvariant feature representations.
The performances of Sourceonly which trains the classifier only on labeled source data could be regarded as a lower bound without domain adaptation. By additionally incorporating the domain adaptation term (MMD/domain discriminator), RevGrad, DC, DAN, RTN, ADDA, JAN and MCD lead to a large performance boost over Sourceonly, which basically indicates the advantage of measuring the domain discrepancy/domain confusion over the source and target data. Furthermore, the performances of them on harder transfer S M are much lower than our TPN and TPN which exploits the classlevel/samplelevel domain discrepancy in Prototypical Networks by matching the prototypes across domains for each class and score distributions of different classifiers (i.e., prototypes) for each sample, respectively. This confirms the effectiveness of leveraging classlevel and samplelevel domain discrepancy in generalpurpose and taskspecific adaptation, especially between more distinct domains. For the two easy transfer tasks between MNIST and USPS, TPN is inferior to ADDA, MCD and TPN, which indicates that solely matching score distributions of each sample might inject noise more easier than domain discriminator/classlevel domain discrepancy on transfer task across similar domains. In addition, by simultaneously utilizing both generalpurpose and taskspecific adaptation, our TPN consistently boosts up the performances on all the three Digits image transfer tasks. The results demonstrate the advantage of jointly leveraging multigranular domain discrepancy at class level and sample level for unsupervised domain adaptation. Note that we exclude the published results of SEn in this comparison as SEn is originally built with deeper CNNs (i.e., 9 conv layers) on Digits image datasets and our TPN is based on 2 convlayer LeNet. When equipped with the same CNNs in SEn, the accuracy of our TPN is boosted up to 98.6% on M U which is higher than 98.3% of SEn.
SynthetictoReal Image Transfer. The performance comparisons for synthetictoreal image transfer task on VisDA 2017 dataset are summarized in Table 2(b). Here the results of SEn are all reported on the setting with multiple data augmentations (DA). Our TPN performs consistently better than other runs without any DA involved. In particular, the mean accuracy across all the 12 categories can reach 80.4%, making the absolute improvement over JAN by 13.9%. Similar to the observations on the hard Digits image transfer S M, TPN and TPN exhibit better performance than JAN by taking classlevel and samplelevel domain discrepancy into account for unsupervised domain adaptation. In addition, TPN performs better than TPN and a larger degree of improvement is attained when exploiting both generalpurpose and taskspecific adaptation by TPN. Please note that the highest accuracy 82.8% of SEn is equipped with the testtime augmentation (Testaug), i.e., averaged predictions of 16 different augmentations of each image, while the accuracy 80.4% of our TPN is on single model without any DA. When relying on one kind of DA (Miniaug), SEn only achieves 74.2% which is still lower than ours.
4.3 Experimental Analysis
Feature Visualization. Figure 2(a)(b) depict the tSNE [18] visualizations of features learnt by Sourceonly and our TPN on VisDA 2017 dataset (10 samples in each domain). We can see that the distribution of target sample is far from that of source samples for Sourceonly run without domain adaptation. Through domain adaptation by TPN, the two distributions are brought closer, making the target distribution indistinguishable from the source one.
Confusion Matrix Visualization. Figure 2(c)(f) show the visualizations of confusion matrix for the classifier learnt by Sourceonly, JAN, our TPN and Trainontarget on VisDA. Examining the confusion matrix of Sourceonly reveals that the domain shift is relatively large and the majority of the confusion are observed between objects with similar 3D structures, e.g., knife & skateboard (sktbrd) and truck & car. Through domain adaptation by JAN and TPN, the confusion is reduced for most classes. In particular, among all the 12 categories, TPN achieves higher accuracies than JAN for 10 categories, demonstrating that the features learnt by our TPN are more discriminative on target domain.
Convergence Analysis. To illustrate the convergence of our TPN, we visualize the evolution of the embedded representation of a subset on VisDA 2017 dataset (10 samples for each domain) with tSNE during training. Figure 3(a)(h) illustrate that the target classes are becoming increasingly well discriminated by TPN source classifier. Figure 3(i) further depicts that the accuracy constantly increases (i.e., the noise of the pseudo labels decreases) and the two adaptation losses decrease when iterating more steps. Specifically, at the initial time, the ratio of target examples with false pseudo labels is 44.7%, i.e., only 55.3% of target samples are assigned with the correct labels. With the increase of training iterations of our TPN, such noise of pseudo labels is gradually decreased and the final accuracy will be boosted up to 80.4% after model convergence. This again verifies that minimizing classlevel and samplelevel domain discrepancy will lead to better adaptation.
5 Conclusions
We have presented Transferrable Prototypical Networks (TPN), which explores domain adaptation in an unsupervised manner. Particularly, we study the problem from the viewpoint of both generalpurpose and taskspecific adaptation. To verify our claim, we have devised the measure of each adaptation in the framework of prototypical networks. The generalpurpose adaptation is to push the prototype of each class computed in each domain to be close in the embedding space, resulting in invariant representation distribution across domains in general. The taskspecific adaptation further takes the decisions of classifiers into account when aligning feature distributions, which ideally leads to domaininvariant representations. Experiments conducted on the transfers across MNIST, USPS and SVHN datasets validate our proposal and analysis. More remarkably, we achieve new stateoftheart performance of single model on synthetictoreal image transfer in VisDA 2017 challenge.
References
 [1] Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 2010.
 [2] Geoffrey French, Michal Mackiewicz, and Mark Fisher. Selfensembling for domain adaptation. In ICLR, 2018.
 [3] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Springer series in statistics New York, 2001.
 [4] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In ICML, 2015.
 [5] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
 [6] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. Journal of Machine Learning Research, 2012.
 [7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [8] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
 [9] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [10] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 [11] DongHyun Lee. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, 2013.
 [12] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
 [13] MingYu Liu and Oncel Tuzel. Coupled generative adversarial networks. In NIPS, 2016.
 [14] Fuchen Long, Ting Yao, Qi Dai, Xinmei Tian, Jiebo Luo, and Tao Mei. Deep domain adaptation hashing with adversarial learning. In SIGIR, 2018.
 [15] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In ICML, 2015.
 [16] Mingsheng Long, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In ICML, 2017.
 [17] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In NIPS, 2016.
 [18] Laurens Van Der Maaten and Geoffrey Hinton. Visualizing data using tSNE. JMLR, 2008.
 [19] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In Workshop on Deep Learning and Unsupervised Feature Learning, NIPS, 2011.
 [20] Sinno Jialin Pan, James T Kwok, and Qiang Yang. Transfer learning via dimensionality reduction. In AAAI, 2008.
 [21] Xingchao Peng, Ben Usman, Neela Kaushik, Judy Hoffman, Dequan Wang, and Kate Saenko. VisDA: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924, 2017.
 [22] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtubeboundingboxes: A large highprecision humanannotated data set for object detection in video. In CVPR, 2017.
 [23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
 [24] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tritraining for unsupervised domain adaptation. In ICML, 2017.
 [25] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, 2018.
 [26] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In NIPS, 2017.
 [27] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
 [28] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weightaveraged consistency targets improve semisupervised deep learning results. In NIPS, 2017.
 [29] Eric Tzeng, Judy Hoffman, Trevor Darrell, and Kate Saenko. Simultaneous deep transfer across domains and tasks. In ICCV, 2015.
 [30] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
 [31] Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
 [32] Riccardo Volpi, Pietro Morerio, Silvio Savarese, and Vittorio Murino. Adversarial feature augmentation for unsupervised domain adaptation. In CVPR, 2018.
 [33] Ting Yao, ChongWah Ngo, and Shiai Zhu. Predicting domain adaptivity: redo or recycle? In ACM MM, 2012.
 [34] Ting Yao, Yingwei Pan, ChongWah Ngo, Houqiang Li, and Tao Mei. Semisupervised domain adaptation with subspace learning for visual recognition. In CVPR, 2015.
 [35] Yiheng Zhang, Zhaofan Qiu, Ting Yao, Dong Liu, and Tao Mei. Fully convolutional adaptation networks for semantic segmentation. In CVPR, 2018.