Unsupervised Domain Adaptation via Discriminative Manifold Embedding and Alignment
Abstract
Unsupervised domain adaptation is effective in leveraging the rich information from the source domain to the unsupervised target domain. Though deep learning and adversarial strategy make an important breakthrough in the adaptability of features, there are two issues to be further explored. First, the hardassigned pseudo labels on the target domain are risky to the intrinsic data structure. Second, the batchwise training manner in deep learning limits the description of the global structure. In this paper, a Riemannian manifold learning framework is proposed to achieve transferability and discriminability consistently. As to the first problem, this method establishes a probabilistic discriminant criterion on the target domain via soft labels. Further, this criterion is extended to a global approximation scheme for the second issue; such approximation is also memorysaving. The manifold metric alignment is exploited to be compatible with the embedding space. A theoretical error bound is derived to facilitate the alignment. Extensive experiments have been conducted to investigate the proposal and results of the comparison study manifest the superiority of consistent manifold learning framework.
Introduction
In machine learning, largescale datasets with annotations play a crucial role during the learning process. Convolutional Neural Networks (CNNs) achieves a significant advance in various tasks via a huge number of welllabeled samples [\citeauthoryearLeCun, Bengio, and Hinton2015]. Unfortunately, such data is actually prohibitive in many realworld scenarios. Applying the learned model in the new environment, i.e., the crossdomains scheme, will cause a significant degradation of recognition performance [\citeauthoryearRen, Xu, and Yan2018, \citeauthoryearKim et al.2019].
Unsupervised Domain Adaptation (UDA) is designed to deal with the shortage of labels by leveraging the rich labels and strong supervision from the source domain to the target domain, where the target domain has no access to the annotations. In fact, datasets composed of specifically exploratory factors and variants, such as background, style, illumination, camera views or resolution, often lead to the shifting distributions (i.e., the domain shift) [\citeauthoryearShimodaira2000, \citeauthoryearMorenoTorres et al.2012]. According to the transfer theory established by BenDavid et al. [\citeauthoryearBenDavid et al.2007, \citeauthoryearBenDavid et al.2010], the primary task for crossdomain adaptation is to learn the discriminative feature representations while narrowing the discrepancy between domains.
Recent literature indicates that CNNs learn abstract representations with nonlinear transformations [\citeauthoryearBengio, Courville, and Vincent2013], which suppress the negative effects caused by variant explanatory factors in domain shift [\citeauthoryearLong et al.2015]. Pioneer works [\citeauthoryearLong et al.2015, \citeauthoryearGanin et al.2016, \citeauthoryearLong et al.2017, \citeauthoryearSankaranarayanan et al.2018] attempt to transfer the source classifier with sufficient supervision to the target domain by minimizing the discrepancy between the source and target domains. Though early adversarial confusion methods [\citeauthoryearGanin et al.2016, \citeauthoryearSankaranarayanan et al.2018, \citeauthoryearPinheiro2018], which is inspired by Generative Adversarial Nets (GANs) [\citeauthoryearGoodfellow et al.2014], promise the generated features are domainindistinguishable and form a wellaligned the marginal distributions, the conditional distributions are still not guaranteed [\citeauthoryearLong et al.2018, \citeauthoryearSaito et al.2018, \citeauthoryearChen et al.2019b].
Some latest methods achieve remarkable improvement in accuracy by employing the uncertainty information on the target domain, e.g., pseudo labels and soft labels [\citeauthoryearLong et al.2018, \citeauthoryearSaito et al.2018, \citeauthoryearPinheiro2018, \citeauthoryearChen et al.2019b]. Though such information transduced from the source domain strengthens the discriminative ability of the target domain, there are still two points to be further explored. First, direct utilization of uncertainty information is risky and should be treated cautiously [\citeauthoryearLong et al.2018], as the hardassigned pseudo labels may change the intrinsic structure of data space [\citeauthoryearDing and Fu2019]. Second, the batchwise training in deep learning limits the capture of global information; thus models may be misled by some extreme local distributions.
In this paper, we develop a novel Riemannian manifold embedding and alignment framework. As the transferability and discriminability are both valuable [\citeauthoryearChen et al.2019b], the proposal reaches a consistent rule for these two properties. The main idea is to describe the domains by a sequence of abstract manifolds. Enlightened by the successful application of soft labels for conditional coding and the multilayer embedding in [\citeauthoryearLong et al.2017, \citeauthoryearLong et al.2018], a probabilistic discriminant criterion is proposed. Further, we extend this criterion to a global approximation scheme, which overcomes the dilemma of discriminant learning in batchwise training. Inspired by previous attempts on manifold learning [\citeauthoryearGong et al.2012, \citeauthoryearHuang et al.2017], we employ manifold metric to measure the domain discrepancy. The contributions are summarized as follows.

To optimize the structure of the target domain and reduce the risk of uncertainty information simultaneously, a probabilistic discriminant criterion is developed. Specifically, an interclass penalty supervised by groundtruth labels is built on the source domain; this penalty aims to construct a separable structure for classes. Then a probabilistic and truncated intraclass agreement is proposed on the target domain, which treats the classes of the source domain as anchors and acquires the interclass separability transductively.

Based on the above criterion, a global approximation scheme is extended. To capture the global structure, it combines the global information in the last epoch with data in the current batch. Since such approximation only requires access to the classwise centers, it is actually memorysaving.

The manifold alignment is developed to be compatible with the embedding discriminant space. It establishes a series of abstract descriptors (i.e. the basis) for original data, and aligns the domains by minimizing the discrepancy between the abstract descriptors, while most of noise are filtered. Further, a theoretical error bound is derived to facilitate the selection of components.
Related Work
Traditional UDA models usually focus on learning domaininvariant and discriminative features [\citeauthoryearPan et al.2010, \citeauthoryearLong et al.2013]. Based on the manifold assumption, plentiful metrics are developed to measure the distance between instances from source and target [\citeauthoryearGong et al.2012, \citeauthoryearFernando et al.2013]. Deep learning methods enhance the transferability by exploring the representations that disentangle exploratory factors of variants hidden behind the data [\citeauthoryearBengio, Courville, and Vincent2013, \citeauthoryearYosinski et al.2014]. The distribution alignment methods minimize the discrepancy of domains based on common statistics directly, e.g., the firstorder statistic based on maximum mean discrepancies (MMD) [\citeauthoryearSejdinovic et al.2013, \citeauthoryearLong et al.2015, \citeauthoryearRen et al.2019] and the secondorder statistic based on covariance matrices [\citeauthoryearSun, Feng, and Saenko2016, \citeauthoryearChen et al.2019a]. Inspired by the GANs [\citeauthoryearGoodfellow et al.2014], lots of adversarial approaches with different purposes are developed. The most common usage of adversarial networks is to generate the representations that fool the domain discriminator, thus the distributions of domains are more similar [\citeauthoryearGanin et al.2016, \citeauthoryearLong et al.2017, \citeauthoryearPinheiro2018]. Domainspecific and Taskspecific methods aim to tackle the issue of compact representations in highlevel layers [\citeauthoryearLong et al.2017, \citeauthoryearSaito et al.2018, \citeauthoryearKim et al.2019, \citeauthoryearLee et al.2019, \citeauthoryearDing and Fu2019].
Though adversarial alignment generates well marginal distributions, the conditional distributions still need to be explored. Recent researches suggest that discriminability plays a crucial role in the formation of class distributions (i.e., the conditional distributions) [\citeauthoryearLong et al.2018, \citeauthoryearDing and Fu2019, \citeauthoryearChen et al.2019b]. Conditional Domain Adversarial Network (CDAN) [\citeauthoryearLong et al.2018] encodes the target predictions into deep features and then models the joint distributions of features and labels. Batch Spectral Penalization (BSP) [\citeauthoryearChen et al.2019b] revisits the relation between transferability and discriminability via the largest singular value of batch features.
Multilayer Remannian Manifold Embedding and Alignment
In this section, we propose the Discriminative Remannian Manifold Embedding and Alignment (DRMEA) framework.
Backgrounds and Motivations
In the classical manifold learning paradigm, to construct a compact and discriminative embedding space, a lowdimensional manifold is usually extracted from the originally highdimensional data space. Specifically, the Riemannian manifold usually consists of a certain object such as linear subspace, affine/convex hull, symmetric positive definite (SPD) matrix [\citeauthoryearHuang et al.2017].
From the perspective of discriminative embedding, graphbased criterion [\citeauthoryearYan et al.2007] is widely adopted in the area of manifold learning and domain adaptation. Basically, those methods establish the instancesbased connection graph or similarity graph to construct a separable space. Besides, as the primary assumption of domain adaptation is based on statistical distribution, the alignment based on covariance matrices, which lie on the Riemannian manifold, equips the domain with the manifold and statistical properties. Motivated by it, our work aims to embed the graphbased discriminant criterion to the target domain, which is represented as manifolds (i.e., the covariance matrices).
Given features and its mean vector , where denote the dimension of features and represent the sample sizes. Denote by the input space (e.g., Euclidean space, Hilbert space or Manifold space), the manifold learning aims to learning a specific nonlinear mapping
where is the lowdimensional embedding manifold. Based on the SPD representation setting, the image of a given covariance matrix is a lowdimensional SPD matrix , where is dimensional vector with all one elements and is the transpose operation. Intuitively, learning of mapping function can be deduced to find a nonlinear transformation
and the image of mapping can be approximated by the inner product of , i.e., .
For domain adaptation, the source and target domains can be taken as two Euclidean spaces, where the discriminative information is relatively inadequate. Thus the ideal manifolds are expected to be discriminative, representative and compact. Besides, the features distribution of domains, which is represented by manifolds, should be aligned with manifold metric for the better transfer of discriminative structure.
LowDimensional Manifold Layers
As previously stated, we aim to learn a nonlinear transformation for the input features directly. In this paper, CNNs are used to obtain such projection . To explore the latent Riemannian representations of the Euclidean features (i.e., the deep features in stage 1), the output features of CNN backbone are sent into progressive lowdimensional manifold layers in the second stage. Since there is a naturally geometric difference between Euclidean Space and Riemannian Space, a multilayer scheme is adopted to reduce the dimension of features progressively.
Figure 1 shows the network architecture of the proposed method. Let be the parameters of networks. The progressive Riemannian manifold layers are represented as a sequence of functions , and implemented on fully connection layers. In fact, the CNNs and Riemannian manifold layers are generalized and share by both two domains. It means that the common projections are explored to map two domains to a general lowdimensional space. Therefore, any manifold layers should be equipped with the following properties:

Discriminative Structure: To strengthen the discriminative power of manifold space, the intraclass samples are required to be compact, while the interclass samples are separable, respectively.

Consistent Structure: The source and target domains are aligned with manifold metric to match the manifold assumption. As a result, the domain discrepancy is represented as the distance between two submanifolds on , and then minimized based on the defined manifold metric (e.g., Grassmannian representations metric, LogEuclidean metric and manifold principal angle similarity).
To reach the above goals, we propose to model the properties by losses and , which will be detailed later. Then, the objective is formulated as following:
where is the crossentropy loss of classifier on source domain and {} are the penalty parameters.
Discriminative Structure Loss
In this section, we describe how to embed the discriminative structure into the manifold layers. The main idea is shown in Figure 2. Since there exists a distribution discrepancy between different domains (e.g., (a) in Figure 2), conventional discriminant criterion is too strong to satisfy in this case. To relax the constraint, our method only focuses on the interclass separability of the source domain and the intraclass compactness of the target domain.
Without loss of generality, we only introduce the formulation of the loss terms in th Riemannian manifold layer . Let and be the feature matrices of . Since class centers of the source domain are used in both two loss terms, the source mean vector and source classwise mean matrix are computed, where is the number of classes.
Source InterClass Similarity
Though the traditional interclass discriminant criterion is applicable on the source domain, a nice geometric structure of the class distribution is actually not guaranteed under the distance metric. To this end, the similarity measurement is utilized here, which has also shown in Figure 2 (b). Rather than compute the similarities between classwise centers and total center directly, we process the classwise centers as following
We call the centralized classwise means hereinafter. Further, if the columns of are normalized with norm, the cosine similarity matrix is derived as . Because indicates the similarity between th class and th class, the diagonal elements are meaningless. Then the separable structure is reached by maximizing the dissimilarities between the centralized classwise mean vectors. Equivalently, it can be achieved by minimizing the following interclass loss:
(1) 
Let us take Figure 2 (b) as an example. There is a 2dimensional space with 3 classes. Let {1,2,3} denotes the labels of “Ball”, “Pyramid” and “Cube”, respectively. Under this situation, and are depicted as and , respectively. According to the goal of Eq. (1) and ignoring the constraints, the optimal solution occurs at , and the minimal equals to (which can also be seen as the lower bound of constrained scenarios).
Target IntraClass Similarity
On the other hand, since there are no labels on the target domain, the discriminant learning is facilitated by the soft labels (i.e., the output of softmax layer). Let be the softmax predictions of classifier layer. Since can be regarded as the confidence or probability of classification, the predictions are used to weight the importance or confidence of the supervised information provided by soft labels. Similarly, assuming the columns of and have unit length. The similarities under all classification cases can be written as . It means that the source classwise centers are utilized instead of the target. The main reasons can be summarized as follows: the interclass structure learned from the source domain can be transduced to the target domain; the source classwise centers computed from groundtruth labels are more reliable. Because there is so much uncertainty when pseudo labels are straightforwardly used on the target domain, we establish a probabilistic discriminative criterion to make the most of the information provided by soft labels. Intuitively, is a natural choice for the probabilistically weighting model. Then the probabilistic intraclass loss is formalized as
(2) 
However, there are much noise in , whose values are very small. Especially when the softmax classifier comes to converging, the columns of tend to be the onehot vectors. As truncation is a efficient way for denoising, we develop a Top preserving scheme for the truncated intraclass loss. Let be the index set of largest elements in , . Then a characteristic function like matrix is defined as
Then, the intraclass loss is modified by the truncated matrix and written as
(3) 
A simple illustration is also shown in Figure 2 (a). Based on the previous notations, , , are computed as , and , respectively. Suppose the softmax output of the “Ball” sample in figure is . According to Eq. (2), all three similarities are taken into consideration, while and are noise. If we adopt the Top strategy in Eq. (3), the perturbation from the “Cube” can be excluded.
In conclusion, the proposed two loss terms build a probabilistic discriminant criterion on the target domain. The groundtruth labels on the source domain provide a reliable separable structure directly, where the intraclass structure is unnecessary. Then the target samples are attached to the corresponding source classwise center via soft labels. As shown in Figure 2 (c), the intraclass relationship on the source domain does not change much while the discriminative property of the target domain is satisfied. Finally, the discriminative structure loss is noted by
Global Structure Learning
For the batch scheme in deep models, it is hard to obtain the complete relation graph between the instances. The direct application of classical graph embedding may be misled by some extreme local distributions, which will result in a suboptimal solution.
Supposing that the geometry of manifolds does not change drastically after several updates, we can built some anchors in the whole data space to acquire the global information. In this work, we propose to fix the anchors in each batch iteration and update them after every epoch. Specifically, the anchors, i.e., in interclass loss Eq. (1) and in intraclass loss Eq. (3), are computed from the last epoch. Note that the anchors are treated as constants in optimization. in Eq. (1) and in Eq. (3) are obtained from batch data. The interclass loss strongly supervised by source labels is imposed at the beginning. While the intraclass loss facilitated by soft label is equipped after a certain number of iterations/epoches.
Manifold Metric Alignment Loss
VisDA2017  Plane  bcycl  bus  car  horse  knife  mcyle  person  plant  sktbrd  train  truck  Mean 
ResNet101 [\citeauthoryearHe et al.2016]  55.1  53.3  61.9  59.1  80.6  17.9  79.7  31.2  81.0  26.5  73.5  8.5  52.4 
DAN [\citeauthoryearLong et al.2015]  87.1  63.0  76.5  42.0  90.3  42.9  85.9  53.1  49.7  36.3  85.8  20.7  61.1 
DANN [\citeauthoryearGanin et al.2016]  81.9  77.7  82.8  44.3  81.2  29.5  65.1  28.6  51.9  54.6  82.8  7.8  57.4 
MCD [\citeauthoryearSaito et al.2018]  87.0  60.9  83.7  64.0  88.9  79.6  84.7  76.9  88.6  40.3  83.0  25.8  71.9 
SimNet [\citeauthoryearPinheiro2018]  94.3  82.3  73.5  47.2  87.9  49.2  75.1  79.7  85.3  68.5  81.1  50.3  72.9 
GTA [\citeauthoryearSankaranarayanan et al.2018]                          77.1 
CDAN [\citeauthoryearLong et al.2018]  85.2  66.9  83.0  50.8  84.2  74.9  88.1  74.5  83.4  76.0  81.9  38.0  73.7 
GPDA [\citeauthoryearKim et al.2019]  83.0  74.3  80.4  66.0  87.6  75.3  83.8  73.1  90.1  57.3  80.2  37.9  73.3 
BSP+DANN [\citeauthoryearChen et al.2019b]  92.2  72.5  83.8  47.5  87.0  54.0  86.8  72.4  80.6  66.9  84.5  37.1  72.1 
BSP+CDAN [\citeauthoryearChen et al.2019b]  92.4  61.0  81.0  57.5  89.0  80.6  90.1  77.0  84.2  77.9  82.1  38.4  75.9 
DRMEA (No AL)  92.8  15.3  86.7  86.3  93.8  70.7  95.2  68.9  95.8  40.4  85.1  5.6  69.7 
DRMEA (No DS)  90.2  66.5  70.2  65.8  79.8  81.8  84.7  70.1  82.0  46.5  88.1  27.7  71.1 
DRMEA  92.1  75.0  78.9  75.5  91.2  81.9  89.0  77.2  93.3  77.4  84.8  35.1  79.3 
OfficeHome  ArCl  ArPr  ArRw  ClAr  ClPr  ClRw  PrAr  PrCl  PrRw  RwAr  RwCl  RwPr  Mean 
ResNet50 [\citeauthoryearHe et al.2016]  34.9  50.0  58.0  37.4  41.9  46.2  38.5  31.2  60.4  53.9  41.2  59.9  46.1 
DAN [\citeauthoryearLong et al.2015]  43.6  57.0  67.9  45.8  56.5  60.4  44.0  43.6  67.7  63.1  51.5  74.3  56.3 
DANN [\citeauthoryearGanin et al.2016]  45.6  59.3  70.1  47.0  58.5  60.9  46.1  43.7  68.5  63.2  51.8  76.8  57.6 
JAN [\citeauthoryearLong et al.2017]  45.9  61.2  68.9  50.4  59.7  61.0  45.8  43.4  70.3  63.9  52.4  76.8  58.3 
CDAN [\citeauthoryearLong et al.2018]  49.0  69.3  74.5  54.4  66.0  68.4  55.6  48.3  75.9  68.4  55.4  80.5  63.8 
CDAN+E [\citeauthoryearLong et al.2018]  50.7  70.6  76.0  57.6  70.0  70.0  57.4  50.9  77.3  70.9  56.7  81.6  65.8 
BSP+DANN [\citeauthoryearChen et al.2019b]  51.4  68.3  75.9  56.0  67.8  68.8  57.0  49.6  75.8  70.4  57.1  80.6  64.9 
BSP+CDAN [\citeauthoryearChen et al.2019b]  52.0  68.6  76.1  58.0  70.3  70.2  58.6  50.2  77.6  72.2  59.3  81.9  66.3 
DRMEA (No AL)  51.9  72.8  77.1  63.0  72.0  71.3  60.5  49.5  78.4  71.5  54.4  82.8  67.1 
DRMEA (No DS)  51.2  72.4  77.7  63.0  71.4  71.4  58.6  44.6  79.1  71.1  53.4  81.5  66.3 
DRMEA  52.3  73.0  77.3  64.3  72.0  71.8  63.6  52.7  78.5  72.0  57.7  81.6  68.1 
0.4  0.6  0.3  0.3  0.7  0.5  0.6  0.7  0.2  0.1  0.6  0.2  0.2 
To satisfy the second property, i.e., Consistent Structure, a manifold metric alignment method is developed. As mentioned before, the covariance matrix is an important tool to represent a manifold . Therefore, the alignment based on covariance not only meets the requirements of manifold metric, but also reaches some nice statistical properties, such as distribution assumption.
Grassmannian Metric
Let and be the covariance matrices of source and target domains computed from batchwise features, respectively. Assume and are two submanifolds of , which are represented by their corresponding covariance matrices. Before the alignment process, these two submanifolds are partially overlapped, and our goal is to minimize the discrepancy under the metric of . In general, the manifold metric alignment loss of the th layer is expressed as
(4) 
where is the manifold metric to be determined.
Grassmannian manifold is a wellknown type of Riemannian manifold. It is a projection subspace deduced from the originally highdimensional space , . Thus the two submanifolds and lying on the Grassmannian manifold are represented as two individual points. The distance between such two points is measured by the discrepancy between their projection orthogonal basis and . Specifically, the orthogonal basis of such dimensional Grassmannian manifold consists of dominant singular vectors with respect to its representation matrix. Thus and are two columnorthogonal matrices, which can be obtained from the Singular Value Decomposition (SVD) of covariance matrices and , respectively. Finally, the Grassmannian distance is measured by
(5) 
where is the Frobenius norm. Thus the manifold metric alignment loss can be written as
Error Bound of Grassmannian Metric
As the dimension is needed in Grassmannian distance, we establish an theoretical error bound for it. Inspired by the previous works [\citeauthoryearZwald and Blanchard2006, \citeauthoryearFernando et al.2013], we shall denote the covariance of given distribution by , and covariance drawn i.i.d. from with sample size by . Then Zwald et al. [\citeauthoryearZwald and Blanchard2006] give the following theorem.
Theorem 1.
[\citeauthoryearZwald and Blanchard2006] Let B be s.t. for any vector , , let and be the orthogonal projectors of the subspaces spanned by the first eigenvectors of and , respectively. Let be the first eigenvalues of , then for any with probability at least we have:
(6) 
Above theorem shows the relation between the error and . Defining the right side of Eq. (6) as . To extend the inequality to the Grassmannian distance, we derive following lemma.
Lemma 2.
Based on Lemma 2, following theorem gives the error of with respect to its samples approximation .
Theorem 3.
Assuming the condition in Theorem 1 is specified by domains. Specifically, and denote the th largest eigenvalue of domainspecific covariance matrices and , respectively. Denote by
the error index. Then the following error bound holds with probability at least :
Theorem 3 suggests that the upper bound of error is proportional to . It means that we should search the maximal gap between the continuous eigenvalues with the consideration of inflation factor . Recall that in batch learning setting, the batch size is usually smaller than , thus only need to be searched in . The proofs are given in the Supplementary.
Experiments and Comparative Analysis
In this section, three popular domain adaptation datasets are selected and the standard evaluation protocols are adopted.
OfficeHome [\citeauthoryearVenkateswara et al.2017] contains 4 domains, i.e., Art (Ar), Clipart (Cl), Product (Pr) and RealWorld (Rw).
ImageCLEFDA
VisDA2017 [\citeauthoryearPeng et al.2017] is a largescale visual domain adaptation challenge dataset. The synthetic data to realimage track is evaluated here.
Setup
Two layers Riemmanian manifold learning scheme is carried out in all experiments (i.e., ), where the first layer (1024d) is activated by Leaky ReLU () and the second layer (512d) by Tanh. Adam Optimizer (, , ) with batch size of 50 is utilized on OfficeHome and ImageCLEFDA datasets; the modified minibatch SGD [\citeauthoryearGanin et al.2016] (, momentum 0.9, weight decay ) with batch size of 32 is employed on VisDA2017 challenge. The learning rate of CNN backbone layers is set as . The hyperparameters are determined by tryanderror approach. Specifically, and are set as and , respectively. The Top scheme is adopted for the target intraclass loss in Eq. (3). For ablation study, the model without discriminative structure loss and manifold metric alignment loss are abbreviated as DRMEA (No DS) and DRMEA (No AL), respectively.
Results Analysis
Error Bound of Grassmannian Distance
ImageCLEFDA  IP  PI  IC  CI  CP  PC  Mean 

ResNet50 [\citeauthoryearHe et al.2016]  74.8 0.3  83.9 0.1  91.5 0.3  78.0 0.2  65.5 0.3  91.2 0.3  80.7 
DAN [\citeauthoryearLong et al.2015]  74.5 0.4  82.2 0.2  92.8 0.2  86.3 0.4  69.2 0.4  89.8 0.4  82.5 
DANN [\citeauthoryearGanin et al.2016]  75.0 0.3  86.0 0.3  96.2 0.4  87.0 0.5  74.3 0.5  91.5 0.6  85.0 
JAN [\citeauthoryearLong et al.2017]  76.8 0.4  88.0 0.2  94.7 0.2  89.5 0.3  74.2 0.3  91.7 0.3  85.8 
CDAN [\citeauthoryearLong et al.2018]  76.7 0.3  90.6 0.3  97.0 0.4  90.5 0.4  74.5 0.3  93.5 0.4  87.1 
CDAN+E [\citeauthoryearLong et al.2018]  77.7 0.3  90.7 0.2  97.7 0.3  91.3 0.3  74.2 0.2  94.3 0.3  87.7 
DRMEA (No AL)  78.0 0.1  91.1 0.1  95.6 0.2  88.7 0.3  74.8 0.1  94.8 0.2  87.3 
DRMEA (No DS)  78.9 0.1  90.5 0.2  94.0 0.1  87.8 0.1  76.7 0.2  93.0 0.1  86.8 
DRMEA  80.7 0.1  92.5 0.1  97.2 0.1  90.5 0.1  77.7 0.2  96.2 0.2  89.1 
The numerical simulation is conducted on ImageCLEFDA dataset to explore the minimal error index . As a fact that the eigenvalues always decrease rapidly at the beginning and enter into a flat state, the error bounds of dimensionality located in the flatten area are too high to assess. As shown in Figure 3, the trend of eigenvalues is consistent with the description. Though the dramatic decrease in the beginning stage results in a lower error, the information in that area is unconvincing and insufficient to support the measurement of manifolds. Since there is a natural gap between the th and th dominant eigenvalues, is smaller than most of other errors. We highlight the error index of by blue dash line, and observe only errors of are lower than . Empirically, the dimensionality of Grassmannian manifold is set as hereinafter.
Convergence
The convergence cures on Office31 AW adaptation task are displayed in Figure 3. It the beginning, the objective loss value decreases quickly and the recognition rate tends to enter a stable region in the epoch 1015. However, the intraclass constraint is imposed after 15 epoches, which further activates the learning of discriminative structure. Thus the second ascent of accuracy on target domain occurred after 15 epoches, which leads to the continuous improvement of recognition rate and alleviates the overfitting the on the source domain.
Comparison
Several stateoftheart UDA approaches are selected and shown in Table 12. The experimental result on Visda2017 dataset is shown in the top of Table 1. We observe that DRMEA outperforms others by a large margin in average accuracy from the result. Performance on OfficeHome dataset is provided at the bottom of Table 1, the proposed method improves the accuracy to 68.1% and obtains the highest accuracy in most of the adaptation tasks. Results on ImageCLEFDA dataset are provided in Table 2. As the discrepancy between the source and target domains on ImageCLEFDA dataset is relatively smaller than others, a more discriminative model is essential to the improvement of recognition accuracy. DRMEA encodes the discriminant criterion and alignment constraint simultaneously, thus it outperforms other methods by 1.4% at least.
The ablation results also prove that the whole Riemannian manifold learning framework effect when both loss terms are equipped. As the discriminative structure loss provides a separable structure and manifold metric alignment loss bridges the distribution discrepancy between the source and target domains based on Grassmannian distance, both two losses are important.
Visualization
Figure 4 shows the 2D representation spaces obtained from tSNE [\citeauthoryearMaaten and Hinton2008] algorithm on VisDA2017 dataset. CDAN+E shortens the distance between source and target by using adversarial alignment. A part of classes has been dragged away from the center, e.g., plant, car, horse, aeroplane and bicycle. In the third column, our method further optimizes the structure of the representation space. The categories are aligned better than ResNet101 and CDAN+E, which leads to more compact target space.
Conclusion
In this paper, we develop a Riemannian manifold embedding and alignment framework for UDA, where the transferability and discriminability are reached consistently. To optimize the structure of the target domain, the soft labels are encoded into the discriminant criterion probabilistically and transductively. Then a globally discriminative structure is approximated via a memorysaving manner. A theoretical error bound is derived, which is guaranteed to find an appropriate dimension for manifolds during the alignment. Numerical simulation and extensive comparisons demonstrate the effectiveness of the derived theorem and proposed method. How to further reduce dependence of our proposal on temporal target predictions is our future work.
Acknowledgments
This work is supported by the National Natural Science Foundation of China under Grants 61976229, 61976104, 61906046, 61572536, 11631015 and U1611265.
Footnotes
References
 BenDavid, S.; Blitzer, J.; Crammer, K.; and Pereira, F. 2007. Analysis of representations for domain adaptation. In Advances in Neural Information Processing Systems, 137–144.
 BenDavid, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Machine Learning 79(12):151–175.
 Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(8):1798–1828.
 Chen, C.; Chen, Z.; Jiang, B.; and Jin, X. 2019a. Joint domain alignment and discriminative feature learning for unsupervised deep domain adaptation. In AAAI Conference on Artificial Intelligence.
 Chen, X.; Wang, S.; Long, M.; and Wang, J. 2019b. Transferability vs. discriminability: Batch spectral penalization for adversarial domain adaptation. In International Conference on Machine Learning, 1081–1090.
 Ding, Z., and Fu, Y. 2019. Deep transfer lowrank coding for crossdomain learning. IEEE Transactions on Neural Networks and Learning Systems 30(6):1768–1779.
 Fernando, B.; Habrard, A.; Sebban, M.; and Tuytelaars, T. 2013. Unsupervised visual domain adaptation using subspace alignment. In IEEE International Conference on Computer Vision, 2960–2967.
 Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domainadversarial training of neural networks. Journal of Machine Learning Research 17(1):2096–2030.
 Gong, B.; Shi, Y.; Sha, F.; and Grauman, K. 2012. Geodesic flow kernel for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 2066–2073.
 Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2672–2680.
 He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
 Huang, Z.; Wang, R.; Shan, S.; Van Gool, L.; and Chen, X. 2017. Cross euclideantoriemannian metric learning with application to face recognition from video. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(12):2827–2840.
 Kim, M.; Sahu, P.; Gholami, B.; and Pavlovic, V. 2019. Unsupervised visual domain adaptation: A deep maxmargin gaussian process approach. In IEEE Conference on Computer Vision and Pattern Recognition, 4380–4390.
 LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436.
 Lee, C.Y.; Batra, T.; Baig, M. H.; and Ulbricht, D. 2019. Sliced wasserstein discrepancy for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 10285–10295.
 Long, M.; Wang, J.; Ding, G.; Sun, J.; and Yu, P. S. 2013. Transfer feature learning with joint distribution adaptation. In IEEE International Conference on Computer Vision, 2200–2207.
 Long, M.; Cao, Y.; Wang, J.; and Jordan, M. 2015. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning, 97–105.
 Long, M.; Zhu, H.; Wang, J.; and Jordan, M. I. 2017. Deep transfer learning with joint adaptation networks. In International Conference on Machine Learning, volume 70, 2208–2217.
 Long, M.; Cao, Z.; Wang, J.; and Jordan, M. I. 2018. Conditional adversarial domain adaptation. In Advances in Neural Information Processing Systems, 1640–1650.
 Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using tsne. Journal of Machine Learning Research 9(Nov):2579–2605.
 MorenoTorres, J. G.; Raeder, T.; AlaizRodríGuez, R.; Chawla, N. V.; and Herrera, F. 2012. A unifying view on dataset shift in classification. Pattern Recognition 45(1):521–530.
 Pan, S. J.; Tsang, I. W.; Kwok, J. T.; and Yang, Q. 2010. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks 22(2):199–210.
 Peng, X.; Usman, B.; Kaushik, N.; Hoffman, J.; Wang, D.; and Saenko, K. 2017. Visda: The visual domain adaptation challenge. arXiv preprint arXiv:1710.06924.
 Pinheiro, P. O. 2018. Unsupervised domain adaptation with similarity learning. In IEEE Conference on Computer Vision and Pattern Recognition, 8004–8013.
 Ren, C.; Ge, P.; Dai, D.; and Yan, H. 2019. Learning kernel for conditional momentmatching discrepancybased image classification. IEEE Transactions on Cybernetics.
 Ren, C.; Xu, X.; and Yan, H. 2018. Generalized conditional domain adaptation: A causal perspective with lowrank translators. IEEE Transactions on Cybernetics.
 Saito, K.; Watanabe, K.; Ushiku, Y.; and Harada, T. 2018. Maximum classifier discrepancy for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 3723–3732.
 Sankaranarayanan, S.; Balaji, Y.; Castillo, C. D.; and Chellappa, R. 2018. Generate to adapt: Aligning domains using generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, 8503–8512.
 Sejdinovic, D.; Sriperumbudur, B.; Gretton, A.; Fukumizu, K.; et al. 2013. Equivalence of distancebased and rkhsbased statistics in hypothesis testing. The Annals of Statistics 41(5):2263–2291.
 Shimodaira, H. 2000. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference 90(2):227–244.
 Sun, B.; Feng, J.; and Saenko, K. 2016. Return of frustratingly easy domain adaptation. In AAAI Conference on Artificial Intelligence.
 Venkateswara, H.; Eusebio, J.; Chakraborty, S.; and Panchanathan, S. 2017. Deep hashing network for unsupervised domain adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, 5018–5027.
 Yan, S.; Xu, D.; Zhang, B.; Zhang, H.J.; Yang, Q.; and Lin, S. 2007. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence (1):40–51.
 Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, 3320–3328.
 Zwald, L., and Blanchard, G. 2006. On the convergence of eigenspaces in kernel principal component analysis. In Advances in Neural Information Processing Systems, 1649–1656.