Learning an Invariant Hilbert Space for Domain Adaptation
Abstract
This paper introduces a learning scheme to construct a Hilbert space (i.e., a vector space along its inner product) to address both unsupervised and semisupervised domain adaptation problems. This is achieved by learning projections from each domain to a latent space along the Mahalanobis metric of the latent space to simultaneously minimizing a notion of domain variance while maximizing a measure of discriminatory power. In particular, we make use of the Riemannian optimization techniques to match statistical properties (e.g., first and second order statistics) between samples projected into the latent space from different domains. Upon availability of class labels, we further deem samples sharing the same label to form more compact clusters while pulling away samples coming from different classes.We extensively evaluate and contrast our proposal against stateoftheart methods for the task of visual domain adaptation using both handcrafted and deepnet features. Our experiments show that even with a simple nearest neighbor classifier, the proposed method can outperform several stateoftheart methods benefitting from more involved classification schemes.
College of Engineering and Computer Science
Australian National University & Data61, CSIRO
Canberra, Australia Mehrtash Harandi mehrtash.harandi@data61.csiro.au
College of Engineering and Computer Science
Australian National University & Data61, CSIRO
Canberra, Australia Fatih Porikli fatih.porikli@anu.edu.au
College of Engineering and Computer Science
Australian National University
Canberra, Australia
1 Introduction
This paper presents a learning algorithm to address both unsupervised (Gong et al., 2012; Fernando et al., 2013; Sun et al., 2016) and semisupervised (Hoffman et al., 2014; Duan et al., 2012; Hubert Tsai et al., 2016) domain adaptation problems. Our goal here is to learn a latent space in which domain disparities are minimized. We show such a space can be learned by first matching the statistical properties of the projected domains (e.g., covariance matrices), and then adapting the Mahalanobis metric of the latent space to the labeled data, i.e., minimizing the distances between pairs sharing the same class label while pulling away samples with different class labels. We develop a geometrical solution to jointly learn projections onto the latent space and the Mahalanobis metric there by making use of the concepts of Riemannian geometry.
Thanks to deep learning, we are witnessing rapid growth in classification accuracy of the imaging techniques if substantial amount of labeled data is provided (Krizhevsky et al., 2012; Simonyan and Zisserman, 2014; He et al., 2016; Herath et al., 2017). However, harnessing the attained knowledge into a new application with limited labeled data (or even without having labels) is far beyond clear (Koniusz et al., 2017; Long et al., 2016; Ganin and Lempitsky, 2015; Tzeng et al., 2014; Chen et al., 2015). To make things even more complicated, due to the inherit bias of datasets (Torralba and Efros, 2011; Shimodaira, 2000), straightforward use of large amount of auxiliary data does not necessarily assure improved performances. For example, the ImageNet (Russakovsky et al., 2015) data is hardly useful for an application designed to classify images captured by a mobile phone camera. Domain adaptation is the science of reducing such undesired effects in transferring knowledge from the available auxiliary resources to a new problem.
The most natural solution to the problem of DA is by identifying the structure of a common space that minimizes a notion of domain mismatch. Once such a space is obtained, one can design a classifier in it, hoping that the classifier will perform equally well across the domains as the domain mismatched is minimized. Towards this end, several studies assume that either 1. a subspace of the target^{1}^{1}1In DA terminology target domain refers to the data directly related to the task. Source domain data is used as the auxiliary data for knowledge transferring. domain is the right space to perform DA and learn how the source domain should be mapped onto it (Saenko et al., 2010; Hubert Tsai et al., 2016), or 2. subspaces obtained from both source and target domains are equally important for classification, hence trying to either learn their evolution (Gopalan et al., 2011; Gong et al., 2012) or similarity measure (Shi et al., 2010; Wang and Mahadevan, 2011; Duan et al., 2012).
Objectively speaking, a common practice in many solutions including the aforementioned methods, is to simplify the learning problem by separating the two elements of it. That is, the algorithm starts by fixing a space (e.g., source subspace of Fernando et al. (2013); Hubert Tsai et al. (2016)), and learns how to transfer the knowledge from domains accordingly. A curious mind may ask why should we resort to a predefined and fixed space in the first place.
In this paper, we propose a learning scheme that avoids such a separation. That is, we do not assume that a space or a transformation, apriori is known and fixed for DA. In essence, we propose to learn the structure of a Hilbert space (i.e., its metric) along the transformations required to map the domains onto it jointly.
This is achieved through the following contributions,

We propose to learn the structure of a latent space, along its associated mappings from the source and target domains to address both problems of unsupervised and semisupervised DA.

Towards this end, we propose to maximize a notion of discriminatory power in the latent space. At the same time, we seek the latent space to minimize a notion of statistical mismatch between the source and target domains (see Fig. 1 for a conceptual diagram).

Given the complexity of the resulting problem, we provide a rigorous mathematical modeling of the problem. In particular, we make use of the Riemannian geometry and optimization techniques on matrix manifolds to solve our learning problem^{2}^{2}2Our implementation is available on https://sherath@bitbucket.org/sherath/ils.git..

We extensively evaluate and contrast our solution against several baseline and stateoftheart methods in addressing both unsupervised and semisupervised DA problems.
2 Proposed Method
In this work, we are interested in learning an Invariant Latent Space (ILS) to reduce the discrepancy between domains. We first define our notations. Bold capital letters denote matrices (e.g., {\boldsymbol{X}}) and bold lowercase letters denote column vectors (e.g., {\boldsymbol{x}}). \mathbf{I}_{n} is the n\times n identity matrix. \mathcal{S}_{++}^{n} and \mathrm{St}({n},{p}) denote the SPD and Stiefel manifolds, respectively, and will be formally defined later. We show the source and target domains by \mathcal{X}_{s}\subset\mathbb{R}^{s} and \mathcal{X}_{t}\subset\mathbb{R}^{t}. The training samples from the source and target domains are shown by \{{\boldsymbol{x}}^{s}_{i},y^{s}_{i}\}_{i=1}^{n_{s}} and \{{\boldsymbol{x}}^{t}_{i}\}_{i=1}^{n_{t}}, respectively. For now, we assume only source data is labeled. Later, we discuss how the proposed learning framework can benefit form the labeled target data.
Our idea in learning an ILS is to determine the transformations \mathbb{R}^{s\times p}\ni{\boldsymbol{W}}_{s}:\mathcal{X}_{s}\to\mathcal{H} and \mathbb{R}^{t\times p}\ni{\boldsymbol{W}}_{t}:\mathcal{X}_{t}\to\mathcal{H} from the source and target domains to a latent pdimensional space \mathcal{H}\subset\mathbb{R}^{p}. We furthermore want to equip the latent space with a Mahalanobis metric, {\boldsymbol{M}}\in\mathcal{S}_{++}^{p}, to reduce the discrepancy between projected source and target samples (see Fig. 1 for a conceptual diagram).
To learn {\boldsymbol{W}}_{s}, {\boldsymbol{W}}_{t} and {\boldsymbol{M}} we propose to minimize a cost function in the form
\mathcal{L}=\mathcal{L}_{d}+\lambda\mathcal{L}_{u}\;.  (1) 
In Eq. 1, \mathcal{L}_{d} is a measure of dissimilarity between labeled samples. The term \mathcal{L}_{u} quantifies a notion of statistical difference between the source and target samples in the latent space. As such, minimizing \mathcal{L} leads to learning a latent space where not only the dissimilarity between labeled samples is reduced but also the domains are matched from a statistical point of view. The combination weight \lambda is envisaged to balance the two terms. The subscripts “d” and “u” in Eq. 1 stand for “Discriminative” and “Unsupervised”. The reason behind such naming will become clear shortly. Below we detail out the form and properties of \mathcal{L}_{d} and \mathcal{L}_{u}.
2.1 Discriminative Loss
The purpose of having \mathcal{L}_{d} in Eq. 1 is to equip the latent space \mathcal{H} with a metric to 1. minimize dissimilarities between samples coming from the same class and 2. maximizing the dissimilarities between samples from different classes.
Let {\boldsymbol{Z}}=\{{\boldsymbol{z}}_{j}\}_{j=1}^{n} be the set of labeled samples in \mathcal{H}. In unsupervised domain adaptation {\boldsymbol{z}}_{j}={\boldsymbol{W}}_{s}^{T}{\boldsymbol{x}}^{s}_{j} and n=n_{s}. In the case of semisupervised domain adaptation,
{\boldsymbol{Z}}=\Big{\{}{\boldsymbol{W}}_{s}^{T}{\boldsymbol{x}}^{s}_{j}\Big{% \}}_{j=1}^{n_{s}}\bigcup\Big{\{}{\boldsymbol{W}}_{t}^{T}{\boldsymbol{x}}^{t}_{% j}\Big{\}}_{j=1}^{n_{tl}}, 
where we assume n_{tl} labeled target samples are provided (out of available n_{t} samples). From the labeled samples in \mathcal{H}, we create N_{p} pairs in the form ({\boldsymbol{z}}_{1,k},{\boldsymbol{z}}_{2,k}),~{}k=1,2,\cdots,{N_{p}} along their associated label y_{k}\in\{1,1\}. Here, y_{k}=1 iff label of {\boldsymbol{z}}_{1,k} is similar to that of {\boldsymbol{z}}_{2,k} and 1 otherwise. That is the pair ({\boldsymbol{z}}_{1,k},{\boldsymbol{z}}_{2,k}) is similar if y_{k}=1 and dissimilar otherwise.
To learn the metric {\boldsymbol{M}}, we deem the distances between the similar pairs to be small while simultaneously making the distances between the dissimilar pairs large. In particular, we define \mathcal{L}_{d} as,
\displaystyle\mathcal{L}_{d}  \displaystyle=\frac{1}{N_{p}}\sum\limits_{k=1}^{N_{p}}\ell_{\beta}\big{(}{% \boldsymbol{M}},y_{k},{\boldsymbol{z}}_{1,k}{\boldsymbol{z}}_{2,k},1\big{)}+r% ({\boldsymbol{M}}),  (2) 
with
\displaystyle\ell_{\beta}\big{(}{\boldsymbol{M}},y,{\boldsymbol{x}},u\big{)}=% \frac{1}{\beta}\log\Big{(}1+\exp\big{(}\beta y({\boldsymbol{x}}^{T}{% \boldsymbol{M}}{\boldsymbol{x}}u)\big{)}\Big{)}.  (3) 
Here, \ell_{\beta} is the generalized logistic function tailored with large margin structure (see Fig. 2) having a margin of u^{3}^{3}3For now we keep the margin at u=1 and later will use this to explain the softmargin extension.. First note that the quadratic term in Eq. 3 (i.e., {\boldsymbol{x}}^{T}{\boldsymbol{M}}{\boldsymbol{x}}) measures the Mahalanobis distance between {\boldsymbol{z}}_{1,k} and {\boldsymbol{z}}_{2,k} if used according to Eq. 2. Also note that \ell_{\beta}\big{(}\cdot,\cdot,{\boldsymbol{x}},\cdot\big{)}=\ell_{\beta}\big{% (}\cdot,\cdot,{\boldsymbol{x}},\cdot\big{)}, hence how samples are order in the pairs is not important.
To better understand the behavior of the function \ell_{\beta}, assume the function is fed with a similar pair, i.e. y_{k}=1. For the sake of discussion, also assume \beta=1. In this case, \ell_{\beta} is decreased if the distance between {\boldsymbol{z}}_{1,k} and {\boldsymbol{z}}_{2,k} is reduced. For a dissimilar pair (i.e., y_{k}=1), the opposite should happen to have a smaller objective. That is, the Mahalanobis distance between the samples of a pair should be increased.
The function \ell_{\beta}\big{(}\cdot,\cdot,{\boldsymbol{x}},\cdot\big{)} can be understood as a smooth and differentiable form of the hingeloss function. In fact, \ell_{\beta}\big{(}\cdot,\cdot,{\boldsymbol{x}},\cdot\big{)} asymptotically reaches the hingeloss function if \beta\rightarrow\infty. The smooth behavior of \ell_{\beta}\big{(}\cdot,\cdot,{\boldsymbol{x}},\cdot\big{)} is not only welcomed in the optimization scheme but also avoids samples in the latent space to collapse into a single point.
Along the general practice in metric learning, we regularize the metric {\boldsymbol{M}} by r({\boldsymbol{M}}). The divergences derived from the \log\det(\cdot) function are familiar faces for regularizing Mahalanobis metrics in the litrature (Davis et al., 2007; Saenko et al., 2010).
Among possible choices, we make use of the Stein divergence (Cherian et al., 2013) in this work. Hence,
r({\boldsymbol{M}})=\frac{1}{p}\delta_{s}({\boldsymbol{M}},{\boldsymbol{I_{p}}% }).  (4) 
Where,
\delta_{s}({\boldsymbol{P}},{\boldsymbol{Q}})=\log\det\Big{(}\frac{{% \boldsymbol{P}}+{\boldsymbol{Q}}}{2}\Big{)}\frac{1}{2}\log\det\big{(}{% \boldsymbol{P}}{\boldsymbol{Q}}\big{)},  (5) 
for {\boldsymbol{P}},{\boldsymbol{Q}}\in\mathcal{S}_{++}.
Soft Margin Extension
For large values of \beta, the cost in Eq. 2 seeks the distances of similar pairs to be less than 1 while simultaneously it deems the distances of dissimilar pairs to exceed 1. This hardmargin in the design of \ell_{\beta}\big{(}\cdot,\cdot,{\boldsymbol{x}},\cdot\big{)} is not desirable. For example, with a large number of pairs, it is often the case to have outliers. Forcing outliers to fit into the hard margins can result in overfitting. As such, we propose a softmargin extension of Eq. 3. The softmargins are implemented by associating a nonnegative slack variable \epsilon_{k} to a pair according to
\displaystyle\mathcal{L}_{d}  \displaystyle=\frac{1}{N_{p}}\sum\limits_{k=1}^{N_{p}}\ell_{\beta}\big{(}{% \boldsymbol{M}},y_{k},{\boldsymbol{z}}_{1,k}{\boldsymbol{z}}_{2,k},1+y_{k}% \epsilon_{k}\big{)}+r({\boldsymbol{M}})+\frac{1}{N_{p}}\sqrt{\sum\epsilon_{k}^% {2}},  (6) 
where a regularizer on the slack variables is also envisaged.
Matching Statistical Properties
A form of incompatibility between domains is due to their statistical discrepancies. Matching the first order statistics of two domains for the purpose of adaptation is studied in Pan et al. (2011); Baktashmotlagh et al. (2016); Hubert Tsai et al. (2016)^{4}^{4}4 The use of Maximum Mean Discrepancy (MMD) (Borgwardt et al., 2006) for domain adaptation is a wellpracticed idea in the literature (see for example Pan et al. (2011); Baktashmotlagh et al. (2016); Hubert Tsai et al. (2016)). Empirically, determining MMD boils down to computing the distance between domain averages when domain samples are lifted to a reproducing kernel Hilbert space. Some studies claim matching the first order statistics is a weaker form of domain adaptation through MMD. We do not support this claim and hence do not see our solution as a domain adaptation method by minimizing the MMD.. In our framework, matching domain averages can be achieved readily. In particular, let \bar{{\boldsymbol{x}}}^{s}_{i}={\boldsymbol{x}}^{s}_{i}{\boldsymbol{m}}_{s} and \bar{{\boldsymbol{x}}}^{t}_{j}={\boldsymbol{x}}^{t}_{j}{\boldsymbol{m}}_{t} be the centered source and target samples with {\boldsymbol{m}}_{s} and {\boldsymbol{m}}_{t} being the mean of corresponding domains. It follows easily that the domain means in the latent space are zero^{5}^{5}5We note that \sum{\boldsymbol{W}}_{s}^{T}\bar{{\boldsymbol{x}}}^{s}_{i}={\boldsymbol{W}}_{s% }^{T}\sum\bar{{\boldsymbol{x}}}^{s}_{i}={\boldsymbol{0}}. This holds for the target domain as well. and hence matching is achieved.
To go beyond first order statistics, we propose to match the second order statistics (i.e., covariance matrices) as well. The covariance of a domain reflects the relationships between its features. Hence, matching covariances of source and target domains in effect improves the cross feature relationships. We capture the mismatch between source and target covariances in the latent space using the \mathcal{L}_{u} loss in Eq. 1. Given the fact that covariance matrices are points on the SPD manifold, we make use of the Stein divergence to measure their differences. This leads us to define \mathcal{L}_{u} as
\begin{split}\displaystyle\mathcal{L}_{u}=\frac{1}{p}\delta_{s}({\boldsymbol{W% }}_{s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s},{\boldsymbol{W}}_{t}^{% T}{\boldsymbol{\Sigma}}_{t}{\boldsymbol{W}}_{t}),\end{split}  (7) 
with {\boldsymbol{\Sigma}}_{s}\in\mathcal{S}_{++}^{s} and {\boldsymbol{\Sigma}}_{t}\in\mathcal{S}_{++}^{t} being the covariance matrices of the source and target domains, respectively. We emphasize that matching the statistical properties as discussed above is an unsupervised technique, enabling us to address unsupervised DA.
2.2 Classification Protocol
Upon learning {\boldsymbol{W}}_{s}, {\boldsymbol{W}}_{t}, {\boldsymbol{M}}, training samples from the source and target (if available) domains are mapped to the latent space using {\boldsymbol{W}}_{s}{\boldsymbol{M}}^{\frac{1}{2}} and {\boldsymbol{W}}_{t}{\boldsymbol{M}}^{\frac{1}{2}}, respectively. For a query from the target domain {\boldsymbol{x}}_{q}^{t}, {\boldsymbol{M}}^{\frac{1}{2}}{\boldsymbol{W}}_{t}^{T}{\boldsymbol{x}}_{q}^{t} is its latent space representation which is subsequently classified by a nearest neighbor classifier.
3 Optimization
The objective of our algorithm is to learn the transformation parameters ({\boldsymbol{W}}_{s} and {\boldsymbol{W}}_{t}), the metric {\boldsymbol{M}} and slack variables \epsilon_{1},\epsilon_{2},...\epsilon_{N_{p}} (see Eq. 6 and Eq. 7). Inline with the general practice of dimensionality reduction, we propose to have orthogonality constraints on {\boldsymbol{W}}_{s} and {\boldsymbol{W}}_{t}. That is {\boldsymbol{W}}_{s}^{T}{\boldsymbol{W}}_{s}={\boldsymbol{W}}_{t}^{T}{% \boldsymbol{W}}_{t}=\mathbf{I}_{p}. We elaborate how orthogonality constraint improves the discriminatory power of the proposed framework later in our experiments.
The problem depicted in Eq. 1 is indeed a nonconvex and constrained optimization problem. One may resort to the method of Projected Gradient Descent (PGD) (Boyd and Vandenberghe, 2004) to minimize Eq. 1. In PGD, optimization is proceed by projecting the gradientdescent updates onto the set of constraints. For example, in our case, we can first update {\boldsymbol{W}}_{s} by ignoring the orthogonality constraint on {\boldsymbol{W}}_{s} and then project the result onto the set of orthogonal matrices using eigendecomposition. As such, optimization can be performed by alternatingly updating {\boldsymbol{W}}_{s}, {\boldsymbol{W}}_{t}, the metric {\boldsymbol{M}} and slack variables using PGD.
In PGD, to perform the projection, the set of constraints needs to be closed though in practice one can resort to open sets. For example, the set of SPD matrices is open though one can project a symmetric matrix onto this set using eigendecomposition.
Empirically, PGD showed an erratic and numerically unstable behavior in addressing our problem. This can be attributed to the nonlinear nature of Eq. 1, existence of openset constraints in the problem or perhaps the combination of both. To alleviate the aforementioned difficulty, we propose a more principle approach to minimize Eq. 1 by making use of the Riemannian optimization technique. We take a short detour and briefly describe the Riemannian optimization methods below.
Optimization on Riemannian manifolds.
Consider a nonconvex constrained problem in the form
\displaystyle\mathrm{minimize}~{}f({\boldsymbol{x}})  
\displaystyle\mathrm{s.t.}~{}~{}~{}{\boldsymbol{x}}\in\mathcal{M}\;,  (8) 
where \mathcal{M} is a Riemannian manifold, i.e., informally, a smooth surface that locally resembles a Euclidean space. Optimization techniques on Riemannian manifolds (e.g., Conjugate Gradient) start with an initial solution {\boldsymbol{x}}^{(0)}\in\mathcal{M}, and iteratively improve the solution by following the geodesic identified by the gradient. For example, in the case of Riemannian Gradient Descent Method (RGDM), the updating rule reads
{\boldsymbol{x}}^{(t+1)}=\tau_{{\boldsymbol{x}}^{(t)}}\big{(}\alpha~{}\mathrm% {grad}~{}f({\boldsymbol{x}}^{(t)})\big{)}\;,  (9) 
with \alpha>0 being the algorithm’s step size. Here, \tau_{{\boldsymbol{x}}}(\cdot):T_{\boldsymbol{x}}\mathcal{M}\to\mathcal{M}, is called the retraction^{6}^{6}6Strictly speaking and in contrast with the exponential map, a retraction only guarantees to pull a tangent vector on the geodesic locally, i.e., close to the origin of the tangent space. Retractions, however, are typically easier to compute than the exponential map and have proven effective in Riemannian optimization (Absil et al., 2009). and moves the solution along the descent direction while assuring that the new solution is on the manifold \mathcal{M}, i.e., it is within the constraint set. T_{\boldsymbol{x}}\mathcal{M} is the tangent space of \mathcal{M} at {\boldsymbol{x}} and can be thought of as a vector space with its vectors being the gradients of all functions defined on \mathcal{M}.
We defer more details on Riemannian optimization techniques to the Appendix A. As for now, it suffices to say that to perform optimization on the Riemannian manifolds, the form of Riemannian gradient, retraction and the gradient of the objective with respect to its parameters (shown by \nabla) are required. The constraints in Eq.1 are orthogonality (transformations {\boldsymbol{W}}_{s} and {\boldsymbol{W}}_{t}) and p.d. for metric {\boldsymbol{M}}. The geometry of these constraints are captured by the Stiefel (James, 1976; Harandi and Fernando, 2016) and SPD (Harandi et al., 2016; Cherian and Sra, 2016) manifolds, formally defined as
Definition 1 (The Stiefel Manifold)
The set of (n\times p)dimensional matrices, p\leq n, with orthonormal columns endowed with the Frobenius inner product^{7}^{7}7Note that the literature is divided between this choice and another form of Riemannian metric. See Edelman et al. (1998) for details. forms a compact Riemannian manifold called the Stiefel manifold \mathrm{St}({p},{n}) (Absil et al., 2009).
\mathrm{St}({p},{n})\triangleq\{{\boldsymbol{W}}\in\mathbb{R}^{n\times p}:{% \boldsymbol{W}}^{T}{\boldsymbol{W}}=\mathbf{I}_{p}\}\;.  (10) 
Definition 2 (The SPD Manifold)
The set of (p\times p) dimensional real, SPD matrices endowed with the Affine Invariant Riemannian Metric (AIRM) (Pennec et al., 2006) forms the SPD manifold \mathcal{S}_{++}^{p}.
\hskip2.15pt\mathcal{S}_{++}^{p}\triangleq\{{\boldsymbol{M}}\in\mathbb{R}^{p% \times p}\hskip2.15pt:{\boldsymbol{v}}^{T}{\boldsymbol{M}}{\boldsymbol{v}}>0,% ~{}\hskip2.15pt\forall{\boldsymbol{v}}\in\mathbb{R}^{p}\{{\boldsymbol{0}}_{p% }\}\}.  (11) 
Updating {\boldsymbol{W}}_{s}, {\boldsymbol{W}}_{t} and {\boldsymbol{M}} and slacks can be done alternatively using Riemannian optimization. As mentioned above, the ingredients for doing so are 1. the Riemannian tools for the Stiefel and SPD manifolds along 2. the form of gradients of the objective with respect to its parameters. To do complete justice, in Table. 1 we provide the Riemmanian metric, form of Riemannian gradient and retraction for the Stiefel and SPD manifolds. In Table. 2, the gradient of Eq. 1 with respect to {\boldsymbol{W}}_{s}, {\boldsymbol{W}}_{t} and {\boldsymbol{M}} and slacks is provided. The detail of derivations can be found in the Appendix B. A tiny note about the slacks worth mentioning. To preserve the nonnegativity constraint on \epsilon_{k}, we define \epsilon_{k}=e^{v_{k}} and optimize on v_{k} instead. This in turn makes optimization for the slacks an unconstrained problem.
Remark 3
From a geometrical point of view, we can make use of the product topology of the parameter space to avoid alternative optimization. More specifically, the set
\mathcal{M}_{prod.}=\mathrm{St}({p},{s})\times\mathrm{St}({p},{t})\times% \mathcal{S}_{++}^{p}\times\mathbb{R}^{N_{p}},  (12) 
can be given the structure of a Riemannian manifold using the concept of product topology (Absil et al., 2009).
Remark 4
In Fig. 3, we compare the convergence behavior of PGD, alternating Riemannian optimization and optimization using the product geometry. While optimization on \mathcal{M}_{prod.} convergences faster, the alternating method results in a lower loss. This behavior resembles the difference between the stochastic gradient descent compared to its batch counterpart.
Remark 5
The complexity of the optimization depends on the number of labeled pairs. One can always resort to a stochastic solution Oh Song et al. (2016); Sa et al. (2015); Bonnabel (2013) by sampling from the set of similar/dissimilar pairs if addressing a very largescale problem. In our experiments, we did not face any difficulty optimizing with an i7 desktop machine with 32GB of memory.
\mathrm{St}({p},{n})  \mathcal{S}_{++}^{p}  

Matrix representation  {\boldsymbol{W}}\in\mathbb{R}^{n\times p}  {\boldsymbol{M}}\in\mathbb{R}^{p\times p} 
Riemannian metric  g_{\nu}(\xi,\varsigma)=\mathop{\rm Tr}\nolimits(\xi^{T}\varsigma)  g_{\mathcal{S}}(\xi,\varsigma)=\mathop{\rm Tr}\nolimits\left({\boldsymbol{M}}^% {1}\xi{\boldsymbol{M}}^{1}\varsigma\right) 
Riemannian gradient  \nabla_{\boldsymbol{W}}(f){\boldsymbol{W}}\mathrm{sym}\Big{(}{\boldsymbol{W}}% ^{T}\nabla_{\boldsymbol{W}}(f)\Big{)}  {\boldsymbol{M}}\mathrm{sym}\Big{(}\nabla_{\boldsymbol{M}}(f)\Big{)}{% \boldsymbol{M}} 
Retraction  \mathrm{uf}({\boldsymbol{W}}+\xi)  {\boldsymbol{M}}^{\frac{1}{2}}\mathop{\rm expm}\nolimits({\boldsymbol{M}}^{% \frac{1}{2}}\xi{\boldsymbol{M}}^{\frac{1}{2}}){\boldsymbol{M}}^{\frac{1}{2}} 
\nabla_{{\boldsymbol{W}}_{s}}\ell_{\beta}  \frac{2}{N_{p}}(1+r^{1})^{1}{\boldsymbol{x}}^{s}_{i}({{\boldsymbol{x}}^{s}_{% i}}^{T}{\boldsymbol{W}}_{s}{{\boldsymbol{x}}^{t}_{j}}^{T}{\boldsymbol{W}}_{t}% ){\boldsymbol{M}} 

\nabla_{{\boldsymbol{W}}_{t}}\ell_{\beta}  \frac{2}{N_{p}}(1+r^{1})^{1}{\boldsymbol{x}}^{t}_{j}({{\boldsymbol{x}}^{t}_{% j}}^{T}{\boldsymbol{W}}_{t}{{\boldsymbol{x}}^{s}_{i}}^{T}{\boldsymbol{W}}_{s}% ){\boldsymbol{M}} 
\nabla_{{\boldsymbol{M}}}\ell_{\beta}  \frac{1}{N_{p}}(1+r^{1})^{1}\big{(}{\boldsymbol{W}}_{s}^{T}{{\boldsymbol{x}}% ^{s}_{i}}{\boldsymbol{W}}_{t}^{T}{\boldsymbol{x}}^{t}_{j}\big{)}\big{(}{{% \boldsymbol{x}}^{s}_{i}}^{T}{\boldsymbol{W}}_{s}{{\boldsymbol{x}}^{t}_{j}}^{T% }{\boldsymbol{W}}_{t}\big{)} 
\nabla_{v_{k}}\ell_{\beta}  \frac{1}{N_{p}}e^{v_{k}}(1+r^{1})^{1} 
\nabla_{{\boldsymbol{W}}_{s}}\mathcal{L}_{u}  \frac{1}{p}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s}\Big{(}2\big{(}{% \boldsymbol{W}}_{s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s}+{% \boldsymbol{W}}_{t}^{T}{\boldsymbol{\Sigma}}_{t}{\boldsymbol{W}}_{t}\big{)}^{% 1}\big{(}{\boldsymbol{W}}_{s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s% }\big{)}^{1}\Big{)} 
4 Related Work
The literature on domain adaptation spans a very broad range (see Patel et al. (2015) for a recent survey). Our solution falls under the category of domain adaptation by subspace learning (DASL). As such, we confine our review only to methods under the umbrella of DASL.
One notable example of constructing a latent space is the work of Daumé III et al. (2010). In particular, the authors propose to use two fixed and predefined transformations to project source and target data to a common and higherdimensional space. As a requirement, the method only accepts domains with the same dimensionality and hence cannot be directly used to adapt heterogeneous domains.
Gopalan et al. (2011) observed that the geodesic connecting the source and target subspaces conveys useful information for DA and proposed the Sampling Geodesic Flow (SGF) method (Gopalan et al., 2011). The Geodesic Flow Kernel (GFK) is an improvement over the SGF technique where instead of sampling a few points on the geodesic, the whole curve is used for domain adaptation (Gong et al., 2012). In both methods, the domain subspaces are fixed and obtained by Principal Component Analysis (PCA) or Partial Least Square regression (PLS) (Krishnan et al., 2011). In contrast to our solution, in SGF and GFK learning the domain subspaces is disjoint from the knowledge transfer algorithm. In our experiments, we will see that the subspaces determined by our method can even boost the performance of GFK, showing the importance of joint learning of domain subspaces and the knowledge transfer scheme. In Ni et al. (2013) dictionary learning is used for interpolating the intermediate subspaces.
Domain adaptation by fixing the subspace/representation of one of the domains is a popular theme in many recent works, as it simplifies the learning scheme. Examples are the maxmargin adaptation (Hoffman et al., 2014; Duan et al., 2012), the metric/similarity learning of Saenko et al. (2010) and its kernel extension (Kulis et al., 2011), the landmark approach of Hubert Tsai et al. (2016), the alignment technique of Fernando et al. (2013, 2015), correlation matching of Sun et al. (2016) and methods that use maximum mean discrepancy (MMD) (Borgwardt et al., 2006) for domain adaptation (Pan et al., 2011; Baktashmotlagh et al., 2016).
In contrast to the above methods, some studies opt to learn the domain representation along the knowledge transfer method jointly. Two representative works are the HeMap (Shi et al., 2010) and manifold alignment (Wang and Mahadevan, 2011). The HeMap learns two projections to minimize the instance discrepancies (Shi et al., 2010). The problem is however formulated such that equal number of source and target instances is required to perform the training. The manifold alignment algorithm of Wang and Mahadevan (2011) attempts to preserve the label structure in the latent space. However, it is essential for the algorithm to have access to labeled data in both source and target domains.
Our solution learns all transformations to the latent space. We do not resort at subspace representations learned disjointly to the DA framework. With this use of the latent space, our algorithm is not limited for applications where source and target data have similar dimensions or structure.
5 Experimental Evaluations
We run extensive experiments on both semisupervised and unsupervised settings, spanning from the handcrafted features (SURF) to the current stateoftheart deepnet features (VGGNet). For comparisons, we use the implementations made available by the original authors. Our method is denoted by ILS.
5.1 Implementation Details
Since the number of dissimilar pairs is naturally larger than the number of similar pairs, we randomly sample from the different pairs to keep the sizes of these two sets equal. We initialize the projection matrices {\boldsymbol{W_{s}}}, {\boldsymbol{W_{t}}} with PCA, following the transductive protocol Gong et al. (2012); Fernando et al. (2013); Hoffman et al. (2014); Hubert Tsai et al. (2016). For the semisupervised setting, we initialize {\boldsymbol{M}} with the Mahalanobis metric learned on the similar pair covariances (Köstinger et al., 2012), and for the unsupervised setting, we initialize it with the identity matrix. For all our experiments we have \lambda=1. We include an experiment showing our solution’s robustness to \lambda in the supplementary material. We use the toolbox provided by Boumal et al. (2014) for our implementations.
Remark 6
To have a simple way of determining \beta in Eq. 3, we propose a heuristic which is shown to be effective in our experiments. To this end, we propose to set \beta to the reciprocal of the standard deviation of the similar pair distances.
5.2 Semisupervised Setting
In our semisupervised experiments, we follow the standard setup on the Office+Caltech10 dataset with the train/test splits provided by Hoffman et al. (2013). The Office+Caltech10 dataset contains images collected from 4 different sources (see Fig. 4) and 10 object classes. The corresponding domains are Amazon, Webcam, DSLR, and Caltech. We use a subspace of dimension 20 for DASL algorithms. We employ SURF (Bay et al., 2006) for the handcrafted feature experiments. We extract VGGNet features with the network model of (Simonyan and Zisserman, 2014) for the deepnet feature experiments^{8}^{8}8The same SURF and VGGNet features are used for the unsupervised experiments as well.. We compare our performance with the following benchmarks:
1NNt and SVMt : Basic Nearest Neighbor (1NN) and linear SVM classifiers trained only on the target domain.
HFA (Duan et al., 2012) : This method employs latent space learning based on the maxmargin framework. As in its original implementation, we use the RBF kernel SVM for its evaluation.
MMDT (Hoffman et al., 2014) : This method jointly learns a transformation between the source and target domains along a linear SVM for classification.
CDLS (Hubert Tsai et al., 2016) : This is the crossdomain landmark search algorithm. We use the parameter setting (\delta=0.5 in the notation of Hubert Tsai et al. (2016)) recommended by the authors.
Table 3, Table 4 and Table 5 report the performances using the handcrafted SURF, VGGFC6 and VGGFC7 features, respectively. For the SURF features our solution achieves the best performance in 7 out 12 cases. For the VGGFC6 and VGGFC7 features, our solution tops in 9 and 7 sets respectively. It seems that in comparison to VGGFC6, VGGFC7 features are less discriminative for all the DA algorithms. We notice the 1NNt baseline performs the worst for both SURF and the VGGNet features. Hence, it is clear that the used features do not favor the nearest neighbor classifier. We observe that Caltech and Amazon domains contain the largest number of test instances. Although the performances of all tested methods decrease on these domains, particularly on Caltech, our method achieves the top rank in almost all domain transformations.
A\rightarrowW  A\rightarrowD  A\rightarrowC  W\rightarrowA  W\rightarrowD  W\rightarrowC  D\rightarrowA  D\rightarrowW  D\rightarrowC  C\rightarrowA  C\rightarrowW  C\rightarrowD  

1NNt  34.5  33.6  19.7  29.5  35.9  18.9  27.1  33.4  18.6  29.2  33.5  34.1 
SVMt  63.7  57.2  32.2  46.0  56.5  29.7  45.3  62.1  32.0  45.1  60.2  56.3 
HFA  57.4  55.1  31.0  56.5  56.5  29.0  42.9  60.5  30.9  43.8  58.1  55.6 
MMDT  64.6  56.7  36.4  47.7  67.0  32.2  46.9  74.1  34.1  49.4  63.8  56.5 
CDLS  68.7  60.4  35.3  51.8  60.7  33.5  50.7  68.5  34.9  50.9  66.3  59.8 
ILS (1NN)  59.7  49.8  43.6  54.3  70.8  38.6  55.0  80.1  41.0  55.1  62.9  56.2 
A\rightarrowW  A\rightarrowD  A\rightarrowC  W\rightarrowA  W\rightarrowD  W\rightarrowC  D\rightarrowA  D\rightarrowW  D\rightarrowC  C\rightarrowA  C\rightarrowW  C\rightarrowD  

1NNt  81.0  79.1  67.8  76.1  77.9  65.2  77.1  81.7  65.6  78.3  80.2  77.7 
SVMt  89.1  88.2  77.3  86.5  87.7  76.3  87.3  88.3  76.3  87.5  87.8  84.9 
HFA  87.9  87.1  75.5  85.1  87.3  74.4  85.9  86.9  74.8  86.2  86.0  87.0 
MMDT  82.5  77.1  78.7  84.7  85.1  73.6  83.6  86.1  71.8  85.9  82.8  77.9 
CDLS  91.2  86.9  78.1  87.4  88.5  78.2  88.1  90.7  77.9  88.0  89.7  86.3 
ILS (1NN)  90.7  87.7  83.3  88.8  94.5  82.8  88.7  95.5  81.4  89.7  91.4  86.9 
A\rightarrowW  A\rightarrowD  A\rightarrowC  W\rightarrowA  W\rightarrowD  W\rightarrowC  D\rightarrowA  D\rightarrowW  D\rightarrowC  C\rightarrowA  C\rightarrowW  C\rightarrowD  

1NNt  81.8  78.2  68.3  77.8  77.6  67.4  78.1  81.5  66.9  79.0  80.6  77.4 
SVMt  87.5  85.4  76.8  86.2  85.6  75.8  87.0  87.1  76.0  87.1  86.4  84.4 
HFA  86.6  85.3  75.2  84.9  85.5  74.8  85.8  86.5  75.1  86.0  85.3  84.8 
MMDT  76.9  73.3  78.1  83.6  79.5  72.2  82.3  83.8  71.7  85.3  77.8  72.6 
CDLS  90.0  85.0  78.5  87.2  86.5  79.0  87.7  89.5  78.8  87.8  89.7  84.6 
ILS (1NN)  89.3  84.0  81.9  88.4  91.0  80.8  86.9  94.4  78.8  88.9  88.7  83.3 
5.3 Unsupervised Setting
In the unsupervised domain adaptation problem, only labeled data from the source domain is available (Fernando et al., 2013; Gong et al., 2012). We perform two sets of experiments for this setting. (1) We evaluate the object recognition performance on the Office+Caltech10 dataset. Similar to the semisupervised settings, we use the SURF and VGGNet features. Our results demonstrate that the learned transformations by our method are superior domain representations. (2) We analyze our performance when the domain discrepancy is gradually increased. This experiment is performed on the PIEFace dataset. We compare our method with the following benchmarks:
1NNs and SVMs : Basic 1NN and linear SVM classifiers trained only on the source domain.
GFKPLS (Gong et al., 2012) : The geodesic flow kernel algorithm where partial least squares (PLS) implementation is used to initialize the source subspace. Results are evaluated on kernelNNs.
SA (Fernando et al., 2013) : This is the subspace alignment algorithm. Results are evaluated using 1NN.
CORAL (Sun et al., 2016) : The correlation alignment algorithm that uses a linear SVM on the similarity matrix formed by correlation matching.
5.3.1 Office+Caltech10 (Unsupervised)
We follow the original protocol provided by Gong et al. (2012) on Office+Caltech10 dataset. Note that several baselines, determine the best dimensionality per domain to achieve their maximum accuracies on SURF features. We observed that a dimensionality in the range [20,120] provides consistent results for our solution using SURF features. For VGG features we empirically found the dimensionality of 20 suits best for the compared DASL algorithms.
Table. 6, Table. 7 and Table. 8 present the unsupervised setting results using the SURF, VGGFC6 and VGGFC7 features. For all feature types, our solution yields the best performance in 8 domain transformations out of 12. Similarly to the semisupervised experiments we notice the VGGFC7 is less favorable for DA algorithms.
A\rightarrowW  A\rightarrowD  A\rightarrowC  W\rightarrowA  W\rightarrowD  W\rightarrowC  D\rightarrowA  D\rightarrowW  D\rightarrowC  C\rightarrowA  C\rightarrowW  C\rightarrowD  

1NNs  23.1  22.3  20.0  14.7  31.3  12.0  23.0  51.7  19.9  21.0  19.0  23.6 
SVMs  25.6  33.4  35.9  30.4  67.7  23.4  34.6  70.2  31.2  43.8  30.5  40.3 
GFKPLS  35.7  35.1  37.9  35.5  71.2  29.3  36.2  79.1  32.7  40.4  35.8  41.1 
SA  38.6  37.6  35.3  37.4  80.3  32.3  38.0  83.6  32.4  39.0  36.8  39.6 
CORAL  38.7  38.3  40.3  37.8  84.9  34.6  38.1  85.9  34.2  47.2  39.2  40.7 
ILS (1NN)  40.6  41.0  37.1  38.6  72.4  32.6  38.9  79.1  36.9  48.6  42.0  44.1 
A\rightarrowW  A\rightarrowD  A\rightarrowC  W\rightarrowA  W\rightarrowD  W\rightarrowC  D\rightarrowA  D\rightarrowW  D\rightarrowC  C\rightarrowA  C\rightarrowW  C\rightarrowD  

1NNs  60.9  52.3  70.1  62.4  83.9  57.5  57.0  86.7  48.0  81.9  65.9  55.6 
SVMs  63.1  51.7  74.2  69.8  89.4  64.7  58.7  91.8  55.5  86.7  74.8  61.5 
GFKPLS  74.1  63.5  77.7  77.9  92.9  71.3  69.9  92.4  64.0  86.2  76.5  66.5 
SA  76.0  64.9  77.1  76.6  90.4  70.7  69.0  90.5  62.3  83.9  76.0  66.2 
CORAL  74.8  67.1  79.0  81.2  92.6  75.2  75.8  94.6  64.7  89.4  77.6  67.6 
ILS (1NN)  82.4  72.5  78.9  85.9  87.4  77.0  79.2  94.2  66.5  87.6  84.4  73.0 
A\rightarrowW  A\rightarrowD  A\rightarrowC  W\rightarrowA  W\rightarrowD  W\rightarrowC  D\rightarrowA  D\rightarrowW  D\rightarrowC  C\rightarrowA  C\rightarrowW  C\rightarrowD  

1NNs  64.0  50.8  72.6  64.5  83.1  60.2  61.2  88.2  52.8  82.6  65.3  54.9 
SVMs  68.0  51.8  76.2  70.1  87.4  65.5  58.7  91.2  56.0  86.7  74.8  61.3 
GFKPLS  74.0  57.6  76.6  75.0  89.6  62.1  67.5  91.9  62.9  84.1  73.6  63.4 
SA  75.0  60.7  76.2  74.6  88.8  67.5  66.0  89.5  59.4  82.6  73.6  63.2 
CORAL  71.8  61.3  78.6  81.4  90.1  73.6  71.2  93.5  63.0  88.6  76.0  63.8 
ILS (1NN)  80.9  71.3  78.4  85.7  84.8  75.1  76.5  91.8  66.2  87.1  80.1  67.1 
Learned Transformations as Subspace Representations: We consider both GFK (Gong et al., 2012) and SA (Fernando et al., 2013) as DASL algorithms. Both these methods make use of PCA subspaces to adapt the domains. To the best of our knowledge, there exists no through studies claiming that PCA is the method of choice to for GFK and SA. As a matter of fact, Gong et al. show that the performance of GFK can be improved if PLS^{9}^{9}9Despite using labeled data, this method falls under the unsupervised setting since it does not use the labeled target data. algorithm is employed to define the source subspace (Gong et al., 2012). Whether PCA or PLS is used to define the subspaces, identification of the subspaces is disjoint from the domain adaptation technique in GFK and SA. In contrast, transformations in the ILS algorithm are linked to the adaptation process. This makes a curious mind wondering whether the learned transformations in the ILS algorithm capture better structures for adaptation or not. We empirically show that this is indeed the case by using the learned {\boldsymbol{W}}_{s} as the source subspace in GFK and SA.
Figure 5 compares the accuracy gains over PCA spaces by using PLS and our {\boldsymbol{W}}_{s} initialization. It is clear that the highest classification accuracy gain is obtained by our {\boldsymbol{W}}_{s} initialization. This proves that {\boldsymbol{W}}_{s} is capable to learn a more favorable subspace representation for domain adaptation.
5.3.2 PIEMultiview Faces
The PIE Multiview dataset includes face images of 67 individuals captured from different views, illumination conditions, and expressions. In this experiment, we use the views C27 (looking forward) as the source domain and C09 (looking down), and the views C05, C37, C02, C25 (looking towards left in an increasing angle, see Fig. 6) as target domains. We expect the face inclination angle to reflect the complexity of transfer learning. We normalize the images to 32\times 32 pixels and use the vectorized grayscale images as features. Empirically, we observed that the GFK (Gong et al., 2012) and SA (Fernando et al., 2013) reach better performances if the features are normalized to have unit \ell_{2} norm. We therefore use \ell_{2} normalized features in our evaluations. The dimensionality of the subspaces for all the subspace based methods (i.e., Gong et al. (2012); Fernando et al. (2013)) including ours is 100.
Table. 9 lists the classification accuracies with increasing angle of inclination. Our solution attains the best scores for 4 views and the second best for the C09. With the increasing camera angle, the feature structure changes up to a certain extent. In other words, the features become heterogeneous. However, our algorithm boosts the accuracies even under such challenging conditions.
camera pose\rightarrow  C09  C05  C37  C25  C02 

1NNs  92.5  55.7  28.5  14.8  11.0 
SVMs  87.8  65.0  35.8  15.7  16.7 
GFKPLS  92.5  74.0  32.1  14.1  12.3 
SA  97.9  85.9  47.9  16.6  13.9 
CORAL  91.4  74.8  35.3  13.4  13.2 
ILS (1NN)  96.6  88.3  72.9  28.4  34.8 
6 Parameter Sensitivity and Orthogonality
In all the above experiments, we keep \lambda=1 (see Eq. 1). To analyze the sensitivity of our method to the changes in parameter \lambda, we performed an experiment using the unsupervised protocol. This is because the statistical loss plays a significant role in establishing the correspondence between the source and the target in the unsupervised DA. We consider two random splits from each of the Office+Caltech10 dataset along VGGFC6 features here.
Our results are shown in Fig. 7. When \lambda=0, no statistical loss term is considered. It is clear that for this case the performance drops considerably. For other values of \lambda, the performance is superior and there is little variation in performance. In other words, our method remains robust.
We further investigate the benefit of orthogonality constraint on {\boldsymbol{W}}_{s} and {\boldsymbol{W}}_{t} against freeform and unconstrained transformations. Using the orthogonality constraint provides a considerable performance gain as shown in Fig. 7. While orthogonality makes the optimization more complicated, it seems it guides the learning to better uncovering the form of adaptation.
Conclusion
In this paper, we proposed a solution for both semisupervised and unsupervised Domain Adaptation (DA) problems. Our solution learns a latent space in which domain discrepancies are minimized. We showed that such a latent space can be obtained by 1. minimizing a notion of discriminatory power over the available labeled data while simultaneously 2. matching statistical properties across the domains. To determine the latent space, we modeled the learning problem as a minimization problem on Riemannian manifolds and solved it using optimization techniques on matrix manifolds.
Empirically, we showed that the proposed method outperformed stateoftheart DA solutions in semisupervised and unsupervised settings. With the proposed framework we see possibilities of extending our solution to large scale datasets with stochastic optimization techniques, multiple source DA and for domain generalization (Ghifary et al., 2016; Gan et al., 2016). In terms of algorithmic extensions we look forward to use dictionary learning (Koniusz and Cherian, 2016) and higher order statistics matching.
References
 Absil et al. (2009) PA Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009.
 Baktashmotlagh et al. (2016) Mahsa Baktashmotlagh, Mehrtash Harandi, and Mathieu Salzmann. Distributionmatching embedding for visual domain adaptation. Journal of Machine Learning Research, 17(108):1–30, 2016.
 Bay et al. (2006) Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In European conference on computer vision, pages 404–417. Springer, 2006.
 Bonnabel (2013) S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
 Borgwardt et al. (2006) Karsten Borgwardt, Arthur Gretton, Malte J. Rasch, HansPeter Kriegel, Bernhard Schoelkopf, and Alexander Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22:49–57, 2006.
 Boumal et al. (2014) N. Boumal, B. Mishra, P.A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds. Journal of Machine Learning Research, 15:1455–1459, 2014. URL http://www.manopt.org.
 Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
 Brookes (2005) Mike Brookes. The matrix reference manual. Imperial College London, 2005.
 Chen et al. (2015) Qiang Chen, Junshi Huang, Rogerio Feris, Lisa M Brown, Jian Dong, and Shuicheng Yan. Deep domain adaptation for describing people based on finegrained clothing attributes. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5315–5324, 2015.
 Cherian et al. (2013) A. Cherian, S. Sra, A. Banerjee, and N. Papanikolopoulos. Jensenbregman logdet divergence with application to efficient similarity search for covariance matrices. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9):2161–2174, Sept 2013.
 Cherian and Sra (2016) Anoop Cherian and Suvrit Sra. Positive definite matrices: data representation and applications to computer vision. Algorithmic Advances in Riemannian Geometry and Applications: For Machine Learning, Computer Vision, Statistics, and Optimization, page 93, 2016.
 Cherian et al. (2016) Anoop Cherian, Vassilios Morellas, and Nikolaos Papanikolopoulos. Bayesian nonparametric clustering for positive definite matrices. IEEE transactions on pattern analysis and machine intelligence, 38(5):862–874, 2016.
 Daumé III et al. (2010) Hal Daumé III, Abhishek Kumar, and Avishek Saha. Frustratingly easy semisupervised domain adaptation. In Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing, pages 53–59, 2010.
 Davis et al. (2007) Jason V. Davis, Brian Kulis, Prateek Jain, Suvrit Sra, and Inderjit S. Dhillon. Informationtheoretic metric learning. In ICML, pages 209–216, Corvalis, Oregon, USA, June 2007.
 Duan et al. (2012) Lixin Duan, Dong Xu, and Ivor W. Tsang. Learning with augmented features for heterogeneous domain adaptation. In Proc. Int. Conference on Machine Learning (ICML), pages 711–718, June 2012.
 Edelman et al. (1998) Alan Edelman, Tomás A Arias, and Steven T Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
 Fernando et al. (2013) B. Fernando, A. Habrard, M. Sebban, and T. Tuytelaars. Unsupervised visual domain adaptation using subspace alignment. In Proc. Int. Conference on Computer Vision (ICCV), pages 2960–2967, 2013.
 Fernando et al. (2015) Basura Fernando, Tatiana Tommasi, and Tinne Tuytelaars. Joint crossdomain classification and subspace learning for unsupervised adaptation. Pattern Recognition Letters, 65:60 – 66, 2015.
 Gan et al. (2016) Chuang Gan, Tianbao Yang, and Boqing Gong. Learning attributes equals multisource domain generalization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 87–97, 2016.
 Ganin and Lempitsky (2015) Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In Proc. Int. Conference on Machine Learning (ICML), pages 1180–1189, 2015.
 Ghifary et al. (2016) M. Ghifary, D. Balduzzi, W. B. Kleijn, and M. Zhang. Scatter component analysis: A unified framework for domain adaptation and domain generalization. IEEE Trans. Pattern Analysis and Machine Intelligence, PP(99):1–1, 2016.
 Gong et al. (2012) B. Gong, Y. Shi, F. Sha, and K. Grauman. Geodesic flow kernel for unsupervised domain adaptation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2066–2073, 2012.
 Gopalan et al. (2011) R. Gopalan, Ruonan Li, and R. Chellappa. Domain adaptation for object recognition: An unsupervised approach. In Proc. Int. Conference on Computer Vision (ICCV), pages 999–1006, 2011.
 Guillemin and Pollack (2010) Victor Guillemin and Alan Pollack. Differential topology, volume 370. American Mathematical Soc., 2010.
 Harandi and Fernando (2016) Mehrtash Harandi and Basura Fernando. Generalized backpropagation, étude de cas: Orthogonality. CoRR, abs/1611.05927, 2016.
 Harandi et al. (2016) Mehrtash Tafazzoli Harandi, Mathieu Salzmann, and Richard I. Hartley. Dimensionality reduction on SPD manifolds: The emergence of geometryaware methods. CoRR, abs/1605.06182, 2016.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016.
 Herath et al. (2017) Samitha Herath, Mehrtash Harandi, and Fatih Porikli. Going deeper into action recognition: A survey. Image and Vision Computing, 60:4 – 21, 2017. Regularization Techniques for HighDimensional Data Analysis.
 Hoffman et al. (2013) Judy Hoffman, Erik Rodner, Jeff Donahue, Kate Saenko, and Trevor Darrell. Efficient learning of domaininvariant image representations. In International Conference on Learning Representations, 2013.
 Hoffman et al. (2014) Judy Hoffman, Erik Rodner, Jeff Donahue, Brian Kulis, and Kate Saenko. Asymmetric and category invariant feature transformations for domain adaptation. Int. Journal of Computer Vision, 109(1):28–41, 2014.
 Hubert Tsai et al. (2016) YaoHung Hubert Tsai, YiRen Yeh, and YuChiang Frank Wang. Learning crossdomain landmarks for heterogeneous domain adaptation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5081–5090, June 2016.
 James (1976) Ioan Mackenzie James. The topology of Stiefel manifolds, volume 24. Cambridge University Press, 1976.
 Koniusz and Cherian (2016) Piotr Koniusz and Anoop Cherian. Sparse coding for thirdorder supersymmetric tensor descriptors with application to texture recognition. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 5395, 2016.
 Koniusz et al. (2017) Piotr Koniusz, Yusuf Tas, and Fatih Porikli. Domain adaptation by mixture of alignments of secondor higherorder scatter tensors. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 Köstinger et al. (2012) Martin Köstinger, Martin Hirzer, Paul Wohlhart, Peter M Roth, and Horst Bischof. Large scale metric learning from equivalence constraints. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2288–2295, 2012.
 Krishnan et al. (2011) Anjali Krishnan, Lynne J Williams, Anthony Randal McIntosh, and Hervé Abdi. Partial least squares (pls) methods for neuroimaging: a tutorial and review. Neuroimage, 56(2):455–475, 2011.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 1097–1105, 2012.
 Kulis et al. (2011) B. Kulis, K. Saenko, and T. Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1785–1792, June 2011.
 Lee (2003) John M Lee. Smooth manifolds. In Introduction to Smooth Manifolds, pages 1–29. Springer, 2003.
 Long et al. (2016) M. Long, J. Wang, Y. Cao, J. Sun, and P. S. Yu. Deep learning of transferable representation for scalable domain adaptation. IEEE Transactions on Knowledge and Data Engineering, 28(8):2027–2040, Aug 2016.
 Ni et al. (2013) Jie Ni, Qiang Qiu, and Rama Chellappa. Subspace interpolation via dictionary learning for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 692–699, 2013.
 Oh Song et al. (2016) Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 Pan et al. (2011) S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210, 2011.
 Patel et al. (2015) V. M. Patel, R. Gopalan, R. Li, and R. Chellappa. Visual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3):53–69, 2015.
 Pennec et al. (2006) Xavier Pennec, Pierre Fillard, and Nicholas Ayache. A riemannian framework for tensor computing. Int. Journal of Computer Vision, 66(1):41–66, 2006.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li FeiFei. ImageNet Large Scale Visual Recognition Challenge. Int. Journal of Computer Vision, 115(3):211–252, 2015.
 Sa et al. (2015) Christopher D Sa, Christopher Re, and Kunle Olukotun. Global convergence of stochastic gradient descent for some nonconvex matrix problems. In Proc. Int. Conference on Machine Learning (ICML), pages 2332–2341, 2015.
 Saenko et al. (2010) Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In Proc. European Conference on Computer Vision (ECCV), pages 213–226, 2010.
 Shi et al. (2010) Xiaoxiao Shi, Qi Liu, Wei Fan, S Yu Philip, and Ruixin Zhu. Transfer learning on heterogenous feature spaces via spectral transformation. In 2010 IEEE international conference on data mining, pages 1049–1054, 2010.
 Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the loglikelihood function. Journal of Statistical Planning and Inference, 90(2):227 – 244, 2000.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Sun et al. (2016) Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In Thirtieth AAAI Conference on Artificial Intelligence, 2016.
 Torralba and Efros (2011) A. Torralba and A. A. Efros. Unbiased look at dataset bias. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1521–1528, 2011.
 Tzeng et al. (2014) Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
 Wang and Mahadevan (2011) Chang Wang and Sridhar Mahadevan. Heterogeneous domain adaptation using manifold alignment. In Proceedings of the TwentySecond International Joint Conference on Artificial Intelligence  Volume Volume Two, IJCAI’11, pages 1541–1546, 2011.
Appendix
In this appendix, we provide more details on the product geometry of the problem discussed in § 2 and also the form of gradients required to perform Riemannian optimization.
A. Product Topology
As the constraints of the optimization problem depicted in Eq. 1 are indeed Riemannian manifolds, the whole set of constraints can be given a Riemannian structure through the concept of product topology. In particular, the constraints can be modeled as
\mathcal{M}_{prod.}=\mathrm{St}({p},{s})\times\mathrm{St}({p},{t})\times% \mathcal{S}_{++}^{p}\times\mathbb{R}^{N_{p}},  (13) 
The tangent space of such a product topology (Lee, 2003; Guillemin and Pollack, 2010) could be written as,
\mathcal{T}_{({\boldsymbol{W}}_{s},{\boldsymbol{W}}_{t},{\boldsymbol{M}},{% \boldsymbol{\epsilon}})}\mathcal{M}_{prod.}=T_{{\boldsymbol{W}}_{s}}\mathrm{St% }({p},{s})\times T_{{\boldsymbol{W}}_{t}}\mathrm{St}({p},{t})\times T_{{% \boldsymbol{M}}}\mathcal{S}_{++}^{p}\times T_{{\boldsymbol{\epsilon}}}\mathbb{% R}^{N_{p}}.  (14) 
In Table 10, the metric and, the form of Riemannian gradient and the retraction for \mathcal{M}_{prod.} are provided.
B. Derivations
We recall that the cost function depicted in Eq. 1 consists of two parts, namely \mathcal{L}_{d} and \mathcal{L}_{u}. Here, \mathcal{L}_{d} is a measure of dissimilarity between labeled samples. The term \mathcal{L}_{u} quantifies a notion of statistical difference between the source and target samples in the latent space. We provide the gradients of Eq. 1 with respect to its parameters below. This, as discussed in § 3 is required for Riemannian optimization.
Derivative of softmargin \ell_{\beta}
We recall that \mathcal{L}_{d} has the following form,
\displaystyle\mathcal{L}_{d}  \displaystyle=\frac{1}{N_{p}}\sum\limits_{k=1}^{N_{p}}\ell_{\beta}\big{(}{% \boldsymbol{M}},y_{k},{\boldsymbol{z}}_{1,k}{\boldsymbol{z}}_{2,k},1+y_{k}% \epsilon_{k}\big{)}+r({\boldsymbol{M}})+\frac{1}{N_{p}}\sqrt{\sum\epsilon_{k}^% {2}}\;,  (15) 
with
\displaystyle\ell_{\beta}\big{(}{\boldsymbol{M}},y,{\boldsymbol{x}},u\big{)}=% \frac{1}{\beta}\log\Big{(}1+\exp\big{(}\beta y({\boldsymbol{x}}^{T}{% \boldsymbol{M}}{\boldsymbol{x}}u)\big{)}\Big{)}.  (16) 
In Eq. 15, y_{k} denotes whether the kth pair is similar or dissimilar (i.e., y_{k}=+1 if {\boldsymbol{z}}_{1,k} and {\boldsymbol{z}}_{2,k} are from the same class and y_{k}=1 otherwise).
For the sake of discussion, assume {\boldsymbol{z}}_{1,k} and {\boldsymbol{z}}_{2,k} are embedded from the source and target domains, respectively. That is {\boldsymbol{z}}_{1,k}={\boldsymbol{W}}_{s}^{T}{\boldsymbol{x}}^{s}_{i} and {\boldsymbol{z}}_{2,k}={\boldsymbol{W}}_{t}^{T}{\boldsymbol{x}}^{t}_{j}. By expanding \ell_{\beta} for such a pair, we get
\displaystyle\ell_{\beta}({\boldsymbol{M}},y_{k},{\boldsymbol{z}}_{1,k}{% \boldsymbol{z}}_{2,k},1+y_{k}\epsilon_{k})  \displaystyle=\frac{1}{\beta}\log(1+\exp(\beta y_{k}(({\boldsymbol{z}}_{1,k}{% \boldsymbol{z}}_{2,k})^{T}{\boldsymbol{M}}({\boldsymbol{z}}_{1,k}{\boldsymbol% {z}}_{2,k})1y_{k}\epsilon_{k})))  (17) 
To simplify the presentation, we define d({\boldsymbol{M}},{\boldsymbol{W}}_{s},{\boldsymbol{W_{t}}})=({\boldsymbol{z}% }_{1,k}{\boldsymbol{z}}_{2,k})^{T}{\boldsymbol{M}}({\boldsymbol{z}}_{1,k}{% \boldsymbol{z}}_{2,k}) and r=\exp(\beta y_{k}(({\boldsymbol{z}}_{1,k}{\boldsymbol{z}}_{2,k})^{T}{% \boldsymbol{M}}({\boldsymbol{z}}_{1,k}{\boldsymbol{z}}_{2,k})y_{k}\epsilon_{% k}1)). We provide the gradients of Eq. 17 with respect to {\boldsymbol{M}}, {\boldsymbol{W_{t}}}, {\boldsymbol{W_{s}}} and the slack \epsilon_{k} below.
Derivative w.r.t. {\boldsymbol{M}}
\displaystyle\nabla_{{\boldsymbol{M}}}\ell_{\beta}  \displaystyle=\frac{y_{k}r}{(1+r)}\nabla_{{\boldsymbol{M}}}d({\boldsymbol{M}})  
\displaystyle=\frac{y_{k}r}{(1+r)}({\boldsymbol{W}}_{s}^{T}{{\boldsymbol{x}}^{% s}_{i}}{\boldsymbol{W}}_{t}^{T}{\boldsymbol{x}}^{t}_{j})({{\boldsymbol{x}}^{s% }_{i}}^{T}{\boldsymbol{W}}_{s}{{\boldsymbol{x}}^{t}_{j}}^{T}{\boldsymbol{W}}_% {t})  
\displaystyle=y_{k}(1+r^{1})^{1}({\boldsymbol{W}}_{s}^{T}{{\boldsymbol{x}}^{% s}_{i}}{\boldsymbol{W}}_{t}^{T}{\boldsymbol{x}}^{t}_{j})({{\boldsymbol{x}}^{s% }_{i}}^{T}{\boldsymbol{W}}_{s}{{\boldsymbol{x}}^{t}_{j}}^{T}{\boldsymbol{W}}_% {t}).  (18) 
\mathcal{M}_{prod.}  

Matrix representation  ({\boldsymbol{W}}_{s},{\boldsymbol{W}}_{t},{\boldsymbol{M}},{\boldsymbol{% \epsilon}}) 
Riemannian metric  g_{\nu_{s}}(\varsigma_{s},\xi_{s})+g_{\nu_{t}}(\varsigma_{t},\xi_{t})+g_{% \mathcal{S}}(\varsigma_{M},\xi_{M})+g_{E}(\varsigma_{E},\xi_{E}) 
Riemannian gradient  (\nabla_{{\boldsymbol{W}}_{s}}(f){\boldsymbol{W}}_{s}\mathrm{sym}\Big{(}{% \boldsymbol{W}}_{s}^{T}\nabla_{{\boldsymbol{W}}_{s}}(f)\Big{)}, \nabla_{{\boldsymbol{W}}_{t}}(f){\boldsymbol{W}}_{t}\mathrm{sym}\Big{(}{% \boldsymbol{W}}_{t}^{T}\nabla_{{\boldsymbol{W}}_{t}}(f)\Big{)}, {\boldsymbol{M}}\mathrm{sym}\Big{(}\nabla_{\boldsymbol{M}}(f)\Big{)}{% \boldsymbol{M}}, \nabla_{{\boldsymbol{\epsilon}}}(f)) 
Retraction  (\mathrm{uf}({\boldsymbol{W}}_{s}+\xi_{s}), \mathrm{uf}({\boldsymbol{W}}_{t}+\xi_{t}), {\boldsymbol{M}}^{\frac{1}{2}}\mathop{\rm expm}\nolimits({\boldsymbol{M}}^{% \frac{1}{2}}\xi_{M}{\boldsymbol{M}}^{\frac{1}{2}}){\boldsymbol{M}}^{\frac{1}{% 2}}, \mathbf{I}_{p}) 
Derivative w.r.t. {\boldsymbol{W}}_{s} (or w.r.t. {\boldsymbol{W}}_{t})
\displaystyle\nabla_{{\boldsymbol{W}}_{s}}\ell_{\beta}  \displaystyle=\frac{y_{k}r}{(1+r)}\nabla_{{\boldsymbol{W}}_{s}}d({\boldsymbol{% W}}_{s})  (19)  
\displaystyle=2\frac{y_{k}r}{(1+r)}{\boldsymbol{x}}_{i}^{s}({{\boldsymbol{x}}^% {s}_{i}}^{T}{\boldsymbol{W}}_{s}{{\boldsymbol{x}}^{t}_{j}}^{T}{\boldsymbol{W}% }_{t}){\boldsymbol{M}}  
\displaystyle=2y_{k}(1+r^{1})^{1}{\boldsymbol{x}}_{i}^{s}({{\boldsymbol{x}}^% {s}_{i}}^{T}{\boldsymbol{W}}_{s}{{\boldsymbol{x}}^{t}_{j}}^{T}{\boldsymbol{W}% }_{t}){\boldsymbol{M}}.  (20) 
For the case where both the pair instances are from the same domain (i.e. {\boldsymbol{z}}_{1,k}={\boldsymbol{W}}_{s}^{T}{\boldsymbol{x}}^{s}_{i} and {\boldsymbol{z}}_{2,k}={\boldsymbol{W}}_{s}^{T}{\boldsymbol{x}}^{s}_{j}), it could be shown that,
\displaystyle\nabla_{{\boldsymbol{W}}_{s}}d({\boldsymbol{W}}_{s})=2y_{k}({% \boldsymbol{x}}_{i}^{s}{\boldsymbol{x}}_{j}^{s})({{\boldsymbol{x}}^{s}_{i}}^{% T}{{\boldsymbol{x}}^{s}_{j}}^{T}){\boldsymbol{W}}_{s}{\boldsymbol{M}}.  (21) 
\displaystyle\nabla_{{\boldsymbol{W}}_{s}}\ell_{\beta}  \displaystyle=2\frac{y_{k}r}{(1+r)}({\boldsymbol{x}}_{i}^{s}{\boldsymbol{x}}_% {j}^{s})({{\boldsymbol{x}}^{s}_{i}}^{T}{{\boldsymbol{x}}^{s}_{j}}^{T}){% \boldsymbol{W}}_{s}{\boldsymbol{M}}.  
\displaystyle=2y_{k}(1+r^{1})^{1}({\boldsymbol{x}}_{i}^{s}{\boldsymbol{x}}_% {j}^{s})({{\boldsymbol{x}}^{s}_{i}}^{T}{{\boldsymbol{x}}^{s}_{j}}^{T}){% \boldsymbol{W}}_{s}{\boldsymbol{M}}.  (22) 
Derivative w.r.t. a Slack variable \epsilon_{k}.
The slacks by origin are nonnegative. To avoid using a nonnegative constraint we make the substitution \epsilon_{k}=e^{v_{k}} to Eq. 17.
\displaystyle\therefore\ell_{\beta}=\frac{1}{\beta}\log(1+\exp(\beta y_{k}(d({% \boldsymbol{M}},{\boldsymbol{W}}_{s},{\boldsymbol{W_{t}}},v_{k})1y_{k}e^{v_{% k}})))  (23) 
The derivative of Eq. 23 w.r.t. v_{k},
\displaystyle\nabla_{{v}_{k}}\ell_{\beta}  \displaystyle=\frac{e^{v_{k}}\exp(\beta y_{k}(d({\boldsymbol{M}},{\boldsymbol% {W}}_{s},{\boldsymbol{W_{t}}},v_{k})1y_{k}e^{v_{k}}))}{(1+\exp(\beta y_{k}(d% ({\boldsymbol{M}},{\boldsymbol{W}}_{s},{\boldsymbol{W_{t}}},v_{k})1y_{k}e^{v% _{k}}))}  
\displaystyle=\frac{e^{v_{k}}r}{(1+r)}=e^{v_{k}}(1+r^{1})^{1}  (24) 
\nabla_{{\boldsymbol{W}}_{s}}\ell_{\beta} ; x_{i}^{s}\in\mathbb{R}^{s}, x_{j}^{t}\in\mathbb{R}^{t}  2y_{k}(1+r^{1})^{1}{\boldsymbol{x}}^{s}_{i}({{\boldsymbol{x}}^{s}_{i}}^{T}{% \boldsymbol{W}}_{s}{{\boldsymbol{x}}^{t}_{j}}^{T}{\boldsymbol{W}}_{t}){% \boldsymbol{M}} 

\nabla_{{\boldsymbol{W}}_{s}}\ell_{\beta} ; x_{i}^{s}\in\mathbb{R}^{s}, x_{j}^{s}\in\mathbb{R}^{s}  2y_{k}(1+r^{1})^{1}({\boldsymbol{x}}_{i}^{s}{\boldsymbol{x}}_{j}^{s})({{% \boldsymbol{x}}^{s}_{i}}^{T}{{\boldsymbol{x}}^{s}_{j}}^{T}){\boldsymbol{W}}_{% s}{\boldsymbol{M}} 
\nabla_{{\boldsymbol{W}}_{t}}\ell_{\beta} ; x_{i}^{s}\in\mathbb{R}^{s}, x_{j}^{t}\in\mathbb{R}^{t}  2y_{k}(1+r^{1})^{1}{\boldsymbol{x}}^{t}_{j}({{\boldsymbol{x}}^{t}_{j}}^{T}{% \boldsymbol{W}}_{t}{{\boldsymbol{x}}^{s}_{i}}^{T}{\boldsymbol{W}}_{s}){% \boldsymbol{M}} 
\nabla_{{\boldsymbol{W}}_{t}}\ell_{\beta} ; x_{i}^{t}\in\mathbb{R}^{t}, x_{j}^{t}\in\mathbb{R}^{t}  2y_{k}(1+r^{1})^{1}({\boldsymbol{x}}_{i}^{t}{\boldsymbol{x}}_{j}^{t})({{% \boldsymbol{x}}^{t}_{i}}^{T}{{\boldsymbol{x}}^{t}_{j}}^{T}){\boldsymbol{W}}_{% t}{\boldsymbol{M}} 
\nabla_{{\boldsymbol{M}}}\ell_{\beta}  2y_{k}(1+r^{1})^{1}({\boldsymbol{W}}_{s}^{T}{{\boldsymbol{x}}^{s}_{i}}{% \boldsymbol{W}}_{t}^{T}{\boldsymbol{x}}^{t}_{j})({{\boldsymbol{x}}^{s}_{i}}^{T% }{\boldsymbol{W}}_{s}{{\boldsymbol{x}}^{t}_{j}}^{T}{\boldsymbol{W}}_{t}) 
\nabla_{v_{k}}\ell_{\beta}  e^{v_{k}}(1+r^{1})^{1} 
\nabla_{{\boldsymbol{W}}_{s}}\mathcal{L}_{u}  \frac{1}{p}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s}\Big{(}2\big{(}{% \boldsymbol{W}}_{s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s}+{% \boldsymbol{W}}_{t}^{T}{\boldsymbol{\Sigma}}_{t}{\boldsymbol{W}}_{t}\big{)}^{% 1}\big{(}{\boldsymbol{W}}_{s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s% }\big{)}^{1}\Big{)} 
Derivative of Statistical loss \mathcal{L}_{u}
The statistical loss (i.e. unsupervised loss) in Eq. 1 is defined using the stein divergence \delta_{s}. We have,
\displaystyle\small\mathcal{L}_{u}  \displaystyle=\frac{1}{p}\delta_{s}\big{(}{\boldsymbol{W}}_{s}^{T}{\boldsymbol% {\Sigma}}_{s}{\boldsymbol{W}}_{s},{\boldsymbol{W}}_{t}^{T}{\boldsymbol{\Sigma}% }_{t}{\boldsymbol{W}}_{t}\big{)}  
\displaystyle=\frac{1}{p}\bigg{\{}\log\det\bigg{(}\frac{{\boldsymbol{W}}_{s}^{% T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s}+{\boldsymbol{W}}_{t}^{T}{% \boldsymbol{\Sigma}}_{t}{\boldsymbol{W}}_{t}}{2}\bigg{)}\frac{1}{2}\log\det({% \boldsymbol{W}}_{s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s}{% \boldsymbol{W}}_{t}^{T}{\boldsymbol{\Sigma}}_{t}{\boldsymbol{W}}_{t})\bigg{\}}  
\displaystyle=\frac{1}{p}\bigg{\{}\log\det\bigg{(}\frac{{\boldsymbol{W}}_{s}^{% T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s}+{\boldsymbol{W}}_{t}^{T}{% \boldsymbol{\Sigma}}_{t}{\boldsymbol{W}}_{t}}{2}\bigg{)}\frac{1}{2}\log\det({% \boldsymbol{W}}_{s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s})\frac{1}% {2}\log\det({\boldsymbol{W}}_{t}^{T}{\boldsymbol{\Sigma}}_{t}{\boldsymbol{W}}_% {t})\bigg{\}},  (25) 
where {\boldsymbol{\Sigma}}_{s} and {\boldsymbol{\Sigma}}_{t} are respectively the source and target domain covariances. Making use of 2.11 of Brookes (2005), we yield
\displaystyle\nabla_{{\boldsymbol{W}}_{s}}\log\det({\boldsymbol{W}}_{s}^{T}{% \boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s})  \displaystyle=2{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s}({\boldsymbol{W}}_% {s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s})^{1}.  (26) 
The derivative of Eq. Derivative of Statistical loss \mathcal{L}_{u} with respect to {\boldsymbol{W}}_{s} (or similarly for {\boldsymbol{W}}_{t}^{10}^{10}10The Stein divergence is symmetric over its two arguments.) can be obtained,
\displaystyle\nabla_{{\boldsymbol{W}}_{s}}\mathcal{L}_{u}  \displaystyle=\frac{1}{p}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s}\bigg{\{% }\bigg{(}\frac{{\boldsymbol{W}}_{s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W% }}_{s}+{\boldsymbol{W}}_{t}^{T}{\boldsymbol{\Sigma}}_{t}{\boldsymbol{W}}_{t}}{% 2}\bigg{)}^{1}({\boldsymbol{W}}_{s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol% {W}}_{s})^{1}\bigg{\}}  
\displaystyle=\frac{1}{p}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s}\bigg{\{% }2\Big{(}{\boldsymbol{W}}_{s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s}% +{\boldsymbol{W}}_{t}^{T}{\boldsymbol{\Sigma}}_{t}{\boldsymbol{W}}_{t}\Big{)}^% {1}({\boldsymbol{W}}_{s}^{T}{\boldsymbol{\Sigma}}_{s}{\boldsymbol{W}}_{s})^{% 1}\bigg{\}}  (27) 
The derivatives are summarized in Table 11.