A Unified Framework for Domain Adaptation using Metric Learning on Manifolds
We present a novel framework for domain adaptation, whereby both geometric and statistical differences between a labeled source domain and unlabeled target domain can be reconciled using a unified mathematical framework that exploits the curved Riemannian geometry of statistical manifolds. It is assumed the feature distribution across the source and target domains are sufficiently dissimilar so that a classifier trained on the source domain would not perform well on the target domain (e.g., ranking Amazon book reviews given labeled movie reviews). Various approaches to domain adaptation have been studied in the literature, ranging from geometric approaches to statistical approaches. Our approach is based on formulating transfer from source to target as a problem of geometric mean metric learning on manifolds. Specifically, we exploit the curved Riemannian manifold geometry of symmetric positive definite (SPD) covariance matrices. We exploit a simple but important observation that as the space of covariance matrices is both a Riemannian space as well as a homogeneous space, the shortest path geodesic between two covariances on the manifold can be computed analytically. Statistics on the SPD matrix manifold, such as the geometric mean of two SPD matrices can be reduced to solving the well-known Riccati equation. We show how the Riccati solution can be constrained to not only reduce the statistical differences between the source and target domains, such as aligning second order covariances and minimizing the maximum mean discrepancy, but also the underlying geometry of the source and target domains using diffusions on the underlying source and target manifolds. A key strength of our proposed approach is that it enables integrating multiple sources of variation between source and target in a unified way, by reducing the combined objective function to a nested set of Riccati equations where the solution can be represented by a cascaded series of geometric mean computations. In addition to showing the theoretical optimality of our solution, we present detailed experiments using standard transfer learning testbeds from computer vision comparing our proposed algorithms to past work in domain adaptation, showing improved results over a large variety of previous methods.
When we apply machine learning  to real-world problems, e.g., in image recognition  or speech recognition , a significant challenge is the need for having large amounts of (labeled) training data, which may not always be available. Consequently, there has been longstanding interest in developing machine learning techniques that can transfer knowledge across domains, thereby alleviating to some extent the need for training data as well as the time required to train the machine learning system. A detailed survey of transfer learning is given in .
Traditional machine learning assumes that the distribution of test examples follows that of the training examples , whereas in transfer learning, this assumption is usually violated. Domain adaptation (DA) is a well-studied formulation of transfer learning that is based on developing methods that deal with the change of distribution in test instances as compared with training instances [5, 10]. In this paper, we propose a new framework for domain adaptation, based on formulating transfer from source to target as a problem of geometric mean metric learning on manifolds. Our proposed approach enables integrating multiple sources of variation between source and target in a unified framework with a theoretically optimal solution. We also present detailed experiments using standard transfer learning testbeds from computer vision, showing how our proposed algorithms give improved results compared to existing methods.
Background: One standard approach of domain adaptation is based on modeling the covariate shift . Unlike traditional machine learning, in DA, the training and test examples are assumed to have different distributions. It is usual in DA to categorize the problem into different types: (i) semi-supervised domain adaptation (ii) unsupervised domain adaptation (iii) multi-source domain adaptation (iv) heterogeneous domain adaptation.
Another popular approach to domain adaptation is based on aligning the distributions between source and target domains. A common strategy is based on the maximum mean discrepancy (MMD) metric , which is a nonparametric technique for measuring the dissimilarity of empirical distributions between source and target domains. Domain-invariant projection is one method that seeks to minimize the MMD measure using optimization on the Grassmannian manifold of fixed-dimensional subspaces of -dimensional Euclidean space .
Linear approaches to domain adaptation involve the use of alignment of lower-dimensional subspaces or covariances from a data source domain with labels to a target data domain . We assume both and are -dimensional Euclidean vectors, representing the values of features of each training example. One popular approach to domain adaptation relies on first projecting the data from the source and target domains onto a low-dimensional subspace, and then finding correspondences between the source and target subspaces. Of these approaches, the most widely used one is Canonical Correlation Analysis (CCA) , a standard statistical technique used in many applications of machine learning and bioinformatics. Several nonlinear versions  and deep learning variants  of CCA have been proposed. These methods often require explicit correspondences between the source and target domains to learn a common subspace. Because CCA finds a linear subspace, a family of manifold alignment methods have been developed that extend CCA [23, 9] to exploit the nonlinear structure present in many datasets.
In contrast to using a single shared subspace across source and target domains, subspace alignment finds a linear mapping that transforms the source data subspace into the target data subspace . To explain the basic algorithm, let denote the two sets of basis vectors that span the subspaces for the “source” and “ target” domains. Subspace alignment attempts to find a linear mapping that minimizes
It can be shown that the solution to the above optimization problem is simply the dot product between and , i.e.,:
Another approach exploits the property that the set of -dimensional subspaces in -dimensional Euclidean space forms a curved manifold called the Grassmannian , a type of matrix manifold. The domain adaptation method called geodesic flow kernels (GFK)  is based on constructing a distance function between source and target subspaces that is based on the geodesic or shortest path between these two elements on the Grassmannian.
Rather than aligning subspaces, a popular technique called CORAL  aligns correlations between source and target domains. Let and represent the mean and covariance of the source and target domains, respectively. CORAL finds a linear transformation that minimizes the distance between the second-order statistics of the source and target features (which can be assumed as normalized with zero means). Using the Frobenius (Euclidean) norm as the matrix distance metric, CORAL is based on solving the following optimization problem:
where are of size . Using the singular value decomposition of and , CORAL  computes a particular closed-form solution 111 The solution characterization in  is non unique. [22, Theorem 1] shows that the optimal , for full-rank and , is characterized as , where and are the eigenvalue decompositions of and , respectively. However, it can be readily checked that there exists a continuous set of optimal solutions characterized as , where is any orthogonal matrix, i.e., of size . A similar construction for non-uniqueness of the CORAL solution also holds for rank deficient and . to find the desired linear transformation .
Novelty of our Approach: Our proposed solution differs from the above previous approaches in several fundamental ways: one, we explicitly model the space of covariance matrices as a curved Riemannian manifold of symmetric positive definite (SPD) matrices. Note the difference of two SPD matrices is not an SPD matrix, and hence they do not form a vector space. Second, our approach can be shown to be both unique and globally optimal, unlike some of the above approaches. Uniqueness and optimality derive from the fact that we reduce all domain adaptation computations to nested equations involving solving the well-known Riccati equation .
The organization of the paper is as follows. In Section 2, we show the connection between the domain adaptation problem to the metric learning problem. In particular, we discuss the Riccati point of view for the domain adaptation problem. Section 3 discusses briefly the Riemannian geometry of the space of SPD matrices. Sections 4 and 5 discuss additional domain adaptation formulations. Our proposed algorithms are presented in Sections 6 and 7. Finally, in Section 8 we show the experimental results on the standard Office and the extended Office-Caltech10 datasets, where our algorithms show clear improvements over CORAL.
2 Domain Adaptation using Metric Learning
In this section, we will describe the central idea of this paper: modeling the problem of domain adaptation as a geometric mean metric learning problem. Before explaining the specific approach, it will be useful to introduce some background. The metric learning problem  involves taking input data in and constructing a (non)linear mapping , so that the distance between two points and in can be measured using the distance . A simple approach is to learn a squared Mahalanobis distance: , where and is an symmetric positive definite (SPD) matrix. If we represent , for some linear transformation matrix , then it is easy to see that , thereby showing that the Mahalanobis distance is tantamount to projecting the data into a potentially lower-dimensional space, and measuring distances using Euclidean (Frobenius) norm. Typically, the matrix is learned using some weak supervision, given two sets of training examples of the form:
A large variety of metric learning methods can be designed based on formulating different optimization objectives based on functions over the and sets to extract information about the distance matrix .
For our purposes, the method that will provide the closest inspiration to our goal of designing a domain adaptation method based on metric learning is the recently proposed geometric mean metric learning (GMML) algorithm . GMML models the distance between points in the set by the Mahalanobis distance by exploiting the geometry of the SPD matrices, and crucially, also models the distance between points in the disagreement set by the inverse metric . GMML is based on solving the objective function over all SPD matrices :
where refers to the set of all SPD matrices.
Several researchers have previously explored the connection between domain adaptation and metric learning. One recent approach is based on constructing a transformation matrix that both minimizes the difference between the source and target distributions based on the previously noted MMD metric, but also captures the manifold geometry of source and target domains, and attempts to preserve the discriminative power of the label information in the source domain . Our approach builds on these ideas, with some significant differences. One, we use an objective function that is based on finding a solution that lies on the geodesic between source and target (estimated) covariance matrices (which are modeled as symmetric positive definite matrices). Second, we use a cascaded series of geometric mean computations to balance multiple factors. We describe these ideas in more detail in this and the next section.
We now describe how the problem of domain adaptation can be considered as a type of metric learning problem, called geometric mean metric learning (GMML) . Recall that in domain adaptation, we are given a source dataset (usually with a set of training labels) and a target dataset (unlabeled). The aim of domain adaptation, as reviewed above, is to construct an intermediate representation that combines some of the features of both the source and target domains, with the rationale being that the distribution of target features differs from that of the source. Relying purely on either the source or the target features is therefore suboptimal, and the challenge is to determine what intermediate representation will provide optimal transfer between the domains.
To connect metric learning to domain adaptation, note that we can define the two sets and in the metric learning problem as associated with the source and target domains respectively, whereby 222We note that while there are alternative ways to define the and sets, the essence of our approach remains similar.
Our approach seeks to exploit the nonlinear geometry of covariance matrices to find a Mahalanobis distance matrix , such that we can represent distances in the source domain using , but crucially we measure distances in the target domain using the inverse .
To provide some intuition here, we observe that as we vary to reduce the distance in the source domain, we simultaneously increase the distance in the target domain by minimizing , and vice versa. Consequently, by appropriately choosing , we can seek the minimize the above sum. We can now use the matrix trace to reformulate the Mahalanobis distances:
Denoting the source and target covariance matrices and as:
we can finally write a new formulation of the domain adaptation problem as minimizing the following objective function to find the SPD matrix such that:
3 Riemannian Geometry of SPD Matrices
In this section, we outline some other formulations of domain adaptation that will be useful to discuss for presenting our overall approach.
As Figure 1 shows, our proposed approach to domain adaptation builds on the nonlinear geometry of the space of SPD (or covariance) matrices, we review some of this material first . Taking a simple example of a SPD matrix , where:
where , and the SPD requirement implies the positivity of the determinant . Thus, the set of all SPD matrices of size forms the interior of a cone in . More generally, the space of all SPD matrices forms a manifold of non-positive curvature in . In the CORAL objective function in Equation (1), the goal is to find a transformation that makes the source covariance resemble the target as closely as possible. Our approach simplifies Equation (1) by restricting the transformation matrix to be a SPD matrix, i.e, , and furthermore, we solve the resulting nonlinear equation exactly on the manifold of SPD matrices. More formally, we solve the Riccati equation :
where and are source and target covariances SPD matrices, respectively. Note that in comparison with the CORAL approach in Equation (1), the matrix is symmetric (and positive definite), so and are the same. The solution to the above Riccati equation is the well-known geometric mean or sharp mean, of the two SPD matrices, and .
where is denotes the geometric mean of SPD matrices . The sharp mean has an intuitive geometric interpretation: it is the midpoint of the geodesic connecting the source domain and target domain matrices, where length is measured on the Riemannian manifold of SPD matrices.
In a manifold, the shortest path between two elements, if it exists, can be represented by a geodesic. For the SPD manifold, it can be shown that the geodesic for a scalar between and , is given by :
It is common to denote as the so-called “weighted” sharp mean . It is easy to see that for , , and for , we have . For the distinguished value of , it turns out that is the geometric mean of and , respectively, and satisfies all the properties of a geometric mean .
The following theorem summarizes some of the properties of the objective function given by Equation (4).
4 Statistical Alignment Across Domains
A key strength of our approach is that it can exploit both geometric and statistical information, and multiple sources of alignment are integrated by solving nested sets of Riccati equations. To illustrate this point, in this section we explicitly introduce a secondary criterion of aligning the source and target domains so that the underlying (marginal) distributions are similar. As our results show later, we obtain a significant improvement over CORAL on a standard computer vision dataset (Office/Caltech/Amazon problem). The reason our approach outperforms CORAL is that not only are we able to solve the Riccati equation uniquely, whereas the CORAL solution proposed is only a particular solution due to non-uniqueness, whereas we can exploit multiple sources of information.
A common way to incorporate the statistical alignment constraint is based on minimizing the maximum mean discrepancy metric (MMD) , a nonparametric measure of the difference between two distributions.
where and , where if , if , and otherwise. It is straightforward to show that , a symmetric positive-semidefinite matrix . We can now combine the MMD objective in Equation (6) with the previous geometric mean objective in Equation (4) to give rise to the following modified objective function:
We can once again find a closed-form solution to the modified objective in Equation (7) by taking gradients:
whose solution is now given , where .
5 Geometrical Diffusion on Manifolds
Thus far we have shown how the solution to the domain adaptation problem can be shown to involve finding the geometric mean of two terms, one involving the source covariance information and the Maximum Mean Discrepancy (MMD) of source and target training instances, and the second involving the target covariance matrix. In this section, we impose additional geometrical constraints on the solution that involve modeling the nonlinear manifold geometry of the source and target domains.
The usual approach is to model the source and target domains as a nonlinear manifold and set up a diffusion on a discrete graph approximation of the continuous manifold , using a random walk on a nearest neighbor graph connecting nearby points. Standard results have been established showing asymptotic convergence of the graph Laplacian to the underlying manifold Laplacian . We can use the above algorithm to find two graph kernels and that are based on the eigenvectors of the random walk on the source and target domain manifold, respectively.
Here, and refer to the eigenvectors of the random walk diffusion matrix on the source and target manifolds, respectively, and and refer to the corresponding eigenvalues.
We can now introduce a new objective function that incorporates the source and target domain manifold geometry:
where and , and is a weighting term that combines the geometric and statistical constraints over .
Once again, we can exploit the SPD nature of the matrices involved, the closed-form solution to Equation (8) is , where .
6 Cascaded Weighted Geometric Mean
One additional refinement that we use is the notion of a weighted geometric mean. To explain this idea, we introduce the following Riemannian distance metric on the nonlinear manifold of SPD matrices:
For the first objective function in Equation (4), we get:
where is the weight parameter. The unique solution to (9) is given by the weighted geometric mean . Note that the weighted metric mean is no longer strictly convex (in the Euclidean sense), but remains geodesically strictly convex [26, 6, Chapter 6].
Similarly, we introduce the weighted variant of the objective function given by Equation (7):
whose unique solution is given by , where as before. A cascaded variant is obtained when we further exploit the SPD structure of and , i.e., (weighted geometric mean of and ) instead of (which is akin to the Euclidean mean of and ). Here, is the weight parameter.
Finally, we obtain the weighted variant of the third objective function in Equation (8):
whose unique solution is given by , where as previously noted. Additionally, the cascaded variant is obtained when instead of .
7 Domain Adaptation Algorithms
We now describe the proposed domain adaptation algorithms, based on the above development of approaches reflecting geometric and statistical constraints on the inferred solution. All the proposed algorithms are summarized in Algorithm 1. The algorithms are based on finding a Mahalanobis distance matrix interpolating source and target covariances (GCA1), incorporating an additional MMD metric (GCA2) and finally, incorporating the source and target manifold geometry (GCA3). It is noteworthy that all the variants rely on computation of the sharp mean, a unifying motif that ties together the various proposed methods. Modeling the Riemannian manifold underlying SDP matrices ensures the optimality and uniqueness of our proposed methods.
8 Experimental Results
We present experimental results using the standard computer vision testbed used in prior work: the Office  and extended Office-Caltech10  benchmark datasets. The Office-Caltech10 dataset contains object categories from an office environment (e.g., keyboard, laptop, and so on) in four image domains: Amazon (A), Caltech256 (C), DSLR (D), and Webcam (W). The Office dataset has categories (the previous categories and additional ones).
An exhaustive comparison of the three proposed methods with a variety of previous methods is summarized by the table in Table 1. The previous methods compared in the table refer to the unsupervised domain adaptation approach where a support vector machine (SVM) classifier is used. The experiments follow the standard protocol established by previous works in domain adaptation using this dataset. The features used (SURF) are encoded with -bin bag-of-words histograms and normalized to have zero mean and unit standard deviation in each dimension. As there are four domains, there are ensuing transfer learning problems, denoted in Table 1 below as (for Amazon to DSLR, etc.). For each of the 12 transfer learning tasks, the best performing method is indicated in boldface. We used randomized trials for each experiment, and randomly sample the same number of labeled images in the source domain as training set, and use all the unlabeled data in the target domain as the test set. All experiments used a support vector machine (SVM) method to measure classifier accuracy, using a standard libsvm package. The methods compared against in Table 1 include the following alternatives:
Baseline-S: This approach uses the projection defined by using PCA in the source domain to project both source and target data.
Baseline-T: Here, PCA is used in the target domain to extract a low-dimensional subspace.
NA: No adaptation is used and the original subspace is used for both source and target domains.
GFK: This approach refers to the geodesic flow kernel , which computes the geodesic on the Grassmannian between the PCA-derived source and target subspaces computed from the source and target domains.
TCA: This approach refers to the transfer component analysis method .
SA: This approach refers to the subspace alignment method .
CORAL: This approach refers to the correlational alignment method .
GCA1: This is a new proposed method, based on finding the weighted geometric mean of the inverse of the source matrix and the target matrix .
GCA2: This is a new proposed method, based on finding the (non-cascaded) weighted geometric mean of the inverse of the source matrix and the target matrix .
GCA3: This is a new proposed method, based on finding the (non-cascaded) weighted geometric mean of the inverse of the source matrix and the target matrix .
Cascaded-GCA2: This is a new proposed method, based on finding the cascaded geometric mean of the inverse revised source matrix and target matrix
Cascaded-GCA3: This is a new proposed method, based on finding the cascaded geometric mean of the inverse revised source matrix and target matrix
One question that arises in proposed algorithms is how to choose the value of in computing the weighted sharp mean. Figure 2 illustrates the variation in performance of the cascaded GCA3 method over CORAL over the range , and fixed for simplicity. Repeating such experiments over all 12 transfer learning tasks, Figure 3 shows the percentage improvement of the cascaded GCA3 method over correlational alignment (CORAL), using the best discovered value of all three hyperparameters using cross-validation. Figure 4 compares the performance of the proposed GCA1, GCA2, and GCA3 methods where just the hyperparameter was varied between 0.1 and 0.9 for the Amazon to DSLR domain adaptation task. Note the variation in performance with occurs at different points for the three points, and while their performance is superior overall to CORAL, their relative performances at the maximum values are not very different from each other. Figure 5 once again repeats the same comparison for the Caltech10 to the Webcam domain adaptation task. As these plots clearly reveal, the values of the hyperparameters has a crucial influence on the performance of all the proposed GCAXX methods. The plot compares the performance of GCA1 to the fixed performance of the CORAL method.
9 Summary and Future Work
In this paper, we introduced a novel formulation of the classic domain adaptation problem in machine learning, based on computing the cascaded geometric mean of second order statistics from source and target domains to align them. Our approach builds on the nonlinear Riemannian geometry of the open cone of symmetric positive definite matrices (SPDs), using which the geometric mean lies along the shortest path geodesic that connects source and target covariances. Our approach has three key advantages over previous work: (a) Simplicity: The Riccati equation is a mathematically elegant solution to the domain adaptation problem, enabling integrating geometric and statistical information. (b) Theory: Our approach exploits the Riemannian geometry of SPD matrices. (c) Extensibility: As our algorithm development indicates, it is possible to easily extend our approach to capture more types of constraints, from geometrical to statistical.
There are many directions for extending our work. Since we have shown how to reduce domain adaption to a problem involving metric learning, we can use other metric learning methods to design new domain adaption methods. This process can generate a wealth of new methods, some of which may outperform the method proposed in our paper. Also, while we did not explore nonlinear variants of our approach, it is possible to extend our approach to develop a deep learning version where the gradient of the three objective functions is used to tune the weights of a multi-layer neural network. As in the case of correlational alignment (CORAL), we anticipate that the deep learning variants may perform better due to the construction of improved features of the training data. The experimental results show that the performance improvement tends to be more significant in some cases than in others. A theoretical analysis of the reasons for this variance in performance would be valuable, which is lacking even for previous methods such as CORAL.
Portions of this research were completed when the first author was at SRI International, Menlo Park, CA and when the second author was at Amazon.com, Bangalore, 560055, India. The first author currently holds a Visiting Professor appointment at Stanford, and is on academic leave from University of Massachusetts, Amherst.
-  Adel, T., Zhao, H., Wong, A.: Unsupervised domain adaptation with a relaxed covariate shift assumption. In: AAAI. pp. 1691–1697 (2017)
-  Baktashmotlagh, M., Harandi, M.T., Lovell, B.C., Salzmann, M.: Unsupervised domain adaptation by domain invariant projection. In: ICCV. pp. 769–776 (2013)
-  Belkin, M., Niyogi, P.: Convergence of laplacian eigenmaps. In: NIPS. pp. 129–136 (2006)
-  Bellet, A., Habrard, A., Sebban, M.: Metric Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan and Claypool Publishers (2015)
-  Ben-David, S., Blitzer, J., Crammer, K., Pereira, F.: Analysis of representations for domain adaptation. In: NIPS (2006)
-  Bhatia, R.: Positive Definite Matrices. Princeton Series in Applied Mathematics, Princeton University Press, Princeton, NJ, USA (2007)
-  Borgwardt, K.M., Gretton, A., Rasch, M.J., Kriegel, H.P., Schölkopf, B., Smola, A.J.: Integrating structured biological data by kernel maximum mean discrepancy. In: ISMB (Supplement of Bioinformatics). pp. 49–57 (2006)
-  Cesa-Bianchi, N.: Learning the distribution in the extended pac model. In: ALT. pp. 236–246 (1990)
-  Cui, Z., Chang, H., Shan, S., Chen, X.: Generalized unsupervised manifold alignment. In: NIPS. pp. 2429–2437 (2014)
-  Daume, H.: Frustratingly easy domain adaptation. In: ACL (2007)
-  Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20(2), 303–353 (1998)
-  Fayek, H.M., Lech, M., Cavedon, L.: Evaluating deep learning architectures for speech emotion recognition. Neural Networks 92, 60–68 (2017)
-  Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T.: Subspace alignment for domain adaptation. Tech. rep., arXiv preprint arXiv:1409.5241 (2014)
-  Fukumizu, K., Bach, F.R., Gretton, A.: Statistical convergence of kernel cca. In: NIPS. pp. 387–394 (2005)
-  Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: CVPR. pp. 2066–2073 (2012)
-  Hotelling, H.: Relations between two sets of variates. Biometrika 28, 321–377 (1936)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
-  Murphy, K.P.: Machine learning : a probabilistic perspective. MIT Press, Cambridge, Mass. [u.a.] (2013)
-  Nadler, B., Lafon, S., Coifman, R.R., Kevrekidis, I.G.: Diffusion maps, spectral clustering and eigenfunctions of fokker-planck operators. In: NIPS. pp. 955–962 (2005)
-  Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2), 199–210 (Feb 2011). https://doi.org/10.1109/TNN.2010.2091281
-  Pan, S., Yang, Q.: A Survey on Transfer Learning. IEEE Trans Knowl Data Eng 22(10), 1345–1359 (2010)
-  Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. In: AAAI. pp. 2058–2065 (2016)
-  Wang, C., Mahadevan, S.: Manifold alignment without correspondence. In: IJCAI. pp. 1273–1278 (2009)
-  Wang, H., Wang, W., Zhang, C., Xu, F.: Cross-domain metric learning based on information theory. In: AAAI. pp. 2099–2105 (2014)
-  Wang, W., Arora, R., Livescu, K., Srebro, N.: Stochastic optimization for deep cca via nonlinear orthogonal iterations. In: ALLERTON. pp. 688–695 (2015)
-  Zadeh, P., Hosseini, R., Sra, S.: Geometric mean metric learning. In: ICML. pp. 2464–2471 (2016)