Kernel Alignment for Unsupervised Transfer Learning
Abstract
The ability of a human being to extrapolate previously gained knowledge to other domains inspired a new family of methods in machine learning called transfer learning. Transfer learning is often based on the assumption that objects in both target and source domains share some common feature and/or data space. In this paper, we propose a simple and intuitive approach that minimizes iteratively the distance between source and target task distributions by optimizing the kernel target alignment (KTA). We show that this procedure is suitable for transfer learning by relating it to Hilbert-Schmidt Independence Criterion (HSIC) and Quadratic Mutual Information (QMI) maximization. We run our method on benchmark computer vision data sets and show that it can outperform some state-of-art methods.
1Introduction
Most research in machine learning is usually concentrated around the setting where a classifier is trained and tested on data drawn from the same distribution. This scenario has already been well investigated and in some tasks supervised approaches have almost no room for improvement. However, building human-like intelligent systems requires them to be able to generalize the discovered patterns to previously unseen domains. This gives rise to a new learning paradigm called transfer learning. Transfer learning is a new learning framework which uses a set of source tasks to influence learning and improve performance of target task where the same distribution over the source and target samples is not assumed. This important difference between standard setting of machine learning and transfer learning attracts more and more attention nowadays as exploring it helps to understand better under which assumptions and in what way the knowledge can be generalized in human’s brain. Intuitively, it is usually assumed that source and target domains should be aligned by learning a new representation of data that maximizes the mutual dependency between them. On the other hand, maximizing the dependence explicitly may lead to a complete loss of auxiliary knowledge that can be a complement to target task. In this case, it becomes crucial to understand at what point one should stop reducing the discrepancy between distributions to preserve the auxiliary knowledge contained in source task that yet remains aligned with target task.
1.1Background and related works
Transfer learning is a widely known technique that was generally inspired by the ability of a human being to detect and to use previously gained knowledge in one area for efficient learning in another. In general, the definition of transfer learning was given in [1] as:
There are three types of transfer learning: (1) supervised or inductive transfer learning (when labeled samples are available in target domain but there can be no labeled instances in the source one); (2) semi-supervised or transductive transfer learning (labeled samples are available only for the source learning task); (3) unsupervised transfer learning (no labeled data both in source and target learning tasks).
According to the survey given in [1], the number of methods dealing with the first two settings of transfer learning drastically exceeds the number of articles dedicated to the last one. Indeed, to the best of our knowledge there are only a couple of algorithms that were proposed to solve this problem: self-taught clustering (STC) presented in [2], transfer spectral clustering (TSC) ([3]) and [4].
The main assumption of STC is that two tasks share a latent feature space that can be used as a “bridge” for transfer learning. The authors perform co-clustering on source and target data simultaneously, while the two co-clusters share the same feature set. TSC is quite similar to STC, it works in the setting of spectral clustering where the low-dimensional shared embedding for two tasks is measured using the objective of bipartite graph co-clustering.
Another approach that can be related to unsupervised transfer learning is [4]. The proposed method, however, is an instance of multi-task clustering rather then self-taught clustering. The optimization procedure presented in this work simultaneously minimizes two terms: first represents the sum of Bregman divergence between clusters and data of each task; second is a regularization term defined as the Bregman divergence between all pairs of partitions. The motivation for this cost function is two-fold - while the first term seeks a qualitative clustering for each task separately, second term takes into account the relationships between clusters of different tasks.
Little research that has been done in this field of machine learning can be explained by the fact that unsupervised transfer learning is an extreme case of the transfer learning paradigm which, nevertheless, occurs in numerous real-world applications. Thus, unsupervised transfer learning becomes a topic of an ongoing interest for further researches.
1.2Our contributions
In this paper, we propose a new unsupervised transfer learning algorithm based on kernel target alignment maximization with application to computer vision problem. To the best of our knowledge, kernel target alignment has never been applied in this context and thus the proposed method presents a novel contribution.
The rest of this paper is organized as follows: in section 2 we briefly introduce basic notations and describe the approaches used later, in section 3 we are introducing our unsupervised transfer learning algorithm. We present the theoretical analysis of our approach in section 4. In section 5 the proposed approach will be evaluated. Finally, we will point out some ideas about the future extensions of our method in section 6.
2Preliminary knowledge
In this section we describe some basic notations and techniques that are used later. We start by introducing the simplest form of Non-negative matrix factorization proposed by [5] and its variations Convex NMF(C-NMF) [6] and Kernel NMF [7]. Each of these methods can be used in order to obtain a partition of data in an unsupervised manner.
2.1Standard NMF
Given matrix , standard NMF seeks the following decomposition:
where
columns of can be considered as basis vectors;
columns of are considered as cluster assignments for each data object;
is the desired number of clusters.
Standard NMF can be represented as a following optimization problem:
where is an arbitrary measure of divergence.
2.2Convex NMF and Kernel NMF
To develop C-NMF, we consider the factorization of the following form:
where the column vectors of lie within the column space of , i.e.,
The natural generalization of C-NMF is Kernel NMF (K-NMF). To “kernelize” C-NMF we consider a mapping which maps each vector to a higher dimensional feature space, such that:
We obtain the factorization of the following form:
Each kernel can be described by its Gram matrix. We call a Gram matrix of a given kernel some symmetric positive-semi-definite matrix . Subsequently the kernel is an inner-product function defined as We obtain
Finally, K-NMF is of the form:
The great advantage of K-NMF is that it can deal with not only attribute-value data but also relational data.
2.3Kernel Alignment
Kernel target alignment (KTA) is a measure of similarity between two Gram matrices, proposed in [8] and defined as follows:
Frobenius inner product is defined as:
where and are two kernels, and are two corresponding Gram matrices.
As we can see, it essentially measures a cosine between two kernel matrices.
2.4Clustering evaluation criteria
There are two classes of clustering evaluation metrics: internal and external clustering evaluation indexes. Speaking about unsupervised clustering, we can only use internal metrics because they are based only on the information intrinsic to the data alone. Among them, the most referenced in literature are the following ones: the Bayesian information criteria, Calinski-Harabasz index, Davies-Bouldin index(DBI), Silhouette index, Dunn index and NIVA index. To estimate the effectiveness of clustering we will use one of the most effective (according to [9]) clustering indexes, the Davies-Bouldin index. This internal evaluation scheme is calculated as follows:
where denotes the number of clusters, and are cluster labels, and are distances to cluster centroids within clusters and , is a measure of separation between clusters and . This index aims to identify sets of clusters that are compact and well separated. Smaller value of DBI indicates a “better” clustering solution.
3Our approach
3.1Motivation
In this section we describe our method for unsupervised transfer learning. The central idea that we will use to overcome the difference between weakly-related tasks is mainly inspired by a very popular approach used in neuroscience called Representation Similarity Analysis (RSA) [10]. This method suggests that a proper comparison between different activity patterns in human’s brain can be encoded and further compared using dissimilarity matrices. For a given brain region, the authors interpret the activity pattern associated with each experimental condition as a representation. Then, they obtain a representational dissimilarity matrix by comparing activity patterns with respect to all pairs of observations. This approach allows to relate activity patterns between different modalities of brain-activity measurement (e.g., fMRI and invasive or scalp electrophysiology), and between subjects and species. We follow this approach by replacing the dissimilarity matrices of brain activity patterns of different modalities by kernels defined on source and target task samples. Then, we reduce the distance between two distributions by learning a new representation of data for target task in a Reproducing Kernel Hilbert Space (RKHS). This new representation is further factorized using K-NMF in order to find weights of similarities in the transformed instance space. Finally, we use these weights as a “bridge” for transfer learning on the target task.
3.2Kernel target alignment optimization
Let us consider two tasks and where the corresponding data samples are given by matrices and . For the sake of convenience, we will consider data sets and with the same number of instances. This inconvenience can be overcome in two ways: by sub-sampling the bigger data set or by using any kind of a bootstrap to increase the size of the smaller data set.
We start by calculating the Gram matrices and for both source and target tasks, for example, by using a Gaussian kernel function. Calculating gives us an idea of how correlated the initial kernels are. Small value of means that transfer learning will most likely fail as source and target task distributions are too different. In order to find an intermediate kernel that plays the role of an embedding for both tasks, we will now apply the kernel alignment optimization to the calculated kernels , that consists in maximizing the unnormalized kernel alignment over :
Normalization in the cost function is omitted compared to the original definition of kernel alignment in section 2 due to the computational convenience as suggested in [11]. Matrix represents a linear combination of kernel matrices (any arbitrary set of kernel functions can be used) calculated based on . There are several methods which can be used to solve this optimization problem. In our work we use the one that was described in [8]. The others can be found in [12] and in [13]. The proposed optimization problem can be rewritten in the following form:
where and . In its current form, the maximization procedure presents a quadratic programming (QP) problem and can be solved using any off-shelf QP solver. For each kernel obtained in the process of alignment optimization, we look for a set of vectors which arises from the K-NMF of :
This matrix is of a particular interest as it represents the weights of similarities that lead to a good reconstruction of in a nonlinear RKHS. Due to the alignment optimization procedure, it naturally consists of adapted weights of an embedding between two tasks. The information contained in can be used further with C-NMF for the target task in order to find more efficient basis vectors that are weighted based on a “good” nonlinear reconstruction of transformed instances. The criteria that we use to evaluate if the obtained reconstruction is “good” or not is Davies-Bouldin index. We recall that this index shows if the clusters are dense and well-separated.
More formally, we look for a matrix that minimizes the Davies-Bouldin index defined in section 2 with respect to target kernel :
We call this matrix: the “bridge matrix”. Given that was calculated as a linear combination of kernels of and was brought closer in sense of alignment to , naturally incorporate information about geometrical structure of that can help to find better basis vectors in .
3.3Transfer process using the “bridge matrix”
Next step is to perform C-NMF of with the matrix of weights fixed to . We use C-NMF as it allows us to reinforce the impact of on the partition matrix .
We call this factorization : the Bridge Convex NMF (BC-NMF).
Finally, our approach is summarized in Algorithm 1.
3.4Complexity
At each iteration of our algorithm we perform a K-NMF and that makes our algorithm quite time consuming when the number of instances is large. On the other hand, it does not depend on the number of features that makes its usage attractive for tasks from high-dimensional spaces. The complexity of K-NMF is of order for a Gram matrix , where is a number of iterations used for K-NMF to converge (usually, ), - is a desired number of clusters. Then, this expressions should be multiplied by - the number of iterations needed to optimize the alignment between two kernels. Finally, we obtain the following order of complexity: .
It should be noted that in real-life tasks the quantity of data in source domain is often greater than in the target one. In order to decrease the computational effort of BC-NMF we propose to proceed a data treatment in the parallel fashion. We split data into several parts and obtain an optimal result for each of them. After that, we use any arbitrary consensus approach (for example, Consensus NMF described in [19]) to calculate the final result which is close to all the partitions obtained.
4Theoretical analysis
In this section, we present the relationships between KTA and two quantities commonly used in transfer learning and domain adaptation problems, namely: Hilbert Schmidt Independence Criterion (HSIC) [14] and Quadratic Mutual Information.
4.1Hilbert-Schmidt independence criterion
We start with a definition of a mean map and its empirical estimate.
If then is an element of RKHS . According to Moore-Aronszajn theorem, the reproducing property of allows us to rewrite every function in the following form: . We now give the definition of HSIC.
Its biased estimate can be calculated from a finite sample using following equation:
where , and is a centering matrix projecting data to a space orthogonal to to the vector .
From this we can see that KTA coincide with the biased estimate of HSIC when centered kernels are used. It shows that KTA is a suitable choice for transfer learning algorithms as its maximization increases iteratively the dependence between source and target distributions. Furthermore, cross-covariance operator has already proved to be efficient when applied in domain adaptation problem for target and conditional shift correction [15].
4.2Quadratic mutual information
Another important point is the equivalence between KTA and Information-Theoretic Learning (ITL) estimators [16]. We define the inner-product between two pdfs as a bivariate function on the set of square intergrable probability density functions:
It is easy to show that is symmetric and non-negative definite and thus according to Moore-Aronszajn theorem, there exists a unique RKHS associated with . We further define Quadratic Mutual Information (QMI):
In order to establish a connection between KTA and QMI, we can use the equivalence between and established in [16] through Parzen window estimation [17]. Parzen window estimator of given probability density functions , and is defined as follows:
This leads to the following result:
where kernel matrices and are calculated with respect to Parzen window kernels used for estimation. Once again, we see that KTA with centered kernels is equal to QMI estimation when the Gram matrices and are defined as inner-products of Parzen window kernels.
We also note that STC [2] is based on mutual information maximization. The latter was used to perform co-clustering of target and auxiliary data with respect to a shared set of features. Another example where mutual information was used for domain adaptation is [18]. Thus, we may conclude that the established relationships allow us to assume that KTA can be effective when used for transfer learning.
5Experimental results
In this section we evaluate our approach and analyze its behavior on some popular computer vision data sets.
5.1Baselines and setting
We choose the following baselines to evaluate the performance of our approach:
C-NMF on target data only;
K-NMF using each kernel from the set of base kernels used for KTA maximization (“Kernel alone”);
Transfer Spectral Clustering (TSC);
Bridge Convex-NMF (BC-NMF).
Using C-NMF we can directly factorize matrix as:
and consider matrix as an initial partition which could be obtained without taking into account the knowledge from the source task. Accuracy obtained on this partition gives us the “No transfer” value. This particular choice of the baseline can be explained by the fact that our approach is, basically, C-NMF but with a weight matrix learned using kernel alignment optimization. Thus, if we are able to increase the accuracy of classification compared to this baseline it will be only due to the efficiency of our approach.
On the other hand, we also give the maximum value of accuracy achieved for a set of kernels that we use in the optimization of KTA. We chose the following kernel functions: (1) Gaussian kernels with bandwidth varying between to with multiplicative step-size of ; homogeneous polynomial kernels with the degree varying from to . We call this “kernel alone” value as it presents the result of applying K-NMF to a given kernel without taking into account the auxiliary knowledge. Source task kernel was calculated using linear kernel.
Finally, we compare out method to TSC that according to the experimental results presented in [3] outperforms both STC and Bregman multitask clustering (BMC). To define the number of nearest neighbors needed to construct the source and target graphs, we perform cross-validation for and report the best achieved accuracy value. As suggested in the original paper, we set and the step length .
We will use accuracy to evaluate the performance of chosen algorithms. It is defined as:
where is a data set, and is the truth label of and is the predicted label of .
5.2Data sets
We evaluate the performance of our approach on the Office [20]/Caltech [21] data set which consists of four classification tasks:
Amazon (A) - images from online merchants (958 images with 800 features from 10 classes);
Webcam (W) - set of low-quality images by a web camera (295 images with 800 features from 10 classes);
DSLR (D) - high-quality images by a digital SLR camera (157 images with 800 features from 10 classes);
Caltech (C) - famous data set for object recognition (1123 images with 800 features from 10 classes).
Sample images from keyboard and backpack categories of each domain are presented in Figure 1.
This set of domains leads to 12 transfer learning scenarios, e.g., C A, C D, C W, ..., D W.
5.3Results
In Table 1 we can see the results of the experimental tests of our approach for transfer between two different domains where bold and underlined numbers stand for the best and second best results respectively.
Domain pair | C-NMF | Kernel alone | TSC | BC-NMF |
C A | 33.24 | 40.34 | 64.88 | |
C W | 46.78 | 52.54 | 60.69 | |
C D | 46.5 | 47.33 | 81.33 | |
A C | 24.89 | 35.33 | 59.29 | |
A W | 46.78 | 53.22 | 60.69 | |
A D | 46.5 | 47.33 | 76.0 | |
W C | 24.89 | 35.33 | 62.71 | |
W A | 33.24 | 40.34 | 77.93 | |
W D | 46.5 | 47.33 | 76.0 | |
D C | 24.89 | 35.33 | 54.14 | |
D A | 33.24 | 40.34 | 78.0 | |
D W | 46.78 | 55.59 | 70.0 | |
From the results, we can see that our algorithm BC-NMF significantly outperforms TSC in 10 transfer learning scenarios. Furthermore, in some cases TSC achieves lower accuracy values than the “kernel alone” setting. This can be explained by the fact that the clusters of the corresponding tasks are not well separable in the initial feature space and thus a nonlinear projection of features to a new RKHS can be beneficial. We also note that using a single kernel from the set of base kernels does not lead to good performance when compared to BC-NMF, while the learned combination of base kernels improves the overall classification accuracy considerably. Finally, comparing the obtained results with C-NMF applied to target data only clearly shows that the improved performance is due to the transfer as the only difference between BC-NMF and C-NMF lies in the learned weight matrix .
In conclusion, we analyze two cases where TSC achieves better clustering results than BC-NMF. We remark that in these two cases Caltech10 plays the role of the target domain. We further notice that the overall performance of both C-NMF and “kernel alone” approaches on Caltech10 is rather weak compared to their performance on Amazon, DSLR and Webcam tasks. We recall that both C-NMF and K-NMF assume that the basis vectors lie in the column space of their instance space while it is not necessarily true. However, if the source task data set is large enough, our approach is still able to improve the performance using the auxiliary knowledge (i.e., A C) while when it is not the case (i.e., W C, D C) BC-NMF may need a larger variety of base kernels to learn a good weight matrix or more instances from the source data set.
Figure 2 presents the learning curves of BC-NMF on transfer from DSLR and Caltech domains (results for other domains are presented in the Supplementary material). We plotted the red bar to indicate where the optimal weight matrix was obtained. It can be noticed that the proposed strategy to choose does not always lead to the best possible results but still performs reasonably well.
6Conclusions and future work
In this paper we presented a new method for unsupervised transfer learning. We use kernel alignment optimization in order to minimize the distance between the distributions of source and target tasks. We apply K-NMF to the intermediate kernels obtained during this procedure and look for a weight matrix that reconstructs well the similarity based representation of data. Once this matrix is found, we use it in C-NMF on the target task to obtain the final partition. Our approach was evaluated on benchmark computer vision data sets and demonstrated a significant improvement when compared to some state-of-art methods. We also showed how KTA maximization can be related to HSIC and QMI optimization. The established relationships allow us to conclude that the use of KTA for transfer learning is justified from both theoretical and practical points of view. One of the inconvenients of our approach is that it is quite time consuming. Nevertheless, this issue can be overcome as discussed in section 3.
In future, we will extend our work in the multiple directions. First of all, we will start by creating a multi-task version of our method. This can be done in the same fashion but with the only difference: firstly, we will search an optimal Gram matrix for each pair of tasks, then we will use the simultaneous non-negative matrix factorization [22] to find the common “bridge matrix” that captures the knowledge from all tasks. Multi-task version of our algorithm can be very important because it could show us the participation of each task in overall improvement. Secondly, it would be useful to derive bounds for classification error. This problem, however, is complicated as there is no statistical theory that can be used in unsupervised setting in the same way how it can be done for supervised and semi-supervised learning.
References
- Pan, S.J. and Yang, Q. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, pp. 1345-1359, 2010.
- Dai, Wenyuan and 0001, Qiang Yang and Xue, Gui-Rong and Yu, Yong. Self-taught clustering. Proceedings of ICML, pp. 200-207, 2008.
- Wenhao Jiang and Fu-Lai Chung. Transfer Spectral Clustering, Proceedings of the ECML/PKDD, pp. 789-803, 2012.
- Zhang, Jianwen and Zhang, Changshui. Multitask Bregman clustering. Neurocomputing, pp. 1720-1734, 2011.
- D.D. Lee, H.S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788 - 791, 1999.
- Ding, Chris H. Q. and Li, Tao and Jordan, Michael I. Convex and Semi-Nonnegative Matrix Factorizations. IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, pp. 45-55, 2010.
- Zhang, D., Zhou, Z.H. and Chen, S. Non-negative matrix factorization on kernels. Proceedings of the 9th Pacific Rim International Conference on Artificial Intelligence, pp. 404-412, 2006.
- Cristianini, N., Shawe-Taylor, J. , Elisseeff, A. and Kandola, J. On kernel-target alignment. NIPS, pp. 367-373, 2002.
- Erendira Rendon, Itzel Abundez, Alejandra Arizmendi and Elvia M. Quiroz. Internal versus external cluster validation indexes. International Journal of Computers and Communications, vol. 5, no. 1, 2011.
- Kriegeskorte, N. and Mur, M. and Bandettini, P. Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience, pp. 1-28, 2008.
- Neumann, J. and Schnörr, C. and Steidl, G. Combined SVM-based Feature Selection and Classification. Machine Learning, vol. 61, pp. 129-150, 2005.
- Ramona, M., Richard, G. and David, B. Multiclass Feature Selection with Kernel Gram-matrix-based criteria. IEEE Trans. Neural Netw. Learning Syst., 2012.
- Pothin, J.-B., and Richard, C. A greedy algorithm for optimizing the kernel alignment and the performance of kernel machines. In Proc. EUSIPCO ’06, pp. 4-8, 2006.
- Gretton, Arthur and Bousquet, Olivier and Smola, Alex and Schölkopf, Bernhard. Measuring Statistical Dependence with Hilbert-schmidt Norms. Proceedings of ALT, pp. 63–77, 2005.
- Kun Zhang and Bernhard Schölkopf and Krikamol Muandet and Zhikun Wang. Domain Adaptation under Target and Conditional Shift. Proceedings of ICML, pp. 819-827, 2013.
- Principe, Jose C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives. Springer Publishing Company Incorporated, 2010.
- Parzen, Emanuel. On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, vol. 33, pp. 1065-1076, 1962.
- Boqing Gong and Kristen Grauman and Fei Sha. Connecting the Dots with Landmarks: Discriminatively Learning Domain-Invariant Features for Unsupervised Domain Adaptation. Proceedings of ICML, pp. 222-230, 2013.
- Li, Tao and Ding, Chris H. Q. and Jordan, Michael I.. Solving Consensus and Semi-supervised Clustering Problems Using Nonnegative Matrix Factorization. Proceedings of ICDM, pp. 577-582, 2007.
- Saenko, Kate and Kulis, Brian and Fritz, Mario and Darrell, Trevor. Adapting Visual Category Models to New Domains. Proceedings of ICCV, pp. 213–226, 2010.
- Gopalan, Raghuraman and Ruonan Li and Chellappa, Rama. Domain Adaptation for Object Recognition: An Unsupervised Approach. Proceedings of ICCV, pp. 999–1006, 2011.
- Badea, Liviu. Extracting Gene Expression Profiles Common to Colon and Pancreatic Adenocarcinoma Using Simultaneous Nonnegative Matrix Factorization. World Scientific, pp. 267-278, 2008.