Tensor Canonical Correlation Analysis for Multi-view Dimension Reduction

Tensor Canonical Correlation Analysis for Multi-view Dimension Reduction

Yong Luo             Dacheng Tao33footnotemark: 3           Yonggang Wen22footnotemark: 2
Kotagiri Ramamohanarao           Chao Xu11footnotemark: 1
yluo180@gmail.com, dacheng.tao@uts.edu.au, ygwen@ntu.edu.sg
rao@csse.unimelb.edu.au, xuchao@cis.pku.edu.cn
Key Laboratory of Machine Perception (Ministry of Education), School of Electronics Engineering and Computer Science, Peking University, Beijing, China.Division of Networks and Distributed Systems, School of Computer Engineering, Nanyang Technological University, Singapore.Centre for Quantum Computation & Intelligent Systems and the Faculty of Engineering & Information Technology, University of Technology, Sydney, Sydney, Australia.Department of Computer Science and Software Engineering, The University of Melbourne, Australia.
08 February 2015
Abstract

Canonical correlation analysis (CCA) has proven an effective tool for two-view dimension reduction due to its profound theoretical foundation and success in practical applications. In respect of multi-view learning, however, it is limited by its capability of only handling data represented by two-view features, while in many real-world applications, the number of views is frequently many more. Although the ad hoc way of simultaneously exploring all possible pairs of features can numerically deal with multi-view data, it ignores the high order statistics (correlation information) which can only be discovered by simultaneously exploring all features.

Therefore, in this work, we develop tensor CCA (TCCA) which straightforwardly yet naturally generalizes CCA to handle the data of an arbitrary number of views by analyzing the covariance tensor of the different views. TCCA aims to directly maximize the canonical correlation of multiple (more than two) views. Crucially, we prove that the multi-view canonical correlation maximization problem is equivalent to finding the best rank-1 approximation of the data covariance tensor, which can be solved efficiently using the well-known alternating least squares (ALS) algorithm. As a consequence, the high order correlation information contained in the different views is explored and thus a more reliable common subspace shared by all features can be obtained. In addition, a non-linear extension of TCCA is presented. Experiments on various challenge tasks, including large scale biometric structure prediction, internet advertisement classification and web image annotation, demonstrate the effectiveness of the proposed method.

1 Introduction

The features utilized in many real-world data mining tasks are frequently of high dimension and extracted from multiple views (or sources). For example, both the page content and hyperlink represented by bag-of-words (BOW) are usually used in web page classification Blum and Mitchell (1998); Foster et al (2008), and it is common to combine the global (such as GIST Oliva and Torralba (2001)) and local (such as SIFT Lowe (2004)) descriptors in image annotation Chua et al (2009); Guillaumin et al (2009). In these applications, the features can have dimensions of up to several hundred or thousand.

Multi-view dimension reduction Foster et al (2008) seeks a low-dimensional common subspace to compactly represent the heterogeneous data, in which each of the data examples is associated with multiple high-dimensional features. It often benefits the subsequent learning process significantly in that the curse-of-dimensionality is alleviated and the computation-al efficiency is improved Hou et al (2010); Han et al (2012). Canonical correlation analysis (CCA), which is designed to inspect the linear relationship between two sets of variables Hardoon et al (2004); Bach and Jordan (2005), was formally introduced as a multi-view dimension reduction method in Foster et al (2008), where the authors prove that the labeled instance complexity can be effectively reduced under certain weak assumptions. In addition, CCA has been widely used for multi-view classification Farquhar et al (2005), regression Kakade and Foster (2007), clustering Blaschko and Lampert (2008); Chaudhuri et al (2009), etc. Theoretically, Bach and Jordan Bach and Jordan (2005) interpreted CCA probabilistically as a latent variable model, and thus it is able to be involved in a larger probabilistic model.

In spite of the profound theoretical foundation and practical success of CCA in multi-view learning, it can only handle data that is represented by two-view features. The features utilized in many real-world applications, however, are usually extracted from more than two views. For example, different kinds of color, texture and shape features are popular used in visual analysis-based tasks such as image annotation and video retrieval. A typical approach for generalizing CCA to several views is to maximize the sum of pairwise correlations between different views Vía et al (2007). The main drawback of this strategy is that only the statistics (correlation information) between pairs of features is explored, while high-order statistics that can only be obtained by simultaneously examining all features is ignored.

Figure 1: The tensor CCA motivation. Only the pairwise correlation is explored in the traditional extensions of CCA, while much more information (i.e., the high order correlation) that can only be obtained by simultaneously examining all views is explored in the proposed TCCA.

To tackle this problem, we develop tensor CCA (TCCA) to generalize CCA to handle an arbitrary number of views in a straightforward and yet natural way. In particular, TCCA aims to directly maximize the correlation between the canonical variables of all views, and this is achieved by analyzing the high-order covariance tensor over the data from all views. We prove that maximizing the correlation is equivalent to approximating the covariance tensor with a rank-1 tensor in an optimal least square sense. This approximation has been investigated in the literature and an efficient alternating least square (ALS) algorithm can be adopted for optimization Kroonenberg and De Leeuw (1980); De Lathauwer et al (2000b); Comon et al (2009). With respect to the traditional pairwise correlation maximization, the statistics (correlation information) explored can be measured using the covariance matrices of size , where is the number of views and represents the average feature dimensions, whereas in the proposed TCCA, the size of the covariance tensor is . Fig. 1 is an illustrative example, where . Much more correlation information is encoded in the common subspace shared by all features in multi-view dimension reduction, and thus hopefully better performance can be achieved. Furthermore, we extend the proposed TCCA to the non-linear case, which is useful when the feature dimensions are very high and limited instances are available. We perform extensive experiments on a variety of challenge tasks, including large scale biometric structure prediction, internet advertisement classification and web image annotation. We compare the proposed method with the traditional CCA Foster et al (2008) and its multi-view extension Vía et al (2007), as well as two representative unsupervised multi-view dimension reduction approaches Long et al (2008); Han et al (2012). The results confirm the effectiveness of the proposed TCCA.

The article is organized as follows. We summarize closely related works in Section 2. A brief introduction of CCA and its traditional multi-view extension is presented in Section 3. Section 4 includes the description, formulation, and analysis of the proposed TCCA, as well as its non-linear extension kernel TCCA (KTCCA) for multi-view dimension reduction. Extensive experiments are presented in Section 5 and the paper is concluded in Section 6.

2 Related Work

2.1 Multi-view Dimension Reduction

Dimension reduction is a key technique in machine learning. The goal of dimension reduction is to find a low dimensional representation for high dimensional data Xia et al (2010). Feature selection and feature transformation and the two main approaches for dimension reduction. The former aims to select a subset of variables from the original, while the latter transforms the data to a new space of fewer dimensions. The dimension reduction can be performed in an either unsupervised (e.g., principal component analysis (PCA) and Laplacian eigenmaps (LE) Belkin and Niyogi (2001)), semi-supervised Benabdeslem and Hindawi (2014), or supervised (e.g., linear discriminant analysis (LDA)) setting, differed in the amount of labeled information being utilized.

In another research line, multi-view learning has attracted much attention recently. The multi-view we refer to here is the multiple feature representations of an object, not the spatial viewpoints in some other vision and graphics applications Su et al (2009). We generally classify the multi-view learning algorithms into three families: weighted view combination Lanckriet et al (2004); McFee and Lanckriet (2011), multi-view dimension reduction Hardoon et al (2004); White et al (2012), and view agreement exploration Blum and Mitchell (1998); Kumar et al (2011). Multi-view dimension reduction focuses on removing irrelevant or redundant information Benabdeslem and Hindawi (2014) and reducing the feature dimension of data that consists of multiple views by leveraging the dependencies, coherence, and complementarity of those views. The different views are often assumed to be conditionally independent, thus a latent representation shared by all views can be obtained by exploiting the conditional independence structure of the multi-view data Foster et al (2008); Long et al (2008); White et al (2012); Han et al (2012); Chen et al (2012). For example, canonical correlation analysis (CCA) is employed for multi-view dimension reduction in Foster et al (2008) to exploit the underlying conditional independence and redundancy assumption in multi-view learning. A general unsupervised learning method is presented in Long et al (2008) for multi-view data, where a consensus representation is learned by first applying dimension reduction technique (such as spectral embedding Belkin and Niyogi (2001)) on each view and then combining the results via matrix factorization. In Han et al (2012), the structured sparsity Jenatton et al (2011) is enforced among the different views in the learning of low-dimension consensus representation, to allow information being shared across subsets of features adaptively. In contrast to unsupervised multi-view dimension reduction, the similarity/dissimilarity pairwise constraints are utilized in Hou et al (2010) for semi-supervised multi-view dimension reduction. In Chen et al (2012), the supervising information is also incorporated in the learned latent shared subspace by the use of a large-margin latent Markov network. In these methods, local optimal subspace can usually be obtained. Therefore, White et al. White et al (2012) proposed a convex formulation for learning a shared subspace of multiple sources. In the learned subspace, conditional independence constraints are enforced.

2.2 Canonical Correlation Analysis and Its Extensions

Canonical correlation analysis (CCA), originally proposed by Hotelling (1936), finds bases for two random variables (or sets of variables) so that the coordinates of the variable pairs projected on these bases are maximally correlated Hardoon et al (2004). Much success has been achieved by applying CCA to pattern recognition and data mining. For example, SVM-2K was proposed in Farquhar et al (2005) for two-view classification. It combines kernel CCA and support vector machine (SVM) in a single optimization problem, and the authors prove that the Rademacher complexity of SVM-2K is significantly lower than the individual SVMs. Kakade and Foster Kakade and Foster (2007) presented a multi-view regression algorithm regularized with a norm that is derived by applying CCA on unlabeled data. The authors show that the intrinsic dimension of the regression problem with the induced norm can be characterized by the correlation coefficients obtained in CCA. Under the conditionally uncorrelated assumption, a simple and efficient subspace learning algorithm based on CCA was proposed in Chaudhuri et al (2009) for multi-view clustering. The algorithm was shown to work well under much weaker separation conditions than the previous clustering methods.

In addition to these applications, there have been dozens of developments for CCA, most of which concentrate on inspecting the relationship between two sets of tensors rather than vectors. For example, the classical CCA was extended in Lee and Choi (2007) to 2D-CCA, which directly analyzes 2D images without reshaping them into vectors. Some of its extensions are local 2D-CCA Wang (2010), sparse 2D-CCA Yan et al (2012), and multilinear CCA (MCCA) Lu (2013). Considering that the two high-order tensors to be studied may share multiple modes (e.g., the video volume data), Kim and Cipolla Kim and Cipolla (2009) presented two architectures for tensor correlation maximization by applying canonical transformation on the non-shared modes. In this way, features that have a good balance between flexibility and descriptive power may be obtained. This method is also termed “tensor CCA” (TCCA), but is quite different from the approach proposed in this paper. The main difference lies in that the latter focuses on analyzing two high-order tensor data sets, while our objective is to analyze the high-order statistics among multiple vector data sets (views).

The most closely related works to our methods, as far as we are concerned, are the maximum variance CCA (CCA-MAXVAR) Kettenring (1971) and an adaptive CCA algorithm termed CCA-LS Vía et al (2007), which is based on least square (LS) regression. The CCA-MAXVAR algorithm is performed by weighted combination of the canonical variables (projected vectors) of all views to approximate a latent common representation. This approach requires costly singular value decomposition (SVD) for optimization and cannot be trained in an adaptive fashion. To avoid these drawbacks, Via et al. Vía et al (2007) reformulated CCA-MAXVAR as a set of coupled LS regression problems, which seeks to minimize the distance between each pair of canonical variables. The reformulation is proved to be equivalent to the original CCA-MAXVAR formulation, but is much more efficient and can be learned adaptively. Nevertheless, there is still a disadvantage to both CCA-LS and CCA-MAXVAR, namely that only the pairwise correlations are exploited, while the high order correlations between all views are ignored. We developed the following tensor CCA framework to rectify this shortcoming.

3 Canonical Correlation Analysis (CCA) and Its Multi-view Generalization

This section briefly introduces standard canonical correlation analysis (CCA) and its traditional generalizations on several data sets Kettenring (1971); Vía et al (2007). Given two sets of column vectors , , . The objective of CCA is to find a pair of projections (usually called canonical vectors) , , such that correlations between the two vectors of canonical variables and with each , , are maximized. The optimization problem is thus given by

(3.1)

where , are data variance matrices, and is the covariance matrix. Here, and are the stacked data matrices. The optimization of problem (3.1) leads to the main solution of CCA, and the remaining solutions are given by maximizing the same correlation under the constraint of being orthogonal to the previous solutions.

CCA-MAXVAR Kettenring (1971) generalizes CCA to views. Suppose the data matrix for the ’th view is , then the optimization problem of CCA-MAXVAR for finding the canonical vectors is

(3.2)

where is the vector of canonical variables, is the best possible one-dimensional PCA representation, and is the vector of combination weights. To avoid a trivial solution, an additional constraint such as is enforced. The solutions of (3.2) can be obtained using the SVD of . To develop an efficient and adaptive algorithm, Via et al. Vía et al (2007) reformulated (3.2) as

(3.3)

The orthogonal constraint is imposed on the different solutions, which can be obtained by using an iterative algorithm based on LS regression Vía et al (2007). Here, and is a vector of canonical variables projected using the ’th canonical vector in the ’th view.

Figure 2: System diagram of the multi-view dimension reduction method by the use of the proposed TCCA. Firstly, different kinds of features are extracted to represent the available instances in different views. Then a covariance tensor is calculated on the obtained representations to discover the correlation information between all views. By approximating the covariance tensor with a set of rank-1 tensors, we obtain the transformation matrix for the ’th view. Each maps the original to the low dimensional in the common subspace, and the final representation is a concatenation of

4 Tensor Canonical Correlation Analysis (TCCA)

In contrast to CCA-MAXVAR Kettenring (1971) and CCA-LS Vía et al (2007), where only the pairwise correlations are considered, we propose tensor CCA (TCCA) for multi-view dimension reduction by exploiting the high-order tensor correlation between all views. The diagram of the multi-view dimension reduction method using the proposed TCCA is shown in Fig. 2. Different kinds of features, such as LAB color histogram (LAB), wavelet texture (WT), and the local SIFT features (SIFT), are first extracted to represent the instances in different views. This leads to multiple feature matrices . Here, is set at for intuitive illustration without loss of generality. The different sets of features are then used to calculate the data covariance tensor , which is subsequently decomposed as a weighted sum of rank-1 tensors, i.e., , where is the reduced dimension and is the tensor (outer) product. The vectors are stacked as a transformation matrix , which is used to map the original high dimensional features into the low dimensional common subspace. The projected features are concatenated as the final representation of the instances. The details of this technique are given below, but first we briefly introduce several useful notations and concepts of multilinear algebra.

4.1 Notations

Let be an -order tensor of size , and be a matrix. The -mode product of and is then denoted as , which is an tensor with the element

(4.1)

The product of and a sequence of matrices is a tensor denoted by

(4.2)

The mode- matricization of is denoted as an matrix , which is obtained by mapping the fibers associated with the ’th dimension of as the rows of , and aligning the corresponding fibers of all the other dimensions as the columns. Here, the columns can be ordered in any way. The -mode multiplication can be manipulated as matrix multiplication by storing the tensors in metricized form, i.e., . Specifically, the series of -mode product in (4.2) can be expressed as a series of Kronecker products and is given by

(4.3)

where is a forward cyclic ordering for the indices of the tensor dimensions that map to the column of the matrix. Finally, the Frobenius norm of the tensor is given by

(4.4)

4.2 Problem Formulation

Given views of instances, and each is assumed to have been centered (i.e., have zero mean). The variance matrices are then

and the covariance tensor among all views is calculated as

where is a tensor of dimension . Following the objective of the traditional two-view CCA Hardoon et al (2004), the proposed tensor CCA seeks to maximize the correlation between the canonical variables , where are usually called the canonical vectors. Therefore, the optimization problem is

(4.5)

Here is the canonical correlation, and is the element-wise product, is an all ones vector. We can prove that it is equivalent to , where is the -mode tensor-matrix product.

Theorem 1.

The high order canonical correlation is given by

(4.6)

The proof is presented in the Appendix. By further considering that , the problem (4.5) becomes

(4.7)

We further add a regularization term in the constraints to control the model complexity, and thus the constraints of problem (4.7) become

(4.8)

where is an identity matrix and is a nonnegative trade-off parameter. Let each and , we can reformulate (4.7) as

(4.9)

where . The equivalence of the problem (4.7) and (4.9) is ensured by the following theorem.

Theorem 2.

The problems (4.7) and (4.9) are equivalent.

Proof.

It is straightforward that the constraints of problems (4.7) and (4.9) are equivalent, and now we prove that the objective of the two problems is the same as follows,

where the metricizing property of the tensor-matrix product presented in (4.3) and some basic properties of the Kronecker product are applied. ∎

4.3 Solutions

It has been presented in De Lathauwer et al (2000b) that the problem (4.9) is equivalent to finding the best rank- approximation of the tensor , i.e., if we define , then the optimization problem becomes

(4.10)

The solution can be obtained using the alternating least square (ALS) algorithm Kroonenberg and De Leeuw (1980); Comon et al (2009). Some other algorithms, such as the high-order power method (HOPM) De Lathauwer et al (2000b) and the tensor power method Allen (2012), can also be applied here for optimization, but our empirical findings indicate that the ALS algorithm performs the best in our experiments.

As in the two-view CCA, we perform a recursive maximization of the correlation between linear combinations of . However, we cannot expect the different linear combinations of to be uncorrelated with each other, where is the rank of (the determination of the rank value is still an open problem for the high-order tensors De Lathauwer et al (2000a)). That is, the orthogonality constraints cannot be imposed on , since the sum of rank-1 decomposition and orthogonal decomposition of high-order tensors cannot be satisfied simultaneously De Lathauwer et al (2000a).

Based on the solutions , we obtain the canonical variables . Let and be the column vectors of , we obtain the projected data for the ’th view:

(4.11)

Following Foster et al (2008), where it is suggested that the dimension be reduced to in the standard CCA, we concatenate the different as the final representation for the subsequent learning, such as classification Farquhar et al (2005); Fisch et al (2014), clustering Yang et al (2014); Wu et al (2015), regression Kakade and Foster (2007), search ranking Xu et al (2015); Zhu et al (2015), collaborative filtering Liu et al (2014), and so on.

4.4 Non-linear Extension

The projections are linear in TCCA and thus may be not appropriate for instances that lie in quite non-linear feature space. To this end, we develop kernel tensor CCA (KTCCA) that extends the proposed TCCA to the non-linear case. KTCCA aims to find non-linear projections by first projecting the data into higher dimensional space induced by the feature mapping :

where the mapped dimension may be infinite. Then the variance matrices

the covariance matrix

and the canonical variables . It follows from the Representer Theorem Scholkopf and Smola (2002) that can be rewritten as a linear combination of the given instances, i.e.,

(4.12)

where is a vector of the combination coefficients. The problem (4.7) then becomes

(4.13)

where is the kernel matrix of the ’th view. The derivation is similar to Theorem 2. Here and can be calculated according to the following theorem.

Theorem 3.

The following equality holds:

in which , i.e., the ’th column of the kernel matrix , .

We give the proof in the Appendix. To avoid trivial learning, we follow Hardoon et al (2004) and introduce a partial least square (PLS) term to penalize the norms of the weight vectors . That is, the constraints of problem (4.13) become

(4.14)

Because the matrix is positive definite, it has a unique Cholesky decomposition, and we can denote its decomposition as . Let and , we can reformulate (4.13) as

(4.15)

Similar to TCCA, this problem is equivalent to finding the best rank- approximation of , and the solution can be found using the ALS algorithm. By recursively maximizing the correlation, we obtain . Let , and the canonical variables and the projected data for the ’th view are then

(4.16)

The concatenated is the final representation of the instances.

4.5 Complexity Analysis

The time and space complexities of the proposed TCCA model are both closely related to the size of tensor . Straightforwardly, the space complexity is . Because the tensor can be calculated offline, the time complexity is dominated by the rank- decomposition using the ALS algorithm. Considering that it is common that , we can speculate the time complexity of ALS is according to Comon et al (2009), where the time cost of the ALS algorithm for the three modes tensor is presented. Here, is the number of iterations in ALS.

According to the above analysis, we can see that the complexity of TCCA is independent of the number of instances, and thus our method can be scaled in very large sample size problems. Similarly, the complexities of KTCCA are determined by the tensor , the size of which is . The space and time complexities are and respectively. This means that KTCCA is capable of being scaled in problems that have very high feature dimensions and a small number of instances.

5 Experiments

In this section, we empirically validate the effectiveness of the proposed TCCA on a biometric structure prediction and an advertisement classification problem following Foster et al (2008), as well as on a challenging web image annotation task Chua et al (2009). In all of the following experiments, five random choices of the labeled instances are used. Twenty percent of the test data (or unlabeled data in the transductive setting) are used for validation, which means that the parameters (if not specified) corresponding to the best performance on the validation set are used for testing. The evaluation criterion is the classification accuracy.

5.1 Evaluation of the Linear Formulation

In the first two sets of experiments (biometric structure prediction and advertisement classification), we use regularized least squares (RLS) as the base learner following Foster et al (2008). Given labeled instances , the optimization problem for RLS is given by , where the positive trade-off parameter is is set as according to Foster et al (2008). A constant feature of is appended to each instance to include a bias term in . In web image annotation, the -nearest-neighbor (NN) classifier is utilized, where the candidate set for is . Specifically, we compare the following methods:

  • BSF: using the single view feature that achieves the best performance in RLS/NN-based classification.

  • CAT: concatenating the normalized features of all the views into a long vector, and then performing RLS/NN-based classification.

  • CCA Foster et al (2008): using the CCA formulation presented in Foster et al (2008) to find a common representation of two different views. In this formulation, a regularization term is added to control the model complexity, and we set the parameter as in biometric structure prediction and advertisement classification according to Foster et al (2008). The parameter is tuned over the set in web image annotation. The implementation details can be found in Foster et al (2008). For different views, there are subsets of two views. The subset that achieves the best performance is termed CCA (BST). To combine the results of all subsets, we average their predicted scores in RLS-based classification and adopt the majority voting strategy in NN. This combination approach is termed CCA (AVG).

  • CCA-LS Vía et al (2007): a generalization of CCA to multiple views based on least square (LS) regression.

  • DSE Long et al (2008): a general and popular unsupervised multi-view dimension reduction method based on spectral embedding.

  • SSMVD Han et al (2012): a recently proposed unsupervised multi-view dimension reduction method based on the structured sparsity-inducing norm Jenatton et al (2011).

  • TCCA: the proposed tensor CCA. The regularization parameter is optimized the same as in CCA.

In the first step of DSE and SSMVD, PCA is taken as the dimension reduction method for each view, and the result dimension (of each view) is set to be empirically.

5.1.1 Biometric Structure Prediction

The dataset used in this set of experiments is SecStr111http://www.kyb.tuebingen.mpg.de/ssl-book, which is a benchmark dataset for evaluating semi-supervised systems Chapelle et al (2006). The task associated with this dataset is “to predict the secondary structure of a given amino acid in a protein based on a sequence window centered around that amino acid” Chapelle et al (2006). The SecStr dataset is large-scale and contains instances. We randomly select instances as labeled samples. There are also unlabeled instances which we use to observe the performance of three CCA-based methods (CCA, CCA-LS and TCCA) with respect to different amounts of unlabeled data. Following Foster et al (2008), all the provided data are used (as unlabeled instances) to find the common subspace in the CCA-based methods. The performance is evaluated in a transductive setting on the unlabeled samples (except those for validation) of the instances. Both DSE and SSMVD are naturally transductive, since they learn the low-dimensional representation of given data directly, and no projection matrix is learned for new data. Therefore, these two methods cannot handle very large datasets and the experiments are conducted only on the instances. In particular, DSE needs to solve an eigen-decomposition problem of size . The time cost or memory cost is intolerable when is , and thus a subset of samples are utilized.

The features provided are categorical attributes, each of which is generated at a position in from the sequence window of amino acid, and represented by a -dimensional sparse binary vector. We divided the features into three views:

  • View-1: attributes based on the left context (positions in );

  • View-2: attributes based on the current position and middle context (positions in );

  • View-3: attributes based on the right context (positions in ).

The dimension of each view is .

Figure 3: Prediction accuracy vs. dimension of the common subspace on the SecStr dataset. (Top: labeled instances and unlabeled instances; Bottom: labeled instances and the entire unlabeled set (about instances).)
Methods #unlabeled = #unlabeled =
BSF 57.481.90
CAT 57.772.03
CCA (BST) 58.782.97 59.972.46
CCA (AVG) 60.751.92 61.151.73
CCA-LS 60.231.70 61.321.65
DSE 60.150.81 No Attempt
SSMVD 61.081.58
TCCA 62.361.27 64.421.70
Table 1: Prediction accuracies () of the different methods at their best dimensions on the SecStr dataset ( labeled instances).

The performance of the compared methods in relation to the dimension of the common subspace is shown in Fig. 3. Accuracy is averaged over runs for each dimension in . The performance of the different methods at their best dimensions are summarized in Table 1. From the results, we observe that: 1) the concatenation strategy (CAT) is comparable to and slightly better than the strategy of only using the best single view features (BSF); 2) by learning the common subspace, all the compared multi-view dimension reduction methods are significantly better than the BSF and CAT baselines, if the dimensionalities are properly set according to the accuracy on the validation dataset. In particular, CCA (BST) is superior to CAT, although only a subset of two views is utilized in the former; 3) the accuracy of all three CCA-based methods increases with an increasing number of unlabeled data. By combining the results of different subsets, CCA (AVG) is better than CCA (BST); 4) CCA-LS is superior to CCA (BST), but their performance at their best dimension is comparable. When the number of unlabeled data is , DSE and SSMVD are comparable to CCA (BST) and CCA-LS respectively; 5) the performance of TCCA does not decease significantly as CCA-LS and CCA do when the number of dimensions is high. The main reason is that the ALS algorithm used in TCCA seeks to maximize the canonical correlations for all the factors simultaneously, but not to greedily find orthogonal decomposition components Allen (2012). That is, the main variance tends to be explained uniformly by all factors, not only by the first several factors. This is also the reason why there are some oscillations in TCCA; 6) the proposed TCCA significantly outperforms all the other methods on most dimensionalities. This demonstrates that the high order correlation information between all features is well discovered, and that exploring this kind of information is much better than only exploring the correlation information between pairs of features, as in CCA-LS.

5.1.2 Advertisement Classification

This set of experiments is conducted on the Ads (internet advertisements)222http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements dataset from the well-known UCI Machine Learning Repository. The task is to predict whether or not a given hyperlink (associated with an image) is an advertisement. There are instances in this dataset. We randomly choose instances as labeled training samples, and all the instances except those for validation are utilized as unlabeled samples to find the common subspace. The performance is evaluated in a transductive setting on the unlabeled samples.

We use the features as described in Kushmerick (1999), and omit the attributes that have missing values, such as the height (and width) of the image. The remained attributes are represented by binary () features which indicate the presence/absence of corresponding terms. For CCA-LS and TCCA, we divide all these features into three views as follows:

  • View-1: features based on the terms in the image¡¯s URL, caption, and alt text. dimensions;

  • View-2: features based on the terms in the URL of the current site. dimensions;

  • View-3: features based on the terms in the anchor URL. dimensions.

Figure 4: Classification accuracy vs. dimension of the common subspace on the Ads dataset. labeled training samples are utilized.
Methods #labeled = 100
BSF 91.101.65
CAT 91.081.74
CCA (BST) 92.881.11
CCA (AVG) 93.840.85
CCA-LS 93.171.10
DSE 93.010.96
SSMVD 92.990.91
TCCA 94.590.27
Table 2: Classification accuracies () of the different methods at their best dimensions on the Ads dataset.

Fig. 4 shows the classification accuracy of the compared methods (in relation to the dimension ), and the accuracies at their best dimensions are summarized in Table 2. In contrast to the observations of the last set of experiments, we can see that: 1) the accuracy of the concatenation strategy (CAT) and the best single view (BSF) are almost the same. The performance of CAT is relatively worse since the feature dimension in this set of experiments is high ( dimensions), and over-fitting occurs given the limited number of labeled samples; 2) the performance of DSE and SSMVD first increase and then decrease sharply with an increasing number of the dimension , while the CCA-based methods are much steady; 3) the improvement of TCCA compared with the other CCA-based methods is not as great as in the last set of experiments. This is because we need more samples to approximate the true underlying high order correlation compared with the traditional pairwise correlation, since there are more variables to be estimated in the high order statistics. The unlabeled instances utilized in this set of experiments are much fewer, thus the high order correlation information is not well explored. CCA-LS is only comparable to CCA for the same reason.

5.1.3 Web Image Annotation

We further verify the effectiveness of the proposed algorithm on a natural image dataset NUS-WIDE Chua et al (2009). This dataset contains images, and our experiments are conduct on a subset that consists of images belonging to mammal concepts: bear, cat, cow, dog, elk, fox, horse, tiger, whale, and zebra. We randomly split the images into a training set of images and a test set of images. Distinguishing between these concepts is very challenging, since many of them are similar to each other, e.g., cat and tiger. We randomly choose labeled instances for each concept in the training set, and all the training instances are utilized as unlabeled samples to find the common subspace.

In this dataset, we choose three types of visual feature, namely -D bag of visual words based on SIFT Lowe (2004) descriptors, -D color auto-correlogram, and -D wavelet texture, to represent each image Chua et al (2009).

The annotation performance of the compared methods is shown in Fig. 5 and Table 3. It can be seen from the results that: 1) in general, performance improves with an increased number of labeled instances; 2) CCA-LS is comparable to CCA (BST) and CCA (AVG), while the best performance (peak of the curve) of CCA-LS is usually higher; 3) the performance of DSE is poor when is large, while SSMVD is much steady and can be superior to CCA (AVG) and CCA-LS sometimes; 4) the accuracies of CCA (AVG) and CCA-LS first increase and then decrease with an increasing number of the dimension , while the results of the proposed TCCA are satisfactory even though is large; 3) the accuracy of TCCA is significantly better than that of all the other methods under most dimensionalities.

Figure 5: Anotation accuracy vs. dimension of the common subspace on the NUS-WIDE mammal subset. (Left: 4 labeled instances for each mammal concept; Middle: 6 labeled instances; Right: 8 labeled instances.)
Methods #labeled = #labeled = #labeled =
BSF 17.421.37 18.371.21 19.961.19
CAT 19.011.86 19.072.23 20.701.44
CCA (BST) 20.771.52 21.512.38 22.611.76
CCA (AVG) 21.211.47 21.572.04 22.611.21
CCA-LS 20.901.84 22.312.53 23.502.48
DSE 20.021.23 21.591.13 22.670.74
SSMVD 21.342.08 23.321.08 23.791.30
TCCA 22.401.96 23.861.41 24.110.32
Table 3: Annotation accuracies () of the different methods at their best dimensions on the NUS-WIDE mammal dataset.

5.2 Evaluation of the Non-linear Extension

We evaluate the non-linear extension of the proposed TCCA in the web image annotation task. As discussed in Section 4.5, the non-linear extension is able to handle the small sample size problem, where the feature dimensions can be very high and possibly infinite. We thus randomly choose a small set of samples from the animal subset. To perform the non-linear classification, we construct a kernel for each kind of feature. The kernel is defined by

where denotes the distance between and , and . We choose the distance for the visual word histogram. For other features, the distance is utilized. Specifically, we compare the following methods:

  • BSK: using the single view kernel that achieves the best performance in the NN-based classification.

  • AVG: averaging the normalized kernels of all the views, and then performing NN-based classification.

  • KCCA Hardoon et al (2004): using the KCCA formulation presented in Hardoon et al (2004) to find a common representation of two different views. The regularization parameter is optimized over the set . The setup of KCCA (BST) and KCCA (AVG) are similar as CCA (BST) and CCA (AVG) in the experiments of the linear version.

  • KTCCA: the non-linear extension of the proposed tensor CCA. The regularization parameter is optimized in the same way as in KCCA.

The experimental results are shown in Fig. 6 and Table 4. Compared with the results in Fig. 5, we can see that: 1) although a small number of unlabeled samples is utilized, the performance is better since the separability is improved by the non-linear projection, which is implemented via the kernel trick Shawe-Taylor and Cristianini (2004); 2) the simple AVG view combination strategy outperforms the best single view kernel (BSK) significantly, and is comparable to KCCA (BST); 3) KCCA (AVG) is slightly better than KCCA (BST), and the proposed KTCCA achieves the best performance under most dimensionalities.

Figure 6: Annotation accuracy (of the non-linear methods) vs. dimension of the common subspace on the NUS-WIDE mammal subset, where a small set of samples is utilized. (Left: 4 labeled instances for each mammal concept; Middle: 6 labeled instances; Right: 8 labeled instances.)
Methods #labeled = #labeled = #labeled =
BSK 17.961.29 19.172.01 20.041.66
AVG 20.491.65 21.732.74 22.861.87
KCCA (BST) 21.512.44 22.581.91 23.781.57
KCCA (AVG) 21.851.38 23.131.77 24.281.04
KTCCA 24.510.78 25.180.58 25.740.90
Table 4: Annotation accuracies () of the different non-linear methods at their best dimensions on the NUS-WIDE mammal dataset.

5.3 Empirical analysis of the computational complexity

In this subsection, we empirically analyze the computational complexity of the different methods. The experiments are conducted in Matlab R2012b on a GHz Intel Xeon ( cores) computer, where the memory is GB MHz ECC DDR3-RAM. The results (time cost and memory cost) on the different datasets are shown in Fig. 7-10. From the results, we observe that: 1) the costs of the proposed TCCA are higher than the other CCA-based methods in general. This is because the decomposition is performed on a large covariance tensor, instead of one or multiple covariance matrices, where are the view indices. The tensor decomposition method we adopt in this paper is the ALS algorithm Kroonenberg and De Leeuw (1980); Comon et al (2009), which could result in satisfactory accuracy but is not efficient; 2) TCCA is much more efficient than DSE or SSMVD when the feature dimensions are not very high and the number of instances is large (see Fig. 7 for example). This demonstrates the superiority of TCCA compared with the existed unsupervised multi-view dimension reduction methods on the large sample size problems.

6 Conclusion

Standard CCA cannot deal with multi-view data, and its typical multi-view extensions ignore the high order statistics (correlation information) among all feature views. To resolve this problem, we have presented tensor CCA (TCCA) to discover such statistics by analyzing the covariance tensor of all views.

From the experimental validation on a variety of application tasks, we conclude that: 1) finding a common subspace for all views using the CCA-based strategy is often better than simply concatenating all the features, especially when the feature dimension is high; 2) examining more statistics, which may require more unlabeled data to be utilized, often leads to better performance; 3) by exploring the high order statistics, the proposed TCCA outperforms the other methods, especially when the dimension of the common subspace is high.

Compared with CCA and its traditional multi-view extensions, the main disadvantage of the proposed TCCA is the high computational cost. Most of the TCCA cost lies in the tensor decomposition, which is not the point of this paper. In the future, we will devote efficient tensor decomposition methods that could speed up TCCA, or introduce the parallel computing technique by utilizing GPU to accelerate the ALS tensor decomposition.

Figure 7: Computational complexity vs. dimension of the common subspace on the SecStr dataset. (Top: time cost in seconds; Bottom: memory cost in Megabits.)
Figure 8: Computational complexity vs. dimension of the common subspace on the Ads dataset. labeled training samples are utilized. (Top: time cost in seconds; Bottom: memory cost in Megabits.)
Figure 9: Computational complexity vs. dimension of the common subspace on the NUS-WIDE mammal subset. labeled samples for each mammal concept are utilized. (Top: time cost in seconds; Bottom: memory cost in Megabits.)
Figure 10: Computational complexity (of the non-linear methods) vs. dimension of the common subspace on the NUS-WIDE mammal subset, where a small set of instances and labeled samples for each mammal concept are utilized. (Top: time cost in seconds; Bottom: memory cost in Megabits.)

Appendix A Proof of Thoerem 1

Proof.

According to the definition of the element-wise product, we have

(A.1)

where denotes the ’th entry of the vector , and the same notation is used for and . Additionally,

(A.2)

According to the definition of the -mode product of a tensor and matrix, we have

(A.3)

Therefore,

(A.4)

This completes the proof. ∎

Appendix B Proof of Theorem 3

Proof.

Let and , then according to the definition of the outer product, the ’th entry of is

(B.1)

where is the ’th element of the vector . Additionally, the ’th entry of is

(B.2)

where is the ’th element of the vector . According to the definition of the tensor-matrix product, we have

Then the ’th entry of is

(B.3)

By comparing (B.1) and (B.3), we complete the proof. ∎

References

  • Allen (2012) Allen GI (2012) Sparse higher-order principal components analysis. In: International Conference on Artificial Intelligence and Statistics, pp 27–36
  • Bach and Jordan (2005) Bach FR, Jordan MI (2005) A probabilistic interpretation of canonical correlation analysis. Tech. Rep. 688, Department of Statistics, University of California, Berkeley
  • Belkin and Niyogi (2001) Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in Neural Information Processing Systems, pp 585–591
  • Benabdeslem and Hindawi (2014) Benabdeslem K, Hindawi M (2014) Efficient semi-supervised feature selection: Constraint, relevance and redundancy. IEEE Transactions on Knowledge and Data Engineering 26(5):1131–1143
  • Blaschko and Lampert (2008) Blaschko MB, Lampert CH (2008) Correlational spectral clustering. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1–8
  • Blum and Mitchell (1998) Blum A, Mitchell T (1998) Combining labeled and unlabeled data with co-training. In: Annual conference on Computational Learning Theory, pp 92–100
  • Chapelle et al (2006) Chapelle O, Schlkopf B, Zien A (2006) Semi-supervised learning. MIT press Cambridge, MA
  • Chaudhuri et al (2009) Chaudhuri K, Kakade SM, Livescu K, Sridharan K (2009) Multi-view clustering via canonical correlation analysis. In: International Conference on Machine Learning, pp 129–136
  • Chen et al (2012) Chen N, Zhu J, Sun F, Xing EP (2012) Large-margin predictive latent subspace learning for multiview data analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(12):2365–2378
  • Chua et al (2009) Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from national university of singapore. In: International Conference on Image and Video Retrieval, pp 48:1–48:9
  • Comon et al (2009) Comon P, Luciani X, De Almeida AL (2009) Tensor decompositions, alternating least squares and other tales. Journal of Chemometrics 23(7-8):393–405
  • De Lathauwer et al (2000a) De Lathauwer L, De Moor B, Vandewalle J (2000a) A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications 21(4):1253–1278
  • De Lathauwer et al (2000b) De Lathauwer L, De Moor B, Vandewalle J (2000b) On the best rank-1 and rank-(r 1, r 2,…, rn) approximation of higher-order tensors. SIAM Journal on Matrix Analysis and Applications 21(4):1324–1342
  • Farquhar et al (2005) Farquhar JDR, Hardoon D, Meng H, Shawe-taylor JS, Szedmak S (2005) Two view learning: SVM-2K, theory and practice. In: Advances in Neural Information Processing Systems, pp 355–362
  • Fisch et al (2014) Fisch D, Kalkowski E, Sick B (2014) Knowledge fusion for probabilistic generative classifiers with data mining applications. IEEE Transactions on Knowledge and Data Engineering 26(3):652–666
  • Foster et al (2008) Foster DP, Johnson R, Zhang T (2008) Multi-view dimensionality reduction via canonical correlation analysis. Tech. Rep. TR-2009-5, TTI-Chicago
  • Guillaumin et al (2009) Guillaumin M, Mensink T, Verbeek J, Schmid C (2009) Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In: International Conference on Computer Vision, pp 309–316
  • Han et al (2012) Han Y, Wu F, Tao D, Shao J, Zhuang Y, Jiang J (2012) Sparse unsupervised dimensionality reduction for multiple view data. IEEE Transactions on Circuits and Systems for Video Technology 22(10):1485–1496
  • Hardoon et al (2004) Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16(12):2639–2664
  • Hou et al (2010) Hou C, Zhang C, Wu Y, Nie F (2010) Multiple view semi-supervised dimensionality reduction. Pattern Recognition 43(3):720–730
  • Jenatton et al (2011) Jenatton R, Audibert JY, Bach F (2011) Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research 12:2777–2824
  • Kakade and Foster (2007) Kakade SM, Foster DP (2007) Multi-view regression via canonical correlation analysis. In: Annual conference on Computational Learning Theory, pp 82–96
  • Kettenring (1971) Kettenring JR (1971) Canonical analysis of several sets of variables. Biometrika 58(3):433–451
  • Kim and Cipolla (2009) Kim TK, Cipolla R (2009) Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(8):1415–1428
  • Kroonenberg and De Leeuw (1980) Kroonenberg PM, De Leeuw J (1980) Principal component analysis of three-mode data by means of alternating least squares algorithms. Psychometrika 45(1):69–97
  • Kumar et al (2011) Kumar A, Rai P, Iii HD (2011) Co-regularized multi-view spectral clustering. In: Advances in Neural Information Processing Systems, pp 1413–1421
  • Kushmerick (1999) Kushmerick N (1999) Learning to remove internet advertisements. In: Proceedings of the third annual conference on Autonomous Agents, pp 175–181
  • Lanckriet et al (2004) Lanckriet G, Cristianini N, Bartlett P, Ghaoui L, Jordan M (2004) Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research 5:27–72
  • Lee and Choi (2007) Lee SH, Choi S (2007) Two-dimensional canonical correlation analysis. IEEE Signal Processing Letters 14(10):735–738
  • Liu et al (2014) Liu Q, Chen E, Xiong H, Ge Y, Li Z, Wu X (2014) A cocktail approach for travel package recommendation. IEEE Transactions on Knowledge and Data Engineering 26(2):278–293
  • Long et al (2008) Long B, Philip SY, Zhang ZM (2008) A general model for multiple view unsupervised learning. In: SDM, pp 822–833
  • Lowe (2004) Lowe DG (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60(2):91–110
  • Lu (2013) Lu H (2013) Learning canonical correlations of paired tensor sets via tensor-to-vector projection. In: International Joint Conference on Artificial Intelligence, pp 1516–1522
  • McFee and Lanckriet (2011) McFee B, Lanckriet G (2011) Learning multi-modal similarity. Journal of Machine Learning Research 12:491–523
  • Oliva and Torralba (2001) Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision 42(3):145–175
  • Scholkopf and Smola (2002) Scholkopf B, Smola A (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. the MIT Press
  • Shawe-Taylor and Cristianini (2004) Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge university press
  • Su et al (2009) Su H, Sun M, Fei-Fei L, Savarese S (2009) Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories. In: International Conference on Computer Vision, pp 213–220
  • Vía et al (2007) Vía J, Santamaría I, Pérez J (2007) A learning algorithm for adaptive canonical correlation analysis of several data sets. Neural Networks 20(1):139–152
  • Wang (2010) Wang H (2010) Local two-dimensional canonical correlation analysis. IEEE Signal Processing Letters 17(11):921–924
  • White et al (2012) White M, Zhang X, Schuurmans D, Yu Yl (2012) Convex multi-view subspace learning. In: Advances in Neural Information Processing Systems, pp 1682–1690
  • Wu et al (2015) Wu J, Liu H, Xiong H, Cao J, Chen J (2015) K-means-based consensus clustering: A unified view. IEEE Transactions on Knowledge and Data Engineering 27(1):155–169
  • Xia et al (2010) Xia T, Tao D, Mei T, Zhang Y (2010) Multiview spectral embedding. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 40(6):1438–1446
  • Xu et al (2015) Xu B, Bu J, Chen C, Wang C, Cai D, He X (2015) Emr: A scalable graph-based ranking model for content-based image retrieval. IEEE Transactions on Knowledge and Data Engineering 27(1):102–114
  • Yan et al (2012) Yan J, Zheng W, Zhou X, Zhao Z (2012) Sparse 2-d canonical correlation analysis via low rank matrix approximation for feature extraction. IEEE Signal Processing Letters 19(1):51–54
  • Yang et al (2014) Yang S, Yi Z, Ye M, He X (2014) Convergence analysis of graph regularized non-negative matrix factorization. IEEE Transactions on Knowledge and Data Engineering 26(9):2151–2165
  • Zhu et al (2015) Zhu H, Xiong H, Ge Y, Chen E (2015) Discovery of ranking fraud for mobile apps. IEEE Transactions on Knowledge and Data Engineering 27(1):74–87
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
48387
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description