Tensor Canonical Correlation Analysis for Multiview Dimension Reduction
Abstract
Canonical correlation analysis (CCA) has proven an effective tool for twoview dimension reduction due to its profound theoretical foundation and success in practical applications. In respect of multiview learning, however, it is limited by its capability of only handling data represented by twoview features, while in many realworld applications, the number of views is frequently many more. Although the ad hoc way of simultaneously exploring all possible pairs of features can numerically deal with multiview data, it ignores the high order statistics (correlation information) which can only be discovered by simultaneously exploring all features.
Therefore, in this work, we develop tensor CCA (TCCA) which straightforwardly yet naturally generalizes CCA to handle the data of an arbitrary number of views by analyzing the covariance tensor of the different views. TCCA aims to directly maximize the canonical correlation of multiple (more than two) views. Crucially, we prove that the multiview canonical correlation maximization problem is equivalent to finding the best rank1 approximation of the data covariance tensor, which can be solved efficiently using the wellknown alternating least squares (ALS) algorithm. As a consequence, the high order correlation information contained in the different views is explored and thus a more reliable common subspace shared by all features can be obtained. In addition, a nonlinear extension of TCCA is presented. Experiments on various challenge tasks, including large scale biometric structure prediction, internet advertisement classification and web image annotation, demonstrate the effectiveness of the proposed method.
1 Introduction
The features utilized in many realworld data mining tasks are frequently of high dimension and extracted from multiple views (or sources). For example, both the page content and hyperlink represented by bagofwords (BOW) are usually used in web page classification Blum and Mitchell (1998); Foster et al (2008), and it is common to combine the global (such as GIST Oliva and Torralba (2001)) and local (such as SIFT Lowe (2004)) descriptors in image annotation Chua et al (2009); Guillaumin et al (2009). In these applications, the features can have dimensions of up to several hundred or thousand.
Multiview dimension reduction Foster et al (2008) seeks a lowdimensional common subspace to compactly represent the heterogeneous data, in which each of the data examples is associated with multiple highdimensional features. It often benefits the subsequent learning process significantly in that the curseofdimensionality is alleviated and the computational efficiency is improved Hou et al (2010); Han et al (2012). Canonical correlation analysis (CCA), which is designed to inspect the linear relationship between two sets of variables Hardoon et al (2004); Bach and Jordan (2005), was formally introduced as a multiview dimension reduction method in Foster et al (2008), where the authors prove that the labeled instance complexity can be effectively reduced under certain weak assumptions. In addition, CCA has been widely used for multiview classification Farquhar et al (2005), regression Kakade and Foster (2007), clustering Blaschko and Lampert (2008); Chaudhuri et al (2009), etc. Theoretically, Bach and Jordan Bach and Jordan (2005) interpreted CCA probabilistically as a latent variable model, and thus it is able to be involved in a larger probabilistic model.
In spite of the profound theoretical foundation and practical success of CCA in multiview learning, it can only handle data that is represented by twoview features. The features utilized in many realworld applications, however, are usually extracted from more than two views. For example, different kinds of color, texture and shape features are popular used in visual analysisbased tasks such as image annotation and video retrieval. A typical approach for generalizing CCA to several views is to maximize the sum of pairwise correlations between different views Vía et al (2007). The main drawback of this strategy is that only the statistics (correlation information) between pairs of features is explored, while highorder statistics that can only be obtained by simultaneously examining all features is ignored.
To tackle this problem, we develop tensor CCA (TCCA) to generalize CCA to handle an arbitrary number of views in a straightforward and yet natural way. In particular, TCCA aims to directly maximize the correlation between the canonical variables of all views, and this is achieved by analyzing the highorder covariance tensor over the data from all views. We prove that maximizing the correlation is equivalent to approximating the covariance tensor with a rank1 tensor in an optimal least square sense. This approximation has been investigated in the literature and an efficient alternating least square (ALS) algorithm can be adopted for optimization Kroonenberg and De Leeuw (1980); De Lathauwer et al (2000b); Comon et al (2009). With respect to the traditional pairwise correlation maximization, the statistics (correlation information) explored can be measured using the covariance matrices of size , where is the number of views and represents the average feature dimensions, whereas in the proposed TCCA, the size of the covariance tensor is . Fig. 1 is an illustrative example, where . Much more correlation information is encoded in the common subspace shared by all features in multiview dimension reduction, and thus hopefully better performance can be achieved. Furthermore, we extend the proposed TCCA to the nonlinear case, which is useful when the feature dimensions are very high and limited instances are available. We perform extensive experiments on a variety of challenge tasks, including large scale biometric structure prediction, internet advertisement classification and web image annotation. We compare the proposed method with the traditional CCA Foster et al (2008) and its multiview extension Vía et al (2007), as well as two representative unsupervised multiview dimension reduction approaches Long et al (2008); Han et al (2012). The results confirm the effectiveness of the proposed TCCA.
The article is organized as follows. We summarize closely related works in Section 2. A brief introduction of CCA and its traditional multiview extension is presented in Section 3. Section 4 includes the description, formulation, and analysis of the proposed TCCA, as well as its nonlinear extension kernel TCCA (KTCCA) for multiview dimension reduction. Extensive experiments are presented in Section 5 and the paper is concluded in Section 6.
2 Related Work
2.1 Multiview Dimension Reduction
Dimension reduction is a key technique in machine learning. The goal of dimension reduction is to find a low dimensional representation for high dimensional data Xia et al (2010). Feature selection and feature transformation and the two main approaches for dimension reduction. The former aims to select a subset of variables from the original, while the latter transforms the data to a new space of fewer dimensions. The dimension reduction can be performed in an either unsupervised (e.g., principal component analysis (PCA) and Laplacian eigenmaps (LE) Belkin and Niyogi (2001)), semisupervised Benabdeslem and Hindawi (2014), or supervised (e.g., linear discriminant analysis (LDA)) setting, differed in the amount of labeled information being utilized.
In another research line, multiview learning has attracted much attention recently. The multiview we refer to here is the multiple feature representations of an object, not the spatial viewpoints in some other vision and graphics applications Su et al (2009). We generally classify the multiview learning algorithms into three families: weighted view combination Lanckriet et al (2004); McFee and Lanckriet (2011), multiview dimension reduction Hardoon et al (2004); White et al (2012), and view agreement exploration Blum and Mitchell (1998); Kumar et al (2011). Multiview dimension reduction focuses on removing irrelevant or redundant information Benabdeslem and Hindawi (2014) and reducing the feature dimension of data that consists of multiple views by leveraging the dependencies, coherence, and complementarity of those views. The different views are often assumed to be conditionally independent, thus a latent representation shared by all views can be obtained by exploiting the conditional independence structure of the multiview data Foster et al (2008); Long et al (2008); White et al (2012); Han et al (2012); Chen et al (2012). For example, canonical correlation analysis (CCA) is employed for multiview dimension reduction in Foster et al (2008) to exploit the underlying conditional independence and redundancy assumption in multiview learning. A general unsupervised learning method is presented in Long et al (2008) for multiview data, where a consensus representation is learned by first applying dimension reduction technique (such as spectral embedding Belkin and Niyogi (2001)) on each view and then combining the results via matrix factorization. In Han et al (2012), the structured sparsity Jenatton et al (2011) is enforced among the different views in the learning of lowdimension consensus representation, to allow information being shared across subsets of features adaptively. In contrast to unsupervised multiview dimension reduction, the similarity/dissimilarity pairwise constraints are utilized in Hou et al (2010) for semisupervised multiview dimension reduction. In Chen et al (2012), the supervising information is also incorporated in the learned latent shared subspace by the use of a largemargin latent Markov network. In these methods, local optimal subspace can usually be obtained. Therefore, White et al. White et al (2012) proposed a convex formulation for learning a shared subspace of multiple sources. In the learned subspace, conditional independence constraints are enforced.
2.2 Canonical Correlation Analysis and Its Extensions
Canonical correlation analysis (CCA), originally proposed by Hotelling (1936), finds bases for two random variables (or sets of variables) so that the coordinates of the variable pairs projected on these bases are maximally correlated Hardoon et al (2004). Much success has been achieved by applying CCA to pattern recognition and data mining. For example, SVM2K was proposed in Farquhar et al (2005) for twoview classification. It combines kernel CCA and support vector machine (SVM) in a single optimization problem, and the authors prove that the Rademacher complexity of SVM2K is significantly lower than the individual SVMs. Kakade and Foster Kakade and Foster (2007) presented a multiview regression algorithm regularized with a norm that is derived by applying CCA on unlabeled data. The authors show that the intrinsic dimension of the regression problem with the induced norm can be characterized by the correlation coefficients obtained in CCA. Under the conditionally uncorrelated assumption, a simple and efficient subspace learning algorithm based on CCA was proposed in Chaudhuri et al (2009) for multiview clustering. The algorithm was shown to work well under much weaker separation conditions than the previous clustering methods.
In addition to these applications, there have been dozens of developments for CCA, most of which concentrate on inspecting the relationship between two sets of tensors rather than vectors. For example, the classical CCA was extended in Lee and Choi (2007) to 2DCCA, which directly analyzes 2D images without reshaping them into vectors. Some of its extensions are local 2DCCA Wang (2010), sparse 2DCCA Yan et al (2012), and multilinear CCA (MCCA) Lu (2013). Considering that the two highorder tensors to be studied may share multiple modes (e.g., the video volume data), Kim and Cipolla Kim and Cipolla (2009) presented two architectures for tensor correlation maximization by applying canonical transformation on the nonshared modes. In this way, features that have a good balance between flexibility and descriptive power may be obtained. This method is also termed “tensor CCA” (TCCA), but is quite different from the approach proposed in this paper. The main difference lies in that the latter focuses on analyzing two highorder tensor data sets, while our objective is to analyze the highorder statistics among multiple vector data sets (views).
The most closely related works to our methods, as far as we are concerned, are the maximum variance CCA (CCAMAXVAR) Kettenring (1971) and an adaptive CCA algorithm termed CCALS Vía et al (2007), which is based on least square (LS) regression. The CCAMAXVAR algorithm is performed by weighted combination of the canonical variables (projected vectors) of all views to approximate a latent common representation. This approach requires costly singular value decomposition (SVD) for optimization and cannot be trained in an adaptive fashion. To avoid these drawbacks, Via et al. Vía et al (2007) reformulated CCAMAXVAR as a set of coupled LS regression problems, which seeks to minimize the distance between each pair of canonical variables. The reformulation is proved to be equivalent to the original CCAMAXVAR formulation, but is much more efficient and can be learned adaptively. Nevertheless, there is still a disadvantage to both CCALS and CCAMAXVAR, namely that only the pairwise correlations are exploited, while the high order correlations between all views are ignored. We developed the following tensor CCA framework to rectify this shortcoming.
3 Canonical Correlation Analysis (CCA) and Its Multiview Generalization
This section briefly introduces standard canonical correlation analysis (CCA) and its traditional generalizations on several data sets Kettenring (1971); Vía et al (2007). Given two sets of column vectors , , . The objective of CCA is to find a pair of projections (usually called canonical vectors) , , such that correlations between the two vectors of canonical variables and with each , , are maximized. The optimization problem is thus given by
(3.1) 
where , are data variance matrices, and is the covariance matrix. Here, and are the stacked data matrices. The optimization of problem (3.1) leads to the main solution of CCA, and the remaining solutions are given by maximizing the same correlation under the constraint of being orthogonal to the previous solutions.
CCAMAXVAR Kettenring (1971) generalizes CCA to views. Suppose the data matrix for the ’th view is , then the optimization problem of CCAMAXVAR for finding the canonical vectors is
(3.2) 
where is the vector of canonical variables, is the best possible onedimensional PCA representation, and is the vector of combination weights. To avoid a trivial solution, an additional constraint such as is enforced. The solutions of (3.2) can be obtained using the SVD of . To develop an efficient and adaptive algorithm, Via et al. Vía et al (2007) reformulated (3.2) as
(3.3) 
The orthogonal constraint is imposed on the different solutions, which can be obtained by using an iterative algorithm based on LS regression Vía et al (2007). Here, and is a vector of canonical variables projected using the ’th canonical vector in the ’th view.
4 Tensor Canonical Correlation Analysis (TCCA)
In contrast to CCAMAXVAR Kettenring (1971) and CCALS Vía et al (2007), where only the pairwise correlations are considered, we propose tensor CCA (TCCA) for multiview dimension reduction by exploiting the highorder tensor correlation between all views. The diagram of the multiview dimension reduction method using the proposed TCCA is shown in Fig. 2. Different kinds of features, such as LAB color histogram (LAB), wavelet texture (WT), and the local SIFT features (SIFT), are first extracted to represent the instances in different views. This leads to multiple feature matrices . Here, is set at for intuitive illustration without loss of generality. The different sets of features are then used to calculate the data covariance tensor , which is subsequently decomposed as a weighted sum of rank1 tensors, i.e., , where is the reduced dimension and is the tensor (outer) product. The vectors are stacked as a transformation matrix , which is used to map the original high dimensional features into the low dimensional common subspace. The projected features are concatenated as the final representation of the instances. The details of this technique are given below, but first we briefly introduce several useful notations and concepts of multilinear algebra.
4.1 Notations
Let be an order tensor of size , and be a matrix. The mode product of and is then denoted as , which is an tensor with the element
(4.1) 
The product of and a sequence of matrices is a tensor denoted by
(4.2) 
The mode matricization of is denoted as an matrix , which is obtained by mapping the fibers associated with the ’th dimension of as the rows of , and aligning the corresponding fibers of all the other dimensions as the columns. Here, the columns can be ordered in any way. The mode multiplication can be manipulated as matrix multiplication by storing the tensors in metricized form, i.e., . Specifically, the series of mode product in (4.2) can be expressed as a series of Kronecker products and is given by
(4.3) 
where is a forward cyclic ordering for the indices of the tensor dimensions that map to the column of the matrix. Finally, the Frobenius norm of the tensor is given by
(4.4) 
4.2 Problem Formulation
Given views of instances, and each is assumed to have been centered (i.e., have zero mean). The variance matrices are then
and the covariance tensor among all views is calculated as
where is a tensor of dimension . Following the objective of the traditional twoview CCA Hardoon et al (2004), the proposed tensor CCA seeks to maximize the correlation between the canonical variables , where are usually called the canonical vectors. Therefore, the optimization problem is
(4.5) 
Here is the canonical correlation, and is the elementwise product, is an all ones vector. We can prove that it is equivalent to , where is the mode tensormatrix product.
Theorem 1.
The high order canonical correlation is given by
(4.6) 
The proof is presented in the Appendix. By further considering that , the problem (4.5) becomes
(4.7) 
We further add a regularization term in the constraints to control the model complexity, and thus the constraints of problem (4.7) become
(4.8) 
where is an identity matrix and is a nonnegative tradeoff parameter. Let each and , we can reformulate (4.7) as
(4.9) 
where . The equivalence of the problem (4.7) and (4.9) is ensured by the following theorem.
Proof.
It is straightforward that the constraints of problems (4.7) and (4.9) are equivalent, and now we prove that the objective of the two problems is the same as follows,
where the metricizing property of the tensormatrix product presented in (4.3) and some basic properties of the Kronecker product are applied. ∎
4.3 Solutions
It has been presented in De Lathauwer et al (2000b) that the problem (4.9) is equivalent to finding the best rank approximation of the tensor , i.e., if we define , then the optimization problem becomes
(4.10) 
The solution can be obtained using the alternating least square (ALS) algorithm Kroonenberg and De Leeuw (1980); Comon et al (2009). Some other algorithms, such as the highorder power method (HOPM) De Lathauwer et al (2000b) and the tensor power method Allen (2012), can also be applied here for optimization, but our empirical findings indicate that the ALS algorithm performs the best in our experiments.
As in the twoview CCA, we perform a recursive maximization of the correlation between linear combinations of . However, we cannot expect the different linear combinations of to be uncorrelated with each other, where is the rank of (the determination of the rank value is still an open problem for the highorder tensors De Lathauwer et al (2000a)). That is, the orthogonality constraints cannot be imposed on , since the sum of rank1 decomposition and orthogonal decomposition of highorder tensors cannot be satisfied simultaneously De Lathauwer et al (2000a).
Based on the solutions , we obtain the canonical variables . Let and be the column vectors of , we obtain the projected data for the ’th view:
(4.11) 
Following Foster et al (2008), where it is suggested that the dimension be reduced to in the standard CCA, we concatenate the different as the final representation for the subsequent learning, such as classification Farquhar et al (2005); Fisch et al (2014), clustering Yang et al (2014); Wu et al (2015), regression Kakade and Foster (2007), search ranking Xu et al (2015); Zhu et al (2015), collaborative filtering Liu et al (2014), and so on.
4.4 Nonlinear Extension
The projections are linear in TCCA and thus may be not appropriate for instances that lie in quite nonlinear feature space. To this end, we develop kernel tensor CCA (KTCCA) that extends the proposed TCCA to the nonlinear case. KTCCA aims to find nonlinear projections by first projecting the data into higher dimensional space induced by the feature mapping :
where the mapped dimension may be infinite. Then the variance matrices
the covariance matrix
and the canonical variables . It follows from the Representer Theorem Scholkopf and Smola (2002) that can be rewritten as a linear combination of the given instances, i.e.,
(4.12) 
where is a vector of the combination coefficients. The problem (4.7) then becomes
(4.13) 
where is the kernel matrix of the ’th view. The derivation is similar to Theorem 2. Here and can be calculated according to the following theorem.
Theorem 3.
The following equality holds:
in which , i.e., the ’th column of the kernel matrix , .
We give the proof in the Appendix. To avoid trivial learning, we follow Hardoon et al (2004) and introduce a partial least square (PLS) term to penalize the norms of the weight vectors . That is, the constraints of problem (4.13) become
(4.14) 
Because the matrix is positive definite, it has a unique Cholesky decomposition, and we can denote its decomposition as . Let and , we can reformulate (4.13) as
(4.15) 
Similar to TCCA, this problem is equivalent to finding the best rank approximation of , and the solution can be found using the ALS algorithm. By recursively maximizing the correlation, we obtain . Let , and the canonical variables and the projected data for the ’th view are then
(4.16) 
The concatenated is the final representation of the instances.
4.5 Complexity Analysis
The time and space complexities of the proposed TCCA model are both closely related to the size of tensor . Straightforwardly, the space complexity is . Because the tensor can be calculated offline, the time complexity is dominated by the rank decomposition using the ALS algorithm. Considering that it is common that , we can speculate the time complexity of ALS is according to Comon et al (2009), where the time cost of the ALS algorithm for the three modes tensor is presented. Here, is the number of iterations in ALS.
According to the above analysis, we can see that the complexity of TCCA is independent of the number of instances, and thus our method can be scaled in very large sample size problems. Similarly, the complexities of KTCCA are determined by the tensor , the size of which is . The space and time complexities are and respectively. This means that KTCCA is capable of being scaled in problems that have very high feature dimensions and a small number of instances.
5 Experiments
In this section, we empirically validate the effectiveness of the proposed TCCA on a biometric structure prediction and an advertisement classification problem following Foster et al (2008), as well as on a challenging web image annotation task Chua et al (2009). In all of the following experiments, five random choices of the labeled instances are used. Twenty percent of the test data (or unlabeled data in the transductive setting) are used for validation, which means that the parameters (if not specified) corresponding to the best performance on the validation set are used for testing. The evaluation criterion is the classification accuracy.
5.1 Evaluation of the Linear Formulation
In the first two sets of experiments (biometric structure prediction and advertisement classification), we use regularized least squares (RLS) as the base learner following Foster et al (2008). Given labeled instances , the optimization problem for RLS is given by , where the positive tradeoff parameter is is set as according to Foster et al (2008). A constant feature of is appended to each instance to include a bias term in . In web image annotation, the nearestneighbor (NN) classifier is utilized, where the candidate set for is . Specifically, we compare the following methods:

BSF: using the single view feature that achieves the best performance in RLS/NNbased classification.

CAT: concatenating the normalized features of all the views into a long vector, and then performing RLS/NNbased classification.

CCA Foster et al (2008): using the CCA formulation presented in Foster et al (2008) to find a common representation of two different views. In this formulation, a regularization term is added to control the model complexity, and we set the parameter as in biometric structure prediction and advertisement classification according to Foster et al (2008). The parameter is tuned over the set in web image annotation. The implementation details can be found in Foster et al (2008). For different views, there are subsets of two views. The subset that achieves the best performance is termed CCA (BST). To combine the results of all subsets, we average their predicted scores in RLSbased classification and adopt the majority voting strategy in NN. This combination approach is termed CCA (AVG).

CCALS Vía et al (2007): a generalization of CCA to multiple views based on least square (LS) regression.

DSE Long et al (2008): a general and popular unsupervised multiview dimension reduction method based on spectral embedding.

TCCA: the proposed tensor CCA. The regularization parameter is optimized the same as in CCA.
In the first step of DSE and SSMVD, PCA is taken as the dimension reduction method for each view, and the result dimension (of each view) is set to be empirically.
5.1.1 Biometric Structure Prediction
The dataset used in this set of experiments is SecStr^{1}^{1}1http://www.kyb.tuebingen.mpg.de/sslbook, which is a benchmark dataset for evaluating semisupervised systems Chapelle et al (2006). The task associated with this dataset is “to predict the secondary structure of a given amino acid in a protein based on a sequence window centered around that amino acid” Chapelle et al (2006). The SecStr dataset is largescale and contains instances. We randomly select instances as labeled samples. There are also unlabeled instances which we use to observe the performance of three CCAbased methods (CCA, CCALS and TCCA) with respect to different amounts of unlabeled data. Following Foster et al (2008), all the provided data are used (as unlabeled instances) to find the common subspace in the CCAbased methods. The performance is evaluated in a transductive setting on the unlabeled samples (except those for validation) of the instances. Both DSE and SSMVD are naturally transductive, since they learn the lowdimensional representation of given data directly, and no projection matrix is learned for new data. Therefore, these two methods cannot handle very large datasets and the experiments are conducted only on the instances. In particular, DSE needs to solve an eigendecomposition problem of size . The time cost or memory cost is intolerable when is , and thus a subset of samples are utilized.
The features provided are categorical attributes, each of which is generated at a position in from the sequence window of amino acid, and represented by a dimensional sparse binary vector. We divided the features into three views:

View1: attributes based on the left context (positions in );

View2: attributes based on the current position and middle context (positions in );

View3: attributes based on the right context (positions in ).
The dimension of each view is .
Methods  #unlabeled =  #unlabeled = 

BSF  57.481.90  
CAT  57.772.03  
CCA (BST)  58.782.97  59.972.46 
CCA (AVG)  60.751.92  61.151.73 
CCALS  60.231.70  61.321.65 
DSE  60.150.81  No Attempt 
SSMVD  61.081.58  
TCCA  62.361.27  64.421.70 
The performance of the compared methods in relation to the dimension of the common subspace is shown in Fig. 3. Accuracy is averaged over runs for each dimension in . The performance of the different methods at their best dimensions are summarized in Table 1. From the results, we observe that: 1) the concatenation strategy (CAT) is comparable to and slightly better than the strategy of only using the best single view features (BSF); 2) by learning the common subspace, all the compared multiview dimension reduction methods are significantly better than the BSF and CAT baselines, if the dimensionalities are properly set according to the accuracy on the validation dataset. In particular, CCA (BST) is superior to CAT, although only a subset of two views is utilized in the former; 3) the accuracy of all three CCAbased methods increases with an increasing number of unlabeled data. By combining the results of different subsets, CCA (AVG) is better than CCA (BST); 4) CCALS is superior to CCA (BST), but their performance at their best dimension is comparable. When the number of unlabeled data is , DSE and SSMVD are comparable to CCA (BST) and CCALS respectively; 5) the performance of TCCA does not decease significantly as CCALS and CCA do when the number of dimensions is high. The main reason is that the ALS algorithm used in TCCA seeks to maximize the canonical correlations for all the factors simultaneously, but not to greedily find orthogonal decomposition components Allen (2012). That is, the main variance tends to be explained uniformly by all factors, not only by the first several factors. This is also the reason why there are some oscillations in TCCA; 6) the proposed TCCA significantly outperforms all the other methods on most dimensionalities. This demonstrates that the high order correlation information between all features is well discovered, and that exploring this kind of information is much better than only exploring the correlation information between pairs of features, as in CCALS.
5.1.2 Advertisement Classification
This set of experiments is conducted on the Ads (internet advertisements)^{2}^{2}2http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements dataset from the wellknown UCI Machine Learning Repository. The task is to predict whether or not a given hyperlink (associated with an image) is an advertisement. There are instances in this dataset. We randomly choose instances as labeled training samples, and all the instances except those for validation are utilized as unlabeled samples to find the common subspace. The performance is evaluated in a transductive setting on the unlabeled samples.
We use the features as described in Kushmerick (1999), and omit the attributes that have missing values, such as the height (and width) of the image. The remained attributes are represented by binary () features which indicate the presence/absence of corresponding terms. For CCALS and TCCA, we divide all these features into three views as follows:

View1: features based on the terms in the image¡¯s URL, caption, and alt text. dimensions;

View2: features based on the terms in the URL of the current site. dimensions;

View3: features based on the terms in the anchor URL. dimensions.
Methods  #labeled = 100 

BSF  91.101.65 
CAT  91.081.74 
CCA (BST)  92.881.11 
CCA (AVG)  93.840.85 
CCALS  93.171.10 
DSE  93.010.96 
SSMVD  92.990.91 
TCCA  94.590.27 
Fig. 4 shows the classification accuracy of the compared methods (in relation to the dimension ), and the accuracies at their best dimensions are summarized in Table 2. In contrast to the observations of the last set of experiments, we can see that: 1) the accuracy of the concatenation strategy (CAT) and the best single view (BSF) are almost the same. The performance of CAT is relatively worse since the feature dimension in this set of experiments is high ( dimensions), and overfitting occurs given the limited number of labeled samples; 2) the performance of DSE and SSMVD first increase and then decrease sharply with an increasing number of the dimension , while the CCAbased methods are much steady; 3) the improvement of TCCA compared with the other CCAbased methods is not as great as in the last set of experiments. This is because we need more samples to approximate the true underlying high order correlation compared with the traditional pairwise correlation, since there are more variables to be estimated in the high order statistics. The unlabeled instances utilized in this set of experiments are much fewer, thus the high order correlation information is not well explored. CCALS is only comparable to CCA for the same reason.
5.1.3 Web Image Annotation
We further verify the effectiveness of the proposed algorithm on a natural image dataset NUSWIDE Chua et al (2009). This dataset contains images, and our experiments are conduct on a subset that consists of images belonging to mammal concepts: bear, cat, cow, dog, elk, fox, horse, tiger, whale, and zebra. We randomly split the images into a training set of images and a test set of images. Distinguishing between these concepts is very challenging, since many of them are similar to each other, e.g., cat and tiger. We randomly choose labeled instances for each concept in the training set, and all the training instances are utilized as unlabeled samples to find the common subspace.
In this dataset, we choose three types of visual feature, namely D bag of visual words based on SIFT Lowe (2004) descriptors, D color autocorrelogram, and D wavelet texture, to represent each image Chua et al (2009).
The annotation performance of the compared methods is shown in Fig. 5 and Table 3. It can be seen from the results that: 1) in general, performance improves with an increased number of labeled instances; 2) CCALS is comparable to CCA (BST) and CCA (AVG), while the best performance (peak of the curve) of CCALS is usually higher; 3) the performance of DSE is poor when is large, while SSMVD is much steady and can be superior to CCA (AVG) and CCALS sometimes; 4) the accuracies of CCA (AVG) and CCALS first increase and then decrease with an increasing number of the dimension , while the results of the proposed TCCA are satisfactory even though is large; 3) the accuracy of TCCA is significantly better than that of all the other methods under most dimensionalities.
Methods  #labeled =  #labeled =  #labeled = 

BSF  17.421.37  18.371.21  19.961.19 
CAT  19.011.86  19.072.23  20.701.44 
CCA (BST)  20.771.52  21.512.38  22.611.76 
CCA (AVG)  21.211.47  21.572.04  22.611.21 
CCALS  20.901.84  22.312.53  23.502.48 
DSE  20.021.23  21.591.13  22.670.74 
SSMVD  21.342.08  23.321.08  23.791.30 
TCCA  22.401.96  23.861.41  24.110.32 
5.2 Evaluation of the Nonlinear Extension
We evaluate the nonlinear extension of the proposed TCCA in the web image annotation task. As discussed in Section 4.5, the nonlinear extension is able to handle the small sample size problem, where the feature dimensions can be very high and possibly infinite. We thus randomly choose a small set of samples from the animal subset. To perform the nonlinear classification, we construct a kernel for each kind of feature. The kernel is defined by
where denotes the distance between and , and . We choose the distance for the visual word histogram. For other features, the distance is utilized. Specifically, we compare the following methods:

BSK: using the single view kernel that achieves the best performance in the NNbased classification.

AVG: averaging the normalized kernels of all the views, and then performing NNbased classification.

KCCA Hardoon et al (2004): using the KCCA formulation presented in Hardoon et al (2004) to find a common representation of two different views. The regularization parameter is optimized over the set . The setup of KCCA (BST) and KCCA (AVG) are similar as CCA (BST) and CCA (AVG) in the experiments of the linear version.

KTCCA: the nonlinear extension of the proposed tensor CCA. The regularization parameter is optimized in the same way as in KCCA.
The experimental results are shown in Fig. 6 and Table 4. Compared with the results in Fig. 5, we can see that: 1) although a small number of unlabeled samples is utilized, the performance is better since the separability is improved by the nonlinear projection, which is implemented via the kernel trick ShaweTaylor and Cristianini (2004); 2) the simple AVG view combination strategy outperforms the best single view kernel (BSK) significantly, and is comparable to KCCA (BST); 3) KCCA (AVG) is slightly better than KCCA (BST), and the proposed KTCCA achieves the best performance under most dimensionalities.
Methods  #labeled =  #labeled =  #labeled = 

BSK  17.961.29  19.172.01  20.041.66 
AVG  20.491.65  21.732.74  22.861.87 
KCCA (BST)  21.512.44  22.581.91  23.781.57 
KCCA (AVG)  21.851.38  23.131.77  24.281.04 
KTCCA  24.510.78  25.180.58  25.740.90 
5.3 Empirical analysis of the computational complexity
In this subsection, we empirically analyze the computational complexity of the different methods. The experiments are conducted in Matlab R2012b on a GHz Intel Xeon ( cores) computer, where the memory is GB MHz ECC DDR3RAM. The results (time cost and memory cost) on the different datasets are shown in Fig. 710. From the results, we observe that: 1) the costs of the proposed TCCA are higher than the other CCAbased methods in general. This is because the decomposition is performed on a large covariance tensor, instead of one or multiple covariance matrices, where are the view indices. The tensor decomposition method we adopt in this paper is the ALS algorithm Kroonenberg and De Leeuw (1980); Comon et al (2009), which could result in satisfactory accuracy but is not efficient; 2) TCCA is much more efficient than DSE or SSMVD when the feature dimensions are not very high and the number of instances is large (see Fig. 7 for example). This demonstrates the superiority of TCCA compared with the existed unsupervised multiview dimension reduction methods on the large sample size problems.
6 Conclusion
Standard CCA cannot deal with multiview data, and its typical multiview extensions ignore the high order statistics (correlation information) among all feature views. To resolve this problem, we have presented tensor CCA (TCCA) to discover such statistics by analyzing the covariance tensor of all views.
From the experimental validation on a variety of application tasks, we conclude that: 1) finding a common subspace for all views using the CCAbased strategy is often better than simply concatenating all the features, especially when the feature dimension is high; 2) examining more statistics, which may require more unlabeled data to be utilized, often leads to better performance; 3) by exploring the high order statistics, the proposed TCCA outperforms the other methods, especially when the dimension of the common subspace is high.
Compared with CCA and its traditional multiview extensions, the main disadvantage of the proposed TCCA is the high computational cost. Most of the TCCA cost lies in the tensor decomposition, which is not the point of this paper. In the future, we will devote efficient tensor decomposition methods that could speed up TCCA, or introduce the parallel computing technique by utilizing GPU to accelerate the ALS tensor decomposition.
Appendix A Proof of Thoerem 1
Proof.
According to the definition of the elementwise product, we have
(A.1) 
where denotes the ’th entry of the vector , and the same notation is used for and . Additionally,
(A.2) 
According to the definition of the mode product of a tensor and matrix, we have
(A.3) 
Therefore,
(A.4) 
This completes the proof. ∎
Appendix B Proof of Theorem 3
Proof.
Let and , then according to the definition of the outer product, the ’th entry of is
(B.1) 
where is the ’th element of the vector . Additionally, the ’th entry of is
(B.2) 
where is the ’th element of the vector . According to the definition of the tensormatrix product, we have
Then the ’th entry of is
(B.3) 
References
 Allen (2012) Allen GI (2012) Sparse higherorder principal components analysis. In: International Conference on Artificial Intelligence and Statistics, pp 27–36
 Bach and Jordan (2005) Bach FR, Jordan MI (2005) A probabilistic interpretation of canonical correlation analysis. Tech. Rep. 688, Department of Statistics, University of California, Berkeley
 Belkin and Niyogi (2001) Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Advances in Neural Information Processing Systems, pp 585–591
 Benabdeslem and Hindawi (2014) Benabdeslem K, Hindawi M (2014) Efficient semisupervised feature selection: Constraint, relevance and redundancy. IEEE Transactions on Knowledge and Data Engineering 26(5):1131–1143
 Blaschko and Lampert (2008) Blaschko MB, Lampert CH (2008) Correlational spectral clustering. In: IEEE Conference on Computer Vision and Pattern Recognition, pp 1–8
 Blum and Mitchell (1998) Blum A, Mitchell T (1998) Combining labeled and unlabeled data with cotraining. In: Annual conference on Computational Learning Theory, pp 92–100
 Chapelle et al (2006) Chapelle O, Schlkopf B, Zien A (2006) Semisupervised learning. MIT press Cambridge, MA
 Chaudhuri et al (2009) Chaudhuri K, Kakade SM, Livescu K, Sridharan K (2009) Multiview clustering via canonical correlation analysis. In: International Conference on Machine Learning, pp 129–136
 Chen et al (2012) Chen N, Zhu J, Sun F, Xing EP (2012) Largemargin predictive latent subspace learning for multiview data analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(12):2365–2378
 Chua et al (2009) Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUSWIDE: a realworld web image database from national university of singapore. In: International Conference on Image and Video Retrieval, pp 48:1–48:9
 Comon et al (2009) Comon P, Luciani X, De Almeida AL (2009) Tensor decompositions, alternating least squares and other tales. Journal of Chemometrics 23(78):393–405
 De Lathauwer et al (2000a) De Lathauwer L, De Moor B, Vandewalle J (2000a) A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications 21(4):1253–1278
 De Lathauwer et al (2000b) De Lathauwer L, De Moor B, Vandewalle J (2000b) On the best rank1 and rank(r 1, r 2,…, rn) approximation of higherorder tensors. SIAM Journal on Matrix Analysis and Applications 21(4):1324–1342
 Farquhar et al (2005) Farquhar JDR, Hardoon D, Meng H, Shawetaylor JS, Szedmak S (2005) Two view learning: SVM2K, theory and practice. In: Advances in Neural Information Processing Systems, pp 355–362
 Fisch et al (2014) Fisch D, Kalkowski E, Sick B (2014) Knowledge fusion for probabilistic generative classifiers with data mining applications. IEEE Transactions on Knowledge and Data Engineering 26(3):652–666
 Foster et al (2008) Foster DP, Johnson R, Zhang T (2008) Multiview dimensionality reduction via canonical correlation analysis. Tech. Rep. TR20095, TTIChicago
 Guillaumin et al (2009) Guillaumin M, Mensink T, Verbeek J, Schmid C (2009) Tagprop: Discriminative metric learning in nearest neighbor models for image autoannotation. In: International Conference on Computer Vision, pp 309–316
 Han et al (2012) Han Y, Wu F, Tao D, Shao J, Zhuang Y, Jiang J (2012) Sparse unsupervised dimensionality reduction for multiple view data. IEEE Transactions on Circuits and Systems for Video Technology 22(10):1485–1496
 Hardoon et al (2004) Hardoon DR, Szedmak S, ShaweTaylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16(12):2639–2664
 Hou et al (2010) Hou C, Zhang C, Wu Y, Nie F (2010) Multiple view semisupervised dimensionality reduction. Pattern Recognition 43(3):720–730
 Jenatton et al (2011) Jenatton R, Audibert JY, Bach F (2011) Structured variable selection with sparsityinducing norms. Journal of Machine Learning Research 12:2777–2824
 Kakade and Foster (2007) Kakade SM, Foster DP (2007) Multiview regression via canonical correlation analysis. In: Annual conference on Computational Learning Theory, pp 82–96
 Kettenring (1971) Kettenring JR (1971) Canonical analysis of several sets of variables. Biometrika 58(3):433–451
 Kim and Cipolla (2009) Kim TK, Cipolla R (2009) Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(8):1415–1428
 Kroonenberg and De Leeuw (1980) Kroonenberg PM, De Leeuw J (1980) Principal component analysis of threemode data by means of alternating least squares algorithms. Psychometrika 45(1):69–97
 Kumar et al (2011) Kumar A, Rai P, Iii HD (2011) Coregularized multiview spectral clustering. In: Advances in Neural Information Processing Systems, pp 1413–1421
 Kushmerick (1999) Kushmerick N (1999) Learning to remove internet advertisements. In: Proceedings of the third annual conference on Autonomous Agents, pp 175–181
 Lanckriet et al (2004) Lanckriet G, Cristianini N, Bartlett P, Ghaoui L, Jordan M (2004) Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research 5:27–72
 Lee and Choi (2007) Lee SH, Choi S (2007) Twodimensional canonical correlation analysis. IEEE Signal Processing Letters 14(10):735–738
 Liu et al (2014) Liu Q, Chen E, Xiong H, Ge Y, Li Z, Wu X (2014) A cocktail approach for travel package recommendation. IEEE Transactions on Knowledge and Data Engineering 26(2):278–293
 Long et al (2008) Long B, Philip SY, Zhang ZM (2008) A general model for multiple view unsupervised learning. In: SDM, pp 822–833
 Lowe (2004) Lowe DG (2004) Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision 60(2):91–110
 Lu (2013) Lu H (2013) Learning canonical correlations of paired tensor sets via tensortovector projection. In: International Joint Conference on Artificial Intelligence, pp 1516–1522
 McFee and Lanckriet (2011) McFee B, Lanckriet G (2011) Learning multimodal similarity. Journal of Machine Learning Research 12:491–523
 Oliva and Torralba (2001) Oliva A, Torralba A (2001) Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision 42(3):145–175
 Scholkopf and Smola (2002) Scholkopf B, Smola A (2002) Learning with kernels: support vector machines, regularization, optimization, and beyond. the MIT Press
 ShaweTaylor and Cristianini (2004) ShaweTaylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge university press
 Su et al (2009) Su H, Sun M, FeiFei L, Savarese S (2009) Learning a dense multiview representation for detection, viewpoint classification and synthesis of object categories. In: International Conference on Computer Vision, pp 213–220
 Vía et al (2007) Vía J, Santamaría I, Pérez J (2007) A learning algorithm for adaptive canonical correlation analysis of several data sets. Neural Networks 20(1):139–152
 Wang (2010) Wang H (2010) Local twodimensional canonical correlation analysis. IEEE Signal Processing Letters 17(11):921–924
 White et al (2012) White M, Zhang X, Schuurmans D, Yu Yl (2012) Convex multiview subspace learning. In: Advances in Neural Information Processing Systems, pp 1682–1690
 Wu et al (2015) Wu J, Liu H, Xiong H, Cao J, Chen J (2015) Kmeansbased consensus clustering: A unified view. IEEE Transactions on Knowledge and Data Engineering 27(1):155–169
 Xia et al (2010) Xia T, Tao D, Mei T, Zhang Y (2010) Multiview spectral embedding. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 40(6):1438–1446
 Xu et al (2015) Xu B, Bu J, Chen C, Wang C, Cai D, He X (2015) Emr: A scalable graphbased ranking model for contentbased image retrieval. IEEE Transactions on Knowledge and Data Engineering 27(1):102–114
 Yan et al (2012) Yan J, Zheng W, Zhou X, Zhao Z (2012) Sparse 2d canonical correlation analysis via low rank matrix approximation for feature extraction. IEEE Signal Processing Letters 19(1):51–54
 Yang et al (2014) Yang S, Yi Z, Ye M, He X (2014) Convergence analysis of graph regularized nonnegative matrix factorization. IEEE Transactions on Knowledge and Data Engineering 26(9):2151–2165
 Zhu et al (2015) Zhu H, Xiong H, Ge Y, Chen E (2015) Discovery of ranking fraud for mobile apps. IEEE Transactions on Knowledge and Data Engineering 27(1):74–87