Robust Matrix Elastic Net based Canonical Correlation Analysis: An Effective Algorithm for Multi-View Unsupervised Learning

Robust Matrix Elastic Net based Canonical Correlation Analysis: An Effective Algorithm for Multi-View Unsupervised Learning

Peng-Bo Zhang and  Zhi-Xin Yang,  Peng-Bo Zhang is with the Department of Industrial Engineering and Logistics Management, School of Engineering, Hong Kong University of Science and Technology, Kowloon, Hong Kong 999077, China. E-mail: (Corresponding author) Zhi-Xin Yang is with the Department of Electromechanical Engineering, Faculty of Science and Technology, University of Macau, Macau 999078, China. E-mail:

This paper presents a robust matrix elastic net based canonical correlation analysis (RMEN-CCA) for multiple view unsupervised learning problems, which emphasizes the combination of CCA and the robust matrix elastic net (RMEN) used as coupled feature selection. The RMEN-CCA leverages the strength of the RMEN to distill naturally meaningful features without any prior assumption and to measure effectively correlations between different ’views’. We can further employ directly the kernel trick to extend the RMEN-CCA to the kernel scenario with theoretical guarantees, which takes advantage of the kernel trick for highly complicated nonlinear feature learning. Rather than simply incorporating existing regularization minimization terms into CCA, this paper provides a new learning paradigm for CCA and is the first to derive a coupled feature selection based CCA algorithm that guarantees convergence. More significantly, for CCA, the newly-derived RMEN-CCA bridges the gap between measurement of relevance and coupled feature selection. Moreover, it is nontrivial to tackle directly the RMEN-CCA by previous optimization approaches derived from its sophisticated model architecture. Therefore, this paper further offers a bridge between a new optimization problem and an existing efficient iterative approach. As a consequence, the RMEN-CCA can overcome the limitation of CCA and address large-scale and streaming data problems. Experimental results on four popular competing datasets illustrate that the RMEN-CCA performs more effectively and efficiently than do state-of-the-art approaches.

RMEN-CCA, Robust Matrix Elastic Net, Canonical Correlation Analysis, Coupled Feature Selection, Non-convex Optimization, Multi-View Unsupervised Learning

I Introduction

With the rapid development of the Internet, the amount of multiple view data available is extremely increasing in various domains. Since multiple representations of multi-view data are captured from various sources or different features spaces, the statistical properties of these representations, in general, are completely different. Therefore, learning and identifying a consensus pattern in terms of effectiveness and efficiency from these multifarious representations is a persistent challenge. In order to address above problem, many researchers have investigated widely many multi-view learning methods [1, 2, 3, 4]. These existing multi-view learning approaches can be classified into three categories [5]: co-training, multiple kernel learning and subspace learning. Our work focuses on the third one.

Canonical Correlation Analysis (CCA) is a classical and powerful unsupervised learning approach for the multi-view learning problem. Its variants, such as Deep CCA (DCCA)[6], Randomized nonlinear CCA (FKCCA and NKCCA) [7], Sparse CCA [8, 9] and Scalable CCA [10], etc, have been thoroughly investigated. The basic idea of CCA is to find coupled linear projection matrices, and then model the potential connections between the two different ’views’. For this reason, CCAs have been widely applied to address multi-view learning problems.

The two major challenges of CCA are: revealing the correlations across different sources in terms of effectiveness [11, 7, 12] and addressing the non-convex optimization problem of CCA efficiently [10, 13, 14]. In this paper, we focus on the former. Over the past two decades, many studies have thoroughly investigated various kernel-based CCA approaches [7, 12]. Although the performance of these kernel-based CCA approaches have been improved remarkably, these kernel-based CCA methods are not powerful enough to explore shared knowledge across different ’views’. In other words, these works ignore one critical property of the cross-model matching approach that is coupled feature selection. Additionally, it is quite difficult to select a ’suitable’ Mercer kernel in kernel-based approaches, which is a key factor for the success of kernel-based approaches. Generally speaking, the commonly-used Gaussian kernel cannot lead to the optimal performance. Therefore, these previous kernel-based CCA methods must consider the distribution of inputs and the corresponding application scenarios as prior knowledge.

In this paper, we present a robust matrix elastic net based canonical correlation analysis (RMEN-CCA) without any prior knowledge. To the best of our knowledge, the RMEN-CCA is the first to incorporate the RMEN into CCA, thus emphasizing the combination of CCA with the coupled feature selection technique. In the RMEN, the norm allows the RMEN-CCA to capture joint sparse structure to distill relevant attributes from the data-embedding space, and simultaneously, the nuclear norm models the correlation between the projected samples via calculating a low-rank solution. More significantly, this paper provides a novel paradigm for CCA, which leverages the strength of coupled feature selection. Furthermore, in order for the RMEN-CCA to handle the highly sophisticated nonlinear relationship, this algorithm can perform directly in conjunction with kernel tricks, and we define this kernel scenario as KRMEN-CCA.

Moreover, we note that addressing CCA is a typical non-convex optimization problem because of its constraints, and hence, it is nontrivial to solve it by the naive gradient descent directly. The AppGrad [10] is an efficient iterative algorithm for CCA. The crucial idea of the AppGrad is to guarantee that the domain of CCA exists in a convex region at each iteration. The AppGrad has two major advantages over the traditional eigenvector computation. First, the AppGrad can significantly decease computational and shortage complexity. Second, the online property of the AppGrad makes it efficient on handling huge datasets, whereas the eigenvector computation is prohibitive in this situation. Unfortunately, it is nontrivial to use directly the AppGrad approach to solve the RMEN-CCA because of the imposed RMEN. Therefore, based on the AppGrad baseline [10], we derive a novel accelerated iterative method with proved convergence to address such non-convex optimization problem.

Furthermore, because of its highly flexible model architecture, the RMEN-CCA can be used as an intermediate structure in convolutional neural networks (CNNs) [15] for fine-tuning. This algorithm can also be applied to transfer learning [16] in order to address the problem of insufficient labeled data. However, these are beyond the scope of this paper.

The contributions of this paper are summarized as follows.

  • We present a novel robust matrix elastic net based canonical correlation analysis (RMEN-CCA) with theoretical guarantees and empirical proficiency. To the best of our knowledge, the RMEN-CCA is the first to incorporate the coupled feature selection into CCA, which improves generalization performance by automatically distilling relevant and useful features without any prior knowledge.

  • In the RMEN, the norm enforces redundancy and meaningless reduction, and meanwhile the nuclear norm yields a low-rank solution that encodes the correlation better. As a result, the RMEN-CCA takes advantage of this sparse plus low-rank structure, which brings benefits in terms of effectiveness and efficiency.

  • The RMEN-CCA can directly leverage the powerful kernel trick to yield the kernel-based algorithm called KRMEN-CCA, which can effectively construct nonlinear approximations of the manifolds.

  • It is nontrivial to address the proposed RMEN-CCA by existing approaches. Therefore, this paper bridges the gap between a novel non-convex optimization problem and the previous efficient iterative approach, which results that the RMEN-CCA is applied to the large-scale and streaming data tasks.

The remainder of the paper is organized as follows. Section II briefs the related work of feature selection techniques. Section III introduce the RMEN-CCA and its kernel version, and then an accelerated iterative optimization algorithm is derived for the RMEN-CCA. Section IV analyzes theoretically the convergence of the RMEN-CCA. Section V evaluates the RMEN-CCA on four popular datasets. Finally, conclusive remarks are provided in Section VI.

Ii Related work of coupled feature selection

In recent years, the feature selection technique plays an important role in machine learning community. This technique is desired to extract numerous useful features and to eliminate redundancies, in order to construct a simple model architecture. To overcome the limitations of conventional feature selection approaches such as Lasso and Ridge, etc., a combined feature selection approach, termed matrix elastic net (MEN), has been successfully used for different learning algorithms [17, 18, 19].

Our motivation is inspired by the LCFS algorithm [20] which incorporates coupled feature selection into a linear regression method. However, in contrast to the LCFS, the RMEN-CCA confronts two greater challenges: learning in an unsupervised fashion and addressing a non-convex optimization. Due to the high costs of labeling data manually, few labeled data may be available, even in this big data era. For this reason, deriving an efficient and effective approach without supervised information should be a hot topic and a promising research direction in machine learning community[21, 22, 23]. Since there is no desired goal, the learning process of the RMEN-CCA is more difficult than the supervised learning algorithm LCFS. Therefore, how to distill useful features and information plays an important role in the RMEN-CCA. On the other hand, in contrast to the LCFS that has the architecture of a linear system, the formula for the RMEN-CCA is more complicated because of its non-convex constrains. For this reason, this algorithm always fails to converge when using the naive gradient decent. Therefore, we must derive an effective optimization method to solve the novel RMEN-CCA.

The proposed RMEN-CCA uses the norm as joint feature selection in the RMEN instead of the Frobenius norm. Argyriou et. al. [24] adopted the norm as the regularizer for multi-task feature learning tasks. Gu et. al. [25] derived a framework based on the norm for joint feature selection and subspace learning. Du et. al. [26] presented a robust k-means approach based on the norm. We note that the norm is well suitable to our task. The norm has two major advantages over the Frobenius norm [27]. The norm is much more robust to outliers than the Frobenius norm. More importantly, the norm considers joint sparse structure to choose relevant attributes across all samples, rather than being based on the importance of individual feature. In other words, the norm not only captures local useful information and potential relevant features, but also takes into account the manifold structure of feature space. Section V will illustrate that the RMEN-CCA not only outperforms the CCA with the MEN in multi-tasks learning problems, but it also applies to one-task leaning problems better.

In addition to the norm in the RMEN, the nuclear norm (or trace norm) is a popular low-rank learning approach with widespread applications in cross-model matching tasks [28, 29, 30]. The nuclear norm can yield a low rank solution, thus simplifying significantly the model architecture. It is a common perspective that only few elements of the instances contribute to a task. Moreover, different from previous work of the nuclear norm [19], the nuclear norm in the RMEN is implemented over all projected instances rather than only adjustable parameters. As a consequence, the nuclear norm can enforce the relevance of projected samples with connections.

Iii The robust matrix elastic net based canonical correlation analysis

In this section, we brief the formulation of the RMEN-CCA in Section III-A, and then we extend the RMEN-CCA to the kernel version (KRMEN-CCA) in order to handle the nonlinear input-output relationships in Section III-B. Finally, we derive a new accelerated iterative algorithm to solve the RMEN-CCA in Section III-C.

Iii-a The formulation of the RMEN-CCA

First of all, we briefly introduce the formulation of the classical CCA method and the robust matrix elastic net.

Given two variables and , the linear algebraic formulation of the classical CCA method [31] is shown as follows.


where and are true canonical variables, is the number of the top canonical subspace, is an identity matrix, and is the Frobenius norm.

Given a matrix , the norm and the nuclear norm , where is the -th row of and denotes the -th singular value of the matrix Z.

Now we incorporate both the norm and the rank function into the classical CCA as follows.


where is a concatenated matrix. and are trading-off parameters. The former controls the norm for joint feature selection on two feature spaces simultaneously. The rank of the concatenated matrix is handled by the latter, that is, the larger is, the lower the rank is.

It is clear that the rank function is noncontinuous, non-differentiable and non-convex. Hence, we use the nuclear norm instead of the rank function, which has been proven to be the tightest convex relaxation of the rank function [32]. As a result, the (III-A) is reformulated as


where is the nuclear norm. We herein define the combination of the norm with the nuclear norm as a robust matrix elastic net (RMEN), and the (III-A) is defined as the RMEN-CCA.

Although the AppGrad [10] is an efficient iterative approach for the CCA, it is nontrivial to solve directly the (III-A) by the AppGrad. To tackle such complicated non-convex problem, according to the AppGrad, we derive an accelerated iterative approach with proved convergence. The details will be illustrated in Section III-C.

Iii-B Kernel extension

In this section, we take advantage of the kernel trick to extend the proposed RMEN-CCA to nonlinear one capable of handling.

There exists a feature mapping: such that the following condition. For any two points , we have , where is a Mercer kernel and is an inner product. We also deal with in the same fashion. Let and be the new feature spaces. Following the Representer Theorem [33], the optimal solutions to the (III-A) can be spanned by and . Therefore, we have and , where are the true canonical variables for the KRMEN-CCA. Consequently, the formula of the KRMEN-CCA is shown as


Iii-C An accelerated iterative algorithm for RMEN-CCA

It is nontrivial to tackle directly (III-A) by the AppGrad [10] because of the imposed RMEN. Therefore, based on the AppGrad baseline, we derive an accelerated iterative algorithm to address this non-convex optimization problem.

Firstly, we simplify the norm. For the norm, there exists an unpredictable value when it is close to the origin [34]. To overcome the above limitation of the norm, we need to define a function which satisfies all following conditions.

Proposition 1.

[34] Let be a function satisfying all following conditions,

  • is convex on ,

  • is concave on ,

  • , ,

  • is on ,

  • , .

In this paper, we determine the function as , where is a small perturbation allowing the norm smoothness and differentiability. It is clear that the defined function fulfills all conditions in Proposition 1.

After defining the function , the following Lemma is helping solve this function in a half quadratic way [35].

Lemma 1.

[34] Given , there is a a conjugate function , such that


where is determined by the minimizer function w.r.t .

Following the above Lemma, we can rewrite the objective function in (III-A) in terms of traces as follows.


where and . and can be calculated as


where and are the -th row of matrix and the -th row of matrix , respectively, and is a small smoothing term [36].

Subsequently, for the nuclear norm, the following Lemma presents a well-known variational formula.

Lemma 2.

[29, 28] Let , The nuclear norm of is equivalent to


where is a convex function (strictly convex when is invertible). If is invertible, the infinitum is obtained when . Otherwise, we can utilize from inside the cone of positive definite matrices to achieve arbitrarily close to the infimum.

Following the above Lemma, substituting (8) into (III-C), we have


Using the property of the nuclear norm, we can further simplify (III-C) as


Likewise to the norm [36], we also impose an additional term for guaranteeing convergence [29]. As a result, the infimum over is achieved when


Finally, we have the aftermost optimization formula over and for given as


To find the solutions to (III-C), according to the AppGrad [10], we derive an accelerated iterative approach. The new iterative approach has similar idea to the AppGrad. We also introduce an unnormalized pair , which are updated via gradient descent with the momentum at each iteration. Subsequently, at this iteration, the true canonical pair are updated by the resulted unnormalized pair. However, different from the AppGrad, the novel iterative approach needs to calculate the intermediate variables , and by the true canonical pair at each iteration. After that, these resulted intermediate variables are fed into the following updating step to calculate the unnormalized pair (see the loop in Algorithm 1).

At each iteration, we take the partial derivative of the objective function with respect to , and then update the of the past time step to the current using the momentum .


where is a momentum coefficient that controls the momentum and is a learning rate.

After updating , we then calculate the true canonical variable such that .


where and is a unitary matrix and a rectangular diagonal matrix, respectively.

Likewise to (III-C) and (III-C), we can calculate the other true canonical variable in the same fashion.


To sum up, the pseudo-code of the RMEN-CCA is summarized in Algorithm 1.

Input: Training data ; the learning rate ; the trading-off parameters ; the momentum coefficient .
Output: The true canonical pair

1:Initialize from the standard Gaussian distribution, as well as set and the momentum as zero matrices;
2:while not convergence do
3:     Calculate by singular value decomposition (SVD): ;
4:     Calculate ;
5:     Calculate each diagonal element of and using (III-C);
6:     Update and in (III-C) and (III-C), respectively;
7:     Update and in (III-C) and (III-C), respectively;
Algorithm 1 An Iterative Algorithm for RMEN-CCA

Moreover, we can address directly the KMEN-CCA by the Algorithm 1, when using and to replace and , respectively.

In order to boost further the generalization performance, we also present a stochastic iterative algorithm for RMEN-CCA, as shown in Algorithm 2.

Input: Training data ; the learning rate ; the trading-off parameters ; the momentum coefficient .
Output: The true canonical pair

1:Initialize from the standard Gaussian distribution, as well as set and the momentum as zero matrices;
2:while not convergence do
3:     Randomly select a subset of training samples, where we have and ;
4:     Calculate by singular value decomposition (SVD): ;
5:     Calculate ;
6:     Calculate each diagonal element of and using (III-C);
7:     Update and in (III-C) and (III-C), respectively;
8:     Update and in (III-C) and (III-C), respectively;
Algorithm 2 A Stochastic Iterative Algorithm for RMEN-CCA

Iv Convergence Analysis of the RMEN-CCA

The consistently fast convergence can illustrate the great practicality and efficiency of an algorithm. For this reason, it is worthwhile to discuss the convergence of the RMEN-CCA. In this section, we theoretically analyze the convergence of the RMEN-CCA in Theorem 1. Additionally, empirical study in Section V further illustrates the quick convergence of the RMEN-CCA in some real-world tasks. In order to prove Theorem 1, we firstly show the following Lemma which is helpful in proving Theorem 1. Lemma 3 not only reveals the relationship between the optimal true canonical pair and its unnormalized counterpart, but it also interprets the novel iterative approach as approximate gradient scheme for addressing the RMEN-CCA.

Lemma 3.

Let and be the optimal true canonical pair and the optimal unnormalized pair, respectively. Then, we have , where is the canonical correlation diagonal matrix.




We only proof that the result holds, and then a similar result also holds for the other variable.

Following the optimality condition, we have


Using the Lemma 1.1 and 2.2 in [10], we can directly obtain the desired result as follows.


Similar argument gives .

We complete the proof. ∎

Following the above Lemma, Theorem 1 illustrates the main convergence result of the RMEN-CCA.

Theorem 1.

A sequence of the leading canonical pair in the RMEN-CCA converge to the optimal canonical pair .


The Lemma 3 shows the relationship between the two canonical pairs. Therefore, now we only consider the unnormalized canonical variables . The newly-derived iterative approach is an alternating optimization, that is, at each iteration, given a fixed canonical variable, we update the other canonical variable. Therefore, we can reformulate the optimization form of the RMEN-CCA at the -th iteration as


Without loss of generality, we only consider the (1), and the (1) has the similar argument.

According to [27], we illustrate that the second term monotonically decreases at each iteration. At the -th iteration, we have


The Eq. 23 indicates that


The (24) can be written as


Following an obvious inequality , for each , we have


Summing over all above inequalities, we have


Combining (25) and (27), we have


That is to say,


which is the desired result.

Moreover, the Lemma 1 in [28] indicates that the third term converges to the optimum . Since the nuclear norm is the infimum over .

Now we revisit the (1). This objective function consists of three norms. And thus, the objective function is convex and the value is not less than 0. When using the gradient descend, the objective function is still moving towards the negative direction of the gradient. Therefore, the unnormalized canonical variables are monotonically decreasing as , and converge to the optimal unnormalized canonical variables .

Furthermore, we revisit the original optimization problem of the RMEN-CCA. Following the Lemma 3, we have the following statement that when the unnormalized pair converge to as , the true canonical pair also converge to . We obtain the desired result and complete the proof. ∎

V Experimental results and discussion

In this section, we conduct several experiments on four popular datasets to evaluate the RMEN-CCA, including MNIST [37], Wiki [38], Pascal VOC 2007 [39] and URL Reputation [40]. We randomly select training data as the validation data for each dataset, and the hyper-parameters are tuned on the validation set. The description of each dataset is shown as follows, and the statistics of these four datasets are summarized in Table I.

  • The MNIST is a handwritten digit recognition dataset, in which each image is separated into the left and right halves.

  • The Wiki is a popular image-text matching dataset, which was used in [20, 38, 41]. The Wiki dataset assembles from Wikipedia’s articles, which consists of 2866 image-text pair (2173 training data and 693 test data). The features of images and texts are extracted by a 128-dim SIFT and a 10-dim latent Dirichlet allocation algorithm, respectively. Some examples are shown in Figure 1.

  • The Pascal VOC 2007 is a challenging and realistic image-tag matching testbed. In order to carry out more experiments, three image features and three tag features are extracted respectively by [42], including Gist, HSV color histograms and bag-of-visual-words (BoW) histograms, as well as word frequency (Wc), relative tag rank (Rel) and absolute tag rank (Abs). Some examples are shown in Figure 2.

  • The URL Reputation is a large-scale dataset for online learning algorithms. Due to limitations of computational resource, we only choose some parts of samples and attributes.

Fig. 1: Three examples on Wiki: image (left) and its corresponding article (right).
Fig. 2: Some examples on Pascal VOC 2007: image and its multiple tags.
Problems Description Training set Test set Dim X Dim Y
MNIST Left and Right Halves of Images 60,000 10,000 392 392
Wiki Image-Text Pairs 2,173 693 128 10
Pascal VOC 2007 Image and Its Multi-Labels 5,011 4,952 512 (Gist) 399 (Wc)
200 (Bow) 399 (Rel)
64 (Hsv) 399 (Abs)
URL Reputation Host and Lexical based Features 1,000,000 20,000 50 50
TABLE I: The statistics of the four datasets

Moreover, we measure the test results by the commonly-used Pearson product moment correlation coefficients (PCC). The PCC is defined as , where and are two projected test samples, is the covariance, as well as and are the standard deviation of and , respectively. The range of PPC is from 100 to 0, in which 100 denotes complete correlation and 0 denotes no correlation. Ten trials are conducted for each algorithm, and we report the average PCC results. In the experiments, we found the number of dimensional canonical subspace is almost no effect on the performance. Therefore, except for MNIST using , we calculate the top 5 dimensional canonical subspace for other three datasets. All simulations are carried out in a Matlab 2015b environment running in a PC machine with Inter(R) Core(TM) i7-6700HD 2.60 GHZ CPU, NVIDIA GTX 960M GPU and 8 GB of RAM.

V-a Hyper-parameters selection

In the RMEN-CCA, we initialize the true canonical pair by drawing i.i.d samples from the standard Gaussian distribution, and set the unnormalized pair and the momentum as zero matrices. We empirically found that the RMEN-CCA is insensitive to the learning rate and the trade-off coefficients , . When the value of is 10 times that of , the RMEN-CCA can achieve the best performance. Hence, we set , and on all four datasets. According to [43], the momentum coefficient is set as 0.9. For the stochastic iterative algorithm, a small part of the training data is held out. We use stochastic iterative approach for the RMEN-CCA in all experiments.

We choose the commonly-used Gaussian kernel for all kernel based methods, and the kernel parameter is chosen empirically from on the validation set. In addition, the number of random projections of both FKCCA and NKCCA [7] are chosen from , as well as that of approximate KCCA (KNOI) [12] is from . Other user-defined parameters are determined empirically, and we pick the one having the best performance. Due to space limitations, we omit the procedure of selecting hyper-parameters in this paper.

V-B Convergence

In this section, we use the accuracy curve to illustrate the convergence analysis of the RMEN-CCA instead of the convergence curve on MNIST and Wiki, which is more visually pleasing for the convergence analysis. With the number of iterations increased, the accuracy of the RMEN-CCA is improved. When the accuracy curve is almost flat, we consider the RMEN-CCA almost converges. In Figure (a)a, we note that there is a jump in the curve, when the number of iterations is about 30, and the accuracy of the RMEN-CCA improves afterwards. When the iteration reaches 150, the RMEN-CCA almost converges. Therefore, we set the number of iteration as 150 on MNIST. Moreover, in Figure (b)b, we only illustrate an interval of the accuracy curve of the RMEN-CCA on Wiki because of the different dimensions of the two ’views’, which is from 150 to 600. We note that the RMEN-CCA almost converges when the iteration reaches 500, and therewith the standard deviation decreases significantly. Therefore balancing the effectiveness and efficiency, we set the number of iteration as 500. Furthermore, we will investigate the convergence of the RMEN-CCA on the large-scale URL Reputation dataset, which is compared with the Scalable CCA. Due to space limitations, we omit the convergence analysis of the RMEN-CCA on Pascal VOC 2007.

(b) Wiki
Fig. 3: The convergence analysis of the RMEN-CCA on MNIST and Wiki.

V-C Performance

V-C1 Results on MNIST

In this paper, we only consider unsupervised learning approaches as comparisons. Hence, the compared learning methods include linear CCA, partial least squares (PLS) [44], bilinear models (BLM) [45, 46], FKCCA and NKCCA [7], approximate KCCA (KNOI) [12] as well as Scalable CCA [10]. To better illustrate the effectiveness of the RMEN-CCA, we also derive CCA with basic MEN termed MEN-CCA. The KNOI is carried out on the GPU, while other algorithms are run on the CPU. The experimental results in terms of PCC and time are shown in Table II.

Algorithms PCC(%) time(sec)
RMEN-CCA 10.04
MEN-CCA 90.82 8.19
KNOI 87.26 258.52 (GPU)
FKCCA 82.61 232.87
NKCCA 84.92 257.69
Scalable CCA 56.87 6.39
Linear CCA 58.43 3.08
PLS 58.16 3.51
BLM 58.68 2.89
TABLE II: The comparison with state-of-the-art approaches on MNIST in terms of PCC (%) and time (sec).

Seen from Table II, the RMEN-CCA has the best performance among all comparisons. The RMEN-CCA outperforms remarkably Scalable CCA in terms of accuracy, since the RMEN-CCA benefits from the strength of coupled feature selection. The RMEN-CCA takes a little more time than Scalable CCA. Since the RMEN-CCA needs to calculate using SVD, and the computational complexity is , in which is the number of input data. But the time cost of the RMEN-CCA is much less than that of all kernel-based methods, even if the KNOI is carried out on the GPU. We note that the above experimental results in terms of time may be unfair to the RMEN-CCA. Since all kernel-based algorithms are sensitive to the user-specified kernel width. As a result, many experiments on the validation set have been carried out for all kernel-based algorithms in order to achieve the reported results, and these additional time costs are not included in Table II.

V-C2 Results on Wiki

Due to the limitations of computational resource, we only evaluate the KRMEN-CCA on this dataset. In addition, the KCCA [11] is also used as a comparison on this dataset. The experimental results are illustrated in Table III.

From Table III, except for the KRMEN-CCA and KCCA, the RMEN-CCA outperforms other comparisons. The KCCA sightly outperforms the RMEN-CCA, but the KCCA needs about 1,120 times as many time costs as the RMEN-CCA. Surprisingly, the KRMEN-CCA can achieve extremely great accuracy, while the time of the KRMEN-CCA is only one-third of that of the KCCA.

Algorithms PCC(%) time(sec)
RMEN-CCA 51.28 1.27
MEN-CCA 50.33 1.09
KRMEN-CCA 465.71
KCCA 58.54 1427.67
KNOI 49.61 213.45
FKCCA 50.12 143.61
NKCCA 50.64 159.28
Scalable CCA 46.19 0.43
Linear CCA 46.36 0.07
PLS 46.32 0.08
BLM 46.92 0.07
TABLE III: The comparison with state-of-the-art approaches on Wiki in terms of PCC (%) and time (sec).

V-C3 Results on Pascal VOC 2007

In order to illustrate better the RMEN-CCA in terms of effectiveness and efficiency, we not only evaluate the RMEN-CCA against some state-of-the-art algorithms on image-to-tags features based Pascal VOC 2007, but also conduct an additional group of experiments on image-to-image features based Pascal VOC 2007.

As shown by both Table IV and V, we see that the RMEN-CCA outperforms all comparisons in terms of accuracy at much faster learning speed.

Algorithms Gist-Wc Gist-Rel Gist-Abs
PCC(%) time(sec) PCC(%) time(sec) PCC(%) time(sec)
RMEN-CCA 0.89 0.96 0.95
MEN-CCA 60.29 0.71 63.18 0.74 57.10 0.82
KNOI 54.51 522.88 57.63 516.79 56.21 523.24
FKCCA 55.69 157.18 55.41 153.84 56.63 154.68
NKCCA 55.63 163.24 55.77 143.64 57.29 171.46
Scalable CCA 51.53 0.88 51.36 0.84 53.56 0.79
Linear CCA 51.77 0.87 52.40 0.81 53.63 0.81
PLS 51.75 0.82 51.65 0.83 53.27 0.80
BLM 52.08 0.92 51.94 0.85 53.91 0.89
TABLE IV: The comparison with state-of-the-art approaches on image-to-tags features based Pascal VOC 2007 in terms of PCC (%) and time (sec).
Algorithms Gist-Bow Gist-Hsv Hsv-Bow
PCC(%) time(sec) PCC(%) time(sec) PCC(%) time(sec)
RMEN-CCA 0.84 0.81 0.81
MEN-CCA 61.41 0.74 56.82 0.69 50.83 0.68
KNOI 61.54 333.22 54.27 392.74 47.66 375.42
FKCCA 61.60 156.62 50.83 148.60 50.76 164.53
NKCCA 61.83 147.27 53.19 166.24 50.69 187.27
Scale CCA 47.99 0.82 49.76 0.71 41.91 0.67
Linear CCA 49.63 0.72 50.18 0.68 43.48 0.56
PLS 48.23 0.79 50.05 0.69 43.17 0.62
BLM 48.99 0.72 51.33 0.67 44.08 0.58
TABLE V: The comparison with state-of-the-art approaches on image-to-image features based Pascal VOC 2007 in terms of PCC (%) and time (sec).

V-C4 Results on URL Reputation

We only compare with Scalable CCA on the large-scale URL Reputation dataset. the previous work [10] has shown that the Scalable CCA can achieve excellent performance on this dataset, and thus, we omit other methods in order to avoid duplication of work. Classical CCA methods based on eigenvector computation fail on a typical PC machine. Since these approaches are trained over the entire training set in a batch learning fashion. Consequently, these methods are prohibitive for huge datasets. However, likewise to the Scalable CCA, the RMEN-CCA is also an online learning algorithm, which is a commonly-used approach on huge datasets.

Algorithms iteration PCC(%) time(sec)
RMEN-CCA 200 22.04
Scalable CCA 200 35.814 21.93
1000 41.144 31.44
TABLE VI: The comparison with Scalable CCA on the large-scale URL Reputation dataset in terms of PCC (%) and time (sec).
Fig. 4: The convergence analysis of the RMEN-CCA and Scalable CCA on URL Reputation.

From Table VI, we see that the RMEN-CCA can achieve better performance than the Scalable CCA. As shown by Figure 4, the more the number of iteration is, the more the Scalable CCA can capture correlations. However, we find that when the number of iterations is more than 1,000, the Scalable CCA is not powerful enough to achieve better accuracy, that is, the PCC of the Scalable CCA remains about . While the RMEN-CCA significantly outperforms the Scalable CCA no matter what the number of iterations of the Scalable CCA is. The experimental results on this large-scale dataset illustrate that the RMEN-CCA not only faster converges than the Scalable CCA, but this novel algorithm is also more stable than the Scalable CCA.

Vi Conclusion

In this paper, we derive a novel robust matrix elastic net based canonical correlation analysis (RMEN-CCA) with theoretically guaranteed convergence and empirical proficiency. To the best of our knowledge, the RMEN-CCA is the first to impose coupled feature selection into the CCA. As a consequence, the RMEN-CCA not only measures correlations between different ’views’, but also distills numerous relevant and useful features, which results that the performance of the RMEN-CCA is significantly improved. Additionally, for the sake of modeling highly sophisticated nonlinear relationships, the RMEN-CCA can be extended straightforwardly to the kernel scenario. Furthermore, due to its complicated model architecture, it is nontrivial to solve the RMEN-CCA by existing optimization approaches. Therefore, we bridge the gap between the new optimization problem and the previous efficient iterative algorithm. Finally, competitive experimental results on four popular datasets confirm the great effectiveness and efficiency of the RMEN-CCA in multi-view learning problems.


The authors would like to thank Prof. Daniel Palomar from Hong Kong University of Science and Technology for great inspiration and valuable discussions of this work. The authors also would like to thank Prof. Ajay Joneja from Hong Kong University of Science and Technology for providing a PC machine and some constructive suggestions on this research.


  • [1] J. Yu, Y. Rui, Y. Y. Tang, and D. Tao, “High-order distance-based multiview stochastic learning in image classification,” IEEE transactions on cybernetics, vol. 44, no. 12, pp. 2431–2442, 2014.
  • [2] X. Zhu, X. Li, and S. Zhang, “Block-row sparse multiview multilabel learning for image classification,” IEEE transactions on cybernetics, vol. 46, no. 2, pp. 450–461, 2016.
  • [3] P. Dhillon, D. P. Foster, and L. H. Ungar, “Multi-view learning of word embeddings via cca,” in Advances in Neural Information Processing Systems, 2011, pp. 199–207.
  • [4] K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang, “A comprehensive survey on cross-modal retrieval,” arXiv preprint arXiv:1607.06215, 2016.
  • [5] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,” arXiv preprint arXiv:1304.5634, 2013.
  • [6] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, “Deep canonical correlation analysis.” in ICML (3), 2013, pp. 1247–1255.
  • [7] D. Lopez-Paz, S. Sra, A. J. Smola, Z. Ghahramani, and B. Schölkopf, “Randomized nonlinear component analysis.” in ICML, 2014, pp. 1359–1367.
  • [8] D. Chu, L.-Z. Liao, M. K. Ng, and X. Zhang, “Sparse canonical correlation analysis: new formulation and algorithm,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 12, pp. 3050–3065, 2013.
  • [9] D. R. Hardoon and J. Shawe-Taylor, “Sparse canonical correlation analysis,” Machine Learning, vol. 83, no. 3, pp. 331–353, 2011.
  • [10] Z. Ma, Y. Lu, and D. Foster, “Finding linear structure in large datasets with scalable canonical correlation analysis,” in Proc. of the 32st Int. Conf. Machine Learning (ICML 2015), 2015, pp. 169–178.
  • [11] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Journal of machine learning research, vol. 3, no. Jul, pp. 1–48, 2002.
  • [12] W. Wang and K. Livescu, “Large-scale approximate kernel canonical correlation analysis,” arXiv preprint arXiv:1511.04773, 2015.
  • [13] B. Xie, Y. Liang, and L. Song, “Scale up nonlinear component analysis with doubly stochastic gradients,” in Advances in Neural Information Processing Systems, 2015, pp. 2341–2349.
  • [14] W. Wang, J. Wang, D. Garber, and N. Srebro, “Efficient globally convergent stochastic optimization for canonical correlation analysis,” in Advances in Neural Information Processing Systems, 2016, pp. 766–774.
  • [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [16] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
  • [17] S. Gaiffas and G. Lecué, “Sharp oracle inequalities for high-dimensional matrix prediction,” IEEE Transactions on Information Theory, vol. 57, no. 10, pp. 6942–6957, 2011.
  • [18] H. Li, N. Chen, and L. Li, “Error analysis for matrix elastic-net regularization algorithms,” IEEE transactions on neural networks and learning systems, vol. 23, no. 5, pp. 737–748, 2012.
  • [19] X. Zhen, M. Yu, X. He, and S. Li, “Multi-target regression via robust low-rank learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [20] K. Wang, R. He, W. Wang, L. Wang, and T. Tan, “Learning coupled feature spaces for cross-modal matching,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2088–2095.
  • [21] Z. Luo, B. Peng, D.-A. Huang, A. Alahi, and L. Fei-Fei, “Unsupervised learning of long-term motion dynamics for videos,” arXiv preprint arXiv:1701.01821, 2017.
  • [22] S. Singh, A. Okun, and A. Jackson, “Artificial intelligence: Learning to play go from scratch,” Nature, vol. 550, no. 7676, p. 550336a, 2017.
  • [23] M. Mirza, A. Courville, and Y. Bengio, “Generalizable features from unsupervised learning,” arXiv preprint arXiv:1612.03809, 2016.
  • [24] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,” in Advances in neural information processing systems, 2007, pp. 41–48.
  • [25] Q. Gu, Z. Li, and J. Han, “Joint feature selection and subspace learning,” in IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, no. 1.   Citeseer, 2011, p. 1294.
  • [26] L. Du, P. Zhou, L. Shi, H. Wang, M. Fan, W. Wang, and Y.-D. Shen, “Robust multiple kernel k-means using l21-norm,” in Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
  • [27] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust feature selection via joint ℓ2, 1-norms minimization,” in Advances in neural information processing systems, 2010, pp. 1813–1821.
  • [28] C.-J. Hsieh and P. A. Olsen, “Nuclear norm minimization via active subspace selection.” in ICML, 2014, pp. 575–583.
  • [29] E. Grave, G. R. Obozinski, and F. R. Bach, “Trace lasso: a trace norm regularization for correlated designs,” in Advances in Neural Information Processing Systems, 2011, pp. 2187–2195.
  • [30] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted nuclear norm minimization with application to image denoising,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2862–2869.
  • [31] G. H. Golub and H. Zha, “The canonical correlations of matrix pairs and their numerical computation,” in Linear algebra for signal processing.   Springer, 1995, pp. 27–49.
  • [32] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” SIAM review, vol. 52, no. 3, pp. 471–501, 2010.
  • [33] F. Dinuzzo and B. Schölkopf, “The representer theorem for hilbert spaces: a necessary and sufficient condition,” in Advances in neural information processing systems, 2012, pp. 189–196.
  • [34] R. He, T. Tan, L. Wang, and W.-S. Zheng, “ regularized correntropy for robust feature selection,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 2504–2511.
  • [35] M. Nikolova and M. K. Ng, “Analysis of half-quadratic minimization methods for signal and image recovery,” SIAM Journal on Scientific computing, vol. 27, no. 3, pp. 937–966, 2005.
  • [36] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from limited data using focuss: A re-weighted minimum norm algorithm,” IEEE Transactions on signal processing, vol. 45, no. 3, pp. 600–616, 1997.
  • [37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [38] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to cross-modal multimedia retrieval,” in Proceedings of the 18th ACM international conference on Multimedia.   ACM, 2010, pp. 251–260.
  • [39] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,”
  • [40] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Identifying suspicious urls: an application of large-scale online learning,” in Proceedings of the 26th annual international conference on machine learning.   ACM, 2009, pp. 681–688.
  • [41] J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “On the role of correlation and abstraction in cross-modal multimedia retrieval,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 3, pp. 521–535, 2014.
  • [42] S. J. Hwang and K. Grauman, “Accounting for the relative importance of objects in image retrieval.” in BMVC, vol. 1, no. 2, 2010, p. 5.
  • [43] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999.
  • [44] R. Rosipal and N. Krämer, “Overview and recent advances in partial least squares,” Lecture notes in computer science, vol. 3940, p. 34, 2006.
  • [45] A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 2160–2167.
  • [46] J. B. Tenenbaum and W. T. Freeman, “Separating style and content,” in Advances in neural information processing systems, 1997, pp. 662–668.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description