Robust Matrix Elastic Net based Canonical Correlation Analysis: An Effective Algorithm for MultiView Unsupervised Learning
Abstract
This paper presents a robust matrix elastic net based canonical correlation analysis (RMENCCA) for multiple view unsupervised learning problems, which emphasizes the combination of CCA and the robust matrix elastic net (RMEN) used as coupled feature selection. The RMENCCA leverages the strength of the RMEN to distill naturally meaningful features without any prior assumption and to measure effectively correlations between different ’views’. We can further employ directly the kernel trick to extend the RMENCCA to the kernel scenario with theoretical guarantees, which takes advantage of the kernel trick for highly complicated nonlinear feature learning. Rather than simply incorporating existing regularization minimization terms into CCA, this paper provides a new learning paradigm for CCA and is the first to derive a coupled feature selection based CCA algorithm that guarantees convergence. More significantly, for CCA, the newlyderived RMENCCA bridges the gap between measurement of relevance and coupled feature selection. Moreover, it is nontrivial to tackle directly the RMENCCA by previous optimization approaches derived from its sophisticated model architecture. Therefore, this paper further offers a bridge between a new optimization problem and an existing efficient iterative approach. As a consequence, the RMENCCA can overcome the limitation of CCA and address largescale and streaming data problems. Experimental results on four popular competing datasets illustrate that the RMENCCA performs more effectively and efficiently than do stateoftheart approaches.
I Introduction
With the rapid development of the Internet, the amount of multiple view data available is extremely increasing in various domains. Since multiple representations of multiview data are captured from various sources or different features spaces, the statistical properties of these representations, in general, are completely different. Therefore, learning and identifying a consensus pattern in terms of effectiveness and efficiency from these multifarious representations is a persistent challenge. In order to address above problem, many researchers have investigated widely many multiview learning methods [1, 2, 3, 4]. These existing multiview learning approaches can be classified into three categories [5]: cotraining, multiple kernel learning and subspace learning. Our work focuses on the third one.
Canonical Correlation Analysis (CCA) is a classical and powerful unsupervised learning approach for the multiview learning problem. Its variants, such as Deep CCA (DCCA)[6], Randomized nonlinear CCA (FKCCA and NKCCA) [7], Sparse CCA [8, 9] and Scalable CCA [10], etc, have been thoroughly investigated. The basic idea of CCA is to find coupled linear projection matrices, and then model the potential connections between the two different ’views’. For this reason, CCAs have been widely applied to address multiview learning problems.
The two major challenges of CCA are: revealing the correlations across different sources in terms of effectiveness [11, 7, 12] and addressing the nonconvex optimization problem of CCA efficiently [10, 13, 14]. In this paper, we focus on the former. Over the past two decades, many studies have thoroughly investigated various kernelbased CCA approaches [7, 12]. Although the performance of these kernelbased CCA approaches have been improved remarkably, these kernelbased CCA methods are not powerful enough to explore shared knowledge across different ’views’. In other words, these works ignore one critical property of the crossmodel matching approach that is coupled feature selection. Additionally, it is quite difficult to select a ’suitable’ Mercer kernel in kernelbased approaches, which is a key factor for the success of kernelbased approaches. Generally speaking, the commonlyused Gaussian kernel cannot lead to the optimal performance. Therefore, these previous kernelbased CCA methods must consider the distribution of inputs and the corresponding application scenarios as prior knowledge.
In this paper, we present a robust matrix elastic net based canonical correlation analysis (RMENCCA) without any prior knowledge. To the best of our knowledge, the RMENCCA is the first to incorporate the RMEN into CCA, thus emphasizing the combination of CCA with the coupled feature selection technique. In the RMEN, the norm allows the RMENCCA to capture joint sparse structure to distill relevant attributes from the dataembedding space, and simultaneously, the nuclear norm models the correlation between the projected samples via calculating a lowrank solution. More significantly, this paper provides a novel paradigm for CCA, which leverages the strength of coupled feature selection. Furthermore, in order for the RMENCCA to handle the highly sophisticated nonlinear relationship, this algorithm can perform directly in conjunction with kernel tricks, and we define this kernel scenario as KRMENCCA.
Moreover, we note that addressing CCA is a typical nonconvex optimization problem because of its constraints, and hence, it is nontrivial to solve it by the naive gradient descent directly. The AppGrad [10] is an efficient iterative algorithm for CCA. The crucial idea of the AppGrad is to guarantee that the domain of CCA exists in a convex region at each iteration. The AppGrad has two major advantages over the traditional eigenvector computation. First, the AppGrad can significantly decease computational and shortage complexity. Second, the online property of the AppGrad makes it efficient on handling huge datasets, whereas the eigenvector computation is prohibitive in this situation. Unfortunately, it is nontrivial to use directly the AppGrad approach to solve the RMENCCA because of the imposed RMEN. Therefore, based on the AppGrad baseline [10], we derive a novel accelerated iterative method with proved convergence to address such nonconvex optimization problem.
Furthermore, because of its highly flexible model architecture, the RMENCCA can be used as an intermediate structure in convolutional neural networks (CNNs) [15] for finetuning. This algorithm can also be applied to transfer learning [16] in order to address the problem of insufficient labeled data. However, these are beyond the scope of this paper.
The contributions of this paper are summarized as follows.

We present a novel robust matrix elastic net based canonical correlation analysis (RMENCCA) with theoretical guarantees and empirical proficiency. To the best of our knowledge, the RMENCCA is the first to incorporate the coupled feature selection into CCA, which improves generalization performance by automatically distilling relevant and useful features without any prior knowledge.

In the RMEN, the norm enforces redundancy and meaningless reduction, and meanwhile the nuclear norm yields a lowrank solution that encodes the correlation better. As a result, the RMENCCA takes advantage of this sparse plus lowrank structure, which brings benefits in terms of effectiveness and efficiency.

The RMENCCA can directly leverage the powerful kernel trick to yield the kernelbased algorithm called KRMENCCA, which can effectively construct nonlinear approximations of the manifolds.

It is nontrivial to address the proposed RMENCCA by existing approaches. Therefore, this paper bridges the gap between a novel nonconvex optimization problem and the previous efficient iterative approach, which results that the RMENCCA is applied to the largescale and streaming data tasks.
The remainder of the paper is organized as follows. Section II briefs the related work of feature selection techniques. Section III introduce the RMENCCA and its kernel version, and then an accelerated iterative optimization algorithm is derived for the RMENCCA. Section IV analyzes theoretically the convergence of the RMENCCA. Section V evaluates the RMENCCA on four popular datasets. Finally, conclusive remarks are provided in Section VI.
Ii Related work of coupled feature selection
In recent years, the feature selection technique plays an important role in machine learning community. This technique is desired to extract numerous useful features and to eliminate redundancies, in order to construct a simple model architecture. To overcome the limitations of conventional feature selection approaches such as Lasso and Ridge, etc., a combined feature selection approach, termed matrix elastic net (MEN), has been successfully used for different learning algorithms [17, 18, 19].
Our motivation is inspired by the LCFS algorithm [20] which incorporates coupled feature selection into a linear regression method. However, in contrast to the LCFS, the RMENCCA confronts two greater challenges: learning in an unsupervised fashion and addressing a nonconvex optimization. Due to the high costs of labeling data manually, few labeled data may be available, even in this big data era. For this reason, deriving an efficient and effective approach without supervised information should be a hot topic and a promising research direction in machine learning community[21, 22, 23]. Since there is no desired goal, the learning process of the RMENCCA is more difficult than the supervised learning algorithm LCFS. Therefore, how to distill useful features and information plays an important role in the RMENCCA. On the other hand, in contrast to the LCFS that has the architecture of a linear system, the formula for the RMENCCA is more complicated because of its nonconvex constrains. For this reason, this algorithm always fails to converge when using the naive gradient decent. Therefore, we must derive an effective optimization method to solve the novel RMENCCA.
The proposed RMENCCA uses the norm as joint feature selection in the RMEN instead of the Frobenius norm. Argyriou et. al. [24] adopted the norm as the regularizer for multitask feature learning tasks. Gu et. al. [25] derived a framework based on the norm for joint feature selection and subspace learning. Du et. al. [26] presented a robust kmeans approach based on the norm. We note that the norm is well suitable to our task. The norm has two major advantages over the Frobenius norm [27]. The norm is much more robust to outliers than the Frobenius norm. More importantly, the norm considers joint sparse structure to choose relevant attributes across all samples, rather than being based on the importance of individual feature. In other words, the norm not only captures local useful information and potential relevant features, but also takes into account the manifold structure of feature space. Section V will illustrate that the RMENCCA not only outperforms the CCA with the MEN in multitasks learning problems, but it also applies to onetask leaning problems better.
In addition to the norm in the RMEN, the nuclear norm (or trace norm) is a popular lowrank learning approach with widespread applications in crossmodel matching tasks [28, 29, 30]. The nuclear norm can yield a low rank solution, thus simplifying significantly the model architecture. It is a common perspective that only few elements of the instances contribute to a task. Moreover, different from previous work of the nuclear norm [19], the nuclear norm in the RMEN is implemented over all projected instances rather than only adjustable parameters. As a consequence, the nuclear norm can enforce the relevance of projected samples with connections.
Iii The robust matrix elastic net based canonical correlation analysis
In this section, we brief the formulation of the RMENCCA in Section IIIA, and then we extend the RMENCCA to the kernel version (KRMENCCA) in order to handle the nonlinear inputoutput relationships in Section IIIB. Finally, we derive a new accelerated iterative algorithm to solve the RMENCCA in Section IIIC.
Iiia The formulation of the RMENCCA
First of all, we briefly introduce the formulation of the classical CCA method and the robust matrix elastic net.
Given two variables and , the linear algebraic formulation of the classical CCA method [31] is shown as follows.
(1) 
where and are true canonical variables, is the number of the top canonical subspace, is an identity matrix, and is the Frobenius norm.
Given a matrix , the norm and the nuclear norm , where is the th row of and denotes the th singular value of the matrix Z.
Now we incorporate both the norm and the rank function into the classical CCA as follows.
(2) 
where is a concatenated matrix. and are tradingoff parameters. The former controls the norm for joint feature selection on two feature spaces simultaneously. The rank of the concatenated matrix is handled by the latter, that is, the larger is, the lower the rank is.
It is clear that the rank function is noncontinuous, nondifferentiable and nonconvex. Hence, we use the nuclear norm instead of the rank function, which has been proven to be the tightest convex relaxation of the rank function [32]. As a result, the (IIIA) is reformulated as
(3) 
where is the nuclear norm. We herein define the combination of the norm with the nuclear norm as a robust matrix elastic net (RMEN), and the (IIIA) is defined as the RMENCCA.
Although the AppGrad [10] is an efficient iterative approach for the CCA, it is nontrivial to solve directly the (IIIA) by the AppGrad. To tackle such complicated nonconvex problem, according to the AppGrad, we derive an accelerated iterative approach with proved convergence. The details will be illustrated in Section IIIC.
IiiB Kernel extension
In this section, we take advantage of the kernel trick to extend the proposed RMENCCA to nonlinear one capable of handling.
There exists a feature mapping: such that the following condition. For any two points , we have , where is a Mercer kernel and is an inner product. We also deal with in the same fashion. Let and be the new feature spaces. Following the Representer Theorem [33], the optimal solutions to the (IIIA) can be spanned by and . Therefore, we have and , where are the true canonical variables for the KRMENCCA. Consequently, the formula of the KRMENCCA is shown as
(4) 
IiiC An accelerated iterative algorithm for RMENCCA
It is nontrivial to tackle directly (IIIA) by the AppGrad [10] because of the imposed RMEN. Therefore, based on the AppGrad baseline, we derive an accelerated iterative algorithm to address this nonconvex optimization problem.
Firstly, we simplify the norm. For the norm, there exists an unpredictable value when it is close to the origin [34]. To overcome the above limitation of the norm, we need to define a function which satisfies all following conditions.
Proposition 1.
[34] Let be a function satisfying all following conditions,

is convex on ,

is concave on ,

, ,

is on ,

, .
In this paper, we determine the function as , where is a small perturbation allowing the norm smoothness and differentiability. It is clear that the defined function fulfills all conditions in Proposition 1.
After defining the function , the following Lemma is helping solve this function in a half quadratic way [35].
Lemma 1.
[34] Given , there is a a conjugate function , such that
(5) 
where is determined by the minimizer function w.r.t .
Following the above Lemma, we can rewrite the objective function in (IIIA) in terms of traces as follows.
(6) 
where and . and can be calculated as
(7) 
where and are the th row of matrix and the th row of matrix , respectively, and is a small smoothing term [36].
Subsequently, for the nuclear norm, the following Lemma presents a wellknown variational formula.
Lemma 2.
Using the property of the nuclear norm, we can further simplify (IIIC) as
(10) 
Likewise to the norm [36], we also impose an additional term for guaranteeing convergence [29]. As a result, the infimum over is achieved when
(11) 
Finally, we have the aftermost optimization formula over and for given as
(12) 
To find the solutions to (IIIC), according to the AppGrad [10], we derive an accelerated iterative approach. The new iterative approach has similar idea to the AppGrad. We also introduce an unnormalized pair , which are updated via gradient descent with the momentum at each iteration. Subsequently, at this iteration, the true canonical pair are updated by the resulted unnormalized pair. However, different from the AppGrad, the novel iterative approach needs to calculate the intermediate variables , and by the true canonical pair at each iteration. After that, these resulted intermediate variables are fed into the following updating step to calculate the unnormalized pair (see the loop in Algorithm 1).
At each iteration, we take the partial derivative of the objective function with respect to , and then update the of the past time step to the current using the momentum .
(13) 
where is a momentum coefficient that controls the momentum and is a learning rate.
After updating , we then calculate the true canonical variable such that .
(14) 
where and is a unitary matrix and a rectangular diagonal matrix, respectively.
Likewise to (IIIC) and (IIIC), we can calculate the other true canonical variable in the same fashion.
(15) 
(16) 
To sum up, the pseudocode of the RMENCCA is summarized in Algorithm 1.
Moreover, we can address directly the KMENCCA by the Algorithm 1, when using and to replace and , respectively.
In order to boost further the generalization performance, we also present a stochastic iterative algorithm for RMENCCA, as shown in Algorithm 2.
Iv Convergence Analysis of the RMENCCA
The consistently fast convergence can illustrate the great practicality and efficiency of an algorithm. For this reason, it is worthwhile to discuss the convergence of the RMENCCA. In this section, we theoretically analyze the convergence of the RMENCCA in Theorem 1. Additionally, empirical study in Section V further illustrates the quick convergence of the RMENCCA in some realworld tasks. In order to prove Theorem 1, we firstly show the following Lemma which is helpful in proving Theorem 1. Lemma 3 not only reveals the relationship between the optimal true canonical pair and its unnormalized counterpart, but it also interprets the novel iterative approach as approximate gradient scheme for addressing the RMENCCA.
Lemma 3.
Let and be the optimal true canonical pair and the optimal unnormalized pair, respectively. Then, we have , where is the canonical correlation diagonal matrix.
Proof.
Let
(17)  
(18) 
We only proof that the result holds, and then a similar result also holds for the other variable.
Following the optimality condition, we have
(19) 
Using the Lemma 1.1 and 2.2 in [10], we can directly obtain the desired result as follows.
(20) 
Similar argument gives .
We complete the proof. ∎
Following the above Lemma, Theorem 1 illustrates the main convergence result of the RMENCCA.
Theorem 1.
A sequence of the leading canonical pair in the RMENCCA converge to the optimal canonical pair .
Proof.
The Lemma 3 shows the relationship between the two canonical pairs. Therefore, now we only consider the unnormalized canonical variables . The newlyderived iterative approach is an alternating optimization, that is, at each iteration, given a fixed canonical variable, we update the other canonical variable. Therefore, we can reformulate the optimization form of the RMENCCA at the th iteration as
(21)  
(22) 
According to [27], we illustrate that the second term monotonically decreases at each iteration. At the th iteration, we have
(23) 
The Eq. 23 indicates that
(24) 
The (24) can be written as
(25) 
Following an obvious inequality , for each , we have
(26) 
Summing over all above inequalities, we have
(27) 
That is to say,
(29) 
which is the desired result.
Moreover, the Lemma 1 in [28] indicates that the third term converges to the optimum . Since the nuclear norm is the infimum over .
Now we revisit the (1). This objective function consists of three norms. And thus, the objective function is convex and the value is not less than 0. When using the gradient descend, the objective function is still moving towards the negative direction of the gradient. Therefore, the unnormalized canonical variables are monotonically decreasing as , and converge to the optimal unnormalized canonical variables .
Furthermore, we revisit the original optimization problem of the RMENCCA. Following the Lemma 3, we have the following statement that when the unnormalized pair converge to as , the true canonical pair also converge to . We obtain the desired result and complete the proof. ∎
V Experimental results and discussion
In this section, we conduct several experiments on four popular datasets to evaluate the RMENCCA, including MNIST [37], Wiki [38], Pascal VOC 2007 [39] and URL Reputation [40]. We randomly select training data as the validation data for each dataset, and the hyperparameters are tuned on the validation set. The description of each dataset is shown as follows, and the statistics of these four datasets are summarized in Table I.

The MNIST is a handwritten digit recognition dataset, in which each image is separated into the left and right halves.

The Wiki is a popular imagetext matching dataset, which was used in [20, 38, 41]. The Wiki dataset assembles from Wikipedia’s articles, which consists of 2866 imagetext pair (2173 training data and 693 test data). The features of images and texts are extracted by a 128dim SIFT and a 10dim latent Dirichlet allocation algorithm, respectively. Some examples are shown in Figure 1.

The Pascal VOC 2007 is a challenging and realistic imagetag matching testbed. In order to carry out more experiments, three image features and three tag features are extracted respectively by [42], including Gist, HSV color histograms and bagofvisualwords (BoW) histograms, as well as word frequency (Wc), relative tag rank (Rel) and absolute tag rank (Abs). Some examples are shown in Figure 2.

The URL Reputation is a largescale dataset for online learning algorithms. Due to limitations of computational resource, we only choose some parts of samples and attributes.
Problems  Description  Training set  Test set  Dim X  Dim Y 
MNIST  Left and Right Halves of Images  60,000  10,000  392  392 
Wiki  ImageText Pairs  2,173  693  128  10 
Pascal VOC 2007  Image and Its MultiLabels  5,011  4,952  512 (Gist)  399 (Wc) 
200 (Bow)  399 (Rel)  
64 (Hsv)  399 (Abs)  
URL Reputation  Host and Lexical based Features  1,000,000  20,000  50  50 
Moreover, we measure the test results by the commonlyused Pearson product moment correlation coefficients (PCC). The PCC is defined as , where and are two projected test samples, is the covariance, as well as and are the standard deviation of and , respectively. The range of PPC is from 100 to 0, in which 100 denotes complete correlation and 0 denotes no correlation. Ten trials are conducted for each algorithm, and we report the average PCC results. In the experiments, we found the number of dimensional canonical subspace is almost no effect on the performance. Therefore, except for MNIST using , we calculate the top 5 dimensional canonical subspace for other three datasets. All simulations are carried out in a Matlab 2015b environment running in a PC machine with Inter(R) Core(TM) i76700HD 2.60 GHZ CPU, NVIDIA GTX 960M GPU and 8 GB of RAM.
Va Hyperparameters selection
In the RMENCCA, we initialize the true canonical pair by drawing i.i.d samples from the standard Gaussian distribution, and set the unnormalized pair and the momentum as zero matrices. We empirically found that the RMENCCA is insensitive to the learning rate and the tradeoff coefficients , . When the value of is 10 times that of , the RMENCCA can achieve the best performance. Hence, we set , and on all four datasets. According to [43], the momentum coefficient is set as 0.9. For the stochastic iterative algorithm, a small part of the training data is held out. We use stochastic iterative approach for the RMENCCA in all experiments.
We choose the commonlyused Gaussian kernel for all kernel based methods, and the kernel parameter is chosen empirically from on the validation set. In addition, the number of random projections of both FKCCA and NKCCA [7] are chosen from , as well as that of approximate KCCA (KNOI) [12] is from . Other userdefined parameters are determined empirically, and we pick the one having the best performance. Due to space limitations, we omit the procedure of selecting hyperparameters in this paper.
VB Convergence
In this section, we use the accuracy curve to illustrate the convergence analysis of the RMENCCA instead of the convergence curve on MNIST and Wiki, which is more visually pleasing for the convergence analysis. With the number of iterations increased, the accuracy of the RMENCCA is improved. When the accuracy curve is almost flat, we consider the RMENCCA almost converges. In Figure (a)a, we note that there is a jump in the curve, when the number of iterations is about 30, and the accuracy of the RMENCCA improves afterwards. When the iteration reaches 150, the RMENCCA almost converges. Therefore, we set the number of iteration as 150 on MNIST. Moreover, in Figure (b)b, we only illustrate an interval of the accuracy curve of the RMENCCA on Wiki because of the different dimensions of the two ’views’, which is from 150 to 600. We note that the RMENCCA almost converges when the iteration reaches 500, and therewith the standard deviation decreases significantly. Therefore balancing the effectiveness and efficiency, we set the number of iteration as 500. Furthermore, we will investigate the convergence of the RMENCCA on the largescale URL Reputation dataset, which is compared with the Scalable CCA. Due to space limitations, we omit the convergence analysis of the RMENCCA on Pascal VOC 2007.
VC Performance
VC1 Results on MNIST
In this paper, we only consider unsupervised learning approaches as comparisons. Hence, the compared learning methods include linear CCA, partial least squares (PLS) [44], bilinear models (BLM) [45, 46], FKCCA and NKCCA [7], approximate KCCA (KNOI) [12] as well as Scalable CCA [10]. To better illustrate the effectiveness of the RMENCCA, we also derive CCA with basic MEN termed MENCCA. The KNOI is carried out on the GPU, while other algorithms are run on the CPU. The experimental results in terms of PCC and time are shown in Table II.
Algorithms  PCC(%)  time(sec) 

RMENCCA  10.04  
MENCCA  90.82  8.19 
KNOI  87.26  258.52 (GPU) 
FKCCA  82.61  232.87 
NKCCA  84.92  257.69 
Scalable CCA  56.87  6.39 
Linear CCA  58.43  3.08 
PLS  58.16  3.51 
BLM  58.68  2.89 
Seen from Table II, the RMENCCA has the best performance among all comparisons. The RMENCCA outperforms remarkably Scalable CCA in terms of accuracy, since the RMENCCA benefits from the strength of coupled feature selection. The RMENCCA takes a little more time than Scalable CCA. Since the RMENCCA needs to calculate using SVD, and the computational complexity is , in which is the number of input data. But the time cost of the RMENCCA is much less than that of all kernelbased methods, even if the KNOI is carried out on the GPU. We note that the above experimental results in terms of time may be unfair to the RMENCCA. Since all kernelbased algorithms are sensitive to the userspecified kernel width. As a result, many experiments on the validation set have been carried out for all kernelbased algorithms in order to achieve the reported results, and these additional time costs are not included in Table II.
VC2 Results on Wiki
Due to the limitations of computational resource, we only evaluate the KRMENCCA on this dataset. In addition, the KCCA [11] is also used as a comparison on this dataset. The experimental results are illustrated in Table III.
From Table III, except for the KRMENCCA and KCCA, the RMENCCA outperforms other comparisons. The KCCA sightly outperforms the RMENCCA, but the KCCA needs about 1,120 times as many time costs as the RMENCCA. Surprisingly, the KRMENCCA can achieve extremely great accuracy, while the time of the KRMENCCA is only onethird of that of the KCCA.
Algorithms  PCC(%)  time(sec) 
RMENCCA  51.28  1.27 
MENCCA  50.33  1.09 
KRMENCCA  465.71  
KCCA  58.54  1427.67 
KNOI  49.61  213.45 
FKCCA  50.12  143.61 
NKCCA  50.64  159.28 
Scalable CCA  46.19  0.43 
Linear CCA  46.36  0.07 
PLS  46.32  0.08 
BLM  46.92  0.07 
VC3 Results on Pascal VOC 2007
In order to illustrate better the RMENCCA in terms of effectiveness and efficiency, we not only evaluate the RMENCCA against some stateoftheart algorithms on imagetotags features based Pascal VOC 2007, but also conduct an additional group of experiments on imagetoimage features based Pascal VOC 2007.
As shown by both Table IV and V, we see that the RMENCCA outperforms all comparisons in terms of accuracy at much faster learning speed.
Algorithms  GistWc  GistRel  GistAbs  
PCC(%)  time(sec)  PCC(%)  time(sec)  PCC(%)  time(sec)  
RMENCCA  0.89  0.96  0.95  
MENCCA  60.29  0.71  63.18  0.74  57.10  0.82 
KNOI  54.51  522.88  57.63  516.79  56.21  523.24 
FKCCA  55.69  157.18  55.41  153.84  56.63  154.68 
NKCCA  55.63  163.24  55.77  143.64  57.29  171.46 
Scalable CCA  51.53  0.88  51.36  0.84  53.56  0.79 
Linear CCA  51.77  0.87  52.40  0.81  53.63  0.81 
PLS  51.75  0.82  51.65  0.83  53.27  0.80 
BLM  52.08  0.92  51.94  0.85  53.91  0.89 
Algorithms  GistBow  GistHsv  HsvBow  
PCC(%)  time(sec)  PCC(%)  time(sec)  PCC(%)  time(sec)  
RMENCCA  0.84  0.81  0.81  
MENCCA  61.41  0.74  56.82  0.69  50.83  0.68 
KNOI  61.54  333.22  54.27  392.74  47.66  375.42 
FKCCA  61.60  156.62  50.83  148.60  50.76  164.53 
NKCCA  61.83  147.27  53.19  166.24  50.69  187.27 
Scale CCA  47.99  0.82  49.76  0.71  41.91  0.67 
Linear CCA  49.63  0.72  50.18  0.68  43.48  0.56 
PLS  48.23  0.79  50.05  0.69  43.17  0.62 
BLM  48.99  0.72  51.33  0.67  44.08  0.58 
VC4 Results on URL Reputation
We only compare with Scalable CCA on the largescale URL Reputation dataset. the previous work [10] has shown that the Scalable CCA can achieve excellent performance on this dataset, and thus, we omit other methods in order to avoid duplication of work. Classical CCA methods based on eigenvector computation fail on a typical PC machine. Since these approaches are trained over the entire training set in a batch learning fashion. Consequently, these methods are prohibitive for huge datasets. However, likewise to the Scalable CCA, the RMENCCA is also an online learning algorithm, which is a commonlyused approach on huge datasets.
Algorithms  iteration  PCC(%)  time(sec) 

RMENCCA  200  22.04  
Scalable CCA  200  35.814  21.93 
1000  41.144  31.44 
From Table VI, we see that the RMENCCA can achieve better performance than the Scalable CCA. As shown by Figure 4, the more the number of iteration is, the more the Scalable CCA can capture correlations. However, we find that when the number of iterations is more than 1,000, the Scalable CCA is not powerful enough to achieve better accuracy, that is, the PCC of the Scalable CCA remains about . While the RMENCCA significantly outperforms the Scalable CCA no matter what the number of iterations of the Scalable CCA is. The experimental results on this largescale dataset illustrate that the RMENCCA not only faster converges than the Scalable CCA, but this novel algorithm is also more stable than the Scalable CCA.
Vi Conclusion
In this paper, we derive a novel robust matrix elastic net based canonical correlation analysis (RMENCCA) with theoretically guaranteed convergence and empirical proficiency. To the best of our knowledge, the RMENCCA is the first to impose coupled feature selection into the CCA. As a consequence, the RMENCCA not only measures correlations between different ’views’, but also distills numerous relevant and useful features, which results that the performance of the RMENCCA is significantly improved. Additionally, for the sake of modeling highly sophisticated nonlinear relationships, the RMENCCA can be extended straightforwardly to the kernel scenario. Furthermore, due to its complicated model architecture, it is nontrivial to solve the RMENCCA by existing optimization approaches. Therefore, we bridge the gap between the new optimization problem and the previous efficient iterative algorithm. Finally, competitive experimental results on four popular datasets confirm the great effectiveness and efficiency of the RMENCCA in multiview learning problems.
Acknowledgment
The authors would like to thank Prof. Daniel Palomar from Hong Kong University of Science and Technology for great inspiration and valuable discussions of this work. The authors also would like to thank Prof. Ajay Joneja from Hong Kong University of Science and Technology for providing a PC machine and some constructive suggestions on this research.
References
 [1] J. Yu, Y. Rui, Y. Y. Tang, and D. Tao, “Highorder distancebased multiview stochastic learning in image classification,” IEEE transactions on cybernetics, vol. 44, no. 12, pp. 2431–2442, 2014.
 [2] X. Zhu, X. Li, and S. Zhang, “Blockrow sparse multiview multilabel learning for image classification,” IEEE transactions on cybernetics, vol. 46, no. 2, pp. 450–461, 2016.
 [3] P. Dhillon, D. P. Foster, and L. H. Ungar, “Multiview learning of word embeddings via cca,” in Advances in Neural Information Processing Systems, 2011, pp. 199–207.
 [4] K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang, “A comprehensive survey on crossmodal retrieval,” arXiv preprint arXiv:1607.06215, 2016.
 [5] C. Xu, D. Tao, and C. Xu, “A survey on multiview learning,” arXiv preprint arXiv:1304.5634, 2013.
 [6] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu, “Deep canonical correlation analysis.” in ICML (3), 2013, pp. 1247–1255.
 [7] D. LopezPaz, S. Sra, A. J. Smola, Z. Ghahramani, and B. Schölkopf, “Randomized nonlinear component analysis.” in ICML, 2014, pp. 1359–1367.
 [8] D. Chu, L.Z. Liao, M. K. Ng, and X. Zhang, “Sparse canonical correlation analysis: new formulation and algorithm,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 12, pp. 3050–3065, 2013.
 [9] D. R. Hardoon and J. ShaweTaylor, “Sparse canonical correlation analysis,” Machine Learning, vol. 83, no. 3, pp. 331–353, 2011.
 [10] Z. Ma, Y. Lu, and D. Foster, “Finding linear structure in large datasets with scalable canonical correlation analysis,” in Proc. of the 32st Int. Conf. Machine Learning (ICML 2015), 2015, pp. 169–178.
 [11] F. R. Bach and M. I. Jordan, “Kernel independent component analysis,” Journal of machine learning research, vol. 3, no. Jul, pp. 1–48, 2002.
 [12] W. Wang and K. Livescu, “Largescale approximate kernel canonical correlation analysis,” arXiv preprint arXiv:1511.04773, 2015.
 [13] B. Xie, Y. Liang, and L. Song, “Scale up nonlinear component analysis with doubly stochastic gradients,” in Advances in Neural Information Processing Systems, 2015, pp. 2341–2349.
 [14] W. Wang, J. Wang, D. Garber, and N. Srebro, “Efficient globally convergent stochastic optimization for canonical correlation analysis,” in Advances in Neural Information Processing Systems, 2016, pp. 766–774.
 [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [16] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
 [17] S. Gaiffas and G. Lecué, “Sharp oracle inequalities for highdimensional matrix prediction,” IEEE Transactions on Information Theory, vol. 57, no. 10, pp. 6942–6957, 2011.
 [18] H. Li, N. Chen, and L. Li, “Error analysis for matrix elasticnet regularization algorithms,” IEEE transactions on neural networks and learning systems, vol. 23, no. 5, pp. 737–748, 2012.
 [19] X. Zhen, M. Yu, X. He, and S. Li, “Multitarget regression via robust lowrank learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
 [20] K. Wang, R. He, W. Wang, L. Wang, and T. Tan, “Learning coupled feature spaces for crossmodal matching,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2088–2095.
 [21] Z. Luo, B. Peng, D.A. Huang, A. Alahi, and L. FeiFei, “Unsupervised learning of longterm motion dynamics for videos,” arXiv preprint arXiv:1701.01821, 2017.
 [22] S. Singh, A. Okun, and A. Jackson, “Artificial intelligence: Learning to play go from scratch,” Nature, vol. 550, no. 7676, p. 550336a, 2017.
 [23] M. Mirza, A. Courville, and Y. Bengio, “Generalizable features from unsupervised learning,” arXiv preprint arXiv:1612.03809, 2016.
 [24] A. Argyriou, T. Evgeniou, and M. Pontil, “Multitask feature learning,” in Advances in neural information processing systems, 2007, pp. 41–48.
 [25] Q. Gu, Z. Li, and J. Han, “Joint feature selection and subspace learning,” in IJCAI ProceedingsInternational Joint Conference on Artificial Intelligence, vol. 22, no. 1. Citeseer, 2011, p. 1294.
 [26] L. Du, P. Zhou, L. Shi, H. Wang, M. Fan, W. Wang, and Y.D. Shen, “Robust multiple kernel kmeans using l21norm,” in TwentyFourth International Joint Conference on Artificial Intelligence, 2015.
 [27] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust feature selection via joint â2, 1norms minimization,” in Advances in neural information processing systems, 2010, pp. 1813–1821.
 [28] C.J. Hsieh and P. A. Olsen, “Nuclear norm minimization via active subspace selection.” in ICML, 2014, pp. 575–583.
 [29] E. Grave, G. R. Obozinski, and F. R. Bach, “Trace lasso: a trace norm regularization for correlated designs,” in Advances in Neural Information Processing Systems, 2011, pp. 2187–2195.
 [30] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted nuclear norm minimization with application to image denoising,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2862–2869.
 [31] G. H. Golub and H. Zha, “The canonical correlations of matrix pairs and their numerical computation,” in Linear algebra for signal processing. Springer, 1995, pp. 27–49.
 [32] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimumrank solutions of linear matrix equations via nuclear norm minimization,” SIAM review, vol. 52, no. 3, pp. 471–501, 2010.
 [33] F. Dinuzzo and B. Schölkopf, “The representer theorem for hilbert spaces: a necessary and sufficient condition,” in Advances in neural information processing systems, 2012, pp. 189–196.
 [34] R. He, T. Tan, L. Wang, and W.S. Zheng, “ regularized correntropy for robust feature selection,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2504–2511.
 [35] M. Nikolova and M. K. Ng, “Analysis of halfquadratic minimization methods for signal and image recovery,” SIAM Journal on Scientific computing, vol. 27, no. 3, pp. 937–966, 2005.
 [36] I. F. Gorodnitsky and B. D. Rao, “Sparse signal reconstruction from limited data using focuss: A reweighted minimum norm algorithm,” IEEE Transactions on signal processing, vol. 45, no. 3, pp. 600–616, 1997.
 [37] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [38] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “A new approach to crossmodal multimedia retrieval,” in Proceedings of the 18th ACM international conference on Multimedia. ACM, 2010, pp. 251–260.
 [39] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results,” http://www.pascalnetwork.org/challenges/VOC/voc2007/workshop/index.html.
 [40] J. Ma, L. K. Saul, S. Savage, and G. M. Voelker, “Identifying suspicious urls: an application of largescale online learning,” in Proceedings of the 26th annual international conference on machine learning. ACM, 2009, pp. 681–688.
 [41] J. C. Pereira, E. Coviello, G. Doyle, N. Rasiwasia, G. R. Lanckriet, R. Levy, and N. Vasconcelos, “On the role of correlation and abstraction in crossmodal multimedia retrieval,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 3, pp. 521–535, 2014.
 [42] S. J. Hwang and K. Grauman, “Accounting for the relative importance of objects in image retrieval.” in BMVC, vol. 1, no. 2, 2010, p. 5.
 [43] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural networks, vol. 12, no. 1, pp. 145–151, 1999.
 [44] R. Rosipal and N. Krämer, “Overview and recent advances in partial least squares,” Lecture notes in computer science, vol. 3940, p. 34, 2006.
 [45] A. Sharma, A. Kumar, H. Daume, and D. W. Jacobs, “Generalized multiview analysis: A discriminative latent space,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 2160–2167.
 [46] J. B. Tenenbaum and W. T. Freeman, “Separating style and content,” in Advances in neural information processing systems, 1997, pp. 662–668.