# Graph Multiview Canonical Correlation Analysis

###### Abstract

Multiview canonical correlation analysis (MCCA) seeks latent low-dimensional representations encountered with multiview data of shared entities (a.k.a. common sources). However, existing MCCA approaches do not exploit the geometry of the common sources, which may be available a priori, or can be constructed using certain domain knowledge. This prior information about the common sources can be encoded by a graph, and be invoked as a regularizer to enrich the maximum variance MCCA framework. In this context, the present paper’s novel graph-regularized (G) MCCA approach minimizes the distance between the wanted canonical variables and the common low-dimensional representations, while accounting for graph-induced knowledge of the common sources. Relying on a function capturing the extent low-dimensional representations of the multiple views are similar, a generalization bound of GMCCA is established based on Rademacher’s complexity. Tailored for setups where the number of data pairs is smaller than the data vector dimensions, a graph-regularized dual MCCA approach is also developed. To further deal with nonlinearities present in the data, graph-regularized kernel MCCA variants are put forward too. Interestingly, solutions of the graph-regularized linear, dual, and kernel MCCA, are all provided in terms of generalized eigenvalue decomposition. Several corroborating numerical tests using real datasets are provided to showcase the merits of the graph-regularized MCCA variants relative to several competing alternatives including MCCA, Laplacian-regularized MCCA, and (graph-regularized) PCA.

## I Introduction

In several applications, such as multi-sensor surveillance systems, multiple datasets are collected offering distinct views of the common information sources. With advances in data acquisition, it becomes easier to access heterogeneous data representing samples from multiple views in various scientific fields, including genetics, computer vision, data mining, and pattern recognition, to name a few. In genomics for instance, a patient’s lymphoma data set consists of gene expression, SNP, and array CGH measurements [34]. In a journal’s dataset, the title, keywords, and citations can be considered as three different views of a given paper [30]. Learning with heterogeneous data of different types is commonly referred to as multiview learning, and in different communities as information fusion or data integration from multiple feature sets. Multiview learning is an emerging field in data science with well-appreciated analytical tools and matching application domains [29].

Canonical correlation analysis (CCA) is a classical tool for multiview learning [14]. Formally, CCA looks for latent low-dimensional representations from a paired dataset comprising two views of several common entities. Multiview (M) CCA generalizes two-view CCA and also principal component analysis (PCA) [17], to handle jointly datasets from multiple views [19]. In contrast to PCA that operates on vectors formed by multi-view sub-vectors, MCCA is more robust to outliers per view, because it ignores the principal components per view that are irrelevant to the latent common sources. Popular MCCA formulations include the sum of correlations (SUMCOR), maximum variance (MAXVAR) [13], sum of squared correlations, the minimum variance, and generalized variance methods [19]. With the increasing capacity of data acquisition and the growing demand for multiview data analytics, the research on MCCA has been re-gaining attention recently.

To capture nonlinear relationships in the data, linear MCCA has been also generalized using (multi-)kernels or deep neural networks; see e.g., [35, 1, 32], that have well-documented merits for (nonlinear) dimensionality reduction of multiview data, as well as for multiview feature extraction. Recent research efforts have also focused on addressing the scalability issues in (kernel) MCCA, using random Fourier features [21], or leveraging alternating optimization advances [16] to account for sparsity [33, 31, 8, 16] or other types of structure-promoting regularizers such as nonnegativity and smoothness [10, 22].

Lately, graph-aware regularizers have demonstrated promising performance in a gamut of machine learning applications, such as dimensionality reduction, data reconstruction, clustering, and classification [15, 27, 24, 25, 11, 9]. CCA with structural information induced by a common source graph has been reported in [9], but it is limited to analyzing two-views of data, and its performance has been tested only experimentally. Further, multigraph-encoded information provided by the underlying physics, or, inferred from alternative views of the information sources, has not been investigated.

Building on but considerably going beyond our precursor work in [9], this paper introduces a novel graph-regularized (G) MCCA approach, and develops a bound on its generalization error performance. Our GMCCA is established by minimizing the difference between the low-dimensional representation of each view and the common representation, while also leveraging the statistical dependencies due to the common sources hidden in the views. These dependencies are encoded by a graph, which can be available from the given data, or can be deduced from correlations. A finite-sample statistical analysis of GMCCA is provided based on a regression formulation offering a meaningful error bound for unseen data samples using Rademacher’s complexity.

GMCCA is operational when there are sufficient data samples (larger than the number of features per view). For cases where the data are insufficient, we develop a graph-regularized dual (GD) MCCA scheme that avoids this limitation at lower computational complexity. To cope with nonlinearities present in real data, we further put forward a graph-regularized kernel (GK) MCCA scheme. Interestingly, the linear, dual, and kernel versions of our proposed GMCCA admit simple analytical-form solutions, each of which can be obtained by performing a single generalized eigenvalue decomposition.

Different from [4, 36], where MCCA is regularized using multiple graph Laplacians separately per view, GMCCA here jointly leverages a single graph effected on the common sources. This is of major practical importance, e.g., in electric power networks, where besides the power, voltage, and current quantities observed, the system operator has also access to the network topology [18] that captures the connectivity between substations through power lines.

Finally, our proposed GMCCA approaches are numerically tested using several real datasets on different machine learning tasks, including e.g., dimensionality reduction, recommendation, clustering, and classification. Corroborating tests showcase the merits of GMCCA schemes relative to its completing alternatives such as MCCA, PCA, graph PCA, and the k-nearest neighbors (KNN) method.

Notation: Bold uppercase (lowercase) letters denote matrices (column vectors). Operators , , and stand for matrix trace, inverse, vectorization, and transpose, respectively; denotes the -norm of vectors; the Frobenius norm of matrices; is an diagonal matrix holding entries of on its main diagonal; denotes the inner product of same-size vectors and ; vector has all zero entries whose dimension is clear from the context; and is the identity matrix of suitable size.

## Ii Preliminaries

Consider datasets collected from views of common source vectors stacked as columns of , where is the dimension of the -th view data vectors, with possibly . Vector denotes the -th column of , meaning the -th datum of the -th view, for all and . Suppose without loss of generality that all per-view data vectors have been centered. Two-view CCA works with datasets and from views. It looks for low-dimensional subspaces and with , such that the Euclidean distance between linear projections and is minimized. Concretely, classical CCA solves the following problem [12]

(1a) | ||||

(1b) |

where columns of are called loading vectors of the data (view) ; while projections are termed canonical variables; they satisfy (1b) to prevent the trivial solution; and, they can be viewed as low ()-dimensional approximations of . Moreover, the solution of (1) is provided by a generalized eigenvalue decomposition [14].

When analyzing multiple () datasets, (1) can be generalized to a pairwise matching criterion [6]; that is

(2a) | ||||

(2b) |

where (2b) ensures a unique nontrivial solution. The formulation in (2) is referred to as the sum-of-correlations (SUMCOR) MCCA, that is known to be NP-hard in general [23].

Instead of minimizing the distance between paired low-dimensional approximations, one can look for a shared low-dimensional representation of different views, namely , by solving [19]

(3a) | ||||

(3b) |

yielding the so-called maximum-variance (MAXVAR) MCCA formulation. Similarly, the constraint (3b) is imposed to avoid a trivial solution. If all per-view sample covariance matrices have full rank, then for a fixed , the -minimizers are given by . Substituting into (3), the -minimizer can be obtained by solving the following eigenvalue decomposition problem

(4a) | ||||

(4b) |

The columns of are given by the first principal eigenvectors of matrix . In turn, we deduce that .

###### Remark 1.

Solutions of the SUMCOR MCCA in (2) and the MAXVAR MCCA in (3) are generally different. Specifically, for , both admit analytical solutions that can be expressed in terms of distinct eigenvalue decompositions; but for , the SUMCOR MCCA can not be solved analytically, while the MAXVAR MCCA still admits an analytical solution though at the price of higher computational complexity because it involves the extra matrix variable .

## Iii Graph-regularized MCCA

In many applications, the common source vectors may reside on, or their dependencies form a graph of nodes. This structural prior information can be leveraged along with multiview datasets to improve MCCA performance. Specifically, we will capture this extra knowledge here using a graph, and effect it in the low-dimensional common source estimates through a graph regularization term.

Consider representing the graph of the common sources using the tuple , where is the vertex set, and collects all edge weights over all vertex pairs . The so-termed weighted adjacency matrix is formed with being its -th entry. Without loss of generality, undirected graphs for which holds are considered in this work. Upon defining and , the Laplacian matrix of graph is defined as

(5) |

Next, a neat link between canonical correlations and graph regularization will be elaborated. To start, let us assume that sources are smooth over . This means that two sources residing on two connected nodes are also close to each other in Euclidean distance. As explained before, vectors and are accordingly the -dimensional approximations of and . Accounting for this fact, a meaningful regularizer is the weighted sum of distances between any pair of common source estimates and over

(6) |

Clearly, source vectors and residing on adjacent nodes having large weights will be forced to be similar to each other. To leverage such additional graph information of the common sources, the quadratic term (6) is invoked as a regularizer in the standard MAXVAR MCCA, yielding our novel graph-regularized (G) MCCA formulation

(7a) | ||||

(7b) |

where the coefficient trades off minimizing the distance between the canonical variables and their corresponding common source estimates with promoting smoothness of common source estimates over the graph . Specifically, when , GMCCA reduces to the classical MCCA in (3); and, as increases, GMCCA relies more heavily in this extra graph knowledge when finding the canonical variables.

If all per-view sample covariance matrices have full rank, equating to zero the partial derivative of the cost in (7a) with respect to each , yields the optimizer . Substituting next by and ignoring the constant term in (7a) give rise to the following eigenvalue problem (cf. (4))

(8a) | ||||

(8b) |

Similar to standard MCCA, the optimal solution of (8) can be obtained by the leading eigenvectors of the matrix

(9) |

At the optimum, it is easy to verify that the following holds

where denotes the -th largest eigenvalue of in (9).

A step-by-step description of our proposed GMCCA scheme is summarized in Alg. 1.

At this point, a few remarks are in order.

###### Remark 2.

We introduced a two-view graph CCA scheme in [9] using the SUMCOR MCCA formulation. However, to obtain an analytical solution, the original cost was surrogated in [9] by its lower bound, which cannot be readily generalized for multiview datasets with . In contrast, our GMCCA in (7) can afford an analytical solution for any .

###### Remark 3.

Different from our single graph regularizer in (7), the proposals in [4] and [36] rely on different regularizers to exploit the extra graph knowledge, for view-specific graphs on data . However, the formulation in [36] does not admit an analytical solution, and convergence of the iterative solvers for the resulting nonconvex problem can be guaranteed only to a stationary point. The approach in [4] focuses on semi-supervised learning tasks, in which cross-covariances of pair-wise datasets are not fully available. In contrast, the single graph Laplacian regularizer in (7) is effected on the common sources, to exploit the pair-wise similarities of the common sources. This is of practical importance when one has prior knowledge about the common sources besides the datasets. For example, in ResearchIndex networks, besides keywords, titles, Abstracts, and Introductions of collected articles, one has also access to the citation network capturing the connectivities among those papers. More generally, the graph of inter-dependent sources can be dictated by underlying physics, or it can be a prior provided by an ‘expert,’ or, it can be learned from extra (e.g., historical) views of the data. Furthermore, our proposed GMCCA approach comes with simple analytical solutions.

###### Remark 4.

With regards to selecting , two ways are feasible: i) cross-validation for supervised learning tasks, where labeled training data are given, and is fixed to the one that yields optimal empirical performance on the training data; and, ii) using a spectral clustering method that automatically chooses the best values from a given set of candidates; see e.g., [7].

###### Remark 5.

Our GMCCA scheme entails eigendecomposition of an matrix, which incurs computational complexity , and thus is not scalable to large datasets. Possible remedies include parallelization and efficient decentralized algorithms capable of handling structured MCCA; e.g., along the lines of [16]. These go beyond the scope of the present paper, but constitute interesting future research directions.

## Iv Generalization Bound of GMCCA

In this section, we will analyze the finite-sample performance of GMCCA based on a regression formulation [26, Ch. 6.5], which is further related to the alternating conditional expectations method in [5]. Our analysis will establish an error bound for unseen source vectors (a.k.a. generalization bound) using the notion of Rademacher’s complexity.

Recall that the goal of MCCA is to find common low-dimensional representations of the -view data. To measure how close the estimated low-dimensional representations are to each other, we introduce the following error function

(10) |

where the underlying source vector is assumed to follow some fixed yet unknown distribution , and the linear function maps a source vector from space to the -the view in , for .

To derive the generalization bound, we start by evaluating the empirical average of over say, a number of given training samples, as follows

For the quadratic terms, it can be readily verified that

(11) | ||||

(12) |

Define two vectors

where the two vectors and are defined as

for and .

Plugging (11) and (12) into (10), one can check that function can be rewritten as

(13) |

with the norm of given by

Starting from (13), we will establish next an upperbound on the expectation of by means of (13), which is important because the expectation involves not only the training source samples, but also unseen samples.

###### Theorem 1.

Assume that i) the common source vectors are drawn i.i.d. from some distribution ; ii) the transformations of vectors are bounded; and, iii) subspaces satisfy () and are the optimizers of (7). If we obtain low-dimensional representations of specified by subspaces , it holds with probability at least that

(14) |

where for , and , while the constant is given by

###### Proof.

Equation (13) suggests that belongs to the function class

Consider the function class

where the function is defined as

It can be checked that is a Lipschitz function with Lipschitz constant , and that the range of functions in is . Appealing to [26, Th. 4.9], one deduces that with probability at least , the following holds

(15) |

where denotes the expected value of on a new common source ; and the Rademacher complexity of along with its empirical version is defined as

where collects independent random variables drawn from the Rademacher distribution, meaning . Further, and denote the expectation with respect to and , respectively.

Since is a Lipschitz function with Lipschitz constant satisfying , the result in [2, Th. 12] asserts that

(16) |

Applying [26, Th. 4.12] leads to

(17) |

where the -th entry of is , for . One can also confirm that

(18) |

Substituting (17) and (18) to (16) yields

Multiplying (15) by along with the last equation gives rise to (14). ∎

Theorem 1 confirms that the empirical expectation of , namely , stays close to its ensemble one , provided that can be controlled. For this reason, it is prudent to trade off maximization of correlations among the datasets with the norms of the resultant loading vectors.

## V Graph-regularized Dual MCCA

In practical scenarios involving high-dimensional data vectors with dimensions satisfying , the matrices become singular – a case where GMCCA in (7) does not apply. For such cases, consider rewriting the loading matrices in terms of the data matrices as , where will be henceforth termed the dual of . Replacing with in the linear GMCCA formulation (7) leads to its dual formulation

(19a) | ||||

(19b) |

If the matrices are nonsingular, it can be readily confirmed that the columns of the optimizer of (19) are the principal eigenvectors of , while the dual matrices can be estimated in closed form as . Clearly, such an does not depend on the data , and this estimate goes against our goal of extracting as the latent low-dimensional structure commonly present in . To address this issue, we mimic the dual CCA trick (see e.g., [12]), and introduce a Tikhonov regularization term on the loading vectors through the norms of . This indeed agrees with the observation we made following Theorem 1 that controlling improves the generalization. In a nutshell, our graph-regularized dual (GD) MCCA is given as

(20a) | ||||

(20b) |

where denote pre-selected weight coefficients.

## Vi Graph-regularized Kernel MCCA

The GMCCA and GDMCCA approaches are limited to analyzing linear data dependencies. Nonetheless, complex nonlinear data dependencies are not rare in practice. To account for nonlinear dependencies, a graph-regularized kernel (GK) MCCA formulation is pursued in this section to capture the nonlinear relationships in the datasets through kernel-based methods. Specifically, the idea of GKMCCA involves first mapping the data vectors to higher (possibly infinite) dimensional feature vectors by means of nonlinear functions, on which features we will apply GMCCA to find the shared low-dimensional canonical variables.

Let be a mapping from to for all , where the dimension can be as high as infinity. Clearly, the data enter the GDMCCA problem (20) only via the similarity matrix . Upon mapping all data vectors into , the linear similarities can be replaced with the mapped nonlinear similarities . After selecting some kernel function such that , the -th entry of the kernel matrix is given by , for all , , and . In the sequel, centering is realized by centering the kernel matrix for data as

(21) |

for .

Replacing in the GDMCCA formulation (20) with centered kernel matrices yields our GKMCCA

(22a) | ||||

(22b) |

Selecting invertible matrices , and following the logic used to solve (20), we can likewise tackle (22). Consequently, the columns of the optimizer are the first principal eigenvectors of , and the optimal sought can be obtained as . For implementation, GKMCCA is presented in step-by-step form as Algorithm 3.

In terms of computational complexity, recall that GMCCA, GDMCCA, GKMCCA, MCCA, DMCCA, and KMCCA all require finding the eigenvectors of matrices with different dimensionalities. Defining , it can be checked that they incur correspondingly complexities , , , , , and . Interestingly, introducing graph-regularization to e.g., MCCA, DMCCA, as well as KMCCA does not result in an increase of computational complexity. When , GMCCA in its present form is not feasible, or suboptimal even though pseudo-inverse can be utilized at the cost of . In contrast, GDMCCA is computationally preferable as its cost grows only linearly with . When , the complexity of GKMCCA is dominated by the computation burden of requiring complexity in the order of . On the other hand, implementing GKMCCA when incurs complexity of order , required to evaluate the kernel matrices.

###### Remark 6.

When the (non)linear maps needed to form the kernel matrices in (22) are not given a priori, the multi-kernel methods are well motivated (see e.g., [37, 28]). Concretely, one presumes that each is a linear combination of kernel matrices, namely , where represent preselected view-specific kernel matrices for data . The unknown coefficients are then jointly optimized with and in (22).

###### Remark 7.

When more than one type of connectivity information on the common sources are available, our single graph-regularized MCCA schemes can be generalized to accommodate multiple or multi-layer graphs. Specifically, the single graph-based regularization term in (7), (20), and (22) can be replaced with with possibly unknown yet learnable coefficients , where denotes the graph Laplacian matrix of the -th graph, for .

## Vii Numerical Tests

In this section, numerical tests using real datasets are provided to showcase the merits of our proposed MCCA approaches in several machine learning applications, including user engagement prediction, friend recommendation, clustering, and classification.

### Vii-a User engagement prediction

Given multi-view data of Twitter users, the goal of the so-called user engagement prediction is to determine which topics a Twitter user is likely to tweet about, by using hashtag as a proxy. The first experiment entails six datasets of Twitter users, which include EgoTweets, MentionTweets, FriendTweets, FollowersTweets, FriendNetwork, and FollowerNetwork data ^{1}^{1}1Downloaded from http://www.dredze.com/datasets/multiviewembeddings/., where and users’ data are randomly chosen from the database. Details in generating those multiview data can be found in [3]. Based on data from the first views, three adjacency matrices are constructed, whose -th entries are

(23) |

where is a Gaussian kernel matrix of