Maximally Correlated Principal Component Analysis
In the era of big data, reducing data dimensionality is critical in many areas of science. Widely used Principal Component Analysis (PCA) addresses this problem by computing a low dimensional data embedding that maximally explain variance of the data. However, PCA has two major weaknesses. Firstly, it only considers linear correlations among variables (features), and secondly it is not suitable for categorical data. We resolve these issues by proposing Maximally Correlated Principal Component Analysis (MCPCA). MCPCA computes transformations of variables whose covariance matrix has the largest Ky Fan norm. Variable transformations are unknown, can be nonlinear and are computed in an optimization. MCPCA can also be viewed as a multivariate extension of Maximal Correlation. For jointly Gaussian variables we show that the covariance matrix corresponding to the identity (or the negative of the identity) transformations majorizes covariance matrices of non-identity functions. Using this result we characterize global MCPCA optimizers for nonlinear functions of jointly Gaussian variables for every rank constraint. For categorical variables we characterize global MCPCA optimizers for the rank one constraint based on the leading eigenvector of a matrix computed using pairwise joint distributions. For a general rank constraint we propose a block coordinate descend algorithm and show its convergence to stationary points of the MCPCA optimization. We compare MCPCA with PCA and other state-of-the-art dimensionality reduction methods including Isomap, LLE, multilayer autoencoders (neural networks), kernel PCA, probabilistic PCA and diffusion maps on several synthetic and real datasets. We show that MCPCA consistently provides improved performance compared to other methods.
Let and be two mean zero and unit variance random variables. Pearson’s correlation  defined as
is a basic statistical parameter and plays a central role in many statistical and machine learning methods such as linear regression , principal component analysis , and support vector machines , partially owing to its simplicity and computational efficiency. Pearson’s correlation however has two main weaknesses: firstly it only captures linear dependency between variables, and secondly for discrete (categorical) variables the value of Pearson’s correlation depends somewhat arbitrarily on the labels. To overcome these weaknesses, Maximal Correlation (MC) has been proposed and studied by Hirschfeld , Gebelein , Sarmanov  and Rényi , and is defined as
Transformation functions are assumed to be Borel measurable whose ranges are in . MC has also been studied by Witsenhausen , Ahlswede and Gács , and Lancaster . MC tackles the two main drawbacks of the Pearson’s correlation: it models a family of nonlinear relationships between the two variables. For discrete variables, the MC value only depends on the joint distribution and does not rely on labels. Moreover the MC value between and is zero iff they are independent .
For the multivariate case with variables ,..., where , Pearson’s correlation can be extended naturally to the covariance matrix where (assuming has zero mean and unit variance). Similarly to the bivariate case, the covariance matrix analysis suffers from two weaknesses of only capturing linear dependencies among variables and being label dependent when variables are discrete (categorical). One way to extend the idea of MC to the multivariate case is to consider the set of covariance matrices of transformed variables. Let be the vector of transformed variables with zero means and unit variances. I.e., and for . Let be the covariance matrix of transformed variables where . The set of covariance matrices of transformed variables is defined as follows:
Similarly to the bivariate case, functions are assumed to be Borel measurable whose ranges are in . If variables are continuous, functions are assumed to be continuous. The set includes infinitely many covariance matrices corresponding to different transformations of variables. In order to have an operational extension of MC to the multivariate case, we need to select one (or finitely many) members of through an optimization.
Here we propose the following optimization over that aims to select a covariance matrix with the maximum -Ky Fan norm (i.e., with the maximum sum of top eigenvalues):
Since the trace of all matrices in is equal to , maximizing the Ky Fan norm over results in a low rank or an approximately low rank covariance matrix. We refer to this optimization as Maximally Correlated Principal Component Analysis with parameter or for simplicity, the MCPCA optimization. The optimal MCPCA value is denoted by . When no confusion arises we use to refer to it.
Principal Component Analysis (PCA)  aims to find eigenvectors corresponding to the top eigenvalues of the covariance matrix. These are called Principal Components (PCs). On the other hand, we show that the MCPCA optimization aims to find possibly nonlinear transformations of variables that can be approximated optimally by orthonormal vectors. Thus, MCPCA can be viewed as a generalization of PCA over possibly nonlinear transformations of variables with zero means and unit variances.
We summarize our main contributions below:
We introduce MCPCA as a multivariate extension of MC and a generalization of PCA.
For jointly Gaussian variables we show that the covariance matrix corresponding to the identity (or the negative of the identity) transformations majorizes covariance matrices of non-identity functions. Using this result we characterize global MCPCA optimizers for nonlinear functions of jointly Gaussian variables for every .
For finite discrete variables,
we compute a globally optimal MCPCA solution when based on the leading eigenvector of a matrix computed using pairwise joint distributions.
for an arbitrary we propose a block coordinate descend algorithm and show its convergence to stationary points of the MCPCA optimization.
We study the consistency of sample MCPCA (an MCPCA optimization computed using empirical distributions) for both finite discrete and continuous variables.
We compare MCPCA with PCA and other state-of-the-art nonlinear dimensionality reduction methods including Isomap , LLE , multilayer autoencoders (neural networks) , kernel PCA , probabilistic PCA  and diffusion maps  on several synthetic and real datasets. Our real dataset experiments include breast cancer, Parkinson’s disease, diabetic retinopathy, dermatology, gene splicing and adult income datasets. We show that MCPCA consistently provides improved performance compared to other methods.
MCPCA can be viewed as a dimensionality reduction method whose goal is to find possibly nonlinear transformations of variables with a low rank covariance matrix. Other nonlinear dimensionality reduction methods include manifold learning methods such as Isomap , Locally Linear Embedding (LLE) , kernel PCA , maximum variance unfolding , diffusion maps , Laplacian eigenmaps , Hessian LLE , Local tangent space analysis , Sammon mapping , multilayer autoencoders , among others. For a comprehensive review of these methods, see reference . Although these techniques show an advantage compared to PCA in artificial datasets, their successful applications to real datasets have been less convincing . The key challenge is to have an appropriate balance among generality of the model, computational complexity of the method and statistical significance of inferences.
MCPCA is more general than PCA since it considers both linear and nonlinear feature transformations. In kernel PCA methods, transformations of variables are fixed in advance. This is in contrast to MCPCA that optimizes over transformations resulting in an optimal low rank approximation of the data. Manifold learning methods such as Isomap and LLE aim to find a low dimensional representation of the data such that sample distances in the low dimensional space are the same, up to a scaling, to sample geodistances (i.e., distances over the manifold), assuming there exists such a manifold that the data lies on. These methods can be viewed as extensions of PCA fitting a nonlinear model to the data. Performance of these methods has been shown to be sensitive to noise and model parameters . Through experiments on several synthetic and real datasets we show that the performance of MCPCA is robust against these factors. Note that MCPCA allows features to be transformed only individually, thus avoiding a combinatorial optimization and resulting in statistically significant inferences. However because of this MCPCA cannot capture low dimensional structures such as the swiss roll example since underlying transformation depend on pairs of variables.
Unlike existing dimensionality reduction methods that are only suitable for data with continuous features, MCPCA is suitable for both categorical and continuous data. The reason is that even if the data is categorical, transformed values computed by MCPCA are real. Moreover we compare computational and memory complexity of MCPCA and manifold learning methods (Isomap and LLE) in Remark ?. Unlike Isomap and LLE methods whose computational and memory complexity scales in a quadratic or cubic manner with the number of samples, computational and memory complexity of the MCPCA algorithm scales linearly with the number of samples, making it more suitable for data sets with large number of samples.
MCPCA can be viewed as a multivariate extension of MC. Other extensions of MC to the multivariate case have been studied in the literature. For example, reference  introduces an optimization over that aims to maximize sum of arbitrary chosen elements of the matrix .  shows that this optimization can be useful in nonlinear regression and graphical model inference. Moreover,  provides an algorithm to find local optima of the proposed optimization. Reference  introduces another optimization that aims to select a covariance matrix whose minimum eigenvalue is maximized.  briefly discuses computational and operational aspects of the proposed optimization.
For matrices we use bold-faced upper case letters, for vectors we use bold-faced lower case letters, and for scalars we use regular lower case letters. For random variables we use regular upper case letters. For example, represents a matrix, represents a vector, represents a scalar number, and represents a random variable. and are the identity and all one matrices of size , respectively. When no confusion arises, we drop the subscripts. is the indicator function which is equal to one if , otherwise it is zero. and represent the trace and the transpose of the matrix , respectively. is a diagonal matrix whose diagonal elements are equal to , while is a vector of the diagonal elements of the matrix . is the second norm of the vector . When no confusion arises, we drop the subscript. is the operator norm of the matrix . is the inner product between vectors and . indicates that vectors and are orthogonal. The matrix inner product is defined as .
The eigen decomposition of the matrix is denoted by , where is the -th largest eigenvalue of the matrix corresponding to the eigenvector . We have . . has a unit norm. Similarly the singular value decomposition of the matrix is denoted by where is the -th largest singular value of the matrix corresponding to the left and right singular eigenvectors and , respectively. We have . . and are unit norm vectors.
2MCPCA: Basic Properties and Relationship with Matrix Majorization
2.1Basic Properties of MCPCA
In reference , Rényi shows that MC between the two variables and is zero iff they are independent, while MC is one iff the two variables are strictly dependent (i.e., there exist mean zero, unit variance transformations of variables that are equal.). Here we study some of these properties for the multivariate case of MCPCA:
To prove part (i), for any , we have because has zero mean and unit variance for . Moreover, since , we have . Thus, , for . This completes the proof of part (i).
To prove part (ii), suppose . Thus, for every , we have . However since the sum of all eigenvalues are equal to , we have for every and . Therefore, for every . This means , for , which indicates that and are independent . To prove the other direction of part (ii), if and are independent, for every zero mean and unit variance functions and , we have . Thus, for every , we have . This completes the proof of part (ii).
To prove part (iii), let . Thus, . It means that there exist transformation functions with zero means and unit variances such that for all , . It means that for , where has zero mean and unit variance. The proof of the inverse direction is straightforward. This completes the proof of part (iii).
To prove part (iv), we note that if are one-to-one transformations, . Thus, . This completes the proof of part (iv).
In the following proposition, we show that the increase ratio of the optimal MCPCA value (i.e., ) is bounded above by which decreases as increases.
Let be an optimal MCPCA solution for . Since is an optimal MCPCA value with parameter , we have
By summing over all , we have . This completes the proof.
2.2Relationship between MCPCA and Matrix Majorization
A vector weakly majorizes vector (in symbols, ) if , for all . The symbols stand for the elements of the vector sorted in a decreasing order. If and , then we say vector majorizes vector and denote it by .
Let and be two Hermitian matrices in . We say majorizes is . We have the following equivalent formulation for matrix majorization that we will use in later parts of the paper.
See Theorem 7.1 in .
The following proposition makes a connection between an optimal MCPCA solution and the majorization of covariance matrices in .
Since majorizes all , , for all . Thus is an optimal solution of optimization , for .
2.3MCPCA as an Optimization over Unit Variance Functions
The feasible set of optimization includes functions of variables with zero means and unit variances. In the following we consider an alternative optimization whose feasible set includes functions of variables with unit variances and show the relationship between its optimal solutions with the ones of the MCPCA optimization. This formulation becomes useful in simplifying the MCPCA optimization for finite discrete variables (Section 4).
First we have the following lemma:
The proof follows from the fact that the Ky Fan norm of a matrix is the solution of the following optimization :
Consider the trace formulation of optimizations and according to Lemma ?:
Let and be an optimal solution of . The set of functions and is feasible for optimization . Thus, . Moreover, let and be an optimal solution of optimization . Let . The set of functions and is feasible for optimization . Thus, we have . Therefore, we have that . This completes the proof.
3MCPCA for Jointly Gaussian Random Variables
Let be zero mean unit variance jointly Gaussian random variables with the covariance matrix . Thus where is the correlation coefficient between variables and . Let for . A sign vector is a vector in where for .
Let be the -th Hermite-Chebyshev polynomial for . These polynomials form an orthonormal basis with respect to the Gaussian distribution :
Moreover, because Hermite-Chebyshev polynomials have zero means over a Gaussian distribution we have
Using a basis expansion approach similar to  we have
where is the vector of projection coefficients. The constraint translates to while the constraint is simplified to for . We also have
Thus the MCPCA optimization can be re-written as follows:
Since for , as . Thus we can approximate optimization with the following optimization
for sufficiently large .
The proof follows from the fact that the Ky Fan norm of a matrix is a continuous function of its elements and also as .
For the bivariate case (), the MCPCA optimization simplifies to the maximum correlation optimization . For jointly Gaussian variables the maximum correlation optimization results in global optimizers for . Sign variables ’s are chosen so that the correlation between and is positive. This can be immediately seen from the formulation as well: maximizing the off-diagonal entry of a covariance matrix maximizes its top eigenvalue. For the bivariate case the global optimizer of optimization is for since for . Using and since is the identity function, we obtain for .
Let be the set of covariance matrices of variables where for . In the bivariate case we have
Note that covariance matrices in have similar eigenvalues. Moreover in the bivariate case every covariance matrix can be written as a convex combination of covariance matrices in . Thus, it is majorized by covariance matrices in (Lemma ?). However in the multivariate case we may have covariance matrices that are not in the convex hull of . To illustrate this, let and consider
One can show that the covariance matrix
is not included in the convex hull of covariance matrices in . This covariance matrix results from having for . Thus techniques used to characterize global optimizers of the bivariate case may not extend to the multivariate case.
3.2Global MCPCA Optimizers
Here we characterize global optimizers of optimization . Our main result is as follows:
This Theorem along with Lemma ? results in the following corollary.
Below we present the proof of Theorem ?.
First we prove the following lemma:
We prove this lemma for . The case of can be shown by a successive application of the proof technique. Since is a positive semidefinite matrix we can write . Since diagonal elements of are one we have where is the -th column of . Then we have
Moreover we have
Next we prove the following result on matrix majorization:
Using Lemma ? we can write
where and . Then using Theorem 1 of  completes the proof.
Let be the covariance matrix of transformed variables . Using and for sufficiently large we have
Since we have . Using Lemma ? completes the proof.
4MCPCA for Finite Discrete Random Variables
Let be a discrete random variable with distribution over the alphabet . Without loss of generality, we assume all alphabets have positive probabilities as otherwise they can be neglected, i.e., for . Let be a function of random variable with zero mean and unit variance. Using a basis expansion approach similar to , we have
Note that form an orthonormal basis with respect to the distribution of because
Moreover we have
Let be the joint distribution of discrete variables and . Define a matrix whose element is
This matrix is called the -matrix of the distribution . Note that
For , let
Consider in the feasible region of MCPCA optimization . We have
where , and for all . Using , we can represent these functions in terms of the basis functions:
Using , the constraint would be translated into for . Moreover using , the constraint is simplified to for . We also have
This shows every feasible point of optimization corresponds to a feasible point of optimization . The inverse argument is similar. This completes the proof.
Recall that is the -th largest singular value of the matrix corresponding to left and right singular vectors and , respectively.
First we show that the maximum singular value of the matrix is less than or equal to one. To show that, it is sufficient to show that for every vectors and such that and , we have . To show this, we define random variables and such that
Using Cauchy-Schwartz inequality, we have
Therefore, the maximum singular value of is at most one.
Moreover and are right and left singular vectors of the matrix corresponding to the singular value one because and .
In the following we use similar techniques to the ones employed in  to formulate an alternative and equivalent optimization to without orthogonality constraints which proves to be useful in characterizing a globally optimal MCPCA solution when .
Consider the matrix . This matrix is positive semidefinite and the only vectors in its null space are and . This is because for any vector we have
where the Cauchy-Schwartz inequality and are used. The inequality becomes an equality if and only if or . Moreover we have for because
where the last equality follows from the fact that is orthogonal to .
Define as follows:
We consider unit variance formulation of the MCPCA optimization . We have
Moreover we have
Therefore optimization can be written as
We can write (since is positive semidefinte) where
Define . Thus, can be written as . The vector is the eigenvector corresponding to eigenvalue zero of the matrix (). Other eigenvalues of is equal to one. Since is not invertible, there are many choices for as a function of .
where can be an arbitrary scalar (note that ). However since the desired of optimization is orthogonal to the vector , we choose (i.e., according to Lemma ?, in order to obtain a mean zero solution of the MCPCA optimization , we subtract the mean from the optimal solution of optimization .) Therefore we have