Subspace Perspective on Canonical Correlation Analysis: Dimension Reduction and Minimax Rates

# Subspace Perspective on Canonical Correlation Analysis: Dimension Reduction and Minimax Rates

Zhuang Ma   and   Xiaodong Li
###### Abstract

Canonical correlation analysis (CCA) is a fundamental statistical tool for exploring the correlation structure between two sets of random variables. In this paper, motivated by the recent success of applying CCA to learn low dimensional representations of high dimensional objects, we propose two losses based on the principal angles between the model spaces spanned by the sample canonical variates and their population correspondents, respectively. We further characterize the non-asymptotic error bounds for the estimation risks under the proposed error metrics, which reveal how the performance of sample CCA depends adaptively on key quantities including the dimensions, the sample size, the condition number of the covariance matrices and particularly the population canonical correlation coefficients. The optimality of our uniform upper bounds is also justified by lower-bound analysis based on stringent and localized parameter spaces. To the best of our knowledge, for the first time our paper separates and for the first order term in the upper bounds without assuming the residual correlations are zeros. More significantly, our paper derives for the first time in the non-asymptotic CCA estimation convergence rates, which is essential to understand the behavior of CCA when the leading canonical correlation coefficients are close to .

## 1 Introduction

Canonical correlation analysis (CCA), first introduced by Hotelling (1936), is a fundamental statistical tool to characterize the relationship between two groups of random variables and finds a wide range of applications across many different fields. For example, in genome-wide association study (GWAS), CCA is used to discover the genetic associations between the genotype data of single nucleotide polymorphisms (SNPs) and the phenotype data of gene expression levels (Witten et al., 2009; Chen et al., 2012). In information retrieval, CCA is used to embed both the search space (e.g. images) and the query space (e.g. text) into a shared low dimensional latent space such that the similarity between the queries and the candidates can be quantified (Rasiwasia et al., 2010; Gong et al., 2014). In natural language processing, CCA is applied to the word co-occurrence matrix and learns vector representations of the words which capture the semantics (Dhillon et al., 2011; Faruqui and Dyer, 2014). Other applications, to name a few, include fMRI data analysis (Friman et al., 2003), computer vision (Kim et al., 2007) and speech recognition (Arora and Livescu, 2013; Wang et al., 2015).

The enormous empirical success motivates us to revisit the estimation problem of canonical correlation analysis. Two theoretical questions are naturally posed: What are proper error metrics to quantify the discrepancy between population CCA and its sample estimates? And under such metrics, what are the quantities that characterize the fundamental statistical limits?

The justification of loss functions, in the context of CCA, has seldom appeared in the literature. From first principles that the proper metric to quantify the estimation loss should depend on the specific purpose of using CCA, we find that the applications discussed above mainly fall into two categories: identifying variables of interest and dimension reduction.

The first category, mostly in genomic research (Witten et al., 2009; Chen et al., 2012), treats one group of variables as responses and the other group of variables as covariates. The goal is to discover the specific subset of the covariates that are most correlated with the responses. Such applications are featured by low signal-to-noise ratio and the interpretability of the results is the major concern.

In contrast, the second category is investigated extensively in statistical machine learning and engineering community where CCA is used to learn low dimensional latent representations of complex objects such as images (Rasiwasia et al., 2010), text (Dhillon et al., 2011) and speeches (Arora and Livescu, 2013). These scenarios are usually accompanied with relatively high signal-to-noise ratio and the prediction accuracy, using the learned low dimensional embeddings as the new set of predictors, is of primary interest. In recent years, there has been a series of publications establishing fundamental theoretical guarantees for CCA to achieve sufficient dimension reduction (Kakade and Foster (2007); Foster et al. (2008); Sridharan and Kakade (2008); Fukumizu et al. (2009); Chaudhuri et al. (2009) and many others).

In this paper, we aim to address the problems raised above by treating CCA as a tool for dimension reduction.

### 1.1 Population and Sample CCA

Suppose and are two sets of variates with the joint covariance matrix

 Cov([xy])=Σ:=[ΣxΣxyΣ⊤xyΣy]. (1.1)

For simplicity, we assume

 E(Xi)=0, i=1,…,p1,E(Yj)=0, j=1,…,p2.

On the population level, CCA is designed to extract the most correlated linear combinations between two sets of random variables sequentially: The th pair of canonical variables and maximizes

 λi=Corr(Ui,Vi)

such that and have unit variances and they are uncorrelated to all previous pairs of canonical variables. Here is called the th pair of canonical loadings and is the th canonical correlation.

It is well known in multivariate statistical analysis that the canonical loadings can be found recursively by the following criterion:

 (ϕi,ψi)= argmax ϕ⊤Σxyψ (1.2) subject to ϕ⊤Σxϕ=1, ψ⊤Σyψ=1; ϕ⊤Σxϕj=0, ψ⊤Σyψj=0, ∀ 1≤j≤i−1.

Although this criterion is a nonconvex optimization, it can be obtained easily by spectral methods: Define , and . Then are singular values of , and are actually left and right singular vectors of , respectively.

For any given estimates of the leading canonical loadings, denoted by , the corresponding estimates for the canonical variables can be represented by

 ˆUi=ˆϕ⊤ix,ˆVi=ˆψ⊤iy,i=1…,p1∧p2.

To quantify the estimation loss, generally speaking, we can either focus on measuring the difference between the canonical loadings and or measuring the difference between the canonical variables and . Here in the definition of and are independent of the samples based on which are constructed. Therefore, for the discrepancy between the canonical variables, there is an extra layer of randomness.

As discussed above, in modern machine learning applications such as natural language processing and information retrieval, the leading sample canonical loadings are used for dimension reduction, i.e., for a new observation , ideally we hope to use the corresponding values of the canonical variables and to represent the observation in a low dimension space. Empirically, the actual low dimensional representations are and . Therefore, the discrepancy between the ideal dimension reduction and actual dimension reduction should be explained by how well approximate . Consequently, we choose to quantify the difference between the sample and population canonical variables instead of the canonical loadings.

### 1.3 Linear span

However, there are still many options to quantify how well the sample canonical variables approximate their population correspondents. To choose suitable losses, it is convenient to come back to specific applications to get some inspiration.

Motivated by applications in natural language processing and information retrieval, the model of multi-view sufficient dimension reduction has been studied in Foster et al. (2008). Roughly speaking, a statistical model was proposed by Foster et al. (2008) to study how to predict using two sets of predictors denoted by and , where the joint covariance of is

 Cov⎛⎜⎝⎡⎢⎣xyZ⎤⎥⎦⎞⎟⎠=⎡⎢ ⎢⎣ΣxΣxyσxzΣ⊤xyΣyσyzσ⊤xzσ⊤yzσ2z⎤⎥ ⎥⎦.

It was proven in Foster et al. (2008) that under certain assumptions, the leading canonical variables are sufficient dimension reduction for the linear prediction of ; That is, the best linear predictor of based on is the same as the best linear predictor based on . (Similarly, the best linear predictor of based on is the same as the best linear predictor based on .)

Notice that the best linear predictor is actually determined by the set of all linear combinations of (referred to as the “model space” in the literature of linear regression for prediction), which we denote as . Inspired by Foster et al. (2008), we propose to quantify the discrepancy between and by the discrepancy between the corresponding subspaces and (and similarly measure the difference between and by the distance between and ).

### 1.4 Hilbert spaces and principal angles

In this section, we define the discrepancy between and by introducing a Hilbert space. Noting that for any given sample , both and are composed by linear combinations of . Denote the set of all possible linear combinations as

 H=span(X1,…,Xp1). (1.3)

Moreover, for any , we define a bilinear function . It is easy to show that is an inner product and is a -dimensional Hilbert space, which is isomorphic to .

With the natural covariance-based inner product, we know both and are subspaces of , so it is natural to define their discrepancy based on their principal angles . In the literature of statistics and linear algebra, two loss functions are usually used

 Lmax(span(ˆU1,…,ˆUk), span(U1,…,Uk))=sin2(θ1)

and

 Lave(span(ˆU1,…,ˆUk), span(U1,…,Uk))=1k(sin2(θ1)+…+sin2(θk))

In spite of a somewhat abstract definition, we have the following clean formula for these two losses:

###### Theorem 1.1.

Suppose for any matrix , represents the orthogonal projector onto the column span of . Assume the observed sample is fixed. Then

 Lave(span(ˆU1,…,ˆUk),span(U1,…,Uk)) =12k∥∥PΣ1/2xˆΦ1:k−PΣ1/2xΦ1:k∥∥2F =1k∥∥(Ip1−PΣ1/2xΦ1:k)PΣ1/2xˆΦ1:k∥∥2F (1.4) =1kminQ∈Rk×kE[∥u⊤−ˆu⊤Q∥22] :=Lave(Φ1:k,ˆΦ1:k)

and

 Lmax(span(ˆU1,…,ˆUk),span(U1,…,Uk)) =∥∥PΣ1/2xˆΦ1:k−PΣ1/2xΦ1:k∥∥2 (1.5) =maxg∈RkminQ∈Rk×kE[((u⊤−ˆu⊤Q)g)2] :=Lmax(Φ1:k,ˆΦ1:k).

Here is a matrix consisting of the leading population canonical loadings for , and is its estimate based on a given sample. Moreover and .

### 1.5 Uniform upper bounds and minimax rates

The most important contribution of this paper is to establish sharp upper bounds for the estimation/prediction of CCA based on the proposed subspace losses and . It is noteworthy that both upper bounds hold uniformly for all invertible provided for some numerical constant . Furthermore, in order to justify the sharpness of these bounds, we also establish minimax lower bounds under a family of stringent and localized parameter spaces. These results will be detailed in \prettyrefsec:theory.

### 1.6 Notations and the Organization

Throughout the paper, we use lower-case and upper-case non-bolded letters to represent fixed and random variables, respectively. We also use lower-case and upper-case bold letters to represent vectors (which could be either deterministic or random) and matrices, respectively. For any matrix and vector , denotes operator (spectral) norm and Frobenius norm respectively, denotes the vector norm, denotes the submatrix consisting of the first columns of , and stands for the projection matrix onto the column space of . Moreover, we use and to represent the largest and smallest singular value of respectively, and to denote the condition number of the matrix. We use for the identity matrix of dimension and for the submatrix composed of the first columns of . Further, (and simply when ) stands for the set of matrices with orthonormal columns and denotes the set of strictly positive definite matrices. For a random vector , denotes the subspace of all the linear combinations of . Other notations will be specified within the corresponding context.

In the following, we will introduce our main upper and lower bound results in \prettyrefsec:theory. To highlight our contributions in the new loss functions and theoretical results, we will compare our results to existing work in the literature in \prettyrefsec:contributions. All proofs are deferred to \prettyrefsec:proofs.

## 2 Theory

In this section, we introduce our main results on non-asymptotic upper and lower bounds for estimating CCA under the proposed loss functions. It is worth recalling that are singular values of .

It is natural to estimate population CCA by its sample counterparts. Similar to equation (1.2), the sample canonical loadings are defined recursively by

 (ˆϕi,ˆψi)= argmax ϕ⊤ˆΣxyψ (2.1) subject to ϕ⊤ˆΣxϕ=1, ψ⊤ˆΣyψ=1; ϕ⊤ˆΣxϕj=0, ψ⊤ˆΣyψj=0, ∀ 1≤j≤i−1.

where are the sample covariance matrices. The sample canonical variables are defined as the following linear combinations by the sample canonical loadings:

 ˆUi=ˆϕ⊤ix,ˆVi=ˆψ⊤iy,i=1…,p1∧p2.

We prove the following upper bound for the estimate based on sample CCA.

###### Theorem 2.1.

(Upper bound) Suppose where is defined as in \prettyrefeq:covariance. Assume and are invertible. Moreover, assume for some predetermined . Then there exist universal positive constants such that if , the top- sample canonical coefficients matrix satisfies

 E[Lmax(Φ1:k,ˆΦ1:k)]≤C0[(1−λ2k)(1−λ2k+1)(λk−λk+1)2p1n+(p1+p2)2n2(λk−λk+1)4+e−γ(p1∧p2)] E[Lave(Φ1:k,ˆΦ1:k)]≤C0[(1−λ2k)(1−λ2k+1)(λk−λk+1)2p1−kn+(p1+p2)2n2(λk−λk+1)4+e−γ(p1∧p2)]

The upper bounds for can be obtained by switching and .

Since we pursue a nonasymptotic theoretical framework for CCA estimates, and the loss functions we propose are nonstandard in the literature, the standard minimax lower bound results in parametric maximum likelihood estimates do not apply straightforwardly. Instead, we turn to the nonparametric minimax lower bound frameworks, particularly those in PCA and CCA; See, e.g., Vu et al. (2013); Cai et al. (2013); Gao et al. (2015). Compared to these existing works, the technical novelties of our results and proofs are summarized in Sections 3.3 and 6.

We define the parameter space as the collection of joint covariance matrices satisfying

1. and ;

2. .

We deliberately set to demonstrate that the lower bound is independent of the condition number. For the rest of the paper, we will use the shorthand to represent this parameter space for simplicity.

###### Theorem 2.2.

(Lower bound) There exists a universal constant independent of and such that

The lower bounds for can be obtained by replacing with .

###### Corollary 2.3.

When and

 n≥C(p1+p2)(1+p2/p1)(λk−λk+1)2(1−λ2k)(1−λ2k+1) (2.2)

for some universal positive constant , the minimax rates can be characterized by

## 3 Related Work and Our Contributions

Recently, the non-asymptotic rate of convergence of CCA has been studied by Gao et al. (2015, 2017) under a sparse setup and by Cai and Zhang (2017) under the usual non-sparse setup. Cai and Zhang (2017) appeared on arXiv almost at the same time as the first version of our paper was posted. In this section, we state our contributions by detailed comparison with these works.

### 3.1 Novel loss funcitons

We proposed new loss functions based on the principal angles between the subspace spanned by the population canonical variates and the subspace spanned by the estimated canonical variates. In contrast, Gao et al. (2017) proposed and studied the loss ; Cai and Zhang (2017) proposed and studied both and , where

 ¯¯¯¯Lave(Φ1:k,ˆΦ1:k) ¯¯¯¯Lmax(Φ1:k,ˆΦ1:k)

and resemble our loss functions and respectively. By \prettyrefthm:formula, we also have

 Lave(Φ1:k,ˆΦ1:k) =2minQ∈Rk×kE[∥x⊤Φ1:k−x⊤ˆΦ1:kQ∥22 ∣∣ ˆΦ1:k] Lmax(Φ1:k,ˆΦ1:k) =maxg∈Rk,|g|=1minQ∈Rk×kE[((x⊤Φ1:k−x⊤ˆΦ1:kQ)g)2 ∣∣ ˆΦ1:k]

By these two expressions, we can easily obtain

 Lave(Φ1:k,ˆΦ1:k) ≤2¯¯¯¯Lave(Φ1:k,ˆΦ1:k) (3.1) Lmax(Φ1:k,ˆΦ1:k) ≤¯¯¯¯Lmax(Φ1:k,ˆΦ1:k)

However, and are not equivalent up to a constant. Neither are and . In fact, we can prove that as long as , if , then

 Lave(Φ1:k,ˆΦ1:k)=Lmax(Φ1:k,ˆΦ1:k)=0,

while almost surely and .

To illustrate this comparison, we can consider the following very simple simulation: Suppose , and and and . In this setup, we know the population canonical correlation coefficients are and , and the leading canonical loadings are and . In our simulation, we generated the following data matrices

 X=⎡⎢⎣0.07361.54961.5390−0.04150.9331−0.4776⎤⎥⎦

and

 Y=⎡⎢⎣0.07362.89821.5390−1.22140.93312.5931⎤⎥⎦.

Furthermore, we can obtain the sample canonical correlations and , as well as the leading sample canonical loadings and . Then while .

This numerical example clearly shows that the sample CCA can exactly identify that among all linear combinations of and and all linear combinations of and , and are mostly correlated. Our loss functions and do characterize this exact identification, whereas and do not.

Moreover, the following joint loss was studied in Gao et al. (2015):

 ¯¯¯¯Ljoint((Φ1:k,Ψ1:k),(ˆΦ1:k,ˆΨ1:k))=E[∥∥ˆΦ1:kˆΨ⊤1:k−Φ1:kΨ⊤1:k∥∥2F].

Similarly, almost surely under the special case .

### 3.2 Sharper upper bounds

Regardless of loss functions, we explain in the following why \prettyrefthm:upper implies sharper upper bounds than the existing rates in Gao et al. (2015), Gao et al. (2017) and Cai and Zhang (2017) under the nonsparse case. Our discussion is focused on in the following discussion while the discussion for is similar.

Notice that if we only apply Wedin’s sin-theta law, i.e., replacing the fine bound \prettyreflem:operator_bound with the rough bound \prettyreflem:estimation (also see Gao et al. (2015) for similar ideas), we can obtain the following rough bound:

 E[Lave(Φ1:k,ˆΦ1:k)]≤C0[p1+p2n(λk−λk+1)2]. (3.2)

In order to decouple the estimation error bound of from , both Gao et al. (2017) and Cai and Zhang (2017) assume the residual canonical correlations are zero, i.e.,

 λk+1=…=λp1∧p2=0.

This assumption is essential for proofs in both Gao et al. (2017) and Cai and Zhang (2017) under certain sample size conditions. We got rid of this assumption by developing new proof techniques and these techniques actually work for as well. A detailed comparison between our result and that in Cai and Zhang (2017) is summarized in Table \prettyreftable: comparison (The results of Gao et al. (2017) in the non-sparse regime can be implied by Cai and Zhang (2017) under milder sample size conditions).

Perhaps the most striking contribution of our upper bound is that we first derive the factors and in the literature of nonasymptotic CCA estimate. We now explain why these factors are essential when leading canonical correlation coefficients are close to .

#### Example 1: λk=1 and λk+1=0

Consider the example that , , and . Then our bound rates actually imply that

 ELave(ϕ1,ˆϕ1)≤Cp2n2,

while the rates in Gao et al. (2017) and Cai and Zhang (2017) imply that

 ELave(ϕ1,ˆϕ1)≤2E¯¯¯¯Lave(ϕ1,ˆϕ1)≤Cpn.

This shows that even under the condition , under our loss , our result could imply sharper convergence rates than that in Gao et al. (2017) and Cai and Zhang (2017) if .

Notice that as aforementioned, when , we can actually prove through a separate argument. How to improve \prettyrefthm:upper to imply this result is an open problem for future research.

#### Example 2: Both λk and λk+1 are close to 1

Consider the example that , , and . Then our bound rates actually imply that

 ELave(ϕ1,ˆϕ1)≤Cpn,

while the rough rates \prettyrefeq:rough_bound by Wedin’s sin-theta law implies

 ELave(ϕ1,ˆϕ1)≤C√pn.

This shows that our upper bound rates could be much sharper than the rough rates \prettyrefeq:rough_bound when both and are close to .

#### New proof techniques and connection to asymptotic theory

To the best of our knowledge, none of the analysis in Gao et al. (2015), Gao et al. (2017), Cai and Zhang (2017) can be used to obtain the multiplicative factor in the first order term of the upper bound, even under the strong condition that .

Following a different path, we do careful non-asymptotic entry-wise perturbation analysis of the estimating equations of CCA to avoid the loss of precision caused by applying matrix inequalities in the early stage of the proof. The main challenge is to analyze the properties of matrix hardmard products, especially to derive tight operator norm bounds for certain hardmard products. We are particularly luckily to find a divide-and-conquer approach ( and in the proof of \prettyreflem:operator_bound) to decompose the target matrices into simple-structure matrices where we can apply the tools developed in Lemma 5.6.

The asymptotic distribution of the canonical loadings has been studied by Anderson (1999) under the assumption that all the canonical correlations are distinct and . Since we focus on subspaces, we only require for the given . Both Anderson (1999) and our work are based on analyzing the estimating equations (\prettyrefeq:estimate) of CCA. Our analysis is more involved because completely novel techniques are required to obtain the factor in the nonasymptotic framework.

### 3.3 Sharper lower bounds under parameter spaces with fixed λk and λk+1

The minimax lower bounds for the estimation rates of CCA were first established by Gao et al. (2015, 2017) under the losses and . However, the parameter space discussed in Gao et al. (2017) requires . Moreover, the parameter space in Gao et al. (2015) is parameterized by satisfying , but is not specified. In fact, they also constructed the hypothesis class with and the resulting minimax lower bound is proportional to .

However, this minimax lower bound is not sharp when and are close. Suppose , , and . Our minimax lower bound in \prettyrefthm:lower leads to

 infˆΦ1:ksupΣ∈FE[Lave(Φ1:k,ˆΦ1:k)]≥O(1).

In contrast, to capture the fundamental limit of CCA estimates in this scenario under the framework of Gao et al. (2015), one needs to choose to capture both and , i.e., and hence . Then the resulting minimax lower bound rate will be , which is much looser than .

Technically speaking, we follow the analytical framework of Gao et al. (2015) and Gao et al. (2017), but the hypothesis classes construction requires any given instead of , and this brings in new technical challenges. More detailed technical discussions are deferred to \prettyrefsec:lower-proof.

## 4 Proof of \prettyrefthm:formula

Suppose the observed sample of is fixed and consider the correlation between the two subspaces of (defined in \prettyrefeq:hilbert): and . Let be the first, second, …, and th pair of canonical variates between and . Then , and , for any and , for .

By the definition of principal angles, we know is actually the th principal angle between and , i.e., . This implies that

 Lave(Φ1:k,ˆΦ1:k):=k∑i=1sin2θi=k∑i=1(1−∣∣⟨Wi,ˆWi⟩∣∣2).

Since are linear combinations of , we can denote

 w⊤:=(W1,…,Wk)=x⊤Σ−1/2xB,~{}and~{}^w⊤:=(ˆW1,…,ˆWk)=x⊤Σ