Sparse CCA: Adaptive Estimation and Computational Barriers The research of C. Gao and H. H. Zhou is supported in part by NSF Career Award DMS-0645676 and NSF FRG Grant DMS-0854975. The research of Z. Ma is supported in part by NSF Career Award DMS-1352060.

Sparse CCA: Adaptive Estimation and Computational Barriers thanks: The research of C. Gao and H. H. Zhou is supported in part by NSF Career Award DMS-0645676 and NSF FRG Grant DMS-0854975. The research of Z. Ma is supported in part by NSF Career Award DMS-1352060.

Chao Gao Yale University Zongming Ma University of Pennsylvania Harrison H. Zhou Yale University
Abstract

Canonical correlation analysis is a classical technique for exploring the relationship between two sets of variables. It has important applications in analyzing high dimensional datasets originated from genomics, imaging and other fields. This paper considers adaptive minimax and computationally tractable estimation of leading sparse canonical coefficient vectors in high dimensions. First, we establish separate minimax estimation rates for canonical coefficient vectors of each set of random variables under no structural assumption on marginal covariance matrices. Second, we propose a computationally feasible estimator to attain the optimal rates adaptively under an additional sample size condition. Finally, we show that a sample size condition of this kind is needed for any randomized polynomial-time estimator to be consistent, assuming hardness of certain instances of the Planted Clique detection problem. The result is faithful to the Gaussian models used in the paper. As a byproduct, we obtain the first computational lower bounds for sparse PCA under the Gaussian single spiked covariance model.

Keywords. Convex programming, group-Lasso, Minimax rates, Computational complexity, Planted Clique, Sparse CCA (SCCA), Sparse PCA (SPCA)

1 Introduction

Canonical correlation analysis (CCA) [23] is a classical and important tool in multivariate statistics [1, 31]. For two random vectors and , at the population level, CCA finds successive vectors and (called canonical coefficient vectors) that solve

(1)
subject to

where , , and . Since our primary interest lies in the covariance structure among and , we assume that their means are zeros from here on. Then the linear combinations are the -th pair of canonical variates. This technique has been widely used in various scientific fields to explore the relationship between two sets of variables. In practice, one does not have knowledge about the population covariance, and , , and are replaced by their sample versions , , and in (1).

Recently, there have been growing interests in applying CCA to analyzing high-dimensional datasets, where the dimensions and could be much larger than the sample size . It has been well understood that classical CCA breaks down in this regime [25, 4, 18]. Motivated by genomics, neuroimaging and other applications, people have become interested in seeking sparse leading canonical coefficient vectors. Various estimation procedures imposing sparsity on canonical coefficient vectors have been developed in the literature, which are usually termed sparse CCA. See, for example, [42, 43, 33, 22, 27, 38, 3].

The theoretical aspect of sparse CCA has also been investigated in the literature. A useful model for studying sparse CCA is the canonical pair model proposed in [14]. In particular, suppose there are pairs of canonical coefficient vectors (and canonical variates) among the two sets of variables, then the model reparameterizes the cross-covariance matrix as

(2)

Here and collect the canonical coefficient vectors and with are the ordered canonical correlations. Let and be the indices of nonzero rows of and . One way to impose sparsity on the canonical coefficient vectors is to require the sizes of and to be small, namely and for some and . Under this model, Gao et al. [18] showed that the minimax rate for estimating and under the joint loss function is

(3)

However, to achieve the rate, Gao et al. [18] used a computationally infeasible and nonadaptive procedure, which requires exhaustive search of all possible subsets with the given cardinality and the knowledge of and . Moreover, it is unclear from (3) whether the estimation error of depends on the sparsity and the ambient dimension of and vice versa.

The goal of the present paper is to study three fundamental questions in sparse CCA: (1) What are the minimax rates for estimating the canonical coefficient vectors on the two sets of variables separately? (2) Is there a computationally efficient and sparsity-adaptive method that achieves the optimal rates? (3) What is the price one has to pay to achieve the optimal rates in a computationally efficient way?

1.1 Main contributions

We now introduce the main contributions of the present paper from three different viewpoints as suggested by the three questions we have raised.

Separate minimax rates

The joint loss studied by [18] characterizes the joint estimation error of both and . In this paper, we provide a finer analysis by studying individual estimation errors of and under a natural loss function that can be interpreted as prediction error of canonical variates. The exact definition of the loss functions is given in Section 2. Separate minimax rates are obtained for and . In particular, we show that the minimax rate of convergence in estimating depends only on and , but not on either or . Consequently, if is sparser than , then convergence rate for estimating can be faster than that for estimating . Such a difference is not reflected by the joint loss, since its minimax rate (3) is determined by the slower of the rates of estimating and .

Adaptive estimation

As pointed out in [14] and [18], sparse CCA is a more difficult problem than the well-studied sparse PCA. A naive application of sparse PCA algorithm to sparse CCA can lead to inconsistent results [14]. The additional difficulty in sparse CCA mainly comes from the presence of the nuisance parameters and , which cannot be estimated consistently in a high-dimensional regime in general. Therefore, our goal is to design an estimator that is adaptive to both the nuisance parameters and the sparsity levels. Under the canonical pair model, we propose a computationally efficient algorithm. The algorithm has two stages. In the first stage, we propose a convex program for sparse CCA based on a tight convex relaxation of a combinatorial program in [18] by considering the smallest convex set containing all matrices of the form with both and being rank- orthogonal matrices. The convex program can be efficiently solved by the Alternating Direction Method with Multipliers (ADMM) [16, 11]. Based on the output of the first stage, we formulate a sparse linear regression problem in the second stage to improve estimation accuracy, and the final estimator and can be obtained via a group-Lasso algorithm [45]. Under the sample size condition that

(4)

for some sufficiently large constant , we show and recover the true canonical coefficient matrices and within optimal error rates adaptively with high probability.

Computational lower bound

We require the sample size condition (4) for the adaptive procedure to achieve optimal rates of convergence. Assuming hardness of certain instances of the Planted Clique detection problem, we provide a computational lower bound to show that a condition of this kind is unavoidable for any computationally feasible estimation procedure to achieve consistency. Up to an asymptotically equivalent discretization which is necessary for computational complexity to be well-defined, our computational lower bound is established directly for the Gaussian canonical pair model used throughout the paper.

An analogous sample size condition has been imposed in the sparse PCA literature [24, 29, 13, 37], namely where is the sparsity of the leading eigenvector and the gap between the leading eigenvalue and the rest of the spectrum. Berthet and Rigollet [7] showed that if there existed a polynomial-time algorithm for a generalized sparse PCA detection problem while such a condition is violated, the algorithm could be made (in randomized polynomial-time) into a detection method for the Planted Clique problem in a regime where it is believed to be computationally intractable. However, both the null and the alternative hypotheses in the sparse PCA detection problem were generalized in [7] to include all multivariate distributions whose quadratic forms satisfy certain uniform tail probability bounds and so the distributions need not be Gaussian or having a spiked covariance structure [24]. The same remark also applies to the subsequent work on sparse PCA estimation [39]. Hence, the computational lower bound in sparse PCA was only established for such enlarged parameter spaces. As a byproduct of our analysis, we establish the desired computational lower bound for sparse PCA in the Gaussian single spiked covariance model.

1.2 Organization

After an introduction to notation, the rest of the paper is organized as follows. In Section 2, we formulate the sparse CCA problem by defining its parameter space and loss function. Section 3 presents separate minimax rates for estimating and . Section 4 proposes a two-stage adaptive estimator that is shown to be minimax rate optimal under an additional sample size condition. Section 5 shows a condition of this kind is necessary for all randomized polynomial-time estimator to achieve consistency by establishing new computational lower bounds for sparse PCA and sparse CCA. Section 6 presents proofs of theoretical results in Section 4. Implementation details of the adaptive procedure, numerical studies, additional proofs and technical details are deferred to the supplement [19].

1.3 Notation

For any , denotes the set . For any set , denotes its cardinality and its complement. For any event , denotes its indicator function. For any , denotes the smallest integer no smaller than , the largest integer no larger than , and . For a vector , , , and . For any matrix , denotes its -th row and , the index set of nonzero rows, is called its support. For any subset , is obtained by keeping all entries in and replacing all entries in with zeros. We write for and for . Note that while stands for the corresponding nonzero submatrix of size . In addition, stands for the projection matrix onto the column space of , denotes the set of all orthogonal matrices and . Furthermore, stands for the -th largest singular value of and , . The Frobenius norm and the operator norm of are and , respectively. The norm and the nuclear norm are and , respectively. If is a square matrix, its trace is . For two square matrices and , we write if is positive semidefinite. For any positive semi-definite matrix , denotes its principal square root that is positive semi-definite and satisfies . The trace inner product of two matrices is . Given a random element , denotes its probability distribution. The symbol and its variants , etc. are generic constants and may vary from line to line, unless otherwise specified. The symbols and stand for generic probability and expectation when the distribution is clear from the context.

2 Problem Formulation

2.1 Parameter space

Consider a canonical pair model where the observed pairs of measurement vectors , are i.i.d. from a multivariate Gaussian distribution where

with the cross-covariance matrix satisfying (2). We are interested in the situation where the leading canonical coefficient vectors are sparse. One way to quantify the level of sparsity is to bound how many nonzero rows there are in the and matrices. This notion of sparsity has been used previously in both sparse PCA [13, 37] and sparse CCA [18] problems when one seeks multiple sparse vectors simultaneously.

Recall that for any matrix , collects the indices of nonzero rows in . Adopting the above notion of sparsity, we define to be the collection of all covariance matrices with the structure (2) satisfying

(5)

The probability space we consider is

(6)

where is the sample size. We shall allow to vary with , while is restricted to be an absolute constant.

2.2 Prediction loss

From now on, the presentation of definitions and results will focus on only since those for can be obtained via symmetry. Given an estimator of the leading canonical coefficient vectors for , a natural way of assessing its quality is to see how well it predicts the values of the canonical variables for a new observation which is independent of and identically distributed as the training sample used to obtain . This leads us to consider the following loss function

(7)

where means taking expectation only over and so is still a random quantity due to the randomness of . Since is the expected squared error for predicting the canonical variables via , we refer to it as prediction loss from now on. It is worth noting that the introduction of an orthogonal matrix is unavoidable. To see this, we can simply consider the case where in (2), then we can replace the pair in (2) by for any . In other words, the canonical coefficient vectors are only determined up to a joint orthogonal transform. If we work out the part in the definition (7), then the loss function can be equivalently defined as

(8)

By symmetry, we can define by simply replacing , , and in (7) and (8) with , , and .

A related loss function is measuring the difference between two subspaces. By Proposition C.2 in the supplementary material [19], the prediction loss is a stronger loss function. That is, for some constant only depending on . Actually, is strictly stronger. To see this, let , and . Then, , while . In this paper, we will focus on the stronger loss , and provide brief remarks on results for .

3 Minimax Rates

We first provide a minimax upper bound using a combinatorial optimization procedure, and then show that the resulting rate is optimal by further providing a matching minimax lower bound.

Let , be i.i.d. observations following for some . For notational convenience, we assume the sample size is divisible by three, i.e., for some .

Procedure

To obtain minimax upper bound, we propose a two-stage combinatorial optimization procedure. We split the data into three equal size batches , and , and denote the sample covariance matrices computed on each batch by and for .

In the first stage, we find which solves the following program:

(9)
subject to

In the second stage, we further refine the estimator for by finding solving

(10)
subject to

The final estimator is a normalized version of , defined as

(11)

The purpose of sample splitting employed in the above procedure is to facilitate the proof.

Theory and discussion

The program (9) was first proposed in [18] as a sparsity constrained version of the classical CCA formulation. However, the resulting estimator will have a convergence rate that involves the sparsity level and the ambient dimension of the matrix [18, Theorem 1], which is sub-optimal. The second stage in the procedure is thus proposed to further pursue the optimal estimation rates. First, if we were given the knowledge of , then the least square solution of regressing on is

(12)

where the expectation is with respect to the distribution . The second equality results from taking expectation over each of the three terms in the expansion of the square Euclidean norm, and the last equality holds since does not involve the argument to be optimized over. In fact, from the canonical pair model, one can easily derive a regression interpretation of CCA, , where . Then, (10) is a least square formulation of the regression interpretation. However, CCA is different from regression because the response depends on an unknown . Comparing (10) with (12), it is clear that (10) is a sparsity constrained version of (12) where the knowledge of and the covariance matrix are replaced by the initial estimator and sample covariance matrix from an independent sample. Therefore, can be viewed as an estimator of . Hence, a final normalization step is taken in (11) to transform it to an estimator of .

We now state a bound for the final estimator (11).

Theorem 3.1.

Assume

(13)

for some sufficiently small constant . Then there exist constants only depending on such that

(14)

with -probability at least uniformly over .

Remark 3.1.

The paper assumes that is a constant. However, it is worth noting that the minimax upper bound of Theorem 3.1 does not depend on even if is allowed to grow with . To be specific, assume the eigenvalues of are bounded in the interval . The convergence rate of would still be , because the dependence on has been implicitly built into the prediction loss. On the other hand, a convergence rate for the loss would be , with an extra factor of the condition number of .

Under assumption (13), Theorem 3.1 achieves a convergence rate for the prediction loss in that does not depend on any parameter related to . Note that the probability tail still involves and . However, it can be shown that , and so the corresponding term in the tail probability goes to as long as . The optimality of this upper bound can be justified by the following minimax lower bound.

Theorem 3.2.

Assume that . Then there exists some constant only depending on and an absolute constant , such that

where .

By Theorem 3.1 and Theorem 3.2, the rate in (14), whenever it is upper bounded by a constant, is the minimax rate of the problem.

4 Adaptive and Computationally Efficient Estimation

Section 3 determines the minimax rates for estimating under the prediction loss. However, there are two drawbacks of the procedure (9) – (11). One is that it requires the knowledge of the sparsity levels and . It is thus not adaptive. The other is that in both stages one needs to conduct exhaustive search over all subsets of given sizes in the optimization problems (9) and (10), and hence the computation cost is formidable.

In this section, we overcome both drawbacks by proposing a two-stage convex program approach towards sparse CCA. The procedure is named CoLaR, standing for Convex program with group-Lasso Refinement. It is not only computationally feasible but also achieves the minimax estimation error rates adaptively over a large collection of parameter spaces under an additional sample size condition. The issues related to this additional sample size condition will be discussed in more detail in the subsequent Section 5.

4.1 Estimation scheme

The basic principle underlying the computationally feasible estimation scheme is to seek tight convex relaxations of the combinatorial programs (9) – (10). In what follows, we introduce convex relaxations for the two stages in order. As in Section 3, we assume that the data is split into three batches and of equal sizes and for , let and be defined as before.

First stage

By the definition of trace inner product, the objective function in (9) can be rewritten as . Since it is linear in , this suggests treating as a single argument rather than optimizing over and separately. Next, the support size constraints imply that the vector norm . Applying the convex relaxation of norm by norm and including it as a Lagrangian term, we are led to consider a new objective function

(15)

where serves as a surrogate for , denotes the vector norm of the matrix argument, and is a penalty parameter controlling sparsity. Note that (15) is the maximization problem of a concave function, which becomes a convex program if the constraint set is convex. Under the identity , the normalization constraint in (9) reduces to

(16)

Naturally, we relax it to where

(17)

is the smallest convex set containing . The relation (17) is stated in the proof of Theorem 3 of [40]. Combining (15) – (17), we use the following convex program for the first stage in our adaptive estimation scheme:

(18)
subject to

Implementation of (18) is discussed in Section D in the supplement [19].

Remark 4.1.

A related but different convex relaxation was proposed in [37] for the sparse PCA problem, where the set of all rank projection matrices (which are symmetric) is relaxed to its convex hull – the Fantope . Such an idea is not directly applicable in the current setting due to the asymmetric nature of the matrices included in the set in (16).

Remark 4.2.

The risk of the solution to (18) for estimating is sub-optimal compared to the optimal rates determined in [18]. See Theorem 4.1 below. Nonetheless, it leads to a reasonable estimator for the subspaces spanned by the first left and right canonical coefficient vectors under a sample size condition, which is sufficient for achieving the optimal estimation rates for and in a further refinement stage to be introduced below. Although it is possible that some other penalty function rather than the penalty in (18) could also achieve this goal, is appealing due to its simplicity.

Second stage

Now we turn to the convex relaxation to (10) in the second stage. By the discussion following Theorem 3.1, if we view the rows of as groups, then (10) becomes a least square problem with a constrained number of active groups. A well-known convex relaxation for such problems is the group-Lasso [45], where the number of active groups constraint is relaxed by bounding the sum of norms of the coefficient vector of each group. Let be the solution to (18) and (resp. ) be the matrix consisting of its first left (resp. right) singular vectors. Thus, in the second stage of the adaptive estimation scheme, we propose to solve the following group-Lasso problem:

(19)

where is the group sparsity penalty, defined as the sum of the norms of all the row vectors in , and is a penalty parameter controlling sparsity. Note that the group sparsity penalty is crucial, since if one uses an penalty instead, only a sub-optimal rate can be achieved. Suppose the solution to (19) is , then our final estimator in the adaptive estimation scheme is its normalized version

(20)

As before, sample splitting is only used for technical arguments in the proof. Simulation results in Section E in the supplement [19] show that using the whole dataset repeatedly in (18) – (20) yields satisfactory performance and the improvement by the second stage is considerable.

4.2 Theoretical guarantees

We first state the upper bound for the solution to the convex program (18).

Theorem 4.1.

Assume that

(21)

for some sufficiently large constant . Then there exist positive constants and only depending on and , such that when for ,

with -probability at least for any .

Note that the error bound in Theorem 4.1 can be much larger than the optimal rate for joint estimation of established in [18]. Nonetheless, under the sample size condition (21), it still ensures that is close to in Frobenius norm distance. This fact, together with the proposed refinement scheme (19) – (20), guarantees the optimal rates of convergence for the estimator (20) as stated in the following theorem.

Theorem 4.2.

Assume (21) holds for some sufficiently large . Then there exist constants and only depending on and such that if we set and for any and for some absolute constant , there exist a constants only depending on and , such that

with -probability at least uniformly over .

Remark 4.3.

The result of Theorem 4.2 assumes a constant . Explicit dependence on the eigenvalues of the marginal covariance can be tracked even when is diverging. Assuming the eigenvalues of all lie in the interval , then the convergence rate of would be and a convergence rate of would be . Compared with Remark 3.1, there is an extra factor , which is also present for the Lasso error bounds [8, 32]. Evidence has been given in the literature that such an extra factor can be intrinsic to all polynomial-time algorithms [46].

Although both Theorem 4.1 and Theorem 4.2 assume Gaussian distributions, a scrutiny of the proofs shows that the same results hold if the Gaussian assumption is weakened to subgaussian. By Theorem 3.2, the rate in Theorem 4.2 is optimal. By Theorem 4.1 and Theorem 4.2, the choices of the penalty parameters and in (18) and (19) do not depend on or . Therefore, the proposed estimation scheme (18) – (20) achieves the optimal rate adaptively over sparsity levels. A full treatment of adaptation to is beyond the scope of the current paper, though it seems possible in view of the recent proposals in [6, 35, 12]. A careful examination of the proofs shows that the dependence of and on is through and , respectively. When and are bounded from above by a constant multiple of , we can upper bound the operator norms by the sample counterparts to remove the dependence of these penalty parameters on . We conclude this section with two more remarks.

Remark 4.4.

The group sparsity penalty used in the second stage (19) plays an important role in achieving the optimal rate . Except for the extra term, this convergence rate is a typical one for group Lasso [28]. If we simply use an penalty, then we will obtain the rate , which is clearly sub-optimal.

Remark 4.5.

Comparing Theorem 3.1 with Theorem 4.2, the adaptive estimation scheme achieves the optimal rates of convergence for a smaller collection of parameter spaces of interest due to the more restrictive sample size condition (21). We examine the necessity of this condition in more details in Section 5 below.

5 Computational Lower Bounds

In this section, we provide evidence that the sample size condition (21) imposed on the adaptive estimation scheme in Theorems 4.1 and 4.2 is probably unavoidable for any computationally feasible estimator to be consistent. To be specific, we show that for a sequence of parameter spaces in (5) – (6), if the condition is violated, then any computationally efficient consistent estimator of sparse canonical coefficients leads to a computationally efficient and statistically powerful test for the Planted Clique detection problem in a regime where it is believed to be computationally intractable.

Planted Clique

Let be a positive integer and . We denote by the Erdős-Rényi graph on vertices where each edge is drawn independently with probability , and by the random graph generated by first sampling from and then selecting vertices uniformly at random and forming a clique of size on these vertices. For an adjacency matrix of an instance from either or , the Planted Clique detection problem of parameter refers to testing the following hypotheses

(22)

It is widely believed that when , the problem (22) cannot be solved by any randomized polynomial-time algorithm. In the rest of the paper, we formalize the conjectured hardness of Planted Clique problem into the following hypothesis.

Hypothesis A.

For any sequence such that and any randomized polynomial-time test ,

Evidence supporting this hypothesis has been provided in [34, 17]. Computational lower bounds in several statistical problems have been established by assuming the above hypothesis and its close variants, including sparse PCA detection [7] and estimation [39] in classes defined by a restricted covariance concentration condition, submatrix detection [30] and community detection [21].

Necessity of the sample size condition (21)

Under Hypothesis A, the necessity of condition (21) is supported by the following theorem.

Theorem 5.1.

Suppose that Hypothesis A holds and that as , satisfying for some constant , , for some sufficiently small , and . If for some ,

(23)

then for any randomized polynomial-time estimator ,

(24)

Comparing (21) with (23), we see that subject to a sub-polynomial factor, the condition (21) is necessary to achieve consistent sparse CCA estimation within polynomial time complexity.

Remark 5.1.

The statement in Theorem 5.1 is rigorous only if we assume the computational complexities of basic arithmetic operations on real numbers and sampling from univariate continuous distributions with analytic density functions are all [10]. To be rigorous under the probabilistic Turing machine model [2], we need to introduce appropriate discretization of the problem and be more careful with the complexity of random number generation. To convey the key ideas in our computational lower bound construction, we focus on the continuous case throughout this section and defer the formal discretization arguments to Section B in the supplement [19].

In what follows, we divide the reduction argument leading to Theorem 5.1 into two parts. In the first part, we show Hypothesis A implies the computational hardness of the sparse PCA problem under the Gaussian spiked covariance model. In the second part, we show computational hardness of sparse PCA implies that of sparse CCA as stated in Theorem 5.1.

5.1 Hardness of sparse PCA under Gaussian spiked covariance model

Gaussian single spiked model [24] refers to the distribution where . Here, is the eigenvector of unit length and is the eigenvalue. Define the following Gaussian single spiked model parameter space for sparse PCA

(25)

The minimax estimation rate for under the loss is . See, for instance, [13]. However, to achieve the above minimax rate via computationally efficient methods such as those proposed in [9, 29, 13], researchers have required the sample size to satisfy for some sufficiently large constant . Moreover, no computationally efficient estimator is known to achieve consistency when the sample size condition is violated. As a first step toward the establishment of Theorem 5.1, we show that Hypothesis A implies hardness of sparse PCA under Gaussian spiked covariance model (25) when for some .

We note that previous computational lower bounds for sparse PCA in [7, 39] cannot be used here directly because they are only valid for parameter spaces defined via the restricted covariance concentration (RCC) condition. As pointed out in [39], such parameter spaces include (but are not limited to) all subgaussian distributions with sparse leading eigenvectors and the covariance matrices need not be of the spiked form . Therefore, the Gaussian single spiked model parameter space defined in (25) only constitutes a small subset of such RCC parameter spaces. The goal of the present subsection is to establish the computational lower bound for the Gaussian single spiked model directly.

Suppose we have an estimator of the leading sparse eigenvector, we propose the following reduction scheme to transform it into a test for (22). To this end, we first introduce some additional notation. Consider integers and . Define

(26)

For any , let denote the density function of the distribution, and let

(27)

denote the density function of the Gaussian mixture . Next, let be the restriction of the distribution on the interval . For any , define two probability distributions and with densities