Sparse Principal Component Analysis with missing observations
Abstract
In this paper, we study the problem of sparse Principal Component Analysis (PCA) in the highdimensional setting with missing observations. Our goal is to estimate the first principal component when we only have access to partial observations. Existing estimation techniques are usually derived for fully observed data sets and require a prior knowledge of the sparsity of the first principal component in order to achieve good statistical guarantees. Our contributions is threefold. First, we establish the first informationtheoretic lower bound for the sparse PCA problem with missing observations. Second, we propose a simple procedure that does not require any prior knowledge on the sparsity of the unknown first principal component or any imputation of the missing observations, adapts to the unknown sparsity of the first principal component and achieves the optimal rate of estimation up to a logarithmic factor. Third, if the covariance matrix of interest admits a sparse first principal component and is in addition approximately lowrank, then we can derive a completely datadriven procedure computationally tractable in highdimension, adaptive to the unknown sparsity of the first principal component and statistically optimal (up to a logarithmic factor).
math.PR/0000000 \startlocaldefs \endlocaldefs
Sparse PCA with missing observations
m1Supported in part by NSF Grant DMS1106644 and Simons foundation Grant 209842
class=AMS] \kwd[Primary ]62H12
Lowrank covariance matrix \kwdSparse Principal Component Analysis \kwdMissing observations \kwdInformationtheoretic lower bounds\kwdOracle inequalities
1 Introduction
Let be i.i.d. zero mean vectors with unknown covariance matrix of the form
(1.1) 
where , (the unit sphere in ) and is a symmetric positive semidefinite matrix with spectral norm and such that . The eigenvector is called the first principal component of . Our objective is to estimate the first principal component when the vectors are partially observed. More precisely, we consider the following framework. Denote by the th component of the vector . We assume that each component is observed independently of the others with probability . Note that can be easily estimated by the proportion of observed entries. Therefore, we will assume in this paper that is known. Note also that the case corresponds to the standard case of fully observed vectors. Let be a sequence of i.i.d. Bernoulli random variables with parameter and independent from . We observe i.i.d. random vectors whose components satisfy
(1.2) 
We can think of the as masked variables. If , then we cannot observe the th component of and the default value is assigned to . Our goal is then to estimate given the partial observations .
Principal Component Analysis (PCA) is a popular technique to reduce the dimension of a data set that has been used for many years in a variety of different fields including image processing, engineering, genetics, meteorology, chemistry and many others. In most of these fields, data are now highdimensional, that is the number of parameters is much larger than the sample size , and contain missing observations. This is especially true in genomics with gene expression microarray data where PCA is used to detect the genes responsible for a given biological process. Indeed, despite the recent improvments in gene expression techniques, microarray data can contain up to missing observations affecting up to of the genes. Unfortunately, it is a known fact that PCA is very sensitive even to small perturbations of the data including in particular missing observations. Therefore, several strategies have been developped to deal with missing values. The simple strategy that consists in eliminating from the PCA study any gene with at least one missing observation is not acceptable in this context since up to of the genes can be eliminated from the study. An alternative strategy consists in infering the missing values prior to the PCA using complex imputation schemes [5, 2]. These schemes usually assume that the genes interactions follow some specified model and involve intensive computational preprocessing to imput the missing observations. We propose in this paper a different strategy. Instead of building an imputation technique based on assumptions describing the genome structure (about which we usually have no prior information), we propose a technique based on the analysis of the perturbations process. In other words, if we understand the process generating the missing observations, then we can efficiently correct the data prior to the PCA analysis. This strategy was first introduced in [7] to estimate the spectrum of lowrank covariance matrices. One of our goal is to show that this approach can be successfully applied to perform fast and accurate PCA with missing observations.
Standard PCA in the full observation framework () consists in extracting the first principal components of (that is the eigenvector associated to the largest eigenvalue) based on the i.i.d. observations :
(1.3) 
where . The standard PCA presents two majors drawbacks. First, it is not consistent in highdimension [3, 10, 11]. Second, the solution is usually a dense vector whereas sparse solutions are prefered in most applications in order to obtain simple interpretable structures. For instance, in microarray data, we typically observe that only a few among the thousands of screened genes are involved in a given biological process. In order to improve interpretability, several approaches have been proposed to perform sparse PCA, that is to enforce sparsity of the PCA outcome. See for instance [12, 17, 13] for SVD based iterative thresholding approaches. [18] reformulated the sparse PCA problem as a sparse regression problem and then used the LASSO estimator. See also [9] for greedy methods. We consider now the approach by [4] which consists in computing a solution of (1.3) under the additional norm constraint for some fixed integer in order to enforce sparsity of the solution. The same approach with the norm constraint replaced by the norm gives the following procedure
(1.4) 
where denotes the number of nonzero components of . In a recent paper, [16] established the following oracle inequality
for some absolute constant . Note that this procedure requires the knowledge of an upper bound . In practice, we generally do not have access to any prior information on the sparsity of . Consequently, if the parameter we use in the procedure is too small, then the above upper bound does not hold, and if is too large, then the above upper bound (even though valid) is suboptimal. In other words, the procedure (1.4) with can be seen as an oracle and our goal is to propose a procedure that performs as well as this oracle without any prior information on .
In order to circumvent the fact that is unknown, we consider the following procedure proposed by [1]
(1.5) 
where is a regularization parameter to be tuned properly. [1, 6] studied the computational aspect. In particular, [6] proposed a computationally tractable procedure to solve the above constrained maximization problem even in highdimension. However none of these references investigated the statistical performances of this procedure or the question of the optimal tuning of . We propose to carry out this analysis in this paper. We establish the optimality of this procedure in the minimax sense and explicit the optimal theorethical choice for the regularization parameter .
When the data contains incomplete observations (), we do not have access to the empirical covariance matrix . Given the observations , we can build the following empirical covariance matrix
As noted in [7], is not an unbiased estimator of , Consequently, we need to consider the following correction in order to get sharp estimation results:
(1.6) 
Indeed, we can check by elementary algebra that is an unbiased estimator of in the missing observation framework . Therefore, we consider the following estimator in the missing observation framework
(1.7) 
where is a regularization parameter to be tuned properly and is a mild constraint on . More precisely, can be chosen as large as when no prior information on is available. We will prove in particular that the procedure (1.7) adapts to the unknown sparsity of provided that . We also investigate the case where is in addition approximately lowrank. In that case, we can remove the restriction (taking ) in the procedure (1.7) and propose a datadriven choice of the regularization parameter . We will show that this datadriven procedure also achieves the optimal rate of estimation (up to a logarithmic factor) in the missing observation framework without any prior knowledge on . Finally, we establish information theoretic lower bounds for the sparse PCA problem in the missing observation framework with the sharp dependence on , thus expliciting completely the effect of missing observations on the sparse PCA estimation rate. Note that our results are nonasymptotic in nature and hold for any setting of including in particular the highdimensional setting .
2 Tools and definitions
In this section, we introduce various notations and definitions and we recall some known results that we will use to establish our results.
The norms of a vector is given by
The support of a vector is defined as follows
We denote the number of nonzero components of by . Note that . Set . For any , we define . For any integer , we define . Note that .
For any symmetric matrix with eigenvalues , we define the Schatten norm of by
Define the usual matrix scalar product for any . Note that for any . Recall the trace duality property
We recall now some basic facts about nets (See for instance Section 5.2.2 in [15]).
Definition 1.
Let be a metric space and let . A subset of is called an net of if for every point , there exists a point so that .
We recall now an approximation result of the spectral norm on an net.
Lemma 1.
Let be a symmetric matrix for some . For any , there exists an net ( the unit sphere in ) such that
and
See for instance Lemma 5.2 and Lemma 5.3 in [15] for a proof.
We recall now the definition and some basic properties of subexponential random vectors.
Definition 2.
The norms of a realvalued random variable are defined by
We say that a random variable with values in is subexponential if for some . If , we say that is subgaussian.
We recall some wellknown properties of subexponential random variables:

For any realvalued random variable such that for some , we have
(2.1) where is the Gamma function.

If a realvalued random variable is subgaussian, then is subexponential. Indeed, we have
(2.2)
Definition 3.
A random vector is subexponential if are subexponential random variables for all . The norms of a random vector are defined by
We recall a version of Bernstein’s inequality for unbounded realvalued random variables.
Proposition 1.
Let be independent realvalued random variables with zero mean. Let there exist constants , and such that for any
(2.3) 
Then for every , we have with probability at least
Remark 1.
In the usual formulation of Bernstein’s inequality, only the second moment condition in (2.3) is imposed and the conclusion holds valid with replaced by . The refinement we propose here is necessary in the missing observation framework in order to get the sharp dependence of our bounds on . An investigation of the proof of Bernstein’s inequality shows that this refinement follows immediately from Chernoff’s bound used to prove the standard Bernstein’s inequality (See for instance Proposition 2.9 in [8]). Indeed, using Chernoff’s approach, we need a control on the following expectation
where we have used the second moment condition in (2.3) and Fubini’s theorem to justify the inversion of the sum and the expectation. The rest of the proof is left unchanged.
3 Main results for sparse PCA with missing observations
In this section, we state our main statistical results concerning the procedure (1.7). We will establish these results under the following condition on the distribution of .
Assumption 1 (Subgaussian observations).
The random vector is subgaussian, that is . In addition, there exist a numerical constant such that
(3.1) 
3.1 Oracle inequalities for sparse PCA
We first establish a preliminary result on the stochastic deviation of the following empirical process
To this end, we introduce the following quantity
Proposition 2.
We can now state our main result
Theorem 1.

We observe that the estimation bound increases as the difference decreases. The problem of estimation of the first principal component is statistically more difficult when the largest and second largest eigenvalues are close. We also observe that the optimal choice of the regularization parameter (3.7) depends on the eigenvalues of . Unfortunately, these quantities are typically unknown in practice. In order to circumvent this difficulty, we propose in Section 3.3 a datadriven choice of with optimal statistical performances (up to a logarithmic factor) provided that is approximately lowrank.

Let now consider the full observation framework (). In that case, if , we obtain the following upper bound with probability at least
We can compare this result with that obtained for the procedure (1.4) by [16]
We see that in order to achieve the rate with the procedure (1.4), we need to know the sparsity of in advance, whereas our procedure adapts to the unknown sparsity of and achieves the minimax optimal rate up to a logarithmic factor provided that (see Section 3.4 for the lower bounds). This logarithmic factor is the price we pay for adaptation to the sparsity of . Note also that we can formulate a version of (1.4) when observations are missing () by replacing with . In that case, our techniques of proof will give with probability at least

We discuss now the choice of . In practice, when no prior information on the sparsity of is available, we propose to choose . Then, the procedure (1.7) adapts to the unknown sparsity of provided that the condition is satisfied, which is actually a natural condition on in order to obtain a non trivial estimation result. Indeed, if , then the upper bound in Theorem 1 for the estimator becomes larger than whereas the bound for the null estimator is .

In the case where observations are missing (), Theorem 1 guarantees that recovery of the first principal component is still possible using the procedure (1.7). We observe the additional factor . Consequently, the estimation accuracy of the procedure (1.7) will decrease as the proportion of observed entries decreases. We show in Section 3.4 below that the dependence of our bounds on is sharp. In other words, there exists no statistical procedure that achieves an upper bound without the factor . Thus, we can conclude that the factor is the statistical price to pay to deal with missing observations in the principal component estimation problem. If we consider for instance microarray datasets where typically about of the observations are missing (that is ), then the optimal bound achieved for the first principal component estimation increases by a factor as compared to the full observation framework ().
3.2 Study of approximately lowrank covariance matrices
We now assume that defined in is also approximately lowrank and study the different implications of this additional condition. We recall that the effective rank of is defined by where is the trace of . We say that is approximately lowrank when . Note also that the effective rank of a covariance matrix can be estimated efficiently by where is an acceptable estimator of . See [7] for more details. Thus, the approximately lowrank assumption can easily be checked in practice.
First, we can propose a different control of the stochastic quantities . Note indeed that for any . We apply now Proposition 3 in [7] and get the following control on . Under the assumptions of Proposition 2, we have with probability at least that
(3.4) 
where is an absolute constant. We concentrate now on the highdimensional setting . Assume that
(3.5) 
for some sufficiently large numerical constant . Taking , we get from the two above displays, with probability at least that
where can depends only on . Combining the previous display with Proposition 2 and a union bound argument, we immediately obtain the following control on .
Proposition 3.
The motivation behind this new bound is the following. When (3.5) is satisfied, we can remove the restriction in the procedure (1.7). We consider now
(3.6) 
Then, a solution of this problem can be computed efficiently even in the highdimensional setting using a generalized power method (see [6] for more details on the computational aspect), whereas it is not clear whether the same holds true for the procedure (1.7) with the constraint .
We now consider the statistical performance of the procedure (3.6). Following the proof of Theorem 1, we establish this result for .
Theorem 2.
Note that this result holds without any condition on the sparsity of . Of course, as we already commented for Theorem 1, the result is of statistical interest only when is sparse: . The interest of this result is to guarantee that the computationally tractable estimator (3.6) is also statistically optimal.
3.3 Datadriven choice of
As we see in Theorem 2, the optimal choice of the regularization parameter depends on the largest and second largest eigenvalues of . These quantities are typically unknown in practice. To circumvent this difficulty, we propose the following datadriven choice for the regularization parameter
(3.8) 
where is a numerical constant and and are the two largest eigenvalues of . If (3.5) is satisfied, then as a consequence of Proposition 3 in [7], and are good estimators of and even in the missing observation framework. In order to guarantee that is a suitable choice, we will need a more restrictive condition on the number of measurements than (3.5). This new condition involves in addition the ”variance” :
(3.9) 
where is a sufficiently large numerical constant. As compared to (3.5), we observe the additional factor in the above condition. We already noted that matrices for which the difference is small are statistically more difficult to estimate. We observe that the number of measurements needed to construct a suitable datadriven estimator also increases as the difference decreases to .
We have the following result.
Lemma 1.
3.4 Information theoretic lower bounds
We derive now minimax lower bounds for the estimation of the first principal component in the missing observation framework.
Let . We denote by the class of covariance matrices satisfying (1.1) with , with and is a symmetric positive semidefinite matrix with spectral norm and such that . We prove now that the dependence of our estimation bounds on in Theorems 1 and 2 is sharp in the minimax sense. Set .
Theorem 3.
Fix and . Let the integers satisfy
(3.10) 
Let be i.i.d. random vectors in with covariance matrix . We observe i.i.d. random vectors such that
where is an i.i.d. sequence of Bernoulli random variables independent of .
Then, there exist absolute constants and such that
(3.11) 
where denotes the infimum over all possible estimators of based on .
Remark 2.
For , we can prove a similar lower bound with the factor
replaced by . This is actually the right
dependence on for sparse vectors. We can indeed derive
an upper bound of the same order for the selector where are the canonical vectors of .
For , we can prove a lower bound of the form
(3.11) without the logarithmic factor by comparing for
instance the hypothesis
and . Getting a
lower bound for with the logarithmic factor remains an open
question.
4 Proofs
4.1 Proof of Proposition 2
proof. For any , we have
(4.1) 
where
with and .
Before we proceed with the study of the empirical processes and , we need to introduce first some additional notations. Define
where are i.i.d. Bernoulli random variables with parameter and independent from . Denote by and the expectations w.r.t. and respectively. We also note that for any and any , we have . This comes from Fubini’s theorem and the fact that is subgaussian.
We now proceed with the study of . For any and any fixed , we have
Set
and
We note that are i.i.d. We now study the moments and for any and .
Note first that
where . (We have indeed that .
Thus we get, for any and , that
(4.2) 
and
(4.3) 
For any and , we set and we note the following simple fact that we will use in the next display: with probability . We also set . Note that for any and for .
For any and , we have
(4.4) 
Next, we have for any and that
Taking the expectation, we get for any and that
(4.5) 
where we have used (2.1), (2.2) and Assumption 1 in the last two lines.
Similarly, we have for any and that