Sparse Principal Component Analysis with missing observations

# Sparse Principal Component Analysis with missing observations

\fnmsKarim \snmLounici\thanksrefm1label=e2]klounici@math.gatech.edu [ Georgia Institute of Technology\thanksmarkm1 School of Mathematics
Georgia Institute of Technology
Atlanta, GA 30332-0160
###### Abstract

In this paper, we study the problem of sparse Principal Component Analysis (PCA) in the high-dimensional setting with missing observations. Our goal is to estimate the first principal component when we only have access to partial observations. Existing estimation techniques are usually derived for fully observed data sets and require a prior knowledge of the sparsity of the first principal component in order to achieve good statistical guarantees. Our contributions is threefold. First, we establish the first information-theoretic lower bound for the sparse PCA problem with missing observations. Second, we propose a simple procedure that does not require any prior knowledge on the sparsity of the unknown first principal component or any imputation of the missing observations, adapts to the unknown sparsity of the first principal component and achieves the optimal rate of estimation up to a logarithmic factor. Third, if the covariance matrix of interest admits a sparse first principal component and is in addition approximately low-rank, then we can derive a completely data-driven procedure computationally tractable in high-dimension, adaptive to the unknown sparsity of the first principal component and statistically optimal (up to a logarithmic factor).

[
\kwd
\arxiv

math.PR/0000000 \startlocaldefs \endlocaldefs

\runtitle

Sparse PCA with missing observations

{aug}\thankstext

m1Supported in part by NSF Grant DMS-11-06644 and Simons foundation Grant 209842

class=AMS] \kwd[Primary ]62H12

Low-rank covariance matrix \kwdSparse Principal Component Analysis \kwdMissing observations \kwdInformation-theoretic lower bounds\kwdOracle inequalities

## 1 Introduction

Let be i.i.d. zero mean vectors with unknown covariance matrix of the form

 Σ=σ1θ1θ⊤1+σ2Υ, (1.1)

where , (the unit sphere in ) and is a symmetric positive semi-definite matrix with spectral norm and such that . The eigenvector is called the first principal component of . Our objective is to estimate the first principal component when the vectors are partially observed. More precisely, we consider the following framework. Denote by the -th component of the vector . We assume that each component is observed independently of the others with probability . Note that can be easily estimated by the proportion of observed entries. Therefore, we will assume in this paper that is known. Note also that the case corresponds to the standard case of fully observed vectors. Let be a sequence of i.i.d. Bernoulli random variables with parameter and independent from . We observe i.i.d. random vectors whose components satisfy

 Y(j)i=δi,jX(j)i,1≤i≤n,1≤j≤p. (1.2)

We can think of the as masked variables. If , then we cannot observe the -th component of and the default value is assigned to . Our goal is then to estimate given the partial observations .

Principal Component Analysis (PCA) is a popular technique to reduce the dimension of a data set that has been used for many years in a variety of different fields including image processing, engineering, genetics, meteorology, chemistry and many others. In most of these fields, data are now high-dimensional, that is the number of parameters is much larger than the sample size , and contain missing observations. This is especially true in genomics with gene expression microarray data where PCA is used to detect the genes responsible for a given biological process. Indeed, despite the recent improvments in gene expression techniques, microarray data can contain up to missing observations affecting up to of the genes. Unfortunately, it is a known fact that PCA is very sensitive even to small perturbations of the data including in particular missing observations. Therefore, several strategies have been developped to deal with missing values. The simple strategy that consists in eliminating from the PCA study any gene with at least one missing observation is not acceptable in this context since up to of the genes can be eliminated from the study. An alternative strategy consists in infering the missing values prior to the PCA using complex imputation schemes [5, 2]. These schemes usually assume that the genes interactions follow some specified model and involve intensive computational preprocessing to imput the missing observations. We propose in this paper a different strategy. Instead of building an imputation technique based on assumptions describing the genome structure (about which we usually have no prior information), we propose a technique based on the analysis of the perturbations process. In other words, if we understand the process generating the missing observations, then we can efficiently correct the data prior to the PCA analysis. This strategy was first introduced in [7] to estimate the spectrum of low-rank covariance matrices. One of our goal is to show that this approach can be successfully applied to perform fast and accurate PCA with missing observations.

Standard PCA in the full observation framework () consists in extracting the first principal components of (that is the eigenvector associated to the largest eigenvalue) based on the i.i.d. observations :

 ^θ=argmaxθ⊤θ=1θ⊤Σnθ, (1.3)

where . The standard PCA presents two majors drawbacks. First, it is not consistent in high-dimension [3, 10, 11]. Second, the solution is usually a dense vector whereas sparse solutions are prefered in most applications in order to obtain simple interpretable structures. For instance, in microarray data, we typically observe that only a few among the thousands of screened genes are involved in a given biological process. In order to improve interpretability, several approaches have been proposed to perform sparse PCA, that is to enforce sparsity of the PCA outcome. See for instance [12, 17, 13] for SVD based iterative thresholding approaches. [18] reformulated the sparse PCA problem as a sparse regression problem and then used the LASSO estimator. See also [9] for greedy methods. We consider now the approach by [4] which consists in computing a solution of (1.3) under the additional -norm constraint for some fixed integer in order to enforce sparsity of the solution. The same approach with the -norm constraint replaced by the -norm gives the following procedure

 ^θo=argmaxθ∈Sp:|θ|0≤¯s(θ⊤Σnθ), (1.4)

where denotes the number of nonzero components of . In a recent paper, [16] established the following oracle inequality

 (E∥^θo^θ⊤o−θ1θ⊤1∥2)2≤C(σ1σ1−σ2)2¯slog(p/¯s)n,

for some absolute constant . Note that this procedure requires the knowledge of an upper bound . In practice, we generally do not have access to any prior information on the sparsity of . Consequently, if the parameter we use in the procedure is too small, then the above upper bound does not hold, and if is too large, then the above upper bound (even though valid) is sub-optimal. In other words, the procedure (1.4) with can be seen as an oracle and our goal is to propose a procedure that performs as well as this oracle without any prior information on .

In order to circumvent the fact that is unknown, we consider the following procedure proposed by [1]

 ^θ1=argmaxθ∈Sp(θ⊤Σnθ−λ|θ|0), (1.5)

where is a regularization parameter to be tuned properly. [1, 6] studied the computational aspect. In particular, [6] proposed a computationally tractable procedure to solve the above constrained maximization problem even in high-dimension. However none of these references investigated the statistical performances of this procedure or the question of the optimal tuning of . We propose to carry out this analysis in this paper. We establish the optimality of this procedure in the minimax sense and explicit the optimal theorethical choice for the regularization parameter .

When the data contains incomplete observations (), we do not have access to the empirical covariance matrix . Given the observations , we can build the following empirical covariance matrix

 Σ(δ)n=1nn∑i=1YiY⊤i.

As noted in [7], is not an unbiased estimator of , Consequently, we need to consider the following correction in order to get sharp estimation results:

 ~Σn=(δ−1−δ−2)diag(Σ(δ)n)+δ−2Σ(δ)n. (1.6)

Indeed, we can check by elementary algebra that is an unbiased estimator of in the missing observation framework . Therefore, we consider the following estimator in the missing observation framework

 ^θ1=argmaxθ∈Sp:|θ|0≤¯s(θ⊤~Σnθ−λ|θ|0), (1.7)

where is a regularization parameter to be tuned properly and is a mild constraint on . More precisely, can be chosen as large as when no prior information on is available. We will prove in particular that the procedure (1.7) adapts to the unknown sparsity of provided that . We also investigate the case where is in addition approximately low-rank. In that case, we can remove the restriction (taking ) in the procedure (1.7) and propose a data-driven choice of the regularization parameter . We will show that this data-driven procedure also achieves the optimal rate of estimation (up to a logarithmic factor) in the missing observation framework without any prior knowledge on . Finally, we establish information theoretic lower bounds for the sparse PCA problem in the missing observation framework with the sharp dependence on , thus expliciting completely the effect of missing observations on the sparse PCA estimation rate. Note that our results are nonasymptotic in nature and hold for any setting of including in particular the high-dimensional setting .

The rest of the paper is organized as follows. In Section 2, we recall some tools and definitions that will be useful for our statistical analysis. Section 3 contains our main theoretical results. Finally, Section 4 contains the proofs of our results.

## 2 Tools and definitions

In this section, we introduce various notations and definitions and we recall some known results that we will use to establish our results.

The -norms of a vector is given by

 |x|q=(p∑j=1|x(j)|q)1/q,for1≤q<∞,and|x|∞=max1≤j≤p|x(j)|.

The support of a vector is defined as follows

 J(x)={j:x(j)≠0}.

We denote the number of nonzero components of by . Note that . Set . For any , we define . For any integer , we define . Note that .

For any symmetric matrix with eigenvalues , we define the Schatten -norm of by

 ∥A∥q=(p∑j=1|σj(A)|q)1/q,∀1≤q<∞,and∥A∥∞=max1≤j≤p{|σj(A)|}.

Define the usual matrix scalar product for any . Note that for any . Recall the trace duality property

 |⟨A,B⟩|≤∥A∥∞∥B∥1,∀A,B∈Rp×p.

We recall now some basic facts about -nets (See for instance Section 5.2.2 in [15]).

###### Definition 1.

Let be a metric space and let . A subset of is called an -net of if for every point , there exists a point so that .

We recall now an approximation result of the spectral norm on an -net.

###### Lemma 1.

Let be a symmetric matrix for some . For any , there exists an -net ( the unit sphere in ) such that

 |Nϵ|≤(1+2ϵ)k,

and

 supθ∈Sk|⟨Ax,x⟩|≤11−2ϵsupθ∈Nϵ|⟨Ax,x⟩|.

See for instance Lemma 5.2 and Lemma 5.3 in [15] for a proof.

We recall now the definition and some basic properties of sub-exponential random vectors.

###### Definition 2.

The -norms of a real-valued random variable are defined by

 ∥V∥ψα=inf{u>0:Eexp(|V|α/uα)≤2},α≥1.

We say that a random variable with values in is sub-exponential if for some . If , we say that is sub-gaussian.

We recall some well-known properties of sub-exponential random variables:

1. For any real-valued random variable such that for some , we have

 E|V|m≤2mαΓ(mα)∥V∥mψα,∀m≥1 (2.1)

where is the Gamma function.

2. If a real-valued random variable is sub-gaussian, then is sub-exponential. Indeed, we have

 ∥V2∥ψ1≤2∥V∥2ψ2. (2.2)
###### Definition 3.

A random vector is sub-exponential if are sub-exponential random variables for all . The -norms of a random vector are defined by

 ∥X∥ψα=supx∈Sp∥⟨X,x⟩∥ψα,α≥1.

We recall a version of Bernstein’s inequality for unbounded real-valued random variables.

###### Proposition 1.

Let be independent real-valued random variables with zero mean. Let there exist constants , and such that for any

 1nn∑i=1|E[Ymi]|≤m!2Km−2σ2,and1nn∑i=1E[|Yi|m]≤m!2Km−2(σ′)2. (2.3)

Then for every , we have with probability at least

 ∣∣ ∣∣1nn∑i=1Yi∣∣ ∣∣≤σ√2tn+Ktn.
###### Remark 1.

In the usual formulation of Bernstein’s inequality, only the second moment condition in (2.3) is imposed and the conclusion holds valid with replaced by . The refinement we propose here is necessary in the missing observation framework in order to get the sharp dependence of our bounds on . An investigation of the proof of Bernstein’s inequality shows that this refinement follows immediately from Chernoff’s bound used to prove the standard Bernstein’s inequality (See for instance Proposition 2.9 in [8]). Indeed, using Chernoff’s approach, we need a control on the following expectation

 E[Φ(tYi)]

where we have used the second moment condition in (2.3) and Fubini’s theorem to justify the inversion of the sum and the expectation. The rest of the proof is left unchanged.

## 3 Main results for sparse PCA with missing observations

In this section, we state our main statistical results concerning the procedure (1.7). We will establish these results under the following condition on the distribution of .

###### Assumption 1 (Sub-gaussian observations).

The random vector is sub-gaussian, that is . In addition, there exist a numerical constant such that

 E(⟨X,u⟩)2≥c1∥⟨X,u⟩∥2ψ2,∀u∈Rp. (3.1)

### 3.1 Oracle inequalities for sparse PCA

We first establish a preliminary result on the stochastic deviation of the following empirical process

 Zn(s) =maxθ∈Sps{∣∣θ⊤(~Σn−Σ)θ∣∣},∀1≤s≤p.

To this end, we introduce the following quantity

 ζn(s,p,t,δ) :=max⎧⎨⎩√t+slog(ep/s)δ2n,t+slog(ep/s)δ2n⎫⎬⎭.
###### Proposition 2.

Let Assumption 1 be satisfied. Let be defined in (1.2) with . Then, we have

 P(p⋂s=1{Zn(s)≤cσmax(s)c1∧1ζn(s,p,t,δ)})≥1−e−t (3.2)

where is an absolute constant and .

We can now state our main result

###### Theorem 1.

Let Assumption 1 be satisfied. Let be defined in (1.2) with . Consider the estimator (1.7) with parameters satisfying and

 λ=Cσ21σ1−σ2log(ep)δ2n, (3.3)

where is a large enough numerical constant.
If , then we have, with probability at least , that

 ∥^θ1^θ⊤1−θ1θ⊤1∥22≤C′|θ1|0~σ2log(ep)δ2n.

where and is a numerical constant that can depend only on .

1. We observe that the estimation bound increases as the difference decreases. The problem of estimation of the first principal component is statistically more difficult when the largest and second largest eigenvalues are close. We also observe that the optimal choice of the regularization parameter (3.7) depends on the eigenvalues of . Unfortunately, these quantities are typically unknown in practice. In order to circumvent this difficulty, we propose in Section 3.3 a data-driven choice of with optimal statistical performances (up to a logarithmic factor) provided that is approximately low-rank.

2. Let now consider the full observation framework (). In that case, if , we obtain the following upper bound with probability at least

 ∥^θ1^θ⊤1−θ1θ⊤1∥22≤C′|θ1|0~σ2log(ep)n.

We can compare this result with that obtained for the procedure (1.4) by [16]

 (E∥^θo^θ⊤o−θ1θ⊤1∥2)2≤C′¯s~σ2log(ep/¯s)n.

We see that in order to achieve the rate with the procedure (1.4), we need to know the sparsity of in advance, whereas our procedure adapts to the unknown sparsity of and achieves the minimax optimal rate up to a logarithmic factor provided that (see Section 3.4 for the lower bounds). This logarithmic factor is the price we pay for adaptation to the sparsity of . Note also that we can formulate a version of (1.4) when observations are missing () by replacing with . In that case, our techniques of proof will give with probability at least

 ∥^θo^θ⊤o−θ1θ⊤1∥22≤C′¯s~σ2log(ep/¯s)δ2n.
3. We discuss now the choice of . In practice, when no prior information on the sparsity of is available, we propose to choose . Then, the procedure (1.7) adapts to the unknown sparsity of provided that the condition is satisfied, which is actually a natural condition on in order to obtain a non trivial estimation result. Indeed, if , then the upper bound in Theorem 1 for the estimator becomes larger than whereas the bound for the null estimator is .

4. In the case where observations are missing (), Theorem 1 guarantees that recovery of the first principal component is still possible using the procedure (1.7). We observe the additional factor . Consequently, the estimation accuracy of the procedure (1.7) will decrease as the proportion of observed entries decreases. We show in Section 3.4 below that the dependence of our bounds on is sharp. In other words, there exists no statistical procedure that achieves an upper bound without the factor . Thus, we can conclude that the factor is the statistical price to pay to deal with missing observations in the principal component estimation problem. If we consider for instance microarray datasets where typically about of the observations are missing (that is ), then the optimal bound achieved for the first principal component estimation increases by a factor as compared to the full observation framework ().

### 3.2 Study of approximately low-rank covariance matrices

We now assume that defined in is also approximately low-rank and study the different implications of this additional condition. We recall that the effective rank of is defined by where is the trace of . We say that is approximately low-rank when . Note also that the effective rank of a covariance matrix can be estimated efficiently by where is an acceptable estimator of . See [7] for more details. Thus, the approximately low-rank assumption can easily be checked in practice.

First, we can propose a different control of the stochastic quantities . Note indeed that for any . We apply now Proposition 3 in [7] and get the following control on . Under the assumptions of Proposition 2, we have with probability at least that

 (3.4)

where is an absolute constant. We concentrate now on the high-dimensional setting . Assume that

 n≥cr(Σ)log2(ep)δ2, (3.5)

for some sufficiently large numerical constant . Taking , we get from the two above displays, with probability at least that

 ∥~Σn−Σ∥∞≤c′σ1√r(Σ)log(ep)δ2n,

where can depends only on . Combining the previous display with Proposition 2 and a union bound argument, we immediately obtain the following control on .

###### Proposition 3.

Let the conditions of Proposition 2 be satisfied. In addition, let (3.5) be satisfied. Then we have

 P⎛⎝p⋂s=1⎧⎨⎩Zn(s)≤cσ1√min{s,r(Σ)}√log(ep)δ2n⎫⎬⎭⎞⎠≥1−1p,

where is numerical constant that can depend only on .

The motivation behind this new bound is the following. When (3.5) is satisfied, we can remove the restriction in the procedure (1.7). We consider now

 ~θ1=argmaxθ∈Sp(θ⊤~Σnθ−λ|θ|0). (3.6)

Then, a solution of this problem can be computed efficiently even in the high-dimensional setting using a generalized power method (see [6] for more details on the computational aspect), whereas it is not clear whether the same holds true for the procedure (1.7) with the constraint .

We now consider the statistical performance of the procedure (3.6). Following the proof of Theorem 1, we establish this result for .

###### Theorem 2.

Let Assumption 1 be satisfied. Let be defined in (1.2) with . In addition, let (3.5) be satisfied. Take

 λ=Cσ21σ1−σ2log(ep)δ2n, (3.7)

where is a large enough numerical constant. Then we have, with probability at least , that

 ∥~θ1~θ⊤1−θ1θ⊤1∥22≤C′|θ1|0~σ2log(ep)δ2n.

where and is a numerical constant that can depend only on .

Note that this result holds without any condition on the sparsity of . Of course, as we already commented for Theorem 1, the result is of statistical interest only when is sparse: . The interest of this result is to guarantee that the computationally tractable estimator (3.6) is also statistically optimal.

### 3.3 Data-driven choice of λ

As we see in Theorem 2, the optimal choice of the regularization parameter depends on the largest and second largest eigenvalues of . These quantities are typically unknown in practice. To circumvent this difficulty, we propose the following data-driven choice for the regularization parameter

 λD=C^σ21^σ1−^σ2log(ep)δ2n, (3.8)

where is a numerical constant and and are the two largest eigenvalues of . If (3.5) is satisfied, then as a consequence of Proposition 3 in [7], and are good estimators of and even in the missing observation framework. In order to guarantee that is a suitable choice, we will need a more restrictive condition on the number of measurements than (3.5). This new condition involves in addition the ”variance” :

 n≥c~σ2δ2r(Σ)log2(ep), (3.9)

where is a sufficiently large numerical constant. As compared to (3.5), we observe the additional factor in the above condition. We already noted that matrices for which the difference is small are statistically more difficult to estimate. We observe that the number of measurements needed to construct a suitable data-driven estimator also increases as the difference decreases to .

We have the following result.

###### Lemma 1.

Let the conditions of Proposition 2 be satisfied, Assume in addition that (3.9) is satisfied. Let be defined in (3.8). Then, we have with probability at least that

 Zn(s)≤λDs,∀1≤s≤p,

and

 λD≤C′σ21σ1−σ2log(ep)δ2n,

for some numerical constant .

Consequently, the conclusion of Theorem 2 holds true for the estimator (3.6) with provided that (3.9) is satisfied.

### 3.4 Information theoretic lower bounds

We derive now minimax lower bounds for the estimation of the first principal component in the missing observation framework.

Let . We denote by the class of covariance matrices satisfying (1.1) with , with and is a symmetric positive semi-definite matrix with spectral norm and such that . We prove now that the dependence of our estimation bounds on in Theorems 1 and 2 is sharp in the minimax sense. Set .

###### Theorem 3.

Fix and . Let the integers satisfy

 2¯σ2s1log(ep/s1)≤δ2n. (3.10)

Let be i.i.d. random vectors in with covariance matrix . We observe i.i.d. random vectors such that

 Yji=δi,jX(j)i,1≤i≤n,1≤j≤p,

where is an i.i.d. sequence of Bernoulli random variables independent of .

Then, there exist absolute constants and such that

 inf^θ1supΣ∈CPΣ(∥^θ1^θ⊤1−θ1θ⊤1∥22>c¯σ2s1δ2nlog(eps1)) ≥ β, (3.11)

where denotes the infimum over all possible estimators of based on .

###### Remark 2.

For , we can prove a similar lower bound with the factor replaced by . This is actually the right dependence on for -sparse vectors. We can indeed derive an upper bound of the same order for the selector where are the canonical vectors of .
For , we can prove a lower bound of the form (3.11) without the logarithmic factor by comparing for instance the hypothesis and . Getting a lower bound for with the logarithmic factor remains an open question.

## 4 Proofs

### 4.1 Proof of Proposition 2

proof.  For any , we have

 Zn(s) ≤δ−1Z(1)n(s)+δ−2Z(2)n(s) (4.1)

where

with and .

Before we proceed with the study of the empirical processes and , we need to introduce first some additional notations. Define

 Y=(δ1X(1),…,δpX(p))⊤,

where are i.i.d. Bernoulli random variables with parameter and independent from . Denote by and the expectations w.r.t. and respectively. We also note that for any and any , we have . This comes from Fubini’s theorem and the fact that is sub-gaussian.

We now proceed with the study of . For any and any fixed , we have

 θ⊤(A(δ)n−A(δ))θ =1nn∑i=1[θ⊤(YiY⊤i−diag(YiY⊤i))θ−δ2θ⊤(Σ−diag(Σ))θ].

Set

 Zi=[(YiY⊤i−diag(YiY⊤i))−δ2(Σ−diag(Σ))],

and

 Z=[(YY⊤−diag(YY⊤))−δ2(Σ−diag(Σ))].

We note that are i.i.d. We now study the moments and for any and .

Note first that

 |θ⊤(Σ−diag(Σ))θ|≤max(θ⊤Σθ,θ⊤diag(Σ)θ)≤σmax(s),∀θ∈Sps,∀s≥1,

where . (We have indeed that .

Thus we get, for any and , that

 E[(θ⊤Zθ)m] ≤2m−1E[(θ⊤(YY⊤−diag(YY⊤))θ)m]+2m−1δ2mσmmax(s) (4.2)

and

 E[∣∣θ⊤Zθ∣∣m] ≤2m−1E[∣∣θ⊤(YY⊤−diag(YY⊤))θ∣∣m]+2m−1δ2mσmmax(s) (4.3)

For any and , we set and we note the following simple fact that we will use in the next display: with probability . We also set . Note that for any and for .

For any and , we have

 E[(θ⊤(YY⊤−diag(YY⊤))θ)m] =E[(θ⊤δ(XX⊤−diag(XX⊤))θδ)m] ≤2m−1E[(θ⊤δWθδ)m] +2m−1δmE[(θ⊤δdiag(XX⊤)θδ)m] (4.4)

Next, we have for any and that

 (θ⊤δWθδ)m=p∑j1,k1=1⋯p∑jm,km=1m∏t=1θ(jt)δθ(kt)δwjt,kt.

Taking the expectation, we get for any and that

 E(θ⊤δWθδ)m =EXEδ⎡⎣p∑j1,k1=1⋯p∑jm,km=1m∏t=1θ(jt)δθ(kt)δwjt,kt⎤⎦ =EX⎡⎣δmp∑j1,k1=1⋯p∑jm,km=1m∏t=1θ(jt)θ(kt)X(jt)X(kt)⎤⎦ =δmEX[(θ⊤X)2m] ≤2δmm!(2∥θ⊤X∥2ψ2)m ≤2δmm!(2c1σmax(s))m, (4.5)

where we have used (2.1), (2.2) and Assumption 1 in the last two lines.

Similarly, we have for any and that

 (θ⊤δdiag(XX⊤)θδ)