Adaptive Thresholding for Sparse Covariance Matrix Estimation

Adaptive Thresholding for Sparse Covariance Matrix Estimation

Tony Cai and Weidong Liu
Abstract

In this paper we consider estimation of sparse covariance matrices and propose a thresholding procedure which is adaptive to the variability of individual entries. The estimators are fully data driven and enjoy excellent performance both theoretically and numerically. It is shown that the estimators adaptively achieve the optimal rate of convergence over a large class of sparse covariance matrices under the spectral norm. In contrast, the commonly used universal thresholding estimators are shown to be sub-optimal over the same parameter spaces. Support recovery is also discussed. The adaptive thresholding estimators are easy to implement. Numerical performance of the estimators is studied using both simulated and real data. Simulation results show that the adaptive thresholding estimators uniformly outperform the universal thresholding estimators. The method is also illustrated in an analysis on a dataset from a small round blue-cell tumors microarray experiment. A supplement to this paper which contains additional technical proofs is available online.

11footnotetext: Department of Statistics, The Wharton School, University of Pennsylvania, Philadelphia, PA
19104, tcai@wharton.upenn.edu. The research was supported in part by NSF FRG Grant DMS-0854973.
22footnotetext: Department of Mathematics and Institute of Natural Sciences, Shanghai Jiao Tong University, China.

Keywords: Adaptive thresholding, Frobenius norm, optimal rate of convergence, sparse covariance matrix, spectral norm, support recovery, universal thresholding.

1 Introduction

Let be a -variate random vector with covariance matrix . Given an independent and identically distributed (i.i.d.) random sample from the distribution of X, we wish to estimate the covariance matrix under the spectral norm. This covariance matrix estimation problem is of fundamental importance in multivariate analysis with a wide range of applications. The high dimensional setting, where the dimension can be much larger than the sample size , is of particular current interest. In such a setting, conventional methods and results based on fixed and large are no longer applicable and new methods and theories are thus needed. In particular, the sample covariance matrix

 Σn=(^σij)p×p:=1n−1n∑k=1(Xk−¯X)(Xk−¯X)T, (1)

where , performs poorly in this setting and structural assumptions are required in order to estimate the covariance matrix consistently.

In this paper we focus on estimating sparse covariance matrices. This problem has been considered in the literature. El Karoui (2008) and Bickel and Levina (2008) proposed thresholding of the sample covariance matrix and obtained rates of convergence for the thresholding estimators. Rothman, Levina and Zhu (2009) considered thresholding of the sample covariance matrix with more general thresholding functions. Cai and Zhou (2009 and 2010) established the minimax rates of convergence under the matrix norm and the spectral norm. Wang and Zou (2010) considered estimation of volatility matrices based on high-frequency financial data.

A common feature of the thresholding methods for sparse covariance matrix estimation proposed in the literature so far is that they all belong to the class of “universal thresholding rules”. That is, a single threshold level is used to threshold all the entries of the sample covariance matrix. Universal thresholding rules were originally introduced by Donoho and Johnstone (1994 and 1998) for estimating sparse normal mean vectors in the context of wavelet function estimation. See also Antoniadis and Fan (2001). An important feature of problems considered there is that noise is homoscedastic. In such a setting, universal thresholding has demonstrated considerable success in nonparametric function estimation in terms of asymptotic optimality and computational simplicity.

In contrast to the standard homoscedastic nonparametric regression problems, sparse covariance matrix estimation is intrinsically a heteroscedastic problem in the sense that the entries of the sample covariance matrix could have a wide range of variability. Although some universal thresholding rules have been shown to enjoy desirable asymptotic properties, this is mainly due to the fact that the parameter space considered in the literature is relatively restrictive which forces the covariance matrix estimation problem to be an essentially homoscedastic problem.

To illustrate the point, it is helpful to consider an idealized model where one observes

 yi=μi+γizi,zi\lx@stackreliid∼N(0,1)1≤i≤p (2)

and wishes to estimate the mean vector which is assumed to be sparse. If the noise levels are bounded, say by , then the universal thresholding rule performs well asymptotically over a standard ball defined by

 Θq(s0)={μ∈IRp:p∑j=1|μj|q≤s0}. (3)

In particular, is a set of sparse vectors with at most nonzero elements. Here the assumption that are bounded by is crucial. The universal thresholding rule simply treats the heteroscedastic problem (2) as a homoscedastic one with all noise levels . It is intuitively clear that this method does not perform well when the range of is large and it fails completely without the uniform boundedness assumption on the ’s.

For sparse covariance matrix estimation, the following uniformity class of sparse matrices was considered in Bickel and Levina (2008) and Rothman, Levina and Zhu (2009):

 Uq:=Uq(s0(p))={Σ:Σ≻0, maxiσii≤K, maxip∑j=1|σij|q≤s0(p)}

for some , where means that is positive definite. Here each column of a covariance matrix in is assumed to be in the ball . Define

 θij:=Var((Xi−μi)(Xj−μj)), (4)

where . It is easy to see that in the Gaussian case, . The condition for all ensures the variances of the entries of the sample covariance matrix to be uniformly bounded. Bickel and Levina (2008) proposed a universal thresholding estimator , where

 ^σuij=^σijI{^σij≥λn}, (5)

and showed that with a proper choice of the threshold the estimator achieves a desirable rate of convergence under the spectral norm. Rothman, Levina and Zhu (2009) considered a class of universal thresholding rules with more general thresholding functions than hard thresholding. Similar to the idealized model (2) discussed earlier, here a key assumption is that the variances are uniformly bounded by which is crucial to make the universal thresholding rules well behaved. A universal thresholding rule in this case essentially treats the problem as if all when selects the threshold .

For heteroscedastic problems such as sparse covariance matrix estimation, it is arguable more desirable to use thresholds that capture the variability of individual variables instead of using a universal upper bound. This is particularly true when the variances vary over a wide range or no obvious upper bound on the variances is known. A more natural and effective approach is to use thresholding rules with entry-dependent thresholds which automatically adapt to the variability of the individual entries of the sample covariance matrix. The main goal of the present paper is to develop such an adaptive thresholding estimator and study its properties.

In this paper we introduce an adaptive thresholding estimator with

 ^σ⋆ij=sλij(^σij), (6)

where is a general thresholding function similar to those used in Rothman, Levina and Zhu (2009) and will be specified later. The individual thresholds are fully data-driven and adapt to the variability of individual entries of the sample covariance matrix . It is shown that the adaptive thresholding estimator enjoys excellent properties both asymptotically and numerically. In particular, we consider the performance of the estimator over a large class of sparse covariance matrices defined by

 U⋆q:=U⋆q(s0(p))={Σ:Σ≻0, maxip∑j=1(σiiσjj)(1−q)/2|σij|q≤s0(p)} (7)

for . In comparison to , the columns of a covariance matrix in are required to be in a weighted ball, instead of a standard ball, with the weight determined by the variance of the entries of the sample covariance matrix. A particular feature of is that it no longer requires the variances to be uniformly bounded and allows . Note that , so the parameter space contains the uniformity class as a subset. The parameter space can also be viewed as a weighted ball of correlation coefficients. See Section 3.1 for more discussions.

It will be shown in Section 3 that achieves the optimal rate of convergence

 s0(p)(logpn)(1−q)/2

over the parameter space . In comparison, it is also shown that the best universal thresholding estimator can only attain the rate over , which is clearly sub-optimal when since .

The choice of the regularization parameters is important in any regularized estimation problem. The thresholds used in (6) are based on an estimator of the variance of the entries of the sample covariance matrix. More specifically, are of the form

 λij=δ√^θijlogpn (8)

where are estimates of defined in (4) and is a tuning parameter. The value of can be taken as fixed at , or it can be empirically chosen through cross validation. Theoretical properties of the resulting covariance matrix estimators using both methods are investigated. It is shown that the estimators attain the optimal rate of convergence under the spectral norm in both cases. In addition, support recovery of a sparse covariance matrix is also considered.

The adaptive thresholding estimators are easy to implement. Numerical performance of the estimators is investigated using both simulated and real data. Simulation results show that the adaptive thresholding estimators perform favorably in comparison to existing methods. In particular, they uniformly outperform the universal thresholding estimators in the simulation studies. The procedure is also applied to analyze a dataset from a small round blue-cell tumors microarray experiment (Khan et al., 2001).

The paper is organized as follows. Section 2 introduces the adaptive thresholding procedure for sparse covariance matrix estimation. Asymptotic properties are studied in Section 3. It is shown that the adaptive thresholding estimator is rate-optimal over , while the best universal thresholding estimator is proved to be suboptimal. Section 4 discusses data-driven selection of the thresholds using cross validation and establish asymptotic optimality of the resulting estimator. Numerical performance of the adaptive thresholding estimators is investigated by simulations and by an application to a dataset from a small round blue-cell tumors microarray experiment in Section 5. Section 6 discusses methods based on the sample correlation matrix. The proofs are given in Section 7.

2 Adaptive thresholding for sparse covariance matrix

In this section we introduce the adaptive thresholding method for estimating sparse covariance matrices. To motivate our estimator, consider again the sparse normal mean estimation problem (2). If the noise levels ’s are known or can be well estimated, a good estimator of the mean vector is the hard thresholding estimator or some generalized thresholding estimator with the same thresholds .

Similarly, for sparse covariance matrix estimation, a more effective thresholding rule than universal thresholding is the one which adapts to the variability of the individual entries of the sample covariance matrix. Define as in (4). Then, roughly speaking, estimation of a sparse covariance matrix is similar to the mean vector estimation problem based on the observations

 1nn∑k=1(Xki−μi)(Xkj−μj)=σij+√θijnzij,1≤i,j≤p (9)

with being asymptotically standard normal. This analogy provides a good motivation for our adaptive thresholding procedure. If the were known, a natural thresholding estimator would be with

 ^σoij=sλoij(^σij)with~{% }λoij=2√θijlogpn, (10)

where is a thresholding function. Comparing to the universal thresholding rule in Bickel and Levina (2008), the variance factors in the thresholds make the thresholding rule entry-dependent and leads to a more flexible estimator. In practice, are typically unknown, but can be well estimated. We propose the following estimator of :

 ^θij=1nn∑k=1[(Xki−¯Xi)(Xkj−¯Xj)−^σij]2,¯Xi=n−1n∑k=1Xki.

This leads to our adaptive thresholding estimator of the covariance matrix ,

 ^Σ⋆(δ)=(^σ⋆ij)p×pwith^σ⋆ij=sλij(^σij), (11)

where

 λij:=λij(δ)=δ√^θijlogpn. (12)

Here is a regularization parameter. It can be fixed at or can be chosen through cross validation. Good choices of will not affect the rate of convergence, but will affect the numerical performance of the resulting estimators. Selection of is thus of practical importance and we will discuss it further later.

The analogy between the sparse covariance estimation problem and the idealized mean estimation problem (9) gives a good motivation for the adaptive thresholding estimator defined in (11) and (12), but of course the matrix estimation problem is not exactly equivalent to the mean estimation problem (9) as noise is not exactly normal or iid and the loss is the spectral norm, not a vector norm or the Frobenius norm. We shall make our technical analysis precise in Sections 3 and 7.

In the present paper, we consider simultaneously a class of thresholding functions that satisfy the following conditions:

• for all satisfy and some ;

• for ;

• , for all .

These three conditions are satisfied, for example, by the soft thresholding rule and the adaptive lasso rule with , as called in Rothman, Levina and Zhu (2009). We shall present a unified analysis of the adaptive thresholding estimators with the thresholding function satisfying the above three conditions. It should be noted that Condition (i) excludes the hard thresholding rule. However, all of the theoretical results in this paper hold for the hard thresholding estimator under similar conditions. Here Condition (i) is in place only to make the technical analysis in Section 7 work in a unified way for the class of thresholding rules. The results for the hard thresholding rule require slightly different proofs.

Rothman, Levina and Zhu (2009) proposed generalized universal thresholding estimators

 ^Σg=(^σgij)p×p,~{}~{}where~{}~{}^σgij=¯sλn(^σij)

and satisfies (ii), (iii) and , which is slightly weaker than (i). Similar general universal thresholding rules were introduced and studied by Antoniadis and Fan (2001) in the context of wavelet function estimation. We should note that the generalized universal thresholding estimators suffer the same shortcomings as those of , and like they are sub-optimal over the class .

3 Theoretical properties of adaptive thresholding

We now consider the asymptotic properties of the adaptive thresholding estimator defined in (11) and (12). It is shown that the estimator adaptively attains the optimal rate of convergence over the collection of parameter spaces .

We begin with some notation. Define the standardized variables

 Yi=(Xi−μi)/(Var(Xi))1/2,

where , and let Throughout the paper, denote for the usual Euclidean norm of a vector . For a matrix , define the spectral norm , the matrix norm , and the Frobenius norm . For two sequences of real numbers and , write if there exists a constant such that holds for all sufficiently large , and write if .

3.1 Rate of convergence

It is conventional in the covariance matrix estimation literature to divide the technical analysis into two cases according the the moment conditions on X.

(C1). (Exponential-type tails) Suppose that and there exists some such that

 Eexp(tY2i)≤K1<∞   for all |t|≤η and i. (13)

Furthermore, we assume that for some ,

 minijVar(YiYj)≥τ0. (14)

(C2). (Polynomial-type tails) Suppose that for some , , and for some

 E|Yi|4γ+4+ϵ≤K1   % for all i. (15)

Furthermore, we assume that (14) holds.

Remark 1

Note that (C1) holds with in the Gaussian case where . To this end, let be the correlation coefficient of and . We can then write , where is independent of . So we have . Hence (14) holds with .

The follow theorem gives the rate of convergence over the parameter space under the spectral norm for the thresholding estimator .

Theorem 1

Let and .

• Under (C1), we have, for some constant depending only on , , and ,

 infΣ0∈U⋆qP(∥^Σ⋆(δ)−Σ0∥2≤CK1,δ,c,qs0(p)(logpn)1−q2)≥1−O((logp)−12p−δ+2). (16)
• Under (C2), (16) holds with probability greater than .

Although is larger than the uniformity class , the rates of convergence of over the two classes are of the same order .

Theorem 1 states the rate of convergence in terms of probability. The same rate of convergence holds in expectation with some additional mild assumptions. By (16) and some long but elementary calculations (see also the proof of Lemma 4), we have the following result on the mean squared spectral norm.

Proposition 1

Under (C1) and for some , we have for , , and some constant ,

 supΣ0∈U⋆qE∥^Σ⋆(δ)−Σ0∥22≤Cs20(p)(logpn)1−q. (17)
Remark 2

Cai and Zhou (2010) established the minimax rates of convergence under the spectral norm for sparse covariance matrix estimation over . It was shown that the optimal rate over is . Since , this implies immediately that the convergence rate attained by the adaptive thresholding estimator over in Theorem 1 and (17) is optimal.

Remark 3

The estimator yields immediately an estimate of the correlation matrix which is the object of direct interest in some statistical applications. Denote the corresponding estimator of by with . A parameter space for the correlation matrices is the following ball:

 R⋆q:=R⋆q(s0(p))={R:R≻0, maxip∑j=1|rij|q≤s0(p)}. (18)

Then Theorem 1 holds for estimating the correlation matrix by replacing , and with , and , respectively.

Note that the covariance matrix can be written as , where . The covariance matrix can thus be viewed as a weighted version of the correlation matrix with weights . Correspondingly, the parameter space in (7) can be viewed as the weighted version of given in (18) with the same weights,

 U⋆q:={Σ:Σ≻0, maxip∑j=1(σiiσjj)1/2|rij|q≤s0(p)}.

3.2 Support recovery

A closely related problem to estimating a sparse covariance matrix under spectral norm is the recovery of the support of the covariance matrix. This problem has been considered, for example, in Rothman, Levina and Zhu (2009). For support recovery, it is natural to consider the parameter space

 ¯U0:=¯U0(s0(p))={Σ:maxip∑j=1I{σij≠0}≤s0(p)},

which assumes that the covariance matrix has at most nonzero entries on each row.

Define the support of by . The following theorem shows that the adaptive thresholding estimator recovers the support exactly with high probability when the magnitudes of nonzero entries are above certain threshold.

Theorem 2

Suppose that . Let and

 |σ0ij|>(2+δ+γ)√θijlogpnfor all (i,j)∈Ψ and some γ>0. (19)

If either (C1) or (C2) holds, then we have

 infΣ0∈¯U0P(supp(^Σ⋆(δ))=supp(Σ0))→1.

Similar support recovery result was established for the generalized universal thresholding estimator in Rothman, Levina and Zhu (2009) under the condition and a lower bound condition similar to (19). Note that in Theorem 2, we do not require .

Following Rothman, Levina and Zhu (2009), the ability to recover the support can be evaluated via the true positive rate (TPR) in combination with the false positive rate (FPR), defined respectively as

 TPR=#{(i,j):^σ⋆ij≠0~{}and~{}σij≠0}#{(i,j):σij≠0} \rm and FPR=#{(i,j):^σ⋆ij≠0~{}and~{}σij=0}#{(i,j):σij=0}.

It follows from Theorem 2 directly that and under the conditions of the theorem.

The next result shows that is the optimal choice for support recovery in the sense that a thresholding estimator with any smaller choice of would fail to recover the support of exactly with probability going to one. We assume X satisfies the following condition which is weaker than the Gaussian assumption.

(C3) Suppose that

 E[(Xi−μi)2(Xj−μj)(Xk−μk)]=0,E% [(Xi−μi)(Xj−μj)(Xk−μk)(Xl−μl)]=0

if for all .

Theorem 3

Let with . Suppose that (C1) or (C2) holds. Under (C3) and , if with some and , then

 infΣ0∈¯U0P(supp(^Σ⋆(τ))≠supp(Σ0))→1.
Remark 4

The condition is used in the proof to make sure the covariances of the samples can be well approximated by normal vectors. It can be replaced by if X is a multivariate normal population.

3.3 Comparison with universal thresholding

It is interesting to compare the asymptotic results for adaptive thresholding estimator with the known results for universal thresholding estimators. We begin by comparing the rate of convergence of with that of the universal thresholding estimator introduced in Bickel and Levina (2008) in the case of polynomial-type tails. Suppose that (C2) holds. Bickel and Levina (2008) showed that

 ∥^Σu−Σ0∥2=OP(s0(p)(p1/(1+γ+ϵ/2)n1/2)1−q) (20)

for . Clearly, the convergence rate given in Theorem 1 for the adaptive thresholding estimator is significantly faster than that in (20).

We next compare the rates over the class , . For brevity, we shall focus on the Gaussian case . The following theorem gives the lower bound of the universal thresholding estimator.

Theorem 4

Assume that and . We have, as ,

 infλnsupΣ0∈U⋆qP(∥^Σg−Σ0∥2>364s2−q0(p)(logpn)(1−q)/2)→1 (21)

and hence for large ,

 infλnsupΣ0∈U⋆qE∥^Σg−Σ0∥22≥1512s4−2q0(p)(logpn)1−q. (22)

The rate in (21) is slower than the optimal rate given in (16) when as . Therefore no universal thresholding estimators can be minimax-rate optimal under the spectral norm over if .

If we assume the mean of X is zero and ignore the term in , then the universal thresholding estimators given in Bickel and Levina (2008) and Rothman, Levina and Zhu (2009) use the sample mean of the samples to identify zero entries in the covariance matrix. The support of these estimators depends on the quantities . In the high dimensional setting, the sample mean is usually unstable for non-Gaussian distributions with heavier tails. Non-Gaussian data can often arise from many practical applications such as in finance and genomics. For our estimator, instead of the sample mean, we use the Student statistic to distinguish zero and nonzero entries. Our support recovery depends on the quantities which are more stable than , since statistic is much more stable than the sample mean; see Shao (1999) for the theoretical justification.

4 Data-driven choice of δ

Section 3 analyzes the properties of the adaptive thresholding estimator with a fixed value of . Alternatively, can be selected empirically through cross validation (CV). In Bickel and Levina (2008) the value of the universal thresholding level is not fully specified and the CV method was used to select empirically. They obtained the convergence rate under the Frobenius norm for an estimator that is based only on partial samples. Theoretical analysis on the rate of convergence under the spectral norm is still lacking. In this section, we first briefly describe the CV method for choosing and then derive the theoretical properties of the resulting estimator under the spectral norm.

Divide the sample into two subsamples at random. Let and be the two sample sizes for the random split satisfying , and let , be the two sample covariance matrices from the th split, for , where is a fixed integer. Let and be defined as in (11) from the th split and

 ^R(δ)=1HH∑v=1∥^Σ⋆v1(δ)−^Σv2∥2F.

Let , be points in and take

where is a fixed integer. If there are several attain the minimum value, is chosen to be the smallest one. The final estimator of the covariance matrix is given by .

Theorem 5

Suppose with and for some . Let for some and for some . We have

 infΣ0∈U0P(∥^Σ⋆(^δ)−Σ0∥2≤Cs0(p)(logpn)1/2)→1.
Remark 5

The assumption that is fixed is not a stringent condition since we only consider belonging to the fixed interval . Moreover, we will only focus on the matrices in due to the complexity of the proof. Extending to the case with certain rate and more general is possible. However, it requires far more complicated proof and will not be discussed in the present paper.

Remark 6

The condition used in the theorem is purely for technical reasons and we believe that it is not essentially needed and can be weakened. This condition is not stringent when and it becomes restrictive if .

Similar to the fixed case, we also consider support recovery with the estimator .

Proposition 2

Suppose the conditions in Theorem 5 hold. For , we have

 FPR=OP(s0(p)/p)→0.

Moreover, since , we have with probability tending to one if the lower bound in (19) holds with being replaced by .

5 Numerical Results

The adaptive thresholding procedure presented in Section 2 is easy to implement. In this section, the numerical performance of the proposed adaptive thresholding estimator is studied using Monte Carlo simulations. Both methods for choosing the regularization parameter are considered and their performance are compared with that of universal thresholding estimators. The adaptive thresholding estimator is illustrated in an analysis on a dataset from a small round blue-cell tumors microarray experiment.

5.1 Simulation

The following two types of sparse covariance matrices are considered in the simulations to investigate the numerical properties of the adaptive thresholding estimator .

• Model 1 (banded matrix with ordering). , where , , . is a two-block diagonal matrix. is a banded and sparse covariance matrix. is a diagonal matrix with along the diagonal.

• Model 2 (sparse matrix without ordering). , where , , with independent . Here is a random variable taking value uniformly in ; is a Bernoulli random variable which takes value 1 with probability 0.2 and 0 with probability 0.8; and to ensure that is positive definite.

Under each model, independent and identically distributed -variate random vectors are generated from the normal distribution with mean 0 and covariance matrix , for . In each setting, 100 replications are used. We compare the numerical performance between the adaptive thresholding estimators and and with the universal thresholding estimator of Rothman, Levina and Zhu (2009). Here is selected by five fold cross-validation in Section 4, is the adaptive thresholding estimator with fixed . The thresholding level in is selected by five fold cross-validation method used in Bickel and Levina (2008). For each procedure, we consider two types of thresholding functions, the hard thresholding and the adaptive lasso thresholding with . The losses are measured by three matrix norms: the spectral norm, the matrix norm and the Frobenius norm. We report in Tables 1 and 2 the means and standard errors of these losses. We also carried out simulations with the SCAD thresholding function for both universal thresholding and adaptive thresholding. The phenomenon is very similar. The SCAD adaptive thresholding also outperforms the SCAD universal thresholding. For reasons of space, the results are not reported here.

Under Model 1 and Model 2, both adaptive thresholding estimators and uniformly outperform the universal thresholding rule significantly, regardless which thresholding function or which loss function is used. Between and , performs better than in general. Between the two thresholding functions, the hard thresholding rule outperforms the adaptive lasso thresholding rule for , while the difference is not significant for . For both models, the behaviors of hard and adaptive lasso universal thresholding rules are very similar. They both tend to “over-threshold” and remove many nonzero off-diagonal entries of the covariance matrices.

For support recovery, again both and outperform . The values of TPR and FPR based on the off-diagonal entries are reported in Tables 3 and 4. For Model 1, tends to estimate many nonzero off-diagonal entries by zero when is large. To better illustrate the recovery performance elementwise for the two models, the heat maps of the nonzeros identified out of 100 replications when are pictured in Figures 1 and 2. The heat maps suggest that the sparsity patterns recovered by and have significantly better resemblance to the true model than .