Ultra High-dimensional multivariate posterior contraction rate under shrinkage priors

Ultra High-dimensional multivariate posterior contraction rate under shrinkage priors

Ruoyang Zhang Malay Ghosh Department of Statistics, University of Florida, Gainesville, FL 32611, United States
Abstract

In recent years, shrinkage priors have received much attention in high-dimensional data analysis from a Bayesian perspective. Compared with widely used spike-and-slab priors, shrinkage priors have better computational efficiency. But the theoretical properties, especially posterior contraction rate, which is important in uncertainty quantification, are not established in many cases. In this paper, we apply global-local shrinkage priors to high-dimensional multivariate linear regression with unknown covariance matrix. We show that when the prior is highly concentrated near zero and has heavy tail, the posterior contraction rates for both coefficients matrix and covariance matrix are nearly optimal. Our results hold when number of features p grows much faster than the sample size n, which is of great interest in modern data analysis. We show that a class of readily implementable scale mixture of normal priors satisfies the conditions of the main theorem.

keywords:
multivariate regression, unknown covariance matrix, Gaussian scale mixture
\newdefinition

rmkRemark[section]

1 Introduction

Parameter estimation, variable selection and prediction in high dimensional regression models has received significant attention in these days, particularly when the number of regressors is much larger than the number of observations . Examples abound - brain imaging, microarray experiments, satellite data analysis, just to name a few. In many of these examples, one key issue is to address sparsity of effective regression parameters in the midst of a multitude of inactive ones. For example, there are only a few significant genes associated with Type I diabetes along with million others of no direct impact for such a disease.

In a frequentist framework, the most commonly used approach for inducing sparsity is by imposing regularization penalty on the parameters of interest. The most popular ones are (lasso) and (ridge) penalties or a combination of these (elastic net). The and regularization can naturally be extended to multivariate case where sparsity in the coefficient matrix is desired. Rothman et al. (2010) used penalties on each entry of the coefficient matrix as well as on each off-diagonal element of the covariance matrix. Wilms and Croux (2018) considered a model which put an penalty on the rows of coefficient matrix to shrink the entire row to zero, and an penalty on the off-diagonal elements of the inverse error covariance matrix. Li et al. (2015) proposed a multivariate sparse group lasso imposing penalty on the rows of the regression matrix and in addition an penalty on individual coefficient of the regression matrix to perform sparse estimation and variable selection both at the between and within group levels.

In a Bayesian setting, spike-and-slab prior, originally introduced by Mitchell and Beauchamp (1988) have become very popular for introducing sparsity. Spike-and-slab priors are mixture densities with positive mass at zero to force some parameters to be zero, and a continuous density to model the nonzero coefficients. These priors have been used in a variety of contexts. For example, for Bayesian Group Lasso, Xu and Ghosh (2015) used these priors for both variable selection and estimation. This work was extended by Liquet et al. (2017) to the multivariate case. More recently, Ročková and George (2018) introduced spike-and-slab lasso for variable selection and estimation. Deshpande et al. (2019) extended it to multivariate case by putting spike-and-slab prior on each entry of the coefficient matrix as well as on each off-diagonal element of the precision matrix.

Spike-and-slab priors face severe computational challenges, when , the number of regressors, is very large. This is due to the fact that one needs to search over possible models. Bai and Ghosh (2018) provided an alternative to these priors by introducing global-local shrinkage priors. These priors approximate the spike-and-slab priors well and are usually much easier to implement because they are continuous. Like spike-and-slab priors, global-local shrinkage priors also put significant probability around zero, but retain heavy enough tails so that the true signals are very unlikely to be missed.

Bai and Ghosh (2018) considered the case when the number of regressors can grow at a sub-exponential rate when compared to the sample size. They established posterior consistency of their prior and showed that the insignificant regression coefficients converge to zero at an exponential rate. Song and Liang (2017) provided some general posterior contraction rates in the context of variable selection and estimation in univariate regression models with unknown variance.

Our paper is a follow-up of the works by Bai and Ghosh (2018) and Song and Liang (2017). In particular, unlike the former, we do not need to assume a known covariance matrix in the original regression model to establish exponential convergence rate of tail probabilities. We propose a set of general conditions on continuous prior for achieving nearly-optimal posterior contraction rate for both coefficient matrix and covariance matrix. This extends the work of Song and Liang (2017) to the multivariate case. Also, we have demonstrated that these regulatory conditions are satisfied by a general class of global-local shrinkage priors. Our technical results borrowed tools developed by Song and Liang (2017), but handling multivariate data presented some new challenges in proving the results.

Ning and Ghosal (2018) also addressed the issue of variable selection with unknown covariance matrix and established posterior consistency result similar to ours. But their results are based on spike-and-slab priors instead of global-local shrinkage priors and utilized different techniques from ours.

This paper is organized as follows. In Section 2, we establish general conditions on priors for achieving nearly-optimal posterior contraction rate for both coefficient matrix and covariance matrix. In Section 3, a class of global-local shrinkage prior that satisfies these general conditions is proposed. In Section 4, finite sample performance of the proposed model is evaluated through numerical experiments. Some final remarks are made in Section 5. Most of the technical theorems and lemmas are relegated to the Appendix.

2 Posterior Contraction rate of B and Σ

2.1 Problem Setting

We consider the following multivariate linear regression model

 Yi=XiBn+εiΣ1/2n,i=1,⋯,n (1)

where is a response vector, and the correlation of responses is assumed to be captured by the covariance matrix . is a coefficient matrix, is a regressor vector, is a noise vector. Throughout this paper, ’s are assumed to have i.i.d multivariate normal distribution, . Subscripts denotes that the quantity can vary with . In matrix form, Model 1 can be written as

 Yn=XnBn+εnΣ1/2n (2)

where , and .
Throughout the paper, for notational simplicity, subscript for , and may be dropped when there is no ambiguity.

For estimation of and , we consider the following Bayesian multivariate linear regression model. This model puts independent prior on each row vector of conditioning on and an Inverse-Wishart prior for . General conditions for for establishing a satisfying posterior contraction rate of and is given in Theorem 1.

 Yi|Xi,B,Σind∼Nqn(XiB,Σ)i=1,⋯,nBj|Σind∼π(Bj|Σ)j=1,⋯,pnΣ∼IWqn(υ,Φ) (3)

where is the row of . means a -dimensional Inverse-Wishart distribution with degree of freedom and a positive definite scale matrix .

2.2 Notations

First, a few notations used throughout the paper are defined. We write for , where and are real numbers. Letters with subscripts denote generic positive constants that do not depend on . For two sequences of positive real numbers and , is equivalent to , i.e. there exists constant such that for all large . means , that is, as . denotes that there exists constants such that .

For a vector , denotes the norm. For a real matrix with entries , denotes the Frobenius norm of ; denotes ’s maximum row length; denotes the sum of row lengths. For a symmetric real matrix , denotes the smallest eigenvalue of . denotes the spectral norm of , which is also the maximum eigenvalue of .

2.3 Conditions for Posterior Contraction Rate

Suppose the data is generated by 1 with the true regression parameter and the true dispersion matrix . To achieve posterior contraction rate, we first state some assumptions for sparsity of , the eigen-structure of design matrix , and eigenvalues of .

Assumption 1.

Sparsity of :
: , where is the size of the true model, i.e., the number of nonzero rows in .

Assumption 2.

Eigen-structure of the design matrix :
: Entries in design matrix are uniformly bounded. For simplicity, assume they are bounded by 1.
: as .
: There exist some integer (depending on and ) and fixed constant such that , and for any subset model with .

Assumption 3.

Dimension and eigenvalues of :
: .
: .
:

{rmk}

Assumption 2 and Assumption 1 are the same as in Song and Liang (2017). Note that does not restrict the rate of going to infinity. Along with , can grow sub-exponentially fast with when is finite, e.g., for some , which is the ultrahigh dimensional setting in Bai and Ghosh (2018).

{rmk}

and are the same as in Ning and Ghosal (2018). Different from many previous settings where the dimension of response is a fixed constant (Bai and Ghosh (2018), Liquet et al. (2017)), here we allow to grow with . However, the growth of is limited by constraints and . When is a fixed constant, and are trivially satisfied.

Theorem 1.

For the multivariate Bayesian model given in (3), suppose design matrix satisfies Assumption 2 and true parameter satisfying Assumption 1 and 3. Let the prior density of be:

 π(B|Σ)=pn∏i=1{|Σ|−1/2gτ(BjΣ−1/2)}

If satisfies

 ∫∥x∥2≥angτ(x)dx≤p−(1+u)n for some u>0, (4) log(inf∥x∥2≤M0gτ(x))≳−logpn (5)

where , .

Then the following posterior contraction result holds

 Πn(∥(B−B0)Σ−1/20∥F≥Mϵn|Y)→0 in P(B0,Σ0)-probability
 Πn(∥Σ−Σ0∥≥M∥Σ0∥ϵn|Y)→0 in P(B0,Σ0)-probability

where and is a sufficiently large constant.

{rmk}

These two conditions 4 and 5 for have intuitive interpretation. 4 means that the prior has to be highly concentrated around a small neighborhood of , which corresponds to the sparsity structure of the model. Taking as the strength of true signal, 5 means that the prior needs to put enough mass around the true signal, which is often referred as heavy-tail condition in Polson and Scott (2010); Armagan et al. (2013b, a).

{rmk}

When is a fixed constant, the contraction rate becomes , which is the same as the univariate case optimal posterior contraction rates for regression coefficient with respect to and norm in Castillo et al. (2015); Ročková et al. (2018), where spike-and-slab priors are used. In addition, this rate is also comparable to the minimax rate of lasso and dantzing selector for loss in ball Raskutti et al. (2011); Ye and Zhang (2010). Two additional terms and that may slower the convergence can be viewed as compensation of allowing .

{rmk}

By the fact that and , we get . Further, if is a constant, the posterior contraction rate of under Frobenius norm is also .

The complete proof of Theorem 1 is provided in Appendix. Here we briefly summarize the ideas and key steps. We applied the tools developed in Song and Liang (2017). To extend univariate contraction results to multivariate case, spectral norm is used for measuring matrix distance. With its relation to Frobenius norm, we are able to make straightforward interpretations.

For showing the posterior contraction results, auxiliary sets , and are constructed as follow.

 An= {at least ~p entries ∥BjΣ−1/2∥ is larger than an} ∪ {∥Σ−Σ0∥≥M(∥Σ∥∨∥Σ0∥)ϵn} ∪ {∥(B−B0)Σ−1/20∥F≥Mϵn}

Define , and . Let and . It suffices to show . By Lemma A.4 in Song and Liang (2017), the proof is composed of three parts:
(1) Construction of test satisfies and .
(2) Showing event has very small probability under the specified prior.
(3) Demonstrating the marginal probability of data is highly likely to be bounded away from 0 if data is generated with true parameters. Probability bounds of Inverse Wishart distributionNing and Ghosal (2018) are applied in this part.

3 Extended MBSP Model with Unknown Covariance Matrix

In previous section, we establish general conditions on the priors to obtain good posterior contraction. In this section, we will propose a class of global-local shrinkage prior that satisfies conditions 4 and 5.

This class of priors we propose is scale mixture of Gaussians, which is closely related to the Multivariate Bayesian model with Shrinkage Priors (MBSP) introduced by Bai and Ghosh (2018). In MBSP, is assumed to be fixed and known. Here we put an Inverse-Wishart prior for , extending it to the unknown case and obtain the following Extended MBSP model:

 Bj|ξj,Σind∼Nqn(0,τnξjΣ)j=1,⋯,pnξjind∼π(ξj)Σ∼IWqn(υ,Φ) (6)

In univariate case(), many priors can be expressed in the form of scale mixture of GaussiansTang et al. (2018). Table 1 lists such priors and corresponding mixing density .

We now show that when the mixing component follows certain polynomial-tailed distribution, posterior contraction is obtained with proper global shrinkage parameter .

Theorem 2.

 Bj|ξj,Σind∼Nqn(0,τnξjΣ)j=1,⋯,pnξjind∼π(ξj) (7)

where is a polynomial-tailed distribution taking the form , . If satisfies either of the two following conditions for all :
;
,
then (4) and (5) holds with for some and .

{rmk}

It is easy to see that , therefore such must exist.

{rmk}

Many commonly used shrinkage priors satisfy either or . As shown in Table (2), mixing component of student’s t, TPBN (horseshoe, NEG are special cases of TPBN) and HIB satisfies ; horseshoe+ satisfies . Proofs of these bounds are provided in Appendix.

{rmk}

For application, we recommend using TPBN prior. It has been shown in Bai and Ghosh (2018) that TPBN prior is easy for implementation using Gibbs sampling and relevant computation package MBSP is readily available. They compared their simulation results with other high-dimensional multivariate models.

4 Numerical Experiments

Through numerical experiments, we examine the uncertainty assessment for covariance matrix estimate using scale mixture of Gaussians proposed in Section 3. We explore how the difference between estimation and truth varies as and grows. More simulations that evaluate performances on coefficient matrix reconstruction, prediction as well as variable selection under various situations are presented in Bai and Ghosh (2018).

In our simulation, horseshoe mixing density is used. We focus on performance on ultra high-dimensional and ultra-sparse setting, where is approximately , proportion of nonzero coefficients ranges from to . Six different experiments settings are listed below.

Experiment 1: .
Experiment 2: .
Experiment 3: .
Experiment 4: .
Experiment 5: .
Experiment 6: .

In all six experiments, data are generated according to the multivariate linear regression model 1. Each row of is generated independently from , where . The true coefficient matrix is generated by uniformly selecting nonzero rows , and other rows are set to be zero. For nonzero rows, each entry is independently sampled from . The true covariance matrix .

By Theorem 2, when Assumption (1)-(3) holds, the global shrinkage parameter for nearly minimax posterior contraction rate should satisfy for some and . But in application, this value is very small, e.g., such in Experiment 3 would be around . Too small will results in problems in Gibbs sampling preciselyVan Der Pas et al. (2014). Currently, inference for the global hyperparmeter is still an open problemPiironen and Vehtari (2017). Here, we set , which achieves posterior consistency, although theoretical posterior contraction rate is not availableBai and Ghosh (2018). We use the Gibbs sampler in package MBSP, where the major computational complexity is linear in Bhattacharya et al. (2016). Each experiment is repeat 100 times. In all experiments, Gibbs sampler is run for 15000 iterations, the first 5000 iterations are burn-in.

Posterior mean is taken to be the point estimators of . and are used to measure the difference between posterior estimates and the truth in two different norms. Figure 1 illustrate how and decrease as and increase. Although this trend is a finite sample behavior, it matches our posterior consistency result established in previous section.

5 Conclusion and Future Work

This paper has several contributions. First, we propose a set of general conditions for continuous prior in sparse multivariate Bayesian estimation that can achieve nearly-optimal posterior contraction rate. While previous Bayesian multivariate models usually assume to be fixed and known, our work highlights the proof of posterior contraction of both coefficient matrix and covariance matrix . Moreover, we allow to grow nearly exponentially with and response dimension to go to infinity. To the best of our knowledge, our work is the first paper showing the nearly-optimal contraction rate of continuous shrinkage priors under this setting. The tools we developed in proof can also be utilized in other multivariate Bayesian models. For application, we show that a large family of heavy-tailed priors, including Student’âs t prior, horseshoe and horseshoe+ prior, the generalized double Pareto prior, etc, satisfies the condition, hence having good posterior contraction results.

Although we have established an informative reconstruction rate, there are still many important issues unexplored. One of them is variable selection criteria and corresponding selection consistency property. Continuous global-local shrinkage priors put zero probability at the point , so certain selection rule is needed for variable selection. And selection consistency under such rule is needed for theoretical justification.

Besides, considering the sparsity of or its inverse is also important. In our paper, where no structure of is assumed, although dimension of response is allowed to grow, it has to be much smaller than sample size in order to keep consistently estimableNing and Ghosal (2018). Recently, to encourage sparsity of precision matrix, Li et al. (2019) proposed a model putting horseshoe prior on regression coefficient and graphical horseshoe prior on precision matrix.

Another interesting problem is whether to adopt the joint scale-invariant prior framework. We use a scale-invariant prior in the paper, but this may result in underestimating the model errorMoran et al. (2019). Moran et al. (2019) recommend independent priors for regression coefficient and error variance apriori for preventing distortion of the global-local shrinkage mechanism and obtaining better estimates of the error variance.

Acknowledgments

The authors are grateful to Ray Bai and Qian Qin for helpful comments and suggestions.

References

• Armagan et al. (2011) Armagan, A., Clyde, M., Dunson, D.B., 2011. Generalized beta mixtures of gaussians. Advances in neural information processing systems , 523–531.
• Armagan et al. (2013a) Armagan, A., Dunson, D.B., Lee, J., 2013a. Generalized double pareto shrinkage. Statistica Sinica 23, 119.
• Armagan et al. (2013b) Armagan, A., Dunson, D.B., Lee, J., Bajwa, W.U., Strawn, N., 2013b. Posterior consistency in linear models under shrinkage priors. Biometrika 100, 1011–1018.
• Bai and Ghosh (2018) Bai, R., Ghosh, M., 2018. High-dimensional multivariate posterior consistency under global–local shrinkage priors. Journal of Multivariate Analysis 167, 157–170.
• Bhadra et al. (2017) Bhadra, A., Datta, J., Polson, N.G., Willard, B., et al., 2017. The horseshoe+ estimator of ultra-sparse signals. Bayesian Analysis 12, 1105–1131.
• Bhattacharya et al. (2016) Bhattacharya, A., Chakraborty, A., Mallick, B.K., 2016. Fast sampling with gaussian scale mixture priors in high-dimensional regression. Biometrika , asw042.
• Carvalho et al. (2010) Carvalho, C.M., Polson, N.G., Scott, J.G., 2010. The horseshoe estimator for sparse signals. Biometrika 97, 465–480.
• Castillo et al. (2015) Castillo, I., Schmidt-Hieber, J., Van der Vaart, A., et al., 2015. Bayesian linear regression with sparse priors. The Annals of Statistics 43, 1986–2018.
• Deshpande et al. (2019) Deshpande, S.K., RoÄkovÃ¡, V., George, E.I., 2019. Simultaneous variable and covariance selection with the multivariate spike-and-slab lasso. Journal of Computational and Graphical Statistics (to appear).
• Griffin et al. (2010) Griffin, J.E., Brown, P.J., et al., 2010. Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis 5, 171–188.
• Li et al. (2019) Li, Y., Datta, J., Craig, B.A., Bhadra, A., 2019. Joint mean-covariance estimation via the horseshoe with an application in genomic data analysis. arXiv preprint arXiv:1903.06768 .
• Li et al. (2015) Li, Y., Nan, B., Zhu, J., 2015. Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics 71, 354–363.
• Liquet et al. (2017) Liquet, B., Mengersen, K., Pettitt, A., Sutton, M., et al., 2017. Bayesian variable selection regression of multivariate responses for group data. Bayesian Analysis 12, 1039–1067.
• Mitchell and Beauchamp (1988) Mitchell, T.J., Beauchamp, J.J., 1988. Bayesian variable selection in linear regression. Journal of the American Statistical Association 83, 1023–1032.
• Moran et al. (2019) Moran, G.E., Rocková, V., George, E.I., 2019. On variance estimation for bayesian variable selection. Bayesian Analysis (to appear).
• Ning and Ghosal (2018) Ning, B., Ghosal, S., 2018. Bayesian linear regression for multivariate responses under group sparsity. arXiv preprint arXiv:1807.03439 .
• Piironen and Vehtari (2017) Piironen, J., Vehtari, A., 2017. On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics 54, 905–913.
• Polson and Scott (2010) Polson, N.G., Scott, J.G., 2010. Shrink globally, act locally: Sparse bayesian regularization and prediction. Bayesian statistics 9, 501–538.
• Polson et al. (2012) Polson, N.G., Scott, J.G., et al., 2012. On the half-cauchy prior for a global scale parameter. Bayesian Analysis 7, 887–902.
• Raskutti et al. (2011) Raskutti, G., Wainwright, M.J., Yu, B., 2011. Minimax rates of estimation for high-dimensional linear regression over -balls. IEEE transactions on information theory 57, 6976–6994.
• Ročková and George (2018) Ročková, V., George, E.I., 2018. The spike-and-slab lasso. Journal of the American Statistical Association 113, 431–444.
• Ročková et al. (2018) Ročková, V., et al., 2018. Bayesian estimation of sparse signals with a continuous spike-and-slab prior. The Annals of Statistics 46, 401–437.
• Rothman et al. (2010) Rothman, A.J., Levina, E., Zhu, J., 2010. Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics 19, 947–962.
• Scott (2015) Scott, D.W., 2015. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons.
• Song and Liang (2017) Song, Q., Liang, F., 2017. Nearly optimal bayesian shrinkage for high dimensional regression. arXiv preprint arXiv:1712.08964 .
• Tang et al. (2018) Tang, X., Xu, X., Ghosh, M., Ghosh, P., 2018. Bayesian variable selection and estimation based on global-local shrinkage priors. Sankhya A 80, 215–246.
• Van Der Pas et al. (2014) Van Der Pas, S., Kleijn, B., Van Der Vaart, A., et al., 2014. The horseshoe estimator: Posterior concentration around nearly black vectors. Electronic Journal of Statistics 8, 2585–2618.
• Vershynin (2012) Vershynin, R., 2012. Introduction to the non-asymptotic analysis of random matrices, 2011. arXiv preprint arXiv:1011.3027 .
• Wilms and Croux (2018) Wilms, I., Croux, C., 2018. An algorithm for the multivariate group lasso with covariance estimation. Journal of Applied Statistics 45, 668–681.
• Xu and Ghosh (2015) Xu, X., Ghosh, M., 2015. Bayesian variable selection and estimation for group lasso. Bayesian Analysis 10, 909–936.
• Ye and Zhang (2010) Ye, F., Zhang, C.H., 2010. Rate minimaxity of the lasso and dantzig selector for the loss in balls. Journal of Machine Learning Research 11, 3519–3540.

Appendix

Proof.

(Theorem 1) Define auxiliary sets , and are constructed as follow.

 An= {at least ~p entries ∥BjΣ−1/2∥ is larger than an} ∪ {∥Σ−Σ0∥≥M(∥Σ∥∨∥Σ0∥)ϵn} ∪ {∥(B−B0)Σ−1/20∥F≥Mϵn},

, and . Let and . By Lemma A.4 in Song and Liang (2017), it suffices to show the following three parts:

 π(Bn)≤e−k1nϵ2n (8)

There exists a test function s.t.

 Eθ0ϕn ≤e−k2nϵ2n, (9) supθ∈CnEθ(1−ϕn) ≤e−k3nϵ2n (10)

And for sufficiently large ,

 Pθ0(m(Dn)fθ0(Dn)≥e−k4nϵ2n)≥1−e−k5nϵ2n (11)

for some constant , where is the marginal of .

So the proof is composed of three parts: (I)construction of test satisfying (9) and (10), (II) showing that event has very probability under the specified prior, (III) showing that the marginal probability of data is highly likely to be bounded away from 0 if data is generated with true parameters.

Part I: Firstly, we show (9) and (10) by constructing testing function in the following way.
For given , consider the following testing functions and :

 ϕ(1)n,S =1{∥1n−|S|YT(In−HS)Y−Σ0∥≥M∥Σ∥0ϵn/2} ϕ(2)n,S

where is the submatrix of composed of columns indexed by , and are the submatrices of and composed of rows indexed by respectively, .

We have the following two inequalities for and .

 E(B0,Σ0)ϕ(1)n,S=P(B0,Σ0)(∥1n−|S|YT(In−HS)Y−Σ0∥≥M∥Σ∥0ϵn/2)≤e−cM2nϵ2n/(4K2)by Lemma % ???
 E(B0,Σ0)ϕ(2)n,S=P(∥(XTSXS)−1XTSε∥F≥Mϵn/2)≤P(λmax(XTSXS)−1tr(εTHSε)≥M2ϵ2n/4)≤P(χ2qn|S|≥M2λ0nϵ2n/4)≤e−λ0M2nϵ2n/16

The last inequality holds because , and if Armagan et al. (2013b).

Let , where

 ϕ(i)n=max{S⊃S0,|S|≤~p+s0}ϕ(i)n,S,i=1,2.
 E(B0,Σ0)ϕn≤E(B0,Σ0)∑{S⊃S0,|S|≤~p+s0}(ϕ(1)n,S+ϕ(2)n,S)≤~p∑i=0(pn−s0i)2e−k23Mnϵ2n where k23=min(cM4K2,λ0M16)≤2(~p+1)(pn−s0)~pe−k23Mnϵ2n

Taking log on both sides,

 log(E(B0,Σ0)ϕn)≤~plog(pn−s0)+log(2(~p+1))−k23Mnϵ2n≤(~p+1)logpn−k23Mnϵ2n≤−k23Mnϵ2n/2 where ~p=⌊k23Mnϵ2n2logpn⌋−1

Note that sufficient large will ensure there is such such that .

Now we want to show . Consider the following two sets and :

 Cn,1={∥Σ−Σ0∥≥M(∥Σ∥∨∥Σ∥0)ϵn}∩{at most ~p entries ∥BjΣ−1/2∥% is larger than an},Cn,2={∥(B−B0)Σ−1/20∥F≥Mϵn,∥Σ−Σ0∥≤M(∥Σ∥∨∥Σ∥0)ϵn}∩{at most ~p entries ∥BjΣ−1/2∥ is larger than an}.

It’s easy to verify that , so we have

 sup(B,Σ)∈CnE(B,Σ)(1−ϕn)=sup(B,Σ)∈CnE(B,Σ)min(1−ϕ(1)n,1−ϕ(2)n)≤max{sup(B,Σ)∈Cn,1E(B,Σ)(1−ϕ(1)n),sup(B,Σ)∈Cn,2E(B,Σ)(1−ϕ(2)n)}

By definition of , for , we have

 sup(B,Σ)∈Cn,1E(B,Σ)(1−ϕ(1)n)≤sup(B,Σ)∈Cn,1E(B,Σ)(1−ϕ(1)n,S).

Taking , then

 sup(B,Σ)∈Cn,1E(B,Σ)(1−ϕ(1)n,S)=sup(B,Σ)∈Cn,1P(∥1n−|S|YT(In−HS)Y−Σ0∥≤M∥Σ∥0ϵn/2)≤sup(B,Σ)∈Cn,1P(∥1n−|S|YT(In−HS)Y−Σ∥≥∥Σ−Σ0∥−M∥Σ∥0ϵn/2)note that ∥Σ−Σ0∥−M∥Σ∥0ϵn/2≥M∥Σ∥ϵn/2 when (B,Σ)∈Cn,1≤sup(B,Σ)∈Cn,1P(∥1n−|S|YT(In−HS)Y−Σ∥≥M∥Σ∥ϵn/2)≤e−cM2nϵ2n/(16K2)

The last inequality follows by Lemma 1 because and for ,

 ∥(In−HS)XBΣ−1/2∥2F =∥XScBScΣ−1/2∥2F ≤∥XSc∥2F∥BScΣ−1/2∥2F ≤(npn)(pna2n) ≲nϵ2n.

Now we look into . For , or , so , which gives by assumption . So we have

because , and is bounded. In addition, we have

 ∥(XTSXS)−1XTSXScBScΣ−1/20∥F≤√λmax((XTSXS)−1)∥XScBScΣ−1/20∥F≤√1/(nλ0)∥ΣΣ−10