MAP MODEL SELECTION IN GAUSSIAN REGRESSION

# Map Model Selection in Gaussian Regression

Felix Abramovich
Department of Statistics Operations Research
Tel Aviv University
Tel Aviv 69978, Israel

Department of Mathematics
The Open University of Israel
Raanana 43107, Israel
###### Abstract

We consider a Bayesian approach to model selection in Gaussian linear regression, where the number of predictors might be much larger than the number of observations. From a frequentist view, the proposed procedure results in the penalized least squares estimation with a complexity penalty associated with a prior on the model size. We investigate the optimality properties of the resulting model selector. We establish the oracle inequality and specify conditions on the prior that imply its asymptotic minimaxity within a wide range of sparse and dense settings for “nearly-orthogonal” and “multicollinear” designs.

## 1 Introduction

Consider the standard Gaussian linear regression model

 y=X\boldmathβ+\boldmathϵ, (1)

where is a vector of the observed response variable , is the design matrix of the explanatory variables (predictors) , is a vector of unknown regression coefficients, and the noise variance is assumed to be known.

A variety of statistical applications of regression models involves a vast number of potential explanatory variables that might be even large relatively to the amount of available data. It raises a severe “curse of dimensionality” problem. Reducing dimensionality of the model becomes therefore crucial in the analysis of such large data sets. The goal of model (or variable) selection is to select the “best”, parsimonious subset of predictors. The corresponding coefficients are then usually estimated by least squares. The meaning of the “best” subset however depends on the particular aim at hand. One should distinguish, for example, between estimation of regression coefficients , estimation of the mean vector , model identification and predicting future observations. Different aims may lead to different optimal model selection procedures especially when the number of potential predictors might be much larger than the sample size . In this paper we focus on estimating the mean vector and the goodness of a model (subset of predictors) is measured by the quadratic risk , where is the least squares estimate of and is the projection of on the span of . The first (bias) term of the risk decomposition represents the approximation error of the projection, while the second (variance) term is the price for estimating the projection coefficients by and is proportional to the model size. The “best” model then is the one with the minimal quadratic risk. Note that the true underlying model in (1) is not necessarily the best in this sense since sometimes it is possible to reduce its risk by excluding predictors with small (but still nonzero!) coefficients.

Such a criterion for model selection is obviously impossible to implement since it depends on the unknown . Instead, the corresponding ideal minimal risk can be used as a benchmark for any available model selection procedure. The model selection criteria are typically based on the empirical quadratic risk , which is essentially the least squares. However, direct minimization of the empirical risk evidently leads to a trivial (unsatisfactory!) choice of the saturated model. A typical remedy is then to add a complexity penalty that increases with the model size, and to consider penalized least squares criterion of the form

 ||y−X^\boldmathβM||2+Pen(|M|)→minM (2)

The properties of the resulting estimator depends on the proper choice of the complexity penalty function in (2). There exists a plethora of works in literature on this problem. The standard, most commonly used choice is a linear type penalty of the form for some fixed . The most known examples motivated by different ideas include AIC for (Akaike, 1974), BIC for (Schwarz, 1978) and RIC for (Foster & George, 1994). A series of recent works suggested the so-called -type nonlinear penalties of the form , where and is some “negligible” term (see, e.g., Birgé & Massart, 2001, 2007; Johnstone, 2002; Abramovich et al., 2006; Bunea, Tsybakov & Wegkamp, 2007).

In this paper we present a Bayesian formalism to the model selection problem in Gaussian linear regression (1) that leads to a general penalized model selection rule (2). The proposed Bayesian approach can be used, in fact, as a natural tool for obtaining a variety of penalized least squares estimators with different complexity penalties that accommodate many of the known model selection procedures as particular cases corresponding to specific choices of the prior. Within Bayesian framework, the penalty term in (2) is interpreted as proportional to the logarithm of a prior distribution. Complexity penalties imply placing a prior on the model size (the number of nonzero entries of ). Minimization of (2) corresponds to the maximum a posteriori (MAP) rule yielding the resulting MAP model selector to be the posterior mode.

Although there exists a large amount of literature on Bayesian model selection (see George & McCulloch, 1993, 1997; Chipman, George & McCullogh, 2001; Liang et al., 2008 for surveys), it mainly focuses on “purely Bayesian” issues (e.g., prior specification, posterior calculations, etc.) and does not investigate the optimality of the resulting Bayesian procedures from a frequentist view. In this paper we study the optimality properties of the proposed MAP model selectors for estimating the mean vector in (1). First, under mild conditions on the prior we establish the oracle inequality and show that, up to a constant multiplier, they achieve the minimal possible risk among all estimators. We then investigate their asymptotic minimaxity. For “nearly-orthogonal” design they are proved to be simultaneously rate-optimal (in the minimax sense) over a wide range of sparse and dense settings and outperform various existing model selection procedures, e.g. AIC, BIC, RIC, Lasso (Tibshirani, 1996) and Dantzig selector (Candés & Tao, 2007). In a way, these results extend those of Abramovich, Grinshtein & Pensky (2007) and Abramovich et al. (2010) for the normal means problem corresponding to the particular case .

The analysis of “multicollinear” design, which is especially relevant for “ much larger than ” setup, is more delicate. We demonstrate that the lower bounds for the minimax rates for estimating the mean vector in this case are smaller than those for “nearly-orthogonal” design by the factor depending on the design properties. Such “blessing of multicollinearity” can be explained by a possibility of exploiting correlations between predictors to reduce the size of a model (hence, to decrease the variance) without paying much extra price in the bias term. We show that under some additional assumptions on the design and the coefficients vector in (1), the proposed Bayesian model selectors are still asymptotically rate-optimal.

The paper is organized as follows. The Bayesian model selection procedure that leads to a penalized least squares estimator (2) is introduced in Section 2. In Section 3 we derive an upper bound for the quadratic risk of the resulting MAP model selector, compare it with that of an oracle and find the conditions on the prior where, up to a constant multiplier, it achieves the minimal possible risk among all estimators. In Section 4 we obtain the upper and lower risk bounds of the MAP model selector in a sparse setup that allows us to investigate its asymptotic minimaxity for nearly-orthogonal and multicollinear designs in Section 5. The computational aspects are discussed in Section 6, and the main take-away messages of the paper are summarized in Section 7. All the proofs are given in the Appendix.

## 2 MAP model selection procedure

Consider the Gaussian linear regression model (1), where the number of possible predictors might be even larger then the number of observations . Let and assume that any columns of are linearly independent. For the “standard” linear regression setup, where all predictors are linearly independent and there are at least linearly independent design points, .

Any model is uniquely defined by the diagonal indicator matrix , where and, therefore, . The corresponding least square estimate , where “+” denotes the generalized inverse matrix.

Assume some prior on the model size , where ( corresponds to a null model with a single intercept) and for since otherwise, there necessarily exists another vector with at most nonzero entries also satisfying (1), that is, .

For any there are different models of a given size . Assume all of them to be equally likely, that is, conditionally on ,

 P(M∣∣|M|=k)=(pk)−1

One should be a little bit more careful for . Although there are different sets of predictors of size , all of them evidently result in the same estimator for the mean vector and, in this sense, are essentially undistinguishable and associated with a single (saturated) model. Hence, in this case, we set

 P(M∣∣|M|=r)=1 (3)

Finally, assume the normal prior on the unknown vector of coefficients of the model : . This is a well-known conventional -prior of Zellner (1986).

For the proposed hierarchical prior, straightforward calculus yields the posterior probability of a model of size  :

 P(M|\boldmathy)∝π(|M|)(p|M|)−1(1+γ)−|M|2exp{γγ+1y′XDM(DMX′XDM)+DMX′y2σ2} (4)

Finding the most likely model leads therefore to the following maximum a posteriori (MAP) model selection criterion:

 y′XDM(DMX′XDM)+DMX′y+2σ2(1+1/γ)ln{(p|M|)−1π(|M|)(1+γ)−|M|2}→maxM

or, equivalently,

 ||y−X^\boldmathβM||2+2σ2(1+1/γ)ln{(p|M|)π(|M|)−1(1+γ)|M|2}→minM, (5)

which is of the general type (2) with the complexity penalty

 Pen(k)=2σ2(1+1/γ)ln{(pk)π(k)−1(1+γ)k2},k=0,...,r−1 (6)

Similarly, for from (3) one has

 Pen(r)=2σ2(1+1/γ)ln{π(r)−1(1+γ)r2} (7)

A specific form of the penalty (6)-(7) depends on the choice of a prior . In particular, the (truncated if ) binomial prior corresponds to the prior assumption that the indicators are independent. The binomial prior yields the linear penalty , where for sufficiently large variance ratio . The AIC criterion corresponds then to , while leads to the RIC criterion. These relations indicate that RIC should be appropriate for sparse cases, where the size of the true (unknown) model is believed to be much less than the number of possible predictors, while AIC is suitable for dense cases, where they are of the same order. In fact, any binomial prior or, equivalently, any linear penalty cannot “kill two birds with one stone”. On the other hand, the (truncated) geometric prior for some , implies which is of the -type introduced above. For large it behaves similar to RIC for and to AIC for and is, therefore, adaptive to both sparse and dense cases. We will discuss these issues more rigorously in Section 5 below.

## 3 Oracle inequality

In this section we derive an upper bound for the quadratic risk of the proposed MAP model selector and compare it with the ideal minimal quadratic risk often called in literature as an oracle risk.

###### Assumption (P).

Assume that

 π(k)≤(pk)e−c(γ)k,k=0,...,r−1,andπ(r)≤e−c(γ)r

where .

Assumption (P) is not restrictive. Indeed, the obvious inequality implies that for it automatically holds forany prior for all . Assumption (P) is used to establish an upper bound for the quadratic risk of the MAP model selector.

###### Theorem 1.

Let the model be the solution of (2) with the complexity penalty given in (6)-(7) and be the corresponding least squares estimate. Then, under Assumption (P)

 E||X^\boldmathβ^M−X\boldmathβ||2≤c0(γ)infM{||X\boldmathβM−X\boldmathβ||2+Pen(|M|)}+c1(γ)σ2 (8)

for some and depending only on .

To assess the quality of the upper bound in (8), we compare it with the oracle risk . Note that the oracle risk is exactly zero when and, evidently, no estimator can achieve it in this case. Hence, an additional, typically negligible term , which is, essentially, an error of estimating a single extra parameter, is usually added to the oracle risk for a proper comparison. It is known that no estimator can attain a risk smaller than within factor from that of an oracle (e.g., Foster & George, 1994; Donoho & Johnstone, 1995; Candès, 2006). The following theorem shows that under certain additional conditions on the prior , the resulting MAP model selector achieves this minimal possible risk among all estimators up to a constant multiplier depending on :

###### Theorem 2 (oracle inequality).

Let satisfy Assumption (P) and, in addition, for some constant . Then, the resulting MAP model selector satisfies

 E||X^\boldmathβ^M−X\boldmathβ||2≤c2(γ)lnp(infME||X^\boldmathβM−X% \boldmathβ||2+σ2)

for some .

In particular, it can be easily shown that Theorem 2 holds for the (truncated) binomial with (RIC criterion) and geometric priors (see Section 2). More generally, all priors such that corresponding to the -type penalties satisfy the conditions of Theorem 2.

## 4 Risk bounds for sparse settings

In the previous section we considered the global behavior of the MAP estimator without any restrictions on the model size. However, in the analysis of large data sets, it is typically reasonable to assume that the true model in (1) is sparse in the sense that only part of coefficients in are different from zero. We now show that under such extra sparsity assumption, more can be said on the optimality of the MAP model selection.

For a given , define the sets of models that have at most predictors, that is, . Obviously, if a true model in (1) belongs to , the quasi-norm of the corresponding coefficients vector , where is the number of its nonzero entries. In this section we find the upper and lower bounds for the maximal risk of the proposed MAP model selector over .

###### Theorem 3.

Let , satisfy Assumption (P) and, in addition, if or if for some constant . Then, there exists a constant depending only on such that

 sup\boldmathβ:||\boldmathβ||0≤p0E||X^\boldmathβ^M−X\boldmathβ||2≤C1(γ)σ2min(p0(ln(p/p0)+1),r) (9)

The general upper bound (9) for the maximal risk of the MAP selector over in Theorem 3 holds for any design matrix . To assess its accuracy we establish the lower bound for the minimax risk of estimating the mean vector in (1).

For any given , let and be the -sparse minimal and maximal eigenvalues of the design defined as

 ϕmin[k]=min\boldmathβ:1≤||\boldmathβ||0≤k||X\boldmathβ||2||\boldmathβ||2,
 ϕmax[k]=max\boldmathβ:1≤||\boldmathβ||0≤k||X\boldmathβ||2||\boldmathβ||2

(see Meinshausen & Yu, 2009; Bickel, Ritov & Tsybakov, 2009). In fact, and are respectively the minimal and maximal eigenvalues of all submatrices of the matrix generated by any columns of . Let and set for all . By the definition, is a non-increasing function of . Obviously, and for the orthogonal design the equality holds for all .

###### Theorem 4.

Consider the model (1) and let . There exists a universal constant such that

 (10)

where the infimum is taken over all estimates of the mean vector .

Theorem 4 shows that the minimax lower bound (10) depends on a specific design matrix only through the sparse eigenvalues ratios. A computationally simpler but less accurate minimax lower bound can be obtained by replacing and in (10) by , that for the case is just the ratio of the minimal and maximal eigenvalues of .

For the orthogonal design, where , and analogous results were obtained in Birgé & Massart (2001). For a general design and similar minimax lower bounds were independently obtained in Raskutti, Wainwright & Yu (2009) for a design matrix of a full rank and in Rigollet & Tsybakov (2010) for a general case within a related aggregation context.

The established upper and lower bounds (9), (10) for the risk of the MAP model selector allow us in the following section to investigate its asymptotic minimaxity as both and increase.

### 5.1 Nearly-orthogonal design

In this section we consider the asymptotic properties of the MAP model selector as the sample size increases. We allow to increase with as well and look for a projection of the unknown mean vector on an expanding span of predictors. In particular, the most challenging cases intensively studied nowadays in literature are those, where or even . In such asymptotic settings one should essentially consider a sequence of design matrices where . For simplicity of exposition, in what follows we omit the index and denote by emphasizing the dependence on the number of predictors and let tend to infinity. Similarly, we consider now sequences of coefficients vectors and priors . In these notations, the original model (1) is transformed into a sequence of models

 y=Xp\boldmathβp+\boldmathϵ, (11)

where and any columns of are linearly independent (hence, ), and the noise variance does not depend on and . One can also view a sequence of models (11) in a triangular array setup (Greenshtein & Ritov, 2004).

###### Definition 1.

Consider the sequence of design matrices . The design is called nearly-orthogonal if the corresponding sequence of sparse eigenvalues ratios is bounded away from zero by some constant . Otherwise, the design is called multicollinear.

Nearly-orthogonality condition essentially means that there is no multicollinearity in the design in the sense that there are no “too strong” linear relationships within any set of columns of . Intuitively, it is clear that in this case cannot be “too large” relative to and, therefore, to . Indeed, apply the upper and lower bounds (9), (10) for to get that implies the following remark:

###### Remark 1.

For nearly-orthogonal design, necessarily and, therefore, .

The following corollary is an immediate consequence of Theorems 3 and 4:

###### Corollary 1.

Let the design be nearly-orthogonal.

1. As increases, the asymptotic minimax risk of estimating the mean vector over is of the order , that is, there exist two constants such that for all sufficiently large ,

 C1σ2min(p0(ln(p/p0)+1),r) ≤ inf^ysup\boldmathβp:||\boldmath% βp||0≤p0E||^y−Xp\boldmathβp||2 ≤ C2σ2min(p0(ln(p/p0)+1),r)

for all .

2. Assume Assumption (P) and, in addition, that and for some constants . Then, the corresponding MAP model selector attains the minimax convergence rates simultaneously over all .

One can easily verify that the conditions on the prior of Corollary 1 are satisfied, for example, for the truncated geometric prior (see Section 2) for all . The resulting MAP model selector attains, therefore, the minimax rates simultaneously for all . As we have mentioned, the corresponding penalty in (6) is of the -type. On the other hand, no truncated binomial prior can satisfy these conditions on the entire range . It is easy to verify that they hold for small if but for large if . In fact, these arguments go along the lines with the similar results of Foster & George (1994) and Birgé & Massart (2001, 2007). Recall that binomial prior corresponds to linear penalties of the type in (2) (see Section 2). Foster & George (1994) and Birgé & Massart (2001, Section 5.2) showed that the best possible risk of such estimators over is only of order achieved for corresponding to the RIC criterion. It is of the same order as the optimal risk for (sparse case) but larger for dense case (). On the other hand, the risk of the AIC estimator () is of the order , which is optimal for dense but much larger for sparse case.

Furthermore, under somewhat similar nearly-orthogonality conditions, Bickel, Ritov & Tsybakov (2009) showed that the well-known Lasso (Tibshirani, 1996) and Dantzig (Candés & Tao, 2007) estimators achieve only the same sub-optimal rate as RIC. These results are, in fact, not so surprising since both Lasso and Dantzig estimators are essentially based on convex relaxations of the -norm of regression coefficients in the linear complexity penalty in order to replace the original combinatorial problem (2) by a convex program. Thus, Lasso approximates the -norm by the the corresponding -norm . In particular, for the orthogonal design, linear complexity penalties and Lasso yield respectively hard and soft thresholding of components of with a fixed threshold. RIC estimator and Lasso with the optimally chosen tuning parameter (e.g., Bickel, Ritov & Tsybakov, 2009) result in this case in the well-known hard and soft universal thresholding of Donoho & Johnstone (1994) with a fixed threshold which is rate-optimal for various sparse but not dense settings. On the other hand, the nonlinear MAP penalty corresponds to hard thresholding with a data-driven threshold that under conditions on in Corollary 1 is simultaneously minimax for both sparse and dense cases (Abramovich, Grinshstein & Pensky, 2007; Abramovich et al., 2010).

Finally, note that for the nearly-orthogonal design, , where “” means that their ratio is bounded from below and above. Therefore, all the results of Corollary 1 for estimating the mean vector in (11) can be straightforwardly applied for estimating the regression coefficients . This equivalence, however, does not hold for the multicollinear design considered below.

### 5.2 Multicollinear design

Nearly-orthogonality assumption may be reasonable in the “classical” setup, where is not too large relatively to but might be questionable for the analysis of high-dimensional data, where , due to the multicollinearity phenomenon (see also Remark 1). When this assumption does not hold, the sparse eigenvalues ratios in (10) may tend to zero as increases and, thus, decrease the minimax lower bound rate relatively to the nearly-orthogonal design. In this case there is a gap between the rates in the lower and upper bounds (10) and (9). Intuitively, one can think of exploiting correlations between predictors to reduce the size of a model (hence, to decrease the variance) without paying much extra price in the bias term, and, therefore, to reduce the risk. We show that under certain additional assumptions on the design and the coefficients vector in (11), the upper risk bound (9) can be indeed reduced to the minimax lower bound rate (10).

For simplicity of exposition we consider the sparse case although the corresponding conditions for the dense case can be obtained in a similar way with necessary changes.

We introduce now several definitions that will be used in the sequel (including the proofs in the Appendix). For a given index set and define a matrix which columns are the elements of the standard basis in . Thus, for any matrix with columns, selects the columns of indexed by . Similarly, for any symmetric matrix , generates a (symmetric) submatrix of of the corresponding columns and rows.

For all , define . Let be an index set of predictors included in a model of size . For any submodel of size let be the minimal eigenvalue of the matrix . In fact, is the covariance matrix of the components of the least squares estimate vector corresponding to a subset of predictors in .

Finally, define

 ~ϕp[k]=minM:|M|=kmaxM′⊂M:|M′|=k′~ϕM,M′

As we show later (see the proof of Theorem 5 in the Appendix), measures an error of approximating mean vectors , where , by their projections on lower dimensional subspans of predictors. The stronger is multicollinearity, the better is the approximation and the larger is .

###### Theorem 5.

Let as (multicollinear design). Assume the following additional assumptions on the design matrix and the (unknown) vector of coefficients in (11):

(D)

for all there exist such that

(B)

, where

for some positive constants and .

Then, under the above additional restrictions, if the prior satisfies Assumption (P) and for all , for some positive , where , the corresponding MAP model selector is asymptotically simultaneously minimax (up to a constant multiplier) over all .

Note that by simple algebra one can verify that and, therefore, the constant in Assumption (D.3) is not larger than one.

We have argued that multicollinearity typically arises when . One can easily verify that for , Assumption (D.2) always follows from Assumption (D.1) and, therefore, can be omitted in this case.

As we show in the proof, Assumptions (D.1, D.2) and Assumption (B) allow one to reduce the upper bound (9) for the risk of the MAP model selector by the factor , while Assumption (D.3) is required to guarantee that the additional constraint on in Assumption (B) does not affect the lower bound (10).

To obtain asymptotic minimaxity of the MAP selector within the entire range similar to Corollary 1 for the nearly-orthogonal case, Assumptions (D) on the design matrix are required to be satisfied for all that might be quite restrictive. However, the results of Theorem 5 are more general and show the tradeoff between relaxation of Assumptions (D) to a smaller range of and the corresponding constriction of the adaptivity range for .

## 6 Computational aspects

In practice, minimizing (2) (and (5) in particular) requires generally an NP-hard combinatorial search over all possible models. During the last decade there have been substantial efforts to develop various approximated algorithms for solving (2) that are computationally feasible for high-dimensional data (see, e.g. Tropp & Wright, 2010 for a survey and references therein). The common remedies involve either greedy algorithms (e.g., forward selection, matching pursuit) approximating the global solution by a stepwise sequence of local ones, or convex relaxation methods replacing the original combinatorial problem by a related convex program (e.g., Lasso and Dantzig selector for linear penalties). The proposed Bayesian formalism allows one instead to use a stochastic search variable selection (SSVS) techniques originated in George & McCulloch (1993, 1997) for solving (5) by generating a sequence of models from the posterior distribution in (4). The key point is that the relevant models with the highest posterior probabilities will appear most frequently and can be identified even for a generated sample of a relatively small size avoiding computations of the entire posterior distribution.

The SSVS algorithm for the problem at hand can be basically described as follows. As we have mentioned in Section 2, every model is uniquely defined by the corresponding indicator vector and the joint posterior distribution of is given by (4) (up to a normalizing constant). SSVS uses the Gibbs sampler to generate a sequence of indicator vectors componentwise by sampling consecutively from the conditional distributions , where . The components can be trivially obtained as simulations of Bernoulli draws, where from (4) the corresponding posterior odds ratio

and is the increment in the residual sum of squares (RSS) after dropping the -th predictor from the model . The resulting Gibbs sampler is computationally efficient and, as increases, the empirical distribution of the generated sample converges to the actual posterior distribution of . After the sequence has reached approximate stationarity, one can identify the most frequently appeared vector(s) as potential candidate(s) to solve (5).

## 7 Concluding remarks

In this paper we considered a Bayesian approach to model selection in Gaussian linear regression. From a frequentist view, the resulting MAP model selector is a penalized least squares estimator with a complexity penalty associated with a prior on the model size. Although the proposed estimator was originated within Bayesian framework, the latter was used as a natural tool to obtain a wide class of penalized least squares estimators with various complexity penalties. Thus, we believe that the main take-away messages of the paper summarized below are of a more general interest.

The first main take-away message is that neither linear complexity penalties (e.g., AIC, BIC and RIC) corresponding to binomial priors , nor closely related Lasso and Dantzig estimators can be simultaneously minimax for both sparse and dense cases. We specify the class of priors and associated nonlinear penalties that do yield such a wide adaptivity range. In particular, it includes -type penalties.

Another important take-away message is about the effect of multicollinearity of design. Unlike model identification or coefficients estimation, where multicollinearity is a “curse”, it may become a “blessing” for estimating the mean vector allowing one to exploit correlations between predictors to reduce the size of a model (hence, to decrease the variance) without paying much extra price in the bias term. Interestingly, a similar phenomenon occurs in a testing setup (e.g., Hall & Jin, 2010).

## Acknowledgments

The authors would like to thank Alexander Samarov and Yaácov Ritov for valuable remarks. The authors are especially grateful to an anonymous referee for an excellent constructive review of the first version of the paper.

## References

• [1] Abramovich, F., Benjamini, Y., Donoho, D.L. and Johnstone, I.M. (2006). Adapting to unknown sparsity by controlling the false discovery rate. Ann. Statist. 34, 584–653.
• [2] Abramovich, F., Grinshtein, V. and Pensky, M. (2007). On optimality of Bayesian testimation in the normal means problem. Ann. Statist. 35, 2261–2286.
• [3] Abramovich, F., Grinshtein, V., Petsa, A. and Sapatinas, T. (2010). On Bayesian testimation and its application to wavelet thresholding. Biometrika 97, 181–198.
• [4] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. in Second International Symposium on Information Theory. (eds. B.N. Petrov and F. Czáki). Akademiai Kiadó, Budapest, 267-281.
• [5] Bickel, P., Ritov, Y. and Tsybakov, A. (2009). Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist. 35, 1705–1732.
• [6] Birgé, L. and Massart, P. (2001). Gaussian model selection. J. Eur. Math. Soc. 3, 203–268.
• [7] Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection. Probab. Theory Relat. Fields 138, 33–73.
• [8] Bunea, F., Tsybakov, A. and Wegkamp, M.H. (2007). Aggregation for Gaussian regression. Ann. Statist. 35, 1674–1697.
• [9] Candés, E.J. (2006). Modern statistical estimation via oracle inequalities. Acta Numerica, 1–69.
• [10] Candés, E.J. and Tao, T. (2007). The Dantzig selector: statistical estimation when is much larger than . Ann. Statist. 35, 2313–2351.
• [11] Chipman, H., George, E.I. and McCullogh, R.E. (2001). The Practical Implementation of Bayesian Model Selection. IMS Lecture Notes – Monograph Series 38.
• [12] Donoho, D.L. and Johnstone, I.M. (1994). Ideal spatial adaptation via wavelet shrinkage. Biometrika 81, 425–455.
• [13] Donoho, D.L. and Johnstone, I.M. (1995). Empirical atomic decomposition, unpublished manuscript.
• [14] Foster, D.P. and George, E.I. (1994). The risk inflation criterion for multiple regression. Ann. Statist. 22, 1947–1975.
• [15] George, E.I. and McCullogh, R.E. (1993). Variable selection via Gibbs sampling. J. Am. Statist. Assoc. 88, 881–889.
• [16] George, E.I. and McCullogh, R.E. (1997). Approaches to Bayesian variable selection. Statistica Sinica 7, 339–373.
• [17] Greenshstein, E. and Ritov, Y. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10, 971–988.
• [18] Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. Ann. Statist. 38, 1681–1732.
• [19] Johnstone, I.M. (2002). Function Estimation and Gaussian Sequence Models, unpublished manuscript.
• [20] Liang, F., Paulo, R., Molina, G., Clyde, M. and Berger, J.O. (2008). Mixtures of priors for Bayesian variable selection. J. Am. Statist. Assoc. 103, 410–423.
• [21] Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist. 37, 246–270.
• [22] Raskutti, G., Wainwright, M.J. and Yu, B. (2009). Minimax rates of estimations for high-dimensional regression over balls. Technical Report, UC Berkeley
• [23] Rigollet, P. and Tsybakov, A. (2010). Exponential screening and optimal rates of sparse estimation. http://arxiv.org/pdf/1003.2654.
• [24] Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461–464.
• [25] Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. J. Roy. Statist. Soc. Ser. B 58, 267–288.
• [26] Tropp, J.A. and Wright, S.J. (2010). Computational methods for sparse solution of linear inverse problems. Proc. IEEE, special issue “Applications of sparse representation and comprehensive sensing”.
• [27] Tsybakov, A. (2009). Introduction to Nonparametric Estimation. Springer.
• [28] Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with -prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finietti (eds. Goel, P.K. and Zellner, A.), North-Holland, Amsterdam, 233–243.

## Appendix A Appendix

Throughout the proofs we use to denote a generic positive constant, not necessarily the same each time it is used, even within a single equation.

### a.1 Proof of Theorem 1

Define

 Lk=(1/k)ln((pk)π−1(k))≥c(γ),k=1,...,r−1 (12)

and

 Lr=(1/r)lnπ−1(r)≥c(γ) (13)

In terms of the complexity penalty (6)-(7) is . Following the arguments of the proof of Theorem 1 of Abramovich et al. (2007), under the Assumption (P) one has

 r−1∑k=1(pk)e−kLk+e−rLr=r∑k=1π(k)=1−π(0)<1

and

 (1+1/γ)(2Lk+ln(1+γ))≥C(1+√2Lk)2,k=1,...,r

The proof of Theorem 1 then follows directly from Theorem 2 of Birgé & Massart (2001).

### a.2 Proof of Theorem 2

Let , where were defined in (12)-(13) and . Simple calculus shows that the conditions on in Theorem 2 imply .

Consider first the case . From Theorem 1 we have

 E||X^\boldmathβ^M−X\boldmathβ||2 ≤ c0(γ)infM{||X\boldmathβM−X\boldmathβ||2+σ2(1+1/γ)|M|(2L|M|+ln(1+γ))}+c1(γ)σ2 ≤ c0(γ)(1+1/γ)(2L∗+ln(1+γ))infM{||X\boldmathβM−X\boldmathβ||2+|M|σ2}+c1(γ)σ2 ≤ c2(γ)(2L∗+ln(1+γ)){infME||X^%\boldmath$β$M−X\boldmathβ||2+σ2}

For the degenerative case (), Theorem 1 implies

 E||X^\boldmathβ^M−X\boldmathβ||2 ≤ c0(γ)(||X\boldmathβ||2+σ2(1+1/γ)L0)+c1(γ)σ2 (15) ≤ c2(γ)L∗(||X\boldmathβ||2+σ2)

Combining (LABEL:eq:Lk1) and (15) completes the proof.

### a.3 Proof of Theorem 3

For all , Theorem 1 and (7) under the assumption imply

 sup\boldmathβ:||\boldmathβ||0≤p0E||X^\boldmathβ^M−X\boldmathβ||2 ≤ sup\boldmathβ:||\boldmathβ||0≤rE||X^\boldmathβ^M−X\boldmathβ||2≤Pen(r)+c1(γ)σ2 (16) ≤ C1(γ)σ2r

On the other hand, applying the general upper bound for the risk of MAP model selector established in Theorem 1 for models of size we have

 sup\boldmathβ:||\boldmathβ||0≤p0||X^\boldmathβ^M−X\boldmathβ||2≤c0(γ)2σ2(1+1/γ)(ln{(pp0)π−1(p0)}+p02ln(1+γ))+c1(γ)σ2 (17)

Abramovich et al. (2010, Lemma 1) showed that . Hence, under the conditions on in Theorem 3, for , (17) yields

 sup\boldmathβ:||\boldmathβ||0≤p0E||X^\boldmathβ^M−X\boldmathβ||2 ≤ c0(γ)2σ2(1+1/γ){(c+1)p0ln(pe/p0)+p02ln(1+γ)}+c1(γ)σ2 ≤ C1(γ)σ2p0(ln(p/p0)+1)

Finally, note that for , as we have already established in (16),

 sup\boldmathβ:||\boldmathβ||0≤rE||X^\boldmathβ^M−X\boldmathβ||2≤C1(γ)σ2r≤C1(γ)σ2r(ln(p/r)+1)

### a.4 Proof of Theorem 4

The core of the proof is to find a subset of vectors , where , and the corresponding subset of mean vectors such that for any , and the Kullback-Leibler divergence . Lemma A.1 of Bunea, Tsybakov & Wegkamp (2007) will imply then that is the minimax lower bound over .

To construct the desired subsets and we consider three possible cases.

Case 1.
Define the subset of all vectors that have entries equal to defined later, while the remaining entries are zeros: . For from Lemma 8.3 of Rigollet & Tsybakov (2010), there exists a subset such that for some constant , , and for any pair , the Hamming distance