A framework for posterior consistency in model selection

# A framework for posterior consistency in model selection

## Abstract.

We develop a theoretical framework for the frequentist assessment of Bayesian model selection, specifically its ability to select the (Kullback-Leibler) optimal model and to portray the corresponding uncertainty. The contribution is not proving consistency for a specific prior, but giving a general strategy for such proofs. Its basis applies to any model, prior, sample size, parameter dimensionality and (although only briefly exploited here) under model misspecification. As an immediate consequence the framework also characterizes a strong form of convergence for penalties and associated pseudo-posterior probabilities of potential interest for uncertainty quantification. The main advantage of the framework is that, instead of studying complex high-dimensional stochastic sums, it suffices to bound certain Bayes factor tails and use standard tools to determine the convergence of deterministic series. As a second contribution we deploy the framework to canonical linear regression. These findings give a high-level description of when one can achieve consistency and at what rate for a wide class of priors as a function of the data-generating truth, sample size and dimensionality. They also indicate when it is possible to use less sparse priors to improve inherent sparsity vs. power trade-offs that are not adequately captured by studying asymptotically optimal rates. Our empirical illustrations align with these findings, underlining the importance of considering the problem at hand’s characteristics to judge the quality of model selection procedures, rather than relying purely on asymptotics.

Keywords: model selection, Bayes factors, high-dimensional inference, consistency, uncertainty quantification, penalty

## 1. Introduction

Two fundamental tasks in Statistics are selecting a probability model amongst a set of models under consideration and quantifying the associated uncertainty. In principle the Bayesian framework provides a generic strategy for both: given the models and corresponding priors one obtains posterior model probabilities that guide model choice and portray uncertainty. Two key issues however are understanding when can one one hope to recover the right model (in a sense defined below) and whether posterior probabilities characterize uncertainty adequately. Although there are general finite-dimensional results (e.g. see Dawid (1992), Walker (2004)), Bayesian model selection in high-dimensional problems where the number of parameters or models grows with the sample size is not sufficiently understood. Current results were mostly obtained on a case-by-case basis and are technically involved, making it hard to understand basic principles such as the effect of , parameter dimensionality, data-generating truth or the prior on balancing sparsity versus power. Regarding uncertainty quantification, as a minimal requirement one hopes that the posterior probability assigned to the optimal model converges to 1 as . We refer to this property as posterior consistency (PC, a precise definition is given below). Unfortunately PC is not guaranteed in general, e.g. in linear regression with variables the posterior probability of the data-generating model can converge to 0 even though pairwise Bayes factors may remain consistent (Johnson and Rossell, 2012; Moreno et al., 2015).

Our main contribution is developing a framework for the theoretical study of high-dimensional Bayesian model selection. The basis applies to any set of (potentially misspecified) models, priors, data-generating truth, sample size and dimensionality. The main idea is to avoid treating high-dimensional stochastic sums to instead handle deterministic sums and bound tail probabilities for pairwise Bayes factors, which is typically much simpler. The framework provides convergence rates for posterior model probabilities (hence on uncertainty quantification) and, as an immediate consequence, on (frequentist) correct model selection probabilities. We emphasize upfront that its main value lies in simplicity, rather than mathematical sophistication. Our second main contribution is deploying the framework to variable selection in linear regression. Studying this canonical problem helps summarize and extend current results on the effects of various prior formulations, specifically on rates for consistency, sparsity and power to detect signal, and also illustrates how to extend results to simple forms of model misspecification. We also show how as an immediate corollary one can obtain a strong form of model selection consistency and pseudo-Bayesian uncertainty quantification for penalties.

We introduce notation for our upcoming discussion. Let be a set of candidate probability models for an observed outcome with corresponding densities for , where is a parameter of interest and (potentially) a nuisance parameter with prior density . All densities are in the Radon-Nikodym sense with respect to a sigma-finite measure, in particular allowing discrete and continuous outcomes/parameters. Although could be infinite-dimensional, for simplicity we focus on finite , . Without loss of generality let for , i.e. models are nested within a larger model, and let be the maximum model size. For instance, in regression one might set as result in data interpolation. Although not denoted explicitly in high-dimensional problems and may grow with , and this is precisely our focus of interest. The posterior probability assigned to an arbitrary is

 (1) p(Mk∣y)=⎛⎝1+∑l≠kp(y∣Ml)p(y∣Mk)p(Ml)p(Mk)⎞⎠−1=⎛⎝1+∑l≠kBlkp(Ml)p(Mk)⎞⎠−1

where is the integrated likelihood under , the model prior probability and the Bayes factor between and .

Our goal is to study PC, i.e. as grows when does converge to 1 and at what speed, assuming that is sampled from some data-generating density and is defined as the smallest model minimizing Kullback-Leibler (KL) divergence to .

###### Definition 1.

Let . Define , where

 M∗={k:∃(θ∗k,ϕ∗k)∈Θk×Φ:KL(f∗,p(y∣θ∗k,ϕ∗k))=KL(f∗,p(y∣θ∗,ϕ∗))}

is the set of all models minimizing KL-divergence to .

In principle can be outside the assumed class and there may be multiple KL-optimal models, hence the need to define as the smallest. For instance if data are truly generated by one of the posed models, i.e. for some , then also for any . For simplicity we assume to be unique, but our results remain valid more generally by defining to be the union of all such KL-optimal models. To clarify ideas consider variable selection. There each model chooses a subset of amongst potential variables. In Gaussian linear regression is the residual variance and is the smallest model minimizing mean squared error under . The number of models is , and in particular if .

We remark that maximizing (1) is equivalent to finding where . Although our focus is on Bayesian model selection one can also consider an -penalized likelihood where and is a penalty that can depend on , say as in the Bayesian Information Criterion (BIC). Proving PC for any such guarantees a strong form of consistency. To see why, model selection consistency requires , that is with probability tending to 1. PC instead requires that is infinitely larger than the sum .

We lay out further notation. Let and be two deterministic sequences, then and denotes and for some constant (respectively). Let be all size super-models of (or spurious models), the corresponding non-spurious models, and . Denote by the number of models in . We state a minimal condition that can be easily relaxed but simplifies the exposition.

1. depends on only through . For arbitrary denote .

We review selected literature on PC for high-dimensional Bayesian model selection. Johnson and Rossell (2012) proved that PC is attained in linear regression with variables by using certain non-local priors , even under a (non-sparse) uniform . Narisetty and He (2014) showed that if then letting become diffuse at an appropriate rate as grows also achieves PC. The strategy of using increasingly diffuse priors with was advocated as early as Foster and George (1994). Shin et al. (2017) extended Johnson and Rossell (2012) to under certain increasingly diffuse non-local priors. These frameworks focus on specific models, priors or data-generating truths, and are asymptotic in nature (i.e. for large ). Castillo et al. (2015) showed that, by using so-called complexity priors , posterior probabilities discard regression models that are supersets of or overshoot its dimension by more than a certain factor. Gao et al. (2015) extended these results to general structured linear models under potentially misspecified sub-Gaussian errors, Chae et al. (2016) to symmetric non-parametric errors, and Rockova and van der Pas (2017) to regression trees. These valuable contributions focus on showing that overly complex models are discarded a posteriori, a necessary but not sufficient condition for PC, whereas we study the ability to recover exactly, as is the goal in model selection. To summarize there are three main strategies to attain PC in high dimensions (these are not mutually exclusive and can be used in combination): setting to non-local priors, letting be increasingly diffuse priors, or penalizing model complexity via . As made precise here the latter two strategies essentially favor small models a priori regardless of the data, and hence they may incur a loss of power when truth is non-sparse or the signal-to-noise ratio is small.

Our focus is on traditional Bayesian model selection where one entertains a collection of models. An alternative is to use shrinkage priors, i.e. consider a single model and set a continuous prior on concentrating most mass on certain subsets of . While interesting the technical developments for shrinkage priors are fundamentally different to ours, as by construction any zero Lebesgue measure subset of has zero posterior probability, hence in our view they are more suitable for parameter estimation than for model selection. For results on shrinkage priors see Bhattacharya et al. (2012), Ghosh and Chakrabarti (2014), van der Pas et al. (2017), Ročková (2018) and references therein.

The remainder of the paper is organized as follows. Section 2 presents our framework to study PC. First we show that it suffices to integrate tail probabilities for pairwise relative to the KL-optimal . Subsequently we show how such integrals can be trivially bounded given generic exponential or polynomial tails on . The latter are typically available, leading to a much simpler treatment than studying (1) directly. Section 3 illustrates the framework by characterizing the posterior probability of individual models in linear regression under several popular priors. We also discuss the effect of the prior formulation on pairwise model comparison consistency, in particular when unduly inducing sparsity, which is of interest in its own right. Section 4 outlines how the results from Section 3 extend trivially to penalties. Section 5 combines the model-specific rates from Sections 3-4 to obtain overall variable selection rates. These help clarify the interplay between prior formulation (both and ), , , and the signal strength as measured by the entries in and the Gram matrix. Section 7 offers concluding remarks. Throughout we offer moderately detailed derivations to clarify the main steps to deploy our framework. To ease discussion a number of auxiliary lemmas, technical results and all proofs are in the supplementary material.

## 2. Approach

### 2.1. Posterior consistency

From (1) posterior consistency means . The difficulty in high dimensions is that the number of terms grows with , hence the sum can only vanish if each term converges to 0 quickly enough. Although this intuition is clear, obtaining probabilistic bounds for this stochastic sum is non-trivial. The main issue is that the ’s may exhibit complex dependencies that require a delicate case-by-case treatment. We follow a conceptually simpler route: instead of characterizing high-dimensional stochastic sums, we work with deterministic expectations of bounded variables. Since we find conditions for , that is by definition of convergence,

 (2) limn→∞Ef∗⎛⎝∑k≠tp(Mk∣y)⎞⎠=limn→∞∑k≠tEf∗(p(Mk∣y))=0,

where is the expectation under . Some trivial but relevant remarks are in order. First, convergence in (2) implies convergence in probability. Specifically let be a sequence such that , if then . Naturally (2) may require more stringent conditions than convergence in probability, but in our examples we obtain essentially tight rates and the gains in clarity are substantial. Second, one can evaluate the sum on the right hand side of (2) for fixed , and , i.e. the expression can be used in non-asymptotic regimes. Finally, note that implies when is the highest probability model (or the median probability model in Barbieri and Berger (2004)), hence

 Pf∗(^k=t)≥Pf∗(p(Mt∣y)>1/2)≥1−2Ef∗⎛⎝∑k≠tp(Mk∣y)⎞⎠,

where the right-hand side follows from Markov’s inequality. That is, if one can prove rates for then those rates also apply to (frequentist) correct model selection probabilities, ensuring some minimal coherency when using posterior probabilities for uncertainty quantification.

Our strategy to characterize (2) is straightforward. We first bound individual via tail probabilities for pairwise (Lemma 1), then rates for are given by simple deterministic sums. To characterize one could in principle use uniform integrability, but Lemma 1 uses an easier (and sufficiently precise) strategy by taking the leading term in the denominator of and bounding . The latter bound neglects the contribution of to the denominator of , but by definition PC implies , hence the loss in precision for using this upper bound is negligible.

###### Lemma 1.

Let . Then

 Ef∗(p(Mk∣y))≤u––+(1−¯u)+∫¯uu––Pf∗(Bkt>rpt,pk/(1/u−1))du.

Lemma 1 is a trivial exploitation of expressing as the integral of its survival function . The reason for introducing is that it is often simpler to replace by an upper-bound via tail inequalities, which may only hold for . To facilitate applying our framework in Section 2.2 we specialize Lemma 1 to the common cases where suitably re-scaled has either exponential or polynomial tails. Clearly, plugging in the bounds obtained from Lemma 1 into (2) gives a bound on rates. Proposition 1 below helps further simplify this summation. For many model classes one can obtain a common asymptotic bound that holds for all same-size models or , then one can easily bound the total posterior probability assigned to such models.

###### Proposition 1.

Let be a sequence such that, as , for all , and let be such that for all . Then

 Ef∗(P(S∣y))⪯¯p∑l=pt+1|Sl|a(l)n Ef∗(P(Sc∣y))⪯pt∑l=0|Scl|~a(l)n+¯p∑l=pt+1|Scl|~a(l)n

Note the role of rates associated to Bayes factors and , of model dimensionalities , and of (sparsity of ) in Proposition 1. is the evidence in favor of spurious models and converges to 0 whenever decrease sufficiently fast relative to and . As discussed this can be achieved via diffuse priors or setting that penalize complexity (see Section 3). measures the power to detect true signals and has often been regarded as less challenging than . The reason is that Bayes factors discard spurious models at a faster rate than non-spurious ones Dawid (1999); Johnson and Rossell (2010), hence typically However one should not neglect the term , particularly under non-sparse where is large or grows with . A natural concern is that sparsity-inducing priors favor with small , and even though this effect is often asymptotically negligible, it can have a significant impact for finite . An alternative advocated by Rossell and Telesca (2017) and Shin et al. (2017) is to enforce sparsity in a data-dependent manner that only affects , e.g. via non-local priors on . See Section 6 for some examples.

### 2.2. Tail integral bounds

A generic strategy to apply Lemma 1 is to upper-bound or a suitable transformation thereof, specifically our later examples naturally lead to for some , where is a random variable for which one can characterize the tails. Then trivially and can be bounded by integrating the tails of . In principle one can obtain this integral on a case-by-case basis, i.e. for a given model or prior, but to facilitate applying our framework Lemmas 2-3 give simple non-asymptotic bounds for the common cases where has exponential or polynomial tails, respectively.

###### Lemma 2.

Let such that , and . Let be a random variable satisfying for and some , , .

If , then

If , then

If , then

###### Lemma 3.

Let such that and let , . Let be a random variable satisfying for all and some , .

If then

and otherwise

In our examples as we let and , then the integral in Lemma 2 and hence are bounded by times a term of smaller order, where and are typically constant or converge to a constant. Lemma 3 is an analogous result for polynomial tails. Lemmas 4-5 are adaptations to chi-square and F-distributed useful for our linear regression examples. For simplicity Lemmas 4-5 state asymptotic bounds as and one should think of and as being arbitrarily close to 1 and 2 (respectively), but the proofs also provide non-asymptotic bounds.

###### Lemma 4.

Let and where may depend on and, as , .

(i) Let and be fixed constants. Then, as ,

 ∫10P(W>dlog(g1/u−1))du⪯1g(4eν)ν/2[log(geν/4)]ν2+1≪1gα.

(ii) Let and be fixed constants. Then, as ,

 ∫10P(W>dlog(g1/u−1))du⪯(1g)d−1[4e3/4νlog(geν/4)]ν2≪1gα(d−1).
###### Lemma 5.

Let , be a fixed constant and be a function of .

(i) Assume that as . Then, for any fixed ,

 ∫10P(ν1F>dlog(g1/u−1))du⪯eν1d(2−√3)1g+e−ν1gd−1−d√4ν2log(geν1/2)≪1gα.

(ii) Assume that as for all fixed . Then, for any fixed ,

 ∫10P(ν1F>dlog(g1/u−1))du≪exp{−ν1+ν2−52log(logα(g)ν1+ν2−6)}.

## 3. Model-specific rates for linear regression

Let , be an matrix. Consider standard linear regression where , . denotes the matrix and the regression coefficients for the columns in selected by . For simplicity is assumed invertible for throughout. In this section we show how to bound for a single . The generic strategy applies to any prior but the required algebra varies, hence for illustration we focus on several popular choices. First we consider Zellner’s prior

 p(θk∣Mk,ϕ)=N(θ;0,τϕ(X′kXk)−1),

where is a known prior dispersion parameter, as it leads to concise expressions that highlight the main ideas. We then extend results to Normal priors

 p(θk∣Mk,ϕ)=N(θk;0,τϕVk)

with general positive-definite covariance and to the pMOM prior Johnson and Rossell (2012)

 p(θk∣ϕ,Mk)=N(θk;0,τϕVk)∏j∈Mkθ2jx′jxj/(τϕ)

where . In our parameterization leads to essentially constant prior variance in the three considered , whereas leads to increasingly diffuse priors that encourage sparse models. Default choices include the unit information prior where Schwarz (1978), Foster and George (1994), Fernández et al. (2001) and Narisetty and He (2014), whereas is rarely considered. Whenever is treated as unknown we set for fixed . These examples show how to apply the framework under various prior dependence structures and less standard (non-local) priors that vanish at the origin. We then extend results to a simple form of model misspecification namely truly heteroskedastic and correlated errors. In Section 4 we extend these rates to penalties.

Recall that the strategy is to bound tail probabilities for pairwise Bayes factors . Here the latter are one-to-one functions of standard chi-square/F-test statistics or Bayesian counterparts. Assuming that the data-generating for some , , their tails are given by chi-square and F distributions and the integral bounds in Section 2.2 characterize . Considering model misspecification reduces to characterizing these tails when is not in the assumed model class. We assume the following technical conditions.

1. The maximum model size and the prior dispersion .

2. For any let where . Assume that for constant .

It is possible to relax (B1) to allow for larger , e.g. under Zellner’s prior depends on squared residual sums, these are 0 for and one obtains trivial . However (B1) seems a natural restriction as results in data interpolation. Also then simply requires , as in all default choices discussed above. Regarding (B2), grows with at a rate given by and the smallest non-zero eigenvalue in . Hence (B2) states that one should not consider models such that is too large relative to the signal-to-noise and correlations between and . A detailed study of eigenvalues is beyond our scope, but common assumptions are Johnson and Rossell (2012) or for some constant Narisetty and He (2014). If has orthogonal columns with zero mean and unit variance, then clearly .

We summarize our results. For Zellner’s and Normal priors for spurious and any fixed , under minimal conditions. That is, setting large plays a similar role than sparse in reinforcing small models. For simplicity we use this common expression for all considered priors, but one can often obtain slightly tighter rates, e.g. for Zellner’s prior the rate is up to a logarithmic factor. The pMOM prior attains faster rates for . For then decreases exponentially with for fixed , . For simplicity here we consider with Normal tails, but rates for thicker-tailed priors cannot be faster (up to lower-order terms), we defer this discussion to Section 7. Importantly, to obtain these rates one must set in a manner that yet preserves power to detect truly active variables, else may not even vanish as . Section 3.1 makes this notion precise and describes sufficient conditions on .

### 3.1. Model priors and conditions for pairwise consistency

A necessary condition to attain global posterior consistency is pairwise consistency when comparing with an arbitrary , i.e. . Pairwise consistency essentially depends on the signal strength as measured by , prior model probabilities and prior dispersion . We list two technical conditions that, as shown in subsequent sections, guarantee pairwise consistency for all priors considered here.

1. Let . As , and for some fixed .

2. Let . As , for all fixed .

Recall that corresponds to roughly constant prior variance. Intuitively, (C1)-(C2) ensure that and do not favor over too strongly a priori. In fact for and under Zellner’s and Normal priors, if then does not converge to 0 in probability, i.e. (C1) is essentially necessary. This follows from the expressions and sampling distributions of in (3) and (6) below. Similarly for , if then is inconsistent. More generally, for a wide class of priors it holds that if then is of order where measures overall prior dispersion, whereas if then for some constant Dawid (1992). Hence (C1)-(C2) still ensure consistency. An exception are non-local priors, which are defined as satisfying , these attain faster rates for and (C1) could be relaxed slightly (e.g. to for the pMOM prior in Section 3.5).

Conditions (C1)-(C2) depend on and . Lemma 6 makes (C1)-(C2) explicit for three popular model space priors of the form , where is the prior probability of selecting variables. These are the so-called uniform prior , the Beta-Binomial(1,1) prior Scott and Berger (2010) and the complexity prior for Castillo et al. (2015).

###### Lemma 6.

Let as in (C2) and have variables. Then

1. Uniform prior. If then (C1)-(C2) hold.

2. Beta-Binomial(1,1) prior. If then (C1)-(C2) hold.

3. Complexity prior. If then (C1)-(C2) hold.

Let be a model with variables.

1. Uniform prior. (C2) holds if and only if .

2. Beta-Binomial(1,1) prior. If then (C2) holds.

3. Complexity prior. If then (C2) holds.

That is when one may take any , but for smaller one must ensure that the prior sparsity induced by and does not smother the signal in , particularly for the Beta-Binomial and Complexity priors. (C2) essentially requires that and respectively. The specific rate of as a function of depends on the eigenvalues of , but even under an ideal one has where , then (C2) requires for these two priors, which may not hold when grows with ( is not sparse) or decreases with (small signals). These conditions are similar to those in Castillo et al. (2015) for Laplace spike-and-slab priors and in van der Pas et al. (2017) for the horseshoe prior, see also Bühlmann and van de Geer (2011) (Chapter 2) for a summary of related conditions for the LASSO and other likelihood penalties.

### 3.2. Zellner’s prior with known variance

Under Zellner’s prior and known , simple algebra gives

 (3) Btm=exp{−τ2ϕ∗(1+τ)Wmt}(1+τ)pm−pt2

and hence in Lemma 1

 (4) Pf∗(Bmt>rpt,pm1/u−1)=Pf∗⎛⎜⎝Wmtϕ∗>1+ττ2log⎡⎢⎣(1+τ)pm−pt2rpt,pm1/u−1⎤⎥⎦⎞⎟⎠,

where is the difference between residual sums of squares under and and the least-squares estimate.

Consider first a spurious model . Under it is well-known that (see Lemma S7), and by Lemma 1

 Ef∗(p(Mm∣y))<∫10Pf∗⎛⎜⎝Wmtϕ∗>1+ττ2log⎡⎢⎣(1+τ)pm−pt2rpt,pm1/u−1⎤⎥⎦⎞⎟⎠du.

To bound this integral we use Lemma 4. Specifically set , and note that as under Conditions (B1) and (C1). Then by Lemma 4

 Ef∗(p(Mm∣y))⪯[log(g)](pm−pt)/2g≪rαpm,pt(1+