Valid uncertainty quantification about the model in a linear regression setting

# Valid uncertainty quantification about the model in a linear regression setting

Ryan Martin,111Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago, email: rgmartin@uic.edu   Huiping Xu,222Department of Biostatistics, The Richard M. Fairbanks School of Public Health and School of Medicine, Indiana University, email: huipxu@iu.edu   Zuoyi Zhang,333Regenstrief Institute, Inc., email: zyizhang@indiana.edu   and Chuanhai Liu444Department of Statistics, Purdue University, email: chuanhai@purdue.edu
July 5, 2019
###### Abstract

In scientific applications, there often are several competing models that could be fit to the observed data, so quantification of the model uncertainty is of fundamental importance. In this paper, we develop an inferential model (IM) approach for simultaneously valid probabilistic inference over a collection of assertions of interest without requiring any prior input. Our construction guarantees that the approach is optimal in the sense that it is the most efficient among those which are valid. Connections between the IM’s simultaneous validity and post-selection inference are also made. We apply the general results to obtain valid uncertainty quantification about the set of predictor variables to be included in a linear regression model.

Keywords and phrases: Inferential model; post-selection inference; optimality; predictive random set; variable selection.

## 1 Introduction

### 1.1 Background

Linear regression is one of the most widely used statistical tools in scientific applications. Standard practice is to consider many predictor variables in hopes that only a few will identify themselves as being useful in explaining variation in the response variable. As a result, there is substantial uncertainty about the underlying model. The classical frequentist approach to deal with this problem is to use the data to help select a particular model, and there are a plethora of tools available, such as the Akaike information criterion (AIC, Akaike 1973), the Bayesian information criterion (BIC, Schwarz 1978), lasso (Tibshirani 1996, 2011) and its variants, including adaptive lasso (Zou 2006) and elastic net (Zou and Hastie 2005). Significance tests, such as those in Bühlmann (2013) and Lockhart et al. (2014), based on these or other methods have also been considered recently. However, these frequentist methods provide no quantification of the uncertainty about the true model, i.e., they provide no way to conclude that one model being the true model is more plausible than another model being the true model.

Bayesian methods, on the other hand, are able to produce a summary of the uncertainty among the candidate models; see, for example, Clyde and George (2004) and O’Hara and Sillanpää (2009). Starting from a prior distribution on the set of possible models and a set of conditional priors on the model-specific parameters, a posterior distribution on the model space can be obtained via Markov chain Monte Carlo. Despite the conceptual simplicity of this approach, there are practical difficulties to overcome, including prior specification; in particular, real prior information is rarely available and improper default priors cannot be used. Without genuine prior information, the model uncertainty assessments coming from the corresponding posterior distribution are not guaranteed to be inferentially meaningful; see Section 1.2. Perhaps as a result of this lack of meaningfulness, the recent trend in the Bayesian literature on model selection is to de-emphasize uncertainty quantification, opting instead to report only a selected model. Therefore, modern Bayesian methods are effectively just frequentist selection procedures and they provide no summaries of model uncertainty.

Researchers have recently started to look beyond the classical frequentist and Bayesian schools for new solutions to challenging problems. See, for example, recent work on generalized fiducial inference (Hannig et al. 2015; Lai et al. 2015), confidence distributions (Xie and Singh 2013), and objective Bayes with default, reference, or data-dependent priors (Berger et al. 2009; Fraser 2011; Fraser et al. 2010; Martin et al. 2015). Our focus in this paper is the new perspective from the inferential model (IM) framework of Martin and Liu (2013). The key feature of this approach is that, for any assertion/hypothesis concerning the unknown parameter, it produces a probabilistic summary of the evidence in data for/against the truthfulness of that assertion. In addition to this internal or subjective interpretation, these inferential summaries are valid, i.e., suitably calibrated, facilitating an external or objective interpretation. The technical feature that distinguishes IMs from other approaches is the use of predictive random sets to produce a valid probabilistic summary of uncertainty about the parameter without a prior.

### 1.2 Lack of validity: an illustration

To further motivate our developments here, it will be helpful to elaborate on the claim above that the Bayesian approach does not, in general, provide an inferentially meaningful assessment of model uncertainty. We consider a simple illustrative example. Let be independent, with , . There are four models in this case, one for each zero and non-zero combination for ; here we will consider two “model assertions,” namely, and . A reasonable Bayesian model for this problem assigns a uniform prior over the four candidate models, and a prior for the model-specific parameters, where the to-be-specified variance, , controls the degree of prior uncertainty. The normal prior on the model-specific parameters is just a special case of the familiar g-prior in regression (e.g., Zellner 1986).

To quantify the uncertainty in the model given the observed data , we report the posterior probabilities for the model assertions mentioned above. To be consistent with the notation and terminology in the rest of the paper, we will write these posterior probabilities as , where is one of the model assertions, and interpret this as the “plausibility” that model assertion is true based on the observed data . These posterior probabilities are proportional to sums of marginal likelihoods, i.e.,

 plx(θ1=0,θ2=0) ∝N(x1∣0,1)N(x2∣0,1) plx(θ1=0) ∝N(x1∣0,1)N(x2∣0,1+v)+N(x1∣0,1)N(x2∣0,1),

where the normalizing constant is the sum of all four marginal likelihoods. These are easy to compute, no sophisticated Monte Carlo methods are needed. The question is if the corresponding uncertainty quantification is meaningful. For example, the plausibility assigned to the true model should be large, but how large is large?

Figure 1 shows the distribution functions (gray) of these plausibility values for the two assertions of interest, for different true parameter values, and for three different values of . It is natural to interpret probabilities on the scale—e.g., 0.9 is a “large” value because it would be exceeded in only 10% of cases—so we take this as a reference. There are several key observations. First, there is a strong dependence on the (arbitrary) choice of prior variance; in fact, depending on , even if the model assertion is true, as in Panels (a) and (c), it can happen that the posterior probability can never be “large.” Second, there is no meaningful way that these distributions can be related to the reference typically used for interpretation. Third, the way that shape of these distributions depends on the assertion and its truth/falsity is haphazard: the distributions in Panels (a) and (b) look indistinguishable even though the model assertions are true and false, respectively, and the distributions in Panel (c), based on a different but true assertion, have similar shapes; in Panel (d), the assertion is false, and the distributions look entirely different from the others. The take-away message is that the Bayesian posterior model probabilities need not have a meaningful interpretation in terms of quantifying uncertainty about the underlying model. The example here is relatively simple, so we can expect that similar issues will occur more generally.

As a preview, Figure 1 also shows the distribution function of the model plausibilities obtained from the optimal IM described in Section 4. The key observation is that the IM-based plausibility is valid in the sense that the comparisons with the reference are consistent with our intuition in each case: when the asserted model is true (resp. false), the plausibility is stochastically no smaller (resp. no larger) than . This important difference compared to the Bayesian output is easiest to see in Panel (c). To our knowledge, only the IM approach proposed here provides valid model uncertainty quantification, so this is an important contribution.

### 1.3 Main contributions

Early work on IMs for regression is presented in Zhang et al. (2011). What is missing there is a rigorous treatment of model uncertainty quantification, and the present paper aims to fill this important gap. In particular, the main contributions here are two-fold.

First, we present new and general results on the construction of optimal predictive random sets for inference problems that involve multiple simultaneous assertions. Previous work on optimality (e.g., Martin and Liu 2013) focused on one assertion at a time but many important examples, such as the regression problem considered here, involve simultaneous consideration of multiple assertions. Our strategy is to decompose the set of assertions under consideration into their basic building blocks, identify the optimal predictive random sets for these building block assertions, and then combine these individually optimal predictive random sets in a particular way to obtain the simultaneously optimal predictive random set for the collection of assertions.

Second, we apply these new results to construct an optimal IM for the important problem of model uncertainty quantification in regression. There we find that a hyper-cube predictive random set is optimal. The particular construction ensures that the IM’s output is provably valid in the sense illustrated in Figure 1. To our knowledge, this is the first result on valid uncertainty quantification about the model in a regression context. As part of our development, we make an important connection between the IM’s focus on simultaneous validity and modern attempts at post-selection inference, especially, that in Berk et al. (2013).

### 1.4 Outline of the paper

In Section 2 we review the basics of the IM approach in the context of the regression problem. Section 3 introduces the concept of efficiency and develops some general theory concerning the shape of the optimal predictive random sets for successively more general collections of assertions, concluding with a result (Theorem 4) that characterizes the IM for valid and optimal simultaneous inference over a class of so-called complex assertions. This result is specialized for model uncertainty quantification in regression in Section 4; there we draw connections between notion of simultaneous validity and post-selection inference, and show how the IM’s validity property can be used to develop a variable selection procedure with provable control on certain frequentist error rates. The paper concludes with a short discussion in Section 5; proofs are deferred to the Appendix and some additional technical details are presented in the Supplementary Material.

## 2 IMs for regression

### 2.1 Baseline association

To set notation, consider the following specification of the linear regression model:

 Y=Xβ+σZ, (1)

where is a -vector of response variables, is an matrix of predictor variables, is a -vector of regression coefficients, is the residual variance, and is a -vector of standard Gaussian errors, i.e., , a -dimensional Gaussian distribution with mean zero and identity covariance matrix. We assume that and that is fixed of full rank. Without loss of generality, we will assume that and the columns of have been centered, which effectively marginalizes out the nuisance intercept parameter.

We refer to (1) as a baseline association between observable data (and ), unknown parameters , and unobservable auxiliary variables . The importance of such an association can be seen as follows: if could be observed, then we could exactly solve (1) for , leading to the “best possible inference.” Since cannot be observed, a natural idea is to accurately predict the unobservable ; see Section 2.4. It is clear that accurately predicting a high-dimensional quantity is more difficult than predicting a lower-dimensional quantity, so it is advantageous to reduce the dimension of the auxiliary variable as much as possible before carrying out the prediction step.

### 2.2 Auxiliary variable dimension reduction

Martin and Liu (2015a, b) discuss two distinct approaches for reducing the dimension of the auxiliary variable. The first is an approach based on conditioning, which is particularly useful in cases where the auxiliary variable is of higher dimension than the parameter. For example, in our regression problem, is -dimensional while is -dimensional, and . Since the regression problem admits a -dimensional minimal sufficient statistic, the IM dimension reduction is straightforward. Let be the least-squares estimator of , and the corresponding estimator of based on the residual sum of squares. Then is a minimal sufficient statistic, and we can identify a lower-dimensional (conditional) association:

 ^β=β+σV1and^σ=σV2 (2)

where satisfies , , independent, and . The key point is that we have replaced the -dimensional auxiliary variable with a -dimensional auxiliary variable .

Here, as is typical in regression applications, is a nuisance parameter. For such cases, it is possible to further reduce the dimension of the auxiliary variable via marginalization. Note that the association (2) can be rewritten as

 ^β=β+^σ(V1/V2)and^σ=σV2.

The general theory in Martin and Liu (2015b) says that the second equation above—the one that is free of —can be ignored. This leaves a marginal association involving a -dimensional auxiliary variable . That is,

 ^β=β+^σW,W∼tp(0,M;n−p−1), (3)

where the auxiliary variable distribution is a -dimensional Student-t, with degrees of freedom, centered at the origin, with scale matrix . Again, the important point is that the -dimensional auxiliary variable in (2) has been replaced by a -dimensional auxiliary variable . No further dimension reduction is possible.

It will be convenient to rewrite association (3) once more. Let be a diagonal matrix with the same diagonal as . Then consider the association

 ^θ=θ+^σU,U∼PU=tp(0,L,n−p−1), (4)

where , , , and . Note, in particular, that if and only if , so the question about which variables are included in the model has not been changed; also, has all ones on the diagonal.

### 2.3 Predictive random sets

The interpretation of the association (4) from the standpoint of inference is that there is a particular value, say , for which that equation holds for the true . How we describe our uncertainty about this value determines our uncertainty assessments about , given the observed data. What distinguishes the IM approach from Fisher’s fiducial approach and its variants is the use of a random set, , to summarize uncertainty about . We refer to as a predictive random set. The key to the IM approach’s success is an appropriate choice of the predictive random set.

There is a rich mathematical theory of random sets, summarized nicely in Molchanov (2005). Here, for conceptual understanding, it will suffice to consider a discrete setting. Let be a finite collection of subsets of a space , in our case, . This collection of sets will serve as the support for the random set ; the individual sets in the collection are called focal elements. Now, define a random set , supported on , simply by assigning probabilities to each of the focal elements; this characterizes the distribution, , of . Of course, defining a random set in a non-discrete case requires more care and, in particular, a notion of measurability of set-valued functions. The Supplementary Material provides some relevant technical background and explains how the random sets appearing in our IM development satisfy the required conditions automatically.

One is free to describe their uncertainty about the unobserved value with any kind of predictive random set. However, the corresponding uncertainty assessments/inference about would be meaningful only if the two distributions in question, namely, and , have a suitable link. The following definition provides such a link.

###### Definition 1.

A predictive random set , with distribution , is called admissible if the following two conditions hold:

• the support , which is assumed to contain and , is nested, that is, for any two focal elements and in , either or , and

• the distribution of satisfies

 PS{S⊂K}=supS∈S:S⊂KPU(S),K⊆U. (5)

Martin and Liu (2013) showed that, by choosing a predictive random set that is admissible in the sense above, the corresponding IM output is valid, i.e., suitably calibrated for scientific inference; see Definition 2 below. Existence of a distribution that satisfies (5) is discussed in the Supplementary Material.

### 2.4 IM construction

First, the model provides a link between the observable data, the unknown parameters, and the unobservable auxiliary variables. Next, uncertainty about the unobserved value of the auxiliary variable, given the observed data, is encoded in a predictive random set. Finally, this uncertainty about the auxiliary variable gets passed through the association, yielding a corresponding uncertainty assessment about the parameter. For the dimension-reduced model in (4), the formal three-step IM construction is as follows.

###### A-step.

Associate data with unknown parameters and auxiliary variable as in (3). This defines a set—in this case, a singleton set—of parameter values indexed by particular values of , namely,

 Θy(u)={θ:^θ=θ+^σu},u∈Rp. (6)
###### P-step.

Predict the unobservable auxiliary variable with an admissible predictive random set with distribution .

###### C-step.

Combine the results of the A- and P-steps to get

 Θy(S)=⋃u∈SΘy(u).

Since is a random set, so is , and probabilities associated with this random set are used to summarize uncertainty about the unknown parameter. If is such that with -probability 1 for all —this is the only case we need to consider here, but see Ermini Leaf and Liu (2012)—then, for a given assertion/hypothesis about the unknown parameter , we calculate the belief function

 bely(A;S)=PS{Θy(S)⊂A}, (7)

the -probability that completely agrees with the assertion . The other important quantity is the plausibility function

 ply(A;S)=1−bely(Ac;S)=PS{Θy(S)⊄Ac},

the -probability that at least partially agrees with . For given , the output of the IM construction is the pair of functions .

### 2.5 IM validity

The output of the IM construction is a belief function, but in what sense is this belief function meaningful for inference? One important property the belief function should have is that it does not tend to assign high beliefs to assertions about which are false. The validity property makes this precise.

###### Definition 2.

An IM is called valid if, for any assertion about , the corresponding belief function satisfies

 sup(θ,σ):θ∉APY|(θ,σ){belY(A;S)≥1−α}≤α,∀α∈(0,1). (8)

In other words, the IM is valid if, for any assertion , is stochastically no larger than as a function of when is false.

Martin and Liu (2013) showed that admissibility of is sufficient for validity of the corresponding IM. The only downside is that there are many predictive random sets that give valid IMs. In that case, it is natural to look for a “best” one; see Definition 3.

## 3 General optimality results

### 3.1 Setup

Consider a general setup with association , where is the observable data, is the unknown parameter of interest, and is an unobservable auxiliary variable, with distribution defined on a space ; precise measure-theoretical statement of our assumptions is given in the Supplementary Material. The corresponding sampling model is denoted by . We assume that the predictive random set is admissible, so that the IM’s belief function is valid in the sense of Definition 2. Then, for a given assertion about , an optimal predictive random set , if it exists, makes stochastically as large as possible as a function of , for all (Martin and Liu 2013). The following definition makes this formal.

###### Definition 3.

For a given association and assertion , if and are two valid predictive random sets, then we say that is at least as efficient as (with respect to ) if

 belX(A;S)≥stbelX(A;S′), as a function of X∼PX|θ, θ∈A. (9)

A predictive random set is called optimal (with respect to ) if (9) holds for all valid predictive random sets .

Given from the A-step, define the a-event as

 UA(x)={u∈U:Θx(u)⊆A}. (10)

This is just the set of all that support the truthfulness of the assertion “” for a given . We require that be appropriately measurable, and the precise conditions are given in the Supplementary Material. The a-events will be crucial in the construction of the focal elements of the optimal predictive random set.

As discussed in Section 1, towards developing an optimal predictive random set for a collection of assertions, simultaneously, we consider breaking these assertions down into their basic building blocks. Starting from the most basic building block, a simple assertion, we characterize the optimal predictive random sets and show how these can be combined to create optimal predictive random sets for more general kinds of assertions. Since the distribution of an admissible predictive random set is determined by (5), our focus throughout is on the construction of the focal elements.

### 3.2 Simple assertions

An assertion is simple if the collection of a-events is nested, i.e., for any pair , we have either or ; otherwise, the assertion is called complex. The following results shows that, if is simple, then the optimal predictive random set (relative to ), as in Definition 3, readily obtains by taking the support and the distribution to satisfy (5).

###### Theorem 1.

For a simple assertion , the optimal predictive random set with respect to is supported on the nested collection with distribution satisfying (5).

###### Proof.

See Appendix A.1. ∎

As an example, consider , with association , . The assertion is simple. To see this, write the a-event:

 UA(x)={u:x=θ+u for some θ>0}=(−∞,x).

Then, clearly, if , then ; so, the a-events are nested. The belief function based on the optimal predictive random set in Theorem 1 is

 belx(A;S)=PS{S⊂Ux(A)}=PU{UA(x)}=Φ(x),

where is the standard normal distribution function. Given the optimality result, it is not a coincidence that is the p-value for the uniformly most powerful test of versus ; see Martin and Liu (2014).

Unfortunately, simple assertions are insufficient for practical applications. For example, in the normal mean example above, we might be interested in the assertion or, more generally, we might be interested in several assertions simultaneously, each of which might be simple or not. The regression application features all of these cases. (More generally, even if there is only one assertion , IM efficiency considerations require that one think about and simultaneously, so an understanding of how to handle multiple assertions is fundamental.) The next two sections discuss how to extend the basic optimality results to these important more general cases.

### 3.3 Complex assertions

To motivate our developments, reconsider the normal mean example above. This time, suppose we are interested in the assertion . This is a union of two disjoint simple assertions, namely, and . The optimal predictive random set with respect to is efficient for but very inefficient for ; likewise, for the optimal predictive random set with respect to . Since we are interested in we require a predictive random set that is as efficient as possible for both and . For the general case of a complex assertion written as a union of two simple assertions, an intuitively reasonable strategy is to consider a predictive random set whose focal elements are intersections of the focal elements of the optimal predictive random sets with respect to the two individual simple assertions. The following result says that the optimal predictive random set for the complex assertion must be of this form. (The results below extend to more than two simple component assertions, but two-component assertions are general enough for our purposes here.)

###### Theorem 2.

Let be a complex assertion, where and are simple assertions. Let and be the optimal predictive random sets for and , respectively, as in Theorem 1. For any predictive random set , there exists an , whose focal elements are intersections of the focal elements of and , such that is at least as efficient as with respect to .

###### Proof.

See Appendix A.2. ∎

This result simplifies the search for optimal predictive random sets with respect to a complex assertion. It does not completely resolve the problem, however, since there are many choices of predictive random sets with intersection focal elements. In the normal mean problem, for example, the focal elements for the optimal predictive random sets with respect to and are one-sided intervals. Therefore, the focal elements of the optimal predictive random set for must be nested intervals, but should the intervals be symmetric or asymmetric? See Section 3.5.

### 3.4 Multiple complex assertions

Suppose there are multiple assertions, , to be considered simultaneously, where , , decomposes as a union of two disjoint simple assertions. Each simple component has an optimal predictive random set according to Theorem 1, and the corresponding optimal predictive random set, , for the union has intersection focal elements according to Theorem 2. As before, intuition suggests that the optimal predictive random set for would be supported on intersections of the focal elements for the individually optimal , . To justify this intuition, we need a way to measure the efficiency of a predictive random set in this multiple-assertion context.

###### Definition 4.

An assertion is said to be generated by if can be written as a union of some or all of the s. Then a predictive random set is at least as efficient as with respect to if (9) holds for all generated by .

The following result, a generalization of Theorem 2, shows that a restriction to predictive random sets supported on intersection focal elements is without loss of efficiency.

###### Theorem 3.

Let be a collection of assertions, where partitions as a union of disjoint simple assertions. Let be the optimal predictive random set for , . Then, for any predictive random set , there exists an , whose focal elements are intersections of the focal elements of the s, such that is at least as efficient as with respect to .

###### Proof.

See Appendix A.3. ∎

As in the previous section, this result simplifies the search for an optimal predictive random set but does not completely resolve the problem, since it does not determine the particular shape of the focal elements. Therefore, we need to push further.

### 3.5 Balance and optimality

To resolve the ambiguity about the shape of the optimal focal elements, here we will introduce some special structure to the problem. Suppose that there exists transformations of that do not fundamentally change the inference problem. As a simple example, if and the association of interest is , then it is clear that changing the sign of should not have an impact on how much support there is for . In other words, this normal mean problem is invariant to sign changes. More generally, let be a group of bijections from to itself; as is customary, we shall write for the image of under , rather than . Practically, the transformations in represent symmetries of the inference problem, i.e., nothing fundamental about the problem changes if is observed instead of . The key technical assumption here is that each commutes with the association mapping , i.e.,

 ga(θ,u)=a(gθ,gu),∀(θ,u),∀g∈G. (11)

Here we are implicitly assuming that also acts upon the parameter space and the auxiliary variable space . This can be relaxed by introducing groups acting on and , respectively, different from (but homomorphic to) , but we will not need this extra notational complexity for our application here. The above condition is different from that which defines the usual group transformation models in that, in the right-hand side, also acts on . In fact, our variable selection application will not fit the usual group transformation structure, unless the design is orthogonal.

We shall also require that the assertions respect these symmetries. Let be an assertion generated the collection of complex assertions. Consider the subgroup of to which is invariant, and write . The intuition is that the inference problem for , at least as it concerns the assertion , is unchanged if the problem is transformed by .

Before proceeding, it may help to see a simple example. Consider, again, the normal mean problem, with , , and assertion . Then changing the sign of will not affect the problem. So, we can take to consist of the identity mapping and , and clearly (11) holds; also, .

Moving on, recall the a-event . In this case, an equivariance property follows immediately from the commutativity property (11):

 UA(gx)=gUA(x),∀g∈GA.

That is, given , transforming , for , and then solving for is equivalent to solving for with the given and then transforming .

We now have the necessary structure to help specify an optimal predictive random set. It suffices to focus on predictive random sets which are admissible. So, let be admissible and, for simplicity, suppose we can write the collection of closed nested focal elements as , where corresponds to the set’s -probability; in particular, . Then we have the following useful representation.

.

###### Proof.

See Appendix A.4. ∎

From Lemma 1 and the equivariance property above, if , then

 PX|θ{belgX(A;S)>1−r}≡PX|θ{UA(gX)⊃Sr}=PX|θ{UA(X)⊃g−1Sr}.

Since the understanding is that the inference problem, at least as it concerns the assertion , is unchanged by transformations for , it is reasonable to require that the distribution of be invariant to . The previous display reveals that the way to achieve belief function invariance is to require the focal elements of the predictive random set to be invariant to . This leads to the following notion of balance.

###### Definition 5.

The predictive random set is said to be balanced with respect to if each focal element satisfies for all . Moreover, is said to be balanced if the aforementioned invariance holds for all .

Balance itself is a reasonable property, given that the transformations are, by definition, irrelevant to the inference problem. It is also interesting and practically beneficial that balance can be checked without doing any probability calculations; however, the focal elements and the transformations depend on the model.

Beyond the intuitive appeal of balance, we claim that balance leads to a particular form of optimality. Recall that, for IM efficiency, the general goal is to choose the predictive random set to make the belief function stochastically large when the assertion is true. With this in mind, we propose the following notion of maximin optimality.

###### Definition 6.

A predictive random set , with focal elements , is maximin optimal if it maximizes

 ming∈GAPX|θ{belgX(A;S)>1−r}≡ming∈GAPX|θ{UA(X)⊃g−1Sr}

over all admissible with focal elements , uniformly for all generated by , for all , and for all .

###### Theorem 4.

If a predictive random set is balanced in the sense of Definition 5, then it is maximin optimal in the sense of Definition 6.

The main idea in the proof, presented in Appendix A.5, is the notion of the “core” of a given focal element . In particular, the core is defined as , and the proof relies on the fact that it is both balanced and contained in each .

## 4 Model uncertainty in regression

### 4.1 Optimal IM for assertions about the model

Recall the dimension-reduced association in (4), where is a -vector distributed as , and is a matrix with ones on the diagonal. The intercept has been marginalized out, so each of can be considered zero or non-zero; the goal is to summarize the uncertainty about these possible models.

Consider the collection of complex assertions , . Consider first a particular . This can be written as , where and . We claim that these sub-assertions are both simple in the sense of Section 3.2. Take , for example. Then the corresponding a-event is

 UAj1(y)={u:^θ−^σu∈Aj1}={u:uj>Tj},

where . This a-event is nested because it shrinks and expands monotonically as a function of . The same is true for and for all the other . Therefore, by Theorem 1, the optimal predictive random sets for the individual sub-assertions and are each supported on collections of half hyper-planes. Next, it follows from Theorem 2 that the optimal predictive random set for the complex assertion is supported on intersections of half hyper-planes, i.e., cylinders . If we are considering simultaneously, then it follows from Theorem 3 that the optimal predictive random set is supported on boxes—intersections of marginal cylinders—in . The remaining question is what shape should the boxes be.

Towards optimality, we need to consider what transformations leave the model uncertainty problem invariant in the sense of Section 3.5. As in the simple normal mean example discussed previously, sign changes to coordinates of are irrelevant. In addition, the labeling of the variables is arbitrary, so permutations of the variable labels are also irrelevant. This suggests we consider the group of signed permutations; that is, each acts on a -vector by matrix multiplication (on the left), where the matrix factors as a product of a diagonal matrix with on the diagonal and a permutation matrix. It is clear that the association commutes with in the sense of (11). With the group specified, it is also clear what shape of boxes the optimal predictive random set should be supported on. According to Definition 5, a balanced predictive random set should have focal elements—in this case, shaped like boxes in —that are invariant to the transformations in . The only such boxes are hyper-cubes centered at the origin.

###### Corollary 1.

The admissible hyper-cube predictive random set , given by

 S={u∈Rp:∥u∥∞≤∥U∥∞},U∼PU=tp(0,L;n−p−1), (12)

is balanced in the sense of Definition 5 and maximin optimal for in the sense of Definition 6.

### 4.2 IM output and its interpretation

Now that we have specified the optimal predictive random set for the P-step, the C-step proceeds just as before. It is helpful, however, to describe the IM output and its interpretation as it pertains to model uncertainty quantification. To start, we need a bit more notation. Let denote the collection of indices corresponding to the truly non-zero coefficients, i.e., for all . Also, for a generic subset , let denote the corresponding sub-vector of .

Consider the following assertion about the model:

 CJ={J⊆J}={θ:θJc=0}.

Note that , so this new assertion is generated by the model assertions from before. Using the optimal predictive random set in (12), we have

 plY(CJ;S)=PU{∥U∥∞>∥TJc∥∞}=1−F(∥TJc∥∞), (13)

where is the distribution function of when . The distribution function can be evaluated via Monte Carlo or the method of Genz and Bretz (2002), implemented in the R package pmvt.

###### Toy Example.

For the illustrative example in Section 1.2, where , the error variance was known, and the design matrix was identity, the distribution function has a simpler form, namely, , . Then the plausibility functions computed in that example, and summarized in Figure 1, are based on the expression (13) with this distribution function and as the pair of observations .

How to interpret the plausibility (13) above? On one hand, it is clear both from intuition and from the formula that is non-decreasing in and that the full model has plausibility 1. On the other hand, since the optimal hyper-cube predictive random set has positive volume, the belief in any proper sub-model is zero. So, to borrow the terminology from Dempster (2008), larger models have more “don’t know” probability, which is consistent with idea that larger models have more parameters to estimate and are more difficult to interpret. Consequently, only the relatively small plausibilities, which correspond to relatively small models, can be interpreted. To help understand what “relatively small” means, we have the following calibration result.

###### Theorem 5.

If is true, then is stochastically no smaller than .

###### Proof.

It is clear that, is true, i.e., if , then is stochastically no smaller than . Then the result follows from the fact that . ∎

The result in Theorem 5 explains why the IM plausibility’s distribution function falls on or below the diagonal line in Panels (a) and (c) in Figure 1. Moreover, optimality of the predictive random set suggests that the IM should also be efficient, i.e., if is false, then should be stochastically smaller than , which is the conclusion in Panels (b) and (d) of the same figure.

For further illustration, consider the prostate cancer data analyzed by Tibshirani (1996) and others. This study examined the association between the prostate specific antigen (PSA) level and some clinical measures among men who were about to receive a radical prostatectomy. There were men with predictors, four of which, including cancer volume (lcavol), prostate weight (lweight), capsular penetration (lcp), and benign prostatic hyperplasia amount (lbph), were log transformed. The other four predictors were age, seminal vestical invasion (svi), Gleason score (gleason), and percentage Gleason scores 4 or 5 (pgg45). The response variable is the log PSA level.

We can compute the plausibilities (13) for each , and the results are summarized in Figure 2, where the plausibilities are arranged by the corresponding model size. That is, for each , there are plausibility values displayed vertically. As mentioned previously, only the relatively small plausibilities in this figure can be interpreted and, since we are interested in models which are both small and sufficiently plausible, we focus on maximum plausibility for each . The maximum plausibilities for and are both rather small; the maximum value (0.09) in the column corresponds to the model that contains variables lcavol and svi only. If we move to variables, then the largest plausibility value jumps to 0.44, corresponding to the model that adds the variable lweight to the previous two-variable model. In fact, this latter three-variable model is the one that is selected by lasso. The plot reveals that there is a lot more uncertainty, i.e., “don’t know,” for model assertions that allow for the possibility that lweight is included in the true model. We conclude that data only provides evidence to support the claim that the true model is a subset of ; see Section 4.4.

### 4.3 On post-selection inference

Using the balanced hyper-cube predictive random set in Corollary 1, we can construct a plausibility function for singletons :

 plY(θ;S):=plY({θ};S)=1−F(∥^σ−1(^θ−θ)∥∞).

This plausibility function satisfies and is decreasing in away from with hyper-cube shaped contours. Consequently, by thresholding the plausibility function at level , we obtain the plausibility region

 {θ:plY(θ;S)≥α}={θ:F(∥^σ−1(^θ−θ)∥∞)≤1−α}.

Since the predictive random set is admissible, it follows from the general IM theory that the plausibility region above has nominal frequentist coverage . Moreover, the shape of these plausibility regions is a hyper-cube, with side lengths characterized by quantiles of the -norm of multivariate Student-t random vectors.

An important consequence of our focus on validity simultaneously across a collection of assertions is that a naive projection of these plausibility regions to any sub-model gives a new hyper-cube plausibility region that also has nominal frequentist coverage. Since this conclusion holds uniformly in , it also holds if is chosen based on data. Such considerations are relevant in the context of post-selection inference. As Berk et al. (2013) argue, when data is used to select a model, then one cannot use the model-specific distribution theory for valid inference. This leads to the fundamental question of how to achieve valid inference when the model is chosen based on data. They propose a procedure they call “POSI” for valid post-selection inference, and it turns out that their procedure is identical to that obtained by projection of the above plausibility region to a sub-model selected by data. The take-away message here is that, by insisting on simultaneous validity over all relevant assertions, the IM approach can automatically handle the challenging post-selection inference problem. This is a consequence of our focus on the main goal of valid uncertainty quantification about the model.

### 4.4 IM-driven variable selection

Section 4.1 showed how to optimally quantify model uncertainty, and the previous section explained how this implies valid post-selection inference. It is also possible to develop an IM-driven variable selection procedure having desirable properties. We want to be clear that valid uncertainty quantification about the model is our primary goal, and the result of a variable selection procedure is only one kind of summary of the IM output.

Intuitively, a model or collection of variables is supported by data if is not too small. In other words, good models are those which are “sufficiently plausible” given data. Following this idea, and the remarks in Section 4.2, an IM-driven approach for variable selection would be to fix and then pick the smallest collection such that . That is, define

 ^Jα(Y)=smallest set J such that plY(CJ;S)>α. (14)

We claim that the IM-driven procedure (14) satisfies a selection validity property:

 PY|J{^Jα(Y)⊆J}≥1−α,∀J⊆{1,2,…,p}. (15)
###### Theorem 6.

The procedure (14) has the selection validity property (15).

###### Proof.

implies , so apply Theorem 5. ∎

To implement the above procedure, it is not necessary to evaluate the plausibility function at for each . In fact, the plausibility depends on the value of , so we only need to look at different models, based on a sorting of the t-statistics by their magnitude. Let be a permutation that ranks the values in descending order according to their magnitudes, i.e., , and then compute

 mplY(s)=maxJ:|J|=splY(CJ;S)=1−F(|Tπ(s+1)|),s=0,1,…,p−1, (16)

and . This is a naive “marginal” plausibility function for the size of the model (Shafer 1987, Sec. G). If is the smallest such that , then

 ^Jα(Y)=π−1({1,…,s⋆α}). (17)

For example, for the prostate cancer data discussed in Section 4.2, based on the summary in Figure 2, if set the cutoff at , then the IM-based rule (14) selects the two-variable model that includes lcavol and svi.

A simulation study is performed to assess the performance of this variable selection procedure. We generate 1000 data sets consisting of observations , where the rows of the predictor variable matrix are draws from with an autoregressive correlation structure, i.e., for all . This model was used by Tibshirani (1996) in his simulation study. We take and and consider two different scenarios:

Scenario 1.

, ;

Scenario 2.

, .

The relative comparisons based on simulations under some different settings were the same as here so these are not presented here. For each scenario, we vary the sample size from 50 to 5000. For each simulated data set, the IM-driven variable selection procedure is applied. Performance of this approach is compared to the lasso, where the tuning parameter is chosen using the 10-fold cross validation, the adaptive lasso, AIC, and BIC. AIC and BIC variable selection is performed using exhaustive search of all possible models. Results of these approaches are summarized in Figures 3 and 4 for the two scenarios. Specifically, we plot the percentage of selecting the true model, where all important variables are selected (Panel A), a subset of the true model, where some but not all important variables and no unimportant variables are selected (Panel B), a superset of the true model, where all important variables and at least one unimportant variable are selected (Panel D), and the other models, where at least one important variable is not selected and at least one unimportant variable is selected (Panel E). In order to show the selection validity of the IM-driven approach, we also plot the percentage of selecting the true model or a subset of the true model, where some or all important variables are selected while no unimportant variables are selected (Panel C).

In both scenarios, the IM-driven approach performs well in terms of how frequently it selects the true model—similar to BIC but better than lasso and AIC. When it misses the true model, the IM-driven approach tends to select a subset of the important variables in both scenarios. Other approaches, on the other hand, are more likely to select unimportant variables. This is especially the case in Scenario 2 with a larger number of variables, where these approaches miss important variables and incorrectly select unimportant variables (Panel E). Overall, the IM-based approach maintains the 95% level for selecting the important variables without picking up unimportant variables (Panel C), a consequence of Theorem 6. The other approaches, except for BIC and adaptive lasso, are generally far off the mark. That BIC and adaptive lasso can reach above 95% for selecting true model or subset of true model is due to their asymptotic variable selection consistency property, not because they are properly calibrated at any fixed .

The take-away message is that the variable selection procedure (14) derived from our IM is surely competitive with some of the standard methods used in the literature. Other ad hoc selection procedures can be used which may beat our IM-driven variable selection procedure with respect to some criterion, but these procedures will always fall short of providing uncertainty quantification about the model.

## 5 Discussion

This paper introduces the concept of multiple simultaneous assertions in the IM framework, and develops a theory of optimal predictive random sets. These general principles are then applied, in a regression setting, to provide a valid quantification of the uncertainty in the model. To our knowledge, the IM approach is the only method known to satisfy this important property. The IM’s simultaneous validity property sheds light on the post-selection inference problem, and also leads to a variable selection procedure with desirable frequentist error rate control. Our simulation study demonstrates that, overall, the proposed IM-based procedure performs as good or better than several standard methods, which suggests that the IM’s emphasis on valid probabilistic uncertainty quantification does not come at the cost of decreased efficiency. Extending the developments here for application in generalized linear model settings is a focus of ongoing research.

The IM approach has already been applied to other problems that involve multiplicity, such as large-scale multinomial inference in genome wide association studies and multiple testing problems (Liu and Xie 2014a, b). We expect that the optimality considerations here, applied to those problems, would lead to some overall improvements. Perhaps the most important question is how to extend the developments in this paper to the high-dimensional regression context. The basic principles of optimal predictive random sets developed here for simultaneous complex assertions are not specific to the case; however, the initial dimension reduction steps do not carry over directly to the case. We expect that once the A-step can be completed for , ideas similar to those presented here can be applied.

## Acknowledgements

The authors thank the Editor, Associate Editor, and referees for their valuable comments on an earlier draft of this manuscript. This work was partially supported by the U.S. National Science Foundation: DMS–1007678, DMS–1208833, and DMS–1208841. The authors also thank Professor Jan Hannig for suggesting a possible connection between our work and that of Berk et al. (2013).

## Appendix A Proofs

### a.1 Proof of Theorem 1

Proposition 1 in Martin and Liu (2013) shows that , for any admissible and any . By assumption, the collection is nested, so it can be taken as the support of an admissible predictive random set. In this case, since satisfies (5), we have that for all . Therefore, the belief function attains its upper bound for each , hence, it is optimal.

### a.2 Proof of Theorem 2

Since and are simple assertions, the respective optimal predictive random sets and have focal elements given by and , as ranges over . Without loss of generality, assume that the predictive random set for is admissible. That is, assume has a nested support and that satisfies (5). Define

 Sj(T)=closure{⋂y:UAj(y)⊃TUAj(y)}j=1,2,T∈T.

Collect intersections of these sets, . Since is nested and the function is monotone, the collection is also nested. Define a new predictive random set , supported on , with the natural measure as in (5). This predictive random set satisfies the conditions stated in the theorem, i.e., admissible and has focal elements as intersections of the respective optimal predictive random set focal elements. We need to show that if and only if , where . One direction is easy, since it is clear that . For the other direction, we want to show that implies . Since and are disjoint, and the union is , implies that either or . By definition of and , it follows that

In either case, is contained in , which completes the argument that and are equivalent. Finally, since , we get for all and, therefore,

 PT{T⊂UA(x)}=supT:T⊂UA(x)PU(T)≤supT:S(T)⊂UA(x)PU(S(T))=PS{S⊂UA(x)}.

The left-hand side is and the right-hand side is , and the inequality holds for all , completing the proof.

### a.3 Proof of Theorem 3

The proof here is similar to that of Theorem 2. Consider collection of complex assertions, where each can be written as a union of disjoint simple assertions, i.e., , where and the a-events and are nested. By Theorem 2, we know that the optimal predictive random sets for , , have focal elements , indexed by a set , which are intersections of a-events. That is, is (the closure of) for some and .

Without loss of generality, assume that the candidate predictive random set for the assertion generated by is admissible in the sense of Definition 1. Given a focal element of , define

 ~Sj(T)=⋂v:Sj(v)⊃TSj(v),j∈J.

Next, set and define . Now take to have support and natural measure as in (5); this satisfies the conditions of the theorem. It remains to show that is more efficient than .

As in the proof of Theorem 2, we need to show that if and only if for each and . By construction, , so one direction is handled. For the other direction, assume that . Then for some and, since splits as a disjoint union of simple assertions, we get that or . Then and, consequently, the no-bigger must be a subset of . Since if and only if , the claimed superiority of to follows just like in the last part of the proof of Theorem 2.

### a.4 Proof of Lemma 1

Recall that is defined as . Since has the natural measure (5), we can write, for any ,

 belx(A;S)>b ⟺PS{S⊂UA(x)}>b ⟺supr:Sr⊂UA(x)PU(Sr)>b ⟺supr:Sr⊂UA(x)(1−r)>b ⟺UA(x)⊃S1−b.

Therefore, we have that

 PX|θ{belX(A;S)>b}=PX|θ{UA(X)⊃S1−b},∀b∈[0,1],

which proves the claim, with .

### a.5 Proof of Theorem 4

Take any predictive random set as in Theorem 3, and let be a generic focal element. Take any assertion generated by . Define the core of as ; note that is balanced and satisfies for all . Then,

 PX|θ{UA(X)⊃S∘}≥PX|θ{UA(X)⊃gS},∀g∈GA,∀θ∈A,

with strict inequality in general. Maximin optimality requires that we choose the focal elements to maximize the minimum (over ) of the right-hand side of the above display. However, we can clearly attain the upper bound above by taking the focal element equal to its core, i.e., balanced. Therefore, balance implies maximin optimality.

## References

• Akaike (1973) Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (Tsahkadsor, 1971), pages 267–281. Akadémiai Kiadó, Budapest.
• Berger et al. (2009) Berger, J. O., Bernardo, J. M., and Sun, D. (2009). The formal definition of reference priors. Ann. Statist., 37(2):905–938.
• Berk et al. (2013) Berk, R., Brown, L., Buja, A., Zhang, K., and Zhao, L. (2013). Valid post-selection inference. Ann. Statist., 41(2):802–837.
• Bühlmann (2013) Bühlmann, P. (2013). Statistical significance in high-dimensional linear models. Bernoulli, 19(4):1212–1242.
• Clyde and George (2004) Clyde, M. and George, E. I. (2004). Model uncertainty. Statist. Sci., 19(1):81–94.
• Dempster (2008) Dempster, A. P. (2008). The Dempster–Shafer calculus for statisticians. Internat. J. Approx. Reason., 48(2):365–377.
• Ermini Leaf and Liu (2012) Ermini Leaf, D. and Liu, C. (2012). Inference about constrained parameters using the elastic belief method. Internat. J. Approx. Reason., 53(5):709–727.
• Fraser (2011) Fraser, D. A. S. (2011). Is Bayes posterior just quick and dirty confidence? Statist. Sci., 26(3):299–316.
• Fraser et al. (2010) Fraser, D. A. S., Reid, N., Marras, E., and Yi, G. Y. (2010). Default priors for Bayesian and frequentist inference. J. R. Stat. Soc. Ser. B Stat. Methodol., 72(5):631–654.
• Genz and Bretz (2002) Genz, A. and Bretz, F. (2002). Comparison of methods for the computation of multivariate probabilities. J. Comput. Graph. Statist., 11(4):950–971.
• Hannig et al. (2015) Hannig, J., Iyer, H., Lai, R. C. S., and Lee, T. C. M. (2015). Generalized fiducial inference: A review. J. Amer. Statist. Assoc., to appear.
• Lai et al. (2015) Lai, R. C. S., Hannig, J., and Lee, T. C. M. (2015). Generalized fiducial inference for ultrahigh dimensional regression. J. Amer. Statist. Assoc., 110:760–772.
• Liu and Xie (2014a) Liu, C. and Xie, J. (2014a). Large scale two sample multinomial inferences and its applications in genome-wide association studies. Internat. J. Approx. Reason., 55(1, part 3):330–340.
• Liu and Xie (2014b) Liu, C. and Xie, J. (2014b). Probabilistic inference for multiple testing. Internat. J. Approx. Reason., 55(2):654–665.
• Lockhart et al. (2014) Lockhart, R., Taylor, J., Tibshirani, R. J., and Tibshirani, R. (2014). A significance test for the lasso. Ann. Statist., 42(2):413–468.
• Martin and Liu (2013) Martin, R. and Liu, C. (2013). Inferential models: A framework for prior-free posterior probabilistic inference. J. Amer. Statist. Assoc., 108(501):301–313.
• Martin and Liu (2014) Martin, R. and Liu, C. (2014). A note on -values interpreted as plausibilities. Statist. Sinica, 24:1703–1716.
• Martin and Liu (2015a) Martin, R. and Liu, C. (2015a). Conditional inferential models: Combining information for prior-free probabilistic inference. J. R. Stat. Soc. Ser. B, 77(1):195–217.
• Martin and Liu (2015b) Martin, R. and Liu, C. (2015b). Marginal inferential models: Prior-free probabilistic inference on interest parameters. J. Amer. Statist. Assoc., 110:1621–1631.
• Martin et al. (2015) Martin, R., Mess, R., and Walker, S. G. (2015). Empirical Bayes posterior concentration in sparse high-dimensional linear models. Bernoulli, to appear; arXiv:1406.7718.
• Molchanov (2005) Molchanov, I. (2005). Theory of Random Sets. Probability and Its Applications (New York). Springer-Verlag London Ltd., London.
• O’Hara and Sillanpää (2009) O’Hara, R. B. and Sillanpää, M. J. (2009). A review of Bayesian variable selection methods: what, how and which. Bayesian Anal., 4(1):85–117.
• Schwarz (1978) Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., 6(2):461–464.
• Shafer (1987) Shafer, G. (1987). Belief functions and possibility measures. In Bezdek, J. C., editor, The Analysis of Fuzzy Information, Vol. 1: Mathematics and Logic, pages 51–84. CRC.
• Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288.
• Tibshirani (2011) Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol., 73(3):273–282.
• Xie and Singh (2013) Xie, M. and Singh, K. (2013). Confidence distribution, the frequentist distribution of a parameter – a review. Int. Statist. Rev., 81(1):3–39.
• Zellner (1986) Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with -prior distributions. In Bayesian Inference and Decision Techniques, volume 6 of Stud. Bayesian Econometrics Statist., pages 233–243. North-Holland, Amsterdam.
• Zhang et al. (2011) Zhang, Z., Xu, H., Martin, R., and Liu, C. (2011). Inferential models for linear regression. Pak. J. Statist. Oper. Res., 7:413–432.
• Zou (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc., 101(476):1418–1429.
• Zou and Hastie (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol., 67(2):301–320.

Supplementary Material

## S  Measure-theoretic details

### S.1  Random sets and the admissibility condition

Molchanov (2005) gives a comprehensive treatment of the theory of random sets. Our goal here is present the minimal amount of technical details necessary to understand our analysis involving predictive random sets. To start, the standard theory of random sets focuses on the case of closed random sets, i.e., random sets whose values are closed sets (with probability 1). Write for the support of our random set , a collection of subsets of the space ; we assume that contains both and . Let be a separable metric space and take to be a -algebra of subsets that contains all the closed subsets of . Assume that each set, or focal element, is closed, relative to the topology on .

To define the random set , consider a probability space , and let be mapping from to which is measurable in the sense that

 {ω:S(ω)∩K≠∅}∈A,% for all compact K⊆U.

We define the distribution