Selective Sequential Model Selection

# Selective Sequential Model Selection

William Fithian, Jonathan Taylor, Robert Tibshirani, and Ryan J. Tibshirani
###### Abstract

Many model selection algorithms produce a path of fits specifying a sequence of increasingly complex models. Given such a sequence and the data used to produce them, we consider the problem of choosing the least complex model that is not falsified by the data. Extending the selected-model tests of fithian2014optimal, we construct -values for each step in the path which account for the adaptive selection of the model path using the data. In the case of linear regression, we propose two specific tests, the max- test for forward stepwise regression (generalizing a proposal of buja2014), and the next-entry test for the lasso. These tests improve on the power of the saturated-model test of tibshirani2014exact, sometimes dramatically. In addition, our framework extends beyond linear regression to a much more general class of parametric and nonparametric model selection problems.

To select a model, we can feed our single-step -values as inputs into sequential stopping rules such as those proposed by gsell2013sequential and li2015accumulation, achieving control of the familywise error rate or false discovery rate (FDR) as desired. The FDR-controlling rules require the null -values to be independent of each other and of the non-null -values, a condition not satisfied by the saturated-model -values of tibshirani2014exact. We derive intuitive and general sufficient conditions for independence, and show that our proposed constructions yield independent -values.

## 1 Introduction

Many model selection procedures produce a sequence of increasingly complex models, leaving the data analyst to choose among them. Given such a path, we consider the problem of choosing the simplest model in the path that is not falsified by the available data. Examples of path algorithms include forward stepwise linear regression, least angle regression (LAR) and the ever-active path in lasso (-regularized) regression. tibshirani2014exact study methods for generating exact -values at each step of these path algorithms, and their methods provide a starting point for our proposals. Other related works include loftus2014significance, who describe -values for path algorithms that add groups of variables (instead of individual variables) at each step, and choi2014selecting, who describe -values for steps of principle components analysis (each step marking the estimation of a principle component direction).

We consider the following workflow: to select a model, we compute a set of sequential -values at each step of the model path, and then feed them into a stopping rule that is guaranteed to control the false discovery rate (FDR) (benjamini1995controlling), familywise error rate (FWER), or a similar quantity. Recently gsell2013sequential and li2015accumulation proposed sequential stopping rules of this kind. Both sets of rules require that once we have reached the first correct model in our path, the -values in subsequent steps are uniform and independent. While this is not true of the -values constructed in tibshirani2014exact, in this paper we develop a theoretical framework for constructing sequential -values satisfying these properties, and give explicit constructions.

Our approach and analysis are quite general, but we begin by introducing a specific example: the selective max- test for forward stepwise regression. This is a selective sequential version of the max- test of buja2014, generalized using the theory in fithian2014optimal.

### 1.1 The max-t Test in Forward Stepwise Regression

Forward stepwise regression is a greedy algorithm for building a sequence of nested linear regression models. At each iteration, it augments the current model by including the variable that will minimize the residual sum of squares (RSS) of the next fitted model. Equivalently, it selects the variable with the largest -statistic, adjusting for the variables in the current model.

For a design matrix , with columns , and a response vector , let denote the current active set, the set of predictor variables already selected, and let denote the residual sum of squares for the corresponding regression model. For let denote the multivariate -statistic of variable , adjusting for the active variables. Using the standard result that

we see that the next variable selected is

with corresponding -statistic . The selective max- test rejects for large values of , compared to an appropriate conditional null distribution that accounts for the adaptive selection of the model path.

Table 1 illustrates the max- test and two others applied to the diabetes data from LARS, consisting of observations of patients. The response of interest is a quantitative measure of disease progression one year after baseline, and there are ten measured predictors — age, sex, body-mass index, average blood pressure, and six blood serum measurements — plus quadratic terms, giving a total of features. We apply forward stepwise regression to generate the model path (beginning with an intercept term, which we represent as a predictor ), and then use each of the three methods to generate -values at each step along the path. Finally, we use the ForwardStop rule (gsell2013sequential) at FDR level to select a model based on the sequence of -values. The bolded entry in each column indicates the last variable selected by ForwardStop.

The nominal (unadjusted) -values in column 1 of Table 1 are computed by comparing to the -distribution with degrees of freedom, which would be the correct distribution if the sequence of models were selected before observing the data. Because the model sequence is in fact selected adaptively to maximize the absolute value of , this method is highly anti-conservative.

The max- test -values in column 3 are computed using the same test statistic , but compared with a more appropriate null distribution. We can simulate the distribution of under the null model, i.e., the linear regression model specified by active set :

 M(E):Y∼N(XEβE,σ2In),βE∈R|E|,σ2>0,

where is the matrix with columns . As is the complete sufficient statistic for , we can sample from the conditional null distribution of given and the current active set , using the selected-model test framework of fithian2014optimal. In step one, because is fixed and is independent of , the conditional test reduces to the max- test proposed in buja2014. In later steps, and are not conditionally independent given , but we can numerically approximate the null conditional distribution through Monte Carlo sampling.

The saturated-model -values in column 2 also account for selection, but they rely on somewhat different assumptions, condition on more information, and test a slightly different null hypothesis. We discuss these distinctions in detail in Section 2.3.

For most of the early steps in Table 1, the (invalid) nominal test gives the smallest -values, followed by the max- test, then the saturated-model test. Both the max- and saturated-model -values are exactly uniform under the null, but the max- -values appear to be more powerful under the alternative. As we will discuss in Section 5, selected-model tests such as the max- can be much more powerful than saturated-model tests in early steps of the path when multiple strong variables are competing with each other to enter the model first. In the diabetes example, using the max- -values, ForwardStop selects a model of size 8, compared to size 3 when using the saturated-model -values.

The null max- -values are independent, while the saturated-model -values are not, and hence ForwardStop is guaranteed to control FDR when using the former but not the latter. In Section 4 we discuss intuitive sufficient conditions for independence and show that the max- and many other selected-model tests satisfy these conditions.

### 1.2 Outline

For the remainder of the article we will discuss the problem of selective sequential model selection in some generality, returning periodically to the selective max- test in forward stepwise regression and its lasso regression counterpart, the next-entry test, as intuitive and practically important running examples.

In comparison to their saturated-model counterparts, we will see that the selected-model tests proposed here have three main advantages: power, independence (in many cases), and generalizability beyond linear regression. These advantages do not come entirely for free, as we will see: when is known, the selected-model methods test a more restrictive null hypothesis than the saturated-model methods. In addition, the selected-model tests require accept-reject or Markov Chain Monte Carlo (MCMC) sampling, while the saturated-model tests can be carried out in closed form.

Sections 2 and 3 introduce the general problem setting and review relevant literature on testing ordered hypotheses using a sequence of -values. Many of these methods require the -values to be uniform and independent under the null, and we derive general conditions to ensure this property in Section 4. In Section 5 we contrast selected-model and saturated-model tests in the linear regression setting, and explain why selected-model tests are often much more powerful in early steps of the path. In Section 6 we discuss strategies for sampling from the null distribution for our tests, and prove that in forward stepwise regression and lasso (-regularized) regression, the number of constraints in the sampling problem is never more than twice the number of variables. We provide a simulation study in Section 7, and Section 8 discusses applications of our framework beyond linear regression to decision trees and nonparametric changepoint detection. The paper ends with a discussion.

## 2 Selective Sequential Model Selection

In the general setting, we observe data with unknown sampling distribution , and then apply some algorithm to generate an adaptive sequence of nested statistical models contained in some upper model :

 M0(Y)⊆M1(Y)⊆⋯⊆Md(Y)⊆M∞.

By model, we mean a family of candidate probability distributions for . For example in linear regression, a model specifies the set of predictors allowed to have nonzero coefficients (but not the values of their coefficients). Note the “assumption” of an upper model involves no loss of generality — we can always take as the union of all models under consideration. We will use the notation to denote the sequence , which we call the model path. We assume throughout that for each candidate model there is a minimal sufficient statistic , and write

 Uk(Y)=U(Y;Mk(Y)).

Given data and model path , we set ourselves the formal goal of selecting the simplest correct model: the smallest for which . Of course, in most real data problems all of the in the path, as well as all other models under consideration, are incorrect. In more informal terms, then, we seek the simplest model in the path that is not refuted by the available data.

Define the completion index by , the index of the first correct model. By construction, . A stopping rule is any estimator of , with “rejected” if , and “accepted” otherwise. Because is the number of models we do reject, and is the number we should reject, the number of type I errors is , while the number of type II errors is . Depending on the scientific context, we might want to control the FWER: , the FDR: , or another error rate such as a modified FDR, defined by the expectation of some loss function :

 EF(^k(⋅),g)=EF[g(^k(Y),k0(Y,F))]. (1)

### 2.1 Single-Step p-Values

We will restrict our attention to stopping rules like those recently proposed by gsell2013sequential and li2015accumulation, which operate on a sequence of -values. At each step , we will construct a -value for testing

 Hk:F∈Mk−1(Y)

against the alternative that , accounting for the fact that is chosen adaptively.

Following fithian2014optimal, we say that for a fixed candidate null model , the random variable is a valid selective -value for at step if it is stochastically larger than uniform (henceforth super-uniform) under sampling from any , given that is selected. That is,

 PF(pk,M(Y)≤α∣Mk−1(Y)=M)≤α,∀F∈M,α∈[0,1].

Once we have constructed selective -values for each fixed , we can use them as building blocks to construct a combined -value for the random null . Define

 pk(y)=pk,Mk−1(y)(y),

which is super-uniform on the event :

 PF(pk≤α∣F∈Mk−1)≤α,∀α∈[0,1]. (2)

One useful strategy for constructing exact selective tests is to condition on the sufficient statistics of the null model . By sufficiency, the law

 LF(Y∣Uk−1,Mk−1=M)

is the same for every . Thus, we can construct selective tests and -values by comparing the realized value of any test statistic to its known conditional distribution under the null. It remains only to choose the test statistic and compute its conditional null distribution, which can be challenging. See fithian2014optimal for a general treatment. Section 6 discusses computational strategies for the tests we propose in this article.

### 2.2 Sparse Parametric Models

Many familiar path algorithms, including forward stepwise regression, least angle regression (LAR), and the lasso regression (thought of as producing a path of models over its regularization parameter ), are methods for adaptively selecting a set of predictors in linear regression models where we observe a random response as well as a fixed design matrix , whose columns correspond to candidate predictors. For each possible active set , there is a corresponding candidate model

 M(E):Y∼N(XEβE,σ2In),

which is a subset of the full model

 M∞:Y∼N(Xβ,σ2In).

If the error variance is known, the complete sufficient statistic for is ; otherwise it is .

For the most part, we can discuss our theory and proposals in a parametric setting generalizing the linear regression problem above. Let be a model parameterized by :

 M∞={Fθ:θ∈Θ}.

For any subset define the sparse submodel with active set as follows:

 Θ(E)={θ:θj=0,∀j∉E},M(E)={Fθ:θ∈Θ(E)}.

We can consider path algorithms that return a sequence of nested active sets

 E0(Y)⊆E1(Y)⊆⋯⊆Ed(Y)⊆{1,…J},

inducing a model path with for . We will be especially interested in two generic path algorithms for the sparse parametric setting: forward stepwise paths and ever-active regularization paths. As we will see in Section 4.3, both methods generically result in independent null -values. For a nonparametric example see Section 8.2.

#### 2.2.1 Forward Stepwise Paths and Greedy Likelihood Ratio Tests

Let denote the log-likelihood for model . A generic forward stepwise algorithm proceeds as follows: we begin with some fixed set (such as an intercept-only model), then at step , we set

 jk=argmaxjsup{ℓ(θ;Y):θ∈Θ(Ek−1∪{j})}, and Ek=Ek−1∪{jk}. (3)

That is, at each step we select the next variable to maximize the likelihood of the next fitted model (in the case of ties, we could either choose randomly or select both variables).

A natural choice of test statistic is the greedy likelihood ratio statistic

 Gk(Y)=supθ∈Θ(Ek)ℓ(θ;Y)−supθ∈Θ(Ek−1)ℓ(θ;Y), (4)

which is the generalized likelihood ratio statistic for testing against the “greedy” alternative with one more active parameter, . The selective greedy likelihood ratio test rejects for large , based on the law

 L(Gk∣Mk−1,Uk−1)

Because the likelihood in linear regression is a monotone decreasing function of the residual sum of squares, the max- test is equivalent to the greedy likelihood ratio test. The counterpart of the max- test in linear regression with known is the max- test, which differs only in replacing the -statistics with their corresponding -statistics. The max- test is also equivalent to the selective greedy likelihood ratio test.

For simplicity, we have implicitly made two assumptions: that only one variable is added at each step, and that the set of candidate variables we choose from is the same in each step. It is relatively straightforward to relax either assumption, but we do not pursue such generalizations here.

#### 2.2.2 Ever-Active Regularization Paths and Next-Entry Tests

Another important class of model selection procedures is the sequence of ever-active sets for a regularized likelihood path, under a sparsity-inducing regularizer such as a scaled norm. The notion of ever-active sets is needed since these solution paths can drop (as well as add) predictors along the way. For some ordered set , let parameterize the regularization penalty . As an example, for a lasso penalty, this is with .

Assume for simplicity that there is a unique solution to each penalized problem of the form

 ^θλ(Y)=argminθ∈Θ−ℓ(θ;Y)+Pλ(θ). (5)

It may be impossible or inconvenient to compute for every . If so, we can instead take to be a grid of finitely many values.

We define the ever-active set for as

 ˜Eλ(Y)={j:^θγj(Y)≠0% for any γ∈Λ,γ≥λ} (6)

Note that the ever-active sets are nested by construction. In addition, we will assume for every . Our model path will correspond to the sequence of distinct ever-active sets. Formally, let

 λ0=supΛ, and E0=⋂λ∈Λ˜Eλ,

and for , let denote the (random) value of where the active set changes for the th time:

 Λk={λ∈Λ:˜Eλ⊋˜Eλk−1},λk=supΛk, and Ek=⋂λ∈Λk˜Eλ.

In this setting, is a natural test statistic for model , with larger values suggesting a poorer fit. The selective next-entry test is the test that rejects for large , based on the law

 L(λk∣Uk−1,Mk−1).
##### Remark.

In its usual formulation, the lasso coefficients for linear regression minimize a penalized RSS criterion. If we replace with any strictly decreasing function of the log-likelihood such as RSS, all of the results in this article hold without modification.

### 2.3 Which Null Hypothesis?

Note that in our formulation of the problem, the type I error is defined in a “model-centric” fashion: at step in linear regression, we are testing whether a particular linear model adequately describes the data . Even if the next selected variable has a zero coefficient in the full model, it is not a mistake to reject provided there are some signal variables that have not yet been included.

Depending on the scientific context, we might want to define a type I error at step differently, by choosing a different null hypothesis to test. Let and let denote the least-squares coefficients of active set — the coefficients of the best linear predictor for the design matrix :

 θE=X+Eμ=argminθ∈R|E|∥μ−XEθ∥22,

where is the Moore-Penrose pseudoinverse of the matrix .

gsell2013sequential describe three different null hypotheses that we could consider testing at step in the case of linear regression:

Complete Null:

 Hk:μ=XEk−1θEk−1.
Incremental Null:

may be incorrect, but is no improvement. That is,

 Hinck:θEkjk=0.
Full-Model Null:

The coefficient of is zero in the “full” model with all predictors. That is,

 Hfullk:θ{1,…,p}jk=0.

While the complete null is the strongest null hypothesis of the three, the incremental null is neither weaker nor stronger than the full-model null. Defining

 Vinc =#{k<^k:Hinck is true}, % and Vfull =#{k<^k:Hfullk is true},

we can define an analogous FWER and FDR with respect to each of these alternative choices, and attempt to control these error rates. For example, we could define

 FDRfull=E[Vfull/(^k∨1)],

as the false discovery rate with respect to the full-model null. barber2014controlling present a framework for controlling .

The full-model null is the most different conceptually from the other two, taking a “variable-centric” instead of “model-centric” viewpoint, with the focus on discovering variables that have nonzero coefficients after adjusting for all other variables under consideration. To elucidate this distinction, consider a bivariate regression example in which the two predictors and are both highly correlated with , but are also nearly collinear with each other, making it impossible to distinguish which variable has the “true” effect. Any procedure that controls could not select either variable, and would return the empty set of predictors. By contrast, most of the methods presented in this article would select the first variable to enter (rejecting the global null model), and then stop at the univariate model.

Similarly, consider a genomics model with quantitative phenotype and predictors , representing minor allele counts for each of single-nucleotide polymorphisms (SNPs). If correlation between neighboring SNPs (neighboring ’s) is high, it may be very difficult to identify SNPs that are highly correlated with , adjusting for all other SNPs; however, we might nevertheless be glad to select a model with a single SNP from each important gene, even if we cannot guarantee it is truly the “best” SNP from that gene.

As the above examples illustrate, methods that control full-model error rates are best viewed not as model-selection procedures — since all inferences are made with respect to the full model — but instead as variable-selection procedures that test multiple hypotheses with respect to a single model, which is specified ahead of time. The “model” returned by such procedures is not selected or validated in any meaningful sense. Indeed, in the bivariate example above, of the four models under consideration (, , , and ), only the global null model is clearly inconsistent with the data; and yet, a full-model procedure is bound not to return any predictors.

Because the truth or falsehood of full-model hypotheses can depend sensitively on the set of predictors, rejecting has no meaning without reference to the list of all variables that we controlled for. As a result, rejections may be difficult to interpret when is large. Thus, error rates like are best motivated when the full model has some special scientific status. For example, the scientist may believe, due to theoretical considerations, that the linear model in is fairly credible, and that a nonzero coefficient , controlling for all of the other variables, would constitute evidence for a causal effect of on the response.

In this article we will concern ourselves primarily with testing the complete null, reflecting our stated aim of choosing the least complex model that is consistent with the data. As we discuss further in Section 5, the saturated-model tests of tibshirani2014exact are valid selective tests of . The advantage of these tests is that they are highly computationally efficient (they do not require sampling). But, unfortunately, they also carry a number of drawbacks: they require us to assume is known, can result in a large reduction in power, create dependence between -values at different steps, and are difficult to generalize beyond the case of linear regression.

### 2.4 Related Work

The problem of model selection is an old one, with quite an extensive literature. However, with the exception of the works above, very few methods offer finite-sample guarantees on the model that is selected except in the orthogonal-design case. One exception is the knockoff filter of barber2014controlling, a variant of which provably controls the full-model FDR. We compare our proposal to the knockoff method in Section 7.

Methods like AIC (akaike1974new) and BIC (schwarz1978estimating) are designed for the non-adaptive case, where the sequence of models is determined in advance of observing the data. Cross-validation, another general-purpose algorithm for tuning parameter selection, targets out-of-sample error and tends to select many noise variables when the signal is sparse (e.g. LY2015). benjamini2009simple extend the AIC by using an adaptive penalty to select a model, based on generalizing the Benjamini-Hochberg procedure (benjamini1995controlling), but do not prove finite-sample control. The stability selection approach of meinshausen2010stability uses an approach based on splitting the data many times and offers asymptotic control of the FDR, but no finite-sample guarantees are available.

If the full model is sparse, and the predictors are not too highly correlated, it may be possible asymptotically to recover the support of the full-model coefficients with high probability — the property of sparsistency. Much recent model-selection literature focuses on characterizing the regime in which sparsistency is possible; see, e.g. bickel2009simultaneous, meinshausen2006high, negahban2009unified, van2009conditions, wainwright2009sharp, zhao2006model. Under this regime, there is no need to distinguish between the “model-centric” and “variable-centric” viewpoints. However, the required conditions for sparsistency can be difficult to verify, and in many applied settings they fail to hold. By contrast, the methods presented here require no assumptions about sparsity or about the design matrix , and offer finite-sample guarantees.

## 3 Stopping Rules and Ordered Testing

An ordered hypothesis testing procedure takes in a sequence of -values for null hypotheses , and outputs a decision to reject the initial block and accept the remaining hypotheses. Note that in our setup the hypotheses are nested, with ; as a result, all of the false hypotheses precede all of the true ones.

We first review several proposals for ordered-testing procedures, several of which require independence of the null -values conditional on the non-null ones. These procedures also assume the sequence of hypotheses is fixed, whereas in our setting the truth or falsehood of is random, depending on which model is selected at step . In Section 3.2 we show that the error guarantees for these stopping rules do transfer to the random-hypothesis setting, provided we have the same independence property conditional on the completion index (recall ).

### 3.1 Proposals for Ordered Testing of Fixed Hypotheses

We now review several proposals for ordered hypotheses testing along with their error guarantees in the traditional setting, where the sequence of null hypotheses is fixed. In the next section we will extend the analysis to random null hypotheses.

The simplest procedure is to keep rejecting until the first time that :

 ^kB(Y)=min{k:pk>α}−1

We will call this procedure BasicStop. It is discussed in marcus1976. Since is the index of the first null hypothesis, we have

 FWER≤P(pk0≤α)≤α. (7)

To control the FDR, gsell2013sequential propose the ForwardStop rule:

 ^kF(Y)=max{k:−1kk∑i=1log(1−pi)≤α}

If is uniform, then is an random variable with expectation 1; thus, the sum can be seen as an estimate of the false discovery proportion (FDP):

 ˆFDPk=−1kk∑i=1log(1−pi),

and we choose the largest model with . gsell2013sequential show that ForwardStop controls the FDR if the null -values are independent of each other and of the non-nulls.

li2015accumulation generalize ForwardStop, introducing the family of accumulation tests, which replace with a generic accumulation function satisfying . li2015accumulation show that accumulation tests control a modified FDR criterion provided that the null -values are , and are independent of each other and of the non-nulls.

### 3.2 Ordered Testing of Random Hypotheses

In our problem setting, the sequence of selected models is random; thus, the truth or falsehood of each is not fixed, as assumed in the previous section. Trivially, we can recover the guarantees from the fixed setting if we construct our -values conditional on the entire path ; however, this option is rather unappealing for both computational and statistical reasons. In this section we discuss what sort of conditional control the single-step -values must satisfy to recover each of the guarantees in Section 3.1.

We note that conditioning on the current null model does guarantee that is uniform conditional on the event , the event that the th null is true. Unfortunately, however, it does not guarantee that the stopping rules of Section 3.1 actually control FDR or FWER. Let denote the upper- quantile of the distribution, and consider the following counterexample.

\thmt@toks\thmt@toks

Consider linear regression with , known and identity design , and suppose that we construct the model path as follows: if , we add to the active set first, then , then . Otherwise, we add , then , then . At each step we construct selective max- -values and finally choose using BasicStop.

If , then the FWER for this procedure tends to as .

###### Proposition 1.

A proof of Proposition 3.2 is given in the appendix. The problem is that is no longer a fixed index. Consequently, even though for each fixed , the -value corresponding to the first true null hypothesis is stochastically smaller than uniform. In this counterexample, , giving conditional on the event and leaving no room for error when .

If, however, we could guarantee that for each ,

 P(pk≤α∣Hk true,k0=k)≤α

then BasicStop again would control FWER, by (7). Note that is unknown, so we cannot directly condition on its value when constructing -values. However, because

 {k0=k}={F∈Mk−1,F∉Mk−2},

it is enough to condition on . As we will see in Section 4, conditioning on is equivalent to conditioning on in most cases of interest including forward stepwise likelihood paths and ever-active regularization paths (but not the path in Proposition 3.2).

Similarly, the error control guarantees of gsell2013sequential and li2015accumulation do not directly apply to case where the null hypotheses are random. However, we recover these guarantees if we have conditional independence and uniformity of null -values given : that is, if for all and , we have

 (8)

For the sake of brevity, we will say that -value sequences satisfying (8) are independent on nulls.

The following proposition shows that independence on nulls allows us to transfer the error-control guarantees of ForwardStop and accumulation tests to the random-hypothesis case.

###### Proposition 2.

Let be a stopping rule operating on -values for nested hypotheses . For some function , let

 Eg=E[g(^k(p1:d(Y)),k0(Y,F))],

Suppose that controls at level if are fixed and the null -values are uniform and independent of each other and the non-nulls. Then, controls at level whenever satisfy (8).

###### Proof.

For nested hypotheses, completely determines the truth or falsehood of for every . If controls in the fixed-hypothesis case, it must in particular control conditional on any fixed values of , since we could set

 p1:k0∼k0∏k=1δak

for any sequence , where is a point mass at .

Thus, (8) implies

 E[g(^k(p1,…,pd),k0(Y,F))∣k0(Y,F),p1:k0(Y)]a.s.≤α.

Marginalizing over gives . ∎

Note that (8) implies in particular that each is uniform given ; thus, independence on nulls is enough to guarantee that BasicStop and ForwardStop control FWER and FDR, respectively. The next section discusses conditions on the model sequence and the -value sequence under which (8) is satisfied.

## 4 Conditions for Independent p-values

We now develop sufficient conditions for constructing -values with the independent on nulls property (8). To begin, we motivate the general theory by discussing a specific case, the max- test for forward stepwise regression.

### 4.1 Independence of max-tp-values

Recall that at step , the max- test rejects for large , comparing its distribution to the conditional law

 L(Tk∣Ek−1,X′Ek−1Y,∥Y∥2). (9)

This conditional distribution is the same for all , because is the complete sufficient statistic for .

If , then is uniform and independent of the previous -values since:

1. is uniform conditional on and by construction, and

2. are functions of and , as we will see shortly.

Informally, the pair forms a “wall of separation” between and , guaranteeing that whenever is true.

Next we will see why are functions of . Observe that knowing tells us that the variables in are selected first, and knowing is enough information to compute the -statistics for all and . As a result, we can reconstruct the order in which those variables were added. In other words, is a function of for all .

Furthermore, for , is computed by comparing , which is a function of , to the reference null distribution , which is a function of . Having verified that under , is uniform and independent of , we can apply this conclusion iteratively to see that all remaining -values are also uniform and independent.

By contrast, the saturated model -values are computed using a reference distribution different from (9), one that depends on information not contained in . As a result, saturated-model -values are generally not independent on nulls. We discuss the regression setting in more detail in Section 5.

### 4.2 General Case

We now present sufficient conditions for independence on nulls generalizing the above ideas, culminating in Theorem 4 at the end of this section.

Define the sufficient filtration for the path as the filtration with

 Fk=F(M0:k,Uk),

where denotes the -algebra generated by random variable . By our assumption of minimal sufficiency, for .

For most path algorithms of interest, including forward stepwise and regularized likelihood paths, observing is equivalent to observing , because knowing is enough to reconstruct the subpath . We say that satisfies the subpath sufficiency principle (henceforth SSP) if

 F(M0:k,Uk)=F(Mk,Uk),k=0,…,d. (10)

A valid selective -value for testing satisfies, for any ,

 PF(pk≤α∣Mk−1)a.s.≤α on {F∈Mk−1}.

We say that a filtration separates the -values if

1. is super-uniform given on the event , and

2. and are measurable with respect to .

If we think of as representing information available at step , then the first condition means that “excludes” whatever evidence may have accrued against the null by step , and the second means that any information revealed after step is likewise irrelevant to determining . Separated -values are independent on nulls, as we see next.

###### Proposition 3 (Independence of Separated p-Values).

Let be selective -values for , separated by .

If the -values are exact then are independent and uniform given on the event . If they are super-uniform, then for all ,

 PF(pk+1≤αk+1,…,pd≤αd,∣Fk)a.s.≤(d∏i=k+1αi) on {k0=k}. (11)
###### Proof.

Noting that is -measurable, it is enough to show that

 PF(pk+1≤αk+1,…,pd≤αd,∣Fk)1{k0≤k}a.s.≤(d∏i=k+1αi)1{k0≤k} (12)

We now prove (12) by induction. Define . The base case is

 PF(Bd∣Fd−1)1{k0≤d−1}a.s.≤αd1{k0≤d−1},

which is true by construction of . For the inductive case, note that

 a.s.=EF[1Bk+1PF(Bk+2,…,Bd∣Fk+1)1{k0≤k+1}∣Fk]1{k0≤k} a.s.≤PF[Bk+1∣Fk]1{k0≤k}d∏i=k+2αi a.s.≤(d∏i=k+1αi)1{k0≤k}.

Lastly, if the are exact then the inequalities above become equalities, implying uniformity and mutual independence. ∎

The sufficient filtration separates if and only if (1) is super-uniform given and , and (2) is a function of and .

To be a valid selective -value for a single step, only needs to be super-uniform conditional on . It may appear more stringent to additionally require that must condition on the entire subpath as well as , but in practice there is often no difference: if satisfies the SSP and each is a complete sufficient statistic, then every exact selective -value is also uniform conditional on .

The requirement that must be -measurable has more bite. For example, we will see in Section 5 that it excludes saturated-model tests in linear regression.

Collecting together the threads of this section, we arrive at our main result: a sufficient condition for to be independent on nulls.

###### Theorem 4 (Sufficient Condition for Independence on Nulls).

Assume that is a complete sufficient statistic for each candidate model , that each is an exact selective -value for , and that the path algorithm satisfies the SSP.

If is -measurable for each , then the -value sequence is independent on nulls.

###### Proof.

We apply the definition of completeness to the function

 h(U(Y;M))=α−P(pk(Y)≤α∣U(Y;M),Mk−1(Y)=M)

If is exact given , then

 EF[h(Uk−1)∣Mk−1=M]=0

for every . Therefore, we must have , so is independent of . Because satisfies the SSP, is also independent of .

If is also -measurable, then the sufficient filtration separates , implying that the sequence is independent on nulls. ∎

##### Remark

If our path algorithm does not satisfy the SSP, we can repair the situation by constructing -values that are uniform conditional on .

### 4.3 Independence for Greedy Likelihood and Next-Entry p-values

In this section, we apply Theorem 4 to establish that in the generic sparse parametric model of Section 2.2, the forward stepwise path and all ever-active regularization paths satisfy the SSP. Moreover, the greedy likelihood ratio statistic and the next-entry statistic are -measurable with respect to the sufficient filtrations of the forward stepwise and ever-active regularization paths, respectively. As a result, the -value sequences in each setting are independent on nulls, per Theorem 4.

We begin by proving both paths satisfy the SSP:

###### Proposition 5.

Forward stepwise paths and ever-active regularization paths both satisfy the subpath sufficiency principle.

###### Proof.

First define the restricted log-likelihood

 ℓE(θ;Y)={ℓ(θ;Y)θ∈Θ(E)−∞otherwise.

For some fixed step and active set , denote the event . Conditioning on , the restricted likelihood is proportional to a function depending only on . The log-likelihood decomposes as

 ℓE(θ;Y)=ℓUE(θ;U)+ℓY∣UE(Y∣U).

The second term, the log-likelihood of given , does not depend on because is sufficient.

Recall the forward stepwise path is defined by

 jk=argmaxjsup{ℓ(θ;Y):θ∈Θ(Ek−1∪{j})}, and Ek=Ek−1∪{jk}. (13)

On , we have for all , meaning that the maximum in (13) is attained by some at every step. So, for , we have

 js =argmaxj∈Esup{ℓE(θ;Y):θ∈Θ(Es−1∪{j})} =argmaxj∈Esup{ℓUE(θ;U(Y)):θ∈Θ(Es−1∪{j})}.

The above shows that all depend on only through , which equals on . As a result, it also follows that the entire sequence depends only on .

As for ever-active regularization paths, if we denote

 ^θ(E,λ)=argminθ∈Θ(E)−ℓUE(θ;U(Y))+Pλ(θ),

then

 ^θ(E,λ)a.s.=^θλ on A∩1{λ≥λk}.

But depends only on . Therefore, on , we can reconstruct the entire path of solutions for , once we know . ∎

As a direct consequence of Proposition 5, greedy likelihood -values for the forward-stepwise path, and next-entry -values for ever-active regularization paths, are independent on nulls.

###### Corollary 6.

If is complete and sufficient for model , then the selective greedy likelihood -values computed for the forward stepwise path are independent on nulls.

###### Proof.

Per Theorem 4, it is enough to show that is -measurable with respect to the sufficient filtration for the forward stepwise path. First, the test statistic is -measurable because

 Gk(Y) =supθ∈Θ(Ek)ℓ(θ;Y)−supθ∈Θ(Ek−1)ℓ(θ;Y) =supθ∈Θ(Ek)ℓY∣UEk(θ;Uk)−supθ∈Θ(Ek−1)ℓY∣UEk(θ;Uk).

Second, the null distribution against which we compare is also -measurable because it depends only on , both of which are -measurable. ∎

###### Corollary 7.

If is complete and sufficient for model , then the selective next-entry -values computed for any ever-active regularized likelihood path are independent on nulls.

###### Proof.

Per Theorem 4, it is enough to show that is -measurable with respect to the sufficient filtration for the regularized-likelihood path. Recall that

 Λk={λ∈Λ:˜Eλ⊋˜Eλk−1},λk=supΛk, and Ek=⋂λ∈Λk˜Eλ,

and that the are nested by construction and finite by assumption. As a result, there is some for which .

Furthermore, for all , we have . As a result, for such , so depends only on , and therefore can be computed from . Second, the null distribution against which we compare is also -measurable because it depends only on , both of which are -measurable. ∎

As we have shown, the selective max- test is equivalent to the selective greedy likelihood test in the case of linear regression. In the next section, we will compare and contrast the max- and next-entry, and other selected-model tests, with the saturated-model tests proposed by tibshirani2014exact and others.

## 5 Selective p-Values in Regression

In linear regression, fithian2014optimal draw a distinction between two main types of selective test that we might perform at step : tests in the selected model, and tests in the saturated model. For simplicity, we will assume throughout this section that our path algorithm adds exactly one variable at each step. We also assume the path algorithm satisfies the SSP, so that we need not worry about the distinction between conditioning on