1 Introduction

Linear regression model selection using -values

when the model dimension grows

By Piotr Pokarowski 111Institute of Applied Mathematics and Mechanics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw. E-mail: pokar@mimuw.edu.pl. , Jan Mielniczuk 222Institute of Computer Science, Polish Academy of Sciences and Faculty of Mathematics and Information Sciences, Warsaw University of Technology. E-mail: miel@mini.pw.edu.pl. and PawełTeisseyre 333Institute of Computer Science, Polish Academy of Sciences. E-mail: teisseyrep@ipipan.waw.pl.

Abstract. We consider a new criterion-based approach to model selection in linear regression. Properties of selection criteria based on -values of a likelihood ratio statistic are studied for families of linear regression models. We prove that such procedures are consistent i.e. the minimal true model is chosen with probability tending to 1 even when the number of models under consideration slowly increases with a sample size. The simulation study indicates that introduced methods perform promisingly when compared with Akaike and Bayesian Information Criteria.
Keywords: model selection criterion; random or deterministic design linear model; -value based methods; Akaike Information Criterion; Bayesian Information Criterion.

## 1 Introduction

We reconsider a problem of model choice for a linear regression

 Y=Xβ+ε, (1)

where is an vector of observations which variability we would like to explain, is a design matrix consisting of vectors of potential regressors collected from objects and is an unknown vector of errors, assumed to have distribution. Vector is an unknown vector of parameters. In the paper we will consider the cases corresponding to experimental and observational data when rows of are either deterministic or random. Suppose that some covariates are unrelated to the prediction of , so that the corresponding coefficients are zero. It is assumed that the true model is a submodel of (1). As it is not a priori known which variables are significant in order to make the last assumption realistic it is natural to let the horizon to grow with and allow in this way potentially large models.
Model selection is a core issue of statistical modeling. In a framework of linear regression the problem has been intensively studied under various conditions imposed on design matrix and growth of . The aim of such procedures is to choose the most parsimonious model describing adequately a given data set. For the review of these advances we refer to Pötscher and Leeb (2008). The main problem here is a modeler’s dillema that underfitting leads to omission of important variables in the model whereas overfitting involves unnecessary parameter estimation for redundant coefficients which lessens the precision of the model fit.
In the article we contribute to a line of research in which the chosen model is the maximiser of a chosen criterion function. In a seminal paper which is typical for this approach Akaike (1970), starting with the idea of maximising the expectation of predictive likelihood, has shown that the usual likelihood has to be modified to obtain an unbiased estimator of the expectation. The likelihood modified in such a way is known as Akaike Information Criterion (AIC). Variety of other modifications of the likelihood followed, with Bayes Information Criterion (BIC) being the most frequently used competitor. Recently, Pokarowski and Mielniczuk (2010) introduced model selection criteria mPVC and MPVC based on -values of a likelihood ratio statistic for families of linear models with deterministic covariates and constant dimension. The idea in the case of minimal -value criterion mPVC is to consider the model selection problem from a point of view of testing a certain null hypothesis against several hypotheses and to choose the hypothesis (the model) for which the null hypothesis is most strongly rejected in its favour. The decision in the case of mPVC is based on a new criterion which is the minimal -value of the underlying test statistics. We stress that the discussed selection method is based on a completely different paradigm than the existing approaches: instead of penalizing the likelihood ratio statistic directly by subtracting a complexity penalty its appropriate function is chosen as a selection criterion.
We study conditions under which such a rule is consistent i.e. it choses the minimal true model with probability tending to 1 when the sample size increases. Our main theoretical result stated in Theorem 1 asserts that this property holds for the minimal -value criterion mPVC provided increases at a slower rate than where are weights appearing in the scaling of -values. Similar result is proved for maximal -value criterion MPVC. Both results apply also to the case when is constant provided the full model (1) is correctly specified. We also introduce and investigate less computationally demanding greedy versions of the discussed methods.
In the last section we present the results of limited simulation study which shows that the introduced methods perform on average better than AIC and BIC criteria. In particular, their performance measured by probability of correct subset detection and prediction error is much more stable when the length of list of models increases i.e. regression model becomes sparse.
In the paper we focus mainly on explanation i.e. finding the model which adequately describes the data. Besides the immediate application of model selection methods to to the second main task of prediction let us mention their use in construction of data-adaptive smooth tests (see e.g. Ledwina (1994)).
Problem of linear model selection when the number of possible predictors increases with the sample size has been studied from different angle by Shao (1997) who defined the optimal submodel to be submodel minimizing the averaged squared prediction error and investigated conditions under which the selected model converges in probability to this model. Moreno et al. (2010) considered Bayesian approach to this problem and proposed using Bayes factors for intrinsic priors as selection criteria.
The main contribution of the present paper is establishing consistency of the criteria based on p-values when the linear model dimension grows. The result is proved for the random design as well as for the fixed design scenario, the former being treated in detail. Intrumental in the proofs are Lemmas 3, 4, 5 which can be also useful for different purposes.
The paper is organized as follows. In Section 2 we introduce considered selection criteria. In Section 3 we discuss the imposed assumptions and consistency results for the family of models consisting of all subsets of predictors as well as hierarchic family. We also introduce greedy modifications of the considered criteria. Section 4 contains proofs of the main results and Section 5 discussion of the results of numerical experiments. Proofs of some auxiliary lemmas are relegated to the Appendix.

## 2 Model Selection criteria for linear regression models based on p-values

We start by explicitly stating the basic assumption we impose on random-design regression model. Assume that the rows of a matrix are iid, , . Throughout we consider the situation that the minimal true model is fixed i.e. it does not change with . Vectors constitute rows in an array of iid sequences of -dimensional random variables. We impose the condition that is nondecreasing and that the law of the first coordinates of coincides with that of i.e. the distribution of attributes considered for a certain sample size remains the same for larger sample sizes. We also assume throughout that the second moments of coordinates of are finite for any . As any submodel of (1) containing variables can be described by set of indexes in order to make notation simpler it will be referred to as model . The minimal true model will be denoted by and will be the number of nonzero coefficients in equation (1). The empty model will be denoted briefly by and the full model (1) by . Note that . Let be a maximum likelihood (ML) estimator of calculated for the considered model . We denote , ML estimator in the full model, briefly by . Let be a certain family of subsets of a set and be a vector of variables which pertain to the minimal true model . Througout this paper with exception of Section 3.2 we will impose the following assumption:
(A0) is positive definite matrix.
The main objective of model selection is to identify the minimal true model using data . Let be the conditional density of given . Consider two models and where the first model is nested within the second model. Denote by likelihood ratio test (LRT) statistic, based on conditional densities given , for testing model is adequate against hypothesis model is adequate whereas is not, equal to

 Dnjk=2logf^βk,^σk2(Y|X)f^βj,^σj2(Y|X), (2)

where and is a sum of squared residuals from the ML fit of the model . We recall that ML estimator coincides with Least Squares estimator of . When and are linear models it turns out that LRT statistic is given explicitly by

 Dnjk=−nlog[RSS(k)RSS(j)]=−nlog(1−Rnjk),

where

 Rnjk=RSS(j)−RSS(k)RSS(j) (3)

is coefficient of partial determination of variables belonging to given that variables in set are included in the model. Under the null hypothesis it follows from Cochran’s theorem (cf. e.g. Section 5.5 in Rencher and Schaalje (2008)) that given and provided is of full column rank.
Let and be univariate cumulative distribution functions and be a test statistic which has distribution function not necessarily equal to . Let . By -value of a test statistic given distribution (null distribution) we will mean . We will consider -values of statistic given Beta distribution with shape parameters and . In order to make notation simpler will be denoted as . We define the following model selection criteria based on -values of statistic when one of the indices is held fixed and the other ranges over all potential models.
Minimal -value Criterion (mPVC)

 Mnm=argminj∈Mepjanp(Rn0j|pj,0),

where and is a sequence of nonnegative numbers. When a minimizer is not unique, the set with the smallest number of elements is chosen. In the case of ties, arbitrary minimizer is selected. Observe that when then from among the pairs we choose a pair for which we are most inclined to reject and we select the model corresponding to the most convincing alternative hypothesis. For positive the scaling factor is interpreted as additional penalization for the complexity of a model.
Moreover, Maximal -value Criterion is defined as
Maximal -value Criterion (MPVC)

 MnM=argmaxj∈Me−pjanp(Rnjf|Mn,pj),

where and . Thus from among the pairs we choose a pair for which we are most reluctant to reject in favour of the full model hypothesis. We stress that the additional assumption needed for consistency of MPVC is not required to prove consistency of mPVC. This point is discussed further in Section 3. Note that in the definition of both criteria the existence of encompassing model, either from below or from above, is vital for the construction. The idea of encompassing has been used in Bayesian model selection (see e.g. Casella et al. (2009)).
Observe that for a fixed number of variables -value is a strictly decreasing function of . Thus the set is actually chosen from among subsets for which is maximal for the stratum . The same observation also holds for MPVC as well as for BIC and AIC. Observe also that if these criteria choose subsets of the same cardinality, these subsets necessarily coincide.

## 3 Results

### 3.1 Random-design regression

The main result of this section is consistency of the introduced selectors. Depending on the context we will use some of the following additional conditions on the horizons , norming constants and matrix .

1. as .

2. as .

3. The minimal eigenvalue of is bounded away from zero, i.e. for some and .

4. For some , and

 supnsup||d||=1E|d′z(n)|4⌈2/η⌉<∞, (4)

where is the standardised vector i.e. and is the smallest integer greater than or equal to .

5. as .

Assumptions (A1.1’) and (A1.1”) are two variants of the condition on a rate of divergence of . As is nondecreasing, the limit in (A1.2) exists and is either finite or equal to infinity. Condition (A1.2) is a natural condition stating that ultimately the list will contain the true model. The assumptions (A1.3) and the second part of (A1.4), used in Zheng and Loh (1997), imply in particular that with probability tending to one exists and therefore is unique. Similar conditions are used by Mammen (1993) to study the asymptotic behaviour of bootstrap estimators of contrasts in linear models of increasing dimension.
We will consider in detail the case when and are optimised over all subsets of i.e. and comment on the situation when the nested list of models is considered: . The first result concerns consistency of the minimal -value criterion.

###### Theorem 1

Let . Then under conditions (A0), (A1.1’), (A1.2), (A1.3), (A1.4), (A1.5)
, as .

As it follows from the proof an Lemma 4 condition (A1.1’) may be weakened in Theorem 1 to . We state now analogous result for MPVC criterion.

###### Theorem 2

Let . Then under conditions of Theorem 1 with (A1.1’) replaced by (A1.1”)
, as .

In order to compare assumptions of the above results note that when grows more slowly than we can take in the case of criterion . However, in the case of the assumption (A1.1”) is obviously not satisfied for .
It follows from the proof that the condition (A1.1”) may be weakened in Theorem 1 to .
Proofs of Theorems 1 and 2 are given in Section 4.
Consider now the case when the criteria are optimised over nested list of models and define as the largest index of nonzero coefficient in the true model. In this case our goal is not to identify consistently the minimal true model but rather , which is equivalent to consistent selection of a set . It turns out that this property holds under weaker conditions than in Theorem 1 and 2. Namely, the conditions (A1.3) and (A1.4) can be omitted. In this case the condition (A0) will be slightly modified. Let be a vector of variables which pertain to the model . Instead of (A0) we assume (B0): is positive definite matrix. Then under conditions (B0), (A1.1’), (A1.2)and (A1.5) and analogous result holds for provided (A1.1’) is replaced by (A1.1”). This is proved along the lines of the proofs of Theorems 1 and 2.

In order to lessen computational burden of all subset search we propose two-step model selection with the first step consisting in initial ordering of variables according to -values of coefficient of partial determination (3). This method is analogous to the procedure proposed in Zheng and Loh (1997) in which variables are ordered according to absolute values of -statistics corresponding to respective attributes. Then in the second step an arbitrary criterion Crit is optimised over nested family of models. Specifically, the greedy procedure consists of the following steps. Let

 PVi=p(Rn(f−{i})f|Mn,Mn−1),i=1,…,Mn (5)

be the -value of statistic for testing model against model . Then

1. Order the -values in nondecreasing order .

2. Consider the nested family and optimise criterion Crit over this family.

It can be shown that under (A1.2)-(A1.4)

 limn→∞P(maxi∈tPVi

The proof of the above assertion is a simple consequence of Theorem 2 in Zheng and Loh (1997). This, together with Theorems 1 and 1 for the case of the nested list of models, when minimal or maximal -value criterion is considered as Crit, leads to the following corollary.

###### Corollary 1

Under conditions of Theorems 1 and 2 respectively the greedy versions of mPVC and MPVC procedures are consistent.

Observe that since parameters of beta distribution used to calculate -values in (5) do not change with , the ordering in the first step is equivalent to ordering wrt values of , or to the ordering wrt to absolute values of -statistics when the full model is fitted.

### 3.2 Deterministic-design regression

In this section we will briefly discuss the case when the design matrix is nonrandom. We allow that the values of attributes of observation may depend on . Recall that is a vector of variables which pertain to the minimal true model . In the case of all subset search we replace condition (A0) by the following assumption

1. , as , where is a positive definite matrix.

In the case of random covariates the above convergence in probability follows from The Law of Large Numbers. We also replace conditions (A1.3) and (A1.4) by the following assumption

1. The minimum eigenvalue of is bounded away from zero, i.e. for some and .

Recall that is the least squares estimator based on the full model . Let be the corresponding t-statistic. It can be easily shown that , for . Thus by assumption (C1) as , for some . This implies the conclusion of Lemma 5 in Section 4, namely that for with probability tending to one is bounded away from . As (A1.3) and (A1.4) are used in the random-design case only to prove Lemma 5 it follows that the analogous results to Theorem 1 and Theorem 2 hold for the deterministic-design case.

###### Corollary 2

Under conditions (C0), (A1.1’), (A1.2), (C1), (A1.5)
, as .

###### Corollary 3

Under conditions of Corollary 2 with (A1.1’) replaced by (A1.1”)
, as .

Consider the case of nested family search. Recall that is a vector of variables which pertain to the model . If condition (B0) if replaced by the following assumption

1. , as , where is a positive definite matrix.

then results discussed at the end of Section 3.1 hold for deterministic design.

## 4 Proofs

We first state auxiliary lemmas which will be used in the proof of Theorem 1. The first one proved in Pokarowski and Mielniczuk (2010) gives an approximation of tail probability function of beta distribution. Let be a random variable having beta distribution with shape parameters and and denote beta function. Define an auxiliary function

 L(a,b,x)=(a−1)(1−x)1−a+(a+b)x,

for such that .

###### Lemma 1

Assume . Then for

 (1−x)bxa−1B(a,b)b≤P[Ba,b>x]≤(1−x)bxa−1B(a,b)b(1+L(a,b,x)) (6)

and for

 (1−x)bxa−1B(a,b)b(1+L(a,b,x))≤P[Ba,b>x]≤(1−x)bxa−1B(a,b)b. (7)

The following Lemma states simple but useful inequalities for gamma function.

###### Lemma 2

Let and , for some . Then

 Γ(b)ba≤Γ(a+b)≤2√πΓ(b)(a+b)a.

The above Lemma implies an inequality for beta function

 ba−1Γ(a)≤1bB(a,b)≤2√π(a+b)abΓ(a), (8)

for , and .

###### Remark 1

Lemma 2 easily implies inequality for , which will be frequently used throughout.

The following Lemma states that for a proper submodel of the true model variance estimator is asymptotically biased. denotes a proper inclusion of in .

###### Lemma 3

(i) For , as . Moreover, for , if (A0) is satisfied then as , where .
(ii) Let , and assume (B0). Then as , where .

###### Lemma 4

Let be a sequence of real numbers such that as . Assume also that and matrix is invertible with probability tending to 1. Then

 P{nlog[RSS(t)RSS(f)]>Rn}→0

as .

###### Remark 2

Observe that as , the imposed condition on is implied by . Thus in particular Lemma 4 implies that

 RSS(t)RSS(f)=OP[exp(Rnn)],

for any such that . Observe moreover that Lemma 4 holds true also in the case when the condition on reduces to only and thus . This can be seen directly from Lemma 3 and the fact that as it follows from them that and thus .

###### Lemma 5

Assume conditions (A1.3) and (A1.4). Then there exists such that

 P{mini∈tlog[RSS(f−{i})RSS(f)]>a}→1

as .

Thus Lemma 5 implies that with probability tending to for is bounded away from 0.

### 4.1 Proof of Theorem 1

We will consider separately two cases: the first when the true model contains nontrivial regressors () and the second, when it equals the null model.
Case 1 (). We will treat the case in detail, the case is similar but simpler and relies on (7) instead of (6) to treat .
(i) Let be such that i.e. is a proper subset of . We will prove that as . Using (8) with and we obtain the following inequalities for sufficiently large

 1B(pt2,n−pt2)(n−pt2)≤2(n2)pt2√π(n−pt2)Γ(pt2)≤2(n2)pt2√π(n4)Γ(pt2)=4(n2)pt2−1√πΓ(pt2). (9)

Moreover for and sufficiently large

 1B(pj2,n−pj2)(n−pj2)≥(n−pj2)pj2−1Γ(pj2)≥(n−Mn2)pj2−1Mpj2n≥(n−Mn2)pt+12−1Mpt+12n≥(n2)pt+12−1(12)pt+12−1Mpt+12n. (10)

Note that

 P⎛⎜⎝infj⊃tRn0j≥supj⊃tpj2−1n2⎞⎟⎠≤P(Rn0t≥(Mn−2)/n)→1,

which follows from Lemma 3 and the fact that . Thus the assumption of Lemma 1 is satisfied for , , and all . Using (6) we have

 P[eptanp(Rn0t|pt,0)>infj⊃teptanp(Rn0j|pj,0)]≤ (11) (12) P⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩(1−Rn0t)n−pt2[1+L(pt2,n−pt2,Rn0t)]eptanB(pt2,n−pt2)(n−pt2)>infj⊃t(1−Rn0f)n−pt2(Rn0t)Mn2−1e(pt+1)anB(pj2,n−pj2)(n−pj2)⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭. (13)

Taking logarithms and using inequalities (9), (10) we obtain

 P[logp(Rn0t|pt,0)+ptan>infj⊃tlogp(Rn0j|pj,0)+(pt+1)an]≤P{[n−pt2]log[RSS(t)RSS(f)]>~Wn},

where

 ~Wn=an+12log(n2)−log[1+L(pt2,n−pt2,Rn0t)]+(Mn2−1)log(Rn0t)+

Assumption , Lemma 3 and the fact that imply that there exists a sequence of real numbers such that and . Now the required convergence follows from

 P{[n−pt2]log[RSS(t)RSS(f)]>Wn}→0

which in its turn is implied by Lemma 4.
(ii) Consider now the case and let be such that . We will prove that as . Define , for . Assume first that . Using (6) and (8) we have

 epjanp(Rn0j|pj,0)≥e2anp(M(n,i)|pj,0)≥e2an[1−M(n,i)]n−pj2M(n,i)pj2−1B(pj2,n−pj2)(n−pj2)≥ (14) (15) e2an[1−M(n,i)]n2Mpj2−1nMpj2n=e2an[1−M(n,i)]n2M−1n. (16)

From (6) and (9)

 eptanp(Rn0t|pt,0)≤eptan(1−Rn0t)n−pt24(n2)pt2−1[1+L(pt2,n−pt2,Rn0t)]√πΓ(pt2) (17)

Using (14) and (17) we have for and

 P[eptanlogp(Rn0t|pt,0)>infj⊉tepjanlogp(Rn0j|pj,0)]≤P{infi∈tn2log[(1−M(n,i))RSS(0)RSS(t)]<~Sn},

where

 ~Sn=an(pt−2)+(pt2−1)log(n2)−pt2log(RSS(t)RSS(0))+log(4√π)+ log[1+L(pt2,n−pt2,Rn0t)]+logΓ−1(pt2)+log(Mn).

In view of definition of the last probability can be bounded from above by

 P{infi∈tn2log[RSS(f−{i})RSS(t)]<~Sn}+P⎧⎪⎨⎪⎩n2log⎡⎢⎣(1−2Mnn−Mn)RSS(0)RSS(t)⎤⎥⎦<~Sn⎫⎪⎬⎪⎭.

The second probability above converges to zero in view of Lemma 3. Consider the first probability. Since the number of elements of is finite it suffices show that for any . Namely, it is bounded from above by

 P{n2log[RSS(f−{i})RSS(f)]+n2log[RSS(f)RSS(t)]<~Sn}≤ (18) P{n2log[RSS(f−{i})RSS(f)]<2~Sn}+P{n2log[RSS(f)RSS(t)]<−~Sn}≤ (19) P{nlog[RSS(f−{i})RSS(f)]<~Sn}+P{n2log[RSS(t)RSS(f)]≥~Sn}. (20)

From assumptions (A1.5) and (A1.1’) and , respectively. Thus the convergence to zero of the above two probabilities in (18) follows from Lemma 5 and 4, respectively. The case is treated analogously.
Consider now the case . From (17) we have

 P[logp(Rn0t|pt,0)+ptan>logp(Rn00|0,0)]=P[logp(Rn0t|pt,0)>an−12log(n)−ptan]≤ (21) P{(n−pt2)log[RSS(0)RSS(t)]

where

 Gn=(pt−1)an+12log(n)+(pt2−1)log(n2)+log(4√π)+logΓ−1(pt2)+ log[1+L(pt2,n−pt2,Rn0t)].

The convergence to zero of the probability in (21) follows from Lemma 3 amd assumption (A1.5).
Case 2 i.e. the true model is null model. We treat in detail the case . Define . Note that the assumption of Lemma 1 is satisfied for , , and . Using (6) and (8) we have

 epjanp(Rn0j|pj,0)≥e2anp(¯M(n)|pj,0)≥e2an[1−¯M(n)]n−pj2¯M(n)pj2−1B(pj2,n−pj2)(n−pj2)