The classical model for linear regression is with i.i.d. standard Gaussian errors. Much of the resulting statistical inference is based on Fisher’s -distribution. In this paper we give two approaches to least squares regression which are model free. The results hold for all data . The derived probabilities are not only exact, they agree with those using the -distribution based on the classical model. This is achieved by replacing questions about the size of , for example , by questions about the degree to which the covariate is better than Gaussian white noise or, alternatively, a random orthogonal rotation of . The idea can be extended to choice of covariates, post selection inference PoSI, step-wise choice of covariates, the determination of dependency graphs and to robust regression and non-linear regression. In the latter two cases the probabilities are no longer exact but are based on the chi-squared distribution. The step-wise choice of covariates is of particular interest: it is a very simple, very fast, very powerful, it controls the number of false positives and does not over fit even in the case where the number of covariates far exceeds the sample size.
A Model-free Approach to Linear Least Squares Regression with Exact Probabilities and Applications to Covariate Selection
Laurie Davies111 Faculty of Mathematics, University of Duisburg-Essen, 45117 Essen, Federal Republic of Germany. e-mail:email@example.com and Lutz Dümbgen222Institute for Mathematical Statistics and Insurance, University of Bern, 3012 Bern, Switzerland. e-mail:firstname.lastname@example.org
Subject classification: 62J05
Key words: stepwise regression; high dimensions.
We consider data with dependent variable and covariates with each . To ease the notation the explicit dependency on will be dropped by writing instead of etc. Suppose we analyse the data based on the classical linear regression model with i.i.d. Gaussian errors
where is a subset of and contains only those covariates with . Problems to be considered include the choice of and the calculation of confidence sets for the , once has been chosen. There are many methods addressed to the first problem but here we restrict attention to lasso [Tibshirani, 1996] and knockoff [Candès et al., 2018]. Problems of post selective inference are considered in [Pötscher and Leeb, 2003] and [Berk et al., 2013]. As a simple example of the use of the standard model the relevance of the covariate can be tested by formulating the null hypothesis and the alternative hypothesis and calculating a P-value based on Fisher’s F-distribution.
We will address such problems using a completely different approach to linear regression. The idea is very simple: in the case of a single covariate we compare with a covariate consisting of i.i.d. random variables. Denote the sum of squared residuals without by and with by . Now replace by the Gaussian covariate leading to a random sum of squared residuals . As will be shown below the probability that is better than is
where denotes the beta-distribution with parameters and and the F-distribution with degrees of freedom and . The first term is the P-value of in our sense. The last term is the P-value for testing against . The first P-value is independent of the data, that is, it is model free, the second requires the standard linear model with i.i.d. Gaussian white noise errors.
In Section 2 both methods are described and the exact probabilities given.
2 Gaussian covariates and random orthogonal rotations
2.1 Gaussian covariates
Consider a subset of of size and consider the regression based on the covariates with with sum of squared residuals . Consider a second subset of size and denote by the sum of squared residuals of the regression based on the covariates . Now replace the covariates , by standard Gaussian covariates and denote by the random sum of squared residuals of the linear regression based on the covariates and the .
The following theorem holds:
The proof is given in the appendix. The expression in (iii) is the P-value in the classical linear model for testing the null hypothesis that the , are all zero.
2.2 Random orthogonal rotations
3 Selecting linear approximations: all subsets
In a first step we select subsets by requiring that each covariate in the selected subset has a P-value less than , for example . More precisely, consider a subset with sum of squared residuals . The p-value of is given by (ii) of Theorem 2.1, namely
where is the sum of squared residuals without the covariate .
The P-values are correct whatever the subset. This is in contrast to methods based on a correct linear model with Gaussian errors where the P-values are correct only for the correct model, for every other subset they are incorrect because the estimate of on which the P-values depend is biased: see the discussion on page 7 of [Berk et al., 2013].
In a second step all selected subsets which are a strict subset of some other selected subset are eliminated. The remaining subsets are in some sense maximal because it is not possible to include other covariates without causing some P-value to rise above the selected level of .
Finally the remaining subsets can be ordered if desired, for example using the sum of squares of the residuals.
Other selection procedures are possible, for example specifying that a particular covariate must always be included or by using the P-values for the whole subset and not for the individual covariates.
4 Stepwise selection of linear approximations
4.1 Single stepwise procedure
Suppose a subset , of the covariates has been selected. Denote the sum of the squared residuals based on this subset by . There remain covariates and for each such covariate denote the sum of squared residuals based on this covariate and the by . Replace all the remaining covariates by i.i.d. standard dimensional Gaussian random variables . Alternatively the remaining covariates can be replaced by independent orthogonal rotations of themselves. Regress the dependent variable on the covariates defined by and each in turn and denote the sum of squared residuals using by . From (ii) of Theorem 2.1 we have
As the are independent and identically distributed we have
which is the probability that the covariate is better than the best of the random Gaussian covariates . We define the P-value of by
We now specify a cut-off P-value , for example . If none of the has a P-value less than the procedure terminates and defined the selected subset. Otherwise that with the smallest P-value is selected and the procedure continues until all the remaining p-values exceed .
4.2 Consistency of stepwise Gaussian covariates
Let the required P-value for stopping be and define by
meaning that the procedure terminates as soon as
The proof is given in the Appendix 7.2.
whereas the asymptotic approximation is
which is not too bad given the relatively small value of . For the two values are 0.1978585 and 0.22315 respectively.
We consider the case of a dependent variable and covariates all of size with generated by
with and where the are standard Gaussian white noise.
Given a subset of let denote the projection onto the subspace spanned by the .
and make the following four assumptions:
for some ,
for some .
If the ASSUMPTIONS 1-4 hold the Gaussian stepwise procedure is asymptotically consistent.
The proof is given in the Appendix 7.3.
The assumptions simplify if the correlation matrix of the first covariates has a dominant diagonal
with for some fixed and if
Here denotes the correlation of and .
Under (9) the eigenvalues of the correlation matrix of any subset of the covariates lie between and .
ASSUMPTIONS 1-4 become
ASSUMPTIONS’ 1 and 2 are guaranteed if
which can be compared with the assumption
of Theorem 1 of ([Lockhart et al., 2014]) in the case of orthogonal .
The proof is given in the Appendix 7.4.
4.3 Controlling the number of false positives
The cut-off P-value can be interpreted as the probability of one false positive being included in the final selected subset. In some situations this may be too strict. Examples are given below in Sections 6.1.1 and 6.1.2.
To make the method more flexible we proposing comparing the best of the covariates with the th best of the Gaussian covariates. A larger value of will in general increase the number of false positives but it may be judged to be of value if there is a corresponding decrease in the number of false negatives. This may be done as follows.
In the notation of Subsection 4.1 it follows from (i) of Theorem 2.1 that that the random variables are independent and uniformly distributed on the interval . The th order statistic of independent uniform random variables is distributed as . The P-value of (2) is replaced by
which agrees with (2) if .
To estimate the number of false positives for any given we propose the following. Simulate using only Gaussian covariates and any non-zero and select the covariates using the Gaussian step-wise procedure with agiven . All selected covariates are false positives. As only rough estimates are required 100 simulations are usually, but not always, sufficient. Table 1 gives the results of such a simulation using the values and . These are the values used in the simulations of Section 6.1.1
4.4 Repeated step-wise regression
The single step-wise method as described in Section 4.1 produces at most one linear approximation. There well may be others. The repeated step-wise method simply eliminates the covariates in the first approximation and then repeats the single step-wise method. This is continued until there are no covariates with a P-value smaller than the given .
We briefly consider extension to robust (-)regression, non-linear regression and minimization of the Kullback-Leibler discrepancy instead of the sum of squared residuals. For further details see[Davies, 2017].
Let by a symmetric, positive and twice differentiable convex function with . The default function will be the Huber’s -function with a tuning constant ([Huber and Ronchetti, 2009], page 69) defined by
The default value of will be .
For a given subset of size the sum of squared residuals is replaced by
which can be calculated using the algorithm described in 7.8.2 of [Huber and Ronchetti, 2009]. The minimizing will be denoted by .
For some put
Replace all the covariates not in by standard Gaussian white noise, include the th such random covariate denoted by and put
A Taylor expansion gives
with . This leads to the asymptotic -value for
corresponding to the exact -value (2) for linear regression. Here
It remains to specify the choice of scale . The initial value of is the median absolute deviation of multiplied by the Fisher consistency factor 1.4826. After the next covariate has been included the new scale is taken to be
where the are the residuals based on the covariates and is the Fisher consistency factor given by
where is (see [Huber and Ronchetti, 2009]). Other choices are also possible.
4.6 Non-linear approximation
For a given subset of covariates the dependent variable is now approximated by where is a smooth function. for some smooth function . Consider a subset , write
and denote the minimizing by . Now include one additional covariate with to give , denote the mean sum of squared residuals by . As before all covariates not in are replaced by standard Gaussian white noise. Include the th random covariate denoted by and put
Arguing as above for robust regression results in
The asymptotic -value for the covariate corresponding to the asymptotic -value (22) for -regression is
4.7 Kullback-Leibler and logistic regression
For integer data least squares can be replaced by miminizing the Kullback-Leibler discrepancy. We consider the case of 0-1 data and logistic regression:
Denoting the minimum for the subset by and the minimum for by the arguments of the previous two sections lead to the asymptotic -value
for the covariate . The are the values of giving the minimum .
5 Analysing the linear approximations: equivalence regions
If statistical inference is carried out based on a single specified classical model the results depend sensitively on the estimated value of , the ‘true’ variance of the noise. If the model is selected out of many possible models only one of which can be ‘true’ this causes problems, see [Pötscher and Leeb, 2003] and [Berk et al., 2013]. As the Gaussian covariate procedure yields approximations and makes no use of parameters it avoids such problems. The result remains valid no matter how the covariates were selected.
Let be the least squares coefficients based on the covariates with sum of squared residuals . Let be some other coefficients. We have
The are not significantly worse
than the if the sum of the
squared residuals is not significantly larger
. This is measured by regressing on Gaussian covariates to give sum of squared
As before we have
Specifying a lower value for this probability gives
The corresponding confidence region for the standard regression model is
i.e. the same.
It is possible to define an equivalence region for any given and not just the least squares solution. We define as being not significantly worse than if firstly
holds. Now regress on Gaussian covariates to give sum of squared residuals . As before we have
which on specifying leads to
Similarly is not significantly better than if
The same idea can be used to give an equivalence region for a subset of the s keeping the others fixed. We give here only an example. For the first covariate of the red wine data the 0.95% equivalence interval for keeping the least squares values of the remaining s fixed is The standard 95% confidence interval is .
6 Simulations and real data
6.1.1 Tutorials 1 and 2
The tutorials in question are the Tutorials 1 and 2 of
In both cases and 60 of the regression coefficients are non-zero. The dependent variable is real valued in Tutorial 1 but binary in Tutorial 2. For the latter we use the binomial family option for lasso and knockoff. The Gaussian covariate method uses least squares. The effect of different values of for the Gaussian covariate method was discussed in Section 4.3 . It can be seen that the results of Table 2 are consistent with Table 1.
In Tutorial 1 lasso selects on average 140 covariates which include virtually all the 60 covariates with non-zero coefficients. It does so at a cost of 80 false positives and can be seen to grossly overfit. Knockoff performs much better selecting on average 50 of the 60 relevant covariates at a cost of 5.6 false positives. The Gaussian covariate method with performs essentially the same as knockoff with the exception of computing time. Knockoff requires on average just over a minute for each simulation whereas the Gaussian covariate method requires on average three seconds.
|Tutorial 1||Tutorial 2|
6.1.2 Random graphs
This is based on [Meinshausen and Bühlmann, 2006] with where is the dimension of each covariate and the number of covariates. On the last line of page 13 the expression with the density of the standard normal distribution and the Euclidean distance is clearly false. It has been replaced by which gives about 1800 nodes compared with the 1747 of [Meinshausen and Bühlmann, 2006].
One simulation of the Gaussian covariate method with and resulted in 1809 edges of which 1693 were selected with 3 false positives and 119 false negatives. The time required was about 11 seconds. When reconstructing the graph the given value of is by default divided by the number of covariates .
One application of lasso to the same graph resulted in 599 false positives and 13 false negatives. The time required was 47 minutes.
Putting and performing 10000 Simulations as in Table 1 with give a mean number of false positives of 7. The large number of simulations was because of the small value of . The time required was 10 minutes. For this value of and the same random graph as above the Gaussian procedure selects 1816 edges, that is an additional 123 of which of the order of 7 can be expected to be false positives. The actual number was 13 with 6 false negatives.
6.2 Real data
6.2.1 Red wine data
The size of the red wine data is with the 12th coordinate being the dependent variable giving subjective evaluations of the quality of the wine. The data are available from
UCI Machine Learning Repository: Wine Quality Data Set
Considering all possible subsets as described in Section with results in 20 linear approximations. The best in terms of the smallest sum of squared residuals is based on the six covariates 2, 5, 7 and 9-11 which are volatile.acidity, chlorides, total.sulfur.dioxide, pH, sulphates and alcohol. The stepwise procedure with gives the same subset.
6.2.2 Boston housing data
The size of the Boston housing data is with the 14th coordinate being the dependent variable giving median value of owner-occupied homes in multiples of 1000$. The data are available from the R package MASS
Considering all possible subsets as described in Section with results in 34 linear approximations. The best in terms of the smallest sum of squared residuals is based on the all the coordinates apart from 2 and 7.
We now consider all 77520 interactions of order at most 7. Five applications of lasso resulted in between 39 and 53 selected covariates with in all 61. The applications took on average 80 seconds each.
The Gaussian stepwise procedure with and including the intercept results in six interactions being selected. The first one is and the second . Regressing on these two alone plus intercept gives a sum of squared residuals of 9688 which is smaller than 11079 obtained by regressing on all 13 original covariates. The time required was 2.5 seconds.
All interactions of degree at most 8 is too large for lasso. The Gaussian method gives a similar result as before in 6 seconds.
6.2.3 Leukemia data
The dimensions of the leukemia data ([Golub et al., 1999]) are . The data are available from
For more information about the data see [Dettling and Bühlmann, 2002].
Five applications of lasso resulted in 14, 12, 14, 16 and 14 covariates with in all 17 covariates. each application took about 0.5 seconds.One application of knockoff resulted in 14 covariates. The time required was six hours.
The repeated Gaussian covariate procedure with gives 115 linear approximations and involving 281 covariates. These included all the lasso and knockoff covariates. The time required was 0.7 seconds. The first two linear approximations are
[,1] [,2] [,3] [,4] [1,] 1 1182 0.000000e+00 4.256962 [2,] 1 1219 8.577131e-04 2.884064 [3,] 1 2888 3.580552e-03 2.023725 [4,] 2 1652 0.000000e+00 4.384850 [5,] 2 979 9.358898e-05 2.790473
The first column gives the number of the linear approximation, the second the covariates included in this approximation, the third the p-values of the covariates and the fourth the sum of squared residuals as each successive covariate is included.
If the cut-off p-value is set to the Gaussian covariate method results in 21 linear approximations each based on a single covariate. Only nine of these are included in the lasso and knockoff lists.
6.2.4 Osteoarthritis data
The osteoarthritis data with dimensions was analysed in [Cox and Battey, 2017]. The authors selected 17 covariates. Five applications of lasso resulted in 0, 6, 9, 8 and 10 covariates and 10 in all. the time for each application was about 10 seconds. This data set is much too large for knockoff.
The repeated Gaussian covariate method with gave 165 covariates forming 63 linear approximations. These included all the lasso covariates and 6 of the 17 chosen in [Cox and Battey, 2017]. The time required was 14 seconds.
Finally a dependency graph for the covariates was calculated with . It consisted of 38986 edges. The computing time was about 55 minutes. It was estimated that lasso would require over 5 days.
7.1 Proof of Theorem 2.1
We consider firstly the case , put and write
where is the linear space spanned by the covariates .
Let be an orthonormal basis of such that
where is the projection onto the subspace .
We now replace by a Gaussian covariate consisting of i.i.d. random variables. By the rotational symmetry of the standard Gaussian distribution on , defines stochastically independent standard Gaussian random variables . The orthogonal projection of onto is given by
it follows that
In the general case with the above argument may be applied inductively to show that