Sharp Convergence Rates for Forward Regression in High-Dimensional Sparse Linear Models

# Sharp Convergence Rates for Forward Regression in High-Dimensional Sparse Linear Models

\fnmsDamian \snmKozburlabel=e1]damian.kozbur@econ.uzh.ch. [ University of Zürich University of Zürich
Department of Economics
Schönberggasse 1, 8006 Zürich
###### Abstract

Forward regression is a statistical model selection and estimation procedure which inductively selects covariates that add predictive power into a working statistical regression model. Once a model is selected, unknown regression parameters are estimated by least squares. This paper analyzes forward regression in high-dimensional sparse linear models. Probabilistic bounds for prediction error norm and number of selected covariates are proved. The analysis in this paper gives sharp rates and does not require -min or irrepresentability conditions.

[
\kwd
\startlocaldefs\endlocaldefs\runtitle

Forward Regression

{aug}\thankstext

t1 This version is of July 12, 2019. An earlier version of this paper, Testing-Based Forward Model Selection [24], is being split into two papers. The current paper presents fundamental results needed for analysis of forward regression in general settings, while the other paper focuses on using hypothesis tests rather than a simple threshold to decided which covariates enter the selected model. I gratefully acknowledge helpful discussion with Christian Hansen, Tim Conley, Attendants at the ETH Zürich Seminar für Statistik Research Seminar, Attendants at the Center for Law and Economics Internal Seminar, as well as financial support of the ETH Fellowship program.

class=MSC] \kwd[]62J05, 62J07, 62L12

forward regression, high-dimensional models, sparsity, model selection

## 1 Introduction

Forward regression is a statistical model selection and estimation technique that inductively selects covariates which substantially increase predictive accuracy into a working statistical model until a stopping criterion is met. Once a model is selected, unknown regression parameters are estimated by least squares. This paper studies statistical properties and proves convergence rates for forward regression in high-dimensional settings.

Dealing with a high-dimensional dataset necessarily involves dimension reduction or regularization. A principal goal of research in high-dimensional statistics and econometrics is to generate predictive power that guards against false discovery and overfitting, does not erroneously equate in-sample fit to out-of-sample predictive ability, and accurately accounts for using the same data to examine many different hypotheses or models. Without dimension reduction or regularization, however, any statistical model will overfit a high dimensional dataset. Forward regression is a method for doing such regularization which is simple to implement, computationally efficient, and easy to understand mechanically.

There are several earlier analyses of forward selection. [39] gives bounds on the performance and number of selected covariates under a -min condition which restricts the minimum magnitude of nonzero regression coefficients. [42] and [35] prove performance bounds for greedy algorithms under a strong irrepresentability condition, which restricts the empirical covariance matrix of the predictors. [14] prove bounds on the relative performance in population -squared of forward regression (relative to infeasible -squared) when the number of variables allowed for selection is fixed.

A key difference between the analysis in this paper relative to previous analysis of forward regression is that all bounds are stated in terms of the sparse eigenvalues of the empirical Gram matrix of the covariates. No -min or irrepresentability conditions are required. Under these general conditions, this paper proves probabilistic bounds on the predictive performance which rely on a bound on the number of selected covariates. In addition, the rates derived here are sharp.

A principal idea in the proof is to track average correlation among selected covariates. The only way for many covariates to be falsely selected into the model is that they be correlated to the outcome variable. Then, by merit of being correlated to the outcome, subsets of the selected covariates must also exhibit correlation amongst each other. On the other hand, sparse eigenvalue conditions on the empirical Gram matrix put upper limits on average correlations between covariates. These two observations together imply a bound on the number of covariates which can be selected. Finally, the convergence rates for forward regression follow.

A related method is forward-backward regression, which proceeds similarly to forward regression, but allows previously selected covariates to be discarded from the working model at certain steps. The convergence rates proven in this paper match those in the analysis of a forward-backward regression in [43]. Despite the similarity between the two procedures, it is still desirable to have a good understanding of forward selection. An advantage of forward selection relative to forward-backward is computational simplicity. In addition, understanding the properties of forward selection may lead to better understanding of general early stopping procedure in statistics (see [40], [44] ) as well as other greedy algorithms (see [9], [17] ). The analysis required for forward regression requires quite different techniques, since there is no chance to correct ‘model selection mistakes.’

There still are many other sensible approaches to high dimensional estimation and regularization. An important and common approach to generic high dimensional estimation problems is the Lasso. The Lasso minimizes a least squares criteria augmented with a penalty proportional to the norm of the coefficient vector. For theoretical and simulation results about the performance of Lasso, see [16] [34], [20], [13], [1], [2], [7], [11], [10] [12], [13], [21], [22], [23], [25], [26], [28], [30], [34], [36], [38], [41], [4], [8], [4], among many more. In addition,[15] have shown that under restrictive conditions, Lasso and forward regression yield approximately the same solutions (see also [27]). This paper derives statistical performance bounds for forward selection which match those given by Lasso in more general circumstances.

Finally, an important potential application for forward regression is as an input for post-model-selection analysis. One example is the selection of a conditioning set, to properly control for omitted variables bias when there are many potential control variables (see [6], [37], [5]). Another example is the selection of instrumental variables for later use in a first stage regression (see [3]). Both applications require a model selection procedure with the hybrid property of both producing a good fit and returning a sparse set of covariates. The results derived in this paper are relevant for both objectives, deriving bounds for both prediction error norm as well as the size of the selected set for forward regression.

## 2 Framework

The observed data is given by . The data consists of a set of covariates as well as outcome variables for each observation . The data satisfy

 yi=x′iθ0+εi

for some unknown parameter of interest and unobserved disturbance terms . The covariates are normalized so that and for every , where denotes empirical expectation. Finally, the parameter is sparse in the sense that the set of non-zero components of , denoted , has cardinality . The interest in this paper is to study how well forward regression can estimate for .

Define a loss function

 ℓ(θ)=En[(yi−x′iθ)2].

Note that depends on , but this dependence is suppressed from the notation. Define also

 ℓ(S)=minθ:supp(θ)⊆Sℓ(θ).

The estimation strategy proceeds by first searching for a sparse subset , with cardinality , that assumes a small value of , followed by estimating with least squares via

 ˆθ∈argminθ:supp(θ)⊆ˆSℓ(θ).

This gives the construction of the estimates for . The paper provides bounds for the prediction error norm defined by

 En[(x′iθ0−x′iˆθ)2]1/2.

The set is selected by forward regression. For any define the incremental loss from the th covariate by

 Δjℓ(S)=ℓ(S∪{j})−ℓ(S).

Consider the greedy algorithm which inductively selects the th covariate to enter a working model if exceeds a threshold :

 −Δjℓ(S)>t

and for each . The threshold is chosen by the user; it is the only tuning parameter required. This defines forward regression. It is summarized formally here:

Algorithm 1: Forward Regression Initialize. Set . For : If: for some , then select Update: . Else: Break. Set:

## 3 Analysis of Forward Regression

In order to state the main theorem, a few more definitions are convenient. Define the empirical Gram matrix by . Let denote the minimum -sparse eigenvalues given by

 φmin(s)(Gx)=minS⊆{1,...,p}:|S|⩽sλmin([Gx]S,S)

where is the principal submatrix of corresponding to the component set . Let

 C1=√ˆs+s0φmin(ˆs+s0)(Gx)−1[2∥En[εix′i]∥∞+t1/2].

For each positive integer , let

 C2(m) =1+72×1.7832×φmin(m+s0)(Gx)−5.

The above quantities are useful for displaying results in Theorem 1. Slightly tighter but messier usable quantities than and are derived in the proof. Note also that depends on .

###### Theorem 1.

Consider data with parameter . Then under Algorithm 1 with threshold ,

 En[(x′iθ0−x′iˆθ)2]1/2⩽C1.

For every integer such that and , it holds that

 m⩽C2(m)s0.

The above theorem calculates explicit constants bounding the prediction error norm. It is also helpful to consider the convergence rates implied by Theorem 1 under more concrete conditions on . Next, consider the following conditions on a sequence of datasets . In what follows, the parameters , the thresholds , and distribution of the data can all depend on .

Condition 1 [Model and Sparsity]. .

Condition 2 [Sparse Eigenvalues]. There is a sequence such that . In addition, with probability .

Condition 3 [Threshold and Disturbance Terms]. The threshold satisfies . In addition, with probability .

###### Theorem 2.

For a sequence of datasets with parameters and thresholds satisfying Conditions 1-3, the bounds

 En[(x′iθ0−x′iˆθ)2]1/2=O(√s0logp/n),
 ˆs⩽O(1)s0

hold with probability .

The theorem shows that forward regression exhibits asymptotically the same convergence rates in prediction error norm as other high-dimensional estimators like Lasso, provided an appropriate threshold is used. In addition, forward regression selects a set with cardinality commensurate with .

Condition 1 bounds the size of and requires that the sparsity level is small relative to the sample size. Condition 2 is a sparse eigenvalue condition useful for proving results about high dimensional techniques like Lasso. In standard regression analysis where the number of covariates is small relative to the sample size, a conventional assumption used in establishing desirable properties of conventional estimators of is that has full rank. In the high-dimensional setting, will be singular if and may have an ill-behaved inverse even when . However, good performance of many high-dimensional estimators only requires good behavior of certain moduli of continuity of . There are multiple formalizations and moduli of continuity that can be considered here; see [7]. This analysis focuses on a simple eigenvalue condition which was used in [3]. Condition 2 could be shown to hold under more primitive conditions by adapting arguments found in [4] which build upon results in [41] and [32]; see also [31]. Condition 2 is notably weaker than previously used irrepresentability conditions. Irrepresentability conditions require that for certain sets and , letting be the subvector of with components , that is bounded, or even strictly less than 1.

Condition 3 is a regularization condition similar to regularization conditions common in the analysis of Lasso. The condition, requires to dominate a multiple of the . This condition is stronger than that typically encountered with Lasso, because the multiple relies on the sparse eigenvalues of . To illustrate why such a condition is useful, let denote residualized away from previously selected regressors and renormalized. Then even if , can exceed resulting in more selections into the model. Nevertheless, using the multiple which stays bounded with , is sufficient to ensure that does not grow faster than . From a practical standpoint, this condition also requires the user to know more about the design of the data in choosing an appropriate . Choosing feasible thresholds which satisfy a similar condition to Condition 3 is considered in [24].

###### Theorem 3.

For a sequence of datasets with parameters and thresholds satisfying Conditions 1-3, the bounds

 ∥θ0−ˆθ∥2=O(√s0logp/n) and ∥θ0−ˆθ∥1=O(√s20logp/n)

hold with probability

Finally, two direct consequence of Theorem 2 are bounds on the deviations and of from underlying unknown parameter . Theorem 3 above shows that deviations of from also achieve rates typically encountered in high-dimensional estimators like Lasso.

## 4 Proof of Theorem 1

### Section 1

This first section of the proof provides a bound on given . First note that . Note that and . In addition, by Lemma 3.3 of [14],

 ℓ(ˆS)−ℓ(ˆS∪S0)⩽φmin(ˆs+s0)(Gx)−1∑j∈S0∖ˆS(−Δjℓ(ˆS))⩽s0tφmin(ˆs+s0)(Gx)−1.

This gives

 ℓ(ˆθ)⩽ℓ(θ0)+s0tφmin(ˆs+s0)(Gx)−1.

Expanding the above two quadratics in gives

 En[(x′iθ0−x′iˆθ)2] ⩽|2En[εix′i(ˆθ−θ0)]|+s0tφmin(ˆs+s0)(Gx)−1 ⩽2∥En[εix′i]∥∞∥θ0−ˆθ∥1+s0tφmin(ˆs+s0)(Gx)−1

To bound :

 ∥θ0−ˆθ∥1 ⩽√ˆs+s0∥θ0−ˆθ∥2 ⩽√ˆs+s0φmin(ˆs+s0)(Gx)−1En[(x′iθ0−x′iˆθ)2]1/2.

Combining the above bounds and dividing by gives

 En[(x′iθ−x′iˆθ)2]1/2 ⩽2∥En[εix′i]∥∞√ˆs+s0φmin(ˆs+s0)(Gx)−1 +s0tφmin(ˆs+s0)(Gx)−1En[(x′iθ0−x′iˆθ)2]1/2.

Finally, either , which is , in which case the first statement of Theorem 1 holds, or , in which case

 En[(x′iθ−x′iˆθ)2]1/2 ⩽2∥En[εix′i]∥∞√ˆs+s0φmin(ˆs+s0)(Gx)−1 +√s0tφmin(ˆs+s0)(Gx)−1 ⩽C1.

and the first statement of Theorem 1 also holds.

### Section 2

This section of the proof defines true and false covariates, introduces a convenient orthogonalization of all selected covariates, and associates to each false selected covariate a parameter on which the analysis is based.

Let be the vector in with components stacked vertically. Similarly, define and . Let , denote true covariates which are defined as the the vectors for . Define false covariates simply as those which do not belong to .

Consider any point in time in the the forward regression algorithm when there are false covariates selected into the model. These falsely selected covariates are denoted , each in , ordered according to the order they were selected.

The true covariates are also ordered according to the order they are selected into the model. Any true covariates unselected after the false covariate selection are temporarily ordered arbitrarily at the end of the list. Let be projection in onto the space orthogonal to . Let

 ~vk=Mk−1vk(v′kMk−1vk)1/2 for k=1,...,s0.

 ~ε=Ms0ε(ε′Ms0ε)1/2.

Let , ordered according to the temporary order. Note that there is and such that such that

 ~Vtemp~θtemp+~θ~ε~ε=y.

At this time, reorder the true covariates. Let denote the index of the final true covariate selected into the model when the -th false covariate is selected. The variables maintain their original order. The unselected true covariates are reordered in such a way that under the new ordering, whenever . Also define consistent with the new ordering. Redefine by so that it is also consistent with the new ordering. Note that no new orthogonalization needs to be done.

For any set , Let be projection onto the space orthogonal to . For each selected covariate, , set to be the set of (both true and false) covariates selected prior to . Define

 ~wj=cjQSpre-wjwj

where the normalization constants are defined in the next paragraph.

Each can be decomposed into components with and . The normalizations introduced above are then chosen so that .

Associates to each false covariate , a vector , defined as the solution in to the following equation

 ~V~γj=~rj.

Set . Assume without loss of generality that each component of is positive (since otherwise, the true covariates can just be multiplied by .) Also assume without loss of generality that .

### Section 3

This section provides upper bounds on quantities related to the defined above. The idea guiding the argument in the next sections is that if too many covariates are selected, then on average they must be correlated with each other since they must be correlated to . For a discussion of partial transitivity of correlation, see [33]. If the covariates are highly correlated amongst themselves, then must be very high. As a result, the sparse eigenvalues of can be used to upper bound the number of selections. Average correlations between covariates are tracked with the aid of the quantities .

Divide the set of false covariates into two sets and where

 A1={j:|~γj~ε|⩽t1/2n1/2(2ε′Ms0ε)1/2}, A2={j:|~γj~ε|>t1/2n1/2(2ε′Ms0ε)1/2}.

Sections 3 - 5 of the proof bound the number of elements in . Section 6 of the proof bounds the number of elements in .

Suppose the set contains total false selections. Collect these false selections into . Set . Decompose . Then . Since , it follows that the average inner product between the , given by :

 ¯ρ=1m1(m1−1)∑j≠l∈A1~u′j~ul,

must be bounded below by

 ¯ρ⩾−1m1−1

due to the positive definiteness of . This implies an upper bound on the average off-diagonal term in since is a diagonal matrix. Since are orthonormal, the sum of all the elements of is given by . Since and since is a diagonal matrix, it must be the case that

 1m1(m1−1)∑j≠l∈A1~γ′j~γl=−¯ρ.

Therefore,

 ¯ρ=1m1(m1−1)⎛⎝∥∥∑j∈A1~γj∥∥22−∑j∈A1∥~γj∥22⎞⎠⩽1m1−1.

This implies that

 ∥∥∑j∈A1~γj∥∥22⩽m1+∑j∈A1∥~γj∥22.

Next, bound Note since is orthonormal. Note that is lower bounded by . This follows from the fact that you can associate to an element of a the inverse covariance matrix for and previously selected covariates. Therefore, . It follows that

 maxj∈A1∥~γj∥22⩽φmin(m+s0)(Gx)−1−1.

This then implies that

 ∥∥∑j∈A1~γj∥∥22⩽m1φmin(m+s0)(Gx)−1.

The same argument as above can also be used to show that for any choice of signs, it is always the case that

 ∥∥∑j∈A1ej~γj∥∥22⩽m1φmin(m+s0)(Gx)−1.

### Section 4

Next search for a particular choice of signs which give a lower bound proportional to on the above term. Note that this will imply an upper bound on . The argument relies on Grothendieck’s inequality which is a theorem of functional analysis ([18], see for a review, [29]). For each , let be the set which contains those such that is selected before , but not before any other true covariate. Note that the sets are set empty if . Also, empty sums are set to zero. Define the following two matrices:

 Γ=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣∑j∈A11~γj1∑j∈A11~γj2...∑j∈A11~γjs00∑j∈A12~γj2...∑j∈A12~γjs0⋮⋮⋱⋮00...∑j∈A1s0~γjs0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦, B=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣~θ1~θ1~θ2~θ1...~θs0~θ1~θ2~θ1~θ2~θ2...~θs0~θ2⋮⋮⋱⋮~θs0~θ1~θs0~θ2...~θs0~θs0⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

Note that the th row of is equal to since the orthogonalization process had enforced for each . Therefore, the diagonal elements of the product satisfy the equality

 [ΓB]k,k=∑j∈Ak~γ′j~θ/~θk.

Let be constants such that

 ~γ′j~θ/~θk⩾C1

for , and

 ~θk/~θl⩾C2

for These key constants are calculated explicitly in Section 5 of the proof. They imply that

 [ΓB]k,k⩾C1|A1k|  and  tr(ΓB)⩾C1m1.

Further observe that whenever for each , assuming without loss of generality that , that is positive semidefinite. This can checked by constructing auxiliary random variables who have covariance matrix : inductively build a covariance matrix where the th random variable has covariance with the th random variable. Then has a positive definite symmetric matrix square root so let . Therefore, Note that the rows (and columns) of each have norm and therefore decomposes into a product where the rows of have norms bounded by . Therefore, let .

Consider the set

 Gs0={Z∈Rs0×s0:Zij=X′iYj for some Xi,Yj∈Rs,∥Xi∥2,∥Yj∥2⩽1}

and observe that Then this observation allows the use of Grothendieck’s inequality (using the exact form described in [19]) which gives

 maxZ∈Gs0tr(ΓZ)⩽KRG∥Γ′∥∞→1.

Here, is an absolute constant which is known to be less than 1.783. It does not depend on . Therefore, , which implies

 (KRG)−1C3−1C1m1⩽∥Γ′∥∞→1.

Therefore, there is such that . For this particular choice of , it follows that

Then by definition of , . In Section 3, it was noted that for any choice of signs . It follows that

 s−10(KRG)−2C3−2C21m21⩽m1φmin(m+s0)(Gx)−1

which yields the conclusion

 m1⩽φmin(m+s0)(Gx)−1C−21C32(KRG)2s0.

### Section 5

It is left to calculate which lower bound for and for . A simple derivation can be made to show that the incremental decrease in empirical loss from the th false selection is

 −Δjℓ(Spre-wj)=1ny′~wj(~w′j~wj)−1~w′jy=1n1~w′j~wj(~θ′~γj+~θ′~ε~γj~ε)2

Note the slight abuse of notation in signifying change in loss under inclusion of rather than . Next,

 (~θ′~γj+~θ′~ε~γj~ε)2⩽2(~θ′~γj)2+2(~θ′~ε~γj~ε)2

Since , , and it follows that

 1n1~w′j~wj(~θ′~ε~γj~ε)2⩽1n1~w′j~wj~θ2~ε(t1/2n1/22(ε′Ms0ε)1/2)2⩽t4.

This implies

 12(−Δjℓ(Spre-wj))⩽1n1~w′j