Factor models and variable selection in high-dimensional regression analysis

Factor models and variable selection in high-dimensional regression analysis

[ [    [ [ Universität Bonn and Université Paul Sabatier Statistische Abteilung
Department of Economics
and Hausdorff
Center for Mathematics
Universität Bonn
53113 Bonn
Germany
Institut de Mathématiques
Laboratoire de Statistique
et Probabilités
Université Paul Sabatier
UMR 5219
118, Route de Narbonne
31062 Toulouse Cedex
France
\smonth3 \syear2011
\smonth3 \syear2011
Abstract

The paper considers linear regression problems where the number of predictor variables is possibly larger than the sample size. The basic motivation of the study is to combine the points of view of model selection and functional regression by using a factor approach: it is assumed that the predictor vector can be decomposed into a sum of two uncorrelated random components reflecting common factors and specific variabilities of the explanatory variables. It is shown that the traditional assumption of a sparse vector of parameters is restrictive in this context. Common factors may possess a significant influence on the response variable which cannot be captured by the specific effects of a small number of individual variables. We therefore propose to include principal components as additional explanatory variables in an augmented regression model. We give finite sample inequalities for estimates of these components. It is then shown that model selection procedures can be used to estimate the parameters of the augmented model, and we derive theoretical properties of the estimators. Finite sample performance is illustrated by a simulation study.

[
\kwd
\doi

10.1214/11-AOS905 \volume39 \issue5 2011 \firstpage2410 \lastpage2447 \newproclaimremRemark

\runtitle

Factor models and variable selection

{aug}

A]\fnmsAlois \snmKneip\correflabel=e1]akneip@uni-bonn.de and B]\fnmsPascal \snmSardalabel=e2]Pascal.Sarda@math.ups-tlse.fr

class=AMS] \kwd[Primary ]62J05 \kwd[; secondary ]62H25 \kwd62F12. Linear regression \kwdmodel selection \kwdfunctional regression \kwdfactor models.

1 Introduction

The starting point of our analysis is a high-dimensional linear regression model of the form

 Yi=\boldsβTXi+εi,i=1,…,n, (1)

where , , are i.i.d. random pairs with and . We will assume without loss of generality that for all . Furthermore, is a vector of parameters in and are centered i.i.d. real random variables independent with with . The dimension of the vector of parameters is assumed to be typically larger than the sample size .

Roughly speaking, model (1) comprises two main situations which have been considered independently in two separate branches of statistical literature. On one side, there is the situation where represents a (high-dimensional) vector of different predictor variables. Another situation arises when the regressors are discretizations (e.g., at different observations times) of a same curve. In this case model (1) represents a discrete version of an underlying continuous functional linear model. In the two setups, very different strategies for estimating have been adopted, and underlying structural assumptions seem to be largely incompatible. In this paper we will study similarities and differences of these methodologies, and we will show that a combination of ideas developed in the two settings leads to new estimation procedures which may be useful in a number of important applications.

The first situation is studied in a large literature on model selection in high-dimensional regression. The basic structural assumptions can be described as follows:

• There is only a relatively small number of predictor variables with which have a significant influence on the outcome . In other words, the set of nonzero coefficients is sparse, .

• The correlations between different explanatory variables and , , are “sufficiently” weak.

The most popular procedures to identify and estimate nonzero coefficients are Lasso and the Dantzig selector. Some important references are Tibshirani (1996), Meinshausen and Bühlmann (2006), Zhao and Yu (2006), van de Geer (2008), Bickel, Ritov and Tsybakov (2009), Candes and Tao (2007) and Koltchinskii (2009). Much work in this domain is based on the assumption that the columns , , of the design matrix are almost orthogonal. For example, Candes and Tao (2007) require that “every set of columns with cardinality less than approximately behaves like an orthonormal system.” More general conditions have been introduced by Bickel, Ritov and Tsybakov (2009) or Zhou, van de Geer and Bülhmann (2009). The theoretical framework developed in these papers also allows one to study model selection for regressors with substantial amount of correlation, and it provides a basis for the approach presented in our paper.

In sharp contrast, the setup considered in the literature on functional regression rests upon a very different type of structural assumptions. We will consider the simplest case that for random functions observed at an equidistant grid . Structural assumptions on coefficients and correlations between variables can then be subsumed as follows:

• , where is a continuous slope function, and as , .

• There are very high correlations between explanatory variables and , . As , for any fixed .

Some important applications as well as theoretical results on functional linear regression are, for example, presented in Ramsay and Dalzell (1991), Cardot, Ferraty and Sarda (1999), Cuevas, Febrero and Fraiman (2002), Yao, Müller and Wang (2005), Cai and Hall (2006), Hall and Horowitz (2007), Cardot, Mas and Sarda (2007) and Crambes, Kneip and Sarda (2009). Obviously, in this setup no variable corresponding to a specific observation at grid point will possess a particulary high influence on , and there will exist a large number of small, but nonzero coefficients of size proportional to . One may argue that dimensionality reduction and therefore some underlying concept of “sparseness” is always necessary when dealing with high-dimensional problems. However, in functional regression sparseness is usually not assumed with respect to the coefficients , but the model is rewritten using a “sparse” expansion of the predictor functions .

The basic idea relies on the so-called Karhunen–Loève decomposition which provides a decomposition of random functions in terms of functional principal components of the covariance operator of . In the discretized case analyzed in this paper this amounts to consider an approximation of by the principal components of the covariance matrix . In practice, often a small number of principal components will suffice to achieve a small -error. An important points is now that even if the eigenvectors corresponding to the leading eigenvalues of can be well estimated by the eigenvectors (estimated principal components) of the empirical covariance matrix . This is due to the fact that if the predictors represent discretized values of a continuous functional variable, then for sufficiently small the eigenvalues will necessarily be of an order larger than and will thus exceed the magnitude of purely random components. From a more general point of view the underlying theory will be explained in detail in Section 4.

Based on this insight, the most frequently used approach in functional regression is to approximate in terms of the first estimated principal components , and to rely on the approximate model . Here, serves as smoothing parameter. The new coefficients are estimated by least squares, and . Resulting rates of convergence are given in Hall and Horowitz (2007).

The above arguments show that a suitable regression analysis will have to take into account the underlying structure of the explanatory variables . The basic motivation of this paper now is to combine the points of view of the above branches of literature in order to develop a new approach for model adjustment and variable selection in the practically important situation of strongly correlated regressors. More precisely, we will concentrate on factor models by assuming that the can be decomposed in the form

 Xi=Wi+Zi,i=1,…,n, (2)

where and are two uncorrelated random vectors in . The random vector is intended to describe high correlations of the while the components , , of are uncorrelated. This implies that the covariance matrix of adopts the decomposition

 \boldsΣ=\boldsΓ+\boldsΨ, (3)

where , while is a diagonal matrix with diagonal entries , .

Note that factor models can be found in any textbook on multivariate analysis and must be seen as one of the major tools in order to analyze samples of high-dimensional vectors. Also recall that a standard factor model is additionally based on the assumption that a finite number of factors suffices to approximate precisely. This means that the matrix only possesses nonzero eigenvalues. In the following we will more generally assume that a small number of eigenvectors of suffices to approximate with high accuracy.

We want to emphasize that the typical structural assumptions to be found in the literature on high-dimensional regression are special cases of (2). If and thus , we are in the situation of uncorrelated regressors which has been widely studied in the context of model selection. On the other hand, and thus reflect the structural assumption of functional regression.

In this paper we assume that as well as represent nonnegligible parts of the variance of . We believe that this approach may well describe the situation encountered in many relevant applications. Although standard factor models are usually considered in the case , (2) for large values of may be of particular interest in time series or spatial analysis. Indeed, factor models for large with a finite number of nonzero eigenvalues of play an important role in the econometric study of multiple time series and panel data. Some references are Forni and Lippi (1997), Forni et al. (2000), Stock and Watson (2002), Bernanke and Boivin (2003) and Bai (2003, 2009).

Our objective now is to study linear regression (1) with respect to explanatory variables which adopt decomposition (2). Each single variable , , then possesses a specific variability induced by and may thus explain some part of the outcome . One will, of course, assume that only few variables have a significant influence on which enforces the use of model selection procedures.

On the other hand, the term represents a common variability. Corresponding principal components quantify a simultaneous variation of many individual regressors. As a consequence, such principal components may possess some additional power for predicting which may go beyond the effects of individual variables. A rigorous discussion will be given in Section 3. We want to note that the concept of “latent variables,” embracing the common influence of a large group of individual variables, plays a prominent role in applied, parametric multivariate analysis.

These arguments motivate the main results of this paper. We propose to use an “augmented” regression model which includes principal components as additional explanatory variables. Established model selection procedures like the Dantzig selector or the Lasso can then be applied to estimate the nonzero coefficients of the augmented model. We then derive theoretical results providing bounds for the accuracy of the resulting estimators.

The paper is organized as follows: in Section 2 we formalize our setup. We show in Section 3 that the traditional sparseness assumption is restrictive and that a valid model may have to include principal components. The augmented model is thus introduced with an estimation procedure. Section 4 deals with the problem how accurately true principal components can be estimated from the sample . Finite sample inequalities are given, and we show that it is possible to obtain sensible estimates of those components which explain a considerable percentage of the total variance of all , . Section 5 focuses on theoretical properties of the augmented model, while in Section 6 we present simulation results illustrating the finite sample performance of our estimators.

2 The setup

We study regression of a response variable on a set of i.i.d. predictors , , which adopt decomposition (2) with , , , for all , . Throughout the following sections we additionally assume that there exist constants and such that with the following assumption (A.1) is satisfied for all : {longlist}[(A.1)]

for all . Recall that is the covariance matrix of with , where and is a diagonal matrix with diagonal entries , . We denote as the empirical covariance matrix based on the sample , .

Eigenvalues and eigenvectors of the standardized matrices and will play a central role. We will use and to denote the eigenvalues of and , respectively, while and denote corresponding orthonormal eigenvectors. Note that all eigenvectors of and (or and ) are identical, while eigenvalues differ by the factor . Standardization is important to establish convergence results for large , since the largest eigenvalues of tend to infinity as .

From a conceptional point of view we will concentrate on the case that is large compared to . Another crucial, qualitative assumption characterizing our approach is the dimensionality reduction of using a small number of eigenvectors (principal components) of such that (in a good approximation) . We also assume that . Then all leading principal components of corresponding to the largest eigenvalues explain a considerable percentage of the total variance of and .

Indeed, if , we necessarily have and . Then , and the first principal component of explains a considerable proportion of the total variance of .

We want to emphasize that this situation is very different from the setup which is usually considered in the literature on the analysis of high-dimensional covariance matrices; see, for example, Bickel and Levina (2008). It is then assumed that the variables of interest are only weakly correlated and that the largest eigenvalue of the corresponding scaled covariance matrix is of order . This means that for large the first principal component only explains a negligible percentage of the total variance of , . It is well known that in this case no consistent estimates of eigenvalues and principal components can be obtained from an eigen-decomposition of .

However, we will show in Section 4 that principal components which are able to explain a considerable proportion of total variance can be estimated consistently. These components will be an intrinsic part of the augmented model presented in Section 3.

We will need a further assumption which ensures that all covariances between the different variables are well approximated by their empirical counterparts: {longlist}[(A.2)]

There exists a such that

 sup1≤j,l≤p∣∣∣1nn∑i=1WijWil−cov(Wij,Wil)∣∣∣ ≤ C0√logpn, (4) sup1≤j,l≤p∣∣∣1nn∑i=1ZijZil−cov(Zij,Zil)∣∣∣ ≤ C0√logpn, (5) sup1≤j,l≤p∣∣∣1nn∑i=1ZijWil∣∣∣ ≤ C0√logpn, (6) sup1≤j,l≤p∣∣∣1nn∑i=1XijXil−cov(Xij,Xil)∣∣∣ ≤ C0√logpn (7)

hold simultaneously with probability , where as , .

The following proposition provides a general sufficient condition on random vectors for which (A.2) is satisfied provided that the rate of convergence of to 0 is sufficiently fast.

Proposition 1

Consider independent and identically distributed random vectors , , such that for , and

 E(ea|Vij|)≤C1 (8)

for positive constants and with moreover . Then, for any positive constant such that and

 P(sup1≤j,l≤p∣∣∣1nn∑i=1VijVil−cov(Vij,Vil)∣∣∣≤C0√logpn) (9) ≥1−p2−C20/(8(C1+C3/20/3))+2p2nC1e−(a/√2)(C0n/logp)1/4.

Note that as the right-hand side of (1) converges to 1 provided that is chosen sufficiently large and that for some . Therefore, assumption (A.2) is satisfied if the components of the random variables possess some exponential moments. For the specific case of centered normally distributed random variables, a more precise bound in (1) may be obtained using Lemma 2.5 in Zhou, van de Geer and Bülhmann (2009) and large deviations inequalities obtained by Zhou, Lafferty and Wassermn (2008). In this case it may also be shown that for sufficiently large events (4)–(7) hold with probability tending to 1 as without any restriction on the quotient . Of course, the rate in (4)–(7) depends on the tails of the distributions: it would be possible to replace this rate with a slower one in case of heavier tails than in Proposition 1. Our theoretical results could be modified accordingly.

3 The augmented model

Let us now consider the structural model (2) more closely. It implies that the vector of predictors can be decomposed into two uncorrelated random vectors and . Each of these two components separately may possess a significant influence on the response variable . Indeed, if and were known, a possibly substantial improvement of model (1) would consist in a regression of on the variables and

 Yi=p∑j=1β∗jWij+p∑j=1βjZij+εi,i=1,…,n, (10)

with different sets of parameters and , , for each contributor. We here again assume that , , are centered i.i.d. random variables with which are independent of and .

By definition, and possess substantially different interpretations. describes the part of which is uncorrelated with all other variables. A nonzero coefficient then means that the variation of has a specific effect on . We will of course assume that such nonzero coefficients are sparse, for some . The true variables are unknown, but with model (10) can obviously be rewritten in the form

 Yi=p∑j=1β∗∗jWij+p∑j=1βjXij+εi,i=1,…,n. (11)

The variables are heavily correlated. It therefore does not make any sense to assume that for some any particular variable possesses a specific influence on the predictor variable. However, the term may represent an important, common effect of all predictor variables. The vectors can obviously be rewritten in terms of principal components. Let us recall that denote the eigenvalues of the standardized covariance matrix of , and corresponding orthonormal eigenvectors. We have

 Wi=p∑r=1(\boldsψTrWi)\boldsψrandp∑j=1β∗∗jWij=p∑r=1α∗r(\boldsψTrWi),

where . As outlined in the previous sections we now assume that the use of principal components allows for a considerable reduction of dimensionality, and that a small number of leading principal components will suffice to describe the effects of the variable . This may be seen as an analogue of the sparseness assumption made for the . More precisely, subsequent analysis will be based on the assumption that the following augmented model holds for some suitable :

 Yi=k∑r=1αrξir+p∑j=1βjXij+εi, (12)

where and . The use of instead of is motivated by the fact that , . Therefore the are standardized variables with . Fitting an augmented model requires us to select an appropriate as well as to determine sensible estimates of . Furthermore, model selection procedures like Lasso or Dantzig have to be applied in order to retrieve the nonzero coefficients , , and , . These issues will be addressed in subsequent sections.

Obviously, the augmented model may be considered as a synthesis of the standard type of models proposed in the literature on functional regression and model selection. It generalizes the classical multivariate linear regression model (1). If a -factor model holds exactly, that is, , then the only substantial restriction of (10)–(12) consists in the assumption that depends linearly on and .

We want to emphasize, however, that our analysis does not require the validity of a -factor model. It is only assumed that there exists “some” and satisfying our assumptions which lead to (12) for a sparse set of coefficients .

3.1 Identifiability

Let and . Since , , are eigenvectors of we have for all , . By assumption the random vectors and are uncorrelated, and hence for all . Furthermore, for all . If the augmented model (12) holds, some straightforward computations then show that under (A.1) for any alternative set of coefficients , ,

 E([k∑r=1(αr−α∗r)ξir+p∑j=1(βj−β∗j)Xij]2) ≥k∑r=1(αr−α∗r+√pλr\boldsψr(\boldsβ−\boldsβ∗))2 (13) +D1∥\boldsβ−\boldsβ∗∥22.

We can conclude that the coefficients , and , , in (12) are uniquely determined.

Of course, an inherent difficulty of (12) consists of the fact that it contains the unobserved, “latent” variables . To study this problem, first recall that our setup imposes the decomposition (3) of the covariance matrix of . If a factor model with factors holds exactly, then the possesses rank . It then follows from well-established results in multivariate analysis that if the matrices and are uniquely identified. If , then also are uniquely determined (up to sign) from the structure of .

However, for large , identification is possible under even more general conditions. It is not necessary that a -factor model holds exactly. We only need an additional assumption on the magnitude of the eigenvalues of defining the principal components of to be considered.

{longlist}

[(A.3)]

The eigenvalues of are such that

 minj,l≤k,j≠l|λj−λl|≥v(k),minj≤kλj≥v(k)

for some with .

In the following we will qualitatively assume that as well as . More specific assumptions will be made in the sequel. Note that eigenvectors are only unique up to sign changes. In the following we will always assume that the right “versions” are used. This will go without saying.

Theorem 1

Let and . Under assumptions (A.1) and (A.2) we have for all , and all satisfying (A.3):

 |μr−λr| ≤ D2p, (14) ∥\boldsψr−\boldsδr∥2 ≤ 2D2pv(k), (15) δ2rj ≤ D0pv(k), E([ξir−ξ∗ir]2) ≤ D2pμr+(8λ1+1)D22p2v(k)2μr, (16) E([ξir−\boldsψTrXi√pλr]2) ≤ D2pλr, E([k∑r=1αrξir−k∑r=1αrξ∗ir]2) ≤ kk∑r=1α2r(D2pμr+(8λ1+1)D22p2v(k)2μr). (17)

For small , standard factor analysis uses special algorithms in order to identify . The theorem tells us that for large this is unnecessary since then the eigenvectors of provide a good approximation. The predictor of possesses an error of order . The error decreases as increases, and thus yields a good approximation of if is large. Indeed, if [for fixed , , and ] then by (16) we have . Furthermore, by (17) the error in predicting by converges to zero as .

A crucial prerequisite for a reasonable analysis of the model is sparseness of the coefficients . Note that if is large compared to , then by (17) the error in replacing by is negligible compared to the estimation error induced by the existence of the error terms . If and , then the true coefficients and provide a sparse solution of the regression problem.

Established theoretical results [see Bickel, Ritov and Tsybakov (2009)] show that under some regularity conditions (validity of the “restricted eigenvalue conditions”) model selection procedures allow to identify such sparse solutions even if there are multiple vectors of coefficients satisfying the normal equations. The latter is of course always the case if . Indeed, we will show in the following sections that factors can be consistently estimated from the data, and that a suitable application of Lasso or the Dantzig-selector leads to consistent estimators , satisfying , , as .

When replacing by , there are alternative sets of coefficients leading to the same prediction error as in (17). This is due to the fact that . However, all these alternative solutions are nonsparse and cannot be identified by Lasso or other procedures. In particular, it is easily seen that

 k∑r=1αrξ∗ir+p∑j=1βjXij=p∑j=1βLRjXij (19) \eqntextwith βLRj:=βj+k∑r=1αrδrj√pμr.

By (15) all values are of order . Since , this implies that many are nonzero. Therefore, if for some , then contains a large number of small, nonzero coefficients and is not at all sparse. If is large compared to no known estimation procedure will be able to provide consistent estimates of these coefficients.

Summarizing the above discussion we can conclude:

{longlist}

[(2)]

If the variables are heavily correlated and follow an approximate factor model, then one may reasonably expect substantial effects of the common, joint variation of all variables and, consequently, nonzero coefficients and in (10) and (12). But then a “bet on sparsity” is unjustifiable when dealing with the standard regression model (1). It follows from (19) that for large model (1) holds approximately for a nonsparse set of coefficients , since many small, nonzero coefficients are necessary in order to capture the effects of the common joint variation.

The augmented model offers a remedy to this problem by pooling possible effects of the joint variation using a small number of additional variables. Together with the familiar assumption of a small number of variables possessing a specific influence, this leads to a sparse model with at most nonzero coefficients which can be recovered from model selection procedures like Lasso or the Dantzig-selector.

In practice, even if (12) only holds approximately, since a too-small value of has been selected, it may be able to quantify at least some important part of the effects discussed above. Compared to an analysis based on a standard model (1), this may lead to a substantial improvement of model fit as well as to more reliable interpretations of significant variables.

3.2 Estimation

For a pre-specified we now define a procedure for estimating the components of the corresponding augmented model (12) from given data. This obviously specifies suitable procedures for approximating the unknown values as well as to apply subsequent model selection procedures in order to retrieve nonzero coefficients and , , . A discussion of the choice of can be found in the next section.

Recall from Theorem 1 that for large the eigenvectors of are well approximated by the eigenvectors of the standardized covariance matrix . This motivates us to use the empirical principal components of in order to determine estimates of and . Theoretical support will be given in the next section. Define as the eigenvalues of the standardized empirical covariance matrix , while are associated orthonormal eigenvectors. We then estimate by

 ˆξir=ˆ\boldsψTrXi/√pˆλr,r=1,…,k,i=1,…,n.

When replacing by in (12), a direct application of model selection procedures does not seem to be adequate, since and the predictor variables are heavily correlated. We therefore rely on a projected model. Consider the projection matrix on the orthogonal space of the space spanned by the eigenvectors corresponding to the largest eigenvalues of

 ˆPk=Ip−k∑r=1ˆ\boldsψrˆ\boldsψTr.

Then model (12) can be rewritten for ,

 Yi=k∑r=1˜αrˆξir+∑pj=1˜βj(ˆPkXi)j((1/n)∑ni=1(ˆPkXi)2j)1/2+˜εi+εi, (20)

where , and . It will be shown in the next section that for large and the additional error term can be assumed to be reasonably small.

In the following we will use to denote the vectors with entries . Furthermore, consider the -dimensional vector of predictors , . The Gram matrix in model (20) is a block matrix defined as

where is the identity matrix of size . Note that the normalization of the predictors in (20) implies that the diagonal elements of the Gram matrix above are equal to 1.

Arguing now that the vector of parameters in model (20) is -sparse, we may use a selection procedure to recover/estimate the nonnull parameters. In the following we will concentrate on the Lasso estimator introduced in Tibshirani (1996). For a pre-specified parameter , an estimator is then obtained as

 ˆ\boldsθ=argmin˜\boldsθ∈Rk+p1n∥Y−\boldsΦ˜\boldsθ∥2+2ρ∥˜\boldsθ∥1, (21)

being the -dimensional matrix with rows . We can alternatively use the Dantzig selector introduced in Candes and Tao (2007).

Finally, from , we define corresponding estimators for , , and , , in the unprojected model (12).

 ˆβj=ˆ˜βj((1/n)∑ni=1(ˆPkXi)2j)1/2,j=1,…,p,

and

 ˆαr=ˆ˜αr−√pˆλrp∑j=1ˆψrjˆβj,r=1,…,k.

4 High-dimensional factor analysis: Theoretical results

The following theorem shows that principal components which are able to explain a considerable proportion of total variance can be estimated consistently.

For simplicity, we will concentrate on the case that as well as are large enough such that {longlist}[(A.4)]

and .

Theorem 2

Under assumptions (A.1)–(A.4) and under events (4)–(7) we have for all and all ,

 |λr−ˆλr| ≤ D2p+C0(logp/n)1/2 (22) ∥\boldsψr−ˆ\boldsψr∥2 ≤ 2D2/p+C0(logp/n)1/2v(k), (23) ψ2rj ≤ D0−D1pλr≤D0−D1pv(k), (24) ˆψ2rj ≤ D0+C0(logp/n)1/2pˆλr ≤ 65D