Jianqing Fan11footnotemark: 1,  Bai Jiang,   and  Qiang Sun Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544; E-mail: jqfan@princeton.edu, baij@princeton.edu.Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3; E-mail: qsun@utstat.toronto.edu.
January 31, 2019
###### Abstract

This paper investigates the high-dimensional linear regression with highly correlated covariates. In this setup, the traditional sparsity assumption on the regression coefficients often fails to hold, and consequently many model selection procedures do not work. To address this challenge, we model the variations of covariates by a factor structure. Specifically, strong correlations among covariates are explained by common factors and the remaining variations are interpreted as idiosyncratic components of each covariate. This leads to a factor-adjusted regression model with both common factors and idiosyncratic components as covariates. We generalize the traditional sparsity assumption accordingly and assume that all common factors but only a small number of idiosyncratic components contribute to the response. A Bayesian procedure with a spike-and-slab prior is then proposed for parameter estimation and model selection. Simulation studies show that our Bayesian method outperforms its lasso analogue, manifests insensitivity to the overestimates of the number of common factors, pays a negligible price in the no correlation case, and scales up well with increasing sample size, dimensionality and sparsity. Numerical results on a real dataset of U.S. bond risk premia and macroeconomic indicators lend strong support to our methodology.

keywords: factor model, Bayesian sparse regression, posterior convergence rate, model selection.

## 1 Introduction

High-dimensional linear models are useful for a wide arrays of economic problems (Fan et al., 2011; Belloni et al., 2012). These models typically assume the sparsity of regression coefficients, that is, only a small number of covariates have significant effects on the response. However, the explanatory variables in the panel of an economic dataset are often highly correlated due to the influence of latent common factors, rendering the sparsity assumption unreasonable and restrictive. To address this issue, this paper proposes a general regression model with a factor-adjusted sparsity assumption, and develops a Bayesian method for this model.

To motivate the factor-adjusted model and its corresponding methodology, we start with the standard linear regression model

 Yn×1=Xn×pβp×1+σεn×1, (1)

where is an response vector, is a design matrix of observations and covariates, is a -dimensional vector of regression coefficients, is an unknown standard deviation, and is an -dimensional standard Gaussian random vector, independent of . Without loss of generality, we assume and include no intercept term in the model. Of interest is the high-dimensional regime in which the dimensionality is much larger than the sample size .

This model has attracted intensive interests in the frequentist community (Tibshirani, 1996; Fan and Li, 2001; Candes and Tao, 2007; Fan and Lv, 2008; Zhang and Huang, 2008; Su and Candes, 2016, among others). All of these methods hinge on at least two basic assumptions. The first one assumes that the correlations between explanatory variables are sufficiently weak. Examples of this assumption are the mutual coherence condition (Donoho and Huo, 2001; Donoho and Elad, 2003; Donoho et al., 2006; Bunea et al., 2007), the irrepresentable condition (Zhao and Yu, 2006), the restricted eigenvalue condition (Bickel et al., 2009; Fan et al., 2018) and the uniform compatibility condition (Bühlmann and van de Geer, 2011, page 157). The second one, referred to as the sparsity assumption, assumes that only a small number of covariates contribute to the response. Formally, the sparsity, defined as , is much smaller than the dimensionality .

Nevertheless, the weak correlation conditions do not necessarily hold in many applications, especially those in economic and financial studies. In an economic or financial dataset, the explanatory variables, e.g., stock returns or macroeconomic indicators over a period of time, are often influenced by similar economic fundamentals and are thus heavily correlated due to the existence of co-movement patterns (Forbes and Rigobon, 2002; Stock and Watson, 2002). In the presence of such strong correlations introduced by common factors, one naturally expects strong effects of common factors on the response. If this is true, many covariates would have non-negligible effects on the response, rendering the traditional sparsity assumption in the standard regression model (1) ideologically unreasonable.

The above argument shows the necessity to take the correlation structure of explanatory variables into account and adjust the sparsity assumption accordingly. For this purpose, we consider using factor models (Stock and Watson, 2002; Bai, 2003; Bai and Ng, 2006; Fan et al., 2013) and assume that each datum (row) of the data matrix exhibits a decomposition of form

 xi=Bfi+ui, (2)

where is a unknown matrix of factor loading coefficients, is a -dimensional random vector of common factors, and is a -dimensional random vector of weakly-correlated idiosyncratic components, uncorrelated with . Without loss of generality, we assume , , and . Both common factors and idiosyncratic components are latent, but they are often estimated by using principal component analysis (PCA) (Bai, 2003; Fan et al., 2013; Wang and Fan, 2017). Model (2) embraces the well-known CAPM model (Sharpe, 1964; Lintner, 1975) and Fama-French model (Fama and French, 1993) as its special cases, with observable common factors. Let be the matrix of common factors, and be the matrix of idiosyncratic components. Then a more compact matrix form reads as

 X=FBT+U. (3)

Each covariate (column) in can be decomposed as a sum of two components and , reflecting the influence of common factors and idiosyncratic variations respectively.

Utilizing this factor structure (3), we generalize the standard sparse regression model (1) to a factor-adjusted sparse regression model of the form

 Yn×1=Fn×kαk×1+Un×pβp×1+σεn×1, (4)

where and are regression coefficient vectors of and , respectively. We assume that is dense (as it is usually low-dimensional) but is sparse. That is, all common factors but only a small number of idiosyncratic components of the original explanatory variables contribute to the response. A non-zero indicates that the covariate , excluding the strong correlation with other covariates, has a specific effect on the response. Compared to the traditional sparsity assumption, this factor-adjusted sparsity assumption is more tenable as the idiosyncratic components are weakly-correlated.

We remark that our generalized factor-adjusted regression model (4) covers the standard regression model (1) as a special case by restricting the side constraint that . Under this constraint, the factor-adjusted sparsity assumption imposed on regression coefficients of idiosyncratic components in model (4) coincides with the traditional sparsity assumption in model (1). Thus any statistical method for estimating model (4) would estimate model (1). Of course, when such a constraint is not enforced, model (4) provides more flexibility in the regression analysis than model (1).

Model (4) is similar but different from the factor-augmented regression or the augmented principal component regression of Stock and Watson (2002); Bai and Ng (2006). In the factor-augmented models, the factors are usually extracted from a large panel of data via PCA and used as a part of covariates, yet the other variables are introduced from outside of the panel. These models are typically low-dimensional. In contrast, model (4) takes idiosyncratic components as covariates, which are created internally from the panel of the data. This allows to explore additional explanatory power of the data. Our analyses of model (4) in the high-dimensional fashion are applicable to the low-dimensional factor-augmented regression models in the literature, as model (4) can easily incorporate external variables in the part of and/or . For simplicity of presentation, we omit the details.

Kneip and Sarda (2011) gave an insightful discussion on the limitation of the traditional sparse assumption in model (1) with factor-structured covariates and proposed a factor-augmented regression model. Nevertheless, they still need the weak correlation condition on the original covariates, which is unlikely to hold for factor-structured covariates. See equation (5.5) of Kneip and Sarda (2011). Fan et al. (2016) pointed out the failure of classical frequentist methods dealing with model (1) with factor-structured covariates, and proposed a frequentist method for estimating model (4). Specially, they estimated the latent common factors and idiosyncratic components, and then run frequentist sparse regression methods (e.g., lasso) on estimated common factors and idiosyncratic components. Similar to ours, they impose the weak correlation condition on idiosyncratic components, instead of the original covariates. See Example 3.2 of Fan et al. (2016).

This paper focuses on Bayesian solutions to model (4). As shown in Section 2, the fully Bayesian procedure cannot work easily due to the involvement of latent common factors and idiosyncratic components in the posterior computation. Inspired by Fan et al. (2016), we consider estimating these latent variables by PCA and running a Bayesian sparse regression method on their estimates. The arsenal of Bayesian sparse regression methods, including those exploiting shrinkage priors (e.g. Park and Casella, 2008; Polson and Scott, 2012; Armagan et al., 2013; Bhattacharya et al., 2015; Song and Liang, 2017) and those exploiting spike-and-slab priors (among others, Ishwaran and Rao, 2005; Narisetty and He, 2014; Castillo et al., 2015; Ročková and George, 2018), has been developed in parallel to the frequentist methods. However, it is unclear whether these methods would work on estimated common factors and idiosyncratic components in model (4). When it does work, it remains unknown whether the factor model estimation would incur any loss to the convergence rate or model selection consistency of the Bayesian sparse regression method. Given theoretical results in the frequentist setting, these questions are still challenging, because the definitions of estimation errors and technical conditions of frequentist and Bayesian methods are significantly different (Castillo et al., 2015). Even if a Bayesian sparse regression method is theoretically sound, it is unclear whether it performs better or worse than the frequentist methods on finite sample data. We would like answer these questions in the current paper.

Specifically, our Bayesian method imposes a slab prior on the regression coefficients of estimated common factors, and a spike-and-slab prior on the regression coefficients of estimated idiosyncratic components. This procedure results in a pseudo-posterior distribution, which differs from the exact posterior distribution obtained by a Bayesian regression on exact common factors and idiosyncratic components. Interestingly, the pseudo-posterior distribution achieves the contraction rate of the regression coefficients, which matches that of the exact posterior distribution. Byproducts of our analyses include the adaptivity to the unknown sparsity and the unknown standard deviation . We only need a type of sparse eigenvalue condition on the idiosyncratic components to overcome the non-identifiability issue of the parameters. This is easy to hold due to the weak correlation among idiosyncratic components. Moreover, by assuming a beta-min condition that is frequently used in the high-dimensional regression literature, we prove that our method consistently selects the support of the true sparse regression coefficients.

The rest of this paper proceeds as follows. In Section 2, we propose the Bayesian methodology for the factor-adjusted regression model (4). Section 3 establishes the contraction rates and model selection consistency of the pseudo-posterior distribution. These theoretical results rely on a high-level condition concerning the estimation of factor models, which is examined by Section 4. Section 5 presents experimental results on simulation datasets. Section 6 applies our method to a real dataset of U.S. bond risk premia and macroeconomic indicators. Section 7 is devoted to discussions. All technical proofs and algorithmic implementation are detailed in the appendices.

Notation. We write for a diagonal matrix of elements . For a symmetric matrix , we write its largest eigenvalue as and its smallest eigenvalue as . For a matrix , we write to denote its -th column, and lowercase to denote its -th row. For a index set , is the sub-matrix of assembling the columns indexed by . Let be the element-wise maximum norm of , let be its Frobenius norm. For a vector , let denote its sub-vector assembling components indexed by , and let denote its norm. For two sequences and , or means .

## 2 Model and Methodology

Our goal is to study the factor-adjusted regression model (4), in which both common and idiosyncratic components are unobserved, but are observed through (3). Each datum (row) in admits the factor structure (2) with therein identically distributed as . Note that are not necessarily independently distributed. The dimension of is fixed, but the dimension of may grow as increases. By decomposition, and are uncorrelated. Without loss of generality, we assume that , and . The regression coefficient vector of is sparse in the sense that is small. We allow to grow as increases, but require so that the desired contraction rate as . The Gaussian errors are independent from and .

An inherent difficulty for estimating model (4) is that both common factors and idiosyncratic components are unobserved. Therefore the first step is to estimate these unobserved variables. We follow Bai (2003); Fan et al. (2013); Wang and Fan (2017) and use PCA for this task. Let be the eigenvalues of . A natural estimator of is the concatenation of the square-root--scaled eigenvectors corresponding to the top eigenvalues of , denote by . That is,

where Then we estimate by

 ˆU=X−ˆFˆBT=(I−ˆFˆFT/n)X.

If is unknown, we may estimate by

 ˆk=argmaxk≤kmaxk-th % eigenvalue of XTX/n%$(k+1)$−theigenvalueof$XTX/n$, (5)

where is any prescribed upper bound for (Luo et al., 2009; Lam and Yao, 2012; Ahn and Horenstein, 2013).

After estimating unobserved variables, we propose a Bayesian sparse regression method for tasks of parameter estimation and model selection. Suppose we are given data generated from true parameter . Let be its running parameter. Let and be the support of and , respectively. We consider a hierarchical prior with a slab prior on the coefficients of common factors and a spike-and-slab prior on the coefficients of idiosyncratic components as follows:

 σ2∼g(σ2),α|σ2∼k∏j=11σh(αjσ),1{j∈ξ}∼Bernoulli(s0/p),βξ|σ2∼∏j∈ξ1τjσh(βjτjσ),   βξc|σ2=0, (6)

where is a positive continuous density function on , e.g., the inverse-gamma density; is a “slab” density function on in the sense that as , e.g., the Gaussian density and the Laplace density ; hyperparameters control the scales of running coefficients ; and, hyperparameter controls the sparsity of running models . For the scaling hyperparameters, we set so that the effects of possibly heterogeneous scales of ’s are appropriately adjusted. For the sparsity hyperparameter, we simply set in the simulation experiments. When dealing with a real dataset, one could choose an informative according to expertise knowledges in the specific area, or tune by sophisticated cross-validation or empirical Bayes procedures

The Bayesian sparse regression on response and regressors , with prior (6) obtains a pseudo-posterior distribution

 ˆπ(σ2,α,β|X,Y)=ˆπ(σ2,α,β|ˆF,ˆU,Y)∝π(σ2,α,β)N(Y|ˆFα+ˆUβ,σ2I), (7)

where denotes the -dimensional normal distribution with mean and covariance . We call it a “pseudo-posterior” distribution and put a hat over to emphasize that it differs from the exact posterior distributions , obtained by a Bayesian regression on observed , and , obtained by a fully Bayesian procedure.

It is worth noting that, even in the simplest setting in which are i.i.d. and are jointly independent, the exact posterior distribution given by a fully Bayesian procedure

 ∝π(σ2,α,β)∫N(Y|Fα+(X−FBT)β,σ2I)n∏i=1Pf(fi)Pu(xi−Bfi)dfi,

is computationally intractable due to the involvement of latent variables in the complicated integral. Thus a fully Bayesian procedure does not solve model (4) easily.

## 3 Theory

In this section, we show under commonly-seen assumptions for Bayesian sparse regression methods that the pseudo-posterior distribution (7) achieves the convergence rate of the estimation error for the coefficient vectors . This rate is so far the best rate Bayesian methods can achieve with observed (Song and Liang, 2017). We see that the factor adjustment added by our approach to the Bayesian sparse regression method incurs no loss in terms of estimation error rate. Byproducts of our analysis are the adaptivities of the pseudo-posterior distribution to the unknown sparsity and unknown standard deviation . Finally, when the beta-min condition holds, we establish the model selection consistency of the pseudo-posterior distribution (7).

### 3.1 Assumptions

In the high-dimensional regime , a common assumption is that is sparse of size . Following the sparse regression literature, we assume that such that the desired error rate as . To recover the sparse coefficient vector at rate , we need the following assumptions.

###### Assumption 1.

There exists a large integer and a constant such that

 minξ:|ξ|≤¯pλmin(UTξUξ/n)≥κ0

holds with probability approaching .

This assumption is commonly referred to as the sparse eigenvalue condition in the frequentist literature (Fan et al., 2018). In a recent study of Bayesian sparse regression with shrinkage priors, Song and Liang (2017) imposed the same assumption on original covariates . Here our assumption is imposed on their idiosyncratic components .

Our next assumption upper bounds the maximum eigenvalue of , which is the Gram matrix corresponding to the true model . Assumptions 1-2 together ensure that is well conditioned.

###### Assumption 2.

There exists a constant such that

holds with probability approaching .

Raskutti et al. (2010); Dobriban and Fan (2016) gave sufficient conditions for correlated covariates to satisfy Assumptions 1-2. These theories typically allow in Assumption 1. If consists of i.i.d. entries with zero mean, unit variance and only finite fourth moment, Assumption 2 holds by Bai-Yin theorem in the random matrix theory (Bai and Yin, 1988; Yin et al., 1988).

Since we feed a Bayesian sparse regression method with the estimated variables rather than the latent variables , it is necessary to control the error of . This goal is achieved by assumptions on the estimation errors of latent variables and the magnitudes of the true coefficient vectors. For the estimation error of the factor model, we impose a generic high-level condition as follows.

###### Assumption 3.

The latent common factors and idiosyncratic components can be estimated by and as follows.

 max1≤j≤k∥(ˆFH)j−Fj∥ =Op(√logp), max1≤j≤p∥ˆUj−Uj∥ =Op(√logp),

for some nearly orthogonal matrix such that and .

Since represents the eigenspace of the top eigenvalues of and mimics the column space of , there is a nearly-orthogonal transformation, represented by , between and . Next section will verify this error rate in factor models under standard assumptions.

Our last assumption requires constant orders of the true parameters .

###### Assumption 4.

is fixed, , and .

This condition is not restrictive. It holds if and only if the response variable has finite variance, under Assumptions 1-2. To see this point, note that the variance of a single response variable is

 Var(y)=∥α⋆∥2+(β⋆ξ⋆)TCov(uξ⋆)β⋆ξ⋆+σ⋆2,

where is the sub-vector of corresponding to the true model , and have all eigenvalues bounded away from and , due to Assumptions 1-2. Although our theoretical analyses need bounded magnitude of regression coefficients to avoid the amplification of estimation errors of latent variables, we remark here that, when the underlying true factors, ’s and ’s, and/or more accurate estimates are available, we can allow larger magnitudes of regression coefficients.

### 3.2 Definition of Posterior Contraction Rate

The definition of convergence rate in the Bayesian setting differs from that in the frequentist setting. We formally define it by following the classical Bayesian literature (Ghosal et al., 2000; Shen and Wasserman, 2001).

###### Definition 1 (Posterior contraction).

Consider a parametric model indexed by . Let be a sequence of data generations according to some true parameter . Let be a function of . Let be a loss function between the estimate and the parameter . A sequence of posterior distributions (random measures) is said to achieve convergence rate of estimation error if

 π(ℓ(γ(θ),γ⋆)≥Mϵn|Dn)→0

in -probability as for some constant .

Specifically in the factor-adjusted regression model (4) with covariates hidden in (3), we consider

 Dn=(X,Y),   θ=(B,σ,α,β),   γ(θ)=(αβ),   γ⋆=(Hα⋆β⋆),

where is introduced by Assumption 3, and want to show that achieves the contraction rate of estimation error

 ℓ(γ(θ),γ⋆)=∥γ(θ)−γ⋆∥=∥∥∥(αβ)−(Hα⋆β⋆)∥∥∥.

As noted on Assumption 3, approximates in the sense that they have almost the same column space and element-wisely for some nearly orthogonal transformation matrix . Thus the pseudo-posterior distribution would concentrate around such that .

### 3.3 Results

This subsection presents the main results of the paper. Recall that . Let

where are constants, and are supports of and , respectively, and is the cardinality of the set difference of and .

###### Theorem 1.

Let denote the probability measure under the true parameters. Under Assumptions 1-4, the following statements hold.

1. (estimation error rate) There exist constants and such that

 P⋆(ˆπ(Ac(σ⋆,Hα⋆,β⋆,M0,M1,M2,ϵn)|X,Y)≥e−C1slogp)→0

as .

2. (prediction error rate) There exist constants and such that

 P⋆(ˆπ(∥(ˆFα+ˆUβ)−(Fα⋆+Uβ⋆)∥≥σ⋆M3√nϵn|X,Y)≥e−C2slogp)→0

as .

3. (model selection consistency) If then there exist constants and such that

 P⋆(ˆπ(Ac(σ⋆,Hα⋆,β⋆,M0,M1,M2,ϵn)∪{ξ⊉ξ⋆}|X,Y)≥e−C3slogp)→0

as . It follows that

 P⋆(ˆπ({|ξ∖ξ⋆|≤M0s,ξ⊇ξ⋆}c|X,Y)≥e−C3slogp) →0 P⋆(ˆπ({j:|βj|≥σ√|ξ|logp/n}≠ξ⋆∣∣X,Y)≥e−C3slogp) →0

as .

Part (a) establishes the convergence rate of the -estimation error of (up to a nearly orthogonal transformation ) and , the adaptivity to the unknown sparsity , and the adaptivity to the unknown standard deviation .

Part (b) shows that predicts the conditional mean with mean squared error for each single datum instance on average.

The first implication in Part (c) asserts that the pseudo-posterior distribution will select all variables in and at most other variables, with high probability. In simulation experiments, we observe that the pseudo-posterior distribution overestimates the true support size by less than . The second implication asserts that

 ˆπ({j:|βj|≥σ√|ξ|logp/n}=ξ⋆∣∣X,Y)→1

in probability as , and therefore provides a variable selection rule. Simply speaking, we can consistently select the true model by thresholding the running coefficients at . In simulation experiments, the majority of pseudo-posterior samples of parameters hit the true model correctly even if the thresholding rule is not used.

The additional condition that in part (c) is called “beta-min condition” in the literature on Bayesian sparse regression (Castillo et al., 2015; Song and Liang, 2017). Narisetty and He (2014) use another identifiability condition to achieve the model selection consistency. Their condition can be shown slightly stronger than the beta-min condition in presence of the minimum sparse eigenvalue condition. To see this point, one can compare their Condition 4.4 to our equation (10) in the proof of Lemma A2, part(d).

## 4 Factor Model Estimation

This section verifies Assumption 3, which concerns the estimation errors of factor models under standard assumptions. Following Bickel and Levina (2008), we define a uniformity class of positive semi-definite matrices as follows

 S+q ={Σ≥0:max1≤j≤p∑i|Σij|q
###### Assumption 5.

are identically (not necessarily independently) distributed as . , ; , with for some , and .

###### Assumption 6.

All entries in the loading matrix are uniformly bounded, i.e., , and all the eigenvalues of is strictly bounded away from and .

###### Assumption 7.

The sample covariance matrices of and converge to the true covariance matrices at rate in the element-wise maximum norm.

 ∥FTF/n−I∥max=Op(√logp/n),∥UTU/n−Σ∥max=Op(√logp/n),∥FTU/n∥max=Op(√logp/n).

In Assumption 5, is made to avoid the non-identifiability issue of and . If rows of are i.i.d. copies of some -dimensional distribution then converges almost surely to as and Assumption 6 holds when has eigenvalues bounded away from 0 and . Assumptions 5-6 together characterize the “low-rank plus sparse” structure of the covariance matrix of . That is,

 Cov(x)=BBT+Σ,

where the first part is of low rank , and the second part is sparse in the sense that the quantity for some is . This decomposition has a “spike plus non-spike” structural interpretation as well: the smallest non-zero eigenvalue of is of order , while the largest eigenvalue of is of order . This eigen-gap plays the key role in estimating and .

Assumption 7 requires that the sample covariance , and converge to their ideal counterparts at an appropriate rate. Kneip and Sarda (2011) provided sufficient conditions for it to hold in case that are i.i.d.. Fan et al. (2013) established the same rate for stationary and weakly-correlated time-series. Our recent work on the concentration inequalities for general Markov chains (Jiang et al., 2018) can verify this assumption in case that are functionals of ergodic Markov chains.

Next theorem summarizes the theoretical results on factor model estimation under Assumptions 5-7. Part (b) of this theorem bounds the difference between column spaces of and in terms of principal angles, which is novel from the previous theory in the literature (Fan et al., 2013) and may be of independent interest. Parts (c) and (d), which are immediate corollaries of part (b), derive Assumption 3.

###### Definition 2.

The principal angles between two linear spaces spanned by orthonormal column vectors of and are defined as

 ∠(ˆΨ,˜Ψ)=(arccos(d1),…,arccos(dk))T,

where are the singular values of or .

###### Theorem 2.

Let consist of -scaled left singular vectors of , which are orthonormal vectors spanning the column space of . Under Assumptions 5-7, the following statements hold.

1. Eigenvalue recovery:

 ∥ˆΛ−Λ∥max/p=Op(√logp/n),   maxk+1≤k≤n|ˆλj|/p=Op(√logp/n).
2. Eigenspace recovery:

 ∥sin∠(ˆF/√n,˜F/√n)∥=Op(√logp/n).
3. Common factor recovery:

 ∥ˆFH−F∥F=Op(√logp),

for some nearly orthogonal matrix with and .

4. Idiosyncratic component recovery:

 max1≤j≤p∥ˆUj−Uj∥=O% p(√logp).

## 5 Simulation Experiments

This section reports simulation results. As a basic case, we set , and generate , and . We set true parameters , , , and .

For prior (6), we choose the inverse-gamma density with shape and scale , the Gaussian density and set hyperparameters and . Starting from , we iterate a Gibbs sampler times and drop the first iterations as the burn-in period. The implementation details of the Gibbs sampler is given in the appendix.

The pseudo-posterior distribution are evaluated in terms of five measures. The posterior mean of is compared to in terms of estimation error. The model selection rate and the sure screening rate are also computed. The former is the portion of the posterior samples that select the true model, i.e., , and the latter is the portion of the posterior samples that select all sparse coefficients, i.e., . To evaluate the adaptivity to unknown sparsity , we report the average model size . To evaluate the adaptivity to unknown standard deviation , the posterior mean of is compared to in terms of relative estimation error. These measures are evaluated over 100 replicates of the datasets, and their averages are reported.

For the comparison purpose, the factor-adjusted lasso method is implemented by using R package glmnet (Friedman et al., 2010). The -penalty hyperparameters of the lasso methods are tuned by 10-fold cross-validation. Since the generic Bayesian / lasso with as covariates can be seen as the factor-adjusted Bayesian / lasso with the underestimate of , we also include them in the comparison.

### 5.1 Comparison of four methods, and insensitivity to misestimates of k

Table 1 summarizes the five measures of four methods in the basic case. Results show that the factor-adjusted Bayesian method outperforms the factor-adjusted lasso method in the tasks of estimation and model selection. The poor performance of the factor-adjusted lasso method may partly result from the less satisfactory hyperparameter tuning procedure implemented in the R package glmnet.

We feed the factor-adjusted methods with the various estimates , and observe that their performances are insensitive to the overestimate of (Table 1). In case that there is no correlation among , i.e., , the factor-adjusted Bayesian method performs slightly worse than the generic Bayesian method (Table 2).

We emphasize that the meaning of the model selection rate for the Bayesian methods are slightly different from that for the frequentist methods. For example, 50% model selection rate given by a frequentist method means that it select the true sparse model in 50 out of 100 replicates of the dataset. In contrast, 90% model selection rate given by a Bayesian method means that every 9 of 10 posterior samples of parameters hit the true sparse model in a single replicate of the dataset on average. In the simulation experiments reported by Tables 1-2, at least every 7 of 10 pseudo-posterior samples obtained by our method hit the true sparse model in each of 100 replicates of the dataset.

### 5.2 Scalability as n,p,s increase

We vary the sample size , the dimensionality and the sparsity in the basic case, and test the scalability of the proposed methodology.

In Figure 1(a), we fix all parameters in the basic case but vary , , , , , . In Figure 1(b), we fix all parameters in the basic case but vary , , , , , . In Figure 1(c), we fix all parameters in the basic case but vary , , , , , , , . For factor-adjusted methods, are used.

We observe that our method outperforms the other three methods in terms of estimation error and model selection rate under each combination of , and achieves comparable relative error of to the factor-adjusted lasso method.

### 5.3 Estimating the standard regression model

Recall that, when admits a factor structure (3), the standard regression model (1) is a special case of the factor-adjusted regression model (4) with the parameter constraint . We expect that the factor-adjusted Bayesian method solves model (1) as well. To verify this expectation, we set (or equivalently ), and test four methods on simulation datasets.

We see that the factor-adjusted Bayesian method does solve model (1). Interestingly, while the factor adjustment added to the lasso method significantly increases the model selection rate from 13% to roughly 50%, the generic Bayesian method works comparably well or even better than the factor-adjusted Bayesian method. We will discuss this phenomenon in the discussion section.

## 6 Predicting U.S. Bond Risk Premia

This section applies our method to predict U.S. bond risk premia with a large panel of macroeconomic variables. The response variables are monthly U.S. bond risk premia with maturity of years spanning the period from January, 1964 to December, 2003 (Ludvigson and Ng, 2009). The -year bond risk premium at period is defined as the (log) holding return from buying an -year bond at period and selling it as an -year boud at period , excessing the (log) return on one-year bond bought at period . The covariates are macroeconomic variables collected in the FRED-MD database (McCracken and Ng, 2016) during the same period.

The covariates over 480 months are strongly correlated. The scree plot of PCA of these covariates (Figure 2) shows that the first principal component accounts for 55.9% of the total variance, and that the first 5 principal components account for 89.7% of the total variance.

We consider the rolling window regression and next value prediction. Specifically, we regress a U.S. bond risk premium on the macroeconomic variables in the last month. For each time window of size ahead of month , we fit

 yi=f(xi−1)+σεi,   i=t−n,…,t−1,

and do out-of-sample prediction

 ˆyt=ˆf(xt−1).

The regression function is fitted as by one of the generic lasso method, the factor-adjusted lasso method, the generic Bayesian method, the factor-adjusted Bayesian method and the principal component regression (PCR) method (Ludvigson and Ng, 2009). For the factor-adjusted methods, the number of common factors is estimated by (5). For the Bayesian methods, we set in prior (6). For PCR, the top eight principal components are included in the regression model in a similar vein to (Ludvigson and Ng, 2009). The R package pls (Wehrens and Mevik, 2007) is used for implementation of PCR.

The prediction performance is evaluated by the out-of-sample , which is computed as follows.

 R2=1−∑480t=n+2(ˆyt−yt)2∑480t=n+2(¯yt−yt)2,

where is one of two-year, three-year, four-year and five-year U.S. bond risk premia, is the prediction of by one of five methods in comparison, and is the average of .

Table 4 summarizes the out-of-sample five methods achieve on this task. Table 5 reports the average size of the sparse models they select. We observe that the factor-adjusted Bayesian method together with the factor-adjusted Bayesian method achieve higher out-of-sample than other methods. But the factor-adjusted Bayesian method select much sparser models than the factor-adjusted lasso method.

## 7 Discussion

We propose a factor-adjusted regression model to handle the linear relationship between the response variable and possibly highly correlated covariates. We decompose the predictors into common factors and idiosyncratic components, where the common factors explain most of the variations, and assume all common factors but a small number of idiosyncratic components contribute to the response. The corresponding Bayesian methodology is then developed for estimating such a model. Theoretical results suggest that the proposed methodology can consistently estimate the factor-adjusted model and thus obtain consistent predictions, under an easily-to-hold sparse eigenvalue condition on the idiosyncratic components instead of the original covariates.

Our factor-adjusted model covers the standard linear model as a sub-model with the side constraint. Thus, our proposed methodology can easily handle the case when the standard linear regression model is assumed to be the underlying model. In simulation studies on the sub-model, we find that the factor adjustment greatly improves the performance of lasso, while the generic Bayesian sparse regression is comparable to the factor-adjusted Bayesian sparse regression in terms of estimation error and model selection rate (Table 3). This suggests a fundamental difference between the frequentist sparse regression method and the Bayesian sparse regression method. Indeed, one can prove under Assumptions 1-2, 5-7 that

 minξ:|ξ∖ξ⋆|≤M0sλmin(XTξXξ/n) ≥κ0−Op(s2logp/n) λmax(XTξ⋆Xξ⋆/n) =Op(s).

If , these two terms are of constant order, and then a similar argument to the proof of Theorem 1 would establish the convergence and model selection consistency of the generic Bayesian regression on standard regression model (1).

Nonetheless, we recommend the factor-adjusted Bayesian regression on model (4) over the generic Bayesian regression on model (1) for three reasons. First, the theoretical analyses of the former allow to grow with , in contrast the latter requires fixed . Second, model (4) provides more flexibility than its sub-model (1) in the regression analyses and would potentially explore more explanatory power from the data. On the real dataset of U.S. bond risk premia, the factor adjusted Bayesian regression achieves 1.0%-3.0% more out-of-sample with one or two less variables (Tables 4-5). Third, in the no correlation case (although it is unlike the case in practice), the factor-adjusted Bayesian regression pays a negligible price for model misspecification (Table 2).

## Acknowledgement

We would like to thank Yun Yang for helpful discussions.

## References

• Ahn and Horenstein (2013) Ahn, S. C. and Horenstein, A. R. (2013). Eigenvalue ratio test for the number of factors. Econometrica 81 1203–1227.
• Armagan et al. (2013) Armagan, A., Dunson, D. B. and Lee, J. (2013). Generalized double pareto shrinkage. Statistica Sinica 23 119.
• Bai (2003) Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171.
• Bai and Ng (2006) Bai, J. and Ng, S. (2006). Confidence intervals for diffusion index forecasts and inference for factor-augmented regressions. Econometrica 74 1133–1150.
• Bai and Yin (1988) Bai, Z.-D. and Yin, Y.-Q. (1988). Necessary and sufficient conditions for almost sure convergence of the largest eigenvalue of a wigner matrix. Annals of Probability 1729–1741.
• Barron (1998) Barron, A. R. (1998). Information-theoretic characterization of bayes performance and the choice of priors in parametric and nonparametric problems. Bayesian Statistics 6 27–52.
• Belloni et al. (2012) Belloni, A., Chen, D., Chernozhukov, V. and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 2369–2429.
• Bhattacharya et al. (2015) Bhattacharya, A., Pati, D., Pillai, N. S. and Dunson, D. B. (2015). Dirichlet-Laplace priors for optimal shrinkage. Journal of the American Statistical Association 110 1479–1490.
• Bickel and Levina (2008) Bickel, P. J. and Levina, E. (2008). Covariance regularization by thresholding. Annals of Statistics 36 2577–2604.
• Bickel et al. (2009) Bickel, P. J., Ritov, Y., Tsybakov, A. B. et al. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics 37 1705–1732.
• Bühlmann and van de Geer (2011) Bühlmann, P. and van de Geer, S. (2011). Statistics for high-dimensional data: methods, theory and applications. Springer, New York.
• Bunea et al. (2007) Bunea, F., Tsybakov, A., Wegkamp, M. et al. (2007). Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics 1 169–194.
• Candes and Tao (2007) Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics 35 2313–2351.
• Castillo et al. (2015) Castillo, I., Schmidt-Hieber, J. and van der Vaart, A. (2015). Bayesian linear regression with sparse priors. Annals of Statistics 43 1986–2018.
• Davis and Kahan (1970) Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis 7 1–46.
• Dobriban and Fan (2016) Dobriban, E. and Fan, J. (2016). Regularity properties for sparse regression. Communications in mathematics and statistics 4 1–19.
• Donoho and Elad (2003) Donoho, D. L. and Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via minimization. Proceedings of the National Academy of Sciences 100 2197–2202.
• Donoho et al. (2006) Donoho, D. L., Elad, M. and Temlyakov, V. N. (2006). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on information theory 52 6–18.
• Donoho and Huo (2001) Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE transactions on information theory 47 2845–2862.
• Fama and French (1993) Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics 33 3–56.
• Fan et al. (2016) Fan, J., Ke, Y. and Wang, K. (2016). Decorrelation of covariates for high dimensional sparse regression. arXiv preprint arXiv:1612.08490 .
• Fan and Li (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96 1348–1360.
• Fan et al. (2013) Fan, J., Liao, Y. and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 603–680.
• Fan et al. (2018) Fan, J., Liu, H., Sun, Q. and Zhang, T. (2018). I-LAMM for sparse learning: Simultaneous control of algorithmic complexity and statistical error. Annals of statistics 46 814.
• Fan and Lv (2008) Fan, J. and Lv, J. (2008). Sure independence screening for ultra-high dimensional feature space (with discussion). Journal of Royal Statistical Society B 70 849–911.
• Fan et al. (2011) Fan, J., Lv, J. and Qi, L. (2011). Sparse high-dimensional models in economics. Annual Review of Economics 3 291–317.
• Forbes and Rigobon (2002) Forbes, K. J. and Rigobon, R. (2002). No contagion, only interdependence: measuring stock market comovements. Journal of Finance 57 2223–2261.
• Friedman et al. (2010) Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 1–22.
• Ghosal et al. (2000) Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics 28 500–531.
• Ishwaran and Rao (2005) Ishwaran, H. and Rao, J. S. (2005). Spike and slab gene selection for multigroup microarray data. Journal of the American Statistical Association 100 764–780.
• Jiang et al. (2018) Jiang, B., Sun, Q. and Fan, J. (2018). Bernstein’s inequality for general markov chains. arXiv preprint arXiv:1805.10721 .
• Kneip and Sarda (2011) Kneip, A. and Sarda, P. (2011). Factor models and variable selection in high-dimensional regression analysis. Annals of Statistics 39 2410–2447.
• Lam and Yao (2012) Lam, C. and Yao, Q. (2012). Factor modeling for high-dimensional time series: inference for the number of factors. Annals of Statistics 40 694–726.
• Laurent and Massart (2000) Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics 1302–1338.
• Lintner (1975) Lintner, J. (1975). The valuation of risk assets and the selection of risky investments in stock portfolios and capital budgets. In Stochastic Optimization Models in Finance. Elsevier, 131–155.
• Liu (2005) Liu, J. (2005). Eigenvalue and singular value inequalities of schur complements. In The Schur Complement and its Applications, chap. 2. Springer, 47–82.
• Ludvigson and Ng (2009) Ludvigson, S. C. and Ng, S. (2009). Macro factors in bond risk premia. The Review of Financial Studies 22 5027–5067.
• Luo et al. (2009) Luo, R., Wang, H. and Tsai, C.-L. (2009). Contour projected dimension reduction. Annals of Statistics 37 3743–3778.
• McCracken and Ng (2016) McCracken, M. W. and Ng, S. (2016). FRED-MD: a monthly database for macroeconomic research. Journal of Business & Economic Statistics 34 574–589.
• Narisetty and He (2014) Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. Annals of Statistics 42 789–817.
• Park and Casella (2008) Park, T. and Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association 103 681–686.
• Pelekis (2016) Pelekis, C. (2016). A lower bound on binomial tails: an approach via tail conditional expectations. arXiv preprint arXiv:1609.06651 .
• Polson and Scott (2012) Polson, N. G. and Scott, J. G. (2012). Local shrinkage rules, lévy processes and regularized regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 287–311.
• Raskutti et al. (2010) Raskutti, G., Wainwright, M. J. and Yu, B. (2010). Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research 11 2241–2259.
• Ročková and George (2018) Ročková, V. and George, E. I. (2018). The spike-and-slab lasso. Journal of the American Statistical Association 113 431–444.
• Sharpe (1964) Sharpe, W. F. (1964). Capital asset prices: a theory of market equilibrium under conditions of risk. Journal of Finance 19 425–442.
• Shen and Wasserman (2001) Shen, X. and Wasserman, L. (2001). Rates of convergence of posterior distributions. Annals of Statistics 29 687–714.
• Song and Liang (2017) Song, Q. and Liang, F. (2017). Nearly optimal bayesian shrinkage for high dimensional regression. arXiv preprint arXiv:1712.08964 .
• Stock and Watson (2002) Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association 97 1167–1179.
• Su and Candes (2016) Su, W. and Candes, E. (2016). Slope is adaptive to unknown sparsity and asymptotically minimax. Annals of Statistics 44 1038–1068.
• Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58 267–288.
• Wang and Fan (2017) Wang, W. and Fan, J. (2017). Asymptotics of empirical eigen-structure for ultra-high dimensional spiked covariance model. Annals of Statistics 45 1342–1374.
• Wehrens and Mevik (2007) Wehrens, R. and Mevik, B.-H. (2007). The pls package: Principal component and partial least squares regression in r. Journal of Statistical Software 18 1–24.
• Yin et al. (1988) Yin, Y.-Q., Bai, Z.-D. and Krishnaiah, P. R. (1988). On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probability theory and related fields 78 509–521.
• Yu et al. (2014) Yu, Y., Wang, T. and Samworth, R. J. (2014). A useful variant of the davis–kahan theorem for statisticians. Biometrika 102 315–323.
• Zhang and Huang (2008) Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. Annals of Statistics 36 1567–1594.
• Zhao and Yu (2006) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research 7 2541–2563.
\appendixpage

Notation. We summarize all notation used in the appendices. Some of them may have been defined in the main body of the paper.

Let be the identity matrix of dimension , the column vector of zeros, the matrix of zeros. If the dimension of an identity or zero matrix is clear in the context, we omit the subscripts. We write for a diagonal matrix of elements . For a symmetric matrix , we write its trace as