Bayesian Factoradjusted Sparse Regression
Abstract
This paper investigates the highdimensional linear regression with highly correlated covariates. In this setup, the traditional sparsity assumption on the regression coefficients often fails to hold, and consequently many model selection procedures do not work. To address this challenge, we model the variations of covariates by a factor structure. Specifically, strong correlations among covariates are explained by common factors and the remaining variations are interpreted as idiosyncratic components of each covariate. This leads to a factoradjusted regression model with both common factors and idiosyncratic components as covariates. We generalize the traditional sparsity assumption accordingly and assume that all common factors but only a small number of idiosyncratic components contribute to the response. A Bayesian procedure with a spikeandslab prior is then proposed for parameter estimation and model selection. Simulation studies show that our Bayesian method outperforms its lasso analogue, manifests insensitivity to the overestimates of the number of common factors, pays a negligible price in the no correlation case, and scales up well with increasing sample size, dimensionality and sparsity. Numerical results on a real dataset of U.S. bond risk premia and macroeconomic indicators lend strong support to our methodology.
keywords: factor model, Bayesian sparse regression, posterior convergence rate, model selection.
1 Introduction
Highdimensional linear models are useful for a wide arrays of economic problems (Fan et al., 2011; Belloni et al., 2012). These models typically assume the sparsity of regression coefficients, that is, only a small number of covariates have significant effects on the response. However, the explanatory variables in the panel of an economic dataset are often highly correlated due to the influence of latent common factors, rendering the sparsity assumption unreasonable and restrictive. To address this issue, this paper proposes a general regression model with a factoradjusted sparsity assumption, and develops a Bayesian method for this model.
To motivate the factoradjusted model and its corresponding methodology, we start with the standard linear regression model
(1) 
where is an response vector, is a design matrix of observations and covariates, is a dimensional vector of regression coefficients, is an unknown standard deviation, and is an dimensional standard Gaussian random vector, independent of . Without loss of generality, we assume and include no intercept term in the model. Of interest is the highdimensional regime in which the dimensionality is much larger than the sample size .
This model has attracted intensive interests in the frequentist community (Tibshirani, 1996; Fan and Li, 2001; Candes and Tao, 2007; Fan and Lv, 2008; Zhang and Huang, 2008; Su and Candes, 2016, among others). All of these methods hinge on at least two basic assumptions. The first one assumes that the correlations between explanatory variables are sufficiently weak. Examples of this assumption are the mutual coherence condition (Donoho and Huo, 2001; Donoho and Elad, 2003; Donoho et al., 2006; Bunea et al., 2007), the irrepresentable condition (Zhao and Yu, 2006), the restricted eigenvalue condition (Bickel et al., 2009; Fan et al., 2018) and the uniform compatibility condition (Bühlmann and van de Geer, 2011, page 157). The second one, referred to as the sparsity assumption, assumes that only a small number of covariates contribute to the response. Formally, the sparsity, defined as , is much smaller than the dimensionality .
Nevertheless, the weak correlation conditions do not necessarily hold in many applications, especially those in economic and financial studies. In an economic or financial dataset, the explanatory variables, e.g., stock returns or macroeconomic indicators over a period of time, are often influenced by similar economic fundamentals and are thus heavily correlated due to the existence of comovement patterns (Forbes and Rigobon, 2002; Stock and Watson, 2002). In the presence of such strong correlations introduced by common factors, one naturally expects strong effects of common factors on the response. If this is true, many covariates would have nonnegligible effects on the response, rendering the traditional sparsity assumption in the standard regression model (1) ideologically unreasonable.
The above argument shows the necessity to take the correlation structure of explanatory variables into account and adjust the sparsity assumption accordingly. For this purpose, we consider using factor models (Stock and Watson, 2002; Bai, 2003; Bai and Ng, 2006; Fan et al., 2013) and assume that each datum (row) of the data matrix exhibits a decomposition of form
(2) 
where is a unknown matrix of factor loading coefficients, is a dimensional random vector of common factors, and is a dimensional random vector of weaklycorrelated idiosyncratic components, uncorrelated with . Without loss of generality, we assume , , and . Both common factors and idiosyncratic components are latent, but they are often estimated by using principal component analysis (PCA) (Bai, 2003; Fan et al., 2013; Wang and Fan, 2017). Model (2) embraces the wellknown CAPM model (Sharpe, 1964; Lintner, 1975) and FamaFrench model (Fama and French, 1993) as its special cases, with observable common factors. Let be the matrix of common factors, and be the matrix of idiosyncratic components. Then a more compact matrix form reads as
(3) 
Each covariate (column) in can be decomposed as a sum of two components and , reflecting the influence of common factors and idiosyncratic variations respectively.
Utilizing this factor structure (3), we generalize the standard sparse regression model (1) to a factoradjusted sparse regression model of the form
(4) 
where and are regression coefficient vectors of and , respectively. We assume that is dense (as it is usually lowdimensional) but is sparse. That is, all common factors but only a small number of idiosyncratic components of the original explanatory variables contribute to the response. A nonzero indicates that the covariate , excluding the strong correlation with other covariates, has a specific effect on the response. Compared to the traditional sparsity assumption, this factoradjusted sparsity assumption is more tenable as the idiosyncratic components are weaklycorrelated.
We remark that our generalized factoradjusted regression model (4) covers the standard regression model (1) as a special case by restricting the side constraint that . Under this constraint, the factoradjusted sparsity assumption imposed on regression coefficients of idiosyncratic components in model (4) coincides with the traditional sparsity assumption in model (1). Thus any statistical method for estimating model (4) would estimate model (1). Of course, when such a constraint is not enforced, model (4) provides more flexibility in the regression analysis than model (1).
Model (4) is similar but different from the factoraugmented regression or the augmented principal component regression of Stock and Watson (2002); Bai and Ng (2006). In the factoraugmented models, the factors are usually extracted from a large panel of data via PCA and used as a part of covariates, yet the other variables are introduced from outside of the panel. These models are typically lowdimensional. In contrast, model (4) takes idiosyncratic components as covariates, which are created internally from the panel of the data. This allows to explore additional explanatory power of the data. Our analyses of model (4) in the highdimensional fashion are applicable to the lowdimensional factoraugmented regression models in the literature, as model (4) can easily incorporate external variables in the part of and/or . For simplicity of presentation, we omit the details.
Kneip and Sarda (2011) gave an insightful discussion on the limitation of the traditional sparse assumption in model (1) with factorstructured covariates and proposed a factoraugmented regression model. Nevertheless, they still need the weak correlation condition on the original covariates, which is unlikely to hold for factorstructured covariates. See equation (5.5) of Kneip and Sarda (2011). Fan et al. (2016) pointed out the failure of classical frequentist methods dealing with model (1) with factorstructured covariates, and proposed a frequentist method for estimating model (4). Specially, they estimated the latent common factors and idiosyncratic components, and then run frequentist sparse regression methods (e.g., lasso) on estimated common factors and idiosyncratic components. Similar to ours, they impose the weak correlation condition on idiosyncratic components, instead of the original covariates. See Example 3.2 of Fan et al. (2016).
This paper focuses on Bayesian solutions to model (4). As shown in Section 2, the fully Bayesian procedure cannot work easily due to the involvement of latent common factors and idiosyncratic components in the posterior computation. Inspired by Fan et al. (2016), we consider estimating these latent variables by PCA and running a Bayesian sparse regression method on their estimates. The arsenal of Bayesian sparse regression methods, including those exploiting shrinkage priors (e.g. Park and Casella, 2008; Polson and Scott, 2012; Armagan et al., 2013; Bhattacharya et al., 2015; Song and Liang, 2017) and those exploiting spikeandslab priors (among others, Ishwaran and Rao, 2005; Narisetty and He, 2014; Castillo et al., 2015; Ročková and George, 2018), has been developed in parallel to the frequentist methods. However, it is unclear whether these methods would work on estimated common factors and idiosyncratic components in model (4). When it does work, it remains unknown whether the factor model estimation would incur any loss to the convergence rate or model selection consistency of the Bayesian sparse regression method. Given theoretical results in the frequentist setting, these questions are still challenging, because the definitions of estimation errors and technical conditions of frequentist and Bayesian methods are significantly different (Castillo et al., 2015). Even if a Bayesian sparse regression method is theoretically sound, it is unclear whether it performs better or worse than the frequentist methods on finite sample data. We would like answer these questions in the current paper.
Specifically, our Bayesian method imposes a slab prior on the regression coefficients of estimated common factors, and a spikeandslab prior on the regression coefficients of estimated idiosyncratic components. This procedure results in a pseudoposterior distribution, which differs from the exact posterior distribution obtained by a Bayesian regression on exact common factors and idiosyncratic components. Interestingly, the pseudoposterior distribution achieves the contraction rate of the regression coefficients, which matches that of the exact posterior distribution. Byproducts of our analyses include the adaptivity to the unknown sparsity and the unknown standard deviation . We only need a type of sparse eigenvalue condition on the idiosyncratic components to overcome the nonidentifiability issue of the parameters. This is easy to hold due to the weak correlation among idiosyncratic components. Moreover, by assuming a betamin condition that is frequently used in the highdimensional regression literature, we prove that our method consistently selects the support of the true sparse regression coefficients.
The rest of this paper proceeds as follows. In Section 2, we propose the Bayesian methodology for the factoradjusted regression model (4). Section 3 establishes the contraction rates and model selection consistency of the pseudoposterior distribution. These theoretical results rely on a highlevel condition concerning the estimation of factor models, which is examined by Section 4. Section 5 presents experimental results on simulation datasets. Section 6 applies our method to a real dataset of U.S. bond risk premia and macroeconomic indicators. Section 7 is devoted to discussions. All technical proofs and algorithmic implementation are detailed in the appendices.
Notation. We write for a diagonal matrix of elements . For a symmetric matrix , we write its largest eigenvalue as and its smallest eigenvalue as . For a matrix , we write to denote its th column, and lowercase to denote its th row. For a index set , is the submatrix of assembling the columns indexed by . Let be the elementwise maximum norm of , let be its Frobenius norm. For a vector , let denote its subvector assembling components indexed by , and let denote its norm. For two sequences and , or means .
2 Model and Methodology
Our goal is to study the factoradjusted regression model (4), in which both common and idiosyncratic components are unobserved, but are observed through (3). Each datum (row) in admits the factor structure (2) with therein identically distributed as . Note that are not necessarily independently distributed. The dimension of is fixed, but the dimension of may grow as increases. By decomposition, and are uncorrelated. Without loss of generality, we assume that , and . The regression coefficient vector of is sparse in the sense that is small. We allow to grow as increases, but require so that the desired contraction rate as . The Gaussian errors are independent from and .
An inherent difficulty for estimating model (4) is that both common factors and idiosyncratic components are unobserved. Therefore the first step is to estimate these unobserved variables. We follow Bai (2003); Fan et al. (2013); Wang and Fan (2017) and use PCA for this task. Let be the eigenvalues of . A natural estimator of is the concatenation of the squarerootscaled eigenvectors corresponding to the top eigenvalues of , denote by . That is,
where Then we estimate by
If is unknown, we may estimate by
(5) 
where is any prescribed upper bound for (Luo et al., 2009; Lam and Yao, 2012; Ahn and Horenstein, 2013).
After estimating unobserved variables, we propose a Bayesian sparse regression method for tasks of parameter estimation and model selection. Suppose we are given data generated from true parameter . Let be its running parameter. Let and be the support of and , respectively. We consider a hierarchical prior with a slab prior on the coefficients of common factors and a spikeandslab prior on the coefficients of idiosyncratic components as follows:
(6) 
where is a positive continuous density function on , e.g., the inversegamma density; is a “slab” density function on in the sense that as , e.g., the Gaussian density and the Laplace density ; hyperparameters control the scales of running coefficients ; and, hyperparameter controls the sparsity of running models . For the scaling hyperparameters, we set so that the effects of possibly heterogeneous scales of ’s are appropriately adjusted. For the sparsity hyperparameter, we simply set in the simulation experiments. When dealing with a real dataset, one could choose an informative according to expertise knowledges in the specific area, or tune by sophisticated crossvalidation or empirical Bayes procedures
The Bayesian sparse regression on response and regressors , with prior (6) obtains a pseudoposterior distribution
(7) 
where denotes the dimensional normal distribution with mean and covariance . We call it a “pseudoposterior” distribution and put a hat over to emphasize that it differs from the exact posterior distributions , obtained by a Bayesian regression on observed , and , obtained by a fully Bayesian procedure.
It is worth noting that, even in the simplest setting in which are i.i.d. and are jointly independent, the exact posterior distribution given by a fully Bayesian procedure
is computationally intractable due to the involvement of latent variables in the complicated integral. Thus a fully Bayesian procedure does not solve model (4) easily.
3 Theory
In this section, we show under commonlyseen assumptions for Bayesian sparse regression methods that the pseudoposterior distribution (7) achieves the convergence rate of the estimation error for the coefficient vectors . This rate is so far the best rate Bayesian methods can achieve with observed (Song and Liang, 2017). We see that the factor adjustment added by our approach to the Bayesian sparse regression method incurs no loss in terms of estimation error rate. Byproducts of our analysis are the adaptivities of the pseudoposterior distribution to the unknown sparsity and unknown standard deviation . Finally, when the betamin condition holds, we establish the model selection consistency of the pseudoposterior distribution (7).
3.1 Assumptions
In the highdimensional regime , a common assumption is that is sparse of size . Following the sparse regression literature, we assume that such that the desired error rate as . To recover the sparse coefficient vector at rate , we need the following assumptions.
Assumption 1.
There exists a large integer and a constant such that
holds with probability approaching .
This assumption is commonly referred to as the sparse eigenvalue condition in the frequentist literature (Fan et al., 2018). In a recent study of Bayesian sparse regression with shrinkage priors, Song and Liang (2017) imposed the same assumption on original covariates . Here our assumption is imposed on their idiosyncratic components .
Our next assumption upper bounds the maximum eigenvalue of , which is the Gram matrix corresponding to the true model . Assumptions 12 together ensure that is well conditioned.
Assumption 2.
There exists a constant such that
holds with probability approaching .
Raskutti et al. (2010); Dobriban and Fan (2016) gave sufficient conditions for correlated covariates to satisfy Assumptions 12. These theories typically allow in Assumption 1. If consists of i.i.d. entries with zero mean, unit variance and only finite fourth moment, Assumption 2 holds by BaiYin theorem in the random matrix theory (Bai and Yin, 1988; Yin et al., 1988).
Since we feed a Bayesian sparse regression method with the estimated variables rather than the latent variables , it is necessary to control the error of . This goal is achieved by assumptions on the estimation errors of latent variables and the magnitudes of the true coefficient vectors. For the estimation error of the factor model, we impose a generic highlevel condition as follows.
Assumption 3.
The latent common factors and idiosyncratic components can be estimated by and as follows.
for some nearly orthogonal matrix such that and .
Since represents the eigenspace of the top eigenvalues of and mimics the column space of , there is a nearlyorthogonal transformation, represented by , between and . Next section will verify this error rate in factor models under standard assumptions.
Our last assumption requires constant orders of the true parameters .
Assumption 4.
is fixed, , and .
This condition is not restrictive. It holds if and only if the response variable has finite variance, under Assumptions 12. To see this point, note that the variance of a single response variable is
where is the subvector of corresponding to the true model , and have all eigenvalues bounded away from and , due to Assumptions 12. Although our theoretical analyses need bounded magnitude of regression coefficients to avoid the amplification of estimation errors of latent variables, we remark here that, when the underlying true factors, ’s and ’s, and/or more accurate estimates are available, we can allow larger magnitudes of regression coefficients.
3.2 Definition of Posterior Contraction Rate
The definition of convergence rate in the Bayesian setting differs from that in the frequentist setting. We formally define it by following the classical Bayesian literature (Ghosal et al., 2000; Shen and Wasserman, 2001).
Definition 1 (Posterior contraction).
Consider a parametric model indexed by . Let be a sequence of data generations according to some true parameter . Let be a function of . Let be a loss function between the estimate and the parameter . A sequence of posterior distributions (random measures) is said to achieve convergence rate of estimation error if
in probability as for some constant .
Specifically in the factoradjusted regression model (4) with covariates hidden in (3), we consider
where is introduced by Assumption 3, and want to show that achieves the contraction rate of estimation error
As noted on Assumption 3, approximates in the sense that they have almost the same column space and elementwisely for some nearly orthogonal transformation matrix . Thus the pseudoposterior distribution would concentrate around such that .
3.3 Results
This subsection presents the main results of the paper. Recall that . Let
where are constants, and are supports of and , respectively, and is the cardinality of the set difference of and .
Theorem 1.
Let denote the probability measure under the true parameters. Under Assumptions 14, the following statements hold.

(estimation error rate) There exist constants and such that
as .

(prediction error rate) There exist constants and such that
as .

(model selection consistency) If then there exist constants and such that
as . It follows that
as .
Part (a) establishes the convergence rate of the estimation error of (up to a nearly orthogonal transformation ) and , the adaptivity to the unknown sparsity , and the adaptivity to the unknown standard deviation .
Part (b) shows that predicts the conditional mean with mean squared error for each single datum instance on average.
The first implication in Part (c) asserts that the pseudoposterior distribution will select all variables in and at most other variables, with high probability. In simulation experiments, we observe that the pseudoposterior distribution overestimates the true support size by less than . The second implication asserts that
in probability as , and therefore provides a variable selection rule. Simply speaking, we can consistently select the true model by thresholding the running coefficients at . In simulation experiments, the majority of pseudoposterior samples of parameters hit the true model correctly even if the thresholding rule is not used.
The additional condition that in part (c) is called “betamin condition” in the literature on Bayesian sparse regression (Castillo et al., 2015; Song and Liang, 2017). Narisetty and He (2014) use another identifiability condition to achieve the model selection consistency. Their condition can be shown slightly stronger than the betamin condition in presence of the minimum sparse eigenvalue condition. To see this point, one can compare their Condition 4.4 to our equation (10) in the proof of Lemma A2, part(d).
4 Factor Model Estimation
This section verifies Assumption 3, which concerns the estimation errors of factor models under standard assumptions. Following Bickel and Levina (2008), we define a uniformity class of positive semidefinite matrices as follows
Assumption 5.
are identically (not necessarily independently) distributed as . , ; , with for some , and .
Assumption 6.
All entries in the loading matrix are uniformly bounded, i.e., , and all the eigenvalues of is strictly bounded away from and .
Assumption 7.
The sample covariance matrices of and converge to the true covariance matrices at rate in the elementwise maximum norm.
In Assumption 5, is made to avoid the nonidentifiability issue of and . If rows of are i.i.d. copies of some dimensional distribution then converges almost surely to as and Assumption 6 holds when has eigenvalues bounded away from 0 and . Assumptions 56 together characterize the “lowrank plus sparse” structure of the covariance matrix of . That is,
where the first part is of low rank , and the second part is sparse in the sense that the quantity for some is . This decomposition has a “spike plus nonspike” structural interpretation as well: the smallest nonzero eigenvalue of is of order , while the largest eigenvalue of is of order . This eigengap plays the key role in estimating and .
Assumption 7 requires that the sample covariance , and converge to their ideal counterparts at an appropriate rate. Kneip and Sarda (2011) provided sufficient conditions for it to hold in case that are i.i.d.. Fan et al. (2013) established the same rate for stationary and weaklycorrelated timeseries. Our recent work on the concentration inequalities for general Markov chains (Jiang et al., 2018) can verify this assumption in case that are functionals of ergodic Markov chains.
Next theorem summarizes the theoretical results on factor model estimation under Assumptions 57. Part (b) of this theorem bounds the difference between column spaces of and in terms of principal angles, which is novel from the previous theory in the literature (Fan et al., 2013) and may be of independent interest. Parts (c) and (d), which are immediate corollaries of part (b), derive Assumption 3.
Definition 2.
The principal angles between two linear spaces spanned by orthonormal column vectors of and are defined as
where are the singular values of or .
Theorem 2.
Let consist of scaled left singular vectors of , which are orthonormal vectors spanning the column space of . Under Assumptions 57, the following statements hold.

Eigenvalue recovery:

Eigenspace recovery:

Common factor recovery:
for some nearly orthogonal matrix with and .

Idiosyncratic component recovery:
5 Simulation Experiments
This section reports simulation results. As a basic case, we set , and generate , and . We set true parameters , , , and .
For prior (6), we choose the inversegamma density with shape and scale , the Gaussian density and set hyperparameters and . Starting from , we iterate a Gibbs sampler times and drop the first iterations as the burnin period. The implementation details of the Gibbs sampler is given in the appendix.
The pseudoposterior distribution are evaluated in terms of five measures. The posterior mean of is compared to in terms of estimation error. The model selection rate and the sure screening rate are also computed. The former is the portion of the posterior samples that select the true model, i.e., , and the latter is the portion of the posterior samples that select all sparse coefficients, i.e., . To evaluate the adaptivity to unknown sparsity , we report the average model size . To evaluate the adaptivity to unknown standard deviation , the posterior mean of is compared to in terms of relative estimation error. These measures are evaluated over 100 replicates of the datasets, and their averages are reported.
For the comparison purpose, the factoradjusted lasso method is implemented by using R package glmnet (Friedman et al., 2010). The penalty hyperparameters of the lasso methods are tuned by 10fold crossvalidation. Since the generic Bayesian / lasso with as covariates can be seen as the factoradjusted Bayesian / lasso with the underestimate of , we also include them in the comparison.
5.1 Comparison of four methods, and insensitivity to misestimates of
Table 1 summarizes the five measures of four methods in the basic case. Results show that the factoradjusted Bayesian method outperforms the factoradjusted lasso method in the tasks of estimation and model selection. The poor performance of the factoradjusted lasso method may partly result from the less satisfactory hyperparameter tuning procedure implemented in the R package glmnet.
We feed the factoradjusted methods with the various estimates , and observe that their performances are insensitive to the overestimate of (Table 1). In case that there is no correlation among , i.e., , the factoradjusted Bayesian method performs slightly worse than the generic Bayesian method (Table 2).
Method 







generic Bayes,  1.536  0.0%  100.0%  15.92  5.940  
Factoradjusted Bayes,  0.124  88.2%  100.0%  5.13  2.057  
Factoradjusted Bayes,  0.124  87.0%  100.0%  5.14  2.058  
Factoradjusted Bayes,  0.130  86.4%  100.0%  5.14  2.057  
Factoradjusted Bayes,  0.133  86.1%  100.0%  5.14  2.069  
generic lasso,  1.189  0%  100%  92.57  3.187  
Factoradjusted lasso,  0.460  27%  100%  21.20  1.899  
Factoradjusted lasso,  0.463  25%  100%  21.56  1.865  
Factoradjusted lasso,  0.467  27%  100%  26.07  1.787  
Factoradjusted lasso,  0.466  24%  100%  29.21  1.657 
Method 







generic Bayes,  0.092  90.5%  100.0%  5.10  0.881  
Factoradjusted Bayes,  0.095  87.9%  100.0%  5.13  0.906  
Factoradjusted Bayes,  0.097  88.4%  100.0%  5.12  0.914  
Factoradjusted Bayes,  0.098  88.5%  100.0%  5.12  0.932  
Factoradjusted Bayes,  0.102  88.9%  100.0%  5.12  0.968  
generic lasso,  0.495  53%  100%  11.72  1.302  
Factoradjusted lasso,  0.498  61%  100%  11.98  1.248  
Factoradjusted lasso,  0.500  56%  100%  13.28  1.279  
Factoradjusted lasso,  0.481  56%  100%  12.48  1.141  
Factoradjusted lasso,  0.487  58%  100%  13.62  1.114 
We emphasize that the meaning of the model selection rate for the Bayesian methods are slightly different from that for the frequentist methods. For example, 50% model selection rate given by a frequentist method means that it select the true sparse model in 50 out of 100 replicates of the dataset. In contrast, 90% model selection rate given by a Bayesian method means that every 9 of 10 posterior samples of parameters hit the true sparse model in a single replicate of the dataset on average. In the simulation experiments reported by Tables 12, at least every 7 of 10 pseudoposterior samples obtained by our method hit the true sparse model in each of 100 replicates of the dataset.
5.2 Scalability as increase
We vary the sample size , the dimensionality and the sparsity in the basic case, and test the scalability of the proposed methodology.
In Figure 1(a), we fix all parameters in the basic case but vary , , , , , . In Figure 1(b), we fix all parameters in the basic case but vary , , , , , . In Figure 1(c), we fix all parameters in the basic case but vary , , , , , , , . For factoradjusted methods, are used.
We observe that our method outperforms the other three methods in terms of estimation error and model selection rate under each combination of , and achieves comparable relative error of to the factoradjusted lasso method.
5.3 Estimating the standard regression model
Recall that, when admits a factor structure (3), the standard regression model (1) is a special case of the factoradjusted regression model (4) with the parameter constraint . We expect that the factoradjusted Bayesian method solves model (1) as well. To verify this expectation, we set (or equivalently ), and test four methods on simulation datasets.
Method 







generic Bayes,  0.070  91.5%  100.0%  5.09  0.913  
Factoradjusted Bayes,  0.090  91.1%  100.0%  5.09  1.690  
Factoradjusted Bayes,  0.091  90.7%  100.0%  5.10  1.715  
Factoradjusted Bayes,  0.093  90.4%  100.0%  5.10  1.733  
Factoradjusted Bayes,  0.095  89.8%  100.0%  5.11  1.763  
generic lasso,  0.734  13%  100%  10.18  3.266  
Factoradjusted lasso,  0.454  53%  100%  15.17  1.125  
Factoradjusted lasso,  0.471  57%  100%  14.96  1.193  
Factoradjusted lasso,  0.465  48%  100%  16.65  1.139  
Factoradjusted lasso,  0.492  55%  100%  18.06  1.213 
We see that the factoradjusted Bayesian method does solve model (1). Interestingly, while the factor adjustment added to the lasso method significantly increases the model selection rate from 13% to roughly 50%, the generic Bayesian method works comparably well or even better than the factoradjusted Bayesian method. We will discuss this phenomenon in the discussion section.
6 Predicting U.S. Bond Risk Premia
This section applies our method to predict U.S. bond risk premia with a large panel of macroeconomic variables. The response variables are monthly U.S. bond risk premia with maturity of years spanning the period from January, 1964 to December, 2003 (Ludvigson and Ng, 2009). The year bond risk premium at period is defined as the (log) holding return from buying an year bond at period and selling it as an year boud at period , excessing the (log) return on oneyear bond bought at period . The covariates are macroeconomic variables collected in the FREDMD database (McCracken and Ng, 2016) during the same period.
The covariates over 480 months are strongly correlated. The scree plot of PCA of these covariates (Figure 2) shows that the first principal component accounts for 55.9% of the total variance, and that the first 5 principal components account for 89.7% of the total variance.
We consider the rolling window regression and next value prediction. Specifically, we regress a U.S. bond risk premium on the macroeconomic variables in the last month. For each time window of size ahead of month , we fit
and do outofsample prediction
The regression function is fitted as by one of the generic lasso method, the factoradjusted lasso method, the generic Bayesian method, the factoradjusted Bayesian method and the principal component regression (PCR) method (Ludvigson and Ng, 2009). For the factoradjusted methods, the number of common factors is estimated by (5). For the Bayesian methods, we set in prior (6). For PCR, the top eight principal components are included in the regression model in a similar vein to (Ludvigson and Ng, 2009). The R package pls (Wehrens and Mevik, 2007) is used for implementation of PCR.
The prediction performance is evaluated by the outofsample , which is computed as follows.
where is one of twoyear, threeyear, fouryear and fiveyear U.S. bond risk premia, is the prediction of by one of five methods in comparison, and is the average of .
Table 4 summarizes the outofsample five methods achieve on this task. Table 5 reports the average size of the sparse models they select. We observe that the factoradjusted Bayesian method together with the factoradjusted Bayesian method achieve higher outofsample than other methods. But the factoradjusted Bayesian method select much sparser models than the factoradjusted lasso method.
Method  2yr bond  3yr bond  4yr bond  5yr bond 
PCR  0.646  0.603  0.568  0.540 
generic Bayes  0.765  0.734  0.722  0.696 
factoradjusted Bayes  0.775  0.753  0.747  0.726 
generic lasso  0.719  0.717  0.701  0.688 
factoradjusted lasso  0.766  0.764  0.746  0.719 
Method  2yr bond  3yr bond  4yr bond  5yr bond 
generic Bayes  12.97  12.97  13.13  13.05 
factoradjusted Bayes  11.04  11.39  11.63  11.41 
generic lasso  24.06  24.25  25.62  25.71 
factoradjusted lasso  34.46  35.12  36.91  36.57 
7 Discussion
We propose a factoradjusted regression model to handle the linear relationship between the response variable and possibly highly correlated covariates. We decompose the predictors into common factors and idiosyncratic components, where the common factors explain most of the variations, and assume all common factors but a small number of idiosyncratic components contribute to the response. The corresponding Bayesian methodology is then developed for estimating such a model. Theoretical results suggest that the proposed methodology can consistently estimate the factoradjusted model and thus obtain consistent predictions, under an easilytohold sparse eigenvalue condition on the idiosyncratic components instead of the original covariates.
Our factoradjusted model covers the standard linear model as a submodel with the side constraint. Thus, our proposed methodology can easily handle the case when the standard linear regression model is assumed to be the underlying model. In simulation studies on the submodel, we find that the factor adjustment greatly improves the performance of lasso, while the generic Bayesian sparse regression is comparable to the factoradjusted Bayesian sparse regression in terms of estimation error and model selection rate (Table 3). This suggests a fundamental difference between the frequentist sparse regression method and the Bayesian sparse regression method. Indeed, one can prove under Assumptions 12, 57 that
If , these two terms are of constant order, and then a similar argument to the proof of Theorem 1 would establish the convergence and model selection consistency of the generic Bayesian regression on standard regression model (1).
Nonetheless, we recommend the factoradjusted Bayesian regression on model (4) over the generic Bayesian regression on model (1) for three reasons. First, the theoretical analyses of the former allow to grow with , in contrast the latter requires fixed . Second, model (4) provides more flexibility than its submodel (1) in the regression analyses and would potentially explore more explanatory power from the data. On the real dataset of U.S. bond risk premia, the factor adjusted Bayesian regression achieves 1.0%3.0% more outofsample with one or two less variables (Tables 45). Third, in the no correlation case (although it is unlike the case in practice), the factoradjusted Bayesian regression pays a negligible price for model misspecification (Table 2).
Acknowledgement
We would like to thank Yun Yang for helpful discussions.
References
 Ahn and Horenstein (2013) Ahn, S. C. and Horenstein, A. R. (2013). Eigenvalue ratio test for the number of factors. Econometrica 81 1203–1227.
 Armagan et al. (2013) Armagan, A., Dunson, D. B. and Lee, J. (2013). Generalized double pareto shrinkage. Statistica Sinica 23 119.
 Bai (2003) Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica 71 135–171.
 Bai and Ng (2006) Bai, J. and Ng, S. (2006). Confidence intervals for diffusion index forecasts and inference for factoraugmented regressions. Econometrica 74 1133–1150.
 Bai and Yin (1988) Bai, Z.D. and Yin, Y.Q. (1988). Necessary and sufficient conditions for almost sure convergence of the largest eigenvalue of a wigner matrix. Annals of Probability 1729–1741.
 Barron (1998) Barron, A. R. (1998). Informationtheoretic characterization of bayes performance and the choice of priors in parametric and nonparametric problems. Bayesian Statistics 6 27–52.
 Belloni et al. (2012) Belloni, A., Chen, D., Chernozhukov, V. and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica 80 2369–2429.
 Bhattacharya et al. (2015) Bhattacharya, A., Pati, D., Pillai, N. S. and Dunson, D. B. (2015). DirichletLaplace priors for optimal shrinkage. Journal of the American Statistical Association 110 1479–1490.
 Bickel and Levina (2008) Bickel, P. J. and Levina, E. (2008). Covariance regularization by thresholding. Annals of Statistics 36 2577–2604.
 Bickel et al. (2009) Bickel, P. J., Ritov, Y., Tsybakov, A. B. et al. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics 37 1705–1732.
 Bühlmann and van de Geer (2011) Bühlmann, P. and van de Geer, S. (2011). Statistics for highdimensional data: methods, theory and applications. Springer, New York.
 Bunea et al. (2007) Bunea, F., Tsybakov, A., Wegkamp, M. et al. (2007). Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics 1 169–194.
 Candes and Tao (2007) Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation when p is much larger than n. Annals of Statistics 35 2313–2351.
 Castillo et al. (2015) Castillo, I., SchmidtHieber, J. and van der Vaart, A. (2015). Bayesian linear regression with sparse priors. Annals of Statistics 43 1986–2018.
 Davis and Kahan (1970) Davis, C. and Kahan, W. M. (1970). The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis 7 1–46.
 Dobriban and Fan (2016) Dobriban, E. and Fan, J. (2016). Regularity properties for sparse regression. Communications in mathematics and statistics 4 1–19.
 Donoho and Elad (2003) Donoho, D. L. and Elad, M. (2003). Optimally sparse representation in general (nonorthogonal) dictionaries via minimization. Proceedings of the National Academy of Sciences 100 2197–2202.
 Donoho et al. (2006) Donoho, D. L., Elad, M. and Temlyakov, V. N. (2006). Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on information theory 52 6–18.
 Donoho and Huo (2001) Donoho, D. L. and Huo, X. (2001). Uncertainty principles and ideal atomic decomposition. IEEE transactions on information theory 47 2845–2862.
 Fama and French (1993) Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics 33 3–56.
 Fan et al. (2016) Fan, J., Ke, Y. and Wang, K. (2016). Decorrelation of covariates for high dimensional sparse regression. arXiv preprint arXiv:1612.08490 .
 Fan and Li (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96 1348–1360.
 Fan et al. (2013) Fan, J., Liao, Y. and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 603–680.
 Fan et al. (2018) Fan, J., Liu, H., Sun, Q. and Zhang, T. (2018). ILAMM for sparse learning: Simultaneous control of algorithmic complexity and statistical error. Annals of statistics 46 814.
 Fan and Lv (2008) Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of Royal Statistical Society B 70 849–911.
 Fan et al. (2011) Fan, J., Lv, J. and Qi, L. (2011). Sparse highdimensional models in economics. Annual Review of Economics 3 291–317.
 Forbes and Rigobon (2002) Forbes, K. J. and Rigobon, R. (2002). No contagion, only interdependence: measuring stock market comovements. Journal of Finance 57 2223–2261.
 Friedman et al. (2010) Friedman, J., Hastie, T. and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33 1–22.
 Ghosal et al. (2000) Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Annals of Statistics 28 500–531.
 Ishwaran and Rao (2005) Ishwaran, H. and Rao, J. S. (2005). Spike and slab gene selection for multigroup microarray data. Journal of the American Statistical Association 100 764–780.
 Jiang et al. (2018) Jiang, B., Sun, Q. and Fan, J. (2018). Bernstein’s inequality for general markov chains. arXiv preprint arXiv:1805.10721 .
 Kneip and Sarda (2011) Kneip, A. and Sarda, P. (2011). Factor models and variable selection in highdimensional regression analysis. Annals of Statistics 39 2410–2447.
 Lam and Yao (2012) Lam, C. and Yao, Q. (2012). Factor modeling for highdimensional time series: inference for the number of factors. Annals of Statistics 40 694–726.
 Laurent and Massart (2000) Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic functional by model selection. Annals of Statistics 1302–1338.
 Lintner (1975) Lintner, J. (1975). The valuation of risk assets and the selection of risky investments in stock portfolios and capital budgets. In Stochastic Optimization Models in Finance. Elsevier, 131–155.
 Liu (2005) Liu, J. (2005). Eigenvalue and singular value inequalities of schur complements. In The Schur Complement and its Applications, chap. 2. Springer, 47–82.
 Ludvigson and Ng (2009) Ludvigson, S. C. and Ng, S. (2009). Macro factors in bond risk premia. The Review of Financial Studies 22 5027–5067.
 Luo et al. (2009) Luo, R., Wang, H. and Tsai, C.L. (2009). Contour projected dimension reduction. Annals of Statistics 37 3743–3778.
 McCracken and Ng (2016) McCracken, M. W. and Ng, S. (2016). FREDMD: a monthly database for macroeconomic research. Journal of Business & Economic Statistics 34 574–589.
 Narisetty and He (2014) Narisetty, N. N. and He, X. (2014). Bayesian variable selection with shrinking and diffusing priors. Annals of Statistics 42 789–817.
 Park and Casella (2008) Park, T. and Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association 103 681–686.
 Pelekis (2016) Pelekis, C. (2016). A lower bound on binomial tails: an approach via tail conditional expectations. arXiv preprint arXiv:1609.06651 .
 Polson and Scott (2012) Polson, N. G. and Scott, J. G. (2012). Local shrinkage rules, lévy processes and regularized regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 74 287–311.
 Raskutti et al. (2010) Raskutti, G., Wainwright, M. J. and Yu, B. (2010). Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research 11 2241–2259.
 Ročková and George (2018) Ročková, V. and George, E. I. (2018). The spikeandslab lasso. Journal of the American Statistical Association 113 431–444.
 Sharpe (1964) Sharpe, W. F. (1964). Capital asset prices: a theory of market equilibrium under conditions of risk. Journal of Finance 19 425–442.
 Shen and Wasserman (2001) Shen, X. and Wasserman, L. (2001). Rates of convergence of posterior distributions. Annals of Statistics 29 687–714.
 Song and Liang (2017) Song, Q. and Liang, F. (2017). Nearly optimal bayesian shrinkage for high dimensional regression. arXiv preprint arXiv:1712.08964 .
 Stock and Watson (2002) Stock, J. H. and Watson, M. W. (2002). Forecasting using principal components from a large number of predictors. Journal of the American Statistical Association 97 1167–1179.
 Su and Candes (2016) Su, W. and Candes, E. (2016). Slope is adaptive to unknown sparsity and asymptotically minimax. Annals of Statistics 44 1038–1068.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58 267–288.
 Wang and Fan (2017) Wang, W. and Fan, J. (2017). Asymptotics of empirical eigenstructure for ultrahigh dimensional spiked covariance model. Annals of Statistics 45 1342–1374.
 Wehrens and Mevik (2007) Wehrens, R. and Mevik, B.H. (2007). The pls package: Principal component and partial least squares regression in r. Journal of Statistical Software 18 1–24.
 Yin et al. (1988) Yin, Y.Q., Bai, Z.D. and Krishnaiah, P. R. (1988). On the limit of the largest eigenvalue of the large dimensional sample covariance matrix. Probability theory and related fields 78 509–521.
 Yu et al. (2014) Yu, Y., Wang, T. and Samworth, R. J. (2014). A useful variant of the davis–kahan theorem for statisticians. Biometrika 102 315–323.
 Zhang and Huang (2008) Zhang, C.H. and Huang, J. (2008). The sparsity and bias of the lasso selection in highdimensional linear regression. Annals of Statistics 36 1567–1594.
 Zhao and Yu (2006) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research 7 2541–2563.
Notation. We summarize all notation used in the appendices. Some of them may have been defined in the main body of the paper.
Let be the identity matrix of dimension , the column vector of zeros, the matrix of zeros. If the dimension of an identity or zero matrix is clear in the context, we omit the subscripts. We write for a diagonal matrix of elements . For a symmetric matrix , we write its trace as