Variable selection and regression analysis for graphstructured covariates withan application to genomics\thanksrefT1
Abstract
Graphs and networks are common ways of depicting biological information. In biology, many different biological processes are represented by graphs, such as regulatory networks, metabolic pathways and protein–protein interaction networks. This kind of a priori use of graphs is a useful supplement to the standard numerical data such as microarray gene expression data. In this paper we consider the problem of regression analysis and variable selection when the covariates are linked on a graph. We study a graphconstrained regularization procedure and its theoretical properties for regression analysis to take into account the neighborhood information of the variables measured on a graph. This procedure involves a smoothness penalty on the coefficients that is defined as a quadratic form of the Laplacian matrix associated with the graph. We establish estimation and model selection consistency results and provide estimation bounds for both fixed and diverging numbers of parameters in regression models. We demonstrate by simulations and a real data set that the proposed procedure can lead to better variable selection and prediction than existing methods that ignore the graph information associated with the covariates.
10.1214/10AOAS332 \volume4 \issue3 2010 \firstpage1498 \lastpage1516
Variable Selection for Graph Covariates \thankstextT1Supported in part by NIH Grants CA127334 and ES009111.
and
Regularization \kwdsign consistency \kwdnetwork \kwdLaplacian matrix \kwdhighdimensional data.
1 Introduction
There has been a growing interest in penalized least squares problems via or other types of regularization, especially in highdimensional settings. Important penalty functions that can lead to sparse variable selection in regression include Lasso [Tibshirani (1996)] and SCAD [Fan and Li (2001)]. In particular, Lasso has the crucial advantage of being a convex problem, which leads to efficient computational algorithms by coordinate descent [Efron et al. (2004); Friedman et al. (2007); Wu and Lange (2008)] and sparse solutions. Zou (2006) proposed a novel adaptive Lasso procedure and presented results on model selection consistency and oracle properties of the parameter estimates. Zhao and Yu (2006) presented the irrepresentable condition for model selection consistency of Lasso. Zhang and Huang (2006) studied the sparsity and bias of the Lasso selection in highdimensional linear regression. Fan and Li (2001) and Huang and Xie (2007) established the asymptotic oracle properties of the SCADpenalized least squares estimators when the number of covariates is fixed or increases with the sample sizes. These novel penalized estimation methods are quite effective in selecting relevant variables and in predicting future outcomes, especially in highdimensional settings.
New estimation procedures have also been developed in recent years to account for certain structures of the explanatory variables. These include the group Lasso procedure [Yuan and Lin (2006)] when the explanatory variables are grouped or organized in a hierarchical manner, the elastic net (Enet) procedure [Zou and Hastie (2005)] that deals with groups of highly correlated variables, and the fused Lasso [Tibshirani et al. (2005)] that imposes the penalty on the absolute differences of the regression coefficients in order to account for some smoothness of the coefficients. Nardi and Rinaldo (2008) established the asymptotic properties of the group Lasso estimator for linear models. Jia and Yu (2008) provided conditions for model selection consistency of the elastic net when . Zou and Zhang (2009) proposed an adaptive elastic net with a diverging number of parameters and established its oracle property. Among these procedures, the Enet regularization and the fused Lasso are particularly appropriate for the analysis of genomic data, where the former encourages a grouping effect and the latter often leads to smoothness of the coefficient profiles for ordered covariates.
Motivated by a genomic application in order to account for network information in the analysis of genomic data, Li and Li (2008) proposed a networkconstrained regularization procedure for fitting linear regression models and for variable selection, where the predictors in the regression model are genomic data that are measured on the genetic networks, which we call the graphstructured covariates. In particular, we assume that the covariates in the regression model are values of the nodes on a graph, where a link between two nodes may indicate a functional relationship between two genes in a genetic network or physical neighborhood between two voxels on brain images. Since many biological networks are constructed using data from highthroughput experiments, often there is a probability associated with an edge to indicate the certainty of a link. Such an edge probability can be used as a weight in a undirected graph, in which case we have a weighted graph. This graphconstrained regularization procedure is similar in spirit to the fused Lasso [Tibshirani et al. (2005)], both of which try to smooth the regression coefficients in certain ways. However, the fused Lasso does not utilize prior graph information. Second, instead of using the norm on the differences of the coefficients of the linked variables, the fused Lasso uses the norm on the differences, which tends to lead to the same regression coefficients for linked variables. In this paper we consider the general problem of regression analysis when the explanatory variables are nodes on a graph and present a cyclical coordinate descent algorithm [Friedman et al. (2007)] to implement the networkconstrained regularization procedure of Li and Li (2008). This algorithm provides new insight on how neighboring variables affect the coefficient estimate of a node. We also extend the procedure of Li and Li (2008) to account for the possibility of different signs of the regression coefficients for neighboring variables. In addition, we provide theoretical results of the estimates, including sign consistency and error bounds of the estimator and consistency.
This paper is organized as follows. In Section 2 we describe the problem of regression analysis with covariates measured on graphs. We then present a graphconstrained estimation (Grace) procedure in order to account for the graph structures in Section 2.1 and an efficient coordinate descent algorithm to implement the proposed regularization methods in Section 2.3. We present the estimation and model selection consistency results in Section 3. We provide Monte Carlo simulation results in Section 4 and results from the application to the analysis of a data set on the gene expression of brain aging in Section 5. Finally, we give a brief discussion of the methods and results.
2 Regression analysis for covariates measured on a graph
Consider a weighted graph , where is the set of vertices that correspond to the predictors, is the set of edges indicating that the predictors and are linked on the graph and there is an edge between and , and is the set of weights of the edges, where denotes the weight of edge . In genomic studies, biological networks are often represented as graphs, an edge between and on the graph can indicate some functional relationship between them and the weight can be used to measure the uncertainty of the edge between two vertices, for example, indicating the probability of having an edge between two variables when the graph is constructed from data. For each given sample, we assume that we have numerical measurements on each of the vertices and these measurements are treated as explanatory variables in a regression analysis framework. For the th node, let be the numerical measurement of the th vertex on the graph for the th individual. Further, let be the measured values at the th vertex for i.i.d. samples. Consider the problem of variable selection and estimation where we have design matrix and response vector , and they follow a linear model
(1) 
where and . Throughout this paper we assume that the predictors and the response are centered so that
In this paper we consider that the design matrix is a deterministic matrix in the fixed design settings.
When is large, we assume that model (1) is “sparse,” that is, most of the true regression coefficients are exactly zero. Without loss of generality, we assume the first elements of vector are nonzeroes. Denote and , then elementwise and . Now write and as the first and last columns of , respectively, and let , which can then be expressed in the following blockwise form:
The goal of this paper is to develop a regularization procedure for selecting the true relevant variables. Different from the existing approaches, we particularly account for the fact that the explanatory variables are related on a graph. We make this more precise in the next section.
2.1 Graphconstrained regularization and variable selection
In order to account for the fact that the explanatory variables are measured on a graph, we first introduce the Laplacian matrix [Chung (1997)] associated with a graph. Let the degree of the vertex be . We say is an isolated vertex if . Let . Following Chung (1997), we define the Laplacian matrix for graph with the th element defined by
(2) 
It is easy to verify that this matrix is positive semidefinite with 0 as the smallest eigenvalue and 2 as the largest eigenvalue when all the weights are 1 [Chung (1997)]. To allow the matrix to change with , we further express this matrix in blockwise form,
where corresponds to the nodes that are relevant to the response and corresponds to the nodes that are not relevant.
The Laplacian matrix has the following interpretations. For a given vector , the edge derivative of along the edge at is defined as
and, therefore, the local variation of at can be measured by
The smoothness of vector with respect to the graph structure can be expressed as
This variation functional for vectors penalizes vectors that differ too much over nodes that are linked. It contains a scaling by . One intuitive reason for such a scaling is to allow a small number of nodes with large to have more extreme values of while the usually much greater number of nodes with small should not ordinarily allow to have very large . This variation functional has been widely used in semisupervised learning on graphs [Zhu (2005); Zhou et al. (2004)].
For many problems with covariates measured on a graph, we would expect that the neighboring variables are correlated and, therefore, the regression coefficients would show some smoothness. One way to account for such a dependence of the regression coefficients is to impose a Markov random field (MRF) prior to the collection of vectors. The MRF decomposes the joint prior distribution of the ’s into lowerdimensional distributions based on the graphneighborhood structures. A common MRF model is the Gaussian MRF model that assumes that the joint distribution of is given by
which is an improper density. Based on this Gaussian MRF prior assumption, Li and Li (2008) introduced the following graphconstrained estimation of the regression coefficients, denoted by ,
(3) 
where
where is the Laplacian as defined in (2) and the tuning parameters control the amount of regularization for sparsity and smoothness. For the special case when , the estimate reduces to the Lasso, and when is the identity matrix, the estimate reduces to the elastic net estimates.
2.2 An adaptive graphconstrained regularization
The Grace procedure may not perform well when two variables that are linked on the graph have different signs in their regression coefficients, in which case the coefficients are not expected to be locally smooth. For example, for genetic networks, two genes might be negatively correlated with the phenotypes and are therefore expected to have different signs in their regression coefficients. In order to account for the sign differences, we can first perform a standard least square regression when or the elastic net regression when and denote the estimate as . We can then modify the above objective function as
where
Note that the matrix is still positive semidefinite. We call the defined by
(4) 
the adaptive Grace (aGrace).
2.3 A coordinate descent algorithm
Friedman et al. (2007) presented a coordinate descent algorithm for solving the Lasso and the Enet regularization. In this section we develop a similar algorithm for the proposed graphconstrained regularization. We only present the detailed algorithm for the optimization problem defined by equation (3). Similar algorithms can be developed by the aGrace defined by (4). If we let and , the Grace can be written as
(5) 
where
is the graphconstrained penalty function.
Given a covariate , suppose we have estimated for , and we want to partially minimize the objective function with respect to . We can rewrite the objective function in as
We would like to compute the gradient at , which only exists when . We first consider the case that the covariate is connected to some other nodes (variables) on the network. If , due to the standardization of the covariates, we can differentiate the objective function with respect to as
Similarly, we can get the corresponding expression when . Following the calculus by Donoho and Johnstone (1994) and Friedman et al. (2007), we obtain the coordinatewise update form for as
(6) 
where:

[]

is the partial residual for fitting , that is, the fitted value excluding the contribution from . Since the covariates are standardized, is the simple leastsquares coefficient while fitting the partial residual to , .

is the softthresholding operator with value
When covariate is not connected to other nodes on the network, that is, when it has no neighbors, the corresponding coordinatewise updating formula becomes the Lasso updating formula, that is
(7) 
Comparing the two updated forms of (6) and , an intuitive explanation can be drawn to help to understand the effect of the graphconstraint penalty on the coefficients. For an isolated predictor, the graph penalty is vanished and, thus, we only apply a softthresholding for the Lasso penalty, while for a connected predictor, form takes into account the graphconstraint to the penalty by adding the scaled summation of the coefficients of the neighboring covariates to the simple leastsquares coefficient and applying a proportional shrinkage for the graph penalty.
Given , we can compute the solution path for a decreasing sequence of values for , starting from the smallest value for which there is no covariate selected, that is, . Similar to Friedman et al. (2007), we can set , and construct a sequence of values of decreasing from to on the log scale. Typical values are and . Crossvalidation (CV) can be used to select the two tuning parameters and .
3 Error bound and model selection consistency for fixed and diverging
In this section we provide some theoretical results on the proposed Grace procedure, including the error bounds, consistency of Grace and the model selection consistency for both fixed and diverging when tends to infinity with the sample size . Our theoretical development follows that of Zhao and Yu (2006), Jia and Yu (2008) and Zou and Zhang (2009) on sign consistency of Lasso and adaptive elastic net estimates. In our theoretical analysis, we assume the following regularity conditions throughout: {longlist}[(A1)]
We use and to denote the minimum and maximum eigenvalues of a positive definite matrix , respectively. We further assume that is positive definite and
where and are two positive constants that do not depend on .
, as .
These two conditions assume that the predictor matrix has a reasonably good behavior and were also assumed in Zhao and Yu (2006) and in Zou and Zhang (2009). Condition (A1) is also the condition (F) in Fan and Peng (2004) and condition (A2) ensures that the rows of the matrix behave like a sample from a probability distribution in [Portnoy (1984)]. These two conditions hold naturally if one assumes that are i.i.d. with finite second moments.
3.1 Error bound and consistency of Grace
We first provide the following nonasymptotic risk bound for the Grace of the regression coefficients defined by (2.1) for any and :
Theorem 3.1
Given the data , define the Grace as
for nonnegative tuning parameters and . Then under the regularity condition (A1), we have
(8) 
The proof of this theorem is given in Li and Li (2010). Note that this result is not asymptotic and holds for any and . From this theorem, under the regularity assumption (A1) and the following further assumptions on and the tuning parameters : {longlist}[(A.3)]
,
,
and , we have
which implies that the Grace of is consistent. This result implies that the Grace procedure chooses the important variables with high probability and that falsely chosen variables by Grace have very small coefficients. The consistency result suggests that we may use some hardthresholding procedure to further eliminate the variables with very small Grace coefficients. Alternatively, an interesting randomized selection procedure proposed by Bickel, Ritov and Tsybakov (2008) can be used to further eliminate the variables with small estimated Grace coefficients. Note that under the classical setting where , and are all fixed as , the assumptions (A3)–(A5) hold when
3.2 Model selection consistency when is fixed
We next establish the results on model selection consistency for the standard case where and are fixed when . Following Zhao and Yu (2006), we define the Grace of to be sign consistent if there exists and as functions of such that
To establish the sign consistency of the Grace, we first provide the following graphconstrained irrepresentable condition (GCIC): there exists and , , such that
(9)  
where is a vector of 1s with length and the inequality holds elementwise. Further, we assume that , where is positive definite. The GCIC is a consequence of the Karush–Kuhn–Tucker (KKT) conditions for the following constrained optimization problem that corresponds to the penalized optimization problem of equation (5):
Theorem 3.2
For fixed and , if , where is positive definite and condition (A.2) holds, the graphconstrained estimate is sign consistent if and only if GCIC (3.2) holds for that satisfy and , for .
This theorem is a special case of Theorem 3.3 and its proof is similar to that of Zhao and Yu (2006) for Lasso estimates. We therefore omit its proof in this paper. Note that the required conditions on the sparsity tuning parameter are the same as those for the Lasso [Zhao and Yu (2006)], for example, is a suitable choice. This theorem indicates that under some restrictive conditions of the design matrix and the Laplacian matrix of the network, the sign consistency property holds for the graphconstrained regularization. To gain further insight into GCIC, consider the special cases when is preselected and fixed and when goes to infinity, the GCIC reverses back to the irrepresentable condition for the Lasso given in Zhao and Yu (2006) and the graphconstrained penalty function is reduced to the Lasso penalty.
3.3 Model selection consistency when diverges
We now consider the model selection consistency of the graphconstrained regularization procedure under the settings when the number of covariates also goes to infinity as , in which case, the assumptions and the regularity conditions for Theorems 3.1 and 3.2 become inappropriate as does not converge and may change as grows. The following theorem shows that for the general scalings when and all go to infinity, under some additional conditions between and , GCIC also guarantees that the Grace is sign consistent in selecting the true model.
Theorem 3.3
Suppose each column of is normalized to the norm of and GCIC (3.2) holds. Define and , where denotes the minimal eigenvalue. Let . If and are chosen such that: {longlist}[(a)]
If ,
or if ,
If , then the Grace is sign consistent as .
A proof analogous to Jia and Yu (2008) can be found in Li and Li (2010). Theorem 3.3 gives a general sign consistency result for the Grace for general scalings of and . If for some and for some , it is easy to check that the conditions and guarantee that condition (b) in Theorem 3.3 holds. In the settings when and are fixed, if converges to a nonnegative definite matrix , converges to a nonnegative number. In addition, it is easy to verify that the conditions in Theorem 3.2 guarantee that the conditions (a) and (b) in Theorem 3.3 hold.
4 Monte Carlo simulations
We conducted Monte Carlo simulations to evaluate the proposed Grace and aGrace procedures and to compare the performance of this procedure with Lasso and Enet in terms of prediction errors and identification of relevant variables. We simulated the graph to mimic gene regulation modules. We used genes and variables interchangeably in this section. We assumed that the graph consisted of 200 unconnected regulatory modules with 200 transcription factors (TFs) and each regulated 10 different genes for a total of 2200 variables. Among these modules and genes, we further assumed that four TFs and their 10 regulated genes (for a total of 44 variables) were associated with the response based on the following model:
(10) 
We considered two different models. For the first model, we assumed that the coefficients in model (10) were specified as
and the was random meanzero normal error with variance . For each TF, the value was simulated from a distribution, and conditional on the value of the TF, we simulated the expression levels of the genes that they regulated from a conditional normal distribution with correlations of 0.2, 0.5 and 0.9, respectively. We therefore had a total of 2200 variables and 44 of them were relevant. For the second model, we considered the case when the regulated genes had different signs in regression coefficients, where the regression coefficients in model (10) have the same absolute values as in Model 1, but for each simulated module, three out of the 10 genes regulated by the TF had different signs from the other 7 genes. The values were generated in the same way as in previous simulations. In this model, genes that are regulated by the same transcription factor are assumed to have different regression coefficients.
For each model, our simulated data consisted of a training set, an independent validation set and an independent test set with a sample size of 200 for all three data sets. Models were fitted on training data only, and the validation data were used to select the tuning parameters. We computed the prediction meansquared errors on the test data set. For each model, we repeated the simulations 100 times. Table 1 shows the prediction meansquare errors for several different procedures. For Model 1 when the neighboring genes have the same signs in regression coefficients, we observed that the Grace gave the smallest prediction errors for all four models with different correlations among the predictors. Both Grace and Enet performed better than Lasso in prediction. When the correlation is very high, the prediction errors of these procedures were comparable, however, Grace still gave the smallest prediction error among the procedures compared. When the signs of the regression coefficients were the same, aGrace was reduced to Grace and gave the same prediction results. For Model 2 when the neighboring variables have different signs of coefficients, aGrace adjusting for the signs of the regression coefficients gave the smallest prediction errors, further indicating the importance of adjusting for the signs in the regularization. In general, Grace gave similar prediction results as the Enet, except when the correlation between the transcription factors and their regulated genes was very high, in which case Enet resulted in a slightly smaller prediction error.
Model 1 (Cor)  Model 2 (Cor)  

Method  0.2  0.5  0.9  0.2  0.5  0.9 
Grace  24.93  23.22  22.56  53.08  42.07  28.20 
(2.97)  (2.41)  (2.20)  (6.45)  (5.03)  (2.87)  
aGrace  24.93  23.22  22.56  27.70  26.23  25.55 
(2.97)  (2.41)  (2.20)  (3.66)  (3.03)  (2.76)  
Enet  51.33  37.37  25.82  56.18  45.65  27.33 
(6.65)  (4.69)  (2.67)  (7.22)  (5.81)  (2.72)  
Lasso  53.41  40.30  27.82  57.62  47.65  29.23 
(6.68)  (4.94)  (2.98)  (7.01)  (5.53)  (2.78) 
To compare the performance on variable selection, Figure 1 shows the receiver operating characteristic (ROC) curves of several different procedures in selecting the relevant variables for the models with correlation of 0.2 and 0.9 between the TF and their regulated genes. For Grace, aGrace and Enet, the ROC curves were obtained as a function of the sparsity parameter with tuning parameter selected based on 5fold crossvalidation among the values of 0.1, 1, 10, 100 and 1000. For Model 1 when the neighboring genes have the same signs in regression coefficients [Figure 1 plots (a) and (b)], Grace gave much larger areas under the ROC cruves than Enet and Lasso, indicating better performance in variable selection for Grace. In addition, fivefold crossvalidation always chose the largest for Grace and for Enet in all 100 replications. For Model 2 when the neighboring variables have different signs of coefficients, aGrace adjusting for the signs of the regression coefficients performed better than the other three procedures compared, resulting in larger areas under the curves, and Grace still performed better than Lasso and Enet on variable selections in both low and high correlation scenarios. When the correlation among the relevant variables is low, the 5fold CV selected for aGrace and for Enet in all 100 replications and selected for Grace in most of the replications. When the correlation among the relevant variables was high, the 5fold CV selected for aGrace and for Grace in most of the 100 replications and selected for Enet in all the replications.
5 Application to networkbased analysis of gene expression data
To demonstrate the proposed method, we consider the problem of identifying agedependent molecular modules based on the gene expression data measured in human brains of individuals of different ages published in Lu et al. (2004). In this study the gene expression levels in the postmortem human frontal cortex were measured using the Affymetrix arrays for 30 individuals ranging from 26 to 106 years of age. To identify the agingregulated genes, Lu et al. (2004) performed simple linear regression analysis for each gene with age as a covariate. We analyzed this data set by combining the KEGG regulatory network information with the gene expression data [Kanehisa and Goto (2002)]. In particular, we limited our analysis to the genes that can be mapped to the KEGG regulatory work and focused on the problem of identifying the subnetworks of the KEGG regulatory network that are associated with human brain aging. By merging the gene expression data with the KEGG regulatory pathways, the final KEGG network includes 1305 genes and 5288 edges.
No. genes  No. edges  CV error  Tuning parameters  

Grace  45  9  0.079  , 
aGrace  73  28  0.077  , 
Enet  41  10  0.077  , 
Lasso  18  0  0.098 
We treated the logarithm of the individual age as the response variable and the expression levels (after transformation) of 1305 genes on the KEGG network as the explanatory variables in our analysis. Table 2 shows the results of several different procedures where the tuning parameters were selected by fivefold crossvalidations. Overall, we observed that the Lasso selected the fewest number of genes with relatively large crossvalidation errors and Grace and Enet selected roughly the same number of genes with similar CV errors. However, the adaptive Grace resulted in more identified genes with similar CV errors than the other two procedures. Figure 2 shows the subnetworks identified by four different estimation procedures. As a comparison, we also included the genes selected by Lasso, although it did not select any linked pairs of genes on the KEGG network. It is interesting to note that as we impose more constraints on the regression coefficients, more linked genes are identified. Both Enet and Grace identified some common subnetworks that were associated with brain aging. These included fibroblast growth factors (FGF) and its receptors. It has been demonstrated that FGFs are associated with many developmental processes including neural induction [Bottcher and Niehrs (2005)] and are involved in multiple functions including cell proliferation, differentiation, survival and aging [Yeoh and de Haan (2007)]. It is also interesting to observe that mitogenactivated protein kinase (MAPK) (MAPK1 and MAPK9) and the specific MAPK kinase (MAP2K) were also identified by Enet and Grace. The MAPKs play important roles in induction of apoptosis [Hayesmoore et al. (2009)]. Other interesting genes include RAS proteinspecific guanine nucleotidereleasing factor 1 (RASGRF1), the functionality of which is highly significant in various contexts of the central nervous system. In the hippocampus, RASGRF2 has been shown to interact with the NR2A subunits of NMDARs, triggering RasERK activation and induction of longterm potentiation, a form of neuronal plasticity that contributes to memory storage in the brain [Tian et al. (2004); Lu et al. (2004)]. Finally, the insulin receptor gene (INSR) is also identified. INSR binds insulin (INS) and regulates energy metabolism. Evidence from model organisms, including results from fruit flies [Tatar et al. (2001)] and roundworms [Kimura et al. (1997)], relates INSR homologues to aging, most likely as part of the GH1/IGF1 axis. These results indicated that our method can indeed recover some biologically interesting molecular modules or KEGG subnetworks that are related to brain aging in humans.
It is important to point out that the adaptive Grace identified several small sets of connected genes that were missed by Enet or the standard Grace. One of the subnetworks included EPHRIN and Eph receptor, both of which were found to be related to neural development and entohinohippocampal axon targeting [Flanagan and Vaderhaeghen (1998); Stein et al. (1999)]. Another subnetwork was part of the JakState signaling, which is important in both mature and aging brains [DeFrajaa et al. (2000)]. Aging was also found to be associated with increased human T cell CC chemokine receptor gene expression [Yung et al. (2003)]. Other interesting subnetworks included PVRL3–PVRL1 that are associated with cell adhesion.
6 Discussion
We have introduced and studied the theoretical properties of a graphconstrained regularized estimation procedure for linear regressions in order to incorporate information coded in graphs. Such a regularization procedure can also be regarded as a penalized least squared estimation where the penalty is defined as a combination of the and penalties on degreescaled differences of coefficients between variables linked on the graphs. This penalty function induces both sparsity and smoothness with respect to the graph structure of the regression coefficients. Simulation studies indicated that when the coefficients are similar for variables that are neighbors on the graph, the proposed procedure has better prediction and identification performance than other commonly used regularization procedures such as Lasso and elastic net regressions. Such improvement results from effectively utilizing the neighboring information in estimating the regression coefficients. If the smoothness assumption on the coefficients does not hold, we expect that the crossvalidation selects a very small value of and, therefore, the proposed procedure would perform similarly as the Lasso. In analysis of the brain aging gene expression data, different from Lasso, the new procedure tends to identify sets of linked genes on the networks, which often leads to better biological interpretation of the genes identified. Although the methods presented are largely motivated by applications in genomic data, they can be applied to other settings when the covariates are nodes on general graphs, such as in image analysis.
Although the methods presented in this paper were developed mainly for linear models, similar methods can be developed for the generalized linear models and the censored survival data regression models, where we can use the negative of the logarithm of the likelihood or partial likelihood as the loss function. Similar to the techniques presented in Friedman et al. (2007) and Wu and Lange (2008), we can use the coordinate descent procedure together with the iterative reweighted least square to obtain the solution path. Such models have great applications in genomic data analysis in identifying the genes or subnetworks that are associated with binary or censored survival data outcomes. Other extensions include replacing the part of the Grace penalty with other sparse penalty functions such as SCAD or bridge penalty [Huang et al. (2008)]. Important future research also includes how to handle covariates that are linked on directed graphs. Finally, to incorporate the fact that the linked nodes might be negatively correlated and the corresponding regression coefficients may have different signs, we introduced an adaptive signadjusted graphconstrained regularization procedure and showed that such a procedure can perform better than the original graphconstrained regularization. The theoretical property of such an adaptive procedure is unknown and is an area for future research.
Acknowledgments
We thank the two reviewers for their comments that have greatly improved the presentation of this paper.
[idsuppA] \stitleProofs of Theorem 3.1 and Theorem 3.3 \slink[doi]10.1214/10AOAS332SUPP \slink[url]http://lib.stat.cmu.edu/aoas/332/supplement.pdf \sdatatype.pdf \sdescriptionWe present the details of the proofs of Theorem 3.1 and Theorem 3.3 in the Supplemental Materials.
References
 (1) Bickel, P. L., Ritov, Y. and Tsybakov, A. B. (2008). Hierarchical selection of variables in sparse highdimensional regression. Technical report, Dept. Statistics, Univ. California, Berkeley.
 (2) Bottcher, R. T. and Niehrs, C. (2005). Fibroblast growth factor signaling during early vertebrate development. Endocrine Reviews 26 63–77.
 (3) Chung, F. (1997). Spectral Graph Theory. CBMS Reginal Conferences Series 92. Amer. Math. Soc., Providence, RI. \MR1421568
 (4) DeFraja, C., Conti, L., Govoni, S., Battaini, F. and Cattaneo, E. (2000). STAT signalling in the mature and aging brain. International Journal of Developmental Neuroscience 18 439–446.
 (5) Donoho, D. and Johnstone, I. (1994). Ideal spatial adaptation via wavelet shrinkage. Biometrika 81 425–455. \MR1311089
 (6) Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. Ann. Statist. 32 407–499. \MR2060166
 (7) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 1348–1360. \MR1946581
 (8) Fan, J. and Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Ann. Statist. 32 928–961. \MR2065194
 (9) Flanagan, J. G. and Vanderhaeghen, P. (1998). The ephrins and Eph receptors in neural development. Annual Review Neuroscience 21 309–345.
 (10) Friedman, J., Hastie, T., Hoefling, H. and Tibshirani, R. (2007). Pathwise coordinate optimization. Ann. Appl. Statist. 1 302–332. \MR2415737
 (11) Hayesmoore, J. B., Bray, N. J., Cross, W. C., Owen, M. J., O’Donovan, M. C. and Morris, H. R. (2009). The effect of age and the H1c MAPT haplotype on MAPT expression in human brain. Neurobiol. Aging 30 1652–1656.
 (12) Huang, J., Horowitz, J. L. and Ma, S. (2008). Asymptotic properties of bridge estimators in sparse highdimensional regression models. Ann. Statist. 36 587–613.
 (13) Huang, J. and Xie, H. (2007). Asymptotic oracle properties of SCADpenalized least squares estimators. In Asymptotics: Particles, Processes and Inverse Problems. IMS Lecture Notes Monogr. Ser. 55 149–166. IMS, Beachwood, OH. \MR2459937
 (14) Jia, J. and Yu, B. (2008). On model selection consistency of elastic net when . Technical Report 756, Dept. Statistics, Univ. California, Berkeley.
 (15) Kanehisa, M. and Goto, S. (2002). KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28 27–30.
 (16) Kimura, K. D., Tissenbaum, H. A., Liu, Y. and Ruvkun, G. (1997). daf2, an insulin receptorlike gene that regulates longevity and diapause in Caenorhabditis elegans. Science 277 942–946.
 (17) Li, C. and Li, H. (2008). Networkconstrained regularization and variable selection for analysis of genomic data. Bioinformatics 24 1175–1182.
 (18) Li, C. and Li, H. (2010). Supplement to “Variable selection and regression analysis for graphstructured covariates with an application to genomics” DOI: 10.1214/10AOAS332SUPP.
 (19) Lu, T., Pan, Y., Kao, S.Y., Li, C., Kohane, I., Chan, J. and Yankner, B. A. (2004). Gene regulation and DNA damage in the aging human brain. Nature 429 883–891.
 (20) Nardi, Y. and Rinado, A. (2008). On the asymptotic properties of the group lasso estimator for linear models. Electron. J. Statist. 2 605–633. \MR2426104
 (21) Portnoy, S. (1984). Asymptotic behavior of Mestimators of regression parameters when is large. I. Consistency. Ann. Statist. 12 1298–1309. \MR0760690
 (22) Stein, E., Savaskan, N. E., Ninnemann, O., Nitsch, R., Zhou, R. and Skutella, T. (1999). A role for the Eph ligand ephrinA3 in entorhinohippocampal axon targeting. Journal of Neuroscience 19 8885–8893.
 (23) Tatar, M., Kopelman, A., Epstein, D., Tu, M. P., Yin, C. M. and Garofalo, R. S. (2001). A mutant Drosophila insulin receptor homolog that extends lifespan and impairs neuroendocrine function. Science 292 107–110.
 (24) Tian, X., Gotoh, T., Tsuji, K., Lo, E. H., Huang, S. and Feig, L. A. (2004). Developmentally regulated role for RasGRFs in coupling NMDA glutamate receptors to Ras, Erk and CREB. EMBO J. 23 1567–1575.
 (25) Tibshirani, R. J. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58 267–288. \MR1379242
 (26) Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. J. Roy. Statist. Soc. Ser. B 67 91–108. \MR2136641
 (27) Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Statist. 2 224–244. \MR2415601
 (28) Yeoh, J. S. and de Haan, G. (2007). Fibroblast growth factors as regulators of stem cell selfrenewal and aging. Mechanisms of Ageing and Development 128 17–24.
 (29) Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with grouped variables. J. Roy. Statist. Soc. Ser. B 68 49–67. \MR2212574
 (30) Yung, R. L. and Mo, R. (2003). Aging is associated with increased human T cell CC chemokine receptor gene expression. Journal of Interferon & Cytokine Research 23 575–582.
 (31) Zhang, C. and Huang, J. (2006). The sparsity and bias of the Lasso selection in highdimensional linear regression. Ann. Statist. 36 1567–1594. \MR2435448
 (32) Zhao, P. and Yu, B. (2006). On model selection consistency of Lasso. J. Mach. Learn. Res. 7 2541–2567. \MR2274449
 (33) Zhou, D., Bousquet, O., Lal, T., Weston, J. and Scholkopf, B. (2004). Learning with local and global consistency. In NIPS 16 321–328. MIT Press, Cambridge, MA.
 (34) Zhu, X. (2005). Semisupervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin–Madison.
 (35) Zou, H. (2006). The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 1418–1429. \MR2279469
 (36) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B 67 301–320. \MR2137327
 (37) Zou, H. and Zhang, H. H. (2009). On the adaptive elastic net with a diverging number of parameters. Ann. Statist. 37 1733–1751. \MR2533470