Joint estimation of sparse multivariate regression and conditional graphical models

# Joint estimation of sparse multivariate regression and conditional graphical models

Junhui Wang
Department of Mathematics, Statistics,
and Computer Science
University of Illinois at Chicago
Chicago, IL 60607
###### Abstract

Multivariate regression model is a natural generalization of the classical univariate regression model for fitting multiple responses. In this paper, we propose a high-dimensional multivariate conditional regression model for constructing sparse estimates of the multivariate regression coefficient matrix that accounts for the dependency structure among the multiple responses. The proposed method decomposes the multivariate regression problem into a series of penalized conditional log-likelihood of each response conditioned on the covariates and other responses. It allows simultaneous estimation of the sparse regression coefficient matrix and the sparse inverse covariance matrix. The asymptotic selection consistency and normality are established for the diverging dimension of the covariates and number of responses. The effectiveness of the proposed method is also demonstrated in a variety of simulated examples as well as an application to the Glioblastoma multiforme cancer data.

\newtheorem

theoremTheorem \newtheoremlemmaLemma \newtheoremcorollaryCorollary \newtheoremproProposition \newtheoremdefiDefinition

Key words: Covariance selection, Gaussian graphical model, large small , multivariate regression, regularization

## 1 Introduction

Multivariate regression model is a key statistical tool for analyzing dataset with multiple responses. A standard approach is to decompose the multivariate regression model and fit each response via a marginal univariate regression model. However, this approach is suboptimal in general as it completely ignores the dependency structure among the responses. For example, the gene expressions of many genes are strongly correlated due to the shared genetic variants or other unmeasured common regulators (Kendziorski et al., 2006). With the dependency structure appropriately incorporated, one would naturally expect a more efficient multivariate regression model in terms of both estimation and prediction. Furthermore, the dependency structure among the responses can be nicely interpreted in a graphical model under the multivariate Gaussian assumption (Edwards, 2000), where two Gaussian responses are connected in the graph if the corresponding entry in the precision matrix (inverse covariance matrix) is nonzero.

In literature, to model the multivariate regression problem, Breiman and Friedman (1997) proposed the curd and whey method to improve the prediction performance by utilizing the dependency among responses. The curd part fits a univariate regression model for each response against the covariates, and the whey part refits each response against the fitted values from the curd part. However, the method is developed in the low dimensional setup, and does not address the challenges when the data dimension is diverging. Yuan et al. (2007) and Chen and Huang (2012) proposed the high dimensional reduced-rank regression model, which assumes that all marginal regression functions reside in a common low dimensional space. This approach focuses on dimension reduction and largely replies on the reduced-rank assumption. Turlach et al. (2005) imposed the sparsity in the regression model through a -norm penalty of the coefficient matrix. This method is able to identify sparsity, but may produce bias for model estimation due to the -norm penalty. The recent work by Rothman et al. (2010), Yin and Li (2011) and Lee and Liu (2012) formulated the multivariate regression problem in a penalized log-likelihood framework, so that it allows joint estimation of the multivariate regression model and the conditional Gaussian graphical model. This formulation requires an alternating optimization scheme, which is computationally expensive and can not guarantee global optimum.

In this paper, we propose a multivariate conditional regression model to tackle the multivariate regression problem with diverging dimension. The key idea is to formulate the protblem as the conditional log-likelihood function of each response conditioned on the covariates and other responses. The conditional log-likelihood function is then equipped with the adaptive Lasso penalty (Zou, 2006) to facilitate joint estimation of the sparse multivariate regression coefficient matrix and the sparse precision matrix. The proposed model leads to a series of augmented adaptive Lasso regression models, which can be efficiently solved by any existing optimization package. More importantly, its asymptotic properties are established in terms of the estimation consistency and selection consistency with diverging dimension. In specific, the dimension of covariates and the number of responses are allowed to diverge in an exponential order of the sample size. Numerical experiments with both simulated and real examples also support the effectiveness of the proposed method.

The rest of the paper is organized as follows. Section 2 provides a brief introduction to the multivariate regression model, with an emphasis on the penalized log-likelihood method. Section 3 describes the proposed penalized conditional log-likelihood method in details, with theoretical justification in Section 4 and numerical experiments in Section 5. Section 6 contains a discussion, and the Appendix is devoted to the technical proofs.

## 2 Prelimilaries

In a multivariate regression setting, supposed that the training dataset consists of , where and . Let and be the design matrix and response matrix, and let and be the -th covariate and the -th response. For simplicity, the covariates and responses are centered, so that

 n∑i=1xij=0, n∑i=1yik=0; j=1,…,p; k=1,…,q,

A standard multivariate regression model is then formulated as

 Y=XB+e, (1)

where with being the regression coefficient for the -th response, and with being the -th error vector. The random vector ’s are assumed to be independent and identically sampled from a -dimensional Gaussian distribution with positive definite .

The maximum likelihood formulation of (1), after dropping constant terms, yields that

 minB,Ω −log|Ω|+tr((Y−XB)Ω(Y−XB)T), (2)

where is also positive definite and known as the precision matrix. The precision matrix is closely connected with the Gaussian graphical models (Edward, 2000) since the conditional dependency structure among the responses can be fully determined by . Specifically, implies that the -th and -th responses are conditionally independent given the covariates and other response variables.

When the dimension of covariates is large, it is generally believed that the responses only rely on a small proportion of them, while other covariates are noise and provide no information about the responses at all. In addition, when the number of responses is large, the dependency structure among responses becomes sparse as some responses may have little relationship with each other. Therefore, penalized log-likelihood approach has been widely employed to analyze the multivariate regression model in literature, including Rothman et al. (2010), Yin and Li (2011) and Lee and Liu (2012). The penalized likelihood approach can be formulated as

 minB,Ω −log|Ω|+tr((Y−XB)Ω(Y−XB)T)+λ1np1(B)+λ2np2(Ω), (3)

where and are sparsity-encouraging penalties, such as the adaptive Lasso penalties and with weights and , and and are two tuning parameters. To optimize (3), alternative updating scheme is used. It updates and separately pretending the other party is fixed. In specific, when is fixed, (3) can be solved via the graphical Lasso algorithm (Friedman, 2008), and when is fixed, (3) can be solved via the coordinate descent algorithm (Lee and Liu, 2012). However, as pointed out in Yin and Li (2011) and Lee and Liu (2012), the alternative updating scheme can not guarantee the global optimum, and is often computationally expensive and thus not practically scalable.

## 3 Proposed Methodology

In this section, a new estimation method based on penalized conditional log-likelihood is developed for jointly estimating the sparse multivariate regression coefficient matrix and the sparse precision matrix. The key idea is motivated from the simple fact that given the model in (1),

 yk|(X,Y−k)∼Nn(Xβk+(Y−k−XB−k)γk,~σkkIn), (4)

for any , where denotes the response matrix without , denotes the coefficient matrix without , , stays the same as in (1), and

 γk=Σ−1−k,−kΣ−k,k=−Ω−k,kωkk. (5)

Since is always positive, it follows from (5) that , where with for convenience. Consequently, the sparsity in can be determined by whether or not, and the sparsity in can be determined by whether or not.

To allow joint estimation of the sparse multivariate regression coefficient matrix and the sparse precision matrix, we then formulate the model in (4) as a series of penalized conditional regressions of each response against the covariates and other responses. In specific, for the -th response,

 minβk,γk ∥yk−Xβk−(Y−k−XB−k)γk∥22+λ1np1(βk)+λ2np2(γk), (6)

where is the usual Euclidean norm, and are the adaptive Lasso penalties. When in (6) is replaced by an initial consistent estimate , the final formulation for the proposed multivariate conditional regression model is

 minB,Γ q∑k=1∥yk−Xβk−(Y−k−XˆB(0)−k)γk∥22+λ1nq∑k=1p1(βk)+λ2nq∑k=1p2(γk). (7)

The following computing algorithm can be employed to solve (7).

Algorithm 1:

Step 1. Initialize , and .

Step 2. For , solve (6) for and .

As computational remarks, can be initialized by the separate Lasso regression ignoring the dependency structure. The weights and are set as and as in Zou (2006), where and are any consistent estimates of and , respectively. Since (6) is a convex optimization problem, its global minimum can be obtained by any available adaptive Lasso regression procedure. Furthermore, the coordinate descent algorithm (Friedman et al., 2007) can be employed to further improve the computational efficiency of solving (6). More importantly, Step 2 fits the adaptive Lasso regression model (6) for each , and thus can be easily parallelized and distributed to multiple computing nodes. Therefore, Algorithm 1 is scalable and can efficiently handle dataset with big size.

When identifying the sparsity in the conditional graphical model defined by , the symmetry of implies that , and thus . Consequently, additional refinement is necessary to correct the possible inconsistency in . Similar as in Meinshausen and Bhlmann (2006), one natural way is to set

 ˆγ∧sk=0 if ˆγsk=0∧ˆγks=0,

or a less conservative way is to set

 ˆγ∨sk=0 if ˆγsk=0∨ˆγks=0.

In the numerical experiments, the less conservative way is used and the resultant selection performance in appears to be satisfactory.

## 4 Asymptotic properties

This section establishes the asymptotic properties of the proposed multivariate conditional regression model in terms of the selection and estimation accuracy. Let be the true regression coefficient matrix, be the inverse of the true covariance matrix , and be defined as in (5) with . The selection accuracy is measured by the sign agreement between and , and the estimation accuracy is quantified by the asymptotic normality of .

Without loss of generality, we assume that for all ’s, and denote

 M=(n−1XTX00Σ∗).

Let , , , , , and . Let , , , and . Denote and as the minimum and maximum eigenvalues of a matrix . Denote and as the tuning parameters used in the initial Lasso regression and (6), respectively. The following technical conditions are assumed.

1. There exists a positive constant such that and . In addition, .

2. For some integers , , and a positive constant ,

 1K(d,m,k0,M):=minJ0⊂{1,…,p+q},|J0|≤d⎛⎝minα≠0,∥αJc0∥1≤k0∥αJ0∥1(∥M1/2α∥2∥αJ0m∥2)⎞⎠>0.
3. Let and be defined as below,

Assumption (A1) implies that , , and . Assumption (A2) is similar as the restricted eigenvalue assumption in Bickel et al. (2008) and Zhou et al. (2009). It implies that for any subset with , we have , where

 Λmin(d)=minJ0⊂{1,…,p+q},|J0|≤d⎛⎝minα≠0,αJc0=0⎛⎝∥αTMα∥2αTJ0αJ0⎞⎠⎞⎠.

Assumption (A3) is similar as the condition in Zhao and Yu (2006) and Meinshausen (2007), and implies that the nonzero and will not decay too fast to be dominated by the noise terms.

{theorem}

(Selection consistency) Supposed that conditions (A1)-(A3) are satisfied with and , the initial and are set as the solution of the separate Lasso regression, and . Then as ,

 P(sgn(ˆB)≠sgn(B∗) or sgn(ˆΩ)≠sgn(Ω∗))⟶0,

when , , and .

{theorem}

(Asymptotic normality) Supposed that the conditions in Theorem 4 are satisfied. Let , where is any vector with unit length, and is the principle submatrix of defined by . Then

 n1/2s−1kαT((ˆβkˆγk)−(β∗kγ∗k))\lx@stackreld⟶N(0,1)  for any k,

when , and .

Theorems 1 and 2 show that with consistent initial estimates of and , the proposed multivariate conditional regression model is able to achieve both selection consistency and the asymptotic normality.

## 5 Numerical experiments

This section examines the effectiveness of the proposed multivariate conditional regression model on a variety of simulated examples and a real application to the Glioblastoma Cancer Dataset (TCGA, 2008). The proposed multivariate conditional regression model with the adaptive Lasso penalty, denoted as aMCR, is compared against the alternative updating algorithm in (3) (ALT; Yin and Li, 2011; Lee and Liu, 2012), and the separate Lasso regression (SEP; estimating each and separately).

The comparison is conducted with respect to the estimation and selection accuracy of and . In specific, the estimation accuracy of is measured by the Frobenius norm , the matrix 1-norm , and the matrix -norm , where . The estimating accuracy of is not reported as the primary interest is the sparsity inferred by , and the proposed method does not produce directly. The selection accuracy of and is measured by the symmetric difference

 Dist(ˆAβ,Aβ) = ∣∣A^β\char92Aβ∣∣+∣∣Aβ\char92A^β∣∣pq; Dist(ˆAω,Aω) = ∣∣A^ω\char92Aω∣∣+∣∣Aω\char92A^ω∣∣q2,

where and are the active sets defined by and , and denotes the set cardinality. We also report the specificity (Spe), sensitivity (Sen) and Matthews correlation coefficient (Mcc) scores, defined as

 Spe = TNTN+FP,   Sen=TPTP+FN, Mcc = TP×TN−FP×FN√(TP+FN)(TN+FP)(TP+FP)(TN+FN),

where TP, TN, FP and FN are the numbers of true positives, true negatives, false positives and false negatives in identifying the nonzero elements in or , and “positive” refers to the nonzero entries.

Furthermore, tuning parameters are used in most penalized log-likelihood formulations to balance the model estimation and model complexity. For example, the tuning parameters and in (3) and (6) control the tradeoff between the sparsity and the estimation accuracy of the multivariate regression models. In the numerical experiments, we employed Bayesian information criterion (BIC; Schwarz, 1978) to select the tuning parameters, which is shown to perform well in tuning penalized likelihood method (Want et al., 2007). The BIC criterion is minimized through a grid search on a two-dimensional equally-spaced grid ; . Other data adaptive model selection criteria, such as cross validation, can be employed as well (Lee and Liu, 2012).

### 5.1 Simulated examples

The simulated examples follow the same setup as in Li and Gui (2006), Fan et al. (2009), Peng et al. (2009) and Yin and Li (2011). First, each entry of the precision matrix is generated from the product of a Bernoulli random variable with success rate proportional to and a uniform random variable on . For each row, all off-diagonal entries are divided by the sum of the absolute value of the off-diagonal entries multiplied by 3/2. The final precision matrix is obtained by symmetrizing the generated matrix and setting the diagonal entries as 1. Next, each entry of the coefficient matrix is generated from the product of a Bernoulli random variable with success rate proportional to and a uniform random variable on , where is the minimum absolute value of the nonzero entries in . Finally, with the generated and , each entry of the covariate matrix is generated independently from , and the response vector is generated from .

Six models are considered, and for each given model, a training sample of observations are generated.

Model 1: , where and ;

Model 2: , where and ;

Model 3: , where and ;

Model 4: , where and ;

Model 5: , where and ;

Model 6: , where and .

Each model is replicated 50 times, and the averaged performance measures as well as the estimated standard errors are reported in Tables 1 and 2.

 Tables 1 and 2 about here

It is evident that the proposed aMCR delivers superior numerical performance, in terms of both estimation and selection accuracy of and , against other competitors across all six simulated examples. In Tables 1 and 2, we only report the numerical performance of ALT on examples 2 and 3, due to the computational burden of running ALT on other examples with larger dimensions. Although the performance of ALT might be improved if some random start algorithm is employed to partially overcome the issue of local minimum, the inefficient alternating algorithm becomes one major obstacle of applying ALT to analyze high dimensional dataset.

In Tables 1 and 2, the advantage of aMCR and ALT over SEP demonstrates that inclusion of the covariance matrix in (3) and (6) is indeed helpful in identifying the sparsity in and and thus in estimating . As for the selection accuracy, aMCR yields higher Spe and Mcc but lower Sen in most examples. This is due to the fact that aMCR tends to produce sparser models than SEP since the correlations among the responses are positive (Lee and Liu, 2012). Although sparser models are produced, aMCR still yields smaller symmetric difference than SEP. As for the estimation accuracy of , it is clear that aMCR outperforms SEP under all three metrics of . This implies that the proposed multivariate conditional regression model can improve not only the accuracy of identified nonzero entries in the precision matrix, but also the accuracy of estimating the multivariate regression coefficient matrix.

### 5.2 Real application

In this section, we apply the proposed multivariate conditional regression model to a Glioblastoma multiforme (GBM) cancer dataset studied by the Cancer Genome Atlas (TCGA) Research Network (TCGA, 2008; Verhaak et al., 2010). GBM is the most common and most aggressive malignant primary brain tumor in adults. The original dataset collected by TCGA consists of 202 samples, 11861 gene expression values and 534 microRNA expression values. One primary goal of the study is to regress the microRNA expressions on the gene expressions and model how the microRNAs regulate the gene expressions. It is also of interest to construct the underlying network among the microRNAs. The proposed model can achieve these two goals simultaneously, where the sparse coefficient matrix reveals the regulatory relationship among the microRNA and gene expressions and the sparse precision matrix can be interpreted as the dependency structure among the microRNAs.

For illustration, some preliminary data cleaning is conducted by removing missing values and prescreening the less expressed genes and microRNAs as in TCGA et al. (2010) and Lee and Liu (2012). In particular, 6 samples with missing values are removed, and thus 196 complete samples are remained in the dataset. Furthermore, the genes and microRNAs are sorted based on their corresponding median absolute deviation (MAD), and the top 500 genes and top 20 microRNAs with large MADs are selected.

The dataset is randomly split into a training set with 120 samples and a test set with 76 samples. On the training set, each method is fitted to estimate the multivariate regression coefficient matrix and the precision matrix. Since the truth is unknown in the real application, the estimation performance is measured by the predictive square error (Pse) estimated on the test set, defined as

 Pse=|test set|−1∑test set∥Yi−ˆYi∥2F,

where denotes the cardinality of the test set. In addition, the numbers of the selected genes by each method are also reported.

The averaged Pse and numbers of selected genes as well as their estimated standard errors based on 50 replications are reported in Table 3.

 Table 3 about here

Clearly, the proposed aMCR yields sparser multivariate regression model and achieves smaller Pse than the separate regression model. This agrees with the conclusion in Lee and Liu (2012), and the sparser regression model is due to the fact that the joint estimation method is able to obtain more shrinkage when strong positive correlations are present among the selected microRNAs. Again, the numerical performance of ALT is not reported due to the computational burden, but we note that in Lee and Liu (2012), the Pse of the ALT is 1.23(.032) and the number of selected genes is 78.0(32.15) based on a slightly smaller dataset.

Figure 1 displays the estimated conditional dependency structure among the microRNAs based on the estimated precision matrix of the microRNAs. Compared with the results in Lee and Liu (2012), the graphical structure in Figure 1 captures the strong positive correlations among the selected microRNA pairs, including the tuple of hsa.mir.136, hsa.mir.376a and hsa.mir.377. More importantly, it produces a sparser dependency structure than that in Lee and Liu (2012), and rules out more microRNA pairs with weak correlations, such as hsa.mir.bart19 and hsa.mir.124a (with pairwise correlation ).

 Figure 1 about here

## 6 Summary

This article proposes a joint estimation method for estimating the multivariate regression model and the dependency structure among the multiple responses. As opposed to the existing methods maximizing the penalized joint log-likelihood function, the proposed method is formulated as a penalized conditional log-likelihood function, leading to efficient computation and superior numerical performance. Its asymptotic estimation and selection consistencies are established for diverging dimensions and numbers of responses. Finally, it is worth pointing out that the penalized conditional log-likelihood formulation can be extended to a general framework without the Gaussian distributional assumption such as in Finegold and Drton (2011) and Lee et al. (2012).

## Acknowledgment

The author would like to thank Wonyul Lee and Yufeng Liu (University of North Carolina at Chapel Hill) for sharing their code on the alternative updating algorithm and the Glioblastoma multiforme cancer dataset.

## Appendix

Proof of Theorem 4: We first establish upper bounds for and , where and are the solution of

 minβk,γk ∥yk−Xβk−ˆy−kγk∥2+λn(p∑j=1ujk|βjk|+∑s≠kvsk|γsk|), (8)

and is a surrogate of . Based on the model assumption (4), we have

 yk=Xβ∗k+ˆy−kγ∗k+ξk+ϵk, (9)

where and . Furthermore, let be the augmented coefficient vector, be the augmented covariate matrix, , and then the model (8) can be simplified as

 min˜β ∥yk−Zζ∥2+λnp+q−1∑j=1rj|ζj|. (10)

We now verify the conditions (11) and (12) in Lemma Appendix. First, for simplicity, let

 T={maxj,s n−1(Xj)Tes≤a21(8n−1log(p+q))1/2},

and it follows from the proof of Lemma 9.1 in Zhou et al. (2009) that . Also let , be the submatrix of without the -th row and column, , and

 Yk={maxj,s|Δjs|≤8n−1/2(log(p+q))1/2}.

It then follows from Lemma 9.3 in Zhou et al. (2009) that . Additionally, since is asymptotically larger than , there exists a constant such that on the set ,

 Λmin(n−1˜ZTAk˜ZAk)≥2c1Λmin(d),

for any subset with . Furthermore, let and , then

 ∣∣Λmin(n−1ZTAZA)−Λmin(n−1˜ZTA˜ZA)∣∣ ≤ ∥n−1ZTAZA−n−1˜ZTA˜ZA∥2≤∥n−1ZTAZA−n−1˜ZTA˜ZA∥∞ ≤ ∥n−1ˆyTA2XA1∥∞+∥n−1XTA1ˆyA2∥∞+∥n−1ˆyTA2ˆyA2−n−1eTA2eA2∥∞,

where is the operator norm of a matrix , and . But since with being the Lasso estimate, we have on the set ,

 max(∥n−1ˆyTA2XA1∥∞,∥n−1XTA1ˆyA2∥∞)≤O(n−1λinitd).

Also, conditional on the set , it follows from Assumption (A1) that

 ∥n−1ˆyTA2ˆyA2−n−1eTA2eA2∥∞≤O(n−1dλinitn−1/2(log(p+q))1/2)+O(n−2dλ2init).

Therefore, on the set ,

 Λmin(n−1ZTAkZAk) ≥ Λmin(n−1˜ZTAk˜ZAk)−∣∣Λmin(n−1ZTAZA)−Λmin(n−1˜ZTA˜ZA)∣∣ ≥ c1Λmin(d),

for sufficiently large , since the cardinality of is bounded by and is asymptotically larger than .

Next, it follows from Bickel et al. (2008) that under Assumption (A2) with and ,

 δAk:=maxj∈Ak |~ζj−ζ∗j| ≤ 4K(d,d,3,M)2n−1d1/2λinit; δAck:=maxj∈Ack |~ζj−ζ∗j| ≤ 16K(d,d,3,M)2n−1d1/2λinit,

on set , where is the solution of the Lasso regression. Therefore,

 rmin(Ack)rmax(Ak)=minj∈Ak|~ζj|maxj∈Ack|~ζj|≥ζ∗min−δAkδAck,

where . Furthermore, it follows from Lemma 10.3 of Zhou et al. (2009) that on set , there exists a positive constant such that . Therefore, on the set , when is sufficiently large,

 ∥ZTAckZAk(ZTAkZAk)−1∥∞≤rmin(Ack)rmax(Ak)(1−η),

for some , provided that .

Finally,it follows from Lemma Appendix that for each ,

 P(sgn(ˆζ)≠sgn(ζ∗))=O((p+q)−2),

provided that and . Consequently, , which implies the desired result immediately.

{lemma}

Consider the linear model in (10), where the design matrix satisfies

 Λmin(n−1ZTAkZAk) ≥ c1Λmin(d)>0, (11) ∥ZTAckZAk(ZTAkZAk)−1∥∞ ≤ rmin(Ack)rmax(Ak)(1−η) (12)