Variable selection in multivariate linear models with highdimensional covariance matrix estimation
Abstract.
In this paper, we propose a novel variable selection approach in the framework of multivariate linear models taking into account the dependence that may exist between the responses. It consists in estimating beforehand the covariance matrix of the responses and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator of the coefficient matrix. The properties of our approach are investigated both from a theoretical and a numerical point of view. More precisely, we give general conditions that the estimators of the covariance matrix and its inverse have to satisfy in order to recover the positions of the null and non null entries of the coefficient matrix when the size of is not fixed and can tend to infinity. We prove that these conditions are satisfied in the particular case of some Toeplitz matrices. Our approach is implemented in the R package MultiVarSel available from the Comprehensive R Archive Network (CRAN) and is very attractive since it benefits from a low computational load. We also assess the performance of our methodology using synthetic data and compare it with alternative approaches. Our numerical experiments show that including the estimation of the covariance matrix in the Lasso criterion dramatically improves the variable selection performance in many cases.
1. Introduction
The multivariate linear model consists in generalizing the classical linear model, in which a single response is explained by variables, to the case where the number of responses is larger than 1. Such a general modeling can be used in a wide variety of applications ranging from econometrics (lutkepohl:2005) to bioinformatics (Meng:2014). In the latter field, for instance, multivariate models have been used to gain insight into complex biological mechanisms like metabolism or gene regulation. This has been made possible thanks to recently developed sequencing technologies. For further details, we refer the reader to mehmood:liland:2012. However, the downside of such a technological expansion is to include irrelevant variables in the statistical models. To circumvent this, devising efficient variable selection approaches in the multivariate setting has become a growing concern.
A first naive approach to deal with the variable selection issue in the multivariate setting consists in applying classical univariate variable selection strategies to each response separately. Some wellknown variable selection methods include the least absolute shrinkage and selection operator (LASSO) proposed by Tib96 and the smoothly clipped absolute deviation (SCAD) approach devised by fan:li:2001. However, such a strategy does not take into account the dependence that may exist between the different responses.
In this paper, we shall consider the following multivariate linear model:
(1) 
where denotes the random response matrix, denotes the design matrix, denotes a coefficient matrix and denotes the random error matrix, where is the sample size. In order to model the potential dependence that may exist between the columns of , we shall assume that for each in ,
(2) 
where denotes the covariance matrix of the th row of the error matrix . We shall moreover assume that the different rows of are independent. With such assumptions, there is some dependence between the columns of but not between the rows. Our goal is here to design a variable selection approach which is able to identify the positions of the null and non null entries in the sparse matrix by taking into account the dependence between the columns of .
This issue has recently been considered by LeeLiu2012 who extended the approach of rothman:2010. More precisely, LeeLiu2012 proposed three approaches for dealing with this issue based on penalized maximum likelihood with a weighted regularization. In their first approach is estimated by using a plugin estimator of , in the second one, is estimated by using a plugin estimator of and in the third one, and are estimated simultaneously. LeeLiu2012 also investigate the asymptotic properties of their methods when the sample size tends to infinity and the number of rows and columns of is fixed.
In this paper, we propose to estimate beforehand and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator of . Hence, our methodology is close to the first approach of LeeLiu2012. However, there are two main differences: The first one is the asymptotic framework in which our theoretical results are established and the second one is the strategy that we use for estimating . More precisely, in our asymptotic framework, is allowed to depend on and thus to tend to infinity as tends to infinity at a polynomial rate. Moreover, in LeeLiu2012, is estimated by using an adaptation of the Graphical Lasso (GLASSO) proposed by friedman:hastie:tibshirani:2008. This technique has also been considered by yuan:li:2007, banerjee:2008 and rothman:2008. In this paper, we give general conditions that the estimators of and have to satisfy in order to be able to recover the support of that is to find the positions of the null and non null entries of the matrix . We prove that when is a particular Toeplitz matrix, namely the covariance matrix of an AR(1) process, the assumptions of the theorem are satisfied.
Let us now describe more precisely our methodology. We start by “whitening” the observations by applying the following transformation to Model (1):
(3) 
The goal of such a transformation is to remove the dependence between the columns of . Then, for estimating , we proceed as follows. Let us observe that (3) can be rewritten as:
(4) 
with
(5) 
where denotes the vectorization operator and the Kronecker product.
With Model (4), estimating is equivalent to estimate since . Then, for estimating , we use the classical LASSO criterion defined as follows for a nonnegative :
(6) 
where and denote the classical norm and norm, respectively. Inspired by zhao:yu:2006, Theorem 1 established some conditions under which the positions of the null and non null entries of can be recovered by using .
In practical situations, the covariance matrix is generally unknown and has thus to be estimated. Let denote an estimator of . Then, the estimator of is such that
When is replaced by , (3) becomes
(7) 
which can be rewritten as follows:
(8) 
where
(9) 
In Model (8), is estimated by
(10) 
By extending Theorem 1, Theorem 5 gives some conditions on the eigenvalues of and on the convergence rate of and its inverse to and , respectively, under which the positions of the null and non null entries of can be recovered by using .
We prove in Section 2.3 that when is a particular Toeplitz matrix, namely the covariance matrix of an AR(1) process, the assumptions of Theorem 5 are satisfied. This strategy has been implemented in the R package MultiVarSel, which is available on the Comprehensive R Archive Network (CRAN), for more general Toeplitz matrices such as the covariance matrix of ARMA processes or general stationary processes. For a successful application of this methodology to particular “omic” data, namely metabolomic data, we refer the reader to nous:bioinfo:arxiv:2017. For a review of the most recent methods for estimating highdimensional covariance matrices, we refer the reader to HDCovEst.
The paper is organized as follows. Section 2 is devoted to the theoretical results of the paper. The assumptions under which the positions of the non null and null entries of can be recovered are established in Theorem 1 when is known and in Theorem 5 when is unknown. Section 2.3 studies the specific case of the AR(1) model. We present in Section 3 some numerical experiments in order to support our theoretical results. The proofs of our main theoretical results are given in Section 4.
2. Theoretical results
2.1. Case where is known
Let us first introduce some notations. Let
Let also define
(12) 
where and denote the columns of belonging to the set defined in (11) and to its complement , respectively.
More generally, for any matrix , denotes the partitioned matrix extracted from by considering the rows of belonging to the set and the columns of belonging to the set , with indicating all the rows or all the columns.
The following theorem gives some conditions under which the estimator defined in (6) is signconsistent as defined by zhao:yu:2006, namely,
where the function maps positive entries to 1, negative entries to 1 and zero to 0.
Theorem 1.
Assume that satisfies Model (4). Assume also that there exist some positive constants , , and positive numbers , such that satisfying:

for all , for all , , where is the th column of defined in (5),

for all , , where denotes the smallest eigenvalue of ,

, where is defined in (11) and is the cardinality of the set ,

.
Assume also that the following strong Irrepresentable Condition holds:

There exists a positive constant vector such that
where is a vector of 1 and the inequality holds elementwise.
Then, for all that satisfies
we have
where is defined by (6).
Remark 1.
Observe that if , for some positive , then the first condition of 1 becomes . Hence for large values of , the size of is much larger than .
The proof of Theorem 1 is given in Section 4. It is based on Proposition 2 which is an adaptation to the multivariate case of Proposition 1 in zhao:yu:2006.
Proposition 2.
We give in the following proposition which is proved in Section 4 some conditions on and under which Assumptions 1 and 2 of Theorem 1 hold.
Proposition 3.
2.2. Case where is unknown
Similarly as in (11) and (12), we introduce the following notations:
(15) 
and
(16) 
where and denote the columns of belonging to the set defined in (11) and to its complement , respectively.
Proposition 4.
The following theorem extends Theorem 1 to the case where is unknown and gives some conditions under which the estimator defined in (10) is signconsistent. The proof of Theorem 5 is given in Section 4 and is based on Proposition 4.
Theorem 5.
Assume that Assumptions 1, 2, 3, 4, 1 and 1 of Theorem 1 hold. Assume also that, there exist some positive constants , , and , such that for all ,

,

,

,

.
Suppose also that

, as tends to infinity,

, as tends to infinity.
Let be defined by (10), then
In the previous assumptions, , , and denote the largest eigenvalue, the smallest eigenvalue, the spectral radius and the infinite norm (induced by the associated vector norm) of the matrix .
Remark 3.
In order to estimate , we propose the following strategy:

Fitting a classical linear model to each column of the matrix in order to have access to an estimation of the random error matrix . It is possible since is assumed to be fixed and smaller than .

Estimating from by assuming that has a particular structure, Toeplitz for instance.
More precisely, defined in the first step is such that:
(19) 
which implies that
(20) 
where is defined in (5).
2.3. The AR(1) case
2.3.1. Sufficient conditions for Assumption 1 of Theorem 1
The following proposition gives some conditions under which the strong Irrepresentable Condition 1 of Theorem 1 holds.
Proposition 6.
Assume that , , …, in Model (1) are independent AR(1) processes satisfying:
where the ’s are zeromean i.i.d. Gaussian random variables with variance and . Assume also that defined in (1) is such that , where is a positive constant. Moreover, suppose that if , then and . Suppose also that for all , or is not in . Then, the strong Irrepresentable Condition 1 of Theorem 1 holds.
2.3.2. Sufficient conditions for Assumptions 7, 8, 9 and 10 of Theorem 5
The following proposition establishes that in the particular case where the , , …, are independent AR(1) processes, our strategy for estimating provides an estimator satisfying the assumptions of Theorem 5.
Proposition 7.
Lemma 8.
3. Numerical experiments
The goal of this section is twofold: to provide sanity checks for our theoretical results in a wellcontrolled framework; and to investigate the robustness of our estimator to some violations of the assumptions of our theoretical results. The latter may reveal a broader scope of applicability for our method than the one guaranteed by the theoretical results.
We investigate in the AR(1) framework presented in Section 2.3. Indeed, all assumptions made in Theorems 1 and 5 can be specified with wellcontrollable simulation parameters in the AR(1) case with balanced design matrix .
Point aims to explore the limitations of our theoretical framework and assess its robustness. To this end, we propose two numerical studies relaxing some of the assumptions of our theorems: first, we study the effect of an unbalanced design – which violates the sufficient condition of the irrepresentability condition (IC) given in Proposition 6 – on the signconsistency; and second, we study the effect of other types of dependence than an AR(1).
In all experiments, the performance are assessed in terms of signconsistency. In other words, we evaluate the probability for the sign of various estimators to be equal to . We compare the performance of three different estimators:

the LASSO criterion applied to the raw data, which we call rawlasso hereafter. Its theoretical properties are established only in the univariate case in alquier:doukhan:2011.
3.1. AR(1) dependence structure with balanced oneway ANOVA
In this section, we consider Model (1) where is the design matrix of a oneway ANOVA with two balanced groups. Each row of the random error matrix is distributed as a centered Gaussian random vector as in Equation (2) where the matrix is the covariance matrix of an AR(1) process defined in Section 2.3.
In this setting, Assumptions 1, 2 and Condition 1 of Theorem 1 are satisfied, see Propositions 3 and 6. The three remaining assumptions 3, 4 and 1 are related to more practical quantities: 3 controls the sparsity level of the problem, involving ; 4 basically controls the signaltonoise ratio, involving and 1 links the sample size , and the two constants , , so that an appropriate range of penalty exists for having a large probability of support recovery. This latter assumption is used in our experiments to tune the difficulty of the support recovery as follows: we consider different values of , , , and we choose a sparsity level and a minimal magnitude in such that Assumptions 3 and 4 are fulfilled. Hence, the problem difficulty is essentially driven by the validity of Assumption 1 where with , and so by the relationship between , and .
We consider a large range of sample sizes varying from to and three different values for in . The constants , are chosen such that with and in . Additional values of and have also been considered and the corresponding results are available upon request. Finally, we consider two values for the parameter appearing in the definition of the AR(1) process: .
Note that in this AR(1) setting with the estimator of defined in (21) , all the assumptions of Theorem 5 are fulfilled, see Proposition 7.
The frequencies of support recovery for the three estimators averaged over 1000 replications is displayed in Figure 1.
We observe from Figure 1 that whitenedlasso and oracle have similar performance since is well estimated. These two approaches always exhibit better performance than rawlasso, especially when . In this case, the sample size required to reach the same performance is indeed ten time larger for rawlasso than for oracle and whitenedlasso.
Finally, the performance of all estimators are altered when is too small, especially in situations where the signal to noise ratio (SNR) is small and the signal is not sparse enough, these two characteristics corresponding to small values of .
3.2. Robustness to unbalanced designs and correlated features
The goal of this section is to study some particular design matrices in Model (1) that may lead to violation of the Irrepresentability Condition 1.
To this end, we consider the multivariate linear model (1) with the same AR(1) dependence as the one considered in Section 3.1. Then, two different matrices are considered: First, an oneway ANOVA model with two unbalanced groups with respective sizes and such that ; and second, a multiple regression model with correlated Gaussian predictors such that the rows of are i.i.d. .
For the oneway ANOVA, violation of 1 may occur when is too different from 1/2, as stated in Proposition 6. For the regression model, we choose for a matrix () such that , , when . The other simulation parameters are fixed as in Section 3.1.
We report in Figure 2 the results for the case where and both for unbalanced oneway ANOVA (top panels) and regression with correlated predictors (bottom panels). For the oneway ANOVA, varies in . For the regression case, varies in . In both cases, the gray lines correspond to the ideal situation (that is, either unbalanced or uncorrelated) denoted Ideal in the legend of Figure 2. The probability of support recovery is estimated over 1000 runs.
From this figure, we note that correlated features or unbalanced designs deteriorate the support recovery of all estimators. This was expected for these LASSObased methods which all suffer from the violation of the irrepresentability condition 1. However, we also note that whitenedlasso and oracle have similar performance, which means that the estimation of is not altered, and that whitening always improves the support recovery.
3.3. Robustness to more general autoregressive processes
In this section, we consider the case where is the design matrix of a oneway ANOVA with two balanced groups and where is the covariance matrix of an AR() process with in . Figure 3 displays the performance of the different estimators when . Here, for computing in whitenedlasso, the parameters of the AR() process are estimated as follows. They are obtained by averaging over the rows of defined in (19) the estimations obtained for the th row of by using standard estimation approaches for AR processes described in Brockwell:1990. As previously, we observe from this figure that whitenedlasso and oracle have better performance than rawlasso.
4. Proofs
Proof of Proposition 2.
For a fixed nonnegative , by (6),
Denoting , we get
Thus,
where
Since the first derivative of with respect to is equal to
satisfies
and
Note that, if , then and .
Proof of Theorem 1.
By Proposition 2,
where and are defined in (13) and (14). It is thus enough to prove that and tend to zero as tends to infinity.
By definition of ,
(27) 
where
and
By definition of and 1,
(28) 
where
Note that, for all in ,
Moreover,
where denotes the largest eigenvalue of the matrix . Observe that
(29) 
by Assumption 2 of Theorem 1. Thus, for all in ,
(30) 
By Assumption 4 of Theorem 1, we get thus that for all in ,
(31) 
Thus,
(32) 
Since is a centered Gaussian random vector having a covariance matrix equal to identity, is a centered Gaussian random vector with a covariance matrix equal to:
Hence, by (29), we get that for all in ,