Variable selection in multivariate linear models

Variable selection in multivariate linear models with high-dimensional covariance matrix estimation

M. Perrot-Dockès UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, 75005, Paris, France marie.perrot-dockes@agroparistech.fr C. Lévy-Leduc UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, 75005, Paris, France celine.levy-leduc@agroparistech.fr L. Sansonnet UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, 75005, Paris, France laure.sansonnet@agroparistech.fr  and  J. Chiquet UMR MIA-Paris, AgroParisTech, INRA, Université Paris-Saclay, 75005, Paris, France julien.chiquet@agroparistech.fr
June 30, 2019
Abstract.

In this paper, we propose a novel variable selection approach in the framework of multivariate linear models taking into account the dependence that may exist between the responses. It consists in estimating beforehand the covariance matrix of the responses and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator of the coefficient matrix. The properties of our approach are investigated both from a theoretical and a numerical point of view. More precisely, we give general conditions that the estimators of the covariance matrix and its inverse have to satisfy in order to recover the positions of the null and non null entries of the coefficient matrix when the size of is not fixed and can tend to infinity. We prove that these conditions are satisfied in the particular case of some Toeplitz matrices. Our approach is implemented in the R package MultiVarSel available from the Comprehensive R Archive Network (CRAN) and is very attractive since it benefits from a low computational load. We also assess the performance of our methodology using synthetic data and compare it with alternative approaches. Our numerical experiments show that including the estimation of the covariance matrix in the Lasso criterion dramatically improves the variable selection performance in many cases.

1. Introduction

The multivariate linear model consists in generalizing the classical linear model, in which a single response is explained by variables, to the case where the number of responses is larger than 1. Such a general modeling can be used in a wide variety of applications ranging from econometrics (lutkepohl:2005) to bioinformatics (Meng:2014). In the latter field, for instance, multivariate models have been used to gain insight into complex biological mechanisms like metabolism or gene regulation. This has been made possible thanks to recently developed sequencing technologies. For further details, we refer the reader to mehmood:liland:2012. However, the downside of such a technological expansion is to include irrelevant variables in the statistical models. To circumvent this, devising efficient variable selection approaches in the multivariate setting has become a growing concern.

A first naive approach to deal with the variable selection issue in the multivariate setting consists in applying classical univariate variable selection strategies to each response separately. Some well-known variable selection methods include the least absolute shrinkage and selection operator (LASSO) proposed by Tib96 and the smoothly clipped absolute deviation (SCAD) approach devised by fan:li:2001. However, such a strategy does not take into account the dependence that may exist between the different responses.

In this paper, we shall consider the following multivariate linear model:

(1)

where denotes the random response matrix, denotes the design matrix, denotes a coefficient matrix and denotes the random error matrix, where is the sample size. In order to model the potential dependence that may exist between the columns of , we shall assume that for each in ,

(2)

where denotes the covariance matrix of the th row of the error matrix . We shall moreover assume that the different rows of are independent. With such assumptions, there is some dependence between the columns of but not between the rows. Our goal is here to design a variable selection approach which is able to identify the positions of the null and non null entries in the sparse matrix by taking into account the dependence between the columns of .

This issue has recently been considered by Lee-Liu-2012 who extended the approach of rothman:2010. More precisely, Lee-Liu-2012 proposed three approaches for dealing with this issue based on penalized maximum likelihood with a weighted regularization. In their first approach is estimated by using a plug-in estimator of , in the second one, is estimated by using a plug-in estimator of and in the third one, and are estimated simultaneously. Lee-Liu-2012 also investigate the asymptotic properties of their methods when the sample size tends to infinity and the number of rows and columns of is fixed.

In this paper, we propose to estimate beforehand and to plug this estimator in a Lasso criterion, in order to obtain a sparse estimator of . Hence, our methodology is close to the first approach of Lee-Liu-2012. However, there are two main differences: The first one is the asymptotic framework in which our theoretical results are established and the second one is the strategy that we use for estimating . More precisely, in our asymptotic framework, is allowed to depend on and thus to tend to infinity as tends to infinity at a polynomial rate. Moreover, in Lee-Liu-2012, is estimated by using an adaptation of the Graphical Lasso (GLASSO) proposed by friedman:hastie:tibshirani:2008. This technique has also been considered by yuan:li:2007, banerjee:2008 and rothman:2008. In this paper, we give general conditions that the estimators of and have to satisfy in order to be able to recover the support of that is to find the positions of the null and non null entries of the matrix . We prove that when is a particular Toeplitz matrix, namely the covariance matrix of an AR(1) process, the assumptions of the theorem are satisfied.

Let us now describe more precisely our methodology. We start by “whitening” the observations by applying the following transformation to Model (1):

(3)

The goal of such a transformation is to remove the dependence between the columns of . Then, for estimating , we proceed as follows. Let us observe that (3) can be rewritten as:

(4)

with

(5)

where denotes the vectorization operator and the Kronecker product.

With Model (4), estimating is equivalent to estimate since . Then, for estimating , we use the classical LASSO criterion defined as follows for a nonnegative :

(6)

where and denote the classical -norm and -norm, respectively. Inspired by zhao:yu:2006, Theorem 1 established some conditions under which the positions of the null and non null entries of can be recovered by using .

In practical situations, the covariance matrix is generally unknown and has thus to be estimated. Let denote an estimator of . Then, the estimator of is such that

When is replaced by , (3) becomes

(7)

which can be rewritten as follows:

(8)

where

(9)

In Model (8), is estimated by

(10)

By extending Theorem 1, Theorem 5 gives some conditions on the eigenvalues of and on the convergence rate of and its inverse to and , respectively, under which the positions of the null and non null entries of can be recovered by using .

We prove in Section 2.3 that when is a particular Toeplitz matrix, namely the covariance matrix of an AR(1) process, the assumptions of Theorem 5 are satisfied. This strategy has been implemented in the R package MultiVarSel, which is available on the Comprehensive R Archive Network (CRAN), for more general Toeplitz matrices such as the covariance matrix of ARMA processes or general stationary processes. For a successful application of this methodology to particular “-omic” data, namely metabolomic data, we refer the reader to nous:bioinfo:arxiv:2017. For a review of the most recent methods for estimating high-dimensional covariance matrices, we refer the reader to HDCovEst.

The paper is organized as follows. Section 2 is devoted to the theoretical results of the paper. The assumptions under which the positions of the non null and null entries of can be recovered are established in Theorem 1 when is known and in Theorem 5 when is unknown. Section 2.3 studies the specific case of the AR(1) model. We present in Section 3 some numerical experiments in order to support our theoretical results. The proofs of our main theoretical results are given in Section 4.

2. Theoretical results

2.1. Case where is known

Let us first introduce some notations. Let

(11)

where is defined in (5) and where denotes the th component of the vector defined in (5).

Let also define

(12)

where and denote the columns of belonging to the set defined in (11) and to its complement , respectively.

More generally, for any matrix , denotes the partitioned matrix extracted from by considering the rows of belonging to the set and the columns of belonging to the set , with indicating all the rows or all the columns.

The following theorem gives some conditions under which the estimator defined in (6) is sign-consistent as defined by zhao:yu:2006, namely,

where the function maps positive entries to 1, negative entries to -1 and zero to 0.

Theorem 1.

Assume that satisfies Model (4). Assume also that there exist some positive constants , , and positive numbers , such that satisfying:

  1. for all , for all , , where is the th column of defined in (5),

  2. for all , , where denotes the smallest eigenvalue of ,

  3. , where is defined in (11) and is the cardinality of the set ,

  4. .

Assume also that the following strong Irrepresentable Condition holds:

  1. There exists a positive constant vector such that

    where is a vector of 1 and the inequality holds element-wise.

Then, for all that satisfies

we have

where is defined by (6).

Remark 1.

Observe that if , for some positive , then the first condition of 1 becomes . Hence for large values of , the size of is much larger than .

The proof of Theorem 1 is given in Section 4. It is based on Proposition 2 which is an adaptation to the multivariate case of Proposition 1 in zhao:yu:2006.

Proposition 2.

Let be defined by (6). Then

where

(13)

and

(14)

with In (13) and (14), and are defined in (12) and and denote the components of being in and , respectively. Note that the previous inequalities hold element-wise.

The proof of Proposition 2 is given in Section 4.

We give in the following proposition which is proved in Section 4 some conditions on and under which Assumptions 1 and 2 of Theorem 1 hold.

Proposition 3.

If there exist some positive constants , , , such that, for all ,

  1. for all , ,

  2. ,

  3. ,

  4. ,

then Assumptions 1 and 2 of Theorem 1 are satisfied.

Remark 2.

Observe that 1 and 2 hold in the case where the columns of the matrix are orthogonal.

We give in Proposition 6 in Section 2.3 some conditions under which Condition 1 holds in the specific case where is the covariance matrix of an AR(1) process.

2.2. Case where is unknown

Similarly as in (11) and (12), we introduce the following notations:

(15)

and

(16)

where and denote the columns of belonging to the set defined in (11) and to its complement , respectively.

A straightforward extension of Proposition 2 leads to the following proposition for Model (8).

Proposition 4.

Let be defined by (10). Then

where

(17)

and

(18)

with In (17) and (18), and are defined in (16) and and denote the components of being in and , respectively. Note that the previous inequalities hold element-wise.

The following theorem extends Theorem 1 to the case where is unknown and gives some conditions under which the estimator defined in (10) is sign-consistent. The proof of Theorem 5 is given in Section 4 and is based on Proposition 4.

Theorem 5.

Assume that Assumptions 1, 2, 3, 4, 1 and 1 of Theorem 1 hold. Assume also that, there exist some positive constants , , and , such that for all ,

  1. ,

  2. ,

  3. ,

  4. .

Suppose also that

  1. , as tends to infinity,

  2. , as tends to infinity.

Let be defined by (10), then

In the previous assumptions, , , and denote the largest eigenvalue, the smallest eigenvalue, the spectral radius and the infinite norm (induced by the associated vector norm) of the matrix .

Remark 3.

Observe that Assumptions 5 and 6 hold in the case where the columns of the matrix are orthogonal. Note also that 7 and 8 are the same as 3 and 4 in Proposition 3.

In order to estimate , we propose the following strategy:

  • Fitting a classical linear model to each column of the matrix in order to have access to an estimation of the random error matrix . It is possible since is assumed to be fixed and smaller than .

  • Estimating from by assuming that has a particular structure, Toeplitz for instance.

More precisely, defined in the first step is such that:

(19)

which implies that

(20)

where is defined in (5).

We prove in Proposition 7 below that our strategy for estimating provides an estimator satisfying the assumptions of Theorem 5 in the case where , , …, are assumed to be independent AR(1) processes.

2.3. The AR(1) case

2.3.1. Sufficient conditions for Assumption 1 of Theorem 1

The following proposition gives some conditions under which the strong Irrepresentable Condition 1 of Theorem 1 holds.

Proposition 6.

Assume that , , …, in Model (1) are independent AR(1) processes satisfying:

where the ’s are zero-mean i.i.d. Gaussian random variables with variance and . Assume also that defined in (1) is such that , where is a positive constant. Moreover, suppose that if , then and . Suppose also that for all , or is not in . Then, the strong Irrepresentable Condition 1 of Theorem 1 holds.

The proof of Proposition 6 is given in Section 4.

2.3.2. Sufficient conditions for Assumptions 7, 8, 9 and 10 of Theorem 5

The following proposition establishes that in the particular case where the , , …, are independent AR(1) processes, our strategy for estimating provides an estimator satisfying the assumptions of Theorem 5.

Proposition 7.

Assume that , , …, in Model (1) are independent AR(1) processes satisfying:

where the ’s are zero-mean i.i.d. Gaussian random variables with variance and . Let

where

(21)

where is defined in (19). Then, Assumptions 7, 8, 9 and 10 of Theorem 5 are valid.

The proof of Proposition 7 is given in Section 4. It is based on the following lemma.

Lemma 8.

Assume that , , …, in Model (1) are independent AR(1) processes satisfying:

where the ’s are zero-mean i.i.d. Gaussian random variables with variance and . Let

where is defined in (19). Then,

Lemma 8 is proved in Section 4. Its proof is based on Lemma 10 in Section 5.

3. Numerical experiments

The goal of this section is twofold: to provide sanity checks for our theoretical results in a well-controlled framework; and to investigate the robustness of our estimator to some violations of the assumptions of our theoretical results. The latter may reveal a broader scope of applicability for our method than the one guaranteed by the theoretical results.

We investigate in the AR(1) framework presented in Section 2.3. Indeed, all assumptions made in Theorems 1 and 5 can be specified with well-controllable simulation parameters in the AR(1) case with balanced design matrix .

Point aims to explore the limitations of our theoretical framework and assess its robustness. To this end, we propose two numerical studies relaxing some of the assumptions of our theorems: first, we study the effect of an unbalanced design – which violates the sufficient condition of the irrepresentability condition (IC) given in Proposition 6 – on the sign-consistency; and second, we study the effect of other types of dependence than an AR(1).

In all experiments, the performance are assessed in terms of sign-consistency. In other words, we evaluate the probability for the sign of various estimators to be equal to . We compare the performance of three different estimators:

  • defined in (6), which corresponds to the LASSO criterion applied to the data whitened with the true covariance matrix ; we call this estimator oracle. Its theoretical properties are established in Theorem 1.

  • defined in (10), which corresponds to the LASSO criterion applied to the data whitened with an estimator of the covariance matrix ; we refer to this estimator as whitened-lasso. Its theoretical properties are established in Theorem 5.

  • the LASSO criterion applied to the raw data, which we call raw-lasso hereafter. Its theoretical properties are established only in the univariate case in alquier:doukhan:2011.

3.1. AR(1) dependence structure with balanced one-way ANOVA

In this section, we consider Model (1) where is the design matrix of a one-way ANOVA with two balanced groups. Each row of the random error matrix is distributed as a centered Gaussian random vector as in Equation (2) where the matrix is the covariance matrix of an AR(1) process defined in Section 2.3.

In this setting, Assumptions 1, 2 and Condition 1 of Theorem 1 are satisfied, see Propositions 3 and 6. The three remaining assumptions 3, 4 and 1 are related to more practical quantities: 3 controls the sparsity level of the problem, involving ; 4 basically controls the signal-to-noise ratio, involving and 1 links the sample size , and the two constants , , so that an appropriate range of penalty exists for having a large probability of support recovery. This latter assumption is used in our experiments to tune the difficulty of the support recovery as follows: we consider different values of , , , and we choose a sparsity level and a minimal magnitude in such that Assumptions 3 and 4 are fulfilled. Hence, the problem difficulty is essentially driven by the validity of Assumption 1 where with , and so by the relationship between , and .

We consider a large range of sample sizes varying from to and three different values for in . The constants , are chosen such that with and in . Additional values of and have also been considered and the corresponding results are available upon request. Finally, we consider two values for the parameter appearing in the definition of the AR(1) process: .

Note that in this AR(1) setting with the estimator of defined in (21) , all the assumptions of Theorem 5 are fulfilled, see Proposition 7.

The frequencies of support recovery for the three estimators averaged over 1000 replications is displayed in Figure 1.

Figure 1. Frequencies of support recovery in a multivariate one-way ANOVA model with two balanced groups and an AR(1) dependence.

We observe from Figure 1 that whitened-lasso and oracle have similar performance since is well estimated. These two approaches always exhibit better performance than raw-lasso, especially when . In this case, the sample size required to reach the same performance is indeed ten time larger for raw-lasso than for oracle and whitened-lasso.

Finally, the performance of all estimators are altered when is too small, especially in situations where the signal to noise ratio (SNR) is small and the signal is not sparse enough, these two characteristics corresponding to small values of .

3.2. Robustness to unbalanced designs and correlated features

The goal of this section is to study some particular design matrices in Model (1) that may lead to violation of the Irrepresentability Condition 1.

To this end, we consider the multivariate linear model (1) with the same AR(1) dependence as the one considered in Section 3.1. Then, two different matrices are considered: First, an one-way ANOVA model with two unbalanced groups with respective sizes and such that ; and second, a multiple regression model with correlated Gaussian predictors such that the rows of are i.i.d. .

For the one-way ANOVA, violation of 1 may occur when is too different from 1/2, as stated in Proposition 6. For the regression model, we choose for a matrix () such that , , when . The other simulation parameters are fixed as in Section 3.1.

We report in Figure 2 the results for the case where and both for unbalanced one-way ANOVA (top panels) and regression with correlated predictors (bottom panels). For the one-way ANOVA, varies in . For the regression case, varies in . In both cases, the gray lines correspond to the ideal situation (that is, either unbalanced or uncorrelated) denoted Ideal in the legend of Figure 2. The probability of support recovery is estimated over 1000 runs.

Figure 2. Frequencies of support recovery in general linear models with unbalanced designs: one-way ANOVA and regression.

From this figure, we note that correlated features or unbalanced designs deteriorate the support recovery of all estimators. This was expected for these LASSO-based methods which all suffer from the violation of the irrepresentability condition 1. However, we also note that whitened-lasso and oracle have similar performance, which means that the estimation of is not altered, and that whitening always improves the support recovery.

3.3. Robustness to more general autoregressive processes

In this section, we consider the case where is the design matrix of a one-way ANOVA with two balanced groups and where is the covariance matrix of an AR() process with in . Figure 3 displays the performance of the different estimators when . Here, for computing in whitened-lasso, the parameters of the AR() process are estimated as follows. They are obtained by averaging over the rows of defined in (19) the estimations obtained for the th row of by using standard estimation approaches for AR processes described in Brockwell:1990. As previously, we observe from this figure that whitened-lasso and oracle have better performance than raw-lasso.

Figure 3. Frequencies of support recovery in one-way ANOVA with AR() covariance matrix.

4. Proofs

Proof of Proposition 2.

For a fixed nonnegative , by (6),

Denoting , we get

Thus,

where

Since the first derivative of with respect to is equal to

satisfies

and

Note that, if , then and .

Let us now prove that when and , defined in (13) and (14), are satisfied then there exists satisfying:

(22)
(23)
(24)

Note that implies:

(25)

By denoting

(26)

we obtain from (25) that (22) and (23) hold. Note that implies:

Hence,

which is (24) by (26). This concludes the proof. ∎

Proof of Theorem 1.

By Proposition 2,

where and are defined in (13) and (14). It is thus enough to prove that and tend to zero as tends to infinity.

By definition of ,

(27)

where

and

By definition of and 1,

(28)

where

Note that, for all in ,

Moreover,

where denotes the largest eigenvalue of the matrix . Observe that

(29)

by Assumption 2 of Theorem 1. Thus, for all in ,

(30)

By Assumption 4 of Theorem 1, we get thus that for all in ,

(31)

Thus,

(32)

Since is a centered Gaussian random vector having a covariance matrix equal to identity, is a centered Gaussian random vector with a covariance matrix equal to:

Hence, by (29), we get that for all in ,